MySQL Forums
Forum List  »  NDB clusters

Re: MySQL-Cluster Backup error
Posted by: Jay Ward
Date: July 31, 2012 02:56PM

Hey Jose,

I am fighting with the exact same problem. I did a whole lot of digging on this and here is what I found:

First, two examples of stack traces I hit while debugging this. I have a bunch more, but they all look identical to one of these two (usually the first one):

#0 0x000000000070a3c9 in insert_signal (self=<value optimized out>, s=0x7f9391d5048c, data=0x7f9391d504a8, secPtr=0x7f9391d504d8)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/mt.cpp:2976
#1 sendlocal (self=<value optimized out>, s=0x7f9391d5048c, data=0x7f9391d504a8, secPtr=0x7f9391d504d8)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/mt.cpp:3908
#2 0x00000000006fa677 in SimulatedBlock::sendSignal (this=0x17f9870, ref=1, gsn=1, signal=0x7f9391d50480,
length=<value optimized out>, jobBuffer=JBB, ptr=0x7f9391d4fe90, noOfSections=1)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:771
#3 0x000000000065ff4c in Backup::sendScanFragReq (this=0x17f9870, signal=0x7f9391d50480, ptr=..., filePtr=..., tabPtr=...,
fragPtr=..., delay=0)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/blocks/backup/Backup.cpp:4157
#4 0x0000000000669535 in Backup::execBACKUP_FRAGMENT_REQ (this=0x17f9870, signal=0x7f9391d50480)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/blocks/backup/Backup.cpp:4090
#5 0x0000000000708fc5 in executeFunction (selfptr=0xb83d20, q=0xb84260, h=0xb83ef0, r=<value optimized out>, sig=0x7f9391d50480,
max_signals=100, signalIdCounter=0x7f9391d5addc)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1034
#6 execute_signals (selfptr=0xb83d20, q=0xb84260, h=0xb83ef0, r=<value optimized out>, sig=0x7f9391d50480, max_signals=100,
signalIdCounter=0x7f9391d5addc)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/mt.cpp:3203
#7 0x000000000070c9a5 in run_job_buffers (thr_arg=0xb83d20)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/mt.cpp:3244
#8 mt_job_thread_main (thr_arg=0xb83d20)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/mt.cpp:3783
#9 0x00000000006a93da in ndb_thread_wrapper (_ss=0x15566d0)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/common/portlib/NdbThread.c:160
#10 0x0000003f8b607851 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003f8aee76dd in clone () from /lib64/libc.so.6

... and two ...

#0 0x000000000070a3c9 in insert_signal (self=<value optimized out>, s=0x7f81c14f858c, data=0x7f81c14f85a8, secPtr=0x7f81c14f85cc)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/mt.cpp:2976
#1 sendlocal (self=<value optimized out>, s=0x7f81c14f858c, data=0x7f81c14f85a8, secPtr=0x7f81c14f85cc)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/mt.cpp:3908
#2 0x00000000006fa677 in SimulatedBlock::sendSignal (this=0x7f81c2438010, ref=3, gsn=1, signal=0x7f81c14f8580,
length=<value optimized out>, jobBuffer=JBB, ptr=0x7f81c14f7150, noOfSections=3)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:771
#3 0x00000000005aa531 in Dbtup::executeTrigger (this=0x7f81c2438010, req_struct=0x7f81c14f7420, trigPtr=0x7f78d69bc2d0,
regOperPtr=0x7f78d8cbc240, disk=<value optimized out>)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/blocks/dbtup/DbtupTrigger.cpp:1548
#4 0x00000000005aacd3 in Dbtup::fireDetachedTriggers (this=0x7f81c2438010, req_struct=0x7f81c14f7420, triggerList=...,
regOperPtr=0x7f78d8cbc240, disk=<value optimized out>)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/blocks/dbtup/DbtupTrigger.cpp:1080
#5 0x00000000005aafc0 in Dbtup::checkDetachedTriggers (this=0x7f81c2438010, req_struct=0x7f81c14f7420, regOperPtr=0x7f78d8cbc240,
regTablePtr=0x7f78d72fd830, disk=<value optimized out>)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/blocks/dbtup/DbtupTrigger.cpp:955
#6 0x000000000065ac02 in Dbtup::execTUP_COMMITREQ (this=0x7f81c2438010, signal=0x7f81c14f8580)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/blocks/dbtup/DbtupCommit.cpp:861
#7 0x000000000054fbc3 in executeFunction (this=0x162d4b0, signal=0x7f81c14f8580)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1034
#8 EXECUTE_DIRECT (this=0x162d4b0, signal=0x7f81c14f8580)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1279
#9 Dblqh::commitContinueAfterBlockedLab (this=0x162d4b0, signal=0x7f81c14f8580)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:8176
#10 0x0000000000708fc5 in executeFunction (selfptr=0xb8a6a0, q=0xb8afe0, h=0xb8a890, r=<value optimized out>, sig=0x7f81c14f8580,
max_signals=100, signalIdCounter=0x7f81c1502ddc)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1034
#11 execute_signals (selfptr=0xb8a6a0, q=0xb8afe0, h=0xb8a890, r=<value optimized out>, sig=0x7f81c14f8580, max_signals=100,
signalIdCounter=0x7f81c1502ddc)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/mt.cpp:3203
#12 0x000000000070c9a5 in run_job_buffers (thr_arg=0xb8a6a0)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/mt.cpp:3244
#13 mt_job_thread_main (thr_arg=0xb8a6a0)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/kernel/vm/mt.cpp:3783
#14 0x00000000006a93da in ndb_thread_wrapper (_ss=0x13b20a0)
at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/storage/ndb/src/common/portlib/NdbThread.c:160
#15 0x00000030fb607851 in start_thread () from /lib64/libpthread.so.0
#16 0x00000030fb2e76dd in clone () from /lib64/libc.so.6

So the thread (2 in the first trace, 3 in the second) is doing some stuff with the SimulatedBlock object which is sending a signal afterwards to another thread (3 in the first example, 2 in the second). The insert_signal function then attempts to write the signal into the job buffer write state for the target thread's job buffer. It took a bit of digging around, but here's where it's kept:

*(g_thr_repository->m_thread[m_threadId]->m_thr_data->m_write_states + dst);
where m_threadId is the id of the current thread and dst is the id of the destination thread. In all cases, the m_write_state in question looks like this:

{
m_write_index = 0,
m_write_pos = 0,
m_write_buffer = 0x0,
m_pending_signals = 0,
m_pending_signals_wakeup = 0
}

If you notice, the m_write_buffer is a null pointer. So when insert_signal tries to memcpy the signal information too it, the linux kernel gives it a SIGSEGV. And rightly so. You can't copy data into a null pointer.

What I haven't found out is WHY the job buffer in question is null. Did it never get initialized? Did it get GCed too quickly? Are the producer threads referencing the wrong consumer threads? I don't know...

It would be nice if someone who did could shed some light on it, or at least point us in the right direction to look.

It seems to have something to do with the disk accessing portion of the backup, since that's the only time I am seeing it. Otherwise, the ndbmtd nodes run rock solid.

If I find anything else out, I will send it your way. Please do likewise!

PS 0# I just noticed that the backups finish writing the .ctl files, but die after writing 0 or 48 bytes into the .log file of the Master NDB node. Usually there is no .Data file entries, since the .log file gets it's 48 bytes before that, but sometimes I see data in them, though not much (usually if any, only 48 bytes as well). So ... It seems to be a problem with how it's handling the info going into the .log files. Maybe.

Time to use the source!

PS 1# The plot thickens... turns out there is a very good reason the job buffers are null in these cases: This two threads are both LDM (LQH) threads and, by definition, are not even allowed to talk to each other:

// LQH threads can communicates with TC-, main- and itself
return is_main_thread(to) ||
is_tc_thread(to) ||
(to == from);

And since they aren't allowed to communicate, no sense in allocating it a buffer to do so. So the question is... why are they trying then?

Stay tuned for, we hope, the exciting conclusion...



Edited 2 time(s). Last edit at 08/01/2012 11:48AM by Jay Ward.

Options: ReplyQuote


Subject
Views
Written By
Posted
2560
July 28, 2012 06:53AM
Re: MySQL-Cluster Backup error
1329
July 31, 2012 02:56PM
944
August 06, 2012 11:03AM
713
August 12, 2012 09:08AM
1153
August 12, 2012 11:33AM
1169
August 13, 2012 10:00AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.