MySQL :: NDB data node failed and sent another data node into a crash loop

New Topic

NDB data node failed and sent another data node into a crash loop

Posted by: Richard Cruise
Date: January 27, 2022 04:30AM

I have an NDB cluster running 3 data nodes and 3 management nodes.

It appears that an error occured on 1 data node which caused it to restart

Node 19:
2022-01-15 04:20:46 [ndbd] INFO -- findNeighbours from: 2905 old (left: 17 right: 17) new (17 18)
2022-01-15 04:20:47 [ndbd] INFO -- NR Status: node=18,OLD=Node failure handling complete,NEW=All nodes permitted us
2022-01-15 04:20:47 [ndbd] INFO -- Switch to 17 multi trp for node 18
2022-01-15 04:21:24 [ndbd] INFO -- NR Status: node=18,OLD=All nodes permitted us,NEW=Include node in LCP/GCP protocols
2022-01-15 04:21:24 [ndbd] INFO -- NR Status: node=18,OLD=Include node in LCP/GCP protocols,NEW=Synchronize start node with live nodes
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::execCOPY_FRAGREQ(Signal*)+0xa29) [0x65f949]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x230) [0x8a11e0]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7f7b742a3ea5]
/lib64/libc.so.6(clone+0x6d) [0x7f7b72ccb96d]
2022-01-15 04:21:25 [ndbd] INFO -- /var/lib/pb2/sb_1-2918142-1619218179.52/rpm/BUILD/mysql-cluster-com-8.0.25/mysql-cluster-com-8.0.25/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
2022-01-15 04:21:25 [ndbd] INFO -- DBLQH (Line: 19166) 0x00000002 Check getFragmentrec(fragId) failed
2022-01-15 04:21:25 [ndbd] INFO -- Error handler shutting down system
2022-01-15 04:21:26 [ndbd] ALERT -- Node 19: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2022-01-15 04:21:27 [ndbd] INFO -- Angel pid: 34452 started child: 34453

Another data node then failed at the same time and was put into a crash loop

Node 18:
2022-01-15 04:21:25 [ndbd] INFO -- LDM(8): Completed copy of fragment T175F3. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2022-01-15 04:21:26 [ndbd] INFO -- Node 19 disconnected in state: 0
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Qmgr::execDISCONNECT_REP(Signal*)+0x21f) [0x79be7f]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc05]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7efd8a3e9ea5]
/lib64/libc.so.6(clone+0x6d) [0x7efd88e1196d]
2022-01-15 04:21:26 [ndbd] INFO -- Node 19 disconnected in state: 0
2022-01-15 04:21:26 [ndbd] INFO -- Node 19 disconnected in phase: 3
2022-01-15 04:21:26 [ndbd] INFO -- QMGR (Line: 4245) 0x00000002
2022-01-15 04:21:26 [ndbd] INFO -- Error handler shutting down system
2022-01-15 04:21:26 [ndbd] ALERT -- Node 18: Forced node shutdown completed. Occurred during startphase 5. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
2022-01-15 04:21:27 [ndbd] INFO -- Angel pid: 14167 started child: 14168

It appears that the 2 nodes were caught in some sort of contention where one would crash and then the other would crash

Node 19:
2022-01-15 13:33:22 [ndbd] INFO -- Node 18 disconnected in state: 0
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Qmgr::failReportLab(Signal*, unsigned short, FailRep::FailCause, unsigned short)+0x96d) [0x7956ad]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc05]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7ff4b634bea5]
/lib64/libc.so.6(clone+0x6d) [0x7ff4b4d7396d]
2022-01-15 13:33:22 [ndbd] INFO -- Node 18 failed
2022-01-15 13:33:22 [ndbd] INFO -- QMGR (Line: 5039) 0x00000002
2022-01-15 13:33:22 [ndbd] INFO -- Error handler shutting down system
2022-01-15 13:33:22 [ndbd] ALERT -- Node 19: Forced node shutdown completed. Occurred during startphase 2. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

Node 18:
22-01-15 13:33:02 [ndbd] INFO -- (16), tab(6,3), lcpNo: 65535, m_max_restorable_gci: 2429, crestartNewestGci: 2430, srStartGci: 0
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
stack_bottom = 0 thread_stack 0x0
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
2022-01-15 13:33:02 [ndbd] INFO -- /var/lib/pb2/sb_1-2918142-1619218179.52/rpm/BUILD/mysql-cluster-com-8.0.25/mysql-cluster-com-8.0.25/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
2022-01-15 13:33:02 [ndbd] INFO -- DBLQH (Line: 27173) 0x00000002 Check c_local_sysfile.m_max_restorable_gci >= crestartNewestGci failed
2022-01-15 13:33:02 [ndbd] INFO -- Error handler shutting down system
2022-01-15 13:33:02 [ndbd] ALERT -- Node 18: Forced node shutdown completed. Occurred during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Eventually Node 19 was able to correct itself and start properly
Node 18 however was stuck in a crash loop and eventually taken offline

Can you advise on how to recover from this state?

Navigate: Previous Message• Next Message

Options: Reply• Quote

Subject

Views

Written By

Posted

NDB data node failed and sent another data node into a crash loop

1554

Richard Cruise

January 27, 2022 04:30AM

Re: NDB data node failed and sent another data node into a crash loop

592

Richard Cruise

January 27, 2022 04:32AM

Re: NDB data node failed and sent another data node into a crash loop

600

Mikael Ronström

January 27, 2022 08:11AM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.