I help with finding a bug in NDBCluster 7.5.11
Posted by:
Nimbi lin
Date: November 13, 2018 01:34AM
I help with finding a bug in NDBCluster 7.5.11 as below:
Node 23: Stall LCP: current stall time: 0 secs, max wait time:11 secs
2018-11-13 15:01:22 [MgmtSrvr] INFO -- Node 23: Local checkpoint 4043 started. Keep GCI = 3283088 oldest restorable GCI = 3283119
2018-11-13 15:03:33 [MgmtSrvr] INFO -- Node 23: LDM(0): Completed LCP, #frags = 1152 #records = 21314442, #bytes = 4108609828
2018-11-13 15:03:33 [MgmtSrvr] INFO -- Node 23: Local checkpoint 4043 completed
2018-11-13 15:03:34 [MgmtSrvr] INFO -- Node 23: Stall LCP, LCP time = 131 secs, wait for Node24, state Synchronize start node with live nodes
2018-11-13 15:03:34 [MgmtSrvr] INFO -- Node 23: Stall LCP: current stall time: 0 secs, max wait time:9 secs
2018-11-13 15:03:43 [MgmtSrvr] INFO -- Node 23: Local checkpoint 4044 started. Keep GCI = 3283166 oldest restorable GCI = 3283149
2018-11-13 15:03:45 [MgmtSrvr] ALERT -- Node 24: Forced node shutdown completed. Occured during startphase 5. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2018-11-13 15:03:45 [MgmtSrvr] ALERT -- Node 23: Node 24 Disconnected
2018-11-13 15:03:45 [MgmtSrvr] INFO -- Node 23: Communication to Node 24 closed
2018-11-13 15:03:45 [MgmtSrvr] ALERT -- Node 23: Network partitioning - arbitration required
2018-11-13 15:03:45 [MgmtSrvr] INFO -- Node 23: President restarts arbitration thread [state=7]
2018-11-13 15:03:45 [MgmtSrvr] ALERT -- Node 22: Node 24 Disconnected
2018-11-13 15:03:45 [MgmtSrvr] ALERT -- Node 23: Arbitration won - positive reply from node 22
2018-11-13 15:03:45 [MgmtSrvr] INFO -- Node 23: NR Status: node=24,OLD=Synchronize start node with live nodes,NEW=Node failed, fail handling on
2018-11-13 15:03:45 [MgmtSrvr] INFO -- Node 23: Removed lock for node 24
2018-11-13 15:03:45 [MgmtSrvr] INFO -- Node 23: DICT: remove lock by failed node 24 for NodeRestart
2018-11-13 15:03:45 [MgmtSrvr] INFO -- Node 23: DICT: unlocked by node 24 for NodeRestart
2018-11-13 15:03:46 [MgmtSrvr] INFO -- Node 23: Started arbitrator node 22 [ticket=f07b00582279a5e5]
2018-11-13 15:04:14 [MgmtSrvr] WARNING -- Node 23: Failure handling of node 24 has not completed in 29 seconds - state = 6
2018-11-13 15:04:14 [MgmtSrvr] INFO -- Node 23: NF Node 24 tc: 1 lqh: 1 dih: 0 dict: 1 recNODE_FAILREP: 1
2018-11-13 15:04:14 [MgmtSrvr] INFO -- Node 23: m_NF_COMPLETE_REP: [SignalCounter: m_count=1 0000000000800000] m_nodefailSteps: 00000002
2018-11-13 15:04:25 [MgmtSrvr] INFO -- Node 23: NR Status: node=24,OLD=Node failed, fail handling ongoing,NEW=Node failure handling complete
2018-11-13 15:04:25 [MgmtSrvr] INFO -- Node 23: Communication to Node 24 opened
2018-11-13 15:05:46 [MgmtSrvr] INFO -- Node 23: LDM(0): Completed LCP, #frags = 1152 #records = 21314469, #bytes = 4108625512
2018-11-13 15:05:46 [MgmtSrvr] INFO -- Node 23: Local checkpoint 4044 completed
2018-11-13 15:05:46 [MgmtSrvr] INFO -- Node 23: Stall LCP, LCP time = 122 secs, wait for Node24, state Node failure handling complete
2018-11-13 15:05:46 [MgmtSrvr] INFO -- Node 23: Stall LCP: current stall time: 0 secs, max wait time:9 secs
2018-11-13 15:05:55 [MgmtSrvr] INFO -- Node 23: Local checkpoint 4045 started. Keep GCI = 3283235 oldest restorable GCI = 3283237
2018-11-13 15:09:42 [MgmtSrvr] INFO -- Node 23: LDM(0): Completed LCP, #frags = 1152 #records = 21314480, #bytes = 4108632000
2018-11-13 15:09:42 [MgmtSrvr] INFO -- Node 23: Local checkpoint 4045 completed
2018-11-13 15:09:43 [MgmtSrvr] INFO -- Node 23: Stall LCP, LCP time = 226 secs, wait for Node24, state Node failure handling complete
2018-11-13 15:09:43 [MgmtSrvr] INFO -- Node 23: Stall LCP: current stall time: 0 secs, max wait time:16 secs
2018-11-13 15:09:58 [MgmtSrvr] INFO -- Node 23: Local checkpoint 4046 started. Keep GCI = 3283299 oldest restorable GCI = 3283295
2018-11-13 15:13:45 [MgmtSrvr] INFO -- Node 23: LDM(0): Completed LCP, #frags = 1152 #records = 21314520, #bytes = 4108655412
2018-11-13 15:13:45 [MgmtSrvr] INFO -- Node 23: Local checkpoint 4046 completed
2018-11-13 15:13:46 [MgmtSrvr] INFO -- Node 23: Stall LCP, LCP time = 226 secs, wait for Node24, state Node failure handling complete
2018-11-13 15:13:46 [MgmtSrvr] INFO -- Node 23: Stall LCP: current stall time: 0 secs, max wait time:16 secs
2018-11-13 15:14:01 [MgmtSrvr] INFO -- Node 23: Local checkpoint 4047 started. Keep GCI = 3283417 oldest restorable GCI = 3283412
this bug is happen after below steps:
1, I have a 2 data nodes ,2 SQL nodes ndbcluster on Centos 6.8 , node 24's hardisk has few space
2, then I stop node 24 by command :24 stop in ndb_mgm console;
3,and then I use pvcreate and other commands to extent the root file system/'s LVM size.
4, after I have succeded in extend hard disk space, I use ndbd's none intial command option to start, but got an error of: " startphase 5 error 2355: 'Failure to restore schema(Resource configuration error). Permanent error, external action needed'. ",
5, then I use ndbd's initial option to start node 24, but get the error logs as up show.
6, sorry I remember I add the 3 variables in config.ini:
TimeBetweenLocalCheckpoints=10
#not work NoOfFragmentLogFiles=32
#ok MaxNoOfExecutionThreads=6
to solve the error 2355, but forget to restart other data node except management node.
would ndbcluster's pioneer can hurry up to help me to solve?
Oracle&MCluster lover: Georgelin,
Share monthly salary with the person who recommend a big-data relative job to me now,
Personal cross platform website: www.gloCalHelp.com(Official) or glocalhelp.servebeer.com(temp),
Mobile: 0086 180 500 42436 or 156 6865 8383
Edited 1 time(s). Last edit at 11/13/2018 02:18AM by Nimbi lin.