Hi,
is it advisable to force a LCP using
ndb_mgm> ALL DUMP 7099
?
There has not been a LCP for quite a while and the undo space keeps growing...
...
root@infra01.dc1:~# grep "Local checkpoint" /var/lib/mysql-cluster/ndbmgm01/ndb_1_cluster.log
2017-11-20 00:29:59 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5011 completed
2017-11-20 00:30:00 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5012 started. Keep GCI = 22782266 oldest restorable GCI = 22790412
2017-11-20 02:51:34 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5012 completed
2017-11-20 02:51:35 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5013 started. Keep GCI = 22790428 oldest restorable GCI = 22798426
2017-11-20 05:16:36 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5013 completed
2017-11-20 05:16:37 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5014 started. Keep GCI = 22798428 oldest restorable GCI = 22806613
2017-11-20 07:41:17 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5014 completed
2017-11-20 07:41:18 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5015 started. Keep GCI = 22806615 oldest restorable GCI = 22814786
2017-11-20 10:20:56 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5015 completed
2017-11-20 10:20:57 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5016 started. Keep GCI = 22814787 oldest restorable GCI = 22823807
2017-11-20 12:58:41 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5016 completed
2017-11-20 12:58:42 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5017 started. Keep GCI = 22823810 oldest restorable GCI = 22832726
2017-11-20 15:40:53 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5017 completed
2017-11-20 15:40:54 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5018 started. Keep GCI = 22832728 oldest restorable GCI = 22841891
2017-11-20 18:29:08 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5018 completed
2017-11-20 18:29:09 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5019 started. Keep GCI = 22841893 oldest restorable GCI = 22851404
2017-11-20 20:59:30 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5019 completed
2017-11-20 20:59:30 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5020 started. Keep GCI = 22851407 oldest restorable GCI = 22859900
2017-11-20 23:31:42 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5020 completed
2017-11-20 23:31:43 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5021 started. Keep GCI = 22859902 oldest restorable GCI = 22868499
2017-11-21 01:52:35 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5021 completed
2017-11-21 01:52:36 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5022 started. Keep GCI = 22868500 oldest restorable GCI = 22876085
2017-11-21 04:16:59 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5022 completed
2017-11-21 04:16:59 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5023 started. Keep GCI = 22876094 oldest restorable GCI = 22884245
2017-11-21 06:48:00 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5023 completed
2017-11-21 06:48:01 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5024 started. Keep GCI = 22884247 oldest restorable GCI = 22892777
2017-11-21 09:18:45 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5024 completed
2017-11-21 09:18:46 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5025 started. Keep GCI = 22892778 oldest restorable GCI = 22901296
2017-11-21 11:55:10 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5025 completed
2017-11-21 11:55:11 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5026 started. Keep GCI = 22901298 oldest restorable GCI = 22910138
2017-11-21 14:31:44 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5026 completed
2017-11-21 14:31:44 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5027 started. Keep GCI = 22910140 oldest restorable GCI = 22918989
2017-11-21 17:16:40 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5027 completed
2017-11-21 17:16:41 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5028 started. Keep GCI = 22918991 oldest restorable GCI = 22928310
2017-11-21 19:59:48 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5028 completed
2017-11-21 19:59:49 [MgmtSrvr] INFO -- Node 5: Local checkpoint 5029 started. Keep GCI = 22928313 oldest restorable GCI = 22937530
root@infra01.dc1:~#
...
Node 6 crashed on 2017-11-21 22:35 and has not been back up since then. Undo space keeps growing and the restart of the node stops at
2017-11-23 08:39:13 [ndbd] INFO -- LDM(11): We have completed restoring our fragments and executed REDO log and rebuilt ordered indexes
We have 12 LDM per node and only 2 completed - CPU is idle and it seems like nothing is happening on that node...