MySQL Forums
Forum List  »  NDB clusters

Wait LCP to ensure durability
Posted by: Stefan Auweiler
Date: September 28, 2017 03:46AM

We had to restart a node because of HW/Disk failure and now, we can not bring up the node, because it seems to wait for the next LCP...

Running mysql-5.6.37 ndb-7.4.16


Everything seems to start fine, until Phase 5, which doesn't seem to find it's end:

Node 5:


2017-09-28 00:42:20 [ndbd] INFO -- Start phase 4 completed
2017-09-28 00:42:20 [ndbd] INFO -- Phase 4 continued preparations of the REDO log
2017-09-28 00:42:20 [ndbd] INFO -- Request copying of distribution and dictionary information from master Starting
2017-09-28 00:42:22 [ndbd] INFO -- Copying of dictionary information from master Starting
2017-09-28 00:42:24 [ndbd] INFO -- Copying of dictionary information from master Completed
2017-09-28 00:42:24 [ndbd] INFO -- Request copying of distribution and dictionary information from master Completed
2017-09-28 00:42:24 [ndbd] INFO -- NDB start phase 4 completed
2017-09-28 00:42:24 [ndbd] INFO -- Start NDB start phase 5 (only to DBDIH)
2017-09-28 00:42:24 [ndbd] INFO -- Restore Database Off-line Starting
2017-09-28 00:42:24 [ndbd] INFO -- All start fragments sent, requesting LDM to restore all fragments and to execute the REDO log to bring the database to an off-line but consistent state

.... a lot of index creation and data copying ....

2017-09-28 00:42:51 [ndbd] INFO -- LDM(15): Completed copy of fragment T79F233. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(15): Completed copy of fragment T80F225. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(15): Completed copy of fragment T80F233. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(3): Completed copy of fragment T75F41. Changed +30691/-92315 rows, 8238924 bytes. 400 pct churn to 30693 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(3): Completed copy of fragment T76F33. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(3): Completed copy of fragment T76F41. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(3): Completed copy of fragment T77F33. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(3): Completed copy of fragment T77F41. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(3): Completed copy of fragment T78F33. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(3): Completed copy of fragment T78F41. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(3): Completed copy of fragment T79F33. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(3): Completed copy of fragment T79F41. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(3): Completed copy of fragment T80F33. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- LDM(3): Completed copy of fragment T80F41. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2017-09-28 00:42:51 [ndbd] INFO -- Bring Database On-line Completed
2017-09-28 00:42:51 [ndbd] INFO -- Starting REDO logging
2017-09-28 00:42:51 [ndbd] INFO -- Make On-line Database recoverable by waiting for LCP Starting, all parallel threads have now ceased their activity and we have a single wait state here
2017-09-28 03:15:07 [ndbd] INFO -- Adjusting disk write speed bounds due to : Backup started
2017-09-28 03:16:20 [ndbd] INFO -- Backup log buffer report : size 17039360 bytes, hwm 326308 bytes (1 pct)
2017-09-28 03:16:22 [ndbd] INFO -- Adjusting disk write speed bounds due to : Backup completed


this has been the state for about 8 hours now...

on Node 6 (the replica):

2017-09-28 00:20:31 [ndbd] INFO -- Node 5 has completed node fail handling
2017-09-28 00:37:07 [ndbd] INFO -- findNeighbours from: 2248 old (left: 4 right: 8) new (4 8)
2017-09-28 00:42:19 [ndbd] INFO -- Started invalidation of node 5
2017-09-28 00:42:19 [ndbd] INFO -- Completed invalidation of node 5
2017-09-28 00:42:19 [ndbd] INFO -- NR Status: node=5,OLD=Node failure handling complete,NEW=All nodes permitted us
2017-09-28 00:42:24 [ndbd] INFO -- NR Status: node=5,OLD=All nodes permitted us,NEW=Include node in LCP/GCP protocols
2017-09-28 00:42:24 [ndbd] INFO -- NR Status: node=5,OLD=Include node in LCP/GCP protocols,NEW=Synchronize start node with live nodes
Ignoring scan.m_state == ScanOp::Blocked, refetch
Ignoring scan.m_state == ScanOp::Blocked, refetch
Ignoring scan.m_state == ScanOp::Blocked, refetch
Ignoring scan.m_state == ScanOp::Blocked, refetch
Ignoring scan.m_state == ScanOp::Blocked, refetch
Ignoring scan.m_state == ScanOp::Blocked, refetch
Ignoring scan.m_state == ScanOp::Blocked, refetch
2017-09-28 00:42:51 [ndbd] INFO -- NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 03:15:06 [ndbd] INFO -- Adjusting disk write speed bounds due to : Backup started
2017-09-28 03:16:19 [ndbd] INFO -- Backup log buffer report : size 17039360 bytes, hwm 283952 bytes (1 pct)
2017-09-28 03:16:19 [ndbd] INFO -- Adjusting disk write speed bounds due to : Backup completed



on Node 1 (the Manager):

2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: Node restart completed copying the fragments to Node 5
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: Bring Database On-line Completed on node 5
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: Starting REDO logging on node 5
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(1): CopyFrag complete. 122 frags, +1158054/-1069371 rows, 355073368 bytes/27292 ms 13010162
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(2): CopyFrag complete. 124 frags, +1159354/-1125594 rows, 358707920 bytes/27292 ms 13143335
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(3): CopyFrag complete. 126 frags, +1170642/-1602995 rows, 387661512 bytes/27292 ms 14204217
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(4): CopyFrag complete. 122 frags, +1151585/-1078742 rows, 352054332 bytes/27292 ms 12899543
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(5): CopyFrag complete. 122 frags, +1153245/-1085182 rows, 353370352 bytes/27292 ms 12947763
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(6): CopyFrag complete. 122 frags, +1160755/-1057196 rows, 355492268 bytes/27292 ms 13025511
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(7): CopyFrag complete. 122 frags, +1158949/-1071535 rows, 355111804 bytes/27292 ms 13011571
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(8): CopyFrag complete. 122 frags, +1139816/-1080159 rows, 346548228 bytes/27292 ms 12697795
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(9): CopyFrag complete. 122 frags, +1142529/-1084242 rows, 348131352 bytes/27292 ms 12755802
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(13): CopyFrag complete. 122 frags, +1154890/-1060333 rows, 352134648 bytes/27292 ms 1290248
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(14): CopyFrag complete. 122 frags, +1134603/-1090741 rows, 345012724 bytes/27291 ms 1264199
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(15): CopyFrag complete. 122 frags, +1149243/-1071930 rows, 350748056 bytes/27291 ms 1285215
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(16): CopyFrag complete. 122 frags, +1135839/-1086801 rows, 345380012 bytes/27291 ms 1265545
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(10): CopyFrag complete. 122 frags, +1157643/-1065288 rows, 354863748 bytes/27292 ms 1300248
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(11): CopyFrag complete. 122 frags, +1133592/-1086868 rows, 344716888 bytes/27292 ms 1263069
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 5: LDM(12): CopyFrag complete. 122 frags, +1146841/-1079050 rows, 350373596 bytes/27292 ms 1283795
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 9: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 7: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 11: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 3: Make On-line Database recoverable by waiting for LCP Starting on node 5, LCP id 551370
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 3: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 12: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 14: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 15: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 16: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 18: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 8: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 4: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 17: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 6: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 10: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2017-09-28 00:42:51 [MgmtSrvr] INFO -- Node 13: NR Status: node=5,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
....
2017-09-28 00:43:07 [MgmtSrvr] INFO -- Node 5: Data usage is 11%(151129 32K pages of total 1280000)
2017-09-28 00:43:07 [MgmtSrvr] INFO -- Node 5: Index usage is 5%(34511 8K pages of total 640512)


So it seems, everybody is waiting for teh LCP...
We first started the cluster node normally, then, using initial ... both with the same results.

Any idea, how to solve this?

Thanks
Stefan



Edited 1 time(s). Last edit at 09/28/2017 03:51AM by Stefan Auweiler.

Options: ReplyQuote


Subject
Views
Written By
Posted
Wait LCP to ensure durability
1797
September 28, 2017 03:46AM
771
September 28, 2017 03:50AM
914
September 28, 2017 03:17PM
1513
October 02, 2017 01:56AM
829
October 02, 2017 06:39AM
828
October 04, 2017 03:49AM
972
November 22, 2017 01:07AM
800
November 22, 2017 01:35AM
766
November 22, 2017 02:23AM
797
November 22, 2017 03:30AM
756
November 25, 2017 07:33AM
686
November 27, 2017 04:28AM
642
November 27, 2017 08:44AM
705
November 27, 2017 04:16PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.