MySQL Cluster restart : data nodes stuck after phases 1 & 3
Hello,
I performed a restart of a production mysql cluster today following a configuration change. I realise now I should have done a rolling restart, but did not (I tested the change and restart in staging, it worked OK).
The data nodes are taking a long time to start. The details from ndb_mgm are currently:
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=3 @192.168.3.43 (mysql-5.5.29 ndb-7.2.10, starting, Nodegroup: 0)
id=5 @192.168.3.44 (mysql-5.5.29 ndb-7.2.10, starting, Nodegroup: 0, Master)
id=7 @192.168.3.45 (mysql-5.5.29 ndb-7.2.10, starting, Nodegroup: 0)
id=9 @192.168.3.46 (mysql-5.5.29 ndb-7.2.10, starting, Nodegroup: 0)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @192.168.3.41 (mysql-5.5.29 ndb-7.2.10)
[mysqld(API)] 6 node(s)
id=10 (not connected, accepting connect from any host)
id=11 (not connected, accepting connect from any host)
id=12 (not connected, accepting connect from any host)
id=13 (not connected, accepting connect from any host)
id=14 (not connected, accepting connect from any host)
id=15 (not connected, accepting connect from any host)
ndb_mgm> all status
Node 3: starting (Last completed phase 1) (mysql-5.5.29 ndb-7.2.10)
Node 5: starting (Last completed phase 3) (mysql-5.5.29 ndb-7.2.10)
Node 7: starting (Last completed phase 3) (mysql-5.5.29 ndb-7.2.10)
Node 9: starting (Last completed phase 3) (mysql-5.5.29 ndb-7.2.10)
This has been the status for about 7 hours now. Each data node has 96G of RAM and about 160G of data.
The log for each data node has variations on e.g.:
jbalock thr: 0 waiting for lock, contentions: 20 spins: 46
Looking at 'top', the CPU for the ndbmtd process on node 3 is running at about 2%, on the other three, between 99-101%.
So on the one hand, there's no error I can point to, but on the other hand, no change for 7 hours, when it got to that status after about 5 minutes? This was not an '--inital' restart, and in the docs phase 4 looks to be just 'find the tail of the redo log'. I don't know exactly what that entails, but shouldn't take this long. I think there must be something else going on.
I started the ndbmtd using sudo and '&' from the command line on each data node, as per the instructions I had been left. I did have to do this twice on node 3, that's why it is not the master. The first time I got the message:
2014-02-08 13:06:58 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 1. Caused by error 2353: 'Insufficent nodes for system restart(Restart error). Temporary error, restart node'.
So I just issued the same command and it got the starting status.
Anyone got any advice or insight? I can post more details if required. I have tried to search for this topic but could not find this exact scenario.
Thanks
Edited 1 time(s). Last edit at 02/08/2014 07:24PM by Paul Williamson.