MySQL Forums
Forum List  »  NDB clusters

MySQL Cluster restart : data nodes stuck after phases 1 & 3
Posted by: Paul Williamson
Date: February 08, 2014 07:16PM

Hello,

I performed a restart of a production mysql cluster today following a configuration change. I realise now I should have done a rolling restart, but did not (I tested the change and restart in staging, it worked OK).

The data nodes are taking a long time to start. The details from ndb_mgm are currently:

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=3 @192.168.3.43 (mysql-5.5.29 ndb-7.2.10, starting, Nodegroup: 0)
id=5 @192.168.3.44 (mysql-5.5.29 ndb-7.2.10, starting, Nodegroup: 0, Master)
id=7 @192.168.3.45 (mysql-5.5.29 ndb-7.2.10, starting, Nodegroup: 0)
id=9 @192.168.3.46 (mysql-5.5.29 ndb-7.2.10, starting, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=1 @192.168.3.41 (mysql-5.5.29 ndb-7.2.10)

[mysqld(API)] 6 node(s)
id=10 (not connected, accepting connect from any host)
id=11 (not connected, accepting connect from any host)
id=12 (not connected, accepting connect from any host)
id=13 (not connected, accepting connect from any host)
id=14 (not connected, accepting connect from any host)
id=15 (not connected, accepting connect from any host)

ndb_mgm> all status
Node 3: starting (Last completed phase 1) (mysql-5.5.29 ndb-7.2.10)
Node 5: starting (Last completed phase 3) (mysql-5.5.29 ndb-7.2.10)
Node 7: starting (Last completed phase 3) (mysql-5.5.29 ndb-7.2.10)
Node 9: starting (Last completed phase 3) (mysql-5.5.29 ndb-7.2.10)

This has been the status for about 7 hours now. Each data node has 96G of RAM and about 160G of data.

The log for each data node has variations on e.g.:

jbalock thr: 0 waiting for lock, contentions: 20 spins: 46

Looking at 'top', the CPU for the ndbmtd process on node 3 is running at about 2%, on the other three, between 99-101%.

So on the one hand, there's no error I can point to, but on the other hand, no change for 7 hours, when it got to that status after about 5 minutes? This was not an '--inital' restart, and in the docs phase 4 looks to be just 'find the tail of the redo log'. I don't know exactly what that entails, but shouldn't take this long. I think there must be something else going on.

I started the ndbmtd using sudo and '&' from the command line on each data node, as per the instructions I had been left. I did have to do this twice on node 3, that's why it is not the master. The first time I got the message:

2014-02-08 13:06:58 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 1. Caused by error 2353: 'Insufficent nodes for system restart(Restart error). Temporary error, restart node'.

So I just issued the same command and it got the starting status.

Anyone got any advice or insight? I can post more details if required. I have tried to search for this topic but could not find this exact scenario.

Thanks



Edited 1 time(s). Last edit at 02/08/2014 07:24PM by Paul Williamson.

Options: ReplyQuote


Subject
Views
Written By
Posted
MySQL Cluster restart : data nodes stuck after phases 1 & 3
5304
February 08, 2014 07:16PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.