I upgraded this morning my cluster from 5.0.24a to 5.0.27 by following the rolling restart method (see
http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-rolling-restart.html)
Everything went fine for 9 hours but then one datanode crashed, here are the mgm logs :
2006-12-04 14:53:55 [MgmSrvr] INFO -- Node 2: Local checkpoint 1602 started. Keep GCI = 355249 oldest restorable GCI = 355276
2006-12-04 15:00:48 [MgmSrvr] INFO -- Node 2: Local checkpoint 1603 started. Keep GCI = 355443 oldest restorable GCI = 355470
2006-12-04 15:01:19 [MgmSrvr] WARNING -- Node 2: Node 3 missed heartbeat 2
2006-12-04 15:01:20 [MgmSrvr] WARNING -- Node 2: Node 3 missed heartbeat 3
2006-12-04 15:01:21 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2006-12-04 15:01:22 [MgmSrvr] WARNING -- Node 2: Node 3 missed heartbeat 4
2006-12-04 15:01:22 [MgmSrvr] ALERT -- Node 2: Node 3 declared dead due to missed heartbeat
2006-12-04 15:01:22 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed
2006-12-04 15:01:22 [MgmSrvr] ALERT -- Node 2: Network partitioning - arbitration required
2006-12-04 15:01:22 [MgmSrvr] INFO -- Node 2: President restarts arbitration thread [state=7]
2006-12-04 15:01:22 [MgmSrvr] ALERT -- Node 2: Arbitration won - positive reply from node 1
2006-12-04 15:01:22 [MgmSrvr] INFO -- Node 2: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue:
2006-12-04 15:01:22 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected
2006-12-04 15:01:22 [MgmSrvr] ALERT -- Node 2: Backup 216 started from 1 has been aborted. Error: 1326
2006-12-04 15:01:22 [MgmSrvr] INFO -- Node 2: Started arbitrator node 1 [ticket=5c0b00054dc4dab3]
2006-12-04 15:01:36 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Initiated by signal 0. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug).
I tried to start the failed node with initial but it hangs in phase 5. I had a similar problem last week as I wanted to increase MaxNoOfTable attribute : I did a rolling restart of the cluster but the second data node never restarted (waited for 10 hours and finally the cluster crashed).
Config file :
[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=2048M
IndexMemory=512M
LockPagesInMainMemory=Y
MaxNoOfUniqueHashIndexes=1024
MaxNoOfOrderedIndexes=512
MaxNoOfConcurrentOperations=65535
MaxNoOfAttributes=10000
MaxNoOfTables=512
[MYSQLD DEFAULT]
[NDB_MGMD DEFAULT]
[TCP DEFAULT]
# Managment Server
[NDB_MGMD]
id=1
HostName=192.168.0.155
DataDir=/usr/local/mysql/mysql-cluster
[NDB_MGMD]
id=10
HostName=192.168.0.152
DataDir=/usr/local/mysql/mysql-cluster
# Storage Engines
[NDBD]
id=2
HostName=192.168.0.151
DataDir=/usr/local/mysql/mysql-cluster
[NDBD]
id=3
HostName=192.168.0.156
DataDir=/usr/local/mysql/mysql-cluster
#API Nodes
[MYSQLD]
id=20
HostName=192.168.0.152
[MYSQLD]
id=21
HostName=192.168.0.155
[MYSQLD]
id=22
HostName=192.168.0.151
[MYSQLD]
id=23
HostName=192.168.0.156
[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]
NB: Node 22 and 23 are not performing queries.
Could you tell me if I missed something ?
Thanks,
Regards,