[Data Loss] Cluster nodes have different data
Posted by: eric
Date: October 13, 2005 05:16PM
I'm running MySQL 4.1.12-max on SuSE Linux 9.2. I have a four-node cluster, with three data nodes and a mgm. MGM is node 1, NDB nodes are 2,3, and 4, and there are 9 API node slots (5-13). Today the following sequence of events occurred, based on the log file on the MGM node. (full logs available on request)
MGM node detects NDB node 2 disconnection
MGM node arbitrates and selects NDB node 3 as new master
MGM node detects NDB node 3 disconnection
MGM node arbitrates and selects NDB node 4 as new master
NDB node 4 starts taking over (table fragments scroll by with LcpStatus 3)
NDB node 2 reconnects and starts up, with CM_president = 4, own Node = 2, our dynamic id = 4
NDB node 3 reconnects and begins the startup process
NDB node 2 loads all indexes and completes startup process.
NDB node 3 disconnects again
MGM node arbitrates and NDB node 4 wins the election again
NDB node 3 reconnects, loads indexes, and completes startup.
Now, the problems:
NDB nodes 2 and 4 complain every 1 minute, that "Failure handling of node 3 has not completed in [n] min."
API nodes are allowed to write data via insert/update to the cluster, but some writes are handled on node 3 and others on node 4.
The result is that an API node querying NDB node 4 sees different data than a node querying node 3.
Backups at this point fail to complete, although the logs indicate they were successful. The strange part is that node 4 gets 2 fragments, node 3 gets one, and node 2 doesn't even run the backup (no files generated in /clusterdata/backup, at least)
This occurred on a production system, and we lost an hour's worth of data while I struggled with my disbelief that the cluster actually became desynchronized.
Has anyone seen this kind of behavior before? What could have caused it? How can I prevent it in the future? We're horrified that production data was lost and would sure like to feel good about not having it happen again.
Thanks in advance for any help.