data node fail causes all sql nodes to fail
Posted by:
Adam Goss
Date: June 04, 2008 12:11PM
I have been testing MySQL with clustering in the following configuration... I have three test systems, all running CentOS5.1. One is a management server, and ALL are running ndbd and mysqld, so effectively the architecture is as follows, and running MySQL 5.0.51a Community from the Downloads area (RHEL 5 on x86):
NoOfReplicas=1
System 1 - Management node, Data Node, SQL Node
System 2 - Data Node, SQL Node
System 3 - Data Node, SQL Node
Given this configuration, I would expect this system to be fully fault-tolerant. Loss of any system (or even 2) should still allow the cluster to function as there is always a Data Node and SQL Node available, and the MySQL 5.0 reference documentation indicates that failure of the Management Node won't impact the cluster. I have verified that with all systems running, the cluster is functional (i.e. all systems show as connected in ndb_mgm), and I have verified that for my clustered DB, data is being replicated amongst all systems.
When testing a network/hardware failure (i.e. unplugging a network cable), at first the whole cluster would fail, but updating the manager's config.ini to include the ndbd option 'StopOnError=false' fixed this. Now when disconnecting a system, the other two data nodes stay active (or for some reason one of the remaining two will crash, but the third will stay alive), but ALL SQL nodes get disconnected. I have been unable to find a configuration option like I did for ndbd and 'StopOnError'. I don't have enough systems at my disposal to test a cluster of two data nodes and two sql nodes on four systems, so is this a bug, or an artifact of my cluster's configuration? Any input would be very much appreciated. For those who have questions, the clustered MySQL database is intended to be used as a High Availability backend to a radius implementation (i.e. 3 systems each running a radiusd instance, and clustering mysql so that user credential info is up-to-date on all three systems), not expected to see massive volume of queries given a user pool of < 250.