I am about to give up on MySQL cluster as a HA solution.
The situation:
First we run into the memory per process limit of 32bit systems. A workaround for this is running more ndbd processes per machine, but when I place a comment on the 5.0 Cluster faq documentation page, it geets revoked, because 'Currently we do not recommend running multiple data node processes per machine.'
Hmm... So maybe the fact that the cluster can't recover from killing the ndbd processes on one of our 2 data node machines is due to the fact that it's not recommended to run this way.
So we go for the 64bit version of RHEL4AS on two machines. We run one ndbd process per machine, as recommended by the
docs@mysql.com team.
We run a stress test, about 2500 random select, update, insert, delete queries per second.
Process to wreck a cluster:
* run this test
* killall -9 ndbd on one of the data nodes
The database keeps running without a hitch.
* start the killed node
The node is stuck in startup phase 5 forever (according to the ndb_mgm node).
I can't imagine that nobody tried something like this before, so why are there no bug reports, work arounds, solution, caveats to be found?
All in all this cost us a lot, not only in additional hardware (memory, disks) but also in time, and the conclusion is that MySQL clustering is a nice idea, now we just have to wait until it works, but look somewhere else for stable HA solutions in the mean time.