Re: 5.1.6-alpha NDB: Could not get apply status share
Further Update:
In an effort to narrow the field of reasons for the malfunction, I moved all components of my cluster onto the CentOS4 Xeon machine at 192.168.0.212. I had to wrestle with this beast for a while, but eventually I had the data node 'started'. Again in the face of convoluting factors, I tried to determine what I did to make the system work.
<how_I_got_it_working>
I am working out of the precompiled '.tar.gz' for 'mysql-max-5.0.19-linux-x86_64-glibc23'. This lives in /usr/local and is symlinked to /usr/local/mysql. The three components are started from pwd=/usr/local/mysql by calling:
mysqld: support-files/mysql.server start (and stop or restart)
ndbd: bin/ndbd
ndb_mgmd: bin/mgmd
and I call the clients from there also:
mysql: bin/mysql
ndb_mgm: bin/ndb_mgm
I was using two configuration files, my.cnf and config.ini:
(my.cnf)
<quote>
[mysqld]
skip_innodb
skip_bdb
skip_locking
default-storage-engine=ndbcluster
ndbcluster
#ndb-connectstring=192.168.0.20
ndb-connectstring=192.168.0.212
[mysql_cluster]
#ndb-connectstring=192.168.0.20
ndb-connectstring=192.168.0.212
[ndb_mgmd]
config-file=/usr/local/mysql/config.ini
</quote>
(config.ini)
<quote>
[ndbd default]
noofreplicas=3
[ndb_mgmd]
hostname=192.168.0.212
[ndbd]
hostname=192.168.0.212
datadir=/var/lib/mysql-cluster
[ndbd]
hostname=192.168.0.214
datadir=/var/lib/mysql-cluster
[ndbd]
hostname=192.168.0.216
datadir=/var/lib/mysql-cluster
[mysqld]
</quote>
In this configuration, I saw what looked like the same problem. (That the data node would get stuck at Phase 2 of starting.)
At this point, I believe I did two things: I dropped the number of replicas to one (by deleting the last two [ndbd] sections and setting noofreplicas=1) and I refreshed the ndb_fs by
[root]# killall ndbd
[root]# rm -rf /var/lib/mysql-cluster/ndb_2_fs
[root]# killall ndb_mgmd
[root]# vi config.ini
(apply changes)
[root]# bin/ndb_mgmd
[root]# bin/ndbd --initial
(it may be relevant that when I did '(apply changes)', I also converted the file from the capitalized form seen in my previous post to this all-lowercase form and this may have helped - hopefully not because of the capitalization change, but because I retyped it all, line-by-line, and may have eliminated non-printable characters...)
When ndbd came up, I was amazed to see that it started. To complete the system, I immediately tried start mysqld. I was still unable to create tables, and a 'show errors' also indicated that the NDB cluster had failed. (error code 1296, 4009 from NDB). I remembered that I had added some lines to my.cnf at previous steps to try and get mysqld working. I stopped mysql and commented-out the three 'skip' lines and the 'default-storage-engine' line.
I brought mysqld back up. This time, it showed in the mgm console that it was connected. Now everything was connected!
To test, I created a table, and it succeeded. I realized that it had gone in as MyISAM (show tables), so I altered table to engine=ndbcluster, and that succeeded (according to 'show tables'). I was able to add rows.
To verify that the config lines were responsible for the sudden success, I brought mysqld down, and uncommented the lines. When I brought the server back up, it still worked! I created a table and added data to it.
I was now able to restart each piece independently, and the overall system returned to this working state and did not report any errors.
</how_I_got_it_working>
<how_I_broke_it_again>
Mystified, I decided to add the FC4 Athlon64 machine at 192.168.0.20 to the cluster. I increased the noofreplicas to 2, and added an ndbd line for 192.168.0.20 and /var/lib/mysql-cluster.
The local data node still registered as starting OK. When restarting mysqld, it again reported the same error with NDB cluster, as well as an error 1105: 'Failed to open 'test_table', error while unpacking from engine'.
This could be interpretted to imply that if all of the replicas are not available, then mysqld will not make a satisfactory connection to the mgmd (and thusly not use what replicas are available). Is this ordinary behavior?
To try to provide the second replica, I moved on to the Athlon64 machine. I installed the 'MySQL-ndb-storage-5.0.19-0.glibc23.x86_64.rpm' package, and added a few lines to /etc/my.cnf:
<quote>
[mysql_cluster]
ndb-connectstring=192.168.0.212
</quote>
From the Athlon, I ran 'ndbd --initial'. It dumped an error to the console:
<quote>
error=2350
2006-03-29 01:46:52 [ndbd] INFO -- Error handler restarting system
2006-03-29 01:46:52 [ndbd] INFO -- Error handler shutdown completed - exiting
sphase=0
exit=-1
</quote>
Trying to determine what happened, I looked into '/usr/local/mysql/ndb_1_cluster.log' on the Xeon:
<quote>
2006-03-29 01:44:25 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2006-03-29 01:44:26 [MgmSrvr] INFO -- Node 2: Started arbitrator node 1 [ticket=08ec000244f5c3f8]
2006-03-29 01:45:41 [MgmSrvr] WARNING -- Allocate nodeid (0) failed. Connection from ip 192.168.0.20. Returned error string "No free node id found for ndbd(NDB)."
2006-03-29 01:45:41 [MgmSrvr] INFO -- Mgmt server state: node id's 2 3 conn
ected but not reserved
... 4 additional repitions of WARNING/INFO pattern ...
2006-03-29 01:46:52 [MgmSrvr] WARNING -- Allocate nodeid (0) failed. Connection from ip 192.168.0.20. Returned error string "No free node id found for ndbd(NDB)."
2006-03-29 01:46:52 [MgmSrvr] INFO -- Mgmt server state: node id's 2 3 connected but not reserved
2006-03-29 01:47:26 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected
2006-03-29 01:47:26 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed
</quote>
It looked like my new ndbd was trying to request nodeid 0 when it should be assigned nodeid 3. I tried to force ndbd to request the right nodeid from the Athlon command line:
[root]# ndbd --ndb-nodeid=3
It did not dump to the console, but the mgm console had something to say:
<quote>
Node 3: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2302: 'node id in the configuration has the wrong type, (i.e. not an NDB node)(Configuration error). Permanent error, external action needed'.
</quote>
so I looked in to the cluster log again, and found more (expected) lines:
<quote>
2006-03-29 01:47:27 [MgmSrvr] INFO -- Mgmt server state: nodeid 4 reserved f
or ip 192.168.0.212, m_reserved_nodes 0000000000000012.
2006-03-29 01:47:27 [MgmSrvr] INFO -- Node 4: mysqld --server-id=0
2006-03-29 01:47:29 [MgmSrvr] INFO -- Node 2: Communication to Node 3 opened
2006-03-29 01:49:57 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 reserved f
or ip 192.168.0.20, m_reserved_nodes 000000000000001a.
2006-03-29 01:49:57 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2006-03-29 01:49:57 [MgmSrvr] INFO -- Node 2: Node 3 Connected
2006-03-29 01:49:57 [MgmSrvr] INFO -- Node 3: Node 2 Connected
2006-03-29 01:49:58 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2006-03-29 01:49:58 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected
2006-03-29 01:49:58 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed
2006-03-29 01:49:58 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_r
eserved_nodes 0000000000000012.
2006-03-29 01:49:58 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2302: 'node id in the configuration has the wrong type, (i.e. not an NDB node)(Configuration error). Permanent error, external action needed'.
</quote>
I tried again, but this time without the --ndb-nodeid. It did not dump to console this time (somehow it learned its nodeid...) but I got the same message at the mgm console, and left a similar pattern in the cluster log.
<messing_around>
Just for kicks, I tried --ndb-nodeid=6. I got the original console dump, and a slightly different (but expected) error in the cluster log:
<quote>
2006-03-29 01:53:09 [MgmSrvr] WARNING -- Allocate nodeid (6) failed. Connection from ip 192.168.0.20. Returned error string "No node defined with id=6 in config file."
</quote>
In case I'd broken something, I gave it --ndb-nodeid=3 again, with same results as the previous trials at nodeid=3.
</messing_around>
To narrow the failure case, I tried to connect but not start the ndbd:
[root]# ndbd -n
so far so good:
<quote>
[ndbd(NDB)] 2 node(s)
id=2 @192.168.0.212 (Version: 5.0.19, Nodegroup: 0, Master)
id=3 @192.168.0.20 (Version: 5.0.19, not started)
</quote>
but a little farther, and:
<quote>
ndb_mgm> 3 start
Database node 3 is being started.
ndb_mgm> Node 3: Start initiated (version 5.0.19)
Node 3: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2302: 'node id in the configuration has the wrong type, (i.e. not an NDB node)(Configuration error). Permanent error, external action needed'.
show
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=2 @192.168.0.212 (Version: 5.0.19, Nodegroup: 0, Master)
id=3 (not connected, accepting connect from 192.168.0.20)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @192.168.0.212 (Version: 5.0.19)
[mysqld(API)] 1 node(s)
id=4 (not connected, accepting connect from any host)
ndb_mgm>
</quote>
Narrowing the failure case, I killed node 2, and started node 3. Node 3 did not bail, but it got stuck in 'starting (Phase 2)'.
I killed node 3 again, and tried to start node 2, and now it is stuck in 'starting (Phase 2)'
Nothing works again!
</how_I_broke_it_again>
That is where I am at. I am going to try 'refreshing' the ndb_fs and a few other things, and I'll post again.
In light of this new information, does anyone have any suggestions?