MySQL Forums
Forum List  »  NDB clusters

Re: 5.1.6-alpha NDB: Could not get apply status share
Posted by: jim shnider
Date: March 29, 2006 03:01PM

Further Update:

In an effort to narrow the field of reasons for the malfunction, I moved all components of my cluster onto the CentOS4 Xeon machine at 192.168.0.212. I had to wrestle with this beast for a while, but eventually I had the data node 'started'. Again in the face of convoluting factors, I tried to determine what I did to make the system work.

<how_I_got_it_working>

I am working out of the precompiled '.tar.gz' for 'mysql-max-5.0.19-linux-x86_64-glibc23'. This lives in /usr/local and is symlinked to /usr/local/mysql. The three components are started from pwd=/usr/local/mysql by calling:
mysqld: support-files/mysql.server start (and stop or restart)
ndbd: bin/ndbd
ndb_mgmd: bin/mgmd

and I call the clients from there also:
mysql: bin/mysql
ndb_mgm: bin/ndb_mgm

I was using two configuration files, my.cnf and config.ini:

(my.cnf)
<quote>
[mysqld]
skip_innodb
skip_bdb
skip_locking
default-storage-engine=ndbcluster
ndbcluster
#ndb-connectstring=192.168.0.20
ndb-connectstring=192.168.0.212

[mysql_cluster]
#ndb-connectstring=192.168.0.20
ndb-connectstring=192.168.0.212

[ndb_mgmd]
config-file=/usr/local/mysql/config.ini
</quote>

(config.ini)
<quote>
[ndbd default]
noofreplicas=3
[ndb_mgmd]
hostname=192.168.0.212
[ndbd]
hostname=192.168.0.212
datadir=/var/lib/mysql-cluster
[ndbd]
hostname=192.168.0.214
datadir=/var/lib/mysql-cluster
[ndbd]
hostname=192.168.0.216
datadir=/var/lib/mysql-cluster
[mysqld]
</quote>

In this configuration, I saw what looked like the same problem. (That the data node would get stuck at Phase 2 of starting.)

At this point, I believe I did two things: I dropped the number of replicas to one (by deleting the last two [ndbd] sections and setting noofreplicas=1) and I refreshed the ndb_fs by
[root]# killall ndbd
[root]# rm -rf /var/lib/mysql-cluster/ndb_2_fs
[root]# killall ndb_mgmd
[root]# vi config.ini
(apply changes)
[root]# bin/ndb_mgmd
[root]# bin/ndbd --initial

(it may be relevant that when I did '(apply changes)', I also converted the file from the capitalized form seen in my previous post to this all-lowercase form and this may have helped - hopefully not because of the capitalization change, but because I retyped it all, line-by-line, and may have eliminated non-printable characters...)

When ndbd came up, I was amazed to see that it started. To complete the system, I immediately tried start mysqld. I was still unable to create tables, and a 'show errors' also indicated that the NDB cluster had failed. (error code 1296, 4009 from NDB). I remembered that I had added some lines to my.cnf at previous steps to try and get mysqld working. I stopped mysql and commented-out the three 'skip' lines and the 'default-storage-engine' line.

I brought mysqld back up. This time, it showed in the mgm console that it was connected. Now everything was connected!

To test, I created a table, and it succeeded. I realized that it had gone in as MyISAM (show tables), so I altered table to engine=ndbcluster, and that succeeded (according to 'show tables'). I was able to add rows.

To verify that the config lines were responsible for the sudden success, I brought mysqld down, and uncommented the lines. When I brought the server back up, it still worked! I created a table and added data to it.

I was now able to restart each piece independently, and the overall system returned to this working state and did not report any errors.

</how_I_got_it_working>

<how_I_broke_it_again>

Mystified, I decided to add the FC4 Athlon64 machine at 192.168.0.20 to the cluster. I increased the noofreplicas to 2, and added an ndbd line for 192.168.0.20 and /var/lib/mysql-cluster.

The local data node still registered as starting OK. When restarting mysqld, it again reported the same error with NDB cluster, as well as an error 1105: 'Failed to open 'test_table', error while unpacking from engine'.

This could be interpretted to imply that if all of the replicas are not available, then mysqld will not make a satisfactory connection to the mgmd (and thusly not use what replicas are available). Is this ordinary behavior?

To try to provide the second replica, I moved on to the Athlon64 machine. I installed the 'MySQL-ndb-storage-5.0.19-0.glibc23.x86_64.rpm' package, and added a few lines to /etc/my.cnf:
<quote>
[mysql_cluster]
ndb-connectstring=192.168.0.212
</quote>

From the Athlon, I ran 'ndbd --initial'. It dumped an error to the console:
<quote>
error=2350
2006-03-29 01:46:52 [ndbd] INFO -- Error handler restarting system
2006-03-29 01:46:52 [ndbd] INFO -- Error handler shutdown completed - exiting
sphase=0
exit=-1
</quote>

Trying to determine what happened, I looked into '/usr/local/mysql/ndb_1_cluster.log' on the Xeon:
<quote>
2006-03-29 01:44:25 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2006-03-29 01:44:26 [MgmSrvr] INFO -- Node 2: Started arbitrator node 1 [ticket=08ec000244f5c3f8]
2006-03-29 01:45:41 [MgmSrvr] WARNING -- Allocate nodeid (0) failed. Connection from ip 192.168.0.20. Returned error string "No free node id found for ndbd(NDB)."
2006-03-29 01:45:41 [MgmSrvr] INFO -- Mgmt server state: node id's 2 3 conn
ected but not reserved

... 4 additional repitions of WARNING/INFO pattern ...

2006-03-29 01:46:52 [MgmSrvr] WARNING -- Allocate nodeid (0) failed. Connection from ip 192.168.0.20. Returned error string "No free node id found for ndbd(NDB)."
2006-03-29 01:46:52 [MgmSrvr] INFO -- Mgmt server state: node id's 2 3 connected but not reserved
2006-03-29 01:47:26 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected
2006-03-29 01:47:26 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed
</quote>

It looked like my new ndbd was trying to request nodeid 0 when it should be assigned nodeid 3. I tried to force ndbd to request the right nodeid from the Athlon command line:
[root]# ndbd --ndb-nodeid=3

It did not dump to the console, but the mgm console had something to say:
<quote>
Node 3: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2302: 'node id in the configuration has the wrong type, (i.e. not an NDB node)(Configuration error). Permanent error, external action needed'.
</quote>

so I looked in to the cluster log again, and found more (expected) lines:
<quote>
2006-03-29 01:47:27 [MgmSrvr] INFO -- Mgmt server state: nodeid 4 reserved f
or ip 192.168.0.212, m_reserved_nodes 0000000000000012.
2006-03-29 01:47:27 [MgmSrvr] INFO -- Node 4: mysqld --server-id=0
2006-03-29 01:47:29 [MgmSrvr] INFO -- Node 2: Communication to Node 3 opened
2006-03-29 01:49:57 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 reserved f
or ip 192.168.0.20, m_reserved_nodes 000000000000001a.
2006-03-29 01:49:57 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2006-03-29 01:49:57 [MgmSrvr] INFO -- Node 2: Node 3 Connected
2006-03-29 01:49:57 [MgmSrvr] INFO -- Node 3: Node 2 Connected
2006-03-29 01:49:58 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2006-03-29 01:49:58 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected
2006-03-29 01:49:58 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed
2006-03-29 01:49:58 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_r
eserved_nodes 0000000000000012.
2006-03-29 01:49:58 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2302: 'node id in the configuration has the wrong type, (i.e. not an NDB node)(Configuration error). Permanent error, external action needed'.
</quote>

I tried again, but this time without the --ndb-nodeid. It did not dump to console this time (somehow it learned its nodeid...) but I got the same message at the mgm console, and left a similar pattern in the cluster log.

<messing_around>

Just for kicks, I tried --ndb-nodeid=6. I got the original console dump, and a slightly different (but expected) error in the cluster log:
<quote>
2006-03-29 01:53:09 [MgmSrvr] WARNING -- Allocate nodeid (6) failed. Connection from ip 192.168.0.20. Returned error string "No node defined with id=6 in config file."
</quote>

In case I'd broken something, I gave it --ndb-nodeid=3 again, with same results as the previous trials at nodeid=3.

</messing_around>

To narrow the failure case, I tried to connect but not start the ndbd:
[root]# ndbd -n

so far so good:
<quote>
[ndbd(NDB)] 2 node(s)
id=2 @192.168.0.212 (Version: 5.0.19, Nodegroup: 0, Master)
id=3 @192.168.0.20 (Version: 5.0.19, not started)
</quote>

but a little farther, and:
<quote>
ndb_mgm> 3 start
Database node 3 is being started.

ndb_mgm> Node 3: Start initiated (version 5.0.19)
Node 3: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2302: 'node id in the configuration has the wrong type, (i.e. not an NDB node)(Configuration error). Permanent error, external action needed'.
show
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=2 @192.168.0.212 (Version: 5.0.19, Nodegroup: 0, Master)
id=3 (not connected, accepting connect from 192.168.0.20)

[ndb_mgmd(MGM)] 1 node(s)
id=1 @192.168.0.212 (Version: 5.0.19)

[mysqld(API)] 1 node(s)
id=4 (not connected, accepting connect from any host)

ndb_mgm>
</quote>

Narrowing the failure case, I killed node 2, and started node 3. Node 3 did not bail, but it got stuck in 'starting (Phase 2)'.

I killed node 3 again, and tried to start node 2, and now it is stuck in 'starting (Phase 2)'

Nothing works again!

</how_I_broke_it_again>

That is where I am at. I am going to try 'refreshing' the ndb_fs and a few other things, and I'll post again.

In light of this new information, does anyone have any suggestions?

Options: ReplyQuote




Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.