MySQL Forums
Forum List  »  NDB clusters

Cluster node lost connection
Posted by: Guus De Graeve
Date: February 24, 2009 07:06AM

We're having some troubles getting our cluster up and running longer than 5 minutes.

If we start de management application and our nodes they eventually are connected although it takes more than 10minutes to start up.

When all nodes are connected to the management application it keeps running like that for some minutes but after a while our first node crashes for some reason.

This is our cluster config.ini:
[NDBD DEFAULT]
NoOfReplicas=2

DataMemory=1536M
IndexMemory=512M
LockPagesInMainMemory=2
MaxNoOfConcurrentTransactions = 500
MaxNoOfConcurrentOperations = 250000
MaxNoOfOrderedIndexes = 27000
MaxNoOfTables = 9000
MaxNoOfAttributes = 25000

# For DataMemory and IndexMemory, we have used the
# default values. Since the "world" database takes up
# only about 500KB, this should be more than enough for
# this example Cluster setup.
[MYSQLD DEFAULT]
[NDB_MGMD DEFAULT]
[TCP DEFAULT]
# Section for the cluster management node
[NDB_MGMD]
# IP address of the management node (this system)
HostName=xx.xx.xx.2
# Section for the storage nodes
[NDBD]
# IP address of the first storage node
HostName=xx.xx.xx.18
DataDir=/usr/local/mysql/var/mysql-cluster
BackupDataDir=/usr/local/mysql/var/mysql-cluster/backup

[NDBD]
# IP address of the second storage node
HostName=xx.xx.xx.26
DataDir=/usr/local/mysql/var/mysql-cluster
BackupDataDir=/usr/local/mysql/var/mysql-cluster/backup

# one [MYSQLD] per storage node
[MYSQLD]
[MYSQLD]

These are some lines of the cluster logfile when de cluster disconnects:
2009-02-24 13:42:29 [MgmSrvr] WARNING  -- Node 3: Node 4 missed heartbeat 2
2009-02-24 13:42:30 [MgmSrvr] WARNING  -- Node 3: Failure handling of node 2 has not completed in 1 min. - state = 3
2009-02-24 13:42:35 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2009-02-24 13:42:38 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 opened
2009-02-24 13:43:51 [MgmSrvr] WARNING  -- Node 3: Node 4 missed heartbeat 2
...
...
...
2009-02-24 13:52:44 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2009-02-24 13:52:45 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 0. Initiated by signal 11. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error
2009-02-24 13:52:45 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2009-02-24 13:52:45 [MgmSrvr] ALERT    -- Node 3: Node 4 Disconnected

and this is the node's error logfile contents (timestamps do not match but it's right):
Current byte-offset of file-pointer is: 1067


Time: Tuesday 24 February 2009 - 05:25:11
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 4917) 0x0000000a
Program: /usr/local/mysql/libexec/ndbd
Pid: 2055
Trace: /usr/local/mysql/var/mysql-cluster/ndb_2_trace.log.1
Version: Version 5.1.30
***EOM***

Time: Tuesday 24 February 2009 - 05:34:16
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug)
Error: 6050
Error data: Allocating memory
Error object: WatchDog.cpp
Program: /usr/local/mysql/libexec/ndbd
Pid: 3016
Trace: /usr/local/mysql/var/mysql-cluster/ndb_2_trace.log.2
Version: Version 5.1.30
***EOM***

Anyone has an idea of what i should do to fix this issue? I've tried so many things but now i'm desperate...

Thanks!

Options: ReplyQuote


Subject
Views
Written By
Posted
Cluster node lost connection
10451
February 24, 2009 07:06AM
5319
February 24, 2009 04:51PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.