MySQL Forums
Forum List  »  NDB clusters

NDB cluster failure
Posted by: smriti .
Date: March 29, 2012 08:31AM

We have ndb cluster setup in which we have replication from Prod to DR.But we are facing a constant problem of cluster failure and after that replication breaks.
According to our investigation,it seems that CPU utilization goes high before cluster failure, after this the data nodes miss the heartbeats and then cluster gets down.

At the time of cluster failure the following logs are generated:

=============================================================================

Time: Thursday 29 March 2012 - 07:58:52
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: ndbd.cpp
Program: /usr/local/mysql/mysql/bin//ndbmtd
Pid: 30550 thr: 1
Version: mysql-5.1.44 ndb-7.1.3
Trace: /mysql/data/mysqlcluster//ndb_3_trace.log.20 /mysql/data/mysqlcluster//ndb_3_trace.log.20_t1 /mysql/data/mysqlcluster//ndb_3_trac


Prod error log
===============================================================

120329 7:58:52 [Note] NDB Binlog: Node: 3, down, Subscriber bitmask 00
120329 7:58:52 [Note] NDB Binlog: Node: 4, down, Subscriber bitmask 00
120329 7:58:52 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema at epoch 50814438/0.
120329 7:58:52 [Note] NDB Binlog: ndb tables initially read only on reconnect.
120329 7:58:52 [ERROR] /usr/local/mysql/mysql/bin//mysqld: Incorrect information in file: './proddbconsum/CustomCoBrandingMappings.frm'
120329 7:58:52 [ERROR] /usr/local/mysql/mysql/bin//mysqld: Incorrect information in file: './proddbconsum/CustomCoBrandingBrands.frm'

Prod ndb_out.log
==========================================================

2012-03-29 07:58:17 [ndbd] WARNING -- Ndb kernel thread 4 is stuck in: Performing Receive elapsed=102
2012-03-29 07:58:17 [ndbd] INFO -- Watchdog: User time: 6790573 System time: 285716
2012-03-29 07:58:17 [ndbd] WARNING -- Ndb kernel thread 3 is stuck in: Job Handling elapsed=102
2012-03-29 07:58:17 [ndbd] INFO -- Watchdog: User time: 6790573 System time: 285716
WARNING: timerHandlingLab now: 53811545092 sent: 53811544897 diff: 195
WARNING: timerHandlingLab now: 53811570675 sent: 53811570571 diff: 104
WARNING: timerHandlingLab now: 53811570931 sent: 53811570764 diff: 167
WARNING: timerHandlingLab now: 53811573783 sent: 53811573656 diff: 127
WARNING: timerHandlingLab now: 53811574105 sent: 53811574030 diff: 75
WARNING: timerHandlingLab now: 53811574567 sent: 53811574401 diff: 166
saving 0x2aaaab060000 at 0xa2a790 (1)
saving 0x2aaadb2b8000 at 0xa2a798 (2)
2012-03-29 07:58:52 [ndbd] INFO -- Received signal 6. Running error handler.
2012-03-29 07:58:52 [ndbd] INFO -- Signal 6 received; Aborted
2012-03-29 07:58:52 [ndbd] INFO -- ndbd.cpp
2012-03-29 07:58:52 [ndbd] INFO -- Error handler signal shutting down system
2012-03-29 07:58:52 [ndbd] INFO -- Error handler shutdown completed - exiting
2012-03-29 07:58:53 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2012-03-29 08:32:23 [ndbd] INFO -- Angel pid: 32471 ndb pid: 32472
NDBMT: MaxNoOfExecutionThreads=4
NDBMT: workers=2 threads=2
2012-03-29 08:32:23 [ndbd] INFO -- NDB Cluster -- DB node 3
2012-03-29 08:32:23 [ndbd] INFO -- mysql-5.1.44 ndb-7.1.3 --
2012-03-29 08:32:23 [ndbd] INFO -- WatchDog timer is set to 6000 ms
2012-03-29 08:32:23 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 4951Mb initial: 4971Mb
Adding 4972Mb to ZONE_LO (1,159087)
NDBMT: num_threads=5
thr: 4 tid: 32472 CMVMI(0)

Prod management ndb_cluster.log
========================================

2012-03-29 07:58:42 [MgmtSrvr] INFO -- Node 3: Index usage is 10%(8995 8K pages of total 85952)
2012-03-29 07:58:50 [MgmtSrvr] WARNING -- Node 3: Node 4 missed heartbeat 2
2012-03-29 07:58:52 [MgmtSrvr] WARNING -- Node 3: Node 4 missed heartbeat 3
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Data usage is 14%(18858 32K pages of total 131072)
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Index usage is 10%(8995 8K pages of total 85952)
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 7 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 8 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 10 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 12 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 13 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 14 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 7 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 8 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 10 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 12 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 13 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 14 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 7 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 8 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 10 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 12 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 13 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 14 closed
2012-03-29 07:58:53 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected
2012-03-29 07:58:53 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2012-03-29 07:58:53 [MgmtSrvr] ALERT -- Node 4: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2012-03-29 07:58:53 [MgmtSrvr] ALERT -- Node 1: Node 4 Disconnected
========================================================================

What could be the other reasons for this problem and also the solutions?

Options: ReplyQuote


Subject
Views
Written By
Posted
NDB cluster failure
9195
March 29, 2012 08:31AM
3179
March 29, 2012 12:30PM
2975
March 30, 2012 01:19AM
2593
March 30, 2012 07:44AM
2656
April 02, 2012 06:31AM
2412
April 27, 2012 06:01AM
2510
May 08, 2012 03:42AM
2416
May 11, 2012 06:02AM
6373
May 16, 2012 04:58AM
3550
May 30, 2012 05:12AM
2409
June 08, 2012 05:17AM
2229
July 06, 2012 08:25AM
2052
September 10, 2012 04:40AM
2257
September 12, 2012 02:41PM
2036
October 08, 2012 08:29AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.