MySQL :: NDB cluster failure

New Topic

NDB cluster failure

Posted by: smriti .
Date: March 29, 2012 08:31AM

We have ndb cluster setup in which we have replication from Prod to DR.But we are facing a constant problem of cluster failure and after that replication breaks.
According to our investigation,it seems that CPU utilization goes high before cluster failure, after this the data nodes miss the heartbeats and then cluster gets down.

At the time of cluster failure the following logs are generated:

=============================================================================

Time: Thursday 29 March 2012 - 07:58:52
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: ndbd.cpp
Program: /usr/local/mysql/mysql/bin//ndbmtd
Pid: 30550 thr: 1
Version: mysql-5.1.44 ndb-7.1.3
Trace: /mysql/data/mysqlcluster//ndb_3_trace.log.20 /mysql/data/mysqlcluster//ndb_3_trace.log.20_t1 /mysql/data/mysqlcluster//ndb_3_trac

Prod error log
===============================================================

120329 7:58:52 [Note] NDB Binlog: Node: 3, down, Subscriber bitmask 00
120329 7:58:52 [Note] NDB Binlog: Node: 4, down, Subscriber bitmask 00
120329 7:58:52 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema at epoch 50814438/0.
120329 7:58:52 [Note] NDB Binlog: ndb tables initially read only on reconnect.
120329 7:58:52 [ERROR] /usr/local/mysql/mysql/bin//mysqld: Incorrect information in file: './proddbconsum/CustomCoBrandingMappings.frm'
120329 7:58:52 [ERROR] /usr/local/mysql/mysql/bin//mysqld: Incorrect information in file: './proddbconsum/CustomCoBrandingBrands.frm'

Prod ndb_out.log
==========================================================

2012-03-29 07:58:17 [ndbd] WARNING -- Ndb kernel thread 4 is stuck in: Performing Receive elapsed=102
2012-03-29 07:58:17 [ndbd] INFO -- Watchdog: User time: 6790573 System time: 285716
2012-03-29 07:58:17 [ndbd] WARNING -- Ndb kernel thread 3 is stuck in: Job Handling elapsed=102
2012-03-29 07:58:17 [ndbd] INFO -- Watchdog: User time: 6790573 System time: 285716
WARNING: timerHandlingLab now: 53811545092 sent: 53811544897 diff: 195
WARNING: timerHandlingLab now: 53811570675 sent: 53811570571 diff: 104
WARNING: timerHandlingLab now: 53811570931 sent: 53811570764 diff: 167
WARNING: timerHandlingLab now: 53811573783 sent: 53811573656 diff: 127
WARNING: timerHandlingLab now: 53811574105 sent: 53811574030 diff: 75
WARNING: timerHandlingLab now: 53811574567 sent: 53811574401 diff: 166
saving 0x2aaaab060000 at 0xa2a790 (1)
saving 0x2aaadb2b8000 at 0xa2a798 (2)
2012-03-29 07:58:52 [ndbd] INFO -- Received signal 6. Running error handler.
2012-03-29 07:58:52 [ndbd] INFO -- Signal 6 received; Aborted
2012-03-29 07:58:52 [ndbd] INFO -- ndbd.cpp
2012-03-29 07:58:52 [ndbd] INFO -- Error handler signal shutting down system
2012-03-29 07:58:52 [ndbd] INFO -- Error handler shutdown completed - exiting
2012-03-29 07:58:53 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2012-03-29 08:32:23 [ndbd] INFO -- Angel pid: 32471 ndb pid: 32472
NDBMT: MaxNoOfExecutionThreads=4
NDBMT: workers=2 threads=2
2012-03-29 08:32:23 [ndbd] INFO -- NDB Cluster -- DB node 3
2012-03-29 08:32:23 [ndbd] INFO -- mysql-5.1.44 ndb-7.1.3 --
2012-03-29 08:32:23 [ndbd] INFO -- WatchDog timer is set to 6000 ms
2012-03-29 08:32:23 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 4951Mb initial: 4971Mb
Adding 4972Mb to ZONE_LO (1,159087)
NDBMT: num_threads=5
thr: 4 tid: 32472 CMVMI(0)

Prod management ndb_cluster.log
========================================

2012-03-29 07:58:42 [MgmtSrvr] INFO -- Node 3: Index usage is 10%(8995 8K pages of total 85952)
2012-03-29 07:58:50 [MgmtSrvr] WARNING -- Node 3: Node 4 missed heartbeat 2
2012-03-29 07:58:52 [MgmtSrvr] WARNING -- Node 3: Node 4 missed heartbeat 3
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Data usage is 14%(18858 32K pages of total 131072)
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Index usage is 10%(8995 8K pages of total 85952)
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 7 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 8 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 10 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 12 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 13 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] ALERT -- Node 4: Node 14 Disconnected
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 7 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 8 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 10 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 12 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 13 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 4: Communication to Node 14 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 7 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 8 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 10 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 12 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 13 closed
2012-03-29 07:58:52 [MgmtSrvr] INFO -- Node 3: Communication to Node 14 closed
2012-03-29 07:58:53 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected
2012-03-29 07:58:53 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2012-03-29 07:58:53 [MgmtSrvr] ALERT -- Node 4: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2012-03-29 07:58:53 [MgmtSrvr] ALERT -- Node 1: Node 4 Disconnected
========================================================================

What could be the other reasons for this problem and also the solutions?

Navigate: Previous Message• Next Message

Options: Reply• Quote

Subject

Views

Written By

Posted

NDB cluster failure

9195

smriti .

March 29, 2012 08:31AM

Re: NDB cluster failure

3179

Matthew Montgomery

March 29, 2012 12:30PM

Re: NDB cluster failure

2975

smriti .

March 30, 2012 01:19AM

Re: NDB cluster failure

2593

Matthew Montgomery

March 30, 2012 07:44AM

Re: NDB cluster failure

2656

smriti .

April 02, 2012 06:31AM

Re: NDB cluster failure

2412

smriti .

April 27, 2012 06:01AM

Re: NDB cluster failure

2510

smriti .

May 08, 2012 03:42AM

Re: NDB cluster failure

2416

Bernhard Ocklin

May 11, 2012 06:02AM

Re: NDB cluster failure

6373

smriti .

May 16, 2012 04:58AM

Re: NDB cluster failure

3550

smriti .

May 30, 2012 05:12AM

Re: NDB cluster failure

2409

smriti .

June 08, 2012 05:17AM

Re: NDB cluster failure

2229

Arlo Gilbert

July 06, 2012 08:25AM

Re: NDB cluster failure

2052

Ben Clewett

September 10, 2012 04:40AM

Re: NDB cluster failure

2257

Adam Scott

September 12, 2012 02:41PM

Re: NDB cluster failure

2036

smriti .

October 08, 2012 08:29AM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.