Cluster crashing with error Internal program error
Date: October 31, 2009 05:56AM
I have set up a cluster with two data, two sql nodes and one management node. We had a load balancer set up using heartbeat and linux virtual ip. When I connected the code to the cluster it was working fine for some time. But it crashes after some time with error
Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
I tried to log the quries. But the quries where working fine when I executed those in the node directly.
We were using temporary tables. I know temperory table won't work with ndb. So I configured those quries to use a direct connection to node(without load balancer). I don't think this is causing problem. Is there any better work around for the temperory tables with the cluster?
Server version: 5.1.34-ndb-7.0.6-cluster-gpl-log MySQL Cluster Server (GPL)
OS : Linux fedora core 6
This is the log from the management node
2009-10-31 01:57:42 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2009-10-31 01:57:42 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2009-10-31 01:57:42 [MgmSrvr] ALERT -- Node 4: Node 3 Disconnected
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: Communication to Node 3 closed
2009-10-31 01:57:42 [MgmSrvr] ALERT -- Node 4: Network partitioning - arbitration required
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: President restarts arbitration thread [state=7]
2009-10-31 01:57:42 [MgmSrvr] ALERT -- Node 4: Arbitration won - positive reply from node 1
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: GCP Take over started
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: Node 4 taking over as DICT master
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: GCP Take over completed
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: kk: 1768610/17 0 0
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: LCP Take over started
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000000
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000000
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 1
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: LCP Take over completed (state = 4)
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000000
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000000
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2009-10-31 01:57:42 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 1
2009-10-31 01:57:42 [MgmSrvr] ALERT -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2009-10-31 01:57:42 [MgmSrvr] ALERT -- Node 1: Node 4 Disconnected
2009-10-31 02:50:06 [MgmSrvr] WARNING -- Allocate nodeid (0) failed. Connection from ip 127.0.0.1. Returned error string "Connection done from wrong host ip 127.0.0.1."
2009-10-31 02:50:06 [MgmSrvr] INFO -- Mgmt server state: node id's 1 not connected but reserved
This is my config.ini
[TCP DEFAULT]
SendBufferMemory=2M
ReceiveBufferMemory=2M
[NDB_MGMD DEFAULT]
Datadir=/var/lib/mysql-cluster
[NDB_MGMD]
Id=1
Hostname=192.168.1.70
ArbitrationRank=1
[NDBD DEFAULT]
NoOfReplicas=2
Datadir=/var/lib/mysql-cluster
DataMemory=4096M
IndexMemory=512M
MaxNoOfOrderedIndexes=1024
MaxNoOfAttributes=5000
MaxNoOfTables=1024
# ### Watchdog
# TimeBetweenWatchdogCheckInitial=30000
# TimeBetweenWatchdogCheck=30000
MaxNoOfConcurrentOperations=1000000
#
# LockPagesInMainMemory=1
#
#
# StringMemory=25
# MaxNoOfUniqueHashIndexes=512
# DiskCheckpointSpeedInRestart=100M
FragmentLogFileSize=512M
# InitFragmentLogFiles=FULL
NoOfFragmentLogFiles=12
RedoBuffer=1024M
#
# TimeBetweenLocalCheckpoints=20
# TimeBetweenGlobalCheckpoints=1000
# TimeBetweenEpochs=100
#
MemReportFrequency=30
BackupReportFrequency=10
#
# ### Params for setting logging
LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15
#
# ### Params for increasing Disk throughput
# BackupMaxWriteSize=1M
# BackupDataBufferSize=16M
# BackupLogBufferSize=4M
# BackupMemory=20M
# #Reports indicates that odirect=1 can cause io errors (os err code 5) on some systems. You must test.
# #ODirect=1
#
# ### TransactionInactiveTimeout - should be enabled in Production
# #TransactionInactiveTimeout=30000
# ### CGE 6.3 - REALTIME EXTENSIONS
# #RealTimeScheduler=1
# #SchedulerExecutionTimer=80
# #SchedulerSpinTimer=40
#
# ### DISK DATA
# #SharedGlobalMemory=384M
# #read my blog how to set this:
# #DiskPageBufferMemory=3072M
# ### DISK DATA
# #SharedGlobalMemory=384M
# #read my blog how to set this:
# #DiskPageBufferMemory=3072M
#
# ### Multithreading
# MaxNoOfExecutionThreads=8
# BatchSizePerLocalScan=512
[NDBD]
Id=3
Hostname=192.168.1.71
[NDBD]
Id=4
Hostname=192.168.1.72
[MYSQLD DEFAULT]
BatchSize=512
# BatchByteSize=2048K
# MaxScanBatchSize=2048K
[MYSQLD]
[MYSQLD]
Does anyone knows about this issue? Any help would be highly appreciated.
Thanks,
Biju
Edited 1 time(s). Last edit at 10/31/2009 05:59AM by Biju Thaj.