MySQL Forums
Forum List  »  NDB clusters

Random node and full cluster crash after upgrade from 7.2 to 7.3.x
Posted by: Christian Ehmig
Date: July 24, 2014 05:26PM

Hi,

after upgrading from NDB 7.2.13 to 7.3.5 and 7.3.6, we experience random node crashes and finally the whole cluster dies.

Setup:
4 data nodes, 256 GB each, 6 cores with HT enabled each, no numa, pinned threads to CPUs (see config)

Crash errors are mostly (with 7.3.6):

Time: Thursday 24 July 2014 - 22:13:35
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 8862) 0x00000006
Program: ndbmtd
Pid: 22137 thr: 3
Version: mysql-5.6.19 ndb-7.3.6
Trace: /mnt/data/cluster/ndb_4_trace.log.9 [t1..t10]

With 7.3.5 we encountered:
(Version 7.3.5)
tatus: Temporary error, restart node
Message: Assertion (Internal error, programming error or missing error message, please report a bug)
Error: 2301
Error data: Illegal signal received (GSN 40 not added)
Error object: Illegal signal received (GSN 40 not added)
Program: ndbmtd
Pid: 6029 thr: 0
Version: mysql-5.6.17 ndb-7.3.5
Trace: /mnt/data/cluster/ndb_6_trace.log.8 [t1..t10]

More details here http://bugs.mysql.com/bug.php?id=73339


Cluster was upgraded with rolling restart first. After the crashes, we started the nodes with --initial and restored the data with ndb_restore. No we do this nearly twice a day since Monday - memory usage is around 63% of configured cluster memory.

NDB config (important part) here:
[NDBD DEFAULT]
NoOfReplicas=2
Datadir=/mnt/data/cluster
FileSystemPathDD=/mnt/data/cluster
DataMemory=183000M
IndexMemory=53000M
LockPagesInMainMemory=1

MaxNoOfConcurrentOperations=2000000
TransactionDeadlockDetectionTimeout=10000

StringMemory=25
MaxNoOfTables=4096
MaxNoOfOrderedIndexes=2048
MaxNoOfUniqueHashIndexes=512
MaxNoOfAttributes=24576
MaxNoOfTriggers=14336
DiskCheckpointSpeed=100M
FragmentLogFileSize=128M
InitFragmentLogFiles=SPARSE
NoOfFragmentLogFiles=300
RedoBuffer=64M
CompressedLCP=1

TimeBetweenLocalCheckpoints=20
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100

MemReportFrequency=30
BackupReportFrequency=10

### Params for setting logging
LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15

### Params for increasing Disk throughput
BackupMaxWriteSize=1M
BackupDataBufferSize=16M
BackupLogBufferSize=4M
BackupMemory=20M

### Backup Settings
BackupDataDir=/var/lib/mysql-cluster/backup
CompressedBACKUP=1

#Reports indicates that odirect=1 can cause io errors (os err code 5) on some systems. You must test.
ODirect=1

### Watchdog
TimeBetweenWatchdogCheckInitial=60000
TransactionInactiveTimeout=60000

### DISK DATA
SharedGlobalMemory=20M
DiskPageBufferMemory=64M

### Multithreading
ThreadConfig=ldm={count=4,cpubind=2,3,4,5},tc={count=2,cpubind=6,7},recv={count=2,cpubind=8,9},send={count=1,cpubind=10},main={count=1,cpubind=11},io={count=1,cpubind=11}

### Increasing the LongMessageBuffer b/c of a bug (20090903)
LongMessageBuffer=32M
BatchSizePerLocalScan=512

DiskCheckpointSpeedInRestart=100M


Any ideas? Maybe disable thread pinning? Increase max total operations / transactions?

One important thing: all tables are "non-logging" which means, not persistent so it is for sure not a file system related issue.


Thanks!

Christian

Options: ReplyQuote


Subject
Views
Written By
Posted
Random node and full cluster crash after upgrade from 7.2 to 7.3.x
1703
July 24, 2014 05:26PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.