Hi,
after upgrading from NDB 7.2.13 to 7.3.5 and 7.3.6, we experience random node crashes and finally the whole cluster dies.
Setup:
4 data nodes, 256 GB each, 6 cores with HT enabled each, no numa, pinned threads to CPUs (see config)
Crash errors are mostly (with 7.3.6):
Time: Thursday 24 July 2014 - 22:13:35
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 8862) 0x00000006
Program: ndbmtd
Pid: 22137 thr: 3
Version: mysql-5.6.19 ndb-7.3.6
Trace: /mnt/data/cluster/ndb_4_trace.log.9 [t1..t10]
With 7.3.5 we encountered:
(Version 7.3.5)
tatus: Temporary error, restart node
Message: Assertion (Internal error, programming error or missing error message, please report a bug)
Error: 2301
Error data: Illegal signal received (GSN 40 not added)
Error object: Illegal signal received (GSN 40 not added)
Program: ndbmtd
Pid: 6029 thr: 0
Version: mysql-5.6.17 ndb-7.3.5
Trace: /mnt/data/cluster/ndb_6_trace.log.8 [t1..t10]
More details here
http://bugs.mysql.com/bug.php?id=73339
Cluster was upgraded with rolling restart first. After the crashes, we started the nodes with --initial and restored the data with ndb_restore. No we do this nearly twice a day since Monday - memory usage is around 63% of configured cluster memory.
NDB config (important part) here:
[NDBD DEFAULT]
NoOfReplicas=2
Datadir=/mnt/data/cluster
FileSystemPathDD=/mnt/data/cluster
DataMemory=183000M
IndexMemory=53000M
LockPagesInMainMemory=1
MaxNoOfConcurrentOperations=2000000
TransactionDeadlockDetectionTimeout=10000
StringMemory=25
MaxNoOfTables=4096
MaxNoOfOrderedIndexes=2048
MaxNoOfUniqueHashIndexes=512
MaxNoOfAttributes=24576
MaxNoOfTriggers=14336
DiskCheckpointSpeed=100M
FragmentLogFileSize=128M
InitFragmentLogFiles=SPARSE
NoOfFragmentLogFiles=300
RedoBuffer=64M
CompressedLCP=1
TimeBetweenLocalCheckpoints=20
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100
MemReportFrequency=30
BackupReportFrequency=10
### Params for setting logging
LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15
### Params for increasing Disk throughput
BackupMaxWriteSize=1M
BackupDataBufferSize=16M
BackupLogBufferSize=4M
BackupMemory=20M
### Backup Settings
BackupDataDir=/var/lib/mysql-cluster/backup
CompressedBACKUP=1
#Reports indicates that odirect=1 can cause io errors (os err code 5) on some systems. You must test.
ODirect=1
### Watchdog
TimeBetweenWatchdogCheckInitial=60000
TransactionInactiveTimeout=60000
### DISK DATA
SharedGlobalMemory=20M
DiskPageBufferMemory=64M
### Multithreading
ThreadConfig=ldm={count=4,cpubind=2,3,4,5},tc={count=2,cpubind=6,7},recv={count=2,cpubind=8,9},send={count=1,cpubind=10},main={count=1,cpubind=11},io={count=1,cpubind=11}
### Increasing the LongMessageBuffer b/c of a bug (20090903)
LongMessageBuffer=32M
BatchSizePerLocalScan=512
DiskCheckpointSpeedInRestart=100M
Any ideas? Maybe disable thread pinning? Increase max total operations / transactions?
One important thing: all tables are "non-logging" which means, not persistent so it is for sure not a file system related issue.
Thanks!
Christian