MySQL Forums
Forum List  »  NDB clusters

NDB engine possibly not a realtime database engine?
Posted by: Stefan Auweiler
Date: March 16, 2017 05:27AM

Hi,

for month now, we are trying to find, why we are getting long queries every now and then.

Cluster is prettie stable and runs at millions of queries per second in a telco environment.
Sub second response times are required. We have slow query logging running at 100ms and we dont see any queries... but...

Several times a day, we see nodes halting for about 4-6 seconds for no reason.
During this period of time, the complete system literally stops working, takes a rest and keeps up with its work after that 6 seconds.

The only evidences in the logs (mostly only on one of the data nodes) are like this:


2017-03-16 11:10:58 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=5435
2017-03-16 11:10:58 [ndbd] INFO -- Watchdog: User time: 1521948563 System time: 392495304
2017-03-16 11:10:58 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=5535
2017-03-16 11:10:58 [ndbd] INFO -- Watchdog: User time: 1521948571 System time: 392495308
2017-03-16 11:10:58 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=5635
2017-03-16 11:10:58 [ndbd] INFO -- Watchdog: User time: 1521948579 System time: 392495312
2017-03-16 11:10:58 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=5735
2017-03-16 11:10:58 [ndbd] INFO -- Watchdog: User time: 1521948587 System time: 392495317
2017-03-16 11:10:58 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=5835
2017-03-16 11:10:58 [ndbd] INFO -- Watchdog: User time: 1521948596 System time: 392495321
2017-03-16 11:10:58 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=5935
2017-03-16 11:10:58 [ndbd] INFO -- Watchdog: User time: 1521948602 System time: 392495325
2017-03-16 11:10:58 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=6035
2017-03-16 11:10:58 [ndbd] INFO -- Watchdog: User time: 1521948612 System time: 392495330
2017-03-16 11:10:59 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=6136
2017-03-16 11:10:59 [ndbd] INFO -- Watchdog: User time: 1521948619 System time: 392495334
2017-03-16 11:10:59 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=6236
2017-03-16 11:10:59 [ndbd] INFO -- Watchdog: User time: 1521948627 System time: 392495339
2017-03-16 11:10:59 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=6336
2017-03-16 11:10:59 [ndbd] INFO -- Watchdog: User time: 1521948636 System time: 392495344
2017-03-16 11:10:59 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=6436
2017-03-16 11:10:59 [ndbd] INFO -- Watchdog: User time: 1521948644 System time: 392495348
2017-03-16 11:10:59 [ndbd] WARNING -- Ndb kernel thread 10 is stuck in: Job Handling elapsed=6536
2017-03-16 11:10:59 [ndbd] INFO -- Watchdog: User time: 1521948652 System time: 392495353
2017-03-16 11:10:59 [ndbd] WARNING -- thr: 10: Overslept 6716 ms, expected ~10ms




All our clients are complaining, that during this period, query results (regardless of their type) are taking that exact amount of time (watch the 'Time: 5871ms' values)




bbbsv276_25502_events 16.03.17 11:10:59.559350 141 5555 [INF] Queryid: 0072 Time: 5871ms on spr at intern462:33263: SELECT * FROM sms WHERE claimed_by = 56775019 AND 0072 limit 20
bbbsv221_13203_inhttp 16.03.17 11:10:59.559394 141 5555 [INF] Queryid: 0412 Time: 5557ms on spr at localhost:/var/opt/acctopus/run/spr.sock: SELECT id,functionid,qtime,qvol,processedtime FROM TicketBookings WHERE subscriber=491748415 and bucketid=206566388 AND functionid<>1 AND 0412
bbbsv265_04193_inhttp 16.03.17 11:10:59.559677 141 5555 [INF] Queryid: 0412 Time: 5832ms on spr at localhost:/var/opt/acctopus/run/spr.sock: SELECT id,functionid,qtime,qvol,processedtime FROM TicketBookings WHERE subscriber=4915250560 and bucketid=7822930 AND functionid<>1 AND 0412
bbbsv229_27150_subchk 16.03.17 11:10:59.559629 141 5555 [INF] Queryid: 0174 Time: 6702ms on spr at intern463:33263: SELECT n_subscriber,n_bc_real,n_checked,n_status,n_fingerprint FROM t_subscriberVolatile WHERE n_claimed_by = 57001982 AND 0174 LIMIT 300
bbbsv272_21808_inhttp 16.03.17 11:10:59.559565 141 5555 [INF] Queryid: 0412 Time: 5820ms on spr at localhost:/var/opt/acctopus/run/spr.sock: SELECT id,functionid,qtime,qvol,processedtime FROM TicketBookings WHERE subscriber=4917050560 and bucketid=7822930 AND functionid<>1 AND 0412
bbbsv278_14885_events 16.03.17 11:10:59.559264 141 5555 [INF] Queryid: 0072 Time: 6462ms on spr at intern464:33263: SELECT * FROM sms WHERE claimed_by = 56924178 AND 0072 limit 20
bbbsv227_29888_events 16.03.17 11:10:59.559672 141 5555 [INF] Queryid: 0072 Time: 6005ms on spr at intern467:33261: SELECT * FROM sms WHERE claimed_by = 56914443 AND 0072 limit 20
bbbsv248_00418_diamgx 16.03.17 11:10:59.560178 141 5555 [INF] Queryid: 0339 Time: 5857ms on spr at intern462:33261: SELECT * FROM t_subscriber WHERE n_subscriber=0491622845 AND 0339
bbbsv236_16240_events 16.03.17 11:10:59.559670 141 5555 [INF] Queryid: 0072 Time: 5691ms on spr at intern464:33263: SELECT * FROM sms WHERE claimed_by = 56924172 AND 0072 limit 20
bbbsv437_22511_schedu 16.03.17 11:10:59.559721 141 5555 [INF] Queryid: 23 Time: 5611ms on spr at intern461:33263: SELECT progname,last,n_pid FROM alive WHERE server='bbbsv437' AND 23
bbbsv464_14863_schedu 16.03.17 11:10:59.559629 141 5555 [INF] Queryid: 23 Time: 5915ms on spr at localhost:/var/opt/acctopus/run/spr.sock: SELECT progname,last,n_pid FROM alive WHERE server='bbbsv464' AND 23
bbbsv406_00196_schedu 16.03.17 11:10:59.559816 141 5555 [INF] Queryid: 23 Time: 5633ms on spr at intern464:33261: SELECT progname,last,n_pid FROM alive WHERE server='bbbsv406' AND 23
bbbsv279_28039_events 16.03.17 11:10:59.559152 141 5555 [INF] Queryid: 0072 Time: 5904ms on spr at intern467:33261: SELECT * FROM sms WHERE claimed_by = 56914452 AND 0072 limit 20
bbbsv262_01095_inhttp 16.03.17 11:10:59.559983 141 5555 [INF] Queryid: 0412 Time: 6099ms on spr at localhost:/var/opt/acctopus/run/spr.sock: SELECT id,functionid,qtime,qvol,processedtime FROM TicketBookings WHERE subscriber=491723015 and bucketid=220957566 AND functionid<>1 AND 0412
bbbsv463_14328_schedu 16.03.17 11:10:59.559970 141 5555 [INF] Queryid: 23 Time: 5927ms on spr at intern467:33262: SELECT progname,last,n_pid FROM alive WHERE server='bbbsv463' AND 23
bbbsv223_24550_inhttp 16.03.17 11:10:59.560121 141 5555 [INF] Queryid: 0412 Time: 6157ms on spr at localhost:/var/opt/acctopus/run/spr.sock: SELECT id,functionid,qtime,qvol,processedtime FROM TicketBookings WHERE subscriber=491735257 and bucketid=218731820 AND functionid<>1 AND 0412
bbbsv259_01845_inhttp 16.03.17 11:10:59.559868 141 5555 [INF] Queryid: 0412 Time: 5583ms on spr at localhost:/var/opt/acctopus/run/spr.sock: SELECT id,functionid,qtime,qvol,processedtime FROM TicketBookings WHERE subscriber=491738806 and bucketid=217818471 AND functionid<>1 AND 0412
bbbsv251_04708_events 16.03.17 11:10:59.560059 141 5555 [INF] Queryid: 0072 Time: 6399ms on spr at intern463:33262: SELECT * FROM sms WHERE claimed_by = 56909694 AND 0072 limit 20
bbbsv208_27247_inhttp 16.03.17 11:10:59.560427 141 5555 [INF] Queryid: 0412 Time: 6319ms on spr at localhost:/var/opt/acctopus/run/spr.sock: SELECT id,functionid,qtime,qvol,processedtime FROM TicketBookings WHERE subscriber=491702423 and bucketid=219024115 AND functionid<>1 AND 0412


About the system


The Boxes are Oracle Fire X5 with lots of disks, memory, running Solaris 11.
Nothing else is running on the data nodes, Processor groups are configured to give the cluster its own CPU space and OS is working in the default group.
IRQs are taken out of the way.
NDB is configured to give any ndb thraed its own core, leaving the HT core free in case of the LDMs:


NoOfFragmentLogParts=16
ThreadConfig=ldm={count=16,cpubind=2-13,20-23,spintime=100},main={cpubind=18},io={cpubind=37},watchdog={cpubind=19},rep={cpubind=24},tc={count=8,cpubind=14,15,50,51,32,33,68,69,realtime=1},recv={count=4,cpubind=34,35,70,71},send={count=4,cpubind=16,17,52,53}


Does anybody have an idea, how to dig into this?
Has anybody have had that strange behaviour?

Thanks
Stefan

Options: ReplyQuote


Subject
Views
Written By
Posted
NDB engine possibly not a realtime database engine?
372
March 16, 2017 05:27AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.