MySQL Forums
Forum List  »  NDB clusters

MySQL Cluster data nodes fails with 7200: LCP fragment scan watchdog
Posted by: Puneet N
Date: January 06, 2018 10:38PM

Hi,

We have a MySQL cluster setup with 1 Management Node, 2 Data nodes and 2 SQL Nodes.

Configurations for it is as below:
mysql-5.6.28
ndb-7.4.10
------------------------------------------------
[ndb_mgmd default]
# Directory for MGM node log files
DataDir=/var/lib/mysql/mycluster

[ndb_mgmd]
#Management Node mgm-01
HostName=<mgm_host_ip>
nodeid=1

[ndbd default]
NoOfReplicas=2
DataMemory=60G
IndexMemory=25G
FragmentLogFileSize=16M
NoOfFragmentLogFiles=500
MaxNoOfTables=256
MaxNoOfAttributes=1536
MaxNoOfOrderedIndexes=512
MaxNoOfUniqueHashIndexes=128
MaxNoOfConcurrentOperations=100000
MaxNoOfLocalOperations=110000
#Directory for Data Node
DataDir=/var/lib/mysql/mycluster_data

[ndbd]
#Data Node data-01
HostName=<data_node1_host_ip>
nodeid=2

[ndbd]
#Data Node data-02
HostName=<data_node2_host_ip>
nodeid=3

[mysqld]
#SQL Node sqlndb-01.
HostName=<sql_node1_host_ip>
nodeid=4

[mysqld]
#SQL Node sqlndb-02.
HostName=<sql_node2_host_ip>
nodeid=5
----------------------------------------------------------




The problem is after around 1-2 weeks both the data nodes go down with following error log:
===========================================================
LCP Frag watchdog : No progress on table 37, frag 0 for 29 s. 8232464 rows completed
LCP Frag watchdog : No progress on table 37, frag 0 for 39 s. 8232464 rows completed
LCP Frag watchdog : No progress on table 37, frag 0 for 49 s. 8232464 rows completed
LCP Frag watchdog : No progress on table 37, frag 0 for 59 s. 8232464 rows completed
LCP Frag watchdog : No progress on table 37, frag 0 for 69 s. 8232464 rows completed
LCP Frag watchdog : Checkpoint of table 37 fragment 0 too slow (no progress for > 69 s).
m_curr_disk_write_speed: 2048kb m_words_written_this_period: 0kwords m_overflow_disk_write: 0kb
m_reset_delay_used: 96 time since last RESET_DISK_SPEED: 92 millis
m_monitor_words_written : 0, duration : 389 millis, rate : 0 bytes/s : (0 pct of config)
BackupRecord 0: BackupId: 245 MasterRef: f70002 ClientRef: 0
State: 4
noOfByte: 643920 noOfRecords: 4570
noOfLogBytes: 0 noOfLogRecords: 0
errorCode: 0
file 0: type: 3 flags: H'19 tableId: 37 fragmentId: 0
ready: TRUE eof: FALSE
2018-01-06 16:55:30 [ndbd] INFO -- Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
2018-01-06 16:55:30 [ndbd] INFO -- DBLQH (Line: 25405) 0x00000002
2018-01-06 16:55:30 [ndbd] INFO -- Error handler shutting down system
2018-01-06 16:55:30 [ndbd] INFO -- Error handler shutdown completed - exiting
2018-01-06 16:55:38 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Caused by error 7200: 'LCP fragment scan watchdog detected a problem. Please report a bug.(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
========================================================

It seems that the LCP watchdog timesout on one of the Table and I am not able to understand which table is it getting stuck on and what exactly is it getting stuck for. I don't have much idea of what exactly LCP does and what could cause it to be unresponsive.

Also Want to highlight that one of the tables in the DB has around 60 Million records. I am assuming that it could be getting stuck on this table but I can't be sure of it.

When I restart the data nodes (using ndbd) each node takes around 30 minutes to completely start. That causes downtime of around 1 hour.
The logs show different phases and most of the time goes in re-building the indexes on first node that is started and Copy of fragments on the second node. (The nodes are started one after the other and only when the first node is started completely.)

First Node Logs Example:
------------------------
2018-01-06 19:53:55 [ndbd] INFO -- LDM(0): index id 82 rebuild done
2018-01-06 20:00:43 [ndbd] INFO -- LDM(0): index id 83 rebuild done

Second Node Logs Example:
2018-01-06 20:05:09 [ndbd] INFO -- LDM(0): Completed copy of fragment T36F1. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
-------------------------
2018-01-06 20:20:21 [ndbd] INFO -- LDM(0): Completed copy of fragment T37F0. Changed +31064372/-460 rows, 5733781380 bytes. 100 pct churn to 31064372 rows.

Can anybody please help me on the following:
1. What does LCP actually do and what could cause it become unresponsive?
2. Any issues with the configurations?
3. Are there any tools that would give an idea of which table has the id "37"?
4. How to speed up the restart phases so that cluster is up within 5-10 mins?

Thanks.



Edited 1 time(s). Last edit at 01/06/2018 10:42PM by Puneet N.

Options: ReplyQuote




Sorry, only registered users may post in this forum.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.