MySQL Forums
Forum List  »  NDB clusters

How to restart or recover data when all data nodes crash and then fail to complete start up due to frag log error in phase 4
Posted by: Rowan Drew
Date: February 10, 2023 01:52PM

We've been running an older version of MySQL Cluster 7.4.7 for many years with minimum configuration, containing 2 data nodes, 2 sql nodes, and 1 management node.
After running an Alter Table statement to add 2 big integer columns to a table with 1,000,000 rows. the query failed and crashed the 1st data node. Upon running the query again it crashed the 2nd data node, bringing down the whole cluster.
When trying to restart either data node, they both fail during start up calling in phase 4 when reaching a frag file. see error below in logs.

So the question is, is it possible to repair or have the data nodes ignore the frag file(s) that's stopping the start up process from completing? Or how do you get around this "File has already been opened" error? We tried moving the problematic S11.FragLog file but then errors get reported that a file is missing.
If there is no way to fix these broken frag files, is there a way to recover the table data from a data node (such as using a separate tool to read/export the node's data stored on disk)?
A more general question is how do you recover data when all data nodes in a cluster cannot start (say due a single corrupt file) and there's no separate backups available?


$./ndbmtd
...
2023-02-11 06:05:13 [ndbd] INFO -- LDM instance 0: Restored T1474F1 LCP 0 rows, 1 millis, 0 rows/s)
2023-02-11 06:05:13 [ndbd] INFO -- LDM instance 0: Restored T1477F0 LCP 161953 rows, 622 millis, 260374 rows/s)
2023-02-11 06:05:14 [ndbd] INFO -- LDM instance 0: Restored T1477F1 LCP 163565 rows, 630 millis, 259626 rows/s)
2023-02-11 06:05:14 [ndbd] INFO -- LDM instance 0: Restored T1482F0 LCP 1 rows, 1 millis, 1000 rows/s)
2023-02-11 06:05:14 [ndbd] INFO -- LDM instance 0: Restored T1482F1 LCP 3 rows, 1 millis, 3000 rows/s)
2023-02-11 06:05:14 [ndbd] INFO -- LDM instance 0: Restored LCP : 1114 fragments, 47359940 rows, 160001 millis, 295997 rows/s
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Completed fuzzy restore 1118 fragments (1118 from LCP, 0 by non-trans copy)
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Starting DD Undo log application
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Completed DD Undo log application
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Starting REDO log execution phase 0
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Node 3 ready to execute REDO log
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): All starting nodes ready to execute REDO log. Phases completed = 0
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0):Ready to start execute REDO log phase, collect REDO log execution info phase completed
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Log part 0 will execute REDO log records from GCI 85012871 -> 85013314 and last log file is number 11 and last MByte in this file is 13
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Log part 1 will execute REDO log records from GCI 85012871 -> 85013314 and last log file is number 2 and last MByte in this file is 12
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Log part 2 will execute REDO log records from GCI 85012871 -> 85013314 and last log file is number 6 and last MByte in this file is 6
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Log part 3 will execute REDO log records from GCI 85012871 -> 85013314 and last log file is number 11 and last MByte in this file is 14
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Start executing REDO log for part 0
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Start executing REDO log for part 1
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Start executing REDO log for part 2
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Start executing REDO log for part 3
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Completed REDO log execution on part 1, ops executed = 0, bytes executed = 0,
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Log part 1 stats: ops skipped = 735, ops prepared 735, pages read = 8, GCIs executed = 444
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Completed REDO log execution on part 0, ops executed = 388, bytes executed = 7212,
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Log part 0 stats: ops skipped = 639, ops prepared 1027, pages read = 8, GCIs executed = 1283
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Completed REDO log execution on part 2, ops executed = 249, bytes executed = 49328,
2023-02-11 06:05:14 [ndbd] INFO -- LDM(0): Log part 2 stats: ops skipped = 7320, ops prepared 7569, pages read = 48, GCIs executed = 2080
2023-02-11 06:05:15 [ndbd] INFO -- open: >/var/lib/mysql-cluster/data/ndb_3_fs/D11/DBLQH/S11.FragLog< existing: >/var/lib/mysql-cluster/data/ndb_3_fs/D11/DBLQH/S11.FragLog<
2023-02-11 06:05:15 [ndbd] INFO -- OpenFiles::insert()
2023-02-11 06:05:15 [ndbd] INFO -- Error handler shutting down system
2023-02-11 06:05:15 [ndbd] INFO -- Error handler shutdown completed - exiting
2023-02-11 06:05:18 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 4. Caused by error 2807: 'File has already been opened(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Options: ReplyQuote


Subject
Views
Written By
Posted
How to restart or recover data when all data nodes crash and then fail to complete start up due to frag log error in phase 4
249
February 10, 2023 01:52PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.