MySQL :: Re: A question about epoch boundary in system-wide recovery

New Topic

Re: A question about epoch boundary in system-wide recovery

Posted by: Frazer Clement
Date: May 31, 2024 10:47AM

Hi Alex,
Short answer :

All nodes will recover to the same consistent point in time, which will correspond to a single epoch boundary.

Long answer :

As you know the cluster is continually closing the current epoch and opening a new one, roughly one every TimeBetweenEpochs milliseconds (default 100 millis).
Each epoch fully contains a consistent set of transactions, every transaction is recorded in exactly one epoch.

Every epoch is part of one GlobalCheckpoint. Every TimeBetweenGlobalCheckpoints, the currently open GlobalCheckpoint is closed.
This involves :
- Increasing the GCI part of the epoch number (top 32 bits)
- Resetting the micro GCI (uGCI) part (bottom 32 bits)
- Making all of the closing GCI's epochs recoverable when the last epoch in the GCI has closed.

Making the closing GCI's epochs recoverable involves :
- Waiting for the last epoch to close
- (A) GCP_SAVE : Messaging all data nodes to flush and fsync their parallel redo log parts up to [and beyond] the content describing the changes in the closing GCI
- (B) Wait for acknowledgement
- (C) GCP_COPY : Messaging all data nodes to record the fact that the cluster is now recoverable to a state containing all of the closing GCP's epochs, via all of the running node's parallel redo log parts.
- Every node records this fact in a small metadata file.
- (D) Wait for acknowledgements

From point (D), the cluster is recoverable to a new consistent state across all live nodes.
The write of the redo logs is done (and acknowledged) in a separate phase to the updating of the metadata, so if *any* data node's metadata file indicates that the cluster is recoverable then *all* redo log parts must contain the required info.
If the cluster were to crash before any metadata write completed, then recovery would occur only up to the previously durable GCP.

This design avoids problems with reading the 'bleeding edge' of the redo log, and the resulting potential different logical content across nodes, at the cost of more overhead and delays until transactions are durable. Since durability is asynchronous and delayed in any case it is a reasonable tradeoff for Ndb.

Hope this makes sense + helps,
Frazer

Navigate: Previous Message• Next Message

Options: Reply• Quote

Subject

Views

Written By

Posted

A question about epoch boundary in system-wide recovery

509

Alex Ou

May 31, 2024 09:18AM

Re: A question about epoch boundary in system-wide recovery

279

Frazer Clement

May 31, 2024 10:47AM

Re: A question about epoch boundary in system-wide recovery

237

Alex Ou

June 01, 2024 01:41AM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.