Re: A question about epoch boundary in system-wide recovery
Hi Alex,
Short answer :
All nodes will recover to the same consistent point in time, which will correspond to a single epoch boundary.
Long answer :
As you know the cluster is continually closing the current epoch and opening a new one, roughly one every TimeBetweenEpochs milliseconds (default 100 millis).
Each epoch fully contains a consistent set of transactions, every transaction is recorded in exactly one epoch.
Every epoch is part of one GlobalCheckpoint. Every TimeBetweenGlobalCheckpoints, the currently open GlobalCheckpoint is closed.
This involves :
- Increasing the GCI part of the epoch number (top 32 bits)
- Resetting the micro GCI (uGCI) part (bottom 32 bits)
- Making all of the closing GCI's epochs recoverable when the last epoch in the GCI has closed.
Making the closing GCI's epochs recoverable involves :
- Waiting for the last epoch to close
- (A) GCP_SAVE : Messaging all data nodes to flush and fsync their parallel redo log parts up to [and beyond] the content describing the changes in the closing GCI
- (B) Wait for acknowledgement
- (C) GCP_COPY : Messaging all data nodes to record the fact that the cluster is now recoverable to a state containing all of the closing GCP's epochs, via all of the running node's parallel redo log parts.
- Every node records this fact in a small metadata file.
- (D) Wait for acknowledgements
From point (D), the cluster is recoverable to a new consistent state across all live nodes.
The write of the redo logs is done (and acknowledged) in a separate phase to the updating of the metadata, so if *any* data node's metadata file indicates that the cluster is recoverable then *all* redo log parts must contain the required info.
If the cluster were to crash before any metadata write completed, then recovery would occur only up to the previously durable GCP.
This design avoids problems with reading the 'bleeding edge' of the redo log, and the resulting potential different logical content across nodes, at the cost of more overhead and delays until transactions are durable. Since durability is asynchronous and delayed in any case it is a reasonable tradeoff for Ndb.
Hope this makes sense + helps,
Frazer
Subject
Views
Written By
Posted
Re: A question about epoch boundary in system-wide recovery
106
May 31, 2024 10:47AM
100
June 01, 2024 01:41AM
Sorry, only registered users may post in this forum.
Content reproduced on this site is the property of the respective copyright holders.
It is not reviewed in advance by Oracle and does not necessarily represent the opinion
of Oracle or any other party.