ARIES Recovery
Three-Phase Recovery
After a crash, the recovery algorithm reads the WAL from the last checkpoint and classifies every transaction as committed, aborted, or incomplete. Committed transactions are redone (applied forward), and incomplete transactions are undone (rolled back).
WAL Segments
|
v
Phase 1: ANALYSIS
Scan WAL from checkpoint LSN forward
Build transaction table (tx_id -> state, last_lsn)
Build dirty page table (page_key -> recovery_lsn)
Compute redo_start_lsn = min(dirty page recovery LSNs)
|
v
Phase 2: REDO
Replay all Insert/Update/Delete records from redo_start_lsn
Apply after_images to restore pages to crash-time state
Includes committed AND uncommitted transactions
|
v
Phase 3: UNDO
Identify "loser" transactions (Active at crash)
Roll back in REVERSE LSN order using before_images
Restores database to last consistent stateData Structures
/// Transaction state during recovery
pub enum RecoveryTxState {
Active, // In-progress at crash (loser)
Committed, // Successfully committed (winner)
Aborted, // Explicitly rolled back
}
/// Per-transaction recovery entry
pub struct TxEntry {
pub tx_id: TxId,
pub state: RecoveryTxState,
pub last_lsn: Lsn, // Last WAL record for this tx
}
/// Dirty page tracking for redo optimization
pub struct DirtyPageEntry {
pub page_key: Vec<u8>,
pub recovery_lsn: Lsn, // LSN of first modification
}
/// Complete recovery result
pub struct AriesRecoveryResult {
pub analysis: AnalysisResult,
pub redo_actions: Vec<RedoAction>,
pub undo_actions: Vec<UndoAction>,
pub winners: HashSet<TxId>, // Committed transactions
pub losers: HashSet<TxId>, // Incomplete transactions
}Phase 1: Analysis
The analysis phase scans the WAL from the checkpoint LSN forward, building two tables: the transaction table (which transactions were active, committed, or aborted) and the dirty page table (which pages were modified and when).
| Record Type | Action |
|---|---|
| Insert / Update / Delete | Create or update tx entry as Active; track dirty page with first-modification LSN |
| Commit | Set tx state to Committed; update last_lsn |
| Abort | Set tx state to Aborted; update last_lsn |
| Checkpoint | Skipped (already handled by checkpoint LSN parameter) |
The analysis phase also computes the redo_start_lsn: the minimum of all dirty page recovery LSNs. This is where the redo phase begins, ensuring no necessary redo is missed.
Phase 2: Redo
The redo phase collects all data operations (Insert, Update, Delete) from redo_start_lsn forward, regardless of transaction outcome. This restores every page to its exact crash-time state, including changes from uncommitted transactions.
pub struct RedoAction {
pub tx_id: TxId,
pub lsn: Lsn,
pub op_type: WalOpType, // Insert, Update, or Delete
pub after_image: Vec<u8>, // Post-operation page state
}Redo actions are applied in forward LSN order, ensuring operations are replayed in the exact sequence they originally occurred.
Phase 3: Undo
The undo phase identifies "loser" transactions — those that were Active (neither committed nor aborted) at crash time. Their operations are rolled back in reverse LSN order using the before-images stored in the WAL, restoring affected pages to their pre-transaction state.
pub struct UndoAction {
pub tx_id: TxId,
pub lsn: Lsn,
pub before_image: Vec<u8>, // Pre-operation page state
}
// Undo actions sorted by reverse LSN
// Latest changes rolled back first
undo_actions.sort_by_key(|a| std::cmp::Reverse(a.lsn));| Transaction State | Classification | Action |
|---|---|---|
| Committed | Winner | Redo only (changes preserved) |
| Aborted | Handled | Already rolled back before crash |
| Active (no commit/abort) | Loser | Undo in reverse LSN order |
Checkpoint Integration
Checkpoints bound the recovery window. The analysis phase skips all WAL records before the checkpoint LSN, since those changes are guaranteed to be flushed to disk. This limits the amount of WAL that must be processed during recovery.
WAL Timeline:
LSN 1 LSN 2 LSN 3 LSN 4 LSN 5 LSN 6
[tx1 ][commit] | [tx2 ][update] [CRASH]
|
CHECKPOINT at LSN 3
|
Recovery starts here ─┘
- tx1: before checkpoint, NOT in tx_table
- tx2: after checkpoint, classified as Active (loser)
- Only LSN 4-6 need processingWAL Record Format
Each WAL record carries both before and after images, enabling both redo (apply after_image) and undo (restore before_image) operations. Records are checksummed with CRC32 for corruption detection.
| Operation | Value | before_image | after_image |
|---|---|---|---|
| Insert | 1 | Empty | New page data |
| Update | 2 | Pre-update page | Post-update page |
| Delete | 3 | Deleted page data | Empty |
| Commit | 4 | Empty | Empty |
| Abort | 5 | Empty | Empty |
| Checkpoint | 6 | Empty | Empty |
Recovery Entry Point
pub fn aries_recover(
records: &[WalRecord],
checkpoint_lsn: Lsn,
) -> AriesRecoveryResult {
let analysis = analysis_phase(records, checkpoint_lsn);
let redo_actions = redo_phase(records, &analysis);
let (undo_actions, losers) = undo_phase(records, &analysis);
let winners: HashSet<TxId> = analysis.tx_table
.values()
.filter(|e| e.state == RecoveryTxState::Committed)
.map(|e| e.tx_id)
.collect();
AriesRecoveryResult {
analysis, redo_actions, undo_actions, winners, losers,
}
}Test Coverage
The ARIES recovery implementation is covered by 11 tests verifying each phase independently and end-to-end recovery scenarios.
| Test | Description |
|---|---|
| analysis_identifies_tx_states | Classifies transactions as Committed, Active, or Aborted from WAL records |
| analysis_tracks_last_lsn | Records correct last_lsn for each transaction (commit=2, update=4, abort=6) |
| analysis_dirty_pages | Populates dirty page table and computes redo_start_lsn from first modification |
| analysis_respects_checkpoint | Skips records before checkpoint LSN; only post-checkpoint transactions in tx_table |
| redo_collects_all_data_ops | Collects all Insert/Update/Delete records regardless of transaction outcome |
| undo_rolls_back_active_transactions | Identifies Active transactions as losers; generates undo actions in reverse LSN order |
| undo_has_before_images | Undo actions carry correct before_images (update has [0xBB], insert has empty) |
| full_recovery | End-to-end: 1 winner, 1 loser, 4 redo actions, 2 undo actions |
| recovery_all_committed | All transactions committed: 2 winners, no losers, no undo actions |
| recovery_empty_wal | Empty WAL: all result fields empty |
| recovery_single_incomplete_tx | Single incomplete transaction: 0 winners, 1 loser, 2 undo actions |
Design
Physiological Logging
WAL records store full before and after images of affected data, enabling both redo (forward replay) and undo (backward rollback) without accessing the original pages.
Steal / No-Force
The ARIES protocol allows dirty pages to be flushed before commit (steal) and does not require all pages to be flushed at commit (no-force). This maximizes I/O flexibility while the WAL guarantees durability.
Bounded Recovery
Checkpoints limit the recovery window. Only WAL records after the last checkpoint need processing, keeping recovery time predictable regardless of total WAL size.
Reverse Undo
Undo actions are sorted by descending LSN, ensuring dependent changes are rolled back in the correct order. An update at LSN 4 is undone before the insert at LSN 3 that created the record.