Technology

ARIES Recovery

Anvil DB implements the ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) protocol for crash recovery. The algorithm processes the Write-Ahead Log in three phases — analysis, redo, and undo — to restore the database to a consistent state after a crash.

Three-Phase Recovery

After a crash, the recovery algorithm reads the WAL from the last checkpoint and classifies every transaction as committed, aborted, or incomplete. Committed transactions are redone (applied forward), and incomplete transactions are undone (rolled back).

recovery flowtext
WAL Segments
    |
    v
Phase 1: ANALYSIS
    Scan WAL from checkpoint LSN forward
    Build transaction table (tx_id -> state, last_lsn)
    Build dirty page table (page_key -> recovery_lsn)
    Compute redo_start_lsn = min(dirty page recovery LSNs)
    |
    v
Phase 2: REDO
    Replay all Insert/Update/Delete records from redo_start_lsn
    Apply after_images to restore pages to crash-time state
    Includes committed AND uncommitted transactions
    |
    v
Phase 3: UNDO
    Identify "loser" transactions (Active at crash)
    Roll back in REVERSE LSN order using before_images
    Restores database to last consistent state

Data Structures

recovery types (aries.rs)rust
/// Transaction state during recovery
pub enum RecoveryTxState {
    Active,     // In-progress at crash (loser)
    Committed,  // Successfully committed (winner)
    Aborted,    // Explicitly rolled back
}

/// Per-transaction recovery entry
pub struct TxEntry {
    pub tx_id: TxId,
    pub state: RecoveryTxState,
    pub last_lsn: Lsn,       // Last WAL record for this tx
}

/// Dirty page tracking for redo optimization
pub struct DirtyPageEntry {
    pub page_key: Vec<u8>,
    pub recovery_lsn: Lsn,  // LSN of first modification
}

/// Complete recovery result
pub struct AriesRecoveryResult {
    pub analysis: AnalysisResult,
    pub redo_actions: Vec<RedoAction>,
    pub undo_actions: Vec<UndoAction>,
    pub winners: HashSet<TxId>,  // Committed transactions
    pub losers: HashSet<TxId>,   // Incomplete transactions
}

Phase 1: Analysis

The analysis phase scans the WAL from the checkpoint LSN forward, building two tables: the transaction table (which transactions were active, committed, or aborted) and the dirty page table (which pages were modified and when).

Record TypeAction
Insert / Update / DeleteCreate or update tx entry as Active; track dirty page with first-modification LSN
CommitSet tx state to Committed; update last_lsn
AbortSet tx state to Aborted; update last_lsn
CheckpointSkipped (already handled by checkpoint LSN parameter)

The analysis phase also computes the redo_start_lsn: the minimum of all dirty page recovery LSNs. This is where the redo phase begins, ensuring no necessary redo is missed.

Phase 2: Redo

The redo phase collects all data operations (Insert, Update, Delete) from redo_start_lsn forward, regardless of transaction outcome. This restores every page to its exact crash-time state, including changes from uncommitted transactions.

redo actionrust
pub struct RedoAction {
    pub tx_id: TxId,
    pub lsn: Lsn,
    pub op_type: WalOpType,      // Insert, Update, or Delete
    pub after_image: Vec<u8>,   // Post-operation page state
}

Redo actions are applied in forward LSN order, ensuring operations are replayed in the exact sequence they originally occurred.

Phase 3: Undo

The undo phase identifies "loser" transactions — those that were Active (neither committed nor aborted) at crash time. Their operations are rolled back in reverse LSN order using the before-images stored in the WAL, restoring affected pages to their pre-transaction state.

undo actionrust
pub struct UndoAction {
    pub tx_id: TxId,
    pub lsn: Lsn,
    pub before_image: Vec<u8>,  // Pre-operation page state
}

// Undo actions sorted by reverse LSN
// Latest changes rolled back first
undo_actions.sort_by_key(|a| std::cmp::Reverse(a.lsn));
Transaction StateClassificationAction
CommittedWinnerRedo only (changes preserved)
AbortedHandledAlready rolled back before crash
Active (no commit/abort)LoserUndo in reverse LSN order

Checkpoint Integration

Checkpoints bound the recovery window. The analysis phase skips all WAL records before the checkpoint LSN, since those changes are guaranteed to be flushed to disk. This limits the amount of WAL that must be processed during recovery.

checkpoint optimizationtext
WAL Timeline:
  LSN 1   LSN 2   LSN 3   LSN 4   LSN 5   LSN 6
  [tx1    ][commit]  |     [tx2    ][update] [CRASH]
                     |
              CHECKPOINT at LSN 3
                     |
Recovery starts here ─┘
  - tx1: before checkpoint, NOT in tx_table
  - tx2: after checkpoint, classified as Active (loser)
  - Only LSN 4-6 need processing

WAL Record Format

Each WAL record carries both before and after images, enabling both redo (apply after_image) and undo (restore before_image) operations. Records are checksummed with CRC32 for corruption detection.

OperationValuebefore_imageafter_image
Insert1EmptyNew page data
Update2Pre-update pagePost-update page
Delete3Deleted page dataEmpty
Commit4EmptyEmpty
Abort5EmptyEmpty
Checkpoint6EmptyEmpty

Recovery Entry Point

aries_recover (aries.rs)rust
pub fn aries_recover(
    records: &[WalRecord],
    checkpoint_lsn: Lsn,
) -> AriesRecoveryResult {
    let analysis = analysis_phase(records, checkpoint_lsn);
    let redo_actions = redo_phase(records, &analysis);
    let (undo_actions, losers) = undo_phase(records, &analysis);

    let winners: HashSet<TxId> = analysis.tx_table
        .values()
        .filter(|e| e.state == RecoveryTxState::Committed)
        .map(|e| e.tx_id)
        .collect();

    AriesRecoveryResult {
        analysis, redo_actions, undo_actions, winners, losers,
    }
}

Test Coverage

The ARIES recovery implementation is covered by 11 tests verifying each phase independently and end-to-end recovery scenarios.

TestDescription
analysis_identifies_tx_statesClassifies transactions as Committed, Active, or Aborted from WAL records
analysis_tracks_last_lsnRecords correct last_lsn for each transaction (commit=2, update=4, abort=6)
analysis_dirty_pagesPopulates dirty page table and computes redo_start_lsn from first modification
analysis_respects_checkpointSkips records before checkpoint LSN; only post-checkpoint transactions in tx_table
redo_collects_all_data_opsCollects all Insert/Update/Delete records regardless of transaction outcome
undo_rolls_back_active_transactionsIdentifies Active transactions as losers; generates undo actions in reverse LSN order
undo_has_before_imagesUndo actions carry correct before_images (update has [0xBB], insert has empty)
full_recoveryEnd-to-end: 1 winner, 1 loser, 4 redo actions, 2 undo actions
recovery_all_committedAll transactions committed: 2 winners, no losers, no undo actions
recovery_empty_walEmpty WAL: all result fields empty
recovery_single_incomplete_txSingle incomplete transaction: 0 winners, 1 loser, 2 undo actions

Design

Physiological Logging

WAL records store full before and after images of affected data, enabling both redo (forward replay) and undo (backward rollback) without accessing the original pages.

Steal / No-Force

The ARIES protocol allows dirty pages to be flushed before commit (steal) and does not require all pages to be flushed at commit (no-force). This maximizes I/O flexibility while the WAL guarantees durability.

Bounded Recovery

Checkpoints limit the recovery window. Only WAL records after the last checkpoint need processing, keeping recovery time predictable regardless of total WAL size.

Reverse Undo

Undo actions are sorted by descending LSN, ensuring dependent changes are rolled back in the correct order. An update at LSN 4 is undone before the insert at LSN 3 that created the record.