Multiversion Concurrency Control
Version Chains
Each record maintains an ordered chain of versions keyed by timestamp. Readers traverse the chain to find the latest version visible at their snapshot timestamp. Deleted records are represented as tombstone versions (data = None).
pub struct VersionChain<V> {
versions: BTreeMap<Timestamp, Version<V>>,
}
struct Version<V> {
timestamp: Timestamp, // Commit timestamp
data: Option<V>, // None = tombstone (deleted)
}
// Read: find latest version <= read_timestamp
chain.read(read_timestamp) -> Option<&V>
// Conflict detection: any version since timestamp?
chain.has_version_since(timestamp) -> bool
// GC: remove old versions, keeping one before cutoff
chain.gc(oldest_active_timestamp)Record "Alice" version chain:
T=10: {name: "Alice", age: 30} ← written by tx1
T=25: {name: "Alice", age: 31} ← written by tx5
T=40: None (tombstone) ← deleted by tx8
Reader at T=20 sees: {name: "Alice", age: 30} (T=10)
Reader at T=30 sees: {name: "Alice", age: 31} (T=25)
Reader at T=50 sees: (deleted) (T=40)MVCC Store
The MvccStore maps keys to version chains and provides snapshot-consistent scan and garbage collection.
pub struct MvccStore<K, V> {
records: BTreeMap<K, VersionChain<V>>,
}
// Snapshot scan: iterate visible keys at timestamp
store.scan(read_timestamp) -> Vec<(&K, &V)>
// Write-write conflict detection
chain.has_version_since(my_start_timestamp)
// GC + tombstone cleanup
store.gc_with_tombstone_cleanup(oldest_active_timestamp)Isolation Levels
| Level | Snapshot Behavior | Validation |
|---|---|---|
| Read Committed | Timestamp advances per read — sees latest committed data at each statement | None |
| Repeatable Read | Fixed snapshot from transaction start — consistent view throughout | None |
| Serializable | Fixed snapshot + read/write set tracking — validates no conflicts at commit | Checks if any key in read_set was written after start_timestamp |
pub struct Snapshot {
read_timestamp: Timestamp,
isolation: IsolationLevel,
read_set: HashSet<Vec<u8>>, // Serializable only
write_set: HashSet<Vec<u8>>, // Serializable only
}
// Read Committed: advance timestamp per read
snapshot.advance_timestamp(new_timestamp);
// Serializable: validate at commit time
snapshot.validate(store, start_timestamp)
-> Result<(), ValidationError> // Fails if read_set was modifiedRecord-Level Locking
Writes acquire record-level locks via a lock table. Shared locks allow concurrent readers; exclusive locks block all other access. Lock upgrade from shared to exclusive is supported when the transaction is the sole holder.
pub enum LockMode { Shared, Exclusive }
pub enum EntityType { Node, Relationship, Property, Label }
pub struct LockKey {
entity_type: EntityType,
id: u64,
}
pub struct LockTable {
locks: HashMap<LockKey, LockEntry>,
lock_timeout: Duration, // Default 30 seconds
}| Compatibility | Shared Held | Exclusive Held |
|---|---|---|
| Request Shared | Granted | Blocked |
| Request Exclusive | Blocked (unless sole holder → upgrade) | Blocked |
Lock acquisition supports configurable timeouts with exponential backoff. If a lock cannot be acquired within the timeout, the transaction receives a LockError::Timeout.
Deadlock Detection
A wait-for graph tracks which transactions are waiting on which other transactions. Cycle detection via depth-first search identifies deadlocks. The youngest transaction in the cycle (highest TxId) is chosen as the victim and aborted.
pub struct WaitForGraph {
edges: HashMap<TxId, HashSet<TxId>>, // waiter → holders
}
// Proactive: check before waiting
graph.would_deadlock(waiter, holder) -> Option<Vec<TxId>>
// Reactive: find all existing cycles
graph.detect_deadlocks() -> Vec<Vec<TxId>>
// Victim selection: abort youngest (highest TxId)
graph.choose_victim(&cycle) -> TxIdtx1 waits on tx2 (tx2 holds lock on node 5)
tx2 waits on tx3 (tx3 holds lock on node 10)
tx3 waits on tx1 (tx1 holds lock on node 15)
Wait-for graph: tx1 → tx2 → tx3 → tx1 (CYCLE!)
Victim: tx3 (highest TxId = youngest)
Result: tx3 aborted, tx1 and tx2 proceedSavepoints
Savepoints enable partial rollback within a transaction. An undo log records every mutation with before-images. Rolling back to a savepoint reverses operations in reverse order and removes nested savepoints created after it.
pub enum UndoEntry {
Insert { key: Vec<u8> }, // Undo: delete
Update { key: Vec<u8>, before_image: Vec<u8> }, // Undo: restore
Delete { key: Vec<u8>, before_image: Vec<u8> }, // Undo: re-insert
}
pub struct SavepointManager {
undo_log: Vec<UndoEntry>,
savepoints: HashMap<SavepointId, usize>, // ID → log position
savepoint_stack: Vec<SavepointId>, // Creation order
}
// Create savepoint at current undo log position
manager.create_savepoint() -> SavepointId
// Rollback: returns undo entries in reverse order
manager.rollback_to(savepoint_id) -> Vec<UndoEntry>
// Full rollback of entire transaction
manager.rollback_all() -> Vec<UndoEntry>Transaction Lifecycle
pub struct TransactionManager {
next_tx_id: AtomicU64, // Atomic ID counter
next_timestamp: AtomicU64, // Atomic timestamp counter
active: HashMap<TxId, Transaction>,
completed: HashMap<TxId, Transaction>,
}
// Lifecycle
let tx_id = manager.begin(); // Active
let ts = manager.commit(tx_id)?; // Active → Committed
manager.rollback(tx_id)?; // Active → Aborted
// GC: oldest active timestamp for version cleanup
manager.oldest_active_timestamp() -> Option<Timestamp>| State | Transition | Effect |
|---|---|---|
| Active | begin() / begin_read_only() | Allocates TxId and start timestamp |
| Committed | commit(tx_id) | Assigns commit timestamp, moves to completed |
| Aborted | rollback(tx_id) | Moves to completed, locks released |
Test Coverage
The MVCC subsystem is covered by 98 tests across 7 modules.
| Module | Tests | Coverage |
|---|---|---|
| mvcc.rs | 19 | Version chains, snapshot reads, conflict detection, GC, tombstone cleanup |
| lock_table.rs | 20 | Shared/exclusive locks, upgrade, timeout, release, multi-key independence |
| deadlock.rs | 13 | Wait-for graph, cycle detection (DFS), 3-way deadlocks, victim selection |
| savepoint.rs | 12 | Create, rollback, nested savepoints, release, undo entry types |
| isolation.rs | 11 | Read committed, repeatable read, serializable validation, conflict detection |
| manager.rs | 13 | Begin, commit, rollback, sequential IDs, oldest active timestamp, purge |
| transaction.rs | 10 | State transitions, double-commit rejection, read-only flag |
Design
Non-Blocking Reads
Readers never block writers and writers never block readers. Each reader sees a consistent snapshot via version chain traversal, while writers create new versions without modifying existing ones.
Write-Write Conflict Detection
Before committing, a transaction checks if any record it wrote was also written by another transaction after its start. This prevents lost updates without requiring read locks.
Garbage Collection
Old versions are cleaned up when no active transaction can see them. The oldest active timestamp serves as the GC watermark. Tombstones are removed when no transaction can see the deleted record.
Youngest Victim
Deadlock resolution always aborts the youngest transaction (highest TxId), minimizing wasted work since younger transactions have done less processing.