Data Compression
Architecture
The compression subsystem sits between the storage engine and disk I/O. A 5-byte header prepended to every compressed buffer identifies the algorithm and original size, enabling transparent decompression without external metadata.
Offset Size Field Description
------ ----- ----------------- ---------------------------
0 1 algorithm 0=None, 1=LZ4, 2=Zstd
1 4 uncompressed_len Original size (u32 LE)
5 var payload Compressed (or raw) bytespub struct CompressionConfig {
pub algorithm: CompressionAlgorithm, // None, Lz4, or Zstd
pub min_size: usize, // Skip compression below this (default 64 bytes)
pub zstd_level: i32, // Zstd level 1-22 (default 3)
}Smart Compression
The compressor makes two automatic decisions to avoid wasting CPU on data that won't benefit from compression:
| Check | Behavior |
|---|---|
| Data below min_size | Stored raw with None header (avoids overhead on small values) |
| Compressed larger than original | Stored raw with None header (incompressible data like random bytes or already-compressed blobs) |
compress(data, config)
|
data.len() < min_size? ── YES ──> store raw (algorithm=None)
|
NO
|
compress with LZ4 or Zstd
|
compressed.len() < data.len()? ── NO ──> store raw (algorithm=None)
|
YES
|
prepend header + compressed payloadLZ4-Style Compression
The LZ4 implementation uses a hash-chain approach with 4-byte minimum matches and a 4,096-entry hash table (12-bit). It scans the input for repeated byte sequences and encodes them as back-references (offset + length), falling back to literal byte runs for non-matching regions.
Token Format
Each sequence is encoded as a token byte followed by optional extra length bytes, literal data, a 2-byte back-reference offset, and optional extra match length bytes:
Token byte: [LLLL MMMM]
LLLL = literal length (0-15, 15 means more bytes follow)
MMMM = match length - 4 (0-15, 15 means more bytes follow)
Sequence layout:
[token: 1B] [extra_lit_len: 0+B] [literals: varB]
[offset: 2B LE] [extra_match_len: 0+B]
Extra length encoding:
Emit 255 repeatedly until remainder < 255, then emit remainder.
Example: length 300 = 15 + 255 + 30 → token(15) + [255, 30]Hash Function
fn hash4(data: &[u8]) -> usize {
let v = u32::from_le_bytes([data[0], data[1], data[2], data[3]]);
(v.wrapping_mul(2654435761) >> 20) as usize
}Uses the Knuth multiplicative hash (golden ratio constant 2654435761) with 12-bit output for the LZ4 hash table or 14-bit for the Zstd variant.
Zstd-Style Compression
The Zstd implementation uses the same core algorithm as LZ4 but with a larger hash table (16,384 entries, 14-bit) for better match finding. This trades slightly more memory for improved compression ratios on larger inputs. The encoded format is identical, so decompression uses the same decoder.
| Parameter | LZ4 | Zstd |
|---|---|---|
| Hash table size | 4,096 entries (12-bit) | 16,384 entries (14-bit) |
| Min match length | 4 bytes | 4 bytes |
| Max back-reference | 65,535 bytes | 65,535 bytes |
| Encoding format | Token + literals + offset | Same (shared decoder) |
| Compression goal | Speed-optimized | Ratio-optimized |
Decompression
Decompression reads the 5-byte header to determine the algorithm and expected output size, then decodes the token stream. The decoder validates bounds at every step: literal overflows, zero offsets, offset-exceeds-output, and length mismatches all return errors instead of panicking.
pub fn decompress(data: &[u8]) -> io::Result<Vec<u8>> {
let algo = data[0]; // 0=None, 1=LZ4, 2=Zstd
let uncompressed_len = u32::from_le_bytes(data[1..5]);
let payload = &data[5..];
match algo {
0 => Ok(payload.to_vec()), // Raw (no compression)
1 => lz4_decompress(payload, len), // LZ4 decode
2 => lz4_decompress(payload, len), // Zstd uses same decoder
}
}Test Coverage
The compression subsystem is covered by 9 tests verifying round-trip correctness, smart skip behavior, and edge cases.
| Test | Description |
|---|---|
| roundtrip_none | Compress/decompress with algorithm=None preserves data exactly |
| roundtrip_lz4 | LZ4 round-trip on 1,000-byte repetitive data; verifies compressed is smaller |
| roundtrip_zstd | Zstd round-trip on 1,000-byte repetitive data |
| small_data_skips_compression | Data below min_size stored raw (header byte = 0) |
| incompressible_data_stored_raw | Random-looking data stored raw when compression doesn't save space |
| empty_data | Empty input produces empty output |
| decompress_invalid_header | Short buffers rejected with error |
| decompress_unknown_algorithm | Unknown algorithm byte (99) rejected with error |
| lz4_large_repetitive | 4,000-byte repetitive data compresses to less than half size |
Design Decisions
Zero Dependencies
Both compressors are implemented in pure safe Rust with no external crates. This eliminates C library linking, simplifies cross-compilation, and ensures the entire binary is auditable Rust.
Self-Describing Format
The 5-byte header stores algorithm and original size, so any compressed buffer can be decompressed without knowing the algorithm in advance. This enables mixed compression within the same storage file.
Graceful Fallback
If compression doesn't reduce size (incompressible data), the original bytes are stored with a None header. No CPU is wasted on decompression for data that wasn't actually compressed.
No Unsafe Code
The entire compression implementation uses safe Rust. All buffer accesses are bounds-checked. The decompressor validates every offset and length before accessing memory.