Data Compression

Technology

Anvil DB provides optional LZ4-style and Zstd-style compression for stored data, implemented entirely in pure safe Rust with zero external dependencies. Compression is transparent — data is automatically compressed on write and decompressed on read, with a self-describing header format.

Architecture

The compression subsystem sits between the storage engine and disk I/O. A 5-byte header prepended to every compressed buffer identifies the algorithm and original size, enabling transparent decompression without external metadata.

wire formattext
Offset  Size   Field              Description
------  -----  -----------------  ---------------------------
0       1      algorithm          0=None, 1=LZ4, 2=Zstd
1       4      uncompressed_len   Original size (u32 LE)
5       var    payload            Compressed (or raw) bytes

configuration (compression.rs)rust
pub struct CompressionConfig {
    pub algorithm: CompressionAlgorithm,  // None, Lz4, or Zstd
    pub min_size: usize,                 // Skip compression below this (default 64 bytes)
    pub zstd_level: i32,                 // Zstd level 1-22 (default 3)
}

Smart Compression

The compressor makes two automatic decisions to avoid wasting CPU on data that won't benefit from compression:

Check	Behavior
Data below min_size	Stored raw with None header (avoids overhead on small values)
Compressed larger than original	Stored raw with None header (incompressible data like random bytes or already-compressed blobs)

compression flowtext
compress(data, config)
  |
  data.len() < min_size? ── YES ──> store raw (algorithm=None)
  |
  NO
  |
  compress with LZ4 or Zstd
  |
  compressed.len() < data.len()? ── NO ──> store raw (algorithm=None)
  |
  YES
  |
  prepend header + compressed payload

LZ4-Style Compression

The LZ4 implementation uses a hash-chain approach with 4-byte minimum matches and a 4,096-entry hash table (12-bit). It scans the input for repeated byte sequences and encodes them as back-references (offset + length), falling back to literal byte runs for non-matching regions.

Token Format

Each sequence is encoded as a token byte followed by optional extra length bytes, literal data, a 2-byte back-reference offset, and optional extra match length bytes:

LZ4 token encodingtext
Token byte: [LLLL MMMM]
  LLLL = literal length (0-15, 15 means more bytes follow)
  MMMM = match length - 4 (0-15, 15 means more bytes follow)

Sequence layout:
  [token: 1B] [extra_lit_len: 0+B] [literals: varB]
  [offset: 2B LE] [extra_match_len: 0+B]

Extra length encoding:
  Emit 255 repeatedly until remainder < 255, then emit remainder.
  Example: length 300 = 15 + 255 + 30 → token(15) + [255, 30]

Hash Function

4-byte hash (Knuth multiplicative)rust
fn hash4(data: &[u8]) -> usize {
    let v = u32::from_le_bytes([data[0], data[1], data[2], data[3]]);
    (v.wrapping_mul(2654435761) >> 20) as usize
}

Uses the Knuth multiplicative hash (golden ratio constant 2654435761) with 12-bit output for the LZ4 hash table or 14-bit for the Zstd variant.

Zstd-Style Compression

The Zstd implementation uses the same core algorithm as LZ4 but with a larger hash table (16,384 entries, 14-bit) for better match finding. This trades slightly more memory for improved compression ratios on larger inputs. The encoded format is identical, so decompression uses the same decoder.

Parameter	LZ4	Zstd
Hash table size	4,096 entries (12-bit)	16,384 entries (14-bit)
Min match length	4 bytes	4 bytes
Max back-reference	65,535 bytes	65,535 bytes
Encoding format	Token + literals + offset	Same (shared decoder)
Compression goal	Speed-optimized	Ratio-optimized

Decompression

Decompression reads the 5-byte header to determine the algorithm and expected output size, then decodes the token stream. The decoder validates bounds at every step: literal overflows, zero offsets, offset-exceeds-output, and length mismatches all return errors instead of panicking.

decompression (compression.rs)rust
pub fn decompress(data: &[u8]) -> io::Result<Vec<u8>> {
    let algo = data[0];          // 0=None, 1=LZ4, 2=Zstd
    let uncompressed_len = u32::from_le_bytes(data[1..5]);
    let payload = &data[5..];

    match algo {
        0 => Ok(payload.to_vec()),       // Raw (no compression)
        1 => lz4_decompress(payload, len),  // LZ4 decode
        2 => lz4_decompress(payload, len),  // Zstd uses same decoder
    }
}

Test Coverage

The compression subsystem is covered by 9 tests verifying round-trip correctness, smart skip behavior, and edge cases.

Test	Description
roundtrip_none	Compress/decompress with algorithm=None preserves data exactly
roundtrip_lz4	LZ4 round-trip on 1,000-byte repetitive data; verifies compressed is smaller
roundtrip_zstd	Zstd round-trip on 1,000-byte repetitive data
small_data_skips_compression	Data below min_size stored raw (header byte = 0)
incompressible_data_stored_raw	Random-looking data stored raw when compression doesn't save space
empty_data	Empty input produces empty output
decompress_invalid_header	Short buffers rejected with error
decompress_unknown_algorithm	Unknown algorithm byte (99) rejected with error
lz4_large_repetitive	4,000-byte repetitive data compresses to less than half size

Design Decisions

Zero Dependencies

Both compressors are implemented in pure safe Rust with no external crates. This eliminates C library linking, simplifies cross-compilation, and ensures the entire binary is auditable Rust.

Self-Describing Format

The 5-byte header stores algorithm and original size, so any compressed buffer can be decompressed without knowing the algorithm in advance. This enables mixed compression within the same storage file.

Graceful Fallback

If compression doesn't reduce size (incompressible data), the original bytes are stored with a None header. No CPU is wasted on decompression for data that wasn't actually compressed.

No Unsafe Code

The entire compression implementation uses safe Rust. All buffer accesses are bounds-checked. The decompressor validates every offset and length before accessing memory.