Codec Support¶

Jetliner supports all standard Avro compression codecs. This guide covers codec characteristics and selection guidance.

Supported Codecs¶

Codec	Compression	Speed	Use Case
`null`	None	Fastest	When storage isn't a concern
`snappy`	Low-Medium	Very Fast	General purpose, good balance
`deflate`	Medium-High	Medium	When compression matters more than speed
`zstd`	High	Fast	Best overall balance
`bzip2`	Very High	Slow	Maximum compression, archival
`xz`	Highest	Slowest	Maximum compression, archival

All codecs are included in the default Jetliner build. No additional configuration is required.

Codec Characteristics¶

null (No Compression)¶

Compression ratio: 1:1 (no compression)
Read speed: Fastest
Best for: Small files, already-compressed data, maximum read speed

# Fastest reads, largest files
df = jetliner.scan_avro("uncompressed.avro").collect()

snappy¶

Compression ratio: ~2-4x
Read speed: Very fast
Best for: General purpose, real-time processing

Snappy prioritizes speed over compression ratio. It's a good default choice when you need fast reads and writes.

deflate (gzip)¶

Compression ratio: ~4-8x
Read speed: Medium
Best for: When storage cost matters, network transfer

Deflate provides better compression than Snappy at the cost of slower decompression.

zstd (Zstandard)¶

Compression ratio: ~4-10x
Read speed: Fast
Best for: Best overall balance of compression and speed

Zstd typically provides better compression than Snappy with similar or better decompression speed. It's often the best choice for new projects.

bzip2¶

Compression ratio: ~6-12x
Read speed: Slow
Best for: Archival, when storage is expensive

Bzip2 provides excellent compression but slow decompression. Use for cold storage or archival.

xz (LZMA)¶

Compression ratio: ~8-15x
Read speed: Slowest
Best for: Maximum compression, long-term archival

XZ provides the highest compression ratios but the slowest decompression. Use only when storage is at a premium and read speed isn't critical.

Codec Selection Guide¶

By Use Case¶

Use Case	Recommended Codec
Real-time analytics	`snappy` or `zstd`
Data lake storage	`zstd`
Network transfer	`zstd` or `deflate`
Cold storage/archival	`bzip2` or `xz`
Maximum read speed	`null`
Lambda/serverless	`snappy` or `zstd`

By Priority¶

Speed priority: null > snappy > zstd > deflate > bzip2 > xz

Compression priority: xz > bzip2 > zstd ≈ deflate > snappy > null

Balanced: zstd (best overall trade-off)

Performance Considerations¶

Decompression is the Hot Path¶

Codec decompression is often the bottleneck when reading Avro files. Choose codecs based on your read patterns:

Frequent reads: Prefer snappy or zstd
Infrequent reads: deflate, bzip2, or xz are acceptable

Memory Usage¶

Decompression requires temporary buffers. For memory-constrained environments:

import jetliner

# Smaller buffers for memory-constrained environments
df = jetliner.scan_avro(
    "data.avro",
    buffer_blocks=2,
    buffer_bytes=16 * 1024 * 1024,  # 16MB
).collect()

S3 Considerations¶

For S3 files, codec choice affects both storage cost and transfer time:

Storage cost: Higher compression = lower cost
Transfer time: Higher compression = less data to transfer
CPU time: Higher compression = more decompression work

For S3, zstd often provides the best balance.

Error Handling¶

Codec errors are raised as CodecError:

import jetliner

try:
    df = jetliner.scan_avro("data.avro").collect()
except jetliner.CodecError as e:
    print(f"Decompression failed: {e}")

For structured error handling with metadata:

import jetliner

try:
    df = jetliner.scan_avro("data.avro").collect()
except jetliner.PyCodecError as e:
    print(f"Decompression failed: {e.message}")

Common causes: - Corrupted compressed data - Truncated file - Invalid compression block

Building with Specific Codecs¶

When building from source, you can customize codec support:

# All codecs (default)
maturin develop

# Specific codecs only (faster compile)
maturin develop --cargo-extra-args="--no-default-features --features snappy,zstd"

# Minimal build (null codec only)
maturin develop --cargo-extra-args="--no-default-features"

Available feature flags: - snappy - deflate - zstd - bzip2 - xz

Checking File Codec¶

To check which codec a file uses:

import jetliner
import json

with jetliner.open("data.avro") as reader:
    # The codec is in the file metadata
    schema = reader.schema_dict
    # Note: codec info is in the file header, not schema

Or use fastavro for inspection:

import fastavro

with open("data.avro", "rb") as f:
    reader = fastavro.reader(f)
    print(f"Codec: {reader.codec}")

Next Steps¶

Streaming Large Files - Memory-efficient processing
Query Optimization - Reduce data read
Error Handling - Handle codec errors