API Overview¶
Jetliner provides a focused API for reading Avro files into Polars DataFrames.
Core Functions¶
scan_avro()¶
The primary API for reading Avro files. Returns a Polars LazyFrame with query optimization support.
Key features:
- Projection pushdown (only read needed columns)
- Predicate pushdown (filter during read)
- Early stopping (stop after row limit)
- S3 support via
storage_options - Multi-file support with glob patterns
read_avro()¶
Eager API for loading Avro files directly into a DataFrame with column selection.
Key features:
- Column selection via
columnsparameter - Row limiting via
n_rowsparameter - Multi-file support with glob patterns
- Equivalent to
scan_avro(...).collect()with eager projection
open()¶
Iterator API for streaming control. Returns a context manager yielding DataFrame batches.
Key features:
- Batch-by-batch processing
- Schema access before reading
- Error tracking in skip mode
- Memory-efficient streaming
read_avro_schema()¶
Extract Polars schema from an Avro file without reading data.
import jetliner
schema = jetliner.read_avro_schema("data.avro")
# {'user_id': Int64, 'name': String, ...}
Classes¶
AvroReader¶
Context manager returned by open(). Provides iteration and schema access.
with jetliner.open("data.avro") as reader:
print(reader.schema) # JSON string
print(reader.schema_dict) # Python dict
print(reader.error_count) # Errors in skip mode
AvroReaderCore¶
Low-level reader used internally. Most users should use open() instead.
Exception Types¶
All exceptions inherit from JetlinerError:
| Exception | When Raised |
|---|---|
JetlinerError |
Base class for all errors |
ParseError |
Invalid Avro file format |
SchemaError |
Invalid or unsupported schema |
CodecError |
Decompression failure |
DecodeError |
Record decoding failure |
SourceError |
File/S3 access errors |
Structured Exception Types¶
For programmatic error handling, use the structured exception types with metadata attributes:
| Exception | Attributes |
|---|---|
PyDecodeError |
block_index, record_index, offset, message |
PyParseError |
offset, message |
PySourceError |
path, message |
PySchemaError |
message |
PyCodecError |
message |
import jetliner
try:
df = jetliner.scan_avro("data.avro").collect()
except jetliner.PyDecodeError as e:
print(f"Error at block {e.block_index}, record {e.record_index}")
print(f"Offset: {e.offset}")
except jetliner.JetlinerError as e:
print(f"Error: {e}")
Quick Reference¶
Common Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
source |
str |
required | File path, S3 URI, or glob pattern |
columns |
list |
None |
Columns to read (read_avro only) |
n_rows |
int |
None |
Maximum rows to read |
row_index_name |
str |
None |
Name for row index column |
row_index_offset |
int |
0 |
Starting value for row index |
glob |
bool |
True |
Whether to expand glob patterns |
include_file_paths |
str |
None |
Column name for source file paths |
ignore_errors |
bool |
False |
Skip bad records instead of failing |
batch_size |
int |
100,000 | Records per batch |
buffer_blocks |
int |
4 | Blocks to prefetch |
buffer_bytes |
int |
64MB | Max buffer size |
read_chunk_size |
int |
None |
I/O read chunk size (auto-detect if None) |
storage_options |
dict |
None |
S3 configuration |
Storage Options Keys¶
| Key | Description |
|---|---|
endpoint |
Custom S3 endpoint (MinIO, LocalStack, R2) |
aws_access_key_id |
AWS access key |
aws_secret_access_key |
AWS secret key |
region |
AWS region |
max_retries |
Maximum retry attempts for transient failures |
Module Exports¶
All public symbols are available from the jetliner module:
import jetliner
# Functions
jetliner.scan_avro
jetliner.read_avro
jetliner.read_avro_schema
jetliner.open
# Classes
jetliner.AvroReader
jetliner.AvroReaderCore
# Exceptions (legacy)
jetliner.JetlinerError
jetliner.ParseError
jetliner.SchemaError
jetliner.CodecError
jetliner.DecodeError
jetliner.SourceError
# Structured exceptions with metadata
jetliner.PyDecodeError
jetliner.PyParseError
jetliner.PySourceError
jetliner.PySchemaError
jetliner.PyCodecError
# Type aliases
jetliner.FileSource
Next Steps¶
- Full API Reference - Complete function signatures and details
- User Guide - Practical usage examples