Schema Inspection¶

Jetliner provides multiple ways to inspect Avro schemas and understand how they map to Polars types.

Accessing Schemas¶

From an Open Reader¶

The open() API provides schema access before reading data:

import jetliner

with jetliner.open("data.avro") as reader:
    # JSON string representation
    print(reader.schema)

    # Python dictionary
    schema_dict = reader.schema_dict
    print(schema_dict)

Using read_avro_schema()¶

Get the Polars schema without reading any data:

import jetliner

# Returns a Polars schema (dict of column names to types)
polars_schema = jetliner.read_avro_schema("data.avro")
print(polars_schema)
# {'user_id': Int64, 'name': String, 'amount': Float64, ...}

This only reads the file header, making it fast even for large files.

From S3¶

Schema inspection works with S3 files:

import jetliner

polars_schema = jetliner.read_avro_schema(
    "s3://bucket/data.avro",
    storage_options={"region": "us-east-1"}
)

Schema Formats¶

JSON String¶

The raw Avro schema as a JSON string:

with jetliner.open("data.avro") as reader:
    schema_json = reader.schema
    print(schema_json)

Output:

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"]}
  ]
}

Python Dictionary¶

Parsed schema as a Python dictionary:

with jetliner.open("data.avro") as reader:
    schema = reader.schema_dict

    print(f"Record name: {schema['name']}")
    print(f"Fields: {[f['name'] for f in schema['fields']]}")

Polars Schema¶

Column names mapped to Polars data types:

import jetliner

schema = jetliner.read_avro_schema("data.avro")
for name, dtype in schema.items():
    print(f"{name}: {dtype}")

Type Mapping¶

Jetliner maps Avro types to Polars types:

Primitive Types¶

Avro Type	Polars Type
`null`	`Null`
`boolean`	`Boolean`
`int`	`Int32`
`long`	`Int64`
`float`	`Float32`
`double`	`Float64`
`bytes`	`Binary`
`string`	`String`

Logical Types¶

Avro Logical Type	Polars Type
`date`	`Date`
`time-millis`	`Time`
`time-micros`	`Time`
`timestamp-millis`	`Datetime(ms)`
`timestamp-micros`	`Datetime(μs)`
`uuid`	`String`
`decimal`	`Decimal`

Complex Types¶

Avro Type	Polars Type
`array`	`List`
`map`	`Struct` (with key/value fields)
`record`	`Struct`
`enum`	`Categorical`
`fixed`	`Binary`
`union`	Depends on variants

Union Types¶

Unions are handled based on their variants:

# ["null", "string"] -> String (nullable)
# ["null", "int", "string"] -> Struct with type indicator

Working with Schemas¶

Validate Schema Before Processing¶

import jetliner

def validate_schema(path, required_columns):
    """Check that file has required columns."""
    schema = jetliner.read_avro_schema(path)
    missing = set(required_columns) - set(schema.keys())
    if missing:
        raise ValueError(f"Missing columns: {missing}")
    return schema

# Usage
schema = validate_schema("data.avro", ["user_id", "amount"])

Schema-Driven Processing¶

import jetliner
import polars as pl

def process_by_schema(path):
    """Process file based on its schema."""
    schema = jetliner.read_avro_schema(path)

    # Build query based on available columns
    numeric_cols = [
        name for name, dtype in schema.items()
        if dtype in (pl.Int32, pl.Int64, pl.Float32, pl.Float64)
    ]

    return (
        jetliner.scan_avro(path)
        .select(numeric_cols)
        .collect()
    )

Compare Schemas¶

import jetliner

def schemas_compatible(path1, path2):
    """Check if two files have compatible schemas."""
    schema1 = jetliner.read_avro_schema(path1)
    schema2 = jetliner.read_avro_schema(path2)

    # Check column names match
    if set(schema1.keys()) != set(schema2.keys()):
        return False

    # Check types match
    for name in schema1:
        if schema1[name] != schema2[name]:
            return False

    return True

Nested Schemas¶

Records (Structs)¶

Nested records become Polars Structs:

# Avro schema:
# {"type": "record", "name": "Event", "fields": [
#   {"name": "user", "type": {
#     "type": "record", "name": "User", "fields": [
#       {"name": "id", "type": "long"},
#       {"name": "name", "type": "string"}
#     ]
#   }}
# ]}

df = jetliner.scan_avro("events.avro").collect()
# Access nested fields
df.select(pl.col("user").struct.field("name"))

Arrays (Lists)¶

Arrays become Polars Lists:

# Avro schema:
# {"name": "tags", "type": {"type": "array", "items": "string"}}

df = jetliner.scan_avro("data.avro").collect()
# Explode list column
df.explode("tags")

Maps¶

Maps become Structs with key/value arrays:

# Avro schema:
# {"name": "metadata", "type": {"type": "map", "values": "string"}}

df = jetliner.scan_avro("data.avro").collect()
# Access map as struct
df.select(pl.col("metadata"))

Recursive Types¶

Avro supports recursive types (self-referencing records). Since Polars doesn't support recursive structures, Jetliner serializes recursive fields to JSON strings:

# Avro schema with recursive type:
# {"type": "record", "name": "Node", "fields": [
#   {"name": "value", "type": "int"},
#   {"name": "children", "type": {"type": "array", "items": "Node"}}
# ]}

df = jetliner.scan_avro("tree.avro").collect()
# "children" column contains JSON strings
# Parse with: df.select(pl.col("children").str.json_decode())

Known Limitations¶

Top-Level Non-Record Schemas¶

Jetliner handles some non-record top-level schemas:

Top-Level Type	Support
`record`	✅ Full support
`int`, `long`, `string`, etc.	✅ Single "value" column
`array`	❌ Not yet supported
`map`	❌ Not yet supported

Schema Evolution¶

Jetliner reads the writer schema embedded in the file. Schema evolution (reader schema different from writer schema) is not currently supported.

Next Steps¶

Codec Support - Compression options
Error Handling - Handle schema errors
Query Optimization - Use schema for efficient queries