Memory & Streaming

sql-splitter uses a streaming architecture to handle files of any size with constant memory.

The Problem

Traditional approaches load the entire file into memory:

# DON'T DO THIS
with open("dump.sql") as f:
    content = f.read()  # Loads entire file
    process(content)

For a 10 GB file, this uses 10+ GB of RAM.

sql-splitter processes files incrementally:

Memory stays constant at ~50 MB regardless of file size.

File → [64 KB Buffer] → Parser → Statement

The parser maintains state across chunk boundaries, handling:

Statement → [64 KB Buffer] → File

Output is written immediately, not buffered in memory.

Compressed files are decompressed on-the-fly:

compressed.sql.gz → [Decompressor] → [64 KB Buffer] → Parser

Only the decompression buffer exists in memory, not the full file.

All commands use streaming:

Command	Streaming I/O
`split`	Read → Write multiple files
`merge`	Read multiple → Write one
`analyze`	Read → Statistics in memory
`convert`	Read → Transform → Write
`validate`	Read → Validate → Report
`sample`	Read → Filter → Write
`shard`	Read → Filter → Write
`diff`	Read both → Compare → Report
`redact`	Read → Transform → Write
`order`	Read → Reorder (memory for graph)
`query`	Read → Import to DuckDB

A few operations require more memory:

order and graph commands build an in-memory graph of table relationships. For schemas with thousands of tables, this can use more memory.

query imports data into DuckDB, which uses memory proportional to data size. Use --disk for large files:

sql-splitter query huge.sql "SELECT ..." --disk

diff with data comparison tracks primary keys in memory. Use --max-pk-entries to limit:

sql-splitter diff old.sql new.sql --max-pk-entries 1000000