Skip to content

Memory & Streaming

sql-splitter uses a streaming architecture to handle files of any size with constant memory.

Traditional approaches load the entire file into memory:

# DON'T DO THIS
with open("dump.sql") as f:
content = f.read() # Loads entire file
process(content)

For a 10 GB file, this uses 10+ GB of RAM.

sql-splitter processes files incrementally:

  1. Read a chunk (64 KB default)
  2. Parse statements in the chunk
  3. Write output immediately
  4. Release memory
  5. Repeat

Memory stays constant at ~50 MB regardless of file size.

File → [64 KB Buffer] → Parser → Statement

The parser maintains state across chunk boundaries, handling:

  • Multi-line strings
  • Escaped characters
  • Statement boundaries
Statement → [64 KB Buffer] → File

Output is written immediately, not buffered in memory.

Compressed files are decompressed on-the-fly:

compressed.sql.gz → [Decompressor] → [64 KB Buffer] → Parser

Only the decompression buffer exists in memory, not the full file.

File SizeMemory Usage
10 MB~50 MB
100 MB~50 MB
1 GB~50 MB
10 GB~50 MB
100 GB~50 MB

All commands use streaming:

CommandStreaming I/O
splitRead → Write multiple files
mergeRead multiple → Write one
analyzeRead → Statistics in memory
convertRead → Transform → Write
validateRead → Validate → Report
sampleRead → Filter → Write
shardRead → Filter → Write
diffRead both → Compare → Report
redactRead → Transform → Write
orderRead → Reorder (memory for graph)
queryRead → Import to DuckDB

A few operations require more memory:

order and graph commands build an in-memory graph of table relationships. For schemas with thousands of tables, this can use more memory.

query imports data into DuckDB, which uses memory proportional to data size. Use --disk for large files:

Terminal window
sql-splitter query huge.sql "SELECT ..." --disk

diff with data comparison tracks primary keys in memory. Use --max-pk-entries to limit:

Terminal window
sql-splitter diff old.sql new.sql --max-pk-entries 1000000