Skip to content

sample

Create smaller test datasets while preserving foreign key relationships.

Alias: sa (e.g., sql-splitter sa dump.sql --percent 10 -o dev.sql)

  • Creating dev/test datasets - Generate smaller databases for local development from production dumps
  • Reducing CI test times - Smaller datasets mean faster test runs without losing coverage
  • Reproducing production bugs - Extract a consistent slice of data that maintains relationships between tables
  • Size-limited backups - Create smaller backups for quick restores or archival purposes
  • Demo environments - Build realistic demo data that maintains referential integrity

Use shard instead if you need to extract data for a specific tenant from a multi-tenant database.

The sample command operates in one of two modes:

Without --preserve-relations, each table is sampled independently using reservoir sampling. Rows are selected randomly based on --percent or --rows. This is fast but doesn’t guarantee foreign key relationships remain valid.

FK-Preserving Sampling (--preserve-relations)

Section titled “FK-Preserving Sampling (--preserve-relations)”

When FK preservation is enabled, the algorithm:

  1. Splits the dump into per-table files in a temp directory
  2. Builds a dependency graph from CREATE TABLE and ALTER TABLE statements
  3. Identifies root tables - tables with no outgoing foreign keys (or explicitly specified via --root-tables)
  4. Processes tables in topological order - parents before children
  5. Samples root tables first using the specified percentage or row count
  6. Follows FK chains - for each sampled row, tracks its primary key hash. When processing child tables, only rows whose FK values match a sampled parent’s PK are eligible
  7. Outputs a consistent dataset where all FK relationships are satisfied

Memory usage: PK values are stored as 64-bit hashes in a HashSet (8 bytes per row). For 10 million sampled rows across all tables, expect ~80MB of memory for PK tracking. Use --max-total-rows to cap memory growth.

Fractional rows: When a percentage results in a fractional row count (e.g., 10% of 15 rows = 1.5), reservoir sampling naturally handles this—you’ll get approximately the target percentage over many runs, or use --seed for deterministic results.

Terminal window
sql-splitter sample <INPUT> [OPTIONS]
Terminal window
# Sample 10% of rows from each table
sql-splitter sample dump.sql -o sampled.sql --percent 10
# Sample fixed number of rows per table
sql-splitter sample dump.sql -o sampled.sql --rows 1000
Terminal window
# Preserve FK relationships - ensures if you sample an order, the user exists
sql-splitter sample dump.sql -o dev.sql --percent 10 --preserve-relations
# Fail if any FK integrity issues are detected
sql-splitter sample dump.sql -o dev.sql --percent 10 --preserve-relations --strict-fk
Terminal window
# Same seed = identical sample every time (useful for CI/debugging)
sql-splitter sample dump.sql -o dev.sql --percent 10 --seed 42
Terminal window
# Sample only specific tables
sql-splitter sample dump.sql -o dev.sql --percent 10 --tables users,orders,products
# Exclude large or irrelevant tables
sql-splitter sample dump.sql -o dev.sql --percent 10 --exclude logs,events,audit_trail
Terminal window
# Use a config file for complex sampling requirements
sql-splitter sample dump.sql -o dev.sql --config sample.yaml
FlagShortDescriptionDefault
--output-oOutput SQL filestdout
--dialect-dSQL dialectauto-detect
--percentSample percentage (1-100)-
--rowsSample fixed number of rows per table-
--preserve-relationsPreserve FK relationshipsfalse
--tables-tOnly sample these tables (comma-separated)all
--exclude-eExclude these tables (comma-separated)none
--root-tablesExplicit root tables for samplingauto
--include-globalGlobal table handling: none, lookups, alllookups
--seedRandom seed for reproducibilityrandom
--config-cYAML config file for per-table settings-
--max-total-rowsMaximum total rows to sampleunlimited
--no-limitDisable row limitfalse
--strict-fkFail if any FK integrity issuesfalse
--no-schemaExclude CREATE TABLE statementsfalse
--progress-pShow progress barfalse
--dry-runPreview without writingfalse
--jsonOutput as JSONfalse

Root tables are tables with no outgoing foreign keys—they don’t reference other tables. The sample command auto-detects these as starting points for FK-preserving sampling.

Override auto-detection when:

  • Your FKs are implicit (not declared in the schema)
  • You want to start sampling from specific tables
  • Auto-detection picks the wrong tables
Terminal window
# Explicitly specify which tables to start sampling from
sql-splitter sample dump.sql -o dev.sql --percent 10 --preserve-relations --root-tables users,organizations

The --include-global option controls how lookup/reference tables are handled:

ValueBehavior
noneSkip global/lookup tables entirely
lookupsInclude small lookup tables in full (default)
allInclude all global tables in full

What counts as a lookup table?

Tables are auto-classified as lookups if they match common patterns:

  • countries, states, provinces, cities
  • currencies, languages, timezones
  • permissions, roles, settings

You can override classification in a YAML config file.

System tables (migrations, jobs, cache, sessions) are automatically skipped.

For complex sampling requirements, use a YAML config file:

sample.yaml
default:
percent: 10
classification:
# Include these tables fully
global:
- permissions
- roles
# Skip these tables
system:
- migrations
- failed_jobs
- telescope_entries
# Classify as lookup tables
lookup:
- currencies
- countries
# Explicitly mark as root tables
root:
- organizations
tables:
users:
rows: 500 # Fixed row count for this table
posts:
percent: 5 # Different percentage for this table
sessions:
skip: true # Skip entirely
audit_logs:
classification: system # Override classification

Then run:

Terminal window
sql-splitter sample dump.sql -o dev.sql --config sample.yaml --preserve-relations
Terminal window
# Sample 10% with FK preservation, then redact PII
sql-splitter sample prod.sql --percent 10 --preserve-relations -o sampled.sql
sql-splitter redact sampled.sql --config redact.yaml -o dev.sql
Terminal window
# Ensure the sample maintains FK integrity
sql-splitter sample dump.sql --percent 10 --preserve-relations -o dev.sql
sql-splitter validate dev.sql --strict
Terminal window
# Understand the FK graph before deciding on root tables
sql-splitter graph dump.sql --format mermaid -o schema.md
sql-splitter sample dump.sql --percent 10 --preserve-relations --root-tables users,orgs -o dev.sql
Terminal window
# Create sample, then run ad-hoc queries against it
sql-splitter sample dump.sql --percent 5 --preserve-relations -o sample.sql
sql-splitter query sample.sql "SELECT COUNT(*) FROM users"

Cause: FK preservation pulls in all related rows. If you sample 10% of orders, you get 100% of the users who placed those orders.

Solutions:

  • Use --max-total-rows to cap the total: --max-total-rows 100000
  • Sample from tables higher in the dependency chain (use --root-tables)
  • Use --dry-run to preview sizes before sampling
Terminal window
# Preview what would be sampled
sql-splitter sample dump.sql --percent 10 --preserve-relations --dry-run --json | jq '.table_stats'

Cause: Foreign keys weren’t detected in the schema (implicit FKs, or defined in application code).

Solutions:

  • Specify root tables explicitly: --root-tables users,organizations
  • Add FK constraints to your schema dump
  • Use --strict-fk to fail early when FK issues are detected

Cause: --preserve-relations tracks all sampled PK values in memory (8 bytes per row).

Solutions:

  • Reduce sample size: --percent 1 instead of --percent 10
  • Use --max-total-rows 500000 to cap memory usage
  • Disable FK preservation for very large dumps (accept potential broken FKs)
  • Sample fewer tables: --tables critical_table1,critical_table2

Cause: Without --seed, sampling uses a random seed each run.

Solution: Add a fixed seed for reproducible results:

Terminal window
sql-splitter sample dump.sql --percent 10 --seed 42 -o dev.sql

“Warning: Reached max_total_rows limit”

Section titled ““Warning: Reached max_total_rows limit””

Cause: The --max-total-rows guard stopped sampling to prevent runaway growth.

Solutions:

  • Increase the limit if you have sufficient memory
  • Reduce --percent to sample fewer rows initially
  • Review which tables are pulling in the most related rows with --dry-run --json
CodeMeaning
0Success
1Error (I/O, parse error, FK integrity failure with --strict-fk)
2Invalid arguments

See Exit Codes for usage in scripts.

  • shard - Extract tenant-specific data
  • redact - Anonymize sampled data
  • validate - Verify FK integrity of sampled output
  • graph - Visualize FK relationships before sampling
  • JSON Output Schema - Schema for --json output
  • Glossary - Definitions of FK, PK, root table, etc.