sample

Create smaller test datasets while preserving foreign key relationships.

Alias: sa (e.g., sql-splitter sa dump.sql --percent 10 -o dev.sql)

When to Use This

Creating dev/test datasets - Generate smaller databases for local development from production dumps
Reducing CI test times - Smaller datasets mean faster test runs without losing coverage
Reproducing production bugs - Extract a consistent slice of data that maintains relationships between tables
Size-limited backups - Create smaller backups for quick restores or archival purposes
Demo environments - Build realistic demo data that maintains referential integrity

Use shard instead if you need to extract data for a specific tenant from a multi-tenant database.

How It Works

The sample command operates in one of two modes:

Simple Random Sampling (default)

Without --preserve-relations, each table is sampled independently using reservoir sampling. Rows are selected randomly based on --percent or --rows. This is fast but doesn’t guarantee foreign key relationships remain valid.

FK-Preserving Sampling (`--preserve-relations`)

When FK preservation is enabled, the algorithm:

Splits the dump into per-table files in a temp directory
Builds a dependency graph from CREATE TABLE and ALTER TABLE statements
Identifies root tables - tables with no outgoing foreign keys (or explicitly specified via --root-tables)
Processes tables in topological order - parents before children
Samples root tables first using the specified percentage or row count
Follows FK chains - for each sampled row, tracks its primary key hash. When processing child tables, only rows whose FK values match a sampled parent’s PK are eligible
Outputs a consistent dataset where all FK relationships are satisfied

Memory usage: PK values are stored as 64-bit hashes in a HashSet (8 bytes per row). For 10 million sampled rows across all tables, expect ~80MB of memory for PK tracking. Use --max-total-rows to cap memory growth.

Fractional rows: When a percentage results in a fractional row count (e.g., 10% of 15 rows = 1.5), reservoir sampling naturally handles this—you’ll get approximately the target percentage over many runs, or use --seed for deterministic results.

Usage

sql-splitter sample <INPUT> [OPTIONS]

Examples

Basic Sampling

# Sample 10% of rows from each table
sql-splitter sample dump.sql -o sampled.sql --percent 10

# Sample fixed number of rows per table
sql-splitter sample dump.sql -o sampled.sql --rows 1000

FK-Preserving Samples

# Preserve FK relationships - ensures if you sample an order, the user exists
sql-splitter sample dump.sql -o dev.sql --percent 10 --preserve-relations

# Fail if any FK integrity issues are detected
sql-splitter sample dump.sql -o dev.sql --percent 10 --preserve-relations --strict-fk

Reproducible Sampling

# Same seed = identical sample every time (useful for CI/debugging)
sql-splitter sample dump.sql -o dev.sql --percent 10 --seed 42

Table Filtering

# Sample only specific tables
sql-splitter sample dump.sql -o dev.sql --percent 10 --tables users,orders,products

# Exclude large or irrelevant tables
sql-splitter sample dump.sql -o dev.sql --percent 10 --exclude logs,events,audit_trail

Using YAML Config for Per-Table Settings

# Use a config file for complex sampling requirements
sql-splitter sample dump.sql -o dev.sql --config sample.yaml

Options

Flag	Short	Description	Default
`--output`	`-o`	Output SQL file	stdout
`--dialect`	`-d`	SQL dialect	auto-detect
`--percent`		Sample percentage (1-100)	-
`--rows`		Sample fixed number of rows per table	-
`--preserve-relations`		Preserve FK relationships	false
`--tables`	`-t`	Only sample these tables (comma-separated)	all
`--exclude`	`-e`	Exclude these tables (comma-separated)	none
`--root-tables`		Explicit root tables for sampling	auto
`--include-global`		Global table handling: `none`, `lookups`, `all`	`lookups`
`--seed`		Random seed for reproducibility	random
`--config`	`-c`	YAML config file for per-table settings	-
`--max-total-rows`		Maximum total rows to sample	unlimited
`--no-limit`		Disable row limit	false
`--strict-fk`		Fail if any FK integrity issues	false
`--no-schema`		Exclude CREATE TABLE statements	false
`--progress`	`-p`	Show progress bar	false
`--dry-run`		Preview without writing	false
`--json`		Output as JSON	false

Root Tables and `--root-tables`

Root tables are tables with no outgoing foreign keys—they don’t reference other tables. The sample command auto-detects these as starting points for FK-preserving sampling.

Override auto-detection when:

Your FKs are implicit (not declared in the schema)
You want to start sampling from specific tables
Auto-detection picks the wrong tables

# Explicitly specify which tables to start sampling from
sql-splitter sample dump.sql -o dev.sql --percent 10 --preserve-relations --root-tables users,organizations

Global and Lookup Tables

The --include-global option controls how lookup/reference tables are handled:

Value	Behavior
`none`	Skip global/lookup tables entirely
`lookups`	Include small lookup tables in full (default)
`all`	Include all global tables in full

What counts as a lookup table?

Tables are auto-classified as lookups if they match common patterns:

countries, states, provinces, cities
currencies, languages, timezones
permissions, roles, settings

You can override classification in a YAML config file.

System tables (migrations, jobs, cache, sessions) are automatically skipped.

YAML Configuration File

For complex sampling requirements, use a YAML config file:

default:
  percent: 10

classification:
  # Include these tables fully
  global:
    - permissions
    - roles
  # Skip these tables
  system:
    - migrations
    - failed_jobs
    - telescope_entries
  # Classify as lookup tables
  lookup:
    - currencies
    - countries
  # Explicitly mark as root tables
  root:
    - organizations

tables:
  users:
    rows: 500      # Fixed row count for this table
  posts:
    percent: 5     # Different percentage for this table
  sessions:
    skip: true     # Skip entirely
  audit_logs:
    classification: system  # Override classification

Then run:

sql-splitter sample dump.sql -o dev.sql --config sample.yaml --preserve-relations

Composing with Other Tools

Create anonymized dev database

# Sample 10% with FK preservation, then redact PII
sql-splitter sample prod.sql --percent 10 --preserve-relations -o sampled.sql
sql-splitter redact sampled.sql --config redact.yaml -o dev.sql

Validate sampled output

# Ensure the sample maintains FK integrity
sql-splitter sample dump.sql --percent 10 --preserve-relations -o dev.sql
sql-splitter validate dev.sql --strict

Visualize dependencies before sampling

# Understand the FK graph before deciding on root tables
sql-splitter graph dump.sql --format mermaid -o schema.md
sql-splitter sample dump.sql --percent 10 --preserve-relations --root-tables users,orgs -o dev.sql

Sample then query

# Create sample, then run ad-hoc queries against it
sql-splitter sample dump.sql --percent 5 --preserve-relations -o sample.sql
sql-splitter query sample.sql "SELECT COUNT(*) FROM users"

Troubleshooting

Sample is larger than expected

Cause: FK preservation pulls in all related rows. If you sample 10% of orders, you get 100% of the users who placed those orders.

Solutions:

Use --max-total-rows to cap the total: --max-total-rows 100000
Sample from tables higher in the dependency chain (use --root-tables)
Use --dry-run to preview sizes before sampling

# Preview what would be sampled
sql-splitter sample dump.sql --percent 10 --preserve-relations --dry-run --json | jq '.table_stats'

Cause: Foreign keys weren’t detected in the schema (implicit FKs, or defined in application code).

Solutions:

Specify root tables explicitly: --root-tables users,organizations
Add FK constraints to your schema dump
Use --strict-fk to fail early when FK issues are detected

Out of memory

Cause: --preserve-relations tracks all sampled PK values in memory (8 bytes per row).

Solutions:

Reduce sample size: --percent 1 instead of --percent 10
Use --max-total-rows 500000 to cap memory usage
Disable FK preservation for very large dumps (accept potential broken FKs)
Sample fewer tables: --tables critical_table1,critical_table2

Non-deterministic results

Cause: Without --seed, sampling uses a random seed each run.

Solution: Add a fixed seed for reproducible results:

sql-splitter sample dump.sql --percent 10 --seed 42 -o dev.sql

“Warning: Reached max_total_rows limit”

Cause: The --max-total-rows guard stopped sampling to prevent runaway growth.

Solutions:

Increase the limit if you have sufficient memory
Reduce --percent to sample fewer rows initially
Review which tables are pulling in the most related rows with --dry-run --json

Exit Codes

Code	Meaning
`0`	Success
`1`	Error (I/O, parse error, FK integrity failure with `--strict-fk`)
`2`	Invalid arguments

See Exit Codes for usage in scripts.

sample

When to Use This

How It Works

Simple Random Sampling (default)

FK-Preserving Sampling (--preserve-relations)

Usage

Examples

Basic Sampling

FK-Preserving Samples

Reproducible Sampling

Table Filtering

Using YAML Config for Per-Table Settings

Options

Root Tables and --root-tables

Global and Lookup Tables

YAML Configuration File

Composing with Other Tools

Create anonymized dev database

Validate sampled output

Visualize dependencies before sampling

Sample then query

Troubleshooting

Sample is larger than expected

Missing related rows (broken FKs)

Out of memory

Non-deterministic results

“Warning: Reached max_total_rows limit”

Exit Codes

See Also

FK-Preserving Sampling (`--preserve-relations`)

Root Tables and `--root-tables`