Parallelogram — The linter for fine-tuning data

The linter for
fine-tuning data.

Parallelogram catches the silent killers — broken role sequences, context-window overflows, duplicates, mojibake — before your $500 GPU run discovers them for you.

If it exits 0, your run won't fail because of data.

schema error

Malformed records.

Missing fields, wrong types, invalid roles. The first rule, because every other rule assumes it.

data.jsonl:84
{"messages": [{"role": "owner", "content": "…"}]}
          ^^^^^^^ invalid role: must be system|user|assistant|tool

roles error

Bad role sequences.

System out of place. Doubled turns. Conversations that don't end on the assistant. The model learns to talk to itself.

data.jsonl:23
[user → user → assistant]
       ^^^^ role alternation broken at turn 1: expected 'assistant', got 'user'

empty-content error

Empty turns.

Whitespace-only content slips past most validators. The model trains to produce silence.

data.jsonl:312
{"role": "user", "content": "   "}
                       ^^^^^ message 0 has empty content

context-window error

Context-window overflow.

TRL truncates oversized records silently — usually severing the assistant turn. You train on noise and don't know.

data.jsonl:1209
~8512 tokens > max_seq_len = 8192
will be silently truncated, severing the assistant response

duplicates error

Exact duplicates.

Repeated examples push the model toward memorization, not generalization. We hash with normalized whitespace so trivial differences don't mask real dupes.

data.jsonl:147
duplicate of line 89  (3 copies total at lines: [89, 147, 402])

encoding warning

Mojibake & BOM.

UTF-8 → latin-1 → UTF-8 round-trips look fine in your editor and ruin your model's punctuation forever.

data.jsonl:401
"donâ€™t do it" → should be: "don't do it"
     ^^^ latin-1 → UTF-8 round-trip artifact

Free CI integration.
No config required.

Clean POSIX exit codes — 0 clean, 1 warnings, 2 errors — plus a structured JSON report. Drop one line into your workflow and your training data gets the same gate as your code.

# .github/workflows/data.yml - run: pip install parallelogram - run: parallelogram check data.jsonl --json

No telemetry. No upload boundary. No backend.

Streams the file. Memory stays flat at 100k records.

Pluggable rules — disable or extend without forking.

The linter for
fine-tuning data.

See what it catches.

Six rules. Every silent killer.

Malformed records.

Bad role sequences.

Empty turns.

Context-window overflow.

Exact duplicates.

Mojibake & BOM.

Free CI integration.
No config required.

Two commands.
Then never lose a run again.

Install

Validate

Ship

Validates the formats your trainer already speaks.

Run before you train.

The linter forfine-tuning data.

See what it catches.

Six rules. Every silent killer.

Malformed records.

Bad role sequences.

Empty turns.

Context-window overflow.

Exact duplicates.

Mojibake & BOM.

Free CI integration.No config required.

Two commands.Then never lose a run again.

Install

Validate

Ship

Validates the formats your trainer already speaks.

Run before you train.

The linter for
fine-tuning data.

Free CI integration.
No config required.

Two commands.
Then never lose a run again.