v0.1 · open source · pre-flight

The linter for
fine-tuning data.

Parallelogram catches the silent killers — broken role sequences, context-window overflows, duplicates, mojibake — before your $500 GPU run discovers them for you.

Get started
If it exits 0, your run won't fail because of data.
/ 01 — pre-flight

See what it catches.

One command before you train. Errors report the exact line and rule. A clean exit is a guarantee.

~/projects/finetune-llama3 — parallelogram
/ 02 — what it checks

Six rules. Every silent killer.

Every check maps to a real failure mode that has cost someone a training run. Rules are pluggable; the v0.1 set is non-negotiable.

schema error

Malformed records.

Missing fields, wrong types, invalid roles. The first rule, because every other rule assumes it.

data.jsonl:84
{"messages": [{"role": "owner", "content": "…"}]}
          ^^^^^^^ invalid role: must be system|user|assistant|tool
roles error

Bad role sequences.

System out of place. Doubled turns. Conversations that don't end on the assistant. The model learns to talk to itself.

data.jsonl:23
[user → user → assistant]
       ^^^^ role alternation broken at turn 1: expected 'assistant', got 'user'
empty-content error

Empty turns.

Whitespace-only content slips past most validators. The model trains to produce silence.

data.jsonl:312
{"role": "user", "content": "   "}
                       ^^^^^ message 0 has empty content
context-window error

Context-window overflow.

TRL truncates oversized records silently — usually severing the assistant turn. You train on noise and don't know.

data.jsonl:1209
~8512 tokens > max_seq_len = 8192
will be silently truncated, severing the assistant response
duplicates error

Exact duplicates.

Repeated examples push the model toward memorization, not generalization. We hash with normalized whitespace so trivial differences don't mask real dupes.

data.jsonl:147
duplicate of line 89  (3 copies total at lines: [89, 147, 402])
encoding warning

Mojibake & BOM.

UTF-8 → latin-1 → UTF-8 round-trips look fine in your editor and ruin your model's punctuation forever.

data.jsonl:401
"don’t do it" → should be: "don't do it"
     ^^^ latin-1 → UTF-8 round-trip artifact
/ 03 — anywhere a build runs

Free CI integration.
No config required.

Clean POSIX exit codes — 0 clean, 1 warnings, 2 errors — plus a structured JSON report. Drop one line into your workflow and your training data gets the same gate as your code.

# .github/workflows/data.yml
- run: pip install parallelogram
- run: parallelogram check data.jsonl --json
  • No telemetry. No upload boundary. No backend.
  • Streams the file. Memory stays flat at 100k records.
  • Pluggable rules — disable or extend without forking.
github.com / Thatayotlhe04 / openai-fine-tune PR #847 · 2m ago
Check failed · parallelogram / data

3 errors, 1 warning in data.jsonl

L23 roles Conversation must end on 'assistant', ended on 'user'
L147 duplicates Duplicate of line 89
L312 context-window Record exceeds max_seq_len: ~8512 > 8192 tokens
L401 encoding Likely mojibake: '’'
3 errors 1 warning 543 clean Re-run jobs
/ 04 — quickstart

Two commands.
Then never lose a run again.

01

Install

$ pip install parallelogram

Add [tokenizer] for the context-window check.

02

Validate

$ parallelogram check data.jsonl \
    --tokenizer meta-llama/Llama-3-8B \
    --max-seq-len 8192

Exits 0 if your data is fine. Otherwise, exact lines and rules.

03

Ship

$ parallelogram check data.jsonl \
    --output clean.jsonl

Writes only error-free records. Ready to feed your trainer.

/ 05 — works with

Validates the formats your trainer already speaks.

v0.1 supports OpenAI chat ({"messages":[…]}). ShareGPT and raw-completion shipping next.

Run before you train.

Free, open source, local. No telemetry, no upload, no account.