Malformed records.
Missing fields, wrong types, invalid roles. The first rule, because every other rule assumes it.
data.jsonl:84 {"messages": [{"role": "owner", "content": "…"}]} ^^^^^^^ invalid role: must be system|user|assistant|tool
Parallelogram catches the silent killers — broken role sequences, context-window overflows, duplicates, mojibake — before your $500 GPU run discovers them for you.
One command before you train. Errors report the exact line and rule. A clean exit is a guarantee.
Every check maps to a real failure mode that has cost someone a training run. Rules are pluggable; the v0.1 set is non-negotiable.
Missing fields, wrong types, invalid roles. The first rule, because every other rule assumes it.
data.jsonl:84 {"messages": [{"role": "owner", "content": "…"}]} ^^^^^^^ invalid role: must be system|user|assistant|tool
System out of place. Doubled turns. Conversations that don't end on the assistant. The model learns to talk to itself.
data.jsonl:23 [user → user → assistant] ^^^^ role alternation broken at turn 1: expected 'assistant', got 'user'
Whitespace-only content slips past most validators. The model trains to produce silence.
data.jsonl:312 {"role": "user", "content": " "} ^^^^^ message 0 has empty content
TRL truncates oversized records silently — usually severing the assistant turn. You train on noise and don't know.
data.jsonl:1209 ~8512 tokens > max_seq_len = 8192 will be silently truncated, severing the assistant response
Repeated examples push the model toward memorization, not generalization. We hash with normalized whitespace so trivial differences don't mask real dupes.
data.jsonl:147 duplicate of line 89 (3 copies total at lines: [89, 147, 402])
UTF-8 → latin-1 → UTF-8 round-trips look fine in your editor and ruin your model's punctuation forever.
data.jsonl:401 "don’t do it" → should be: "don't do it" ^^^ latin-1 → UTF-8 round-trip artifact
Clean POSIX exit codes — 0 clean, 1 warnings, 2 errors —
plus a structured JSON report. Drop one line into your workflow and your training data
gets the same gate as your code.
# .github/workflows/data.yml - run: pip install parallelogram - run: parallelogram check data.jsonl --json
3 errors, 1 warning in data.jsonl
$ pip install parallelogram
Add [tokenizer] for the context-window check.
$ parallelogram check data.jsonl \ --tokenizer meta-llama/Llama-3-8B \ --max-seq-len 8192
Exits 0 if your data is fine. Otherwise, exact lines and rules.
$ parallelogram check data.jsonl \ --output clean.jsonl
Writes only error-free records. Ready to feed your trainer.
v0.1 supports OpenAI chat ({"messages":[…]}).
ShareGPT and raw-completion shipping next.
Free, open source, local. No telemetry, no upload, no account.