Broken role alternation
{"role":"user","content":"Try again."},
{"role":"user","content":"Hello? Anyone?"}
Alternation is the supervision signal — doubled turns teach the model that user messages answer user messages.
[roles] ERROR
v0.4.2 open source local CLI
Parallelogram is an open-source CLI that validates OpenAI chat JSONL and Qwen/ShareGPT datasets — broken role sequences, empty messages, duplicates, encoding artifacts, context-window overflows — locally, before the training run.
If it exits 0 with all rules enabled, your run won't fail because of data.
02 · Ecosystem
Parallelogram is designed to sit between your dataset and anything that trains on it — no SDK, no lock-in. It reads the formats and tokenizers your stack already uses, and its exit codes drop into any CI.
Compatibility means file formats, tokenizers, and exit codes — no partnership or endorsement implied.
03 · Silent failure modes
Bad JSONL doesn't always crash the trainer. It trains anyway — on poisoned data. Eight failure modes Parallelogram catches:
{"role":"user","content":"Try again."},
{"role":"user","content":"Hello? Anyone?"}
Alternation is the supervision signal — doubled turns teach the model that user messages answer user messages.
[roles] ERROR
{"role":"user","content":"Explain RoPE."},
{"role":"assistant","content":" "}
A whitespace-only target slips past most loaders and trains the model that silence is a valid completion.
[empty-content] ERROR
line 89: {"messages":[…]} # hash 9f31c2…
line 147: {"messages":[…]} # hash 9f31c2…
Repeated records push the model toward memorization and quietly inflate any eval that samples the same file.
[duplicates] ERROR
{"messages":[{"role":"user","content":"ok"}
# unterminated — not JSON
One truncated line either fails the whole ingest or gets skipped silently — which one depends on the loader, not on you.
[schema] ERROR
"content":"Don’t merge yet"
# UTF-8 → latin-1 → UTF-8
Round-trip artifacts look almost right in an editor and bake corrupted punctuation directly into the weights.
[encoding] WARNING
record 131: 5234 tokens
max_seq_len: 4096 tokens
TRL truncates oversized records without a word — usually severing the assistant turn you were training on.
[context-window] ERROR
{"messages":[{"role":"user", …}]}
{"conversations":[{"from":"human", …}]}
Parallelogram detects both shapes before the rules run, then normalizes them into one internal message list.
[format] AUTO normalized at the parse boundary
{"role":"assistant","content":"Done."},
{"role":"user","content":"thanks!!"}
The final turn is the training target. Ending on a user message means there is no target at all — just wasted loss.
[roles] ERROR
04 · Architecture
Parallelogram turns OpenAI chat JSONL and ShareGPT records into one internal message list, then prints the same rule diagnostics, safe-fix notes, and output reasons for both.
Click a panel to inspect it. The console is the source of truth: raw records in, normalized diagnostics in the middle, clean output and held-back reasons on the right.
05 · Live demo
The same six rules, ported to the browser. Pick a fixture or edit the JSONL yourself; the diagnostics below are exactly what the CLI prints.
One JSON record per line. Runs entirely in your browser — nothing is uploaded.
The browser port counts tokens with the ~4 chars/token estimate, so context-window findings here are WARNINGs. The CLI uses exact tokenizers when installed.
06 · Checks
A JSON schema validator can tell you a record is well-formed. It can't tell you the conversation ends on the wrong speaker, repeats itself, or won't fit the model's context window.
| Check | Detects | Why it matters | Fixable | Severity |
|---|---|---|---|---|
| schema validation | Non-object records, missing messages, wrong types, invalid role values |
Every other rule depends on its structural guarantees — it cannot be disabled | No — with --fix the record is dropped, with the reason |
ERROR DROPPED |
| role alternation | Doubled turns, system message out of first position | Alternation is the supervision signal; the model learns to talk to itself | No — dropped | ERROR |
| assistant-final | Conversations that end on a user message | The last turn is the training target — ending on user means no target | No — dropped | ERROR |
| empty content | Empty or whitespace-only message content | The model trains to produce silence | Yes — empty turns dropped from the record | ERROR FIXED |
| duplicate detection | Exact-content duplicates, hashed with normalized whitespace | Memorization instead of generalization; inflated eval metrics | Yes — first occurrence kept | ERROR FIXED |
| encoding / mojibake | BOM markers, UTF-8 → latin-1 → UTF-8 round-trip artifacts | Corrupted punctuation gets baked into the weights | Yes — BOM stripped, mojibake repaired | WARNING FIXED |
| context-window overflow | Records exceeding --max-seq-len (default 4096) |
TRL truncates these silently, usually severing the assistant turn | Yes — longest user message truncated until the record fits | ERROR EXACT WARNING ESTIMATED |
| safe fix output | What changed: unchanged / fixed / dropped / unparseable, per record | Repairs are mechanical and re-validated — anything still erroring is dropped, never papered over | Writes to --output; --dry-run previews |
FIXED DROPPED |
| format normalization | OpenAI chat and ShareGPT shapes; anything else fails as schema | Every rule runs on one internal representation — formats never leak into rule logic | Applied at the parse boundary, not a repair | PARSE-TIME |
| tokenizer-aware counting | Per-record token totals against the budget | Exact counts make overflow a hard gate; estimates stay honest as warnings | Feeds the context-window rule | EXACT ESTIMATED |
07 · Tokenizers
When an exact tokenizer is available — tiktoken for OpenAI models, Hugging Face tokenizers for open-weight models — overflow is a hard ERROR. Otherwise Parallelogram falls back to a ~4 chars/token estimate and reports a WARNING. A heuristic never deletes records or fails CI. Claude models have no offline tokenizer, so they always use the estimate.
# optional — adds exact tokenizers (tiktoken + Hugging Face)
$ pip install 'parallelogram[tokenizer]'
$ parallelogram check data.jsonl --tokenizer gpt-4o
$ parallelogram check data.jsonl --tokenizer llama-3 --max-seq-len 8192
$ parallelogram check data.jsonl --tokenizer Qwen/Qwen2.5-7B --max-seq-len 32768
✗ train.jsonl:131 [context-window] 5234 > 4096 tokens — assistant response at risk of truncation
! train.jsonl:131 [context-window] ~5234 > 4096 tokens (estimated) — a WARNING, never a hard failure
08 · Quickstart
No config file, no account, no network. Python ≥ 3.10.
# install — that's the whole setup
$ pip install parallelogram
# validate — format is auto-detected
$ parallelogram check data.jsonl
# repair what's mechanical, drop what isn't — with reasons
$ parallelogram check data.jsonl --fix --output clean.jsonl
# exact token counts against a budget
$ parallelogram check data.jsonl --tokenizer gpt-4o --max-seq-len 4096
OpenAI chat JSONL and Qwen/ShareGPT files enter through the same command; the same rules and exit codes apply.
09 · Use cases
One command between your raw data and anything that consumes it.
Validate JSONL before upload or trainer launch. Catch schema, role, duplicate, and token-budget problems before the job queues or the GPU bill starts.
ShareGPT dumps and generation pipelines drift in predictable ways: empty turns, bad order, duplicates, mojibake, and truncated records.
Exit codes 0/1/2 map directly to CI status. A dataset PR that breaks role order can fail the same way a code PR with a failing test does.
10 · Open source
Apache-2.0. No telemetry. No upload boundary. The source is on GitHub, the package is on PyPI, and heuristic checks are labeled as such.
Your dataset never leaves your disk. No telemetry, no network — the upload boundary doesn't exist.
$ pip install parallelogram && parallelogram check data.jsonl