v0.4.2 open source local CLI

Catch fine-tuning dataset bugs before they poison training.

Parallelogram is an open-source CLI that validates OpenAI chat JSONL and Qwen/ShareGPT datasets — broken role sequences, empty messages, duplicates, encoding artifacts, context-window overflows — locally, before the training run.

$ pip install parallelogram
Install from PyPI View GitHub

If it exits 0 with all rules enabled, your run won't fail because of data.

dirty.jsonl 3 issues
41 {"messages":[…,{"role":"assistant","content":"Three breaking changes: …"}]} 42 {"messages":[…,{"role":"user","content":"Actually, make it shorter."}]} ← ends on 'user' 97 {"messages":[…,"Don’t worry, it’s handled."]} ← mojibake 131 {"messages":[{"role":"user","content":"[full 40-page transcript…]"},…]} ← 5234 > 4096 tokens 547 records — nothing here crashes a trainer
parallelogram check preflight
$
clean.jsonl 546 records
41 {"messages":[…,{"role":"assistant","content":"Three breaking changes: …"}]} — line 42 dropped → roles (unfixable) 97 {"messages":[…,"Don't worry, it's handled."]} ← repaired 131 {"messages":[{"role":"user","content":"[transcript, truncated to fit]"},…]} ← fits 4096 ✓ 546 records — exit 0 on re-check

02 · Ecosystem

Works with your fine-tuning stack.

Parallelogram is designed to sit between your dataset and anything that trains on it — no SDK, no lock-in. It reads the formats and tokenizers your stack already uses, and its exit codes drop into any CI.

  • OpenAI chat JSONLformat · supported
  • ShareGPTformat · supported
  • tiktokenexact tokenizer
  • Hugging Face tokenizersexact tokenizer
  • Axolotlsits before
  • TRLsits before
  • Unslothsits before
  • GitHub ActionsCI gate
  • PyPIdistribution
  • GitHubsource

Compatibility means file formats, tokenizers, and exit codes — no partnership or endorsement implied.

03 · Silent failure modes

Your dataset can be broken without looking broken.

Bad JSONL doesn't always crash the trainer. It trains anyway — on poisoned data. Eight failure modes Parallelogram catches:

Broken role alternation

{"role":"user","content":"Try again."},
{"role":"user","content":"Hello? Anyone?"}

Alternation is the supervision signal — doubled turns teach the model that user messages answer user messages.

[roles] ERROR

Empty assistant messages

{"role":"user","content":"Explain RoPE."},
{"role":"assistant","content":"   "}

A whitespace-only target slips past most loaders and trains the model that silence is a valid completion.

[empty-content] ERROR

Exact duplicates

line  89: {"messages":[…]}   # hash 9f31c2…
line 147: {"messages":[…]}   # hash 9f31c2…

Repeated records push the model toward memorization and quietly inflate any eval that samples the same file.

[duplicates] ERROR

Invalid JSONL

{"messages":[{"role":"user","content":"ok"}
            # unterminated — not JSON

One truncated line either fails the whole ingest or gets skipped silently — which one depends on the loader, not on you.

[schema] ERROR

Mojibake / encoding artifacts

"content":"Don’t merge yet"
           # UTF-8 → latin-1 → UTF-8

Round-trip artifacts look almost right in an editor and bake corrupted punctuation directly into the weights.

[encoding] WARNING

Context-window overflow

record 131:  5234 tokens
max_seq_len: 4096 tokens

TRL truncates oversized records without a word — usually severing the assistant turn you were training on.

[context-window] ERROR

Format drift (OpenAI chat vs ShareGPT)

{"messages":[{"role":"user", …}]}
{"conversations":[{"from":"human", …}]}

Parallelogram detects both shapes before the rules run, then normalizes them into one internal message list.

[format] AUTO normalized at the parse boundary

Conversation ends on user

{"role":"assistant","content":"Done."},
{"role":"user","content":"thanks!!"}

The final turn is the training target. Ending on a user message means there is no target at all — just wasted loss.

[roles] ERROR

04 · Architecture

Diagnostics first. Formats disappear at the parse boundary.

Parallelogram turns OpenAI chat JSONL and ShareGPT records into one internal message list, then prints the same rule diagnostics, safe-fix notes, and output reasons for both.

openai-chat sharegpt role + content

Click a panel to inspect it. The console is the source of truth: raw records in, normalized diagnostics in the middle, clean output and held-back reasons on the right.

05 · Live demo

Paste nothing. Upload nothing. It runs locally — so does this demo.

The same six rules, ported to the browser. Pick a fixture or edit the JSONL yourself; the diagnostics below are exactly what the CLI prints.

fixture

One JSON record per line. Runs entirely in your browser — nothing is uploaded.

parallelogram check
--fix --output clean.jsonl

The browser port counts tokens with the ~4 chars/token estimate, so context-window findings here are WARNINGs. The CLI uses exact tokenizers when installed.

06 · Checks

Checks built for fine-tuning data, not generic JSON.

A JSON schema validator can tell you a record is well-formed. It can't tell you the conversation ends on the wrong speaker, repeats itself, or won't fit the model's context window.

Check Detects Why it matters Fixable Severity
schema validation Non-object records, missing messages, wrong types, invalid role values Every other rule depends on its structural guarantees — it cannot be disabled No — with --fix the record is dropped, with the reason ERROR DROPPED
role alternation Doubled turns, system message out of first position Alternation is the supervision signal; the model learns to talk to itself No — dropped ERROR
assistant-final Conversations that end on a user message The last turn is the training target — ending on user means no target No — dropped ERROR
empty content Empty or whitespace-only message content The model trains to produce silence Yes — empty turns dropped from the record ERROR FIXED
duplicate detection Exact-content duplicates, hashed with normalized whitespace Memorization instead of generalization; inflated eval metrics Yes — first occurrence kept ERROR FIXED
encoding / mojibake BOM markers, UTF-8 → latin-1 → UTF-8 round-trip artifacts Corrupted punctuation gets baked into the weights Yes — BOM stripped, mojibake repaired WARNING FIXED
context-window overflow Records exceeding --max-seq-len (default 4096) TRL truncates these silently, usually severing the assistant turn Yes — longest user message truncated until the record fits ERROR EXACT WARNING ESTIMATED
safe fix output What changed: unchanged / fixed / dropped / unparseable, per record Repairs are mechanical and re-validated — anything still erroring is dropped, never papered over Writes to --output; --dry-run previews FIXED DROPPED
format normalization OpenAI chat and ShareGPT shapes; anything else fails as schema Every rule runs on one internal representation — formats never leak into rule logic Applied at the parse boundary, not a repair PARSE-TIME
tokenizer-aware counting Per-record token totals against the budget Exact counts make overflow a hard gate; estimates stay honest as warnings Feeds the context-window rule EXACT ESTIMATED

07 · Tokenizers

Context-window checks with exact tokenizers when possible.

When an exact tokenizer is available — tiktoken for OpenAI models, Hugging Face tokenizers for open-weight models — overflow is a hard ERROR. Otherwise Parallelogram falls back to a ~4 chars/token estimate and reports a WARNING. A heuristic never deletes records or fails CI. Claude models have no offline tokenizer, so they always use the estimate.

# optional — adds exact tokenizers (tiktoken + Hugging Face)
$ pip install 'parallelogram[tokenizer]'
$ parallelogram check data.jsonl --tokenizer gpt-4o
$ parallelogram check data.jsonl --tokenizer llama-3 --max-seq-len 8192
$ parallelogram check data.jsonl --tokenizer Qwen/Qwen2.5-7B --max-seq-len 32768
record tokens 0 / 4096 EXACT · o200k_base
same record · no tokenizer ~5,234 / 4096 ESTIMATED · ~4 chars/token

! train.jsonl:131 [context-window] ~5234 > 4096 tokens (estimated) — a WARNING, never a hard failure

tiktoken EXACT

gpt-4ogpt-4.1o1o3o200k_base
gpt-4gpt-3.5cl100k_base

Hugging Face tokenizers EXACT

llama-3mistralmixtralqwen2.5gemma-2phi-3any HF repo id

no offline tokenizer ESTIMATED

Claude models~4 chars/tokenWARNING, never ERROR

08 · Quickstart

Install. Point it at a dataset. Read the diagnostics.

No config file, no account, no network. Python ≥ 3.10.

# install — that's the whole setup
$ pip install parallelogram
# validate — format is auto-detected
$ parallelogram check data.jsonl
# repair what's mechanical, drop what isn't — with reasons
$ parallelogram check data.jsonl --fix --output clean.jsonl
# exact token counts against a budget
$ parallelogram check data.jsonl --tokenizer gpt-4o --max-seq-len 4096
exit codes
0clean
1warnings · partial fix
2errors · nothing fixable

OpenAI chat JSONL and Qwen/ShareGPT files enter through the same command; the same rules and exit codes apply.

09 · Use cases

Run it wherever a dataset meets a trainer.

One command between your raw data and anything that consumes it.

Before fine-tuning jobs

Validate JSONL before upload or trainer launch. Catch schema, role, duplicate, and token-budget problems before the job queues or the GPU bill starts.

Cleaning public or synthetic data

ShareGPT dumps and generation pipelines drift in predictable ways: empty turns, bad order, duplicates, mojibake, and truncated records.

CI gate for dataset PRs

Exit codes 0/1/2 map directly to CI status. A dataset PR that breaks role order can fail the same way a code PR with a failing test does.

10 · Open source

Open source, local, and honest about what it can't see.

Apache-2.0. No telemetry. No upload boundary. The source is on GitHub, the package is on PyPI, and heuristic checks are labeled as such.

Apache-2.0

Permissive license

Use it anywhere, fork it, ship it. The license has no strings.

local-only

Nothing is uploaded

Your dataset never leaves your disk. No telemetry, no network — the upload boundary doesn't exist.

on PyPI

One pip install

pip install parallelogram — nothing else to set up, nothing to sign up for.

$ pip install parallelogram && parallelogram check data.jsonl