We lead with proof. Every deliverable runs the same pipeline — authored goldens, evidence-grounded verifiers, multi-stage QA, and reproducible benchmarking.
We align on the capability under test, the deliverable schema, and what 'good' means before any data is produced.
Senior authors write the reference goldens — ideal responses and trajectories — that everything else is measured against.
We author rubric verifiers with criterion, justification, and evidence — calibrated to the difficulty target.
Contributors execute the task under clear guidelines; every submission is traceable and evidence-grounded.
Quality leads review submissions in layers — catching noise, calibration drift, and spec violations before delivery.
Where applicable, we run candidate models in Docker with pytest-based scoring for fully reproducible results.
Final artifacts ship as structured JSON with patches, logs, and taxonomies — ready to drop into your pipeline.
Six principles behind every workstream.
Human-authored reference answers and trajectories.
Rubric verifiers with criterion, justification, evidence.
Deliverables in exact, drop-in schemas.
Reproducible benchmarking, every run.
Every judgment cites its evidence.
Frontier models fail ≥25% of mandatory criteria.
Our goldens, taxonomies, and verifiers are calibrated so frontier models fail ≥25% of mandatory criteria while human goldens pass 100%.
F2P (fail-to-pass) construction + verifier calibration = discriminating signal.
Final outputs ship as structured JSON with everything needed to reproduce them.
Two ways in — whether you have work to ship or want to contribute to it.