How we work

Quality is a system, not a promise

We lead with proof. Every deliverable runs the same pipeline — authored goldens, evidence-grounded verifiers, multi-stage QA, and reproducible benchmarking.

Scoping

We align on the capability under test, the deliverable schema, and what 'good' means before any data is produced.

Task spec & schemaSuccess criteriaDifficulty targets

Golden authoring

Senior authors write the reference goldens — ideal responses and trajectories — that everything else is measured against.

Golden trajectoriesReference answersJSON-schema deliverables

Rubric & verifier design

We author rubric verifiers with criterion, justification, and evidence — calibrated to the difficulty target.

HLI rubric verifiersError taxonomiesF2P tests

Annotation

Contributors execute the task under clear guidelines; every submission is traceable and evidence-grounded.

Labeled turnsGraded responsesEvidence trails

Multi-stage QA

Quality leads review submissions in layers — catching noise, calibration drift, and spec violations before delivery.

Review passesCalibration checksReviewer sign-off

Docker-reproducible benchmarking

Where applicable, we run candidate models in Docker with pytest-based scoring for fully reproducible results.

Reproducible imagesPass-rate scoringCost & log capture

Structured-JSON delivery

Final artifacts ship as structured JSON with patches, logs, and taxonomies — ready to drop into your pipeline.

Rubric ratings JSONPatches & logsTaxonomies

Quality pillars

What makes our data hold up

Six principles behind every workstream.

Goldens

Human-authored reference answers and trajectories.

Verifiers

Rubric verifiers with criterion, justification, evidence.

Structured JSON

Deliverables in exact, drop-in schemas.

Docker pipelines

Reproducible benchmarking, every run.

Evidence-grounded

Every judgment cites its evidence.

Calibrated difficulty

Frontier models fail ≥25% of mandatory criteria.

Calibrated difficulty, by design

Our goldens, taxonomies, and verifiers are calibrated so frontier models fail ≥25% of mandatory criteria while human goldens pass 100%.

F2P (fail-to-pass) construction + verifier calibration = discriminating signal.

Human golden passes100%

Frontier model fails≥25%

Deliverables

Artifacts you can drop into your pipeline

Final outputs ship as structured JSON with everything needed to reproduce them.

Rubric ratings JSON

Starter & golden patches

Run logs & transcripts

Error & capability taxonomies

Ready to raise your data quality bar?

Two ways in — whether you have work to ship or want to contribute to it.

For clientsScope a project, get goldens, rubrics & benchmarks.Work with us

Book a call