When the model passed every eval and still failed in the wild

Situation

A regulated organization replaced a rules-based decisioning system with an LLM-backed assistant. Offline evals were rigorous; online review was held to a 1% sample.

Constraint

Evals were drawn from historical traffic, which materially under-represented edge-case prompts that emerged after launch.

Decision Point

After two weeks of clean dashboards, an external audit surfaced a class of failures invisible to the 1% sample. The team had to choose: tighten the sample, broaden the eval set, or roll back.

What Changed

They moved review to 100% on the affected class while broadening the eval set, and added a structural rule: any new external-input class triggers re-evaluation regardless of summary metrics.

Lessons Learned

<ul><li data-list-item-id="e0998b8dda17ec1a764b03d93889e61b3">Offline evals describe the past; production describes the present. Treat them as different signals.</li><li data-list-item-id="e569cc0090b20d5b99586235d5231c68e">Sample-based review hides class-of-failure issues. Bias your review toward classes you have not yet seen.</li><li data-list-item-id="e9ca56944c7fbeb085ee609783f4720ef">Add structural rules — "new external-input class triggers re-eval" — that survive turnover.</li></ul>