Situation
A regulated organization replaced a rules-based decisioning system with an LLM-backed assistant. Offline evals were rigorous; online review was held to a 1% sample.
Constraint
Evals were drawn from historical traffic, which materially under-represented edge-case prompts that emerged after launch.
Decision Point
After two weeks of clean dashboards, an external audit surfaced a class of failures invisible to the 1% sample. The team had to choose: tighten the sample, broaden the eval set, or roll back.
What Changed
They moved review to 100% on the affected class while broadening the eval set, and added a structural rule: any new external-input class triggers re-evaluation regardless of summary metrics.
Lessons Learned
<ul><li data-list-item-id="e0998b8dda17ec1a764b03d93889e61b3">Offline evals describe the past; production describes the present. Treat them as different signals.</li><li data-list-item-id="e569cc0090b20d5b99586235d5231c68e">Sample-based review hides class-of-failure issues. Bias your review toward classes you have not yet seen.</li><li data-list-item-id="e9ca56944c7fbeb085ee609783f4720ef">Add structural rules — "new external-input class triggers re-eval" — that survive turnover.</li></ul>