Make Agent Evals Part of Your Observability

The first agent I was part of had evals, except they all ran at CI/CD time.

What we didn't have was live evals as part of our observability. The model passed the gate on its way out the door, and then... that was it. Once it was in production, our evaluation story went quiet — a pre-flight checklist and no in-flight instruments.

Those CI evals are worth having as a golden set of regression tests for every build. But a fixed set only sees the world you imagined when writing it. It doesn't answer "how is this doing right now?" Between releases, we were flying blind.

Closing the gap: evals as observability

The fix is to pull evaluation out of the pipeline and into your observability, running continuously on real traffic. The pattern:

Capture every turn — the question, the answer, and the grounding context it used (the retrieved rows, the query, the cited sources). This is just structured logging with the right fields.
Sample recent turns and have an LLM judge them — a capable model at temperature 0, scoring each turn against the question and its grounding context, returning structured JSON.
Aggregate into rolling metrics over a trailing window, on a dashboard you actually look at.

The judge's question is the trick: not "is this true in general?" (unanswerable) but "is this answer supported by the context the system actually retrieved?" That's mechanical, repeatable, and exactly what matters for a grounded system — and it works on live inputs, not a frozen set.

The measures

Raw judge scores are typically 1–5 (safety is binary), normalized to 0–1 and averaged over the window.

Grounding accuracy — is the answer supported by the retrieved context, or did the model embellish? Scored against the captured context; turns with no context to ground against are excluded, not scored. Decent ≥ 0.85; worry below 0.70.
Hallucination rate — the inverse framing, literally 1 - grounding. Same signal, but "3% hallucinated" and "97% grounded" land differently with different audiences. Decent ≤ 0.15; worry above 0.30.
Instruction-following — did it answer the question that was asked? An answer can be perfectly true and still fail by addressing an adjacent question. Decent ≥ 0.90; worry below 0.75 — usually a routing or prompt issue.
Safety / appropriateness — binary per turn; the metric is the fraction safe. Decent = 1.0; worry below 1.0, full stop. A single unsafe turn is an incident to investigate, not a number to average away.
Coherence — clear, well-formed, professional? A useful quality signal, especially where groundedness can't be computed.

Read every metric with its sample size: 0.92 over n = 4 is noise; over n = 400 it's a trend. And remember the headline is an average — averages hide the very failures you most want to see.

Evals aren't guardrails (a quick aside)

Worth separating two things people lump together. Guardrails act in real time — grounding by construction, screening prompts and responses for injection, data leakage, and harmful content — and they prevent. Evals measure. CI evals and live evals are both measurement; one runs offline against a fixed set, the other runs online against reality. This post is about that second split — but don't mistake either eval for a guardrail. Detection is not prevention.

What to improve: tuning evals over time

A first live-eval setup is an instrument you tune, not a box you check. Where the real work is:

Baseline before you threshold. The "0.85 / 0.70" lines above are reasonable defaults, not universal truths. Run for a while, learn your system's normal, and alert on deviation from your own baseline. A drop from a steady 0.94 to 0.88 is a louder signal than sitting at a "passing" 0.86 forever.
Calibrate the judge, and re-calibrate it. LLM-as-judge has bias and variance; it can be wrong. Periodically hand-label a sample and measure human/judge agreement. When they diverge, fix the rubric or change the judge — and re-calibrate whenever you swap the judge model, because you've changed your ruler.
Watch deltas, not absolutes. The most actionable signal is movement after a change — a prompt edit, a model upgrade, new retrieval data. Alert on week-over-week regressions, not just static lines.
Segment; don't average. One global score buries a regression in a single route or intent. Break metrics out by channel, intent, and user type.
Grow a golden set from real failures. Every low-scoring or escalated turn becomes a regression test — which, neatly, flows your live findings back into your CI evals. The two layers feed each other: production teaches the test set what it was missing.
Add task-specific measures as you learn them. Start with generic groundedness and safety; layer on citation precision/recall, tool-call correctness, refusal appropriateness, and latency SLOs as you discover what actually breaks.
Mind cost and cadence. Judging every turn with a large model gets expensive. Sample (but stratify so rare, high-stakes paths stay covered), use a cheaper judge for routine traffic, and run scoring both on a schedule and on-demand so you're never blind right after a change.
Close the loop, or don't bother. Low scores should drive prompt, retrieval, and fine-tuning fixes — and then you re-measure to confirm the number moved. A dashboard nobody acts on is theater.

When to tune, concretely: re-baseline after any model/prompt/retrieval change; trust a metric only once its sample size earns it; investigate any sustained delta; and add a new measure the first time production surprises you.

The lesson from that first agent

CI evals gate the door. Live evals watch the room. We'd built the gate and called it done — and for a while, between deploys, nobody was watching the room.

If your evals only run in the pipeline, you don't have an observability problem yet. You have one waiting. Capture real traffic, judge a sample of it continuously, and feed what you learn back into the gate. Then keep tuning the instrument — because the test set you wrote on day one already knows less than your users do.

This isn't so straight-forward, I know. I'm still trying to implement all of this, but I think it will be worth it.