The eval pipeline is the product

A demo is a model. A product is a model + evals + fallback + cost cap + abuse detection + observability. The 5× effort to build the second is what determines whether it ships.

Afriend sent me a Loom in March. Two minutes of his AI assistant handling a complex multi-step query — chain-of-thought reasoning visible, citation links back to source docs, a clarifying question when the prompt was ambiguous, a clean structured answer.

“Looks great,” I said. “How does it score on your eval suite?”

He didn’t have one.

His team is a Series B B2B SaaS. They had GPT-4o-mini in production, were paying ~$3K/month in inference, and had a chatbot their CS team was telling customers about. What they had instead of an eval suite was a Loom and a feeling.

This is the moment most production AI projects quietly become liabilities — and most teams don’t notice until something breaks publicly.

The claim

A demo is a model. A product is a model + evals + fallback + cost cap + abuse detection + observability.

The 5× effort to build the second is what separates “AI demo” from “production AI.” Most teams skip it because the demo already works.

I have been shipping production AI systems for the last two years — 30+ agents at this point, plus chatbots, RAG implementations, full A4 product builds. Without exception, the systems that survive contact with real users are the ones with serious eval pipelines behind them. The ones that don’t survive are the ones that shipped on vibes.

This essay is the long version of why.

What “no eval pipeline” actually means in production

Without an eval pipeline, your AI is shipping blind in three specific ways. Each of these has bitten me at least once on a client engagement I came in to fix.

1. You don’t know when it regresses.

A prompt change, a model version upgrade, a context length change, a retrieval re-chunk, a system prompt tweak — any of these can silently degrade performance on cases you weren’t thinking about when you made the change. The bug doesn’t manifest as an exception. It manifests as a subtly worse answer that a customer will eventually notice.

I had a client whose chatbot’s response quality dropped 11 percentage points (on a measure they didn’t have) when they upgraded from GPT-4-turbo to GPT-4o because GPT-4o handled their specific RAG retrieval format slightly worse on long documents. They found out three weeks later when a customer complained on their support channel. Three weeks of degraded output, with no internal signal.

2. You don’t know when you improve.

When you invest engineering time in improvements — better retrieval, smarter routing, hybrid search, fine-tuning — you have no way to prove it actually helped. You’re optimizing on vibes.

This is a different failure mode than (1) but related. Without comparable scores, every “improvement” is a hypothesis you can’t validate. You’ll ship things that feel better in your three test cases but degrade on the long tail.

3. You can’t make a deployment decision.

Should this prompt change ship? Should this new model go to production? Without a comparable score on a fixed test set with a regression threshold, the answer is always “we’ll see how it goes.” That’s just a slower, more expensive way of saying “we’ll find out from production.”

The eval pipeline is what makes deployment decisions consequential rather than vibes-based. Without it, you cannot tell good days from bad days, good changes from bad changes, good models from bad models.

What a real eval pipeline looks like

The eval pipelines I ship into production have three layers. Each catches different failure modes.

Layer 1: Capability evals

These measure whether the system does what it’s supposed to do on representative tasks.

Accuracy on representative cases. 200-2000 test cases that mirror the real distribution of user queries. Each case has a known correct answer (or an LLM-judge with a careful rubric). The system runs all cases on every prompt or model change.
Format compliance. If the output needs to be valid JSON, valid function calls, or a specific structured shape — measure it. Mine for the cases where the model returns prose instead of structured output.
Latency distribution. P50, P95, P99. Most teams only watch P50; the 99th percentile is where the user-visible problems live.
Cost per task. With model + prompt + context length factored in. Without this you’ll be surprised by your monthly bill.

These run on every PR that touches the AI layer. Failures block merge.

Layer 2: Adversarial evals

These measure whether the system fails safely — not whether it can be tricked, but whether the failure mode is acceptable when it is.

Refusal calibration. Over-refusal (annoying) and under-refusal (unsafe) are both failures. Measure both.
Hallucination rate. On retrieval-grounded answers, measure how often the system makes up content not present in the retrieved documents. Use an LLM-judge with a strict rubric.
Prompt injection resistance. A small but real test of whether crafted user input can subvert the system prompt. This is part of red-team output, not capability eval, but it lives here in the pipeline.
PII handling. Especially relevant for chatbots over support tickets or customer data. Mine for cases where the system echoes PII it shouldn’t.

These run weekly, not per-PR. The signal is slower to move but the consequences are higher.

Layer 3: Operational evals

These measure whether the system can actually be operated by a real team without 24/7 founder attention.

Logging completeness. Can you reconstruct what happened in any incident from logs alone? If not, your future on-call engineer (or your past self at 3am) is in trouble.
Trajectory replay. For agents — can you re-run a specific failed conversation deterministically? Most teams cannot, which makes debugging impossible.
Drift detection. Same eval set, run weekly, with a chart. Most “the model got worse” moments are actually drift in the input distribution (new user phrasings, new document formats) rather than model regression. You only know which is which if you’re measuring.
Cost drift detection. Daily spend, broken down by feature and customer. The most common production AI surprise is a 3-5× cost increase that no one noticed for two weeks.

These run continuously. They are the difference between “production-grade” and “demo-grade with extra steps.”

The eval cases are the asset

Here is the thing most teams miss when they finally get around to building evals: the eval cases themselves are your most valuable asset. Not the model, not the prompt, not the agent framework — the cases.

The model can be swapped. Anthropic deprecates Claude 3, you swap to Claude 4. The prompt can be rewritten. The framework can be ripped out. The cases — your accumulated knowledge of what your users actually ask, with the correct answers labeled — that’s the work product that compounds.

I tell clients: if your office burned down and you could save one artifact from your AI codebase, save the eval suite, not the prompt. The prompt can be rewritten in a day. The eval suite represents months of accumulated production traffic, carefully labeled edge cases, and adversarial probes that someone had to come up with.

This is why I push every engagement to build the eval set first and the system second. The first sprint of an A2/A3/A4 build is always: write the eval cases. Define what “good” means before you build the thing that has to be good.

A concrete example

The invoice processing agent I built for a Series A B2B fintech last year (96.3% straight-through rate, still holding nine months later) has 1,247 eval cases. Each case is a real historical invoice with the correct extraction labeled by their ops team during the audit phase.

The eval suite runs on every prompt change, every model upgrade, every retrieval modification. It runs in CI, blocking merge if pass rate drops more than 1.5 percentage points. It runs in production weekly on the same cases plus a rolling sample of recent invoices, to catch drift.

Three months after launch, an upstream vendor changed how they emitted a specific PDF field. The eval caught it within seven days. Without that pipeline, we’d have caught it from a customer escalation — except in this domain, the customer is the CFO of the fintech, and the “escalation” is them noticing their AP automation is mis-categorizing invoices, which loses trust at a board meeting.

The eval suite cost us roughly two engineer-weeks to build during the initial sprint. It has prevented at least three production-degradation incidents I am aware of (and probably more I am not). The math on that ROI is comically favorable.

The blocking criteria question

An eval pipeline that doesn’t block is mostly theatre. If a failed eval doesn’t block a merge, the engineer who introduced the regression doesn’t see the signal at decision time. They see it in a Slack channel three days later, with no context.

The single most consequential design decision in an eval pipeline is what counts as a blocking failure. Two patterns work:

Hard threshold. “Accuracy must not drop more than 2 percentage points on the baseline suite.” Simple, predictable, easy to reason about. Doesn’t handle the case where you’re shipping a feature that intentionally trades capability for safety.

Per-category regression check. “No category may drop more than 5 percentage points.” Catches the regression-buried-in-improvement case where overall accuracy improves but a specific user segment got dramatically worse.

I default to per-category with a tighter overall threshold (2pp overall, 5pp per category). It catches more issues without adding intolerable friction.

Whichever pattern you pick — pick one and enforce it. An eval pipeline that warns but doesn’t block is the same as no eval pipeline by week three of any team. The discipline has to be built into the tools, not the engineers.

The cost math

The most common pushback when I tell a team to build an eval suite before the build is: “We don’t have time. Ship the demo, build evals later.”

The eval pipeline costs ~2 engineer-weeks for a typical build.

Each production AI incident costs:

Detection time: days to weeks before you notice (without evals).
Engineering time: 1-2 weeks of investigation and remediation, often involving the founders.
Trust cost: weeks to months of customer skepticism.
Hard cost (if customer-facing failure): 0 to 6 figures, depending on severity.

Two engineer-weeks against the expected cost of one incident is comically favorable. Two engineer-weeks against the expected cost of three to five incidents over the next 12 months is the actual comparison. Every team I work with hits at least three of these in year one without eval-gated deployment.

Build the evals first. The math is not close.

What this looks like at scale

The pattern I run on A3 and A4 engagements:

Audit phase (weeks 1-2): Define what “good” means. Sample 200-500 representative cases from production logs (or, if pre-launch, from the user research / customer support ticket archive). Label them with the ops team or domain experts.
Build phase, sprint 1: Implement the eval harness. Wire it to CI. Set the blocking thresholds. Now start building the actual system, with eval pass rate as the success metric every sprint.
Build phase, ongoing: Every prompt change runs the suite. Failed cases get added to the regression set. New customer-reported issues get a case added before the fix ships.
Operate phase: Weekly eval run on rolling production sample. Alert on >2pp drop. Quarterly review with the client to expand the case library.

This isn’t novel methodology. It’s what every team I respect in production AI is doing. The Anthropic team writes about it. The OpenAI team writes about it. The companies you’ve heard of building serious AI features are all doing some version of this.

The teams that aren’t doing it are the ones whose AI features quietly underperform, get blamed on “the model,” and eventually get cut from the roadmap because nobody could prove they were working.

What I tell teams who push back

The most common objection isn’t “evals don’t help” — most engineers know they help. The objection is “we don’t have the labeled data.”

Three responses:

You have more labeled data than you think. Customer support tickets are labeled examples of what users ask. Bug reports are labeled examples of failures. Sales transcripts are labeled examples of conversation flow. You have to spend a day or two extracting it, but it’s there.

Synthetic cases are useful even before you have real ones. Have an LLM generate adversarial cases against your spec. Have a domain expert hand-write the first 50. You don’t need 1,000 cases on day one. You need 50 good ones and a discipline to grow them.

The cost of not doing it is higher than the cost of doing it imperfectly. A 200-case eval suite that catches 60% of regressions is dramatically better than no eval suite that catches 0%. Don’t let perfect be the enemy of “we will know when something breaks.”

The closing argument

If you cannot measure whether your AI system is improving or regressing, you cannot ship it.

You can stand it up. You can put it in production. You can demo it. You can pay for inference. But you cannot ship it — in the sense of “deploy with confidence that it does what it’s supposed to” — without an eval pipeline.

The eval pipeline isn’t a nice-to-have you add later when the team grows. It’s the load-bearing piece of infrastructure that makes everything else trustworthy. Build it first, build it well, and treat the eval cases as the most valuable artifact your team produces.

The model is a commodity. The prompt is a draft. The eval suite is the product.

If you’re shipping an AI feature and you’d like an outside read on what your eval pipeline is missing — that’s exactly what the A1 audit is. Two weeks, written deliverable, no slides. Book a 30-min call or email hello@mishrasiddharth.com.