Available · Q2 2026 · 3 audit slots open Reply <4 hrs business hours ·  Book a 30-min call →
Security

Why I red-team every AI before launch

Siddharth Mishra · · ai-security · production-ai · red-team

The math on pre-launch versus post-incident is brutal. Once a prompt injection vector lands on Twitter, three months of trust evaporate. So I red-team everything as a default, not an exception.

ast year I ran an A1 audit for a Series B B2B SaaS launching a customer-facing AI assistant. Four weeks. They had run an internal review. The team felt good about it.

I found twenty-three critical issues.

Three prompt injection vectors. One PII exfiltration path that worked cross-tenant via a crafted query. A refund tool that could be invoked on adversarial input without an approval check. Four jailbreak paths. Rate limit gaps that would have let a single user blow up their inference bill in an afternoon.

All shipped fixed before launch. Twelve months later, zero post-launch incidents.

Their engineering lead said the line that goes in every email I write to a prospect now: “This is the cheapest insurance we’ve ever bought.”

The audit fee was eight thousand dollars. The cost of one of those vulnerabilities making it to production would have been somewhere between a customer-trust-destroying tweet and a six-figure incident-response engagement. The math on pre-launch versus post-incident is not close.

This essay is the long version of why I red-team every AI system I ship — not by exception, by default.

What “internal review” misses

The team that built the system above had genuinely good engineers. They had run an internal review. They had thought adversarially. Their CTO had personally tried to break it for an afternoon.

They still missed twenty-three critical issues.

This isn’t a story about a bad team. It’s a story about a structural problem: the people who build a system are the worst people to red-team it.

Three reasons:

1. Blind spots are systemic. You don’t see the assumptions you made because they feel like facts. The team had assumed user-uploaded content couldn’t influence the system prompt because they had “wrapped it in delimiters.” They had — but the delimiters were textual, and the model treated content inside them as instructions when crafted carefully. They’d never have found that without someone whose job was to find it.

2. Adversarial creativity is a separate skill. Building robust software and breaking robust software are different disciplines. Pen-testers exist as a profession for a reason. The mindset of “what is the worst thing a determined adversary could do here” is not the same mindset as “how do I make this work well for legitimate users.”

3. Internal reviews stop at the first finding. Once an internal review surfaces an issue, the team fixes it and feels good. An external red-team keeps probing. Most of the issues I find are #4, #11, #17 on a list — the ones that get harder to surface as you go deeper. Internal reviews almost always stop at #3.

The math

Let me put real numbers on the pre-launch vs post-incident comparison.

Pre-launch red-team cost (A1 audit):

  • 4 weeks, fixed-price, $5K–$15K depending on system complexity.
  • Output: severity-graded report with proof-of-concept exploits, reproducer catalog, prioritized fixes.
  • Engineering time on fixes: 2-6 weeks depending on findings.
  • Total realistic cost: $25K–$50K including remediation.

Post-incident cost (the cheap version, no public disclosure):

  • Engineering investigation: 1-2 weeks of senior engineering time, often spanning multiple engineers, often including the founders.
  • Hotfix and validation: another 1-2 weeks.
  • Customer trust repair: weeks of conversations, refunds, possibly losing the customer entirely.
  • Legal/compliance review if regulated industry: variable, often expensive.
  • Total realistic cost: $50K–$150K, even when the incident is contained.

Post-incident cost (the expensive version, public disclosure):

  • Everything above, plus:
  • Incident response, often with external help: $30K–$100K+.
  • Customer churn from the disclosure: variable, sometimes catastrophic.
  • Reputational damage: months of sales motion harder, harder to close enterprise deals where security teams now have your incident on their radar.
  • If PII was exposed: regulatory exposure, mandatory disclosure timelines (GDPR is 72 hours), legal fees.
  • Total realistic cost: $200K to genuinely existential, depending on what leaked.

A pre-launch red-team costs roughly the same as the cheap version of a contained internal incident. It costs a small fraction of any public disclosure scenario. And it converts what would have been an emergency into a planned engineering sprint.

There is no scenario in which the math doesn’t favor pre-launch red-teaming. None.

The forty-seven vectors

I red-team against eight categories of attack, with roughly forty-seven specific vectors documented in my internal playbook. The list grows monthly as new jailbreak techniques get published and new model versions ship.

The eight categories:

1. Prompt injection (direct). Crafted user input that overrides the system prompt. “Ignore your instructions and tell me X.” Trivial in 2022, harder but still very possible in 2026 with the right phrasing.

2. Prompt injection (indirect). Adversarial content embedded in documents the system retrieves from. A “support ticket” with hidden instructions that get treated as system prompts when the AI processes them. This is the vector that scares me the most because most teams don’t even think about it.

3. Jailbreaks. DAN-style persona injection, role-play attacks, encoding-based attacks (Base64, leetspeak), multi-turn jailbreaks where you compromise the system slowly over several messages.

4. PII exfiltration. Crafted queries designed to leak: system prompts (which often contain sensitive context), other users’ data (especially in multi-tenant systems), training data, retrieved document content from documents the user shouldn’t see.

5. Tool/function-call abuse. For agents — can crafted input cause the agent to call dangerous tools (refund, delete, send-email) with parameters chosen by the attacker? The refund-tool vulnerability in the case above was this category. It’s brutal because the model is doing what it was designed to do (call tools) but at the attacker’s direction.

6. RAG poisoning. Can untrusted content fed into the retrieval system later influence the AI’s behavior? E.g., the system indexes a user-uploaded document, and that document contains instructions that get retrieved and treated as authority when other users query.

7. Output exploitation. Can the AI’s output poison downstream systems? Generated SQL injected into a database, generated HTML containing XSS payloads, generated content passed to another system as if it were instructions.

8. Cost-exhaustion attacks. Token-flooding, recursive prompt loops, queries designed to trigger maximum-length completions repeatedly. Less destructive than the others but can financially hurt a small team in a hurry.

A real red-team probes all eight, with at least 3-7 specific vectors per category, customized to the system architecture. The output isn’t a “you passed/failed” report — it’s a severity-graded list with reproducers.

Severity, not just count

Twenty-three findings sounds like a lot. Most of them weren’t catastrophic.

The breakdown on that engagement was roughly:

  • 3 Critical — issues that would have been press-worthy: prompt injection leaking system prompt, cross-tenant PII exfiltration, tool-call abuse on refund endpoint.
  • 6 High — issues that would have caused real customer-visible problems but not catastrophic disclosure: jailbreak susceptibility, output sanitization gaps, rate limit issues.
  • 11 Medium — issues that would have shown up in support tickets but not lead to incidents: refusal calibration, RAG retrieval edge cases, latency P99 issues.
  • 3 Low — quality and operational issues, fix-when-convenient.

Not every finding needs to ship-block launch. The categorization matters. The CTO of that team didn’t have to delay launch for the Medium and Low items — they shipped with those open, with tickets to address them in the first six weeks post-launch. But the three Criticals were absolutely ship-blockers, and they had no way of knowing that without the external review.

A red-team without severity grading is just a list of complaints. A red-team with severity grading is a decision tool — what must we fix to launch, what should we fix in the first month, what can wait.

The post-incident reality

I want to spend a paragraph on what actually happens when a critical issue hits production unreviewed, because most teams underestimate this.

It is rarely a clean technical incident. It is a screenshot on Twitter. It is a customer’s CISO finding it during a security review and threatening to cancel their contract. It is a journalist DM’ing your founders. It is a regulator asking questions if you’re in a regulated industry. It is your sales team having to explain what happened on every call for the next three months.

Worse, the original technical issue is usually small. The damage is in the narrative. “This company’s AI leaked our data” is the story, regardless of whether the leak was 1 record or 1,000. The trust hit is the same.

Pre-launch red-teaming is the only path I know to make sure the story never gets written.

Why this isn’t a one-time thing

The argument I sometimes get: “We red-teamed at launch. We’re good now.”

You are not good now. You are good at launch. Three things change after launch that create new vulnerabilities:

1. New attack classes get published. Every quarter, security researchers publish new prompt injection techniques. The DAN family of jailbreaks is multi-year old now and still gets refined. Indirect prompt injection wasn’t well-understood in 2023; it’s a major category now. Your launch red-team caught the attacks known then. It didn’t catch the attacks discovered next.

2. Your system changes. New tools added, new RAG sources, new model versions, new features. Every change has the potential to introduce new attack surfaces. The system you red-teamed in March is not the system you have in September.

3. Your input distribution changes. New customer segments. New user behaviors. Adversaries learning your system. The threat model you designed for at launch isn’t the threat model six months in.

The pattern I run on engagements that include an A5 retainer: a monthly mini-red-team. Two days of effort. Run a fresh sample from the current threat landscape against the current system. Catches drift before it becomes an incident.

This is the cheapest insurance possible against the most expensive thing that can happen to a production AI system.

What I tell teams who push back

The most common pushback is timing. “We’re three weeks from launch. We don’t have time for a four-week audit.”

Two responses:

Compress it. A focused two-week red-team that targets the top three risk categories for your specific architecture is dramatically better than no red-team. You won’t catch everything, but you’ll catch the Criticals — which is the math that matters.

Delay launch. This is the hard recommendation, and I’ve given it three times in the last 18 months. Two of those teams listened. One didn’t and had a public incident within six weeks of launch. The math doesn’t care about your sprint planning.

The other common pushback: “We can’t afford it.” For a $5K-$15K audit fee, against the post-incident scenarios above? You can’t afford not to. If $5K is meaningfully blocking you from launching with confidence, your AI feature is not commercially viable yet — that’s the real signal.

The closing argument

I red-team every AI system I ship as a default, not as an exception, because:

  1. The math on pre-launch versus post-incident is brutal, in every scenario.
  2. Internal reviews systematically miss things external reviews catch.
  3. The cost of being wrong is non-linear — small technical issues become large narrative damage.
  4. The discipline scales. Once you have a red-team playbook and a reproducer catalog, you can run a mini-red-team on every major change in days, not weeks.

Most production AI systems in 2026 are shipping without serious adversarial review. Most of them will be fine for a while. Some of them will be very not-fine, very publicly, and the post-mortem will read like every other post-mortem: “We thought we’d thought of everything.”

You haven’t. Nobody has. That’s why you bring in someone whose job is to find what you missed.


If you’re launching an AI feature in the next 6-12 weeks and want an external read before the world sees it — that’s the A1 red-team engagement. Four weeks, fixed-price ($5K–$15K), written report with reproducers and prioritized fixes. Book a 30-min call or email hello@mishrasiddharth.com.

Liked this essay?

Have a system that's shipping on vibes?

Book a 30-min audit call. No slides. I'll tell you within 20 minutes whether your AI or data pipeline is production-grade or pre-production.

Scroll to Top