What real agent architecture actually requires — tool definitions, planning steps, trajectory observability, eval pipelines focused on end-to-end completion — versus what most teams ship and call ‘agents.’
Aprospective client showed me their “AI agent” in a discovery call last month. It was a system prompt that had instructions for “if the user asks X, do Y, then do Z.” The model would produce text-output that the team would then… parse with regex.
This is not an agent. This is a prompt with delusions of grandeur.
I want to be careful here because the line is fuzzy and the word is overloaded. But the fuzziness is itself the problem. The vocabulary has been hollowed out by people who shipped a chatbot, added a system prompt mentioning “tools,” and started calling it an agent in their pitch deck.
If everything is an agent, then nothing useful is.
This essay is the argument for keeping the definition tight — not because vocabulary purity matters, but because the engineering implications of being a real agent are dramatically different from the implications of being a chatbot. If you ship one thinking you’ve shipped the other, the failure modes will surprise you.
The definitional question
I’ll commit to a working definition I find useful:
An agent is an LLM-powered system that (a) takes multiple consequential steps, (b) uses tools that affect the world outside the conversation, (c) operates with sufficient autonomy that the human is not in the loop for every step, and (d) has explicit observable trajectory state that can be inspected and replayed.
That’s it. Four properties. Most things called “agents” in 2026 have one or two of those at best.
Specifically, the most common gap is (d): the trajectory state. Most “agent” implementations I see are stateless LLM calls dressed up with some control-flow logic. There’s no persistent state representing where the agent is in its work, no way to inspect why it chose to do what it did, no way to replay the decision deterministically. That’s not an agent — that’s a script with an LLM in it.
The other common gap is the consequential-steps part. A chatbot that handles a single Q&A turn isn’t an agent even if it called a search tool to ground its answer. The “agent” framing implies the system is doing work across multiple steps, where each step’s outcome influences the next, and the wrong step is consequential.
What real agent architecture has
Let me walk through what production-grade agent architecture looks like, from the bottom up. This is the standard I hold engagements to. If you’re missing more than one of these, you have an elaborate prompt, not an agent.
1. Explicit tool definitions with execution boundaries
Real agents have a discrete, enumerable set of tools they can call. Each tool has:
- A specification of what inputs it accepts (validated before execution).
- A specification of what outputs it produces (validated before being passed back to the model).
- An execution boundary that determines what the tool is allowed to do (rate limits, parameter constraints, permission scopes).
- Logging of every invocation with full input/output capture.
Most “agent” systems I audit have a function that accepts a JSON blob from the LLM and… just calls whatever the LLM said to call. No validation. No boundaries. The model returned {"tool": "refund", "amount": 9999} and the system refunds $9,999. This is not architecture; this is a vulnerability with a Loom video.
2. Planning step (explicit, not implicit)
Real agents have a planning step. Before executing tools, the agent commits to a plan: here’s what I think I need to do, here’s the order, here’s what success looks like. The plan is observable — you can log it, evaluate it independently of execution, and use it as the basis for human-in-loop intervention.
Most “agent” implementations skip this. They prompt the model to “respond with the next tool call” and chain those together until something happens. The model is improvising the entire trajectory, one step at a time, with no commitment to a plan, which means:
- You cannot evaluate whether the plan was right (there’s no plan).
- You cannot detect when the agent is going off-script (no script).
- You cannot intervene productively (intervene in what?).
- Failure modes are emergent and surprising.
Some of the better agent frameworks (LangGraph, Inngest, Temporal-based patterns) push you toward explicit planning. Many teams use these frameworks and still don’t get the benefit because they wire the LLM to make per-step decisions instead of plan-then-execute.
3. Trajectory storage and replay
This is the single most differentiating property of a real agent: every run is stored, and every run is replayable.
Concretely, that means:
- Every LLM call made during the run is logged with full request/response.
- Every tool call is logged with full input/output.
- The decision tree (which branch did the agent take, why) is captured.
- A specific failed run can be re-executed deterministically (given the same inputs, the same model version, the same retrieved context).
When something goes wrong in production — and it will — you cannot debug an agent without trajectory replay. The bug is some interaction between LLM output, tool call result, retrieved context, and the next LLM call. Without the trajectory, you have a 4-hour conversation log and no idea what happened.
Most teams shipping “agents” have logs that look like chat history. That’s not a trajectory; that’s a transcript. The difference matters: you can read a transcript; you can re-execute a trajectory.
4. Eval pipeline scoped to end-to-end task completion
I covered eval pipelines in a different essay, but they look different for agents than for chatbots. Specifically:
Chatbot evals measure individual response quality. Was this answer correct? Was the format right? Did it cite the right source?
Agent evals measure end-to-end task completion. Did the multi-step task get done? How many steps did it take? Were any intermediate steps wrong-but-recoverable? Did the human-in-loop queue catch the right cases?
You can have an agent where every individual LLM call scores 95% on chatbot-style evals and the end-to-end task success rate is 23%, because errors compound across steps. An agent that scores 95% per step and takes 5 steps has a 77% end-to-end success rate (0.95^5). If steps are 4, 5, or 6 deep, “good per-step performance” doesn’t translate to good system performance.
The eval discipline for agents has to be end-to-end-task-shaped. Most teams don’t do this and are surprised when their per-step-good agent has a 30% overall success rate.
5. Human-in-loop pipeline (designed, not bolted on)
The agents I ship to production almost all have an HITL surface. Not because the AI isn’t good enough — the AI is often quite good — but because the consequences of being wrong are too expensive to fully automate.
Real HITL design has:
- Explicit confidence scoring. The agent commits to a confidence in its outcome. This is harder than it sounds; getting calibrated confidence from an LLM requires structured output and explicit prompting.
- Routing logic. Confidence above threshold X: straight-through. Confidence X-Y: route to a human reviewer with the agent’s answer prefilled. Below Y: route to a human starting from scratch.
- Reviewer UI. A queue, an interface for the reviewer, a way to see what the agent did and why, a way to accept/reject/edit cleanly.
- Feedback loop. Reviewer corrections flow back into the training data (eval cases, prompt updates, fine-tuning fodder).
Most “agent” systems I audit have no HITL pipeline at all — they’re either fully automated (terrifying for consequential work) or fully manual (the human reviews everything, defeating the point of the agent).
The shipped-quality middle ground is a designed HITL surface. The invoice processing agent I built for a fintech runs 96% straight-through, 4% through a human review queue where the agent has prefilled an answer with confidence reasoning. The human spends 90 seconds reviewing where they previously spent 6 minutes processing from scratch.
What “elaborate prompt” looks like by contrast
Let me describe the system I see most often, which is not an agent despite being called one:
- Single LLM call, with a long system prompt mentioning “tools you have available.”
- LLM returns text output that is then parsed.
- If the output mentions a tool name, the system extracts the params with regex and calls the tool.
- The tool result is concatenated into the next prompt.
- Loop until output doesn’t contain a tool call, or until 5 iterations have happened.
- Final output returned to user.
This is “an LLM in a while loop.” It can produce surprisingly good output for some tasks. It is not an agent. It has no planning step. The trajectory isn’t first-class — it’s reconstructed from LLM output text. There’s no execution boundary on tools. Failure modes are emergent and impossible to debug.
I’ve been in audit conversations where the team showed me this and called it an A3 agent. It’s not their fault — the vocabulary is muddy and the demos look the same when they work. But when they break, they break in fundamentally different ways than a real agent, and the team is not equipped to debug them.
Common architectural mistakes
Some patterns I see repeatedly in “agent” systems that should be flagged on review:
Treating the LLM as the orchestrator. The LLM should make decisions; it should not be the control flow. If your agent’s logic is “ask LLM what to do, do it, ask again,” you have no architecture. You have an LLM running your application, and LLMs are bad at running applications.
No timeout / loop detection. Real agents have explicit step budgets and timeouts. Without them, an agent can get into a stuck loop calling the same tool repeatedly because the LLM is confused, burning through your inference budget and your dignity.
Tools that mutate state without dry-run mode. Every consequential tool (refund, send email, delete record, post to channel) should have a dry-run mode the agent can use during planning. The agent’s planning step should run dry-run; only the execution step should run live.
No deterministic replay. If you can’t reproduce a failed run with the same inputs, you can’t debug. Period. Trajectory storage with replay is non-negotiable for production agents.
Confidence scoring as theatre. Many systems have a “confidence score” field in the agent’s output that turns out to be the LLM just inventing a number. Real confidence has to be derived from something measurable: log probabilities, structured uncertainty, or specific signals you can probe.
When you actually need an agent
The honest second half of this essay is: you probably don’t need an agent.
Most use cases that call for “an agent” are well-served by simpler patterns:
- Chatbots for conversational Q&A grounded in retrieval. Fully covered by single-turn LLM + RAG.
- Workflows for multi-step processes where the steps are known and the LLM is just one of the nodes. Use a real workflow engine (Inngest, Temporal, plain Python) with LLM calls embedded where useful. This is dramatically simpler than agent architecture.
- Decision trees for “if X then call Y” logic where the conditions are knowable. Don’t ask an LLM to do a switch statement.
The actual case for agent architecture is when:
- The work genuinely requires multiple consequential steps.
- The exact sequence of steps depends on intermediate results.
- Each step requires tool use that affects the world.
- The cost of getting any individual step wrong is meaningful.
That’s a narrower set than “everything I previously built as a chatbot but now want to call an agent.”
When the use case fits, build it as a real agent — with all five properties above. When it doesn’t, build the simpler thing and don’t lie about what it is.
What I tell teams who push back
The most common pushback when I make this argument is: “We’re not pretending it’s a real agent. We’re just using the word for marketing.”
Fair. The word does its work in marketing. Investors hear “agent” and price it differently. Customers hear “agent” and expect more. I get it.
The problem is: the engineering team eventually has to ship what was sold. When the marketing says “AI agent that handles your entire workflow autonomously” and the implementation is an LLM in a while loop, you have a debt that gets paid at the worst possible moment — usually when a customer hits a failure mode and you have to explain why.
The compromise I recommend: use the word externally if you must, but internally name it accurately. “Our customer-facing AI assistant” or “our automated workflow system” is fine for engineering vocabulary. Calling it an agent internally makes you start treating it like one, which means you’ll feel disappointed when it doesn’t behave like one.
Better: build the engineering as if it were a real agent, even if you start with the simpler version. Get the trajectory storage right. Get the eval pipeline right. Get the tool boundaries right. Then the gap between marketing claim and reality starts to close.
The closing argument
The word “agent” is going to keep being abused. That’s fine. Words are tools.
But if you’re building one — or buying one, or auditing one — keep the four properties in your head:
- Multiple consequential steps.
- Tools that affect the world outside the conversation.
- Sufficient autonomy that humans aren’t in the loop for every step.
- Explicit observable trajectory that can be inspected and replayed.
If your system has all four, you’ve built an agent. Treat it like one: build the eval pipeline for end-to-end task success, build the HITL surface, build the trajectory storage. The engineering load is heavier but the system survives contact with reality.
If your system has fewer than four, you’ve built something else. Probably a chatbot or a workflow with LLM calls in it. That’s fine — those are useful patterns. But don’t call it an agent, because the moment you call it an agent, you start making promises the architecture can’t keep.
The work in 2026 isn’t building more “agents.” The work is being honest about what we’ve built so we can ship it well.
If you’re shipping something you’ve been calling an agent and want a clear-eyed external review of the architecture — that’s exactly what the A1 audit catches. Two weeks, written report, no slides. Book a 30-min call or email hello@mishrasiddharth.com.