Agents & State Machines: How LLMs Run Multi-Step Tasks

What an agent actually is

The word "agent" gets used loosely. For the purposes of inference engineering, an agent is a system where a language model makes decisions that affect what happens next — including what inputs it receives in subsequent calls. The model is in a loop, not just answering a single question.

The minimal definition: an agent runs the model, observes the result, takes some action based on that result, and then runs the model again. This loop continues until some termination condition is met. Everything else — memory, tools, planning, multi-agent coordination — is built on top of that basic structure.

What distinguishes this from single-turn inference is the presence of state. Between each model call, something about the world (or the agent's knowledge of it) has changed. Managing that state — what to track, how to represent it, when to update it, and how to recover when it goes wrong — is where most of the real engineering work in agentic systems lives.

// The key distinction

Single-turn inference: input → model → output. Done. Agentic inference: the output feeds back into the input. The model's decisions shape what it sees next. This creates feedback loops, and feedback loops require careful state management.

What a state machine is

A state machine (formally: a finite state automaton) is one of the oldest and most useful abstractions in computer science. The concept is simple: a system is always in exactly one state from a defined set of possible states. Events or conditions cause transitions from one state to another. Each state has defined behaviour, and each transition has defined conditions.

That's the whole thing. States, transitions, conditions. The power comes from making implicit behaviour explicit.

// simple_state_machine_example — a turnstile

LOCKED

——→

insert coin

UNLOCKED

——→

push through

LOCKED

——→

push (no coin)

LOCKED

← stays in same state

State machines define what should happen in every combination of (current state + event). Invalid inputs don't cause undefined behaviour — they either trigger a defined transition or are explicitly ignored.

State machines have a key property that makes them valuable: they make illegal states unrepresentable. If you've defined your states correctly, the system can never be in a state you haven't thought about. Compare this to a tangle of boolean flags and conditionals — where you can easily end up in a state you never intended, because no one made the implicit states explicit.

The three parts of any state machine

Every state machine has the same three components, regardless of complexity:

// state_machine_components

01States — the complete, finite set of situations the system can be in. A traffic light: {RED, AMBER, GREEN}. An order: {PENDING, PROCESSING, SHIPPED, DELIVERED, CANCELLED}. Each state should be mutually exclusive and exhaustive.

02Transitions — the rules for moving between states. Transitions are triggered by events or conditions. A transition from PENDING → PROCESSING might require that payment has been confirmed. Transitions can also produce side effects (sending an email, logging an event).

03Actions — what happens when you enter a state, exit a state, or during a transition. Entry actions run when you arrive in a state. Exit actions run when you leave. Transition actions run as you move between them.

The power of this model is not the formalism — it's the discipline. Defining states forces you to answer: what are all the situations this system can be in? Defining transitions forces you to answer: what can happen in each one?

Agents as state machines

An LLM agent is a state machine whether you model it that way or not. The question is whether your state machine is explicit (designed, documented, testable) or implicit (scattered across prompts, conditionals, and glue code that nobody fully understands).

Consider a basic research agent that can browse the web and answer questions. Even something this simple has multiple states:

// research_agent_states

IDLE

→

task received

PLANNING

→

plan ready

SEARCHING

→

results in

SYNTHESISING

→

answer ready

DONE

SEARCHING

→

needs more info

PLANNING

← back-transitions

SEARCHING

→

tool error

ERROR

→

retry / abort

SEARCHING

DONE

Making the ERROR state explicit is where most agents fail by omission. If you don't define what happens when a tool call fails or returns unexpected output, you get undefined behaviour — often silent loops or confident wrong answers.

Without the state machine framing, this same agent is usually implemented as a chain of if/else conditions and prompt checks. It works until an unexpected transition happens — a tool returns a malformed response, the model decides to call a tool that's out of scope, a loop runs longer than expected. At that point, undefined behaviour takes over.

The inference loop in an agent

From an inference engineering perspective, an agent is a process that runs the model repeatedly, with each call's output feeding the next call's input. Each model call is stateless — the model has no memory between calls — so the agent must construct the full context window on every iteration.

// agent_inference_loop

ENV

Environment / Observation

The current state of the world is serialised into text: tool results, memory contents, previous turns, system context. This becomes the model's input.

LLM

Model Inference

A single forward pass. The model produces an output: either a final answer, a tool call, or a reasoning step. From the model's perspective, this is just inference — it has no awareness of being in a loop.

TOOL

Action / Tool Execution

If the model called a tool, the agent executes it: web search, code execution, database query, API call. The result is an observation that feeds back into the environment.

↻

State Update & Termination Check

The agent's state machine transitions based on what just happened. Is the task complete? Has a termination condition been met? Is the context window approaching its limit? If not, construct the next context and loop.

↑ loop continues until: task_complete OR max_steps reached OR error condition triggers abort

Notice that the model itself is just one component in this loop. Everything around it — context construction, tool dispatch, state transitions, termination conditions — is orchestration code. This is why "building an agent" is primarily a software engineering problem, not a prompting problem.

Context window as working memory

Each model call starts from zero. The model has no memory of previous calls — it sees only what is in its context window right now. This means the agent's context window is its working memory. Everything the model needs to know to make a good decision must be present in the context.

This creates a hard constraint: context windows are finite. An agent that naively appends every observation, tool result, and reasoning step will eventually hit the context limit and either fail or produce degraded outputs as the beginning of the context rolls off. Managing what goes into the context — and what gets summarised, compressed, or evicted — is state management.

// The context window trap

Long-running agents often fail not because the model can't do the task, but because the context window fills up with irrelevant earlier steps, the model loses track of the original goal, and its outputs degrade. Context management is agent state management. They're the same problem.

Where agents fail — a state machine diagnosis

Most agent failures map directly to state machine failure modes. Once you see them this way, they become much easier to prevent:

Failure mode	What it looks like	State machine cause
Infinite loops	Agent keeps calling the same tool or asking the same question	No transition out of the current state; missing termination condition
Silent tool failure	Tool returns an error; agent proceeds as if it succeeded	No ERROR state defined; tool result not checked before transition
Goal drift	Agent ends up working on a subtask and forgets the original task	State doesn't encode the top-level goal; context window crowded out original instruction
Hallucinated tool calls	Model invents a tool that doesn't exist or calls with wrong arguments	Valid transitions not constrained; model can attempt any transition, including invalid ones
Premature termination	Agent decides task is done when it isn't	Termination condition too loose; model's DONE state doesn't match task completion criteria
Unrecoverable error	One bad tool call crashes the entire agent run	No retry/recovery transitions defined from ERROR state

Designing agent states well

The practical implication of treating your agent as a state machine is that you have to decide, upfront, what your states are. This sounds obvious but most agent implementations skip it entirely, jumping straight to the LLM call and hoping the model figures out what to do.

Good state design has a few properties. States should be meaningful — they should represent a real distinction in what the agent is doing or what information is available. States should be mutually exclusive — the agent is in exactly one state at a time. And states should be complete — every plausible situation should map to one of your defined states, including failure modes.

// what_to_define_per_state

→Entry condition: What triggers entry into this state?

→Available actions: Which tools or outputs are valid from this state?

→Context policy: What should be in the context window while in this state?

→Transition conditions: What outputs move us to which next states?

→Failure handling: What happens if the model output is invalid or a tool call fails?

→Max iterations: How many times can the agent loop in this state before we force a transition?

Structured outputs as transition guards

One practical technique for enforcing state machine behaviour is to constrain the model's outputs structurally. If the model can only output a predefined set of actions — defined by a schema, an enum, or a grammar — then invalid transitions are impossible by construction. The model cannot call a tool that doesn't exist if tool calls are validated against a strict schema before execution.

This is one of the main practical benefits of structured output generation. The constraint isn't just aesthetic tidiness — it's a way of making the agent's transition graph enforceable rather than aspirational.

// State machines and reliability

An agent with an explicit state machine is dramatically easier to test, debug, and monitor than one without. You can log state transitions, write tests for specific transitions, and build alerts for states that should be transient but persist. The state machine gives you a vocabulary for describing what went wrong.

The inference cost of agents

Every iteration of the agent loop is at minimum one model call. Multi-step agents make many calls. This has a direct, compounding effect on latency and cost that single-turn inference does not have.

A task that takes ten model calls at 2 seconds per call takes at minimum 20 seconds — even with zero overhead. If each call has a long context (accumulated tool results and history), the per-call cost rises too, because prefill cost scales with context length. Long-running agents on large models can be expensive to run at scale.

This is why agent design and inference efficiency are not separate concerns. Decisions like context compression strategy, which model to use per step (a smaller model may suffice for tool call routing; only use the large model for synthesis), and how aggressively to cache prefixes — these are both agent design decisions and inference engineering decisions simultaneously.

Multi-agent systems

When multiple agents coordinate — a planner agent that breaks down tasks, specialist agents that execute subtasks, a critic agent that reviews outputs — you have a system of communicating state machines. Each agent has its own state. The coordination layer has its own state. Messages between agents are transitions.

The same principles apply, just at a higher level. The coordination layer needs its own explicit states (what are all the ways the multi-agent system can be?), defined transitions, and explicit failure handling. The most common failure in multi-agent systems is treating inter-agent communication as reliable when it isn't — an agent that receives no response from a subagent can end up in an undefined state.

// multi_agent_coordination_example

Planner agent state machine (simplified):

DECOMPOSING

→

subtasks ready

DISPATCHING

→

all sent

WAITING

→

all results in

SYNTHESISING

→

done

COMPLETE

WAITING

→

timeout / subagent error

PARTIAL

→

decide: retry / proceed / abort

SYNTHESISING

WAITING is a particularly important state to make explicit. If no timeout is defined, the planner can wait indefinitely for a subagent that crashed. PARTIAL results need their own handling — not all failures are total failures.

// In short

01An agent is a model in a loop. Each iteration is a model call. The output feeds the next input. The loop continues until a termination condition is met — which you have to define explicitly.

02A state machine makes implicit states explicit. States, transitions, conditions. Defining them upfront forces you to think about every situation the agent can be in — including failure modes.

03The context window is the agent's working memory. Because each model call is stateless, everything the model needs to reason about must fit in the context. Context management is state management.

04Most agent bugs are state bugs. Infinite loops, silent failures, goal drift — they all map to missing or undefined transitions. The fix is not better prompts; it's an explicit state model.

05Agents multiply inference costs. Ten steps at N tokens per step costs 10N tokens. Context accumulation makes each step more expensive than the last. Agent design and inference efficiency are the same problem.

06Structured outputs enforce transitions. Constraining the model's output to a defined schema prevents invalid tool calls and hallucinated actions. The state machine becomes enforceable, not just aspirational.