// Concept 07 — Production

Production Metrics

TTFT, TPOT, and the Numbers That Define LLM Experience

A model that is fast in a benchmark can feel slow in production. The metrics you track determine what you optimise. Know these numbers and you can have a useful conversation about any LLM deployment.

18 min read·Concept 07 of 07

The two phases create two separate problems

LLM inference has two distinct phases — prefill and decode — and they generate different user experience problems. Prefill is fast but produces waiting. Decode is slow but produces visible progress. Users perceive these very differently, which is why a single latency number fails to capture what matters.

A response that takes 10 seconds with nothing displayed, then dumps 500 words instantly, is experienced as a 10-second wait. A response that streams 500 words over 12 seconds, starting within 200ms, feels fast and responsive. Both have the same end-to-end latency. The metrics below capture this distinction precisely.

Time to First Token (TTFT)

Time to First Token TTFT

The elapsed time from when a request is submitted to when the first output token is received by the client. Determined primarily by prefill time and queue wait time. This is the metric that governs perceived responsiveness — how long users wait before anything appears.

Excellent< 200ms
Acceptable200–800ms
Poor> 1s

TTFT is dominated by prompt length and system load. Long system prompts (2k+ tokens) can add hundreds of milliseconds of prefill time before any streaming begins. Prefix caching is the primary lever for reducing TTFT in production — a cached system prompt costs near zero prefill time.

Time per Output Token (TPOT)

Time per Output Token TPOT

The average time between successive output tokens, measured during the decode phase. Also called inter-token latency (ITL). This determines the streaming "feel" — whether the text appears to flow naturally or stutters. At good TPOT values, text appears at roughly reading speed.

Excellent< 30ms
Acceptable30–80ms
Poor> 100ms

Human reading speed is roughly 250 words per minute, or about 4 words per second. At ~1.3 tokens per word, this is approximately 5 tokens per second, or 200ms per token. Anything below 100ms TPOT produces a perceptibly smooth stream; above 150ms, users typically notice stuttering.

End-to-End Latency (E2EL)

End-to-End Latency E2EL

Total time from request submission to receiving the final token. Equals TTFT + (TPOT × output_tokens). The least useful single metric for UX, but important for batch/offline workloads where streaming is irrelevant.

Chat (100 tokens)2–8s
Code (500 tokens)8–30s
Document (2k tokens)30–120s

Throughput

Throughput

The rate at which the system generates output tokens across all concurrent requests, measured in tokens per second (TPS) or requests per second (RPS). The primary metric for cost efficiency — maximising throughput minimises cost per token.

Flagship GPU, 70B INT4High-thousands TPS
Flagship GPU, 70B FP16Low-thousands TPS
Previous-gen GPU, 70B INT4Mid-range TPS

Throughput figures are estimates derived from bandwidth and model size. Actual numbers vary significantly with batch size, context length, quantization implementation, and serving configuration. Treat as order-of-magnitude references, not benchmarks.

// Throughput vs. latency tension

Increasing throughput usually increases latency. Batching more requests together improves GPU utilisation and tokens-per-second, but each individual request waits longer in the queue and experiences higher TPOT as the batch competes for compute. Your SLO defines where to set this tradeoff.

Percentiles, not averages

Average latency is nearly useless for SLO management. A system with 200ms average TTFT might have p99 of 3 seconds — meaning 1% of users wait 15× longer than average. In production, you care about the tail.

// latency_percentile_hierarchy

p50Median latency — half of requests are faster, half slower. Useful for "typical" characterisation.
p9090th percentile — 90% of requests complete within this time. Your primary SLO target for most services.
p9595th percentile — stricter SLO. Common for user-facing chat applications.
p9999th percentile — the tail you use to identify and fix spikes. Often 3–5× the p50.
Monitor p50, p95, and p99 TTFT and TPOT. A degrading p99 while p50 holds is an early sign of queue buildup or memory pressure.

Cost per token

The economic metric that ultimately governs hardware and optimisation decisions.

Cost per million output tokens

cost = (hourly_GPU_cost / throughput_tokens_per_hour) × 1_000_000

Example: at $3/hr and 4,000 tokens/sec, you generate ~14.4M tokens/hr → ~$0.21 per million tokens before margin. Substitute your actual GPU cost and measured throughput.

Throughput is the key variable in the denominator. Every optimisation that increases tokens per second — continuous batching, quantization, prefix caching, speculative decoding — directly reduces cost per token. This is why throughput is the primary engineering target for cost-sensitive deployments.

Goodput

A subtler metric gaining adoption in production teams: goodput is the fraction of GPU compute that produces tokens delivered to users within SLO, as opposed to tokens that were generated but arrived too late (after a client timeout), or compute spent on cancelled requests.

A system at 90% throughput utilisation but 20% timeout rate has much lower goodput than it appears. Goodput = useful work ÷ total work. It's the metric that most honestly captures whether your serving system is doing what you're paying for.

What to instrument

MetricUnitWhy it matters
TTFT p50/p95/p99msUser-perceived responsiveness; first impression
TPOT p50/p95/p99ms/tokenStreaming quality; reading speed match
Token throughputtokens/secCost efficiency; capacity planning
Request queue depthcountEarly warning of overload; latency predictor
KV cache utilisation%Memory pressure; preemption risk
Prefix cache hit rate%Prefill efficiency; TTFT reduction
Token error rate%Generation failures, OOM events
GPU utilisation%Headroom assessment; over/under-provisioning
That's the full core series. You now have the concepts behind every major decision in LLM inference. The glossary has 40+ term definitions, and deeper guides on specific tools are on the way.

// In short

01TTFT governs perceived responsiveness. Users judge "fast" by when the first character appears. Optimise TTFT for chat; optimise throughput for batch.
02TPOT governs streaming quality. Below ~100ms per token, text streams feel natural. Above 150ms, users notice stuttering and the UX degrades.
03Track percentiles, not averages. p50 describes typical behaviour. p99 reveals your worst 1% — which is often what users remember and what causes support tickets.
04Throughput is cost. Every token per second added to system throughput reduces cost per million tokens. This is the through-line connecting all inference optimisations.
05Goodput is the honest metric. Throughput counts tokens generated. Goodput counts tokens delivered within SLO. The gap between them is waste you're paying for.