The 40K-token system prompt that wasn't slow

Mon, 18 May 2026 00:00:00 +0000

Spent a week chasing latency on my local LLM setup. The answer was not the one I expected.

The setup. RTX 5090. Ollama. A ~40K-token system prompt — instructions, tool schemas, few-shot examples, the usual modern agent stack. Each request was a fresh user query against that fixed preamble.

Inference felt sluggish. My first instinct was the cliché list: bad quantization, model too large, batch size wrong, maybe a driver issue with the 5090. None of it.

Where the time was actually going

A transformer forward pass has two phases:

Prefill — process the entire prompt, build the KV cache.
Decode — autoregressively generate output tokens, one at a time.

Decode is what people quote when they say “120 tokens/sec.” Prefill is what you wait through before the first token ever appears — time-to-first-token, TTFT.

For a 40K-token prompt, prefill is doing forty thousand attention computations before the model emits a single new token. And here is the cruelty of my setup: I was paying that cost on every request, even though 40,000 of those tokens were byte-identical from one call to the next.

Prefix caching

The fix has a name: prefix caching. Store the KV cache of the shared prefix once, reuse it for any subsequent request that starts with the same tokens. The model skips straight to processing the user’s new suffix.

request 1:  [system 40K][user query A]  → compute KV for 40K + suffix
request 2:  [system 40K][user query B]  → reuse KV for 40K, compute only suffix
request 3:  [system 40K][user query C]  → reuse KV for 40K, compute only suffix

That is the simple version. SGLang’s RadixAttention generalises it: cached prefixes live in a radix tree, so any request that shares a prefix path — even a partial one — with prior requests skips the redundant compute. Multi-tenant serving, multi-turn conversations, branching agent trees all benefit from the same machinery.

What changed

Same model. Same hardware. Same 40K system prompt. Different serving engine.

TTFT dropped close to an order of magnitude. Not a 10–20% optimisation — the cached path turns a prefill pass into a lookup. End-to-end latency on short outputs went from “noticeable pause” to “feels instant.” For an agent loop that fires off five or ten LLM calls per user turn, the cumulative effect is the difference between a toy and something you would ship.

Lessons I’m keeping

Profile before you guess. TTFT vs. decode is the first split to make. If TTFT dominates, no amount of quantisation will save you — you are bottlenecked on prefill compute, not on memory bandwidth during decode.
Long static prefixes are everywhere now. Agent system prompts, RAG context, tool catalogues, multi-turn history. If the first N tokens of your requests are stable, prefix caching is no longer an optimisation — it’s the design.
The serving engine is an architectural choice. Ollama and llama.cpp are wonderful for prototyping and single-user workloads. For anything resembling production with shared prefixes, look at engines that treat prefix caching as a first-class feature: SGLang, vLLM, TGI.

The 5090 is a beast. The real unlock was making it do less.

If the first N tokens of every request are the same, compute them once.

— Antonie

Welcome — what this blog is about

Sun, 17 May 2026 00:00:00 +0000

I’ve been meaning to start writing for a while. The trigger is finally having something worth saying — a year of building production LLM systems at Keysight, and a growing collection of opinions that don’t fit in a tweet.

What you’ll find here

Mostly notes on the unglamorous half of agentic AI:

Guided generation — Outlines, JSON schemas, grammar-constrained decoding.
Validators and self-healing loops — when the model is wrong, what do you do next?
Local inference — llama.cpp, quantization, the actual cost of a token.
Observability for agents — drift detection, eval design, when retries cost more than they help.

A bit of computer vision and some math when the mood strikes.

How I’ll write

Short. Specific. Code where it helps, prose where it doesn’t.

from outlines import models, generate
from pydantic import BaseModel

class Decision(BaseModel):
    action: str
    confidence: float

model = models.llamacpp("qwen2.5-7b-instruct-q4.gguf")
generator = generate.json(model, Decision)
result = generator("Should I retry? Error: rate_limited (3rd time)")

If you read something here and disagree, email me. The shorter the email, the faster the reply.

The unglamorous half of agentic AI is where the production wins live. Everyone wants to demo the agent. Few want to ship its retry policy.