<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
  <title>Antonie Chirilus — Reliable LLM systems</title>
  <link>https://roglia.ro/</link>
  <description>Antonie Chirilus, R&amp;D Engineer at Keysight Technologies. Building deterministic LLM systems: guided generation, validators, agents.</description>
  <language>en</language>
  <lastBuildDate>Tue, 19 May 2026 18:18:23 +0000</lastBuildDate>
  <atom:link href="https://roglia.ro/feed.xml" rel="self" type="application/rss+xml"/>
  <item>
    <title>The 40K-token system prompt that wasn't slow</title>
    <link>https://roglia.ro/posts/prefix-caching-40k-system-prompt/</link>
    <guid isPermaLink="true">https://roglia.ro/posts/prefix-caching-40k-system-prompt/</guid>
    <pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate>
    <description>I spent a week chasing latency on a local LLM setup. The bottleneck wasn't the GPU, the model, or the quantization — it was the prefill stage doing the same 40K tokens of work on every request.</description>
    <content:encoded><![CDATA[<p>Spent a week chasing latency on my local LLM setup. The answer was not the one I expected.</p>
<p><strong>The setup.</strong> RTX 5090. Ollama. A ~40K-token system prompt — instructions, tool schemas, few-shot
examples, the usual modern agent stack. Each request was a fresh user query against that fixed preamble.</p>
<p>Inference felt sluggish. My first instinct was the cliché list: bad quantization, model too large,
batch size wrong, maybe a driver issue with the 5090. None of it.</p>
<h2 id="where-the-time-was-actually-going">Where the time was actually going</h2>
<p>A transformer forward pass has two phases:</p>
<ol>
<li><strong>Prefill</strong> — process the entire prompt, build the KV cache.</li>
<li><strong>Decode</strong> — autoregressively generate output tokens, one at a time.</li>
</ol>
<p>Decode is what people quote when they say &ldquo;120 tokens/sec.&rdquo; Prefill is what you wait through before
the first token ever appears — <strong>time-to-first-token</strong>, TTFT.</p>
<p>For a 40K-token prompt, prefill is doing forty thousand attention computations before the model
emits a single new token. And here is the cruelty of my setup: I was paying that cost on <em>every</em>
request, even though 40,000 of those tokens were byte-identical from one call to the next.</p>
<h2 id="prefix-caching">Prefix caching</h2>
<p>The fix has a name: prefix caching. Store the KV cache of the shared prefix once, reuse it for any
subsequent request that starts with the same tokens. The model skips straight to processing the
user&rsquo;s new suffix.</p>
<div class="highlight"><pre><span></span><code>request 1:  [system 40K][user query A]  → compute KV for 40K + suffix
request 2:  [system 40K][user query B]  → reuse KV for 40K, compute only suffix
request 3:  [system 40K][user query C]  → reuse KV for 40K, compute only suffix
</code></pre></div>

<p>That is the simple version. SGLang&rsquo;s <strong>RadixAttention</strong> generalises it: cached prefixes live in a
radix tree, so <em>any</em> request that shares a prefix path — even a partial one — with prior requests
skips the redundant compute. Multi-tenant serving, multi-turn conversations, branching agent trees
all benefit from the same machinery.</p>
<h2 id="what-changed">What changed</h2>
<p>Same model. Same hardware. Same 40K system prompt. Different serving engine.</p>
<p>TTFT dropped close to an order of magnitude. Not a 10–20% optimisation — the cached path turns a
prefill pass into a lookup. End-to-end latency on short outputs went from &ldquo;noticeable pause&rdquo; to
&ldquo;feels instant.&rdquo; For an agent loop that fires off five or ten LLM calls per user turn, the
cumulative effect is the difference between a toy and something you would ship.</p>
<h2 id="lessons-im-keeping">Lessons I&rsquo;m keeping</h2>
<ul>
<li><strong>Profile before you guess.</strong> TTFT vs. decode is the first split to make. If TTFT dominates,
  no amount of quantisation will save you — you are bottlenecked on prefill compute, not on
  memory bandwidth during decode.</li>
<li><strong>Long static prefixes are everywhere now.</strong> Agent system prompts, RAG context, tool catalogues,
  multi-turn history. If the first N tokens of your requests are stable, prefix caching is no
  longer an optimisation — it&rsquo;s the design.</li>
<li><strong>The serving engine is an architectural choice.</strong> Ollama and llama.cpp are wonderful for
  prototyping and single-user workloads. For anything resembling production with shared prefixes,
  look at engines that treat prefix caching as a first-class feature: SGLang, vLLM, TGI.</li>
</ul>
<p>The 5090 is a beast. The real unlock was making it do less.</p>
<blockquote>
<p>If the first N tokens of every request are the same,
compute them once.</p>
</blockquote>
<p>— Antonie</p>]]></content:encoded>
  </item>
  <item>
    <title>Welcome — what this blog is about</title>
    <link>https://roglia.ro/posts/welcome/</link>
    <guid isPermaLink="true">https://roglia.ro/posts/welcome/</guid>
    <pubDate>Sun, 17 May 2026 00:00:00 +0000</pubDate>
    <description>Why I'm starting this, what to expect, and the kind of writing I want to do here.</description>
    <content:encoded><![CDATA[<p>I&rsquo;ve been meaning to start writing for a while. The trigger is finally having something
worth saying — a year of building production LLM systems at Keysight, and a growing
collection of opinions that don&rsquo;t fit in a tweet.</p>
<h2 id="what-youll-find-here">What you&rsquo;ll find here</h2>
<p>Mostly notes on the unglamorous half of agentic AI:</p>
<ul>
<li><strong>Guided generation</strong> — Outlines, JSON schemas, grammar-constrained decoding.</li>
<li><strong>Validators and self-healing loops</strong> — when the model is wrong, what do you do next?</li>
<li><strong>Local inference</strong> — llama.cpp, quantization, the actual cost of a token.</li>
<li><strong>Observability for agents</strong> — drift detection, eval design, when retries cost more than they help.</li>
</ul>
<p>A bit of computer vision and some math when the mood strikes.</p>
<h2 id="how-ill-write">How I&rsquo;ll write</h2>
<p>Short. Specific. Code where it helps, prose where it doesn&rsquo;t.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">outlines</span> <span class="kn">import</span> <span class="n">models</span><span class="p">,</span> <span class="n">generate</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>

<span class="k">class</span> <span class="nc">Decision</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">action</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">confidence</span><span class="p">:</span> <span class="nb">float</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">llamacpp</span><span class="p">(</span><span class="s2">&quot;qwen2.5-7b-instruct-q4.gguf&quot;</span><span class="p">)</span>
<span class="n">generator</span> <span class="o">=</span> <span class="n">generate</span><span class="o">.</span><span class="n">json</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">Decision</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">generator</span><span class="p">(</span><span class="s2">&quot;Should I retry? Error: rate_limited (3rd time)&quot;</span><span class="p">)</span>
</code></pre></div>

<p>If you read something here and disagree, <a href="mailto:hello@chirilus.dev">email me</a>.
The shorter the email, the faster the reply.</p>
<blockquote>
<p>The unglamorous half of agentic AI is where the production wins live.
Everyone wants to demo the agent. Few want to ship its retry policy.</p>
</blockquote>
<p>— Antonie</p>]]></content:encoded>
  </item>
</channel>
</rss>
