INFEROA / FLIGHT RECORDER announcement · built on the vLLM stack

Your agent harness is
burning tokens.
Watch it happen.

Inferoa is an inference-native agent harness for long-horizon coding work. It treats the mechanics and cost of inference — prefix caches, context shape, model routing — as the design constraint, not an afterthought. Below: a turn-by-turn simulation of the same coding task run through a naive harness vs. Inferoa.

task: “fix the failing auth test” · 8 turns
What this is: the page below is a client-side simulation, parameterized from the results reported in the Inferoa announcement (90.0% cached-token discount, 80.8% context reduction via CodeGraph, 61.4% tool-output reduction via RTK). But the mechanics are real — we also ran the real thing and kept the receipts ↓

Receipts — what's real.

A simulation is legible; receipts are credible. We ran the real stack in isolated islo.dev sandboxes and kept the artifacts. Everything below actually executed.

REAL PREFIX CACHE · vLLM /metrics
97.8%

measured cache-hit rate

Real Inferoa v0.11.0 (npm) drove a real vLLM v0.22.1 server (Qwen2.5-0.5B, CPU). vLLM's own Prometheus counters after the run: prefix_cache_hits_total 1,611,008 of queries_total 1,647,574. Note: the announcement's 90.0% is a cached-token discount (pricing); ours is the measured hit rate — related but distinct metrics, both driven by the same byte-stable prefixes.

REAL AGENT WORK · MERGED TO MAIN
2 lines

failing test → fix → green, in a sandbox

An agent in an isolated sandbox was handed a repo with a genuinely failing test (tz-naive vs tz-aware datetime) and a task prompt that named the suspected cause. It produced the 2-line fix, re-ran pytest to green, and the change is merged on main. Verbatim before/after output is embedded in step 7 below. What's proven: the sandboxed execute-verify-publish workflow — not unguided agent diagnosis.

REAL WIRING · TWO SANDBOXES

vLLM in one, Inferoa in the other

Sandbox A serves vLLM on :8000; islo share exposed it at a public *.share.islo.dev URL; sandbox B's Inferoa pointed its base_url there. Inferoa's event log records the proof: provider_id: vllm:openai_compatible:https://…share.islo.dev/v1, prompt_tokens: 16,829 per turn, with stable prompt/tool-schema hashes — cache discipline, visible.

HONEST LIMITS

what we don't claim

The 0.5B CPU model proves the mechanics (caching, routing, harness loop), not frontier-grade coding — it made a real but adorably confused tool call. Sim pricing is illustrative and published in the README. Share URLs expire in 24h; the repos and PR are permanent.

0
TOKENS PROCESSED
PREFIX CACHE HIT RATE
$0.0000
SPEND AVOIDED VS NAIVE
0 / 8
AGENT TURNS
AGENT STREAMIDLE
// press RUN to start the agent loop…
CONTEXT WINDOW
0 tokens128k window
cached prefix (≈10% price) fresh input output
SEMANTIC ROUTER
// routing decisions appear here
CUMULATIVE SPEND — SAME TASK, TWO HARNESSES
naive harness — full resend, raw tool dumps, frontier-only$0.0000
inferoa — prefix cache + codegraph + rtk + routing$0.0000

The real run, end to end.

Every block below is captured output from the actual run on 2026-06-10 — three isolated islo.dev sandboxes, real vLLM, real Inferoa, real metrics. Nothing here is mocked. Two pipelines, two results:

laptop — the islo CLI creates three isolated sandboxes on islo.dev · no GPU, no devops, one command each
Pipeline 1 · measure the inference claim
SANDBOX B
inferoa@0.11.0the actual harness, installed from npm
real inference requests · HTTPS
PUBLIC URL · islo share
https://…share.islo.devsandbox port 8000, exposed in one command
SANDBOX A
vLLM v0.22.1 · Qwen2.5-0.5B · CPUprefix_caching = ON
read off vLLM's own /metrics
RESULT
97.8% cache-hit rate1,611,008 / 1,647,574 prompt tokens served from prefix cache
Pipeline 2 · prove the agent workflow
SANDBOX C
coding agent + failing repoTypeError: naive vs aware datetime comparison
pytest red → 2-line fix → pytest green
GITHUB · MAIN
fix merged (ffda3d7)verbatim before/after output in step 7 below
RESULT
0 → 2 tests passingexecute → verify → publish, fully isolated
What happened: the harness in B drove the model in A through a public islo.dev URL, and vLLM's own counters measured the cache. The agent in C turned a failing repo green. Every box maps to a numbered step below with its captured output.
97.8%
REAL CACHE-HIT RATE (vLLM /metrics)
1.65M
PROMPT TOKENS THROUGH vLLM
138
SUCCESSFUL MODEL REQUESTS
0 → 2
TESTS PASSING AFTER AGENT FIX
1

Spin up a vLLM sandbox

One CLI command turns the official vLLM CPU image into a running sandbox. Note the engine config: enable_prefix_caching=True.

$ islo use vllm-cpu-1781079422 -i docker.io/vllm/vllm-openai-cpu:latest-x86_64 \
    --cpu 6 --memory 12288 --disk 30
 Sandbox 'vllm-cpu-1781079422' created
$ vllm serve Qwen/Qwen2.5-0.5B-Instruct --dtype bfloat16 --max-model-len 32768
(EngineCore) INFO core.py:112 Initializing a V1 LLM engine (v0.22.1) with config:
  model='Qwen/Qwen2.5-0.5B-Instruct', device_config=cpu, enable_prefix_caching=True …
(Worker) INFO Time spent downloading weights: 10.27 s
(Worker) INFO Loading weights took 0.27 seconds
real output · sandbox A
2

Expose it to the world

One more command gives the sandbox port a public HTTPS URL on the islo.dev domain.

$ islo share vllm-cpu-1781079422 8000 --ttl 24h
 Share created for vllm-cpu-1781079422:8000
  URL: https://itc807adoyzwz6hzj0rkmcp7m.ca.share.islo.dev
  Expires: in 1d 0h
real output · share url (24h ttl — may be expired by the time you read this)
3

Prove the inference is real

A completion through the public URL. Asked to echo “INFERENCE OK”, the 0.5B model replied “INFORMATION OK” — a mock would have echoed exactly. The typo is the proof.

$ curl https://itc807….ca.share.islo.dev/v1/chat/completions -d '{
    "model":"Qwen/Qwen2.5-0.5B-Instruct",
    "messages":[{"role":"user","content":"Reply with exactly: INFERENCE OK"}]}'
reply: INFORMATION OK
usage: {'prompt_tokens': 36, 'total_tokens': 40, 'completion_tokens': 4}
real output · external request from a laptop, through islo.dev, into sandbox A
4

Point real Inferoa at it

Second sandbox: install the actual npm package and aim its engine config at the share URL. Inferoa's default provider is literally vLLM.

$ npm install -g inferoa   # inferoa@0.11.0
$ cat ~/.inferoa/config.yaml
model_setup:
  mode: direct
  provider: vllm
  model: Qwen/Qwen2.5-0.5B-Instruct
  base_url: https://itc807….ca.share.islo.dev/v1
$ inferoa debug status --json | jq .endpoint_signals.errors
[]
real output · sandbox B
5

Run the harness for real

Inferoa's own event log records each model call: the islo.dev provider URL, a 16,829-token prompt, and stable prompt/tool-schema hashes — the cache-discipline machinery, visible. The tiny model then made a real but confused tool call (it tried to invoke is_palindrome instead of writing it). Honest limit: 0.5B proves mechanics, not coding skill.

$ inferoa --print "Write a Python function is_palindrome(s)…"
$ inferoa debug events s_6b9da06c… 
type: model.request.started
  provider_id: vllm:openai_compatible:https://itc807….ca.share.islo.dev/v1
  prompt_hash: 990301121c…  tool_schema_hash: a381b770ce…
type: model.response.settled
  usage: { prompt_tokens: 16829, completion_tokens: 30 }
  tool_calls: [ { name: is_palindrome,
    arguments: { s: "A man, a plan, a canal, Panama" } } ]
real output · inferoa event log, session s_6b9da06c…
6

Read the cache off vLLM itself

Not our numbers — vLLM's Prometheus counters after the session. 1,611,008 of 1,647,574 prompt tokens hit the prefix cache: a 97.8% hit rate. (The announcement's 90.0% is a cached-token discount — a pricing metric; this is the hit rate. Same mechanism: byte-stable prefixes.)

$ curl localhost:8000/metrics | grep prefix_cache
vllm:prefix_cache_queries_total{model="Qwen/Qwen2.5-0.5B-Instruct"} 1,647,574
vllm:prefix_cache_hits_total{model="Qwen/Qwen2.5-0.5B-Instruct"}    1,611,008
# hit rate = 97.8%
real output · sandbox A, vllm /metrics
7

And the agent loop, with receipts

Separately, an agent in a third sandbox was handed a repo with a genuinely failing test and the suspected cause. It produced the fix, re-ran pytest to green, and the change is merged on main (ffda3d7). Before / fix / after:

$ python3 -m pytest -v
FAILED tests/test_token_refresh.py::test_token_refresh
E  TypeError: can't compare offset-naive and offset-aware datetimes

--- a/auth_service/token.py
+++ b/auth_service/token.py
-    now = datetime.utcnow()
+    now = datetime.now(timezone.utc)

$ python3 -m pytest -v
tests/test_token_refresh.py::test_token_refresh PASSED
tests/test_token_refresh.py::test_expired_token_rejected PASSED
============== 2 passed ==============
real output · sandbox C · merged to zozo123/inferoa-receipts@main

Three levers, multiplied.

Each mechanism compounds with the others. The discount isn't one trick — it's the product of treating inference as a first-class engineering surface.

PREFIX CACHE DISCIPLINE
90.0%

cached-token discount

The harness keeps prompt prefixes byte-stable across turns — system prompt, tool schemas, repo context — so vLLM's prefix cache (and frontier cache pricing) bills repeated context at the cached rate instead of full price.

CODEGRAPH
80.8%

context reduction

Instead of pasting whole files, the agent carries a graph-shaped slice of the repository — symbols, edges, and the spans that matter for the task. Same signal, a fifth of the tokens.

RTK
61.4%

tool-output reduction

Raw tool output is the silent token killer. RTK records compact, structured command results — exit codes, deltas, the failing assertion — not 400 lines of test runner noise.

Route by economics, not habit.

The vLLM Semantic Router sends each step where it belongs: mechanical edits and log digestion to self-hosted models at marginal cost; architectural reasoning to frontier models when capability actually pays for itself. Cost, privacy, and capability are routing inputs — not vibes.

“Great to see Inferoa, a community agent harness built on the vLLM stack — it designs the agent loop around the mechanics and cost of inference, with prefix-cache discipline, context optimization, and routing across self-hosted and frontier models.”

— vLLM

Try it for real.

The simulation above is illustrative. The harness is real:

# install
$ npm install -g inferoa
# configure engine, router, cache
$ inferoa setup
# run on a long-horizon task
$ inferoa