INFEROA / FLIGHT RECORDER announcement · built on the vLLM stack

Your agent harness is
burning tokens.
Watch it happen.

Inferoa is an inference-native agent harness for long-horizon coding work. It treats the mechanics and cost of inference — prefix caches, context shape, model routing — as the design constraint, not an afterthought. Below: a turn-by-turn simulation of the same coding task run through a naive harness vs. Inferoa.

task: “fix the failing auth test” · 8 turns

What this is: the page below is a client-side simulation, parameterized from the results reported in the Inferoa announcement (90.0% cached-token discount, 80.8% context reduction via CodeGraph, 61.4% tool-output reduction via RTK). But the mechanics are real — we also ran the real thing and kept the receipts ↓

Receipts — what's real.

A simulation is legible; receipts are credible. We ran the real stack in isolated islo.dev sandboxes and kept the artifacts. Everything below actually executed.

REAL PREFIX CACHE · vLLM /metrics

97.8%

measured cache-hit rate

Real Inferoa v0.11.0 (npm) drove a real vLLM v0.22.1 server (Qwen2.5-0.5B, CPU). vLLM's own Prometheus counters after the run: prefix_cache_hits_total 1,611,008 of queries_total 1,647,574. Note: the announcement's 90.0% is a cached-token discount (pricing); ours is the measured hit rate — related but distinct metrics, both driven by the same byte-stable prefixes.

REAL AGENT WORK · MERGED TO MAIN

2 lines

failing test → fix → green, in a sandbox

An agent in an isolated sandbox was handed a repo with a genuinely failing test (tz-naive vs tz-aware datetime) and a task prompt that named the suspected cause. It produced the 2-line fix, re-ran pytest to green, and the change is merged on main. Verbatim before/after output is embedded in step 7 below. What's proven: the sandboxed execute-verify-publish workflow — not unguided agent diagnosis.

REAL WIRING · TWO SANDBOXES

vLLM in one, Inferoa in the other

Sandbox A serves vLLM on :8000; islo share exposed it at a public *.share.islo.dev URL; sandbox B's Inferoa pointed its base_url there. Inferoa's event log records the proof: provider_id: vllm:openai_compatible:https://…share.islo.dev/v1, prompt_tokens: 16,829 per turn, with stable prompt/tool-schema hashes — cache discipline, visible.

HONEST LIMITS

what we don't claim

The 0.5B CPU model proves the mechanics (caching, routing, harness loop), not frontier-grade coding — it made a real but adorably confused tool call. Sim pricing is illustrative and published in the README. Share URLs expire in 24h; the repos and PR are permanent.

TOKENS PROCESSED

—

PREFIX CACHE HIT RATE

$0.0000

SPEND AVOIDED VS NAIVE

0 / 8

AGENT TURNS

AGENT STREAMIDLE

// press RUN to start the agent loop…

CONTEXT WINDOW

0 tokens128k window

cached prefix (≈10% price) fresh input output

SEMANTIC ROUTER

// routing decisions appear here

CUMULATIVE SPEND — SAME TASK, TWO HARNESSES

naive harness — full resend, raw tool dumps, frontier-only$0.0000

inferoa — prefix cache + codegraph + rtk + routing$0.0000

The real run, end to end.

Every block below is captured output from the actual run on 2026-06-10 — three isolated islo.dev sandboxes, real vLLM, real Inferoa, real metrics. Nothing here is mocked. Two pipelines, two results:

⌨ laptop — the islo CLI creates three isolated sandboxes on islo.dev · no GPU, no devops, one command each

Pipeline 1 · measure the inference claim

SANDBOX B

inferoa@0.11.0the actual harness, installed from npm

real inference requests · HTTPS

PUBLIC URL · islo share

https://…share.islo.devsandbox port 8000, exposed in one command

SANDBOX A

vLLM v0.22.1 · Qwen2.5-0.5B · CPUprefix_caching = ON

read off vLLM's own /metrics

RESULT

97.8% cache-hit rate1,611,008 / 1,647,574 prompt tokens served from prefix cache

Pipeline 2 · prove the agent workflow

SANDBOX C

coding agent + failing repoTypeError: naive vs aware datetime comparison

pytest red → 2-line fix → pytest green

GITHUB · MAIN

fix merged (ffda3d7)verbatim before/after output in step 7 below

RESULT

0 → 2 tests passingexecute → verify → publish, fully isolated

What happened: the harness in B drove the model in A through a public islo.dev URL, and vLLM's own counters measured the cache. The agent in C turned a failing repo green. Every box maps to a numbered step below with its captured output.

97.8%

REAL CACHE-HIT RATE (vLLM /metrics)

1.65M

PROMPT TOKENS THROUGH vLLM

138

SUCCESSFUL MODEL REQUESTS

0 → 2

TESTS PASSING AFTER AGENT FIX

Spin up a vLLM sandbox

One CLI command turns the official vLLM CPU image into a running sandbox. Note the engine config: enable_prefix_caching=True.

$ islo use vllm-cpu-1781079422 -i docker.io/vllm/vllm-openai-cpu:latest-x86_64 \
    --cpu 6 --memory 12288 --disk 30
✓ Sandbox 'vllm-cpu-1781079422' created
$ vllm serve Qwen/Qwen2.5-0.5B-Instruct --dtype bfloat16 --max-model-len 32768
(EngineCore) INFO core.py:112 Initializing a V1 LLM engine (v0.22.1) with config:
  model='Qwen/Qwen2.5-0.5B-Instruct', device_config=cpu, enable_prefix_caching=True …
(Worker) INFO Time spent downloading weights: 10.27 s
(Worker) INFO Loading weights took 0.27 seconds

real output · sandbox A

Expose it to the world

One more command gives the sandbox port a public HTTPS URL on the islo.dev domain.

$ islo share vllm-cpu-1781079422 8000 --ttl 24h
✓ Share created for vllm-cpu-1781079422:8000
  URL: https://itc807adoyzwz6hzj0rkmcp7m.ca.share.islo.dev
  Expires: in 1d 0h

real output · share url (24h ttl — may be expired by the time you read this)

Prove the inference is real

A completion through the public URL. Asked to echo “INFERENCE OK”, the 0.5B model replied “INFORMATION OK” — a mock would have echoed exactly. The typo is the proof.

$ curl https://itc807….ca.share.islo.dev/v1/chat/completions -d '{
    "model":"Qwen/Qwen2.5-0.5B-Instruct",
    "messages":[{"role":"user","content":"Reply with exactly: INFERENCE OK"}]}'
reply: INFORMATION OK
usage: {'prompt_tokens': 36, 'total_tokens': 40, 'completion_tokens': 4}

real output · external request from a laptop, through islo.dev, into sandbox A

Point real Inferoa at it

Second sandbox: install the actual npm package and aim its engine config at the share URL. Inferoa's default provider is literally vLLM.

$ npm install -g inferoa   # inferoa@0.11.0
$ cat ~/.inferoa/config.yaml
model_setup:
  mode: direct
  provider: vllm
  model: Qwen/Qwen2.5-0.5B-Instruct
  base_url: https://itc807….ca.share.islo.dev/v1
$ inferoa debug status --json | jq .endpoint_signals.errors
[]

real output · sandbox B

Run the harness for real

Inferoa's own event log records each model call: the islo.dev provider URL, a 16,829-token prompt, and stable prompt/tool-schema hashes — the cache-discipline machinery, visible. The tiny model then made a real but confused tool call (it tried to invoke is_palindrome instead of writing it). Honest limit: 0.5B proves mechanics, not coding skill.

$ inferoa --print "Write a Python function is_palindrome(s)…"
$ inferoa debug events s_6b9da06c… 
type: model.request.started
  provider_id: vllm:openai_compatible:https://itc807….ca.share.islo.dev/v1
  prompt_hash: 990301121c…  tool_schema_hash: a381b770ce…
type: model.response.settled
  usage: { prompt_tokens: 16829, completion_tokens: 30 }
  tool_calls: [ { name: is_palindrome,
    arguments: { s: "A man, a plan, a canal, Panama" } } ]

real output · inferoa event log, session s_6b9da06c…

Read the cache off vLLM itself

Not our numbers — vLLM's Prometheus counters after the session. 1,611,008 of 1,647,574 prompt tokens hit the prefix cache: a 97.8% hit rate. (The announcement's 90.0% is a cached-token discount — a pricing metric; this is the hit rate. Same mechanism: byte-stable prefixes.)

$ curl localhost:8000/metrics | grep prefix_cache
vllm:prefix_cache_queries_total{model="Qwen/Qwen2.5-0.5B-Instruct"} 1,647,574
vllm:prefix_cache_hits_total{model="Qwen/Qwen2.5-0.5B-Instruct"}    1,611,008
# hit rate = 97.8%

real output · sandbox A, vllm /metrics

And the agent loop, with receipts

Separately, an agent in a third sandbox was handed a repo with a genuinely failing test and the suspected cause. It produced the fix, re-ran pytest to green, and the change is merged on main (ffda3d7). Before / fix / after:

$ python3 -m pytest -v
FAILED tests/test_token_refresh.py::test_token_refresh
E  TypeError: can't compare offset-naive and offset-aware datetimes

--- a/auth_service/token.py
+++ b/auth_service/token.py
-    now = datetime.utcnow()
+    now = datetime.now(timezone.utc)

$ python3 -m pytest -v
tests/test_token_refresh.py::test_token_refresh PASSED
tests/test_token_refresh.py::test_expired_token_rejected PASSED
============== 2 passed ==============

real output · sandbox C · merged to zozo123/inferoa-receipts@main

Three levers, multiplied.

Each mechanism compounds with the others. The discount isn't one trick — it's the product of treating inference as a first-class engineering surface.

PREFIX CACHE DISCIPLINE

90.0%

cached-token discount

The harness keeps prompt prefixes byte-stable across turns — system prompt, tool schemas, repo context — so vLLM's prefix cache (and frontier cache pricing) bills repeated context at the cached rate instead of full price.

CODEGRAPH

80.8%

context reduction

Instead of pasting whole files, the agent carries a graph-shaped slice of the repository — symbols, edges, and the spans that matter for the task. Same signal, a fifth of the tokens.

RTK

61.4%

tool-output reduction

Raw tool output is the silent token killer. RTK records compact, structured command results — exit codes, deltas, the failing assertion — not 400 lines of test runner noise.

Route by economics, not habit.

The vLLM Semantic Router sends each step where it belongs: mechanical edits and log digestion to self-hosted models at marginal cost; architectural reasoning to frontier models when capability actually pays for itself. Cost, privacy, and capability are routing inputs — not vibes.

“Great to see Inferoa, a community agent harness built on the vLLM stack — it designs the agent loop around the mechanics and cost of inference, with prefix-cache discipline, context optimization, and routing across self-hosted and frontier models.”

— vLLM

Try it for real.

The simulation above is illustrative. The harness is real:

# install

$ npm install -g inferoa

# configure engine, router, cache

$ inferoa setup

# run on a long-horizon task

$ inferoa

Your agent harness isburning tokens.Watch it happen.