Inferoa is an inference-native agent harness for long-horizon coding work. It treats the mechanics and cost of inference — prefix caches, context shape, model routing — as the design constraint, not an afterthought. Below: a turn-by-turn simulation of the same coding task run through a naive harness vs. Inferoa.
A simulation is legible; receipts are credible. We ran the real stack in isolated islo.dev sandboxes and kept the artifacts. Everything below actually executed.
Real Inferoa v0.11.0 (npm) drove a real
vLLM v0.22.1 server (Qwen2.5-0.5B, CPU). vLLM's own Prometheus
counters after the run: prefix_cache_hits_total 1,611,008 of
queries_total 1,647,574. Note: the announcement's 90.0% is a
cached-token discount (pricing); ours is the measured hit rate — related but distinct
metrics, both driven by the same byte-stable prefixes.
An agent in an isolated sandbox was handed a repo with a genuinely failing test (tz-naive vs tz-aware datetime) and a task prompt that named the suspected cause. It produced the 2-line fix, re-ran pytest to green, and the change is merged on main. Verbatim before/after output is embedded in step 7 below. What's proven: the sandboxed execute-verify-publish workflow — not unguided agent diagnosis.
Sandbox A serves vLLM on :8000; islo share exposed it at a
public *.share.islo.dev URL; sandbox B's Inferoa pointed its
base_url there. Inferoa's event log records the proof:
provider_id: vllm:openai_compatible:https://…share.islo.dev/v1,
prompt_tokens: 16,829 per turn, with stable prompt/tool-schema hashes — cache discipline, visible.
The 0.5B CPU model proves the mechanics (caching, routing, harness loop), not frontier-grade coding — it made a real but adorably confused tool call. Sim pricing is illustrative and published in the README. Share URLs expire in 24h; the repos and PR are permanent.
Every block below is captured output from the actual run on 2026-06-10 — three isolated islo.dev sandboxes, real vLLM, real Inferoa, real metrics. Nothing here is mocked. Two pipelines, two results:
islo CLI creates three isolated sandboxes on islo.dev · no GPU, no devops, one command eachOne CLI command turns the official vLLM CPU image into a running sandbox. Note the engine config: enable_prefix_caching=True.
$ islo use vllm-cpu-1781079422 -i docker.io/vllm/vllm-openai-cpu:latest-x86_64 \ --cpu 6 --memory 12288 --disk 30 ✓ Sandbox 'vllm-cpu-1781079422' created $ vllm serve Qwen/Qwen2.5-0.5B-Instruct --dtype bfloat16 --max-model-len 32768 (EngineCore) INFO core.py:112 Initializing a V1 LLM engine (v0.22.1) with config: model='Qwen/Qwen2.5-0.5B-Instruct', device_config=cpu, enable_prefix_caching=True … (Worker) INFO Time spent downloading weights: 10.27 s (Worker) INFO Loading weights took 0.27 seconds
One more command gives the sandbox port a public HTTPS URL on the islo.dev domain.
$ islo share vllm-cpu-1781079422 8000 --ttl 24h ✓ Share created for vllm-cpu-1781079422:8000 URL: https://itc807adoyzwz6hzj0rkmcp7m.ca.share.islo.dev Expires: in 1d 0h
A completion through the public URL. Asked to echo “INFERENCE OK”, the 0.5B model replied “INFORMATION OK” — a mock would have echoed exactly. The typo is the proof.
$ curl https://itc807….ca.share.islo.dev/v1/chat/completions -d '{ "model":"Qwen/Qwen2.5-0.5B-Instruct", "messages":[{"role":"user","content":"Reply with exactly: INFERENCE OK"}]}' reply: INFORMATION OK usage: {'prompt_tokens': 36, 'total_tokens': 40, 'completion_tokens': 4}
Second sandbox: install the actual npm package and aim its engine config at the share URL. Inferoa's default provider is literally vLLM.
$ npm install -g inferoa # inferoa@0.11.0 $ cat ~/.inferoa/config.yaml model_setup: mode: direct provider: vllm model: Qwen/Qwen2.5-0.5B-Instruct base_url: https://itc807….ca.share.islo.dev/v1 $ inferoa debug status --json | jq .endpoint_signals.errors []
Inferoa's own event log records each model call: the islo.dev provider URL, a 16,829-token prompt, and stable prompt/tool-schema hashes — the cache-discipline machinery, visible. The tiny model then made a real but confused tool call (it tried to invoke is_palindrome instead of writing it). Honest limit: 0.5B proves mechanics, not coding skill.
$ inferoa --print "Write a Python function is_palindrome(s)…" $ inferoa debug events s_6b9da06c… type: model.request.started provider_id: vllm:openai_compatible:https://itc807….ca.share.islo.dev/v1 prompt_hash: 990301121c… tool_schema_hash: a381b770ce… type: model.response.settled usage: { prompt_tokens: 16829, completion_tokens: 30 } tool_calls: [ { name: is_palindrome, arguments: { s: "A man, a plan, a canal, Panama" } } ]
Not our numbers — vLLM's Prometheus counters after the session. 1,611,008 of 1,647,574 prompt tokens hit the prefix cache: a 97.8% hit rate. (The announcement's 90.0% is a cached-token discount — a pricing metric; this is the hit rate. Same mechanism: byte-stable prefixes.)
$ curl localhost:8000/metrics | grep prefix_cache vllm:prefix_cache_queries_total{model="Qwen/Qwen2.5-0.5B-Instruct"} 1,647,574 vllm:prefix_cache_hits_total{model="Qwen/Qwen2.5-0.5B-Instruct"} 1,611,008 # hit rate = 97.8%
Separately, an agent in a third sandbox was handed a repo with a genuinely failing test and the suspected cause. It produced the fix, re-ran pytest to green, and the change is merged on main (ffda3d7). Before / fix / after:
$ python3 -m pytest -v FAILED tests/test_token_refresh.py::test_token_refresh E TypeError: can't compare offset-naive and offset-aware datetimes --- a/auth_service/token.py +++ b/auth_service/token.py - now = datetime.utcnow() + now = datetime.now(timezone.utc) $ python3 -m pytest -v tests/test_token_refresh.py::test_token_refresh PASSED tests/test_token_refresh.py::test_expired_token_rejected PASSED ============== 2 passed ==============
Each mechanism compounds with the others. The discount isn't one trick — it's the product of treating inference as a first-class engineering surface.
The harness keeps prompt prefixes byte-stable across turns — system prompt, tool schemas, repo context — so vLLM's prefix cache (and frontier cache pricing) bills repeated context at the cached rate instead of full price.
Instead of pasting whole files, the agent carries a graph-shaped slice of the repository — symbols, edges, and the spans that matter for the task. Same signal, a fifth of the tokens.
Raw tool output is the silent token killer. RTK records compact, structured command results — exit codes, deltas, the failing assertion — not 400 lines of test runner noise.
The vLLM Semantic Router sends each step where it belongs: mechanical edits and log digestion to self-hosted models at marginal cost; architectural reasoning to frontier models when capability actually pays for itself. Cost, privacy, and capability are routing inputs — not vibes.
“Great to see Inferoa, a community agent harness built on the vLLM stack — it designs the agent loop around the mechanics and cost of inference, with prefix-cache discipline, context optimization, and routing across self-hosted and frontier models.”
The simulation above is illustrative. The harness is real: