The Stochastic CPU.
A field report on the cloud that's being built underneath Software 3.0. From the agent hour, to the cache keyed on the prompt, to the three signatures on every 2030 merge.
In one sentence, then seven beats.
The LLM is the first nondeterministic computer in production, and every layer of the cloud above it has to be rebuilt.
- Frontier quality answers cost more than 1,000× less than they did in 2023.
- Coding agents on SWE-bench Verified went from 14% to 95% in 27 months, crossing the 85% human baseline in Q1 2026.
- Agents now coauthor 12.5% of merged PRs at Shopify and (!) about 75% of new code at Google, per Pichai in Q1 2026.
- The agent hour becomes a billing line item at all three hyperscalers by 2027.
- By 2030 the AI native cloud bill splits roughly 40 / 25 / 15 / 10 / 5 across inference, agent runtime, snapshots, eval, and legacy compute.
- Regulated merges carry three signatures by 2030: human, agent, platform.
- The direction is clear, the dates are not, and the substrate is being poured underneath teams that still think they ship software.
- This may be the most consequential rewrite of computing in 60 years.
Six numbers, before the prose.
Frontier quality answers cost a thousand times less than they did three years ago.
Coding agents went from undergrad clumsy to better than most engineers in 27 months.
What used to be a 46 second build becomes a 3 second one when the cache hits.
One in eight of Shopify's merged code changes is now coauthored by an AI agent.
By 2030, regulated code carries proof of who asked, who wrote, and where it ran.
Inference, agent time, and cache will eat four fifths of an AI native company's cloud bill.
01 Prelude ¶
The way we make software changed three times in a decade. We've only named two of them.
Software 1.0 was the code a human typed for a machine. Software 2.0 was the weights a dataset taught a machine to hold. Karpathy named those in 2017.1 Software 3.0 is the prompt. A program written in English, running on a model, running on a fleet. He named it in 2025. The eight year gap is the tell.
1.0 reshaped how code got written. 2.0 reshaped how programs got learned. 3.0 reshapes where they run. The wiring underneath always arrives after the workload that needs it. The workload is here. Coding agents that scored under 14% on SWE-bench Verified in early 2024 are clearing 95% by June 2026.2 Inference cost at equivalent quality has fallen more than a thousandfold across the same window.3 Two curves moving in opposite directions at comparable speed, and where they cross is where the substrate sits.
We are about to stop deploying programs and start deploying invocations. The cloud, every layer of it, from the load balancer to the line item, was not designed for that.
This essay is the substrate, looked at from underneath.
02 Software 3.0, plainly ¶
If the substrate is the unnamed shift, the first job is to name what's sitting on top of it. Three sentences carry the whole idea.
Your prompts are programs. They have inputs and outputs, side effects, version drift, a runtime cost per call. They get checked into repos, reviewed, A/B tested. The English surface is a UX choice. Underneath, the semantics are program semantics, and the runtime treats them that way.
The LLM is a computer.4 Karpathy's sense of the word. An instruction stream (the prompt), a memory hierarchy (context, KV cache, retrieval), a clock that ticks in tokens. Kind of like a CPU, except this one will insist that 9.11 is greater than 9.9. Every architectural property of the old computer needs a new answer on this one. Scheduling, caching, isolation, billing, even what it means for a program to terminate.
The autonomy slider exists. The same model can be a tab complete, a copilot, a background agent, or a fleet. The slider is continuous. The cost curve is not. A keystroke is cents. An agent hour is dollars. A thousand parallel agents is the bill.
A working test for whether you're in 3.0. Did you author a prompt, did a model execute it, did the output land as code or a ticket or a refund? Three yeses and you're in.
All of this sits on top of a training curve that has run about five times per year since 2020, and is on track to cross 10²⁷ FLOPs between 2027 and 2028.5 Four orders of magnitude in six years on the training side. A thousandfold drop on the inference side. The substrate is the place those two curves land.
Read the three sentences again. Each names the model and then spends the rest of its breath on the plumbing.
02b The Stochastic CPU ¶
So if the LLM is a computer, what kind? The honest answer is the one nobody puts on a slide. It's the first computer in production that isn't deterministic.
A classical CPU is a deterministic function. Same input bytes, same instruction stream, same output bytes. Yesterday, today, on the rack next to it, on the one in Dublin. Cache hits because the hash of the input maps to the hash of the output. Tests pass or fail because the answer is fixed. Reproducible builds reproduce. Blame walks cleanly from a stack trace back to a line. Contracts hold because both sides compute the same thing from the same bytes. Sixty years of systems engineering (virtual memory, distributed consensus, exactly once delivery, Merkle trees, the whole shape of git) is scaffolding on top of f(x) = y, always.
The LLM is a computer in Karpathy's sense.16 It is also a computer that does not satisfy that guarantee. Ask the same question twice. You get two answers. Sometimes a synonym, sometimes a renamed variable, sometimes the control flow inverts and the tests still pass. Temperature is the knob you see. Batched matmul reductions over nonassociative floating point are the ones you don't. Even at temperature=0 the output drifts, because greedy decoding over a batched GPU kernel is not bit exact when the batch composition changes.17
Then the model itself moves underneath you. A snapshot rolls. A safety fine tune lands on Tuesday. The provider rotates a server side seed you were never told existed. The function you were calling last week is, literally, not the function you are calling now.
Regenerate a 40 line Python file twice and diff the bytes. (I tried.) The first version writes for i, row in enumerate(rows):, the second writes for idx, row in enumerate(rows):. Same behavior, different SHA. Your content addressed cache, your code review tooling, your "did anything change" check, all of them think work happened. None did.
f(x) = y, always. The new CPU answers the same question twice and returns two answers.A faster CPU would have fit on the old substrate. A stochastic one doesn't, because every layer above the CPU was built on the property the new one no longer provides. I think this is the most important architectural shift since the move from a single machine to distributed compute, and it is happening underneath teams that still think they are shipping software.
03 What broke ¶
Take §02b as given. The CPU is stochastic. Now walk the layers above it and ask what broke.
A container is a frozen filesystem with a process boundary. It was designed for a workload that boots, serves a request, scales out, and dies. Agents are the opposite shape. They wake up, think for ninety seconds, write a file, sleep, wake up again with a different prompt and a different model. Five things break.
Cold start. A Python image pulls in seconds. A CUDA image with a 70B model pulls in minutes. On a web server, cold start is the p99. On an agent fleet, every wakeup is a cold start, so the p99 becomes the p50.
State. A container is stateless by convention. An agent has a scratchpad, a working tree, a tool history, and a context window worth keeping. Snapshotting that across a fleet of ephemeral runners is a new primitive. Modal, E2B, and a half dozen startups now charge for it at a documented 3× premium over their own serverless tier.6
Pricing. A container bills CPU seconds. An agent bills tokens, and tokens are 95% of the line. A 2 vCPU sandbox costs about ten cents an hour on Modal or E2B. The same hour of inference, in a heavy loop, runs roughly $5 at Haiku rates, $30 at Sonnet, $130 at Opus. Two to three orders of magnitude over the box, on the same invoice.
CI. A boolean assertion over a stochastic function is a coin flip dressed as CI. The gate becomes an eval. A statistical regression with a budget and a confidence interval, run on its own pool. Green and red give way to a distribution.
Cache. Content hash caches assume same input means same output bytes. Agents regenerate semantically equivalent files with different bytes constantly, so Bazel, Buck, and Nix all miss when they shouldn't. A smarter hash doesn't help. The bytes really are different. The fix is a second cache layer that keys on the prompt prefix, sitting above the byte layer. §5.
None of this is hypothetical. Shopify reports one in eight merged PRs is coauthored by its agent, River. 3,536 merged in a 30 day window.8 Sundar Pichai walked the number up the curve on four consecutive earnings calls. 25% in Q3 2024, 30% in Q1 2025, around 50% by fall 2025, around 75% by Q1 2026.9 The Shopify number is the more honest one, because it counts what got merged, not what got typed.
The workload that broke the container is already the dominant workload at the largest engineering orgs in the world. The infrastructure has not caught up.
03b The sandbox. ¶
Compute has fractured its unit four times. The fourth fracture is the one that matters here.
Mainframes treated the machine as the unit. Many humans queued for one box, time sliced. Servers cut the unit down to the process: one daemon per service, scaled horizontally across cheap hardware. Containers cut it again, down to the request: ephemeral, stateless, immortal in aggregate but disposable individually. Each fracture made the unit smaller and, oddly, more stateful in what surrounded it. The sandbox is the fourth fracture. The unit is no longer the machine, the process, or the request. The unit is the task.
A task thinks for thirty minutes. It sleeps for six hours. It wakes up mid sentence. It has an identity that survives across the gap. The previous three units could not do this, and were never asked to.
A sandbox is what compute looks like when the unit is a task instead of a request.
Each prior unit carried an assumption that the sandbox quietly breaks. The VM assumed work was a server: long lived, network attached, identified by an address. The container assumed work was a request: short, stateless, addressed by a load balancer that did not care which copy answered. The function assumed work was a transaction: a few hundred milliseconds, no memory of yesterday, no expectation of tomorrow. The sandbox assumes work is a train of thought. It can be interrupted. It can be resumed. It must be the same train when it comes back, or the work is wasted.
This changes what the lifecycle paragraph is even about. Cold, warm, idle, paused, archived, evicted are no longer engineering states of a box. They are verbs of a thinking thing. Cold means the thought has not started. Warm means it is loaded and ready to continue. Idle is the pause between sentences. Paused is sleep. Archived is long term memory, retrievable but slow. Evicted is forgetting. (I realize this sounds a little overwrought, but the API surface really does look like this once you stop pretending it is a container.) The interesting numbers are quiet ones: Firecracker cold boots in roughly 125 ms, snapshot resumes land in 5 to 30 ms, and Modal restores a loaded PyTorch process in about a second. Those are not infrastructure benchmarks. They are the latency of waking up.
A small supply chain has formed around this primitive. Modal sells the warm start path for Python and ML workloads. E2B sells sandboxes tuned for code writing agents, with paused state held for thirty days. Daytona sells developer environments that auto archive after seven days of silence. Tensorlake sells sandboxed document and data work for retrieval pipelines. Fly Sprites sells fast sandboxes welded to a global edge. islo.dev (which I work on, so take the placement with appropriate salt) sells the runtime as a metered hour rather than a server. Six vendors, one shared bet: that the task, not the request, is the thing worth selling.
It is worth saying plainly what this is. For sixty years compute has been a substrate that humans rented to run programs they wrote. The sandbox is the first substrate designed to be rented by something that is itself thinking. That is a civilizational shift in who the customer of compute actually is, and it has happened in roughly eighteen months.
Once the unit of compute is cognitive, the unit of billing has to be cognitive too. That is the agent hour.
03b The sandbox. ¶
Containers broke. What replaced them is a primitive the cloud did not have a name for in 2023. Every vendor in this section now sells a flavor of it.
A sandbox is a microVM that boots in tens of milliseconds, captures its full memory and filesystem in a snapshot, resumes from that snapshot anywhere on the fleet, and bills per second of wall clock time.
Mechanically it is a hardware isolated guest under a tiny userspace VMM. Firecracker is the canonical kernel, written in Rust by AWS, with a ~120 host syscall surface and a ~5 MiB memory floor per VM (per Firecracker docs, 2026). Cold boot is around 125 ms to userspace, snapshot restore is 5 to 30 ms, and a single host can launch up to 150 microVMs per second (per Firecracker docs, 2026). Kata Containers wraps the same idea in a CRI shim so Kubernetes can schedule it. gVisor takes the cheaper road, intercepting syscalls in a Go process called Sentry, trading 10 to 30% runtime overhead for no kernel boot and no nested virtualization requirement (per gVisor docs, 2026). Cloud Hypervisor is the rising alternative when GPU passthrough or userfaultfd lazy paging matters. Underneath all of them sit VT-x, EPT, and the IOMMU. A container escape is a kernel bug in one shared kernel. A microVM escape has to traverse a guest kernel, a tiny VMM, and the IOMMU before it touches the host. That is the actual security boundary.
It is the category between a container and a VM. A container is a chroot with cgroups. No kernel boundary, single digit millisecond start, no real memory snapshot. A VM is the opposite. Full hardware boundary, minute scale boot, designed to hum for months. The sandbox sits in the middle on every axis and was never the design center of either. It is what you build when the workload thinks for thirty minutes, sleeps for six hours, and resumes mid thought somewhere else on the fleet.
What it replaces is three assumptions, not one. The container assumed a request that returns, so its state model is "throw it away." That fails the moment an agent has a working tree, a tool history, and a context window worth keeping. The VM assumed a workload that hums for months, so its boot budget is generous and its cost model is committed instances. That fails the moment you need a fresh isolated environment per agent attempt, fanned out a hundred wide. The serverless function assumed a request that returns in seconds with no state between calls. That fails the moment the agent loop runs for half an hour and the cache prefix matters more than the cold start. The sandbox keeps the isolation of the VM, the start time of the function, and adds a snapshot primitive neither had.
The lifecycle is the thing most teams have not internalized yet. A sandbox is not running or off. It walks through six states, and what survives each transition is different.
Note what does and doesn't survive. The filesystem lives through cold to warm. Memory lives through warm to paused and back, including loaded variables, tmux sessions, and open files. Outbound TCP connections die on pause and have to be reopened by the client (per E2B and Modal docs, 2026). On archive the memory is dropped and only the filesystem persists. On eviction nothing survives. The implication for agent runtimes is that any in flight HTTP call held by the agent at pause time has to be re driven by a durable layer above the sandbox. The sandbox is the checkpoint. The event log is the truth.
The supply chain in 2026 looks roughly like this. Modal sells raw serverless compute and a sandbox SKU at a 3× premium over its own standard tier ($0.142 per physical core hour, per Modal pricing, 2026). E2B sells Firecracker microVMs you can pause and resume by name, around $0.165 per hour for a 2 vCPU 4 GiB box, capped at 100 concurrent on Pro (per E2B pricing, 2026). Daytona sells sub 90 ms code to execution on a pre warmed Docker pool, with hard idle and archive defaults built in (per Daytona docs, 2026). Tensorlake sells the long sandbox: pause and resume on the same microVM, with running processes and tmux sessions intact for weeks. Fly Sprites are persistent Firecracker microVMs with 100 GB NVMe and sub second checkpoint restore. islo.dev sells the agent computer plus persistence plus a secrets gateway that injects tokens at egress so the agent never holds them, deployable on Islo cloud, your cloud, or your infra. (I run islo.dev, so weight that accordingly.) Cursor and Cognition build their own runtimes on top of these primitives because their entire product thesis is the sandbox UX. Build versus buy is a real decision now, and the answer for almost everyone is buy, because the orchestrator (snapshot storage, pause and resume, networking, egress, auth, billing, fleet health) is where the six to twelve months of work lives.
One unit ties all of these vendors together. They meter per second of wall clock, not per request. Modal's sandbox SKU is priced in core seconds and gibibyte seconds. E2B and Daytona quote vCPU hours that decompose to seconds. Fly Sprites pause the clock when the agent is idle. islo bills the same way. The sandbox second is the new unit of cloud compute, and the next section is about what happens when you aggregate it into an hour and put that hour on an invoice.6
04 The agent hour ¶
If containers don't fit and agents are real workloads, the billing unit has to move. It's moving now. Read the billing pages.
Keynotes sell models. SKUs reveal which abstractions a provider has committed compute, finance, and oncall to. Anthropic announced a billing split on May 14 2026 that would have broken out agent calls as a separate SKU from raw inference, then paused the change on June 15 after subscriber pushback. The pause matters less than the announcement. The SKU was drafted, priced, and put on the calendar. Modal added a sandbox tier at $0.1419 per physical core hour, roughly 3× its standard serverless rate, and called it the sandbox tax in the docs.6 Cursor's background agent has its own SKU. E2B sandboxes meter at about $0.05 per vCPU hour. The pattern is the same in every case. A unit of billing that did not exist in 2023 now has a price.
The unit is the agent hour. By 2027 all three hyperscalers will have an agent hour SKU, and the one that doesn't will be reselling someone else's.
The economics work because the substrate underneath is finally cheap enough. The agent loop that issues hundreds of LLM round trips per task only closes because inference at equivalent quality has fallen more than a thousandfold since 2023.3 Frontier prices stay flat around $15 per million output tokens. The floor at equivalent quality has collapsed underneath it.
The other half of solvency is that the work is worth doing. Devin 1.0 cleared 13.86% on SWE-bench Verified in March 2024.10 Claude 3.5 Sonnet (new) reached 49% in Q4 2024.12 Sonnet 3.7 hit 63% in Q1 2025, Opus 4 hit 72% in Q2 2025, Sonnet 4.5 reached 77.2% in September 2025, Opus 4.5 became the first model over 80% in November 2025, Opus 4.7 hit 87.6% in April 2026, and Fable 5 sits at 95.0% as of June 2026.2 Sometime in Q1 2026 the line crossed the human baseline of about 85%. That crossing is what turned SW 3.0 from a research problem into a deployment problem.
Walk the math for one agent hour at mid 2026 prices. A background loop issues 200 to 600 round trips per hour, with contexts of 30K to 150K tokens (80% to 95% cached) and 2K to 8K output per call. At Haiku 4.5 rates that lands between $1 and $5 per hour. At Sonnet 4.5, between $15 and $155. At Opus, $30 to $450. A fully loaded engineer is $80 to $150 an hour. The widget below lets you move the sliders. The dollars are real Anthropic pricing.
The agent is cheap now. The expensive part is making a hundred of them agree on a working tree, a test suite, and a deploy without stepping on each other. The bottleneck moved up the stack, which is, I think, where every interesting infrastructure problem now lives.
Move the sliders. The dollars are real Anthropic pricing. The dots are sampled. The rest is your imagination.
Sandbox fleet, live
Prompts enter on the left, the model fans work out to sandboxes on the right. Cached prefixes take the green shortcut.
Assumptions (hover any number above for the arithmetic)
Workload tiers (light, medium, heavy) are modeling assumptions you choose, not measurements. Prices and the 730 hour month are public anchors.
How it adds up
The math
Cost per agent hour decomposes into four terms. Uncached input tokens pay the full input rate. Cached input tokens pay the cache read rate. Output tokens pay the output rate. The sandbox is a flat add.
cost_per_agent_hour =
input_tokens * (1 - C) * tier.in_rate
+ input_tokens * C * tier.cache_read_rate
+ output_tokens * tier.out_rate
+ sandbox_rate
Work an example. One medium agent on Sonnet at C = 0.60. Medium is 400 calls/hr, 80k input/call, 4k output/call. So input_tokens = 400 * 80,000 = 32,000,000 per hour. Output_tokens = 400 * 4,000 = 1,600,000 per hour.
Sonnet rates are $3 / $15 / $0.30 per million. Plug in.
Uncached input: 32M * 0.40 * $3 / 1M = 12.8 * $3 = $38.40.
Cached input: 32M * 0.60 * $0.30 / 1M = 19.2 * $0.30 = $5.76.
Output: 1.6M * $15 / 1M = $24.00.
Sandbox: $0.10.
Sum: 38.40 + 5.76 + 24.00 + 0.10 = $68.26 per agent hour.
Now drop the cache. Same agent at C = 0. Input collapses to one term: 32M * $3 / 1M = $96.00. Output and sandbox unchanged at $24.10. Total $120.10 per agent hour. The 60% prefix cache saves $51.84 per agent hour, about 43%.
The widget scales linearly. A fleet of 100,000 such agents costs 100,000 * $68.26 = $6,826,000 per hour, which is what the top number reports.
Why the cache works
Content hash caches key on bytes. The LLM regenerates a file with a renamed local variable, the SHA shifts, Bazel recomputes the world. Nothing about the file's meaning changed. The hash sat too low in the stack to notice.
Prompt prefix caches key higher. The expensive object inside a transformer is the KV cache. The key and value tensors produced by attending over every token in the prompt. Recomputing that state on each call is most of what you pay for.
Anthropic stores it. The key is the prefix of your request as identical bytes, typically the system prompt, tool definitions, and the unchanged file contents you pass in. A second call with the same prefix reuses the tensors and skips the attention pass over those tokens. The skip is the discount.
The 0.1x rate is a cost ratio, not a promotion. A cached read is RAM access with no matmul behind it. The 10 percent charge amortizes the write and the eviction bookkeeping. The read itself is close to free, and the price reflects that.
Two constraints fall out of this for your harness. First, the prefix must stay identical byte for byte across calls. Pin the system prompt, hash and freeze the tool schema, reuse file contents verbatim rather than asking the model to regenerate them. Second, calls must land inside the TTL. Five minutes for the cheap write, one hour for the expensive one.
Drift on either and the discount evaporates. The 43 percent on the widget above is engineering work, not a config flag.
05 The two level cache ¶
If the agent hour is the unit, the cache is the first thing that has to be rebuilt around it. Caches mirror the workload they serve.
Bazel hashes inputs and reuses outputs. Buck hashes inputs and reuses outputs. Nix hashes inputs and reuses outputs. The entire build system tradition of the last fifteen years is a content addressed cache that assumes f(x) = y, always. An LLM regenerating a file is the workload that breaks the assumption. Variable renames, reordered imports, equivalent control flow. Bytes change while meaning doesn't. The hash misses. The cache rebuilds. The bill arrives.
The fix is two caches stacked. The bottom layer is the old content hash cache. It still works for most of the build graph, because most of the build graph is still deterministic compilers chewing on deterministic inputs. The top layer is a prompt prefix cache. It keys on the prompt the agent issued, normalized. On a hit, you reuse the agent's last output without recomputing the bytes.
Anthropic already does the prompt prefix layer at inference. Cache reads are billed at 0.1× the input rate (a 90% discount), and writes at 1.25× (5 minute TTL) or 2× (1 hour TTL). For Opus 4.x that's $5 per million input tokens dropping to $0.50 on a cache hit, with $6.25 cache writes at 5 minutes and $10 at one hour.7 The catch is a prefix with identical bytes. The agent harness has to be disciplined about prefix stability or the discount evaporates. (Don't ask how I learned this.)
That discipline propagates down the build system. Prompts get versioned. System prompts get pinned. Tool schemas get hashed. The "what does this build depend on" question grows a new column. Not just source files and compiler version, but model snapshot, prompt prefix, and the seed if you bothered to set one.
06 The Tokio benchmark ¶
Given the two level cache works in principle, here is one number. We ran cargo test --workspace on Tokio three ways. Cold, 46.0 seconds. Warm content hash cache, 12.0 seconds. Hot two level cache with the prompt prefix layer included, 3.0 seconds.13
The 46 is specific to the project. Tokio is a particular workspace. A different repo gives you a different cold build. The number to watch is 3, because 3 is what an agent sees when it edits one crate out of fifty and reruns the workspace. Agents iterate like this constantly. Edit, run tests, edit, run tests.
At 3 seconds the agent stays in the call frame. At 46 seconds the agent harness, the human watching, and the CI runner all context switch out, and once you have context switched out, the next iteration costs more than just the build delta. The wall clock difference is fifteen times. The behavioral difference is larger than that.
The mechanism is the two layers doing their separate jobs. The bottom (content addressed) cache catches the 47 unchanged crates. The top (prompt prefix) cache catches the one regenerated file as semantically equivalent to the previous call. The work left over is the actual delta, which is roughly what 3 seconds buys.
This is one repo. The same shape recurs on any monorepo where an agent is the dominant editor. The break even point, where the prompt prefix cache pays back the cost of running it, comes around the third or fourth iteration on the same codebase. After that, every loop is on the 3 second side.
Three seconds is what makes the agent loop a loop. Forty six seconds is what makes it a coffee break.
06b Dev, test, eval, prod.
Three seconds at the build layer is the floor. Above the build, the rest of the SDLC has to change too. If the inner loop is a stochastic CPU answering in seconds, the four boxes around it (dev, test, eval, prod) stop looking like the four boxes a 2018 platform team would have drawn. The shapes change. The unit of work changes. The line items on the invoice change.
Dev.
A developer no longer spins up a laptop and a docker compose file. They request a sandbox. E2B boots a Firecracker microVM in approximately 80ms same region and resumes a paused one in 5 to 30ms (per E2B docs, 2026). Daytona advertises sub 90ms code to execution, as low as 27ms from a pre warmed pool (per Daytona docs, 2026). Modal restores a memory snapshot in about 1.05s for a PyTorch process and 3.5s for Stable Diffusion, roughly 2.5x faster than a cold container (per Modal blog, 2026). The legacy primitive was a long lived workstation. The neo primitive is a snapshot that gets paged in.
The vendor landscape splits along a price spread that reads as positioning, not markup. Modal and E2B sell raw compute, about $0.10 per hour for a typical 2 vCPU sandbox. Daytona pitches sub 90ms code to execution at about $0.17 per hour all in. Tensorlake leans on sandbox native pause and resume for agents, with a Pro tier around $0.10 per hour plus a $250 a month base. islo.dev packages the box plus persistence plus secrets plus multi cloud as one product, about $0.31 per hour, which works out to $1.23 for a 4 hour run on 2 vCPU and 4 GB. (I run islo.dev, so weight that accordingly.) The 3x spread across this row is positioning, not markup. The lower end sells the box. The higher end sells the agent computer.
State lives in two places: the filesystem snapshot that captures the OS, deps, and configured base, and the memory snapshot that captures the running process. E2B paused sandboxes persist for up to 30 days. Daytona auto stops at 15 minutes idle and auto archives at 7 days. Billing is per second, not per box. Modal sandbox CPU runs at about $0.142 per physical core hour and RAM at $0.024 per GiB hour (per Modal pricing, 2026). E2B sits in roughly the same band. A dev environment that costs cents per hour while running and zero while paused is a different economic object than a laptop, and the agent loop from section 06 is the thing that actually exercises it.6
Test.
A unit test over a deterministic function is an assertion. A unit test over a stochastic function is a coin flip. The fix is not to retry until green. The fix is to stop pretending the output is a scalar and start treating it as a distribution.
Practice has converged. Five to ten samples per case is standard. Eight is the de facto reliability multiplier in academic benchmarks. SWE bench Verified runs 500 human validated Python issues as the reference grid (per SWE bench, 2026). Golden sets in production tend to land at 200 to 500 curated examples per route, mined from real failures (per industry write ups, March 2026). The gate is no longer zero failures. The gate is mean above threshold T with a confidence interval narrower than W, where T is route specific. A faithfulness score of 0.82 passes for casual chat and fails for regulated insurance advice. Same score, different box.
Eval.
Eval is the unit of work in the neo SDLC, and it is the line item that decides whether the loop terminates before the invoice does. Two modes run in parallel. Offline eval runs against curated reference sets on every commit, gated in CI. Online eval samples production traffic, typically 1 to 10 percent of matching traces for high volume apps and up to 100 percent for critical workflows (per Braintrust, 2026), scores them async with a cheap classifier or an LLM judge, and writes the score back onto the trace.
Cost asymmetry is the design pressure. A unit test costs roughly zero. A classifier eval costs about a tenth of a cent per call. An LLM judge costs $0.05 to $0.50 per call. At frontier scale the Holistic Agent Leaderboard ran 21,730 rollouts across 9 models for approximately $40k, and k equals 8 reliability runs scale that to roughly $320k (per Hugging Face evaleval, 2026). Drift detection rides on top: a sustained 2 to 5 percent drop in per rubric score over 24 to 48 hours opens an investigation, 5 percent or more pages on call, and PSI on the input distribution above 0.2 forces action. The cascade pattern (deterministic scanner first, cheap classifier next, LLM judge only on residual uncertainty) is how teams claw back about an order of magnitude without losing precision.
Prod.
Prod is a fleet of long lived sandboxes, not a deployment target. Shopify's River runs approximately 60,000 sessions across roughly 7,000 employees in 30 days and co authors one in eight merged PRs at Shopify.8 Anthropic's dynamic workflows cap at 16 concurrent agents per run and 1,000 subagents total (per Anthropic, 2026). Cognition runs Devin sessions as separate VMs, infinitely parallelizable, billed per ACU only while the VM is actively working.
Scheduling is per task, never per repo or per PR. The fleet pattern is three layers: a durable event log (Postgres for River), a disposable harness process, a disposable execution sandbox. Inference routes by prefix cache affinity, not round robin.7 The llm-d reference stack on 8 vLLM pods over 16 H100s drives P90 TTFT from 31s plus down to 0.542s, and Ranvier's benchmark on 8 A100s lifts cache hit rate from 12.5 percent to 97.5 percent and cuts P99 latency from 6,800ms to 1,000ms (per llm-d and Ranvier, 2026). Kubernetes is still there. It has been demoted to fleet scheduler underneath the inference aware layer, the role it should have had all along.13
Once the four boxes look like this, the question that closes the loop is who signed which output.
07 Provenance ¶
If agents stay in the call frame at 3 seconds, the volume of agent authored code keeps climbing. Pichai's 75% and Shopify's 12.5% are the leading edge.89 Which means the next question is the one auditors ask. Who signed this?
A commit used to carry one signature. By 2030, in any regulated industry, it will carry three. The human who asked. The agent identity (model, version, prompt hash, tool schema, seed). The platform attestation (which snapshot ran, on which cluster, against which evals). SLSA v1.1 already specifies the build provenance shape. Sigstore already signs the artifacts. The EU Cyber Resilience Act already requires it for software placed on the European market.14 The primitives exist. They just haven't been composed into a default yet.
The reason this becomes mandatory isn't regulatory. It's IP. When 75% of Google's new code is AI coauthored and Shopify is merging more than 3,500 agent PRs a month, "who wrote this" stops being a credit line question and starts being an indemnity question. A signature you can't trace is a liability you can't bound. The triple signature is what makes the merge reviewable, the audit defensible, and the IP transferable.
It also closes the loop on the stochastic CPU. The output drifts every run. The invocation that produced it is still a fixed tuple you can write down on one line. f(x) = y is gone. The pinning moves up one level, to (human, agent, platform, prompt, snapshot) → artifact, and that tuple is what the next decade's compliance regime will be written against.
Reproducibility moved up one level of abstraction. Provenance moved with it.
08 Timeline 2025 → 2030 ¶
The sections above name the pieces. Here are the dates, written down so they can be checked later.
2025. Coding agents cross 75% on SWE-bench Verified. Sonnet 4.5 lands at 77.2% in September. Opus 4.5 becomes the first model over 80% in November.2 The autonomy slider becomes a product surface, not a research demo.
2026. Agents cross the roughly 85% human baseline in Q1. Fable 5 hits 95% in June. The agent hour ships as a discrete SKU on at least one major provider (Anthropic's split was drafted and on the calendar). Shopify reports one in eight merged PRs are agent coauthored.8
2027. All three hyperscalers have an agent hour SKU. Frontier training compute crosses 10²⁷ FLOPs in the aggressive case. Epoch AI's median forecast puts the crossing closer to 2028.5 Large engineering orgs cross the 50% coauthored code threshold.
2028. Power binds, not silicon. The IEA energy curve dominates the procurement conversation, and at least one frontier cluster is sited next to a dedicated generator.15
2029. Triple signature provenance ships as a default in at least one major CI provider. EU CRA enforcement does the rest.14
2030. Global data center electricity consumption hits about 945 TWh, roughly double 2024's 415, with the US and China taking about 80% of the growth.15 Kubernetes still runs, but it has been demoted from the platform to the scheduler for the agent fleet underneath.
09 The 2030 bill ¶
Given the timeline, the invoice flips. The shape of a cloud bill is the most honest forward indicator we have. Money goes where the workload is, and the workload moved.
The 2030 split, for a large engineering org running agent heavy workflows, lands roughly here. Inference, 40%. Agent hours, billed separately from raw inference because the coordination surface is its own product, 25%. Snapshot and cache infrastructure, including the two level cache, 15%. Eval compute, which has eaten the role CI used to play, 10%. Egress, 5%. Legacy stateless compute, the EC2 instances, the RDS, the old shape, 5%.
Compare that to 2023, when roughly 70% of a comparable bill was the legacy shape and inference barely cleared single digits. Reallocations this large happen once a workload crosses a threshold. The agent workload crossed in 2026. The bill is following on a two year lag.
Kubernetes survives. It just gets demoted. The control plane that schedules pods becomes the control plane that schedules agent runtimes. And the things it schedules aren't stateless. They have warm KV caches, prompt prefix locality, model affinity to specific GPU pools, snapshot warmth. The scheduler that wins this round will look more like a database query planner than the round robin we have today.
Power is the binding constraint. The IEA projects global data center consumption rising from 415 TWh in 2024 to about 945 TWh by 2030, with AI taking the dominant share of the growth and the US plus China responsible for about 80% of it.15 Frontier inference clusters in 2028 will be sited around the megawatt, not the rack.
This is the most consequential infrastructure shift since virtualization, and most of the people building it have never written a CUDA kernel.
10 Closing ¶
Software 1.0 was humans writing for machines. Software 2.0 was data writing weights. Software 3.0 is humans writing for models that write for machines. Software 3.5 is already here in the agent loop. Models write prompts for models that write code for machines, and the human moves up the abstraction one more notch.
The model in the middle is a stochastic CPU, and every layer above it was built on a guarantee the new CPU doesn't provide. The cache misses. The test flakes. The build's reproducibility moves from the binary up to the invocation. The commit grows from one signature to three. The invoice grows a line item with a unit nobody priced in 2023. A smarter model would have left the substrate alone. A stochastic one rewrites it.
What you are watching is the rewriting of a sixty year stack against a property change in its lowest layer. The container, the cache, the test runner, the build system, the scheduler, the bill. Each one is being reformed around the same observation, and most teams are still writing CI files as if the answer were a boolean.
The substrate is shifting. It's already shifted. You're standing on it.