Docker made code portable. Sandboxes make it safe to run —
because the author isn't human anymore. It's a model, writing code faster than anyone can read it.
Generation got cheap. Safe execution is the new bottleneck.
sandbox · the unitA disposable computer an agent works inside — it can edit, run, test, build & ship there, with controlled reach into files, network & secrets, then vanish. Safe enough to let a model run wild; complete enough to let it finish the job.
Container era · 2013Trusted human writes the code
Goal: portability & reproducibility. Code is reviewed and vouched for. Environments are long-lived pets.
Sandbox era · nowA model writes the code
Goal: isolation, instant, disposable. Code is guilty until proven safe. Environments are ephemeral cattle, by the thousand.
01
Why
The author is untrusted. Model-written code can carry an injected payload, a hallucinated rm -rf /, or a typo'd dependency that resolves to malware. You can't read every run.
Blast radius is bounded. The box decides what the code may touch — filesystem, network egress, secrets, and (if isolation is weak) the host kernel itself.
Reproducible, at scale. Identical clean state every run, thousands in parallel, cheap to spawn and kill. Without it, an eval's reward signal is just noise.
▣
A wall, and a workbench
Containment is only half the story. A sandbox is also the whole computer an agent works inside — one place to dev, test, build and deploy, where it takes a task end to end with no human in the seat.
edit → run → test → build → deploy → observe ↺the agent's loop — repeated autonomously, thousands of times over
Docker packaged the artifact: build, ship, run. The sandbox hands the agent the entire workflow — and the keys. Give it a box that can only run code and you get a calculator; give it one that can edit, build and ship, and you get a developer.
02
When
Four jobs people actually reach for them:
devThe harness's workshopGive the coding agent — Claude Code & friends — a box to develop in: edit, install, run, break things, without touching your laptop or your main branch.
test · evalProve it worksRun code you or your AI just wrote on a clean slate and grade it, before you trust a single line.
deployShip AI-written codeRun model-authored code in production inside a contained runtime, where its blast radius is bounded by design.
parallelTen at onceFork one environment ten ways and let agents chase ten features and bugs at the same time. Keep what passes; bin the rest.
And the same primitive, by role — pick yours:
Agents now open PRs and run shell commands. You ship code no human fully reviewed.
Why
An autonomous coding agent's diff is untrusted input the moment it executes.
Where
Code-interpreter tools · agent dev loops (a worktree per task) · CI · production tool-calls serving real users.
The call
Untrusted code, nothing private to reach → ephemeral container or microVM, egress off.
Model-generated code and SQL run against real datasets and pipelines.
Why
The generated step touches data you actually care about — and might exfiltrate or corrupt it.
Untrusted code that needs private data → inside the VPC, scoped credentials, deny-by-default egress.
Every rollout and every eval runs untrusted code — thousands at once — and the reward must be reproducible.
Why
RL and evals are untrusted execution at scale; dirty state silently poisons the signal.
Where
RL environments · verifiable evals · reasoning sandboxes that run harnesses: spawn → set up task → run the agent's code → score → destroy.
The call
Disposability is the whole game → microVMs: VM-grade isolation, ~100 ms boot, thousands per host.
The boundary decision
Untrusted code with nothing to steal belongs outside — a public, ephemeral box, lowest blast radius.
Untrusted code that needs your data belongs inside the VPC, isolated hard — now you're fencing something with real network reach.
03
How
An isolation ladder — weak & fast at the bottom, strong & heavy at the top. Pick the lowest rung that holds your threat model.
subprocess + limitsSame kernel, same user. A timeout and a ulimit. Trusted code only.
microVM · FirecrackerA real VM boundary that boots in ~100 ms, thousands per host. The agent sweet spot.
full VM · air-gappedMaximum isolation, maximum cost. For the genuinely hostile.
isolation strength↔boot latency↔density / cost
microVMs broke the old rule that VM-grade isolation must be slow and expensive — which is what makes per-run disposability economically real.
the real lessonDocker didn't beat LXC on isolation — it won on developer experience. Sandboxes get won the same way: the best API and the fastest boot, not the thickest wall.
⌗
Decide
Four questions. Live verdict, recommended rung (it highlights the ladder), and placement. Nothing leaves your browser.
¶
Isn't this just…?
…a VM? In isolation terms, yes. The novelty is booting it in ~100 ms and packing thousands per host — that economics is what makes per-rollout disposability possible at all.
…containers? Containers share the host kernel; one kernel CVE is an escape. Fine for code you trust, thin for unreviewed model output. That gap is why gVisor and microVMs exist.
…hype? we've run untrusted code for decades. True. What changed is authorship and volume: code is now machine-written, unreviewed, and generated faster than humans can vet. Isolation moves from edge case to default substrate.
⬡
The landscape
A vast, fast-moving market — converging on the same shape: fast, disposable, API-driven isolation.
Pick any of them and the concepts on this page don't change. crabbox.sh exists so you don't have to choose blindly — it runs the same task across every provider, turning the sandbox into a commodity you can swap.
full disclosureislo.dev is one of these providers — and it's ours. There's only one bear in town. 🐻
the bet
It's the container revolution again — at a far larger scale. Shipping software without sandboxes in 2026 is not using Docker in 2016.