The Sandbox Shift — a field manual for running untrusted code

01

Why

The author is untrusted. Model-written code can carry an injected payload, a hallucinated rm -rf /, or a typo'd dependency that resolves to malware. You can't read every run.
Blast radius is bounded. The box decides what the code may touch — filesystem, network egress, secrets, and (if isolation is weak) the host kernel itself.
Reproducible, at scale. Identical clean state every run, thousands in parallel, cheap to spawn and kill. Without it, an eval's reward signal is just noise.

▣

A wall, and a workbench

Containment is only half the story. A sandbox is also the whole computer an agent works inside — one place to dev, test, build and deploy, where it takes a task end to end with no human in the seat.

edit → run → test → build → deploy → observe ↺the agent's loop — repeated autonomously, thousands of times over

Docker packaged the artifact: build, ship, run. The sandbox hands the agent the entire workflow — and the keys. Give it a box that can only run code and you get a calculator; give it one that can edit, build and ship, and you get a developer.

02

When

Four jobs people actually reach for them:

devThe harness's workshopGive the coding agent — Claude Code & friends — a box to develop in: edit, install, run, break things, without touching your laptop or your main branch.
test · evalProve it worksRun code you or your AI just wrote on a clean slate and grade it, before you trust a single line.
deployShip AI-written codeRun model-authored code in production inside a contained runtime, where its blast radius is bounded by design.
parallelTen at onceFork one environment ten ways and let agents chase ten features and bugs at the same time. Keep what passes; bin the rest.

And the same primitive, by role — pick yours:

Agents now open PRs and run shell commands. You ship code no human fully reviewed.

Why: An autonomous coding agent's diff is untrusted input the moment it executes.
Where: Code-interpreter tools · agent dev loops (a worktree per task) · CI · production tool-calls serving real users.
The call: Untrusted code, nothing private to reach → ephemeral container or microVM, egress off.

Model-generated code and SQL run against real datasets and pipelines.

Why: The generated step touches data you actually care about — and might exfiltrate or corrupt it.
Where: Feature pipelines · notebook / analysis agents · batch scoring · data-cleaning tools.
The call: Untrusted code that needs private data → inside the VPC, scoped credentials, deny-by-default egress.

Every rollout and every eval runs untrusted code — thousands at once — and the reward must be reproducible.

Why: RL and evals are untrusted execution at scale; dirty state silently poisons the signal.
Where: RL environments · verifiable evals · reasoning sandboxes that run harnesses: spawn → set up task → run the agent's code → score → destroy.
The call: Disposability is the whole game → microVMs: VM-grade isolation, ~100 ms boot, thousands per host.

The boundary decision Untrusted code with nothing to steal belongs outside — a public, ephemeral box, lowest blast radius. Untrusted code that needs your data belongs inside the VPC, isolated hard — now you're fencing something with real network reach.

03

How

An isolation ladder — weak & fast at the bottom, strong & heavy at the top. Pick the lowest rung that holds your threat model.

subprocess + limitsSame kernel, same user. A timeout and a ulimit. Trusted code only.
namespaces · cgroups · seccompnsjail, bubblewrap. Cheap kernel-level fencing for semi-trusted code.
containerrunc / containerd. Shared kernel — one kernel CVE is an escape.
gVisorA user-space kernel intercepts syscalls. Container ergonomics, smaller attack surface.
microVM · FirecrackerA real VM boundary that boots in ~100 ms, thousands per host. The agent sweet spot.
full VM · air-gappedMaximum isolation, maximum cost. For the genuinely hostile.

isolation strength↔boot latency↔density / cost

microVMs broke the old rule that VM-grade isolation must be slow and expensive — which is what makes per-run disposability economically real.

the real lessonDocker didn't beat LXC on isolation — it won on developer experience. Sandboxes get won the same way: the best API and the fastest boot, not the thickest wall.

⌗

Decide

Four questions. Live verdict, recommended rung (it highlights the ladder), and placement. Nothing leaves your browser.

¶

Isn't this just…?

…a VM? In isolation terms, yes. The novelty is booting it in ~100 ms and packing thousands per host — that economics is what makes per-rollout disposability possible at all.

…containers? Containers share the host kernel; one kernel CVE is an escape. Fine for code you trust, thin for unreviewed model output. That gap is why gVisor and microVMs exist.

…hype? we've run untrusted code for decades. True. What changed is authorship and volume: code is now machine-written, unreviewed, and generated faster than humans can vet. Isolation moves from edge case to default substrate.

⬡

The landscape

A vast, fast-moving market — converging on the same shape: fast, disposable, API-driven isolation.

Daytona · E2B · Blacksmith · Tensorlake · exe.dev · Modal · islo.dev · …

Pick any of them and the concepts on this page don't change. crabbox.sh exists so you don't have to choose blindly — it runs the same task across every provider, turning the sandbox into a commodity you can swap.

full disclosureislo.dev is one of these providers — and it's ours. There's only one bear in town. 🐻

the bet

It's the container revolution again — at a far larger scale. Shipping software without sandboxes in 2026 is not using Docker in 2016.