AI Sandbox Landscape

Choose the execution contract first.

For AI-generated code, the hard question is not which vendor. It is what the agent may do, what resets, what persists, how work fans out, where secrets live, and what evidence comes back.

Updated May 16, 2026 Public sources only AI-generated baseline

Start with use cases Compare solutions Cost signals

Read path 01 Route workload 02 Runtime shapes 03 Use cases 04 Decision matrix 05 Solution links 06 Cost signals

Thesis The sandbox is the mechanism. The contract is the decision: permissions, isolation, reset, state, scale, secrets, evidence, and stop conditions.

What can the agent execute?
What network or tools can it reach?
What resets after each run?
What state is allowed to persist?
Where do credentials stay?
What logs, diffs, artifacts, or rewards prove the outcome?
Who owns timeout, kill, and rollback?

Execution atlas

Route the contract before picking a runtime.

Each path starts with code risk and ends with evidence. The useful question is not “which sandbox?” but which policy, reset model, state boundary, control loop, and artifact trail the workload needs.

Eval benchmark route: task definition, agent disclosure, clean environment, logs, artifacts, and a reproducible score.

Runtime shapes

Four contract shapes, then vendors.

Classify the boundary first. Vendor choice is secondary to what the workload needs from isolation, reset, state, orchestration, and evidence.

Embedded micro runtime

Small generated programs inside your agent loop. Start with no filesystem, no network, and explicit host-provided functions.

Monty / WASM / model-native code tools

Ephemeral task sandbox

Run generated Python, notebooks, tests, or one benchmark task with clean reset, logs, files, and bounded lifetime.

E2B / Modal / model code interpreters

Persistent agent workspace

Give the agent a real computer: repo checkout, packages, services, browser state, partial work, snapshots, and resume.

Daytona / Runloop / Islo / Docker / exe.dev

Control and eval plane

Fan out rollouts, hydrate repos, reset environments, capture evidence, score tasks, retry failures, and gate risky actions.

Harbor / Temporal / ARC / Crabbox / Tensorlake

Use cases

Map workload to contract shape.

“Sandbox” is overloaded. Inline code, evals, RL episodes, PR validation, long agent tasks, and production-generated code need different contracts.

Inline code in the model loop

Use Monty, WASM, or model-native code tools when generated code should only call capabilities you explicitly expose.

Agent evals and benchmarks

Use Harbor with Docker, Modal, Daytona, or E2B when every run needs a task, reset policy, agent/model disclosure, logs, and a score.

RL and post-training rollouts

Use Modal plus Harbor/TRL-style reward loops when you need many parallel episodes, deterministic resets, and reward artifacts.

PR validation for generated code

Use GitHub Actions, ARC, Docker Sandboxes, Daytona, or Crabbox when an AI patch needs tests without touching shared runners or host Docker.

Long-running coding agents

Use Daytona, Runloop, Islo, Docker Sandboxes, Tensorlake, or exe.dev when repo state, services, caches, logs, and partial work must survive.

Production sandbox for generated code

Use microVM/VM-backed sandboxes with network policy, credential proxying, audit logs, timeouts, kill switches, and human approval gates.

Decision matrix

Workload to default contract.

Filter

Workload	State	Scale	Security	Default choice
Inline generated code in model loop	No	Low	Medium	Monty / WASM / model-native code execution
Generated script / data analysis	Low	Medium	Medium	E2B / Modal / Anthropic / OpenAI / Gemini / AWS code tools
Agent eval harness	No / medium	High	Medium / high	Harbor + Modal / Daytona
SWE-bench / Terminal-Bench	Medium	High	High	Harbor + Daytona / Modal / E2B
GitHub Actions ordinary CI	No	Medium	Medium	GitHub-hosted runners
GitHub Actions private infra	No	High	High	ARC ephemeral self-hosted runners
AI-generated PR validation	No / medium	Medium	High	ARC + Docker Sandboxes / Daytona / Crabbox
Long-running coding agent	Yes	Medium	High	Daytona / Runloop / Islo / exe.dev
Enterprise unattended agent	Yes	Medium	Very high	Islo / Runloop / Docker Sandboxes
Post-training RL rollouts	Usually no	Very high	Medium / high	Modal + Harbor
Production sandbox for generated code	No / bounded	Medium / high	Very high	Docker Sandboxes / Islo / Runloop / Daytona + approvals
Stateful RL environments	Yes	High	High	Tensorlake / Daytona / Runloop / Fly Sprites
Auto-healing agent workflow	Yes	Medium / high	High	Temporal + sandbox backend
Auto-spawning experiments	Snapshot	Very high	Medium / high	Modal / Harbor / Daytona / Tensorlake / Crabbox
Browser / research agents	Medium	Medium	High	E2B Desktop / Daytona / Tensorlake / browser-specific sandbox
Production-adjacent actions	Yes	Medium	Very high	Islo / Runloop / Docker Sandboxes + approvals

Solutions

Public links by execution role.

These links point to product pages, docs, or canonical repos. The labels are workload roles, not benchmark rankings.

Tier 0

Embedded and model-native code

Small generated code, data transforms, notebook-like cells, and capability-limited functions inside the model loop.

M MontyCapability-limited Python runtime A Anthropic Code ExecutionModel-native execution tool O OpenAI Code InterpreterHosted code interpreter surface G Gemini Code ExecutionGenerated Python execution AWS AWS AgentCore Code InterpreterAWS-native agent code tool

Tier 1

Ephemeral sandboxes

Generated scripts, tests, batch jobs, data tasks, and benchmark episodes that should start clean and leave evidence.

E2B E2BAI-generated code in secure sandboxes M Modal SandboxesCompute scale, Python, GPU-adjacent work CS ComputeSDKOne API over sandbox providers V Vercel SandboxWeb-oriented AI code execution N NamespaceEphemeral container instances B BlaxelServerless AI compute

Tier 2

Stateful agent workspaces

Repo checkouts, package installs, services, browser/session state, snapshots, resume, and multi-hour coding work.

D DaytonaStateful sandboxes and agent workspaces R RunloopDevboxes for coding agents I IsloPersistent isolated agent environments Dk Docker SandboxesLocal/enterprise AI coding isolation T TensorlakeStateful microVM and orchestration stack ex exe.devPersistent Linux computers for agents F Fly SpritesStateful agent runtime pattern

Tier 3

Eval, RL, and benchmark planes

Task definitions, rollout fan-out, reward functions, raw logs, artifacts, traces, and reproducible scorekeeping.

H HarborAgent eval and rollout harness CS ComputeSDK BenchmarksDaily sandbox/provider performance data M ModalParallel rollout and compute substrate TRL Hugging Face TRLPost-training algorithm layer SW SWE-benchSoftware engineering benchmark TB Terminal-BenchTerminal task benchmark B BraintrustEval and trace management LS LangSmithTrace and evaluation workflows Lf LangfuseOpen-source LLM observability

Control

CI, brokers, and orchestration

Generated PR validation, ephemeral runners, provider routing, durable workflows, retries, and approval gates.

C CrabboxBroker across sandbox backends gh Crabbox repoCanonical source repository ARC Actions Runner ControllerAutoscaled GitHub runners T TemporalDurable workflow state In InngestEvented durable functions Tr Trigger.devBackground jobs and automation K8s Kubernetes Agent SandboxKubernetes-native agent runtime API

Watchlist

Adjacent sandbox providers

Promising or specialized providers to test under the same benchmark protocol before promoting into defaults.

P PodflareSandbox provider candidate S0 Sandbox0AI sandbox provider candidate Nf Northflank SandboxesCloud sandbox product IV InstaVMVM-oriented runtime candidate sr shuruSandbox/runtime candidate No NoidAgent runtime candidate Hx HopxMicroVM sandbox provider CSB CodeSandboxCloud development sandboxes

Costs

Public cost signals.

Pricing changes. Treat this as a pointer to public pricing surfaces, not a purchasing quote.

Solution	Cost signal	Links
Crabbox	Open-source broker; spend comes from leased provider capacity and operations.	Docs Repo
ComputeSDK	Abstraction layer; cost depends on chosen provider. Use benchmarks to compare cold start, throughput, and reliability.	Site Benchmarks
E2B	Free/hobby entry plus usage; CPU and RAM metered per second.	Pricing
Modal	Plan plus compute usage; sandbox CPU/RAM billed by time used.	Pricing
Daytona	Usage-based vCPU, memory, storage, and GPU pricing.	Pricing
Runloop	Plan plus usage; devbox CPU, RAM, and storage billed separately.	Pricing
Islo	Usage pricing for CPU, memory, and storage.	Pricing
Tensorlake	Free tier plus usage-based on-demand and pro plans.	Pricing
exe.dev	Public VM plan; production usage is sales-led.	Pricing
Docker Sandboxes	Depends on Docker plan and feature availability.	Pricing
GitHub Actions / ARC	Hosted runner minutes for GitHub; ARC also carries Kubernetes infrastructure cost.	Billing
Temporal / workflow tools	Self-hosted infrastructure or managed cloud plan usage.	Temporal pricing

Method

How this should evolve.

Public information only

This page is AI-generated from public docs, public repos, public product pages, changelogs, and reproducible experiments.

Agent disclosure

Every benchmark should disclose prompt, agent/model, date, commit, harness, raw logs, and human edits.

Evidence over positioning

Vendor claims are leads. Benchmark results need code, logs, artifacts, scoring rubrics, and failure notes.

Skills sit above runtimes

Use agent-skill registries such as officialskills.sh as adjacent context; this page maps where those skills execute and how the execution is controlled.

Benchmark protocol ComputeSDK benchmarks officialskills.sh Prompt provenance Agent instructions Launch notes

Feedback

Want to be considered or correct something?

Send public evidence, benchmark ideas, pricing corrections, or inclusion requests on LinkedIn.

Reach out on LinkedIn