Embedded micro runtime
Small generated programs inside your agent loop. Start with no filesystem, no network, and explicit host-provided functions.
Monty / WASM / model-native code tools
AI Sandbox Landscape
For AI-generated code, the hard question is not which vendor. It is what the agent may do, what resets, what persists, how work fans out, where secrets live, and what evidence comes back.
Execution atlas
Each path starts with code risk and ends with evidence. The useful question is not “which sandbox?” but which policy, reset model, state boundary, control loop, and artifact trail the workload needs.
Eval benchmark route: task definition, agent disclosure, clean environment, logs, artifacts, and a reproducible score.
Runtime shapes
Classify the boundary first. Vendor choice is secondary to what the workload needs from isolation, reset, state, orchestration, and evidence.
Small generated programs inside your agent loop. Start with no filesystem, no network, and explicit host-provided functions.
Monty / WASM / model-native code tools
Run generated Python, notebooks, tests, or one benchmark task with clean reset, logs, files, and bounded lifetime.
E2B / Modal / model code interpreters
Give the agent a real computer: repo checkout, packages, services, browser state, partial work, snapshots, and resume.
Daytona / Runloop / Islo / Docker / exe.dev
Fan out rollouts, hydrate repos, reset environments, capture evidence, score tasks, retry failures, and gate risky actions.
Harbor / Temporal / ARC / Crabbox / Tensorlake
Use cases
“Sandbox” is overloaded. Inline code, evals, RL episodes, PR validation, long agent tasks, and production-generated code need different contracts.
Use Monty, WASM, or model-native code tools when generated code should only call capabilities you explicitly expose.
Use Harbor with Docker, Modal, Daytona, or E2B when every run needs a task, reset policy, agent/model disclosure, logs, and a score.
Use Modal plus Harbor/TRL-style reward loops when you need many parallel episodes, deterministic resets, and reward artifacts.
Use GitHub Actions, ARC, Docker Sandboxes, Daytona, or Crabbox when an AI patch needs tests without touching shared runners or host Docker.
Use Daytona, Runloop, Islo, Docker Sandboxes, Tensorlake, or exe.dev when repo state, services, caches, logs, and partial work must survive.
Use microVM/VM-backed sandboxes with network policy, credential proxying, audit logs, timeouts, kill switches, and human approval gates.
Decision matrix
| Workload | State | Scale | Security | Default choice |
|---|---|---|---|---|
| Inline generated code in model loop | No | Low | Medium | Monty / WASM / model-native code execution |
| Generated script / data analysis | Low | Medium | Medium | E2B / Modal / Anthropic / OpenAI / Gemini / AWS code tools |
| Agent eval harness | No / medium | High | Medium / high | Harbor + Modal / Daytona |
| SWE-bench / Terminal-Bench | Medium | High | High | Harbor + Daytona / Modal / E2B |
| GitHub Actions ordinary CI | No | Medium | Medium | GitHub-hosted runners |
| GitHub Actions private infra | No | High | High | ARC ephemeral self-hosted runners |
| AI-generated PR validation | No / medium | Medium | High | ARC + Docker Sandboxes / Daytona / Crabbox |
| Long-running coding agent | Yes | Medium | High | Daytona / Runloop / Islo / exe.dev |
| Enterprise unattended agent | Yes | Medium | Very high | Islo / Runloop / Docker Sandboxes |
| Post-training RL rollouts | Usually no | Very high | Medium / high | Modal + Harbor |
| Production sandbox for generated code | No / bounded | Medium / high | Very high | Docker Sandboxes / Islo / Runloop / Daytona + approvals |
| Stateful RL environments | Yes | High | High | Tensorlake / Daytona / Runloop / Fly Sprites |
| Auto-healing agent workflow | Yes | Medium / high | High | Temporal + sandbox backend |
| Auto-spawning experiments | Snapshot | Very high | Medium / high | Modal / Harbor / Daytona / Tensorlake / Crabbox |
| Browser / research agents | Medium | Medium | High | E2B Desktop / Daytona / Tensorlake / browser-specific sandbox |
| Production-adjacent actions | Yes | Medium | Very high | Islo / Runloop / Docker Sandboxes + approvals |
Solutions
These links point to product pages, docs, or canonical repos. The labels are workload roles, not benchmark rankings.
Small generated code, data transforms, notebook-like cells, and capability-limited functions inside the model loop.
Generated scripts, tests, batch jobs, data tasks, and benchmark episodes that should start clean and leave evidence.
Repo checkouts, package installs, services, browser/session state, snapshots, resume, and multi-hour coding work.
Task definitions, rollout fan-out, reward functions, raw logs, artifacts, traces, and reproducible scorekeeping.
Generated PR validation, ephemeral runners, provider routing, durable workflows, retries, and approval gates.
Promising or specialized providers to test under the same benchmark protocol before promoting into defaults.
Costs
Pricing changes. Treat this as a pointer to public pricing surfaces, not a purchasing quote.
| Solution | Cost signal | Links |
|---|---|---|
| Crabbox | Open-source broker; spend comes from leased provider capacity and operations. | Docs Repo |
| ComputeSDK | Abstraction layer; cost depends on chosen provider. Use benchmarks to compare cold start, throughput, and reliability. | Site Benchmarks |
| E2B | Free/hobby entry plus usage; CPU and RAM metered per second. | Pricing |
| Modal | Plan plus compute usage; sandbox CPU/RAM billed by time used. | Pricing |
| Daytona | Usage-based vCPU, memory, storage, and GPU pricing. | Pricing |
| Runloop | Plan plus usage; devbox CPU, RAM, and storage billed separately. | Pricing |
| Islo | Usage pricing for CPU, memory, and storage. | Pricing |
| Tensorlake | Free tier plus usage-based on-demand and pro plans. | Pricing |
| exe.dev | Public VM plan; production usage is sales-led. | Pricing |
| Docker Sandboxes | Depends on Docker plan and feature availability. | Pricing |
| GitHub Actions / ARC | Hosted runner minutes for GitHub; ARC also carries Kubernetes infrastructure cost. | Billing |
| Temporal / workflow tools | Self-hosted infrastructure or managed cloud plan usage. | Temporal pricing |
Method
This page is AI-generated from public docs, public repos, public product pages, changelogs, and reproducible experiments.
Every benchmark should disclose prompt, agent/model, date, commit, harness, raw logs, and human edits.
Vendor claims are leads. Benchmark results need code, logs, artifacts, scoring rubrics, and failure notes.
Use agent-skill registries such as officialskills.sh as adjacent context; this page maps where those skills execute and how the execution is controlled.
Feedback
Send public evidence, benchmark ideas, pricing corrections, or inclusion requests on LinkedIn.