AI Sandbox Landscape

Choose the execution contract first.

For AI-generated code, the hard question is not which vendor. It is what the agent may do, what resets, what persists, how work fans out, where secrets live, and what evidence comes back.

Updated May 16, 2026 Public sources only AI-generated baseline
Read path 01 Route workload 02 Runtime shapes 03 Use cases 04 Decision matrix 05 Solution links 06 Cost signals
Thesis The sandbox is the mechanism. The contract is the decision: permissions, isolation, reset, state, scale, secrets, evidence, and stop conditions.

Execution atlas

Route the contract before picking a runtime.

Each path starts with code risk and ends with evidence. The useful question is not “which sandbox?” but which policy, reset model, state boundary, control loop, and artifact trail the workload needs.

Eval benchmark route: task definition, agent disclosure, clean environment, logs, artifacts, and a reproducible score.

Runtime shapes

Four contract shapes, then vendors.

Classify the boundary first. Vendor choice is secondary to what the workload needs from isolation, reset, state, orchestration, and evidence.

01

Embedded micro runtime

Small generated programs inside your agent loop. Start with no filesystem, no network, and explicit host-provided functions.

Monty / WASM / model-native code tools
02

Ephemeral task sandbox

Run generated Python, notebooks, tests, or one benchmark task with clean reset, logs, files, and bounded lifetime.

E2B / Modal / model code interpreters
03

Persistent agent workspace

Give the agent a real computer: repo checkout, packages, services, browser state, partial work, snapshots, and resume.

Daytona / Runloop / Islo / Docker / exe.dev
04

Control and eval plane

Fan out rollouts, hydrate repos, reset environments, capture evidence, score tasks, retry failures, and gate risky actions.

Harbor / Temporal / ARC / Crabbox / Tensorlake

Use cases

Map workload to contract shape.

“Sandbox” is overloaded. Inline code, evals, RL episodes, PR validation, long agent tasks, and production-generated code need different contracts.

01

Inline code in the model loop

Use Monty, WASM, or model-native code tools when generated code should only call capabilities you explicitly expose.

02

Agent evals and benchmarks

Use Harbor with Docker, Modal, Daytona, or E2B when every run needs a task, reset policy, agent/model disclosure, logs, and a score.

03

RL and post-training rollouts

Use Modal plus Harbor/TRL-style reward loops when you need many parallel episodes, deterministic resets, and reward artifacts.

04

PR validation for generated code

Use GitHub Actions, ARC, Docker Sandboxes, Daytona, or Crabbox when an AI patch needs tests without touching shared runners or host Docker.

05

Long-running coding agents

Use Daytona, Runloop, Islo, Docker Sandboxes, Tensorlake, or exe.dev when repo state, services, caches, logs, and partial work must survive.

06

Production sandbox for generated code

Use microVM/VM-backed sandboxes with network policy, credential proxying, audit logs, timeouts, kill switches, and human approval gates.

Decision matrix

Workload to default contract.

Workload State Scale Security Default choice
Inline generated code in model loop No Low Medium Monty / WASM / model-native code execution
Generated script / data analysis Low Medium Medium E2B / Modal / Anthropic / OpenAI / Gemini / AWS code tools
Agent eval harness No / medium High Medium / high Harbor + Modal / Daytona
SWE-bench / Terminal-Bench Medium High High Harbor + Daytona / Modal / E2B
GitHub Actions ordinary CI No Medium Medium GitHub-hosted runners
GitHub Actions private infra No High High ARC ephemeral self-hosted runners
AI-generated PR validation No / medium Medium High ARC + Docker Sandboxes / Daytona / Crabbox
Long-running coding agent Yes Medium High Daytona / Runloop / Islo / exe.dev
Enterprise unattended agent Yes Medium Very high Islo / Runloop / Docker Sandboxes
Post-training RL rollouts Usually no Very high Medium / high Modal + Harbor
Production sandbox for generated code No / bounded Medium / high Very high Docker Sandboxes / Islo / Runloop / Daytona + approvals
Stateful RL environments Yes High High Tensorlake / Daytona / Runloop / Fly Sprites
Auto-healing agent workflow Yes Medium / high High Temporal + sandbox backend
Auto-spawning experiments Snapshot Very high Medium / high Modal / Harbor / Daytona / Tensorlake / Crabbox
Browser / research agents Medium Medium High E2B Desktop / Daytona / Tensorlake / browser-specific sandbox
Production-adjacent actions Yes Medium Very high Islo / Runloop / Docker Sandboxes + approvals

Solutions

Public links by execution role.

These links point to product pages, docs, or canonical repos. The labels are workload roles, not benchmark rankings.

Tier 3

Eval, RL, and benchmark planes

Task definitions, rollout fan-out, reward functions, raw logs, artifacts, traces, and reproducible scorekeeping.

Costs

Public cost signals.

Pricing changes. Treat this as a pointer to public pricing surfaces, not a purchasing quote.

Solution Cost signal Links
Crabbox Open-source broker; spend comes from leased provider capacity and operations. Docs Repo
ComputeSDK Abstraction layer; cost depends on chosen provider. Use benchmarks to compare cold start, throughput, and reliability. Site Benchmarks
E2B Free/hobby entry plus usage; CPU and RAM metered per second. Pricing
Modal Plan plus compute usage; sandbox CPU/RAM billed by time used. Pricing
Daytona Usage-based vCPU, memory, storage, and GPU pricing. Pricing
Runloop Plan plus usage; devbox CPU, RAM, and storage billed separately. Pricing
Islo Usage pricing for CPU, memory, and storage. Pricing
Tensorlake Free tier plus usage-based on-demand and pro plans. Pricing
exe.dev Public VM plan; production usage is sales-led. Pricing
Docker Sandboxes Depends on Docker plan and feature availability. Pricing
GitHub Actions / ARC Hosted runner minutes for GitHub; ARC also carries Kubernetes infrastructure cost. Billing
Temporal / workflow tools Self-hosted infrastructure or managed cloud plan usage. Temporal pricing

Method

How this should evolve.

Public information only

This page is AI-generated from public docs, public repos, public product pages, changelogs, and reproducible experiments.

Agent disclosure

Every benchmark should disclose prompt, agent/model, date, commit, harness, raw logs, and human edits.

Evidence over positioning

Vendor claims are leads. Benchmark results need code, logs, artifacts, scoring rubrics, and failure notes.

Skills sit above runtimes

Use agent-skill registries such as officialskills.sh as adjacent context; this page maps where those skills execute and how the execution is controlled.

Feedback

Want to be considered or correct something?

Send public evidence, benchmark ideas, pricing corrections, or inclusion requests on LinkedIn.

Reach out on LinkedIn