Score Repo2RLEnv pr_diff tasks on a remote islo.dev sandbox via crabbox. ~50s end-to-end per task. Parallel batch over a whole dataset with one flag.
Repo2RLEnv turns any GitHub repo into a verifiable RL training/eval dataset, emitted in Harbor's task spec. Harbor's built-in runners are {docker, modal, daytona, e2b, runloop}. Crabbox is a separate control plane that fronts ~15 more providers — AWS, Azure, GCP, Hetzner, Proxmox, islo.dev, …
This is the small adapter (one Python file, ~250 lines) that runs Repo2RLEnv's pr_diff tasks on any crabbox provider. Source: examples/crabbox/runner.py in PR #53.
repo2rlenv pull AdithyaSK/repo2rlenv-pr-diff ./datasets/pr-diff # 161 pr_diff tasks across ripgrep · click · chalk · axios · ...
python3 examples/crabbox/runner.py ./datasets/pr-diff/pallets__click-3466
[crabbox] default/pallets__click-3466 provider=islo image=python:3.12-slim leased isb_crabbox-tmp-...-cafe98 slug=violet-crayfish provider=islo sync candidate: 5 files, 24.5 KiB · sync complete in 1.014s + git clone --filter=blob:none https://github.com/pallets/click.git /repo + git reset --hard b8b9ffeb5d012f5d041685c81152636bf596cf72 + git apply --whitespace=nowarn /workspace/task/agent.diff + bash /tmp/test.sh { "final_reward": 1.0, "components": { "format_valid": 1.0, "size_sanity": 1.0, "file_targeting": 1.0, "region_overlap": 1.0, "similarity": 1.0, "llm_judge": null } } islo run summary sync=1.014s command=4.888s total=51.336s exit=0
: > /tmp/empty.diff # empty patch python3 examples/crabbox/runner.py ./datasets/pr-diff/pallets__click-3466 \ --agent-patch /tmp/empty.diff \ --reward-out /tmp/empty-reward.json
{
"final_reward": 0.0,
"components": {
"format_valid": 0.0,
"size_sanity": 0.0,
"file_targeting": 0.0,
"region_overlap": 0.0,
"similarity": 0.0,
"llm_judge": null
}
}
python3 examples/crabbox/runner.py --all ./datasets/pr-diff -j 3
[batch] 3 tasks; j=3; provider=islo [batch] pallets__click-3444 reward= 1.0 ok [batch] pallets__click-3466 reward= 1.0 ok [batch] chalk__chalk-541 reward= 1.0 ok [batch] wrote /Users/.../mini-dataset/rewards.csv real 0m56.774s
# mini-dataset/rewards.csv task,final_reward,status chalk__chalk-541,1.0,ok pallets__click-3444,1.0,ok pallets__click-3466,1.0,ok
Each crabbox provider names its image flag differently — islo and modal use --<p>-image, e2b uses --e2b-template, daytona uses --daytona-snapshot, local-container uses --local-container-image with a -work-root. The wrapper carries an explicit PROVIDER_CONFIG table so the right flag is selected per provider:
--provider | sandbox kind | image flag | unit | live |
|---|---|---|---|---|
islo | islo.dev sandbox | --islo-image | ✓ | ✓ |
e2b | E2B template | --e2b-template | ✓ | — |
modal | Modal Image | --modal-image | ✓ | — |
daytona | Daytona snapshot | --daytona-snapshot | ✓ | — |
local-container | local Docker | --local-container-image | ✓ | — |
docker | alias for local-container | --local-container-image | ✓ | — |
namespace-devbox | Namespace Devbox | --namespace-image | ✓ | — |
tensorlake | Tensorlake | --tensorlake-image | ✓ | — |
python3 runner.py <task> --provider islo # default · live-verified python3 runner.py <task> --provider e2b # --e2b-template instead of --e2b-image python3 runner.py <task> --provider daytona # --daytona-snapshot, --daytona-work-root python3 runner.py <task> --provider local-container # laptop Docker — no cloud
Pass an unsupported provider and the wrapper fails fast with a list of supported names and a note on why VM providers (aws / hetzner / gcp / …) need a separate adapter:
ValueError: provider 'aws' not supported by this example.
Container-style providers: daytona, docker, e2b, islo, local-container,
modal, namespace-devbox, tensorlake.
For VM providers (aws, azure, gcp, hetzner, proxmox, ssh) the sandbox
needs a pre-baked image with python+git; not handled here.
tests/test_examples_crabbox.py exercises the wrapper in two layers:
$ uv run pytest tests/test_examples_crabbox.py -v test_provider_config_covers_known_container_providers PASSED test_unsupported_provider_raises_with_helpful_message PASSED test_task_load_extracts_metadata PASSED test_extract_verifier_recovers_three_files PASSED test_run_task_builds_correct_command_for_each_provider[islo] PASSED test_run_task_builds_correct_command_for_each_provider[e2b] PASSED test_run_task_builds_correct_command_for_each_provider[modal] PASSED test_run_task_builds_correct_command_for_each_provider[daytona] PASSED test_run_task_builds_correct_command_for_each_provider[local-…] PASSED test_run_task_builds_correct_command_for_each_provider[docker] PASSED test_run_task_builds_correct_command_for_each_provider[namespace…] PASSED test_run_task_builds_correct_command_for_each_provider[tensorlake] PASSED test_run_task_rejects_unsupported_provider PASSED test_run_task_writes_keep_flag_when_requested PASSED test_run_task_raises_when_no_sentinel_in_stdout PASSED test_live_islo_oracle_scores_one SKIPPED (ISLO_API_KEY not set — live islo.dev smoke is opt-in) ======================== 15 passed, 1 skipped in 1.07s ========================
Unit tests (always run in CI) monkeypatch subprocess.run and assert the exact crabbox argv built for every supported provider — no network, no binary. The live islo smoke pulls pallets__click-3466 from the public reference dataset, scores the oracle, and asserts final_reward == 1.0. It runs only when both ISLO_API_KEY is set and crabbox is on PATH:
ISLO_API_KEY=ak_... uv run pytest tests/test_examples_crabbox.py \ -k live_islo -v
pr_diff task in AdithyaSK/repo2rlenv-pr-diff (161 tasks). Pipelines like pr_runtime and commit_runtime build a per-repo Docker image at repo2rlenv bootstrap time and need docker-in-docker — wiring those through crabbox is a follow-up.Repo2RLEnv's stated focus is synthesis; running is delegated to Harbor's stack. Crabbox extends the runtime reach without modifying either project — Harbor for the in-stack runners, crabbox for the providers Harbor doesn't ship a backend for. Generate once, run anywhere.