Repo2RLEnv × crabbox × islo.dev

Score Repo2RLEnv pr_diff tasks on a remote islo.dev sandbox via crabbox. ~50s end-to-end per task. Parallel batch over a whole dataset with one flag.

● live · verified May 2026 PR #53 huggingface/Repo2RLEnv openclaw/crabbox islo.dev

What this is

Repo2RLEnv turns any GitHub repo into a verifiable RL training/eval dataset, emitted in Harbor's task spec. Harbor's built-in runners are {docker, modal, daytona, e2b, runloop}. Crabbox is a separate control plane that fronts ~15 more providers — AWS, Azure, GCP, Hetzner, Proxmox, islo.dev, …

This is the small adapter (one Python file, ~250 lines) that runs Repo2RLEnv's pr_diff tasks on any crabbox provider. Source: examples/crabbox/runner.py in PR #53.

Pipeline

host islo.dev sandbox ┌─ runner.py ──────────────────┐ sync ┌─ python:3.12-slim ─────────┐ parse task.toml ─────► apt-get install git extract verifier from git clone <repo> /repo environment/Dockerfile git reset --hard <ref> stage agent.diff + git apply agent.diff verifier/* + test.sh bash tests/test.sh git init (crabbox needs it) ◄───── echo SENTINEL crabbox run --provider islo stdout cat /logs/verifier/ parse reward.json from pipe reward.json └──────────────────────────────┘ └────────────────────────────┘

Live runs · pallets/click on islo.dev

Single task · oracle
1.0
~51s wall · expected ≈ 1.0
Single task · empty diff
0.0
~50s · agent gave up → baseline
Batch · 3 tasks, j=3
1.0
~57s wall (3× parallel)

1 · pull a published dataset

repo2rlenv pull AdithyaSK/repo2rlenv-pr-diff ./datasets/pr-diff
# 161 pr_diff tasks across ripgrep · click · chalk · axios · ...

2 · score the oracle diff (sanity check)

python3 examples/crabbox/runner.py ./datasets/pr-diff/pallets__click-3466
[crabbox] default/pallets__click-3466 provider=islo image=python:3.12-slim
leased isb_crabbox-tmp-...-cafe98 slug=violet-crayfish provider=islo
sync candidate: 5 files, 24.5 KiB  ·  sync complete in 1.014s
+ git clone --filter=blob:none https://github.com/pallets/click.git /repo
+ git reset --hard b8b9ffeb5d012f5d041685c81152636bf596cf72
+ git apply --whitespace=nowarn /workspace/task/agent.diff
+ bash /tmp/test.sh
{
  "final_reward": 1.0,
  "components": {
    "format_valid":   1.0,
    "size_sanity":    1.0,
    "file_targeting": 1.0,
    "region_overlap": 1.0,
    "similarity":     1.0,
    "llm_judge":      null
  }
}
islo run summary  sync=1.014s  command=4.888s  total=51.336s  exit=0

3 · score an empty agent diff (agent gave up)

: > /tmp/empty.diff                              # empty patch
python3 examples/crabbox/runner.py ./datasets/pr-diff/pallets__click-3466 \
  --agent-patch /tmp/empty.diff \
  --reward-out  /tmp/empty-reward.json
{
  "final_reward": 0.0,
  "components": {
    "format_valid":   0.0,
    "size_sanity":    0.0,
    "file_targeting": 0.0,
    "region_overlap": 0.0,
    "similarity":     0.0,
    "llm_judge":      null
  }
}

4 · whole dataset, parallel (3 tasks, j=3)

python3 examples/crabbox/runner.py --all ./datasets/pr-diff -j 3
[batch] 3 tasks; j=3; provider=islo
[batch] pallets__click-3444                  reward=   1.0  ok
[batch] pallets__click-3466                  reward=   1.0  ok
[batch] chalk__chalk-541                     reward=   1.0  ok
[batch] wrote /Users/.../mini-dataset/rewards.csv

real    0m56.774s
# mini-dataset/rewards.csv
task,final_reward,status
chalk__chalk-541,1.0,ok
pallets__click-3444,1.0,ok
pallets__click-3466,1.0,ok

One flag, eight providers

Each crabbox provider names its image flag differently — islo and modal use --<p>-image, e2b uses --e2b-template, daytona uses --daytona-snapshot, local-container uses --local-container-image with a -work-root. The wrapper carries an explicit PROVIDER_CONFIG table so the right flag is selected per provider:

--providersandbox kindimage flagunitlive
isloislo.dev sandbox--islo-image
e2bE2B template--e2b-template
modalModal Image--modal-image
daytonaDaytona snapshot--daytona-snapshot
local-containerlocal Docker--local-container-image
dockeralias for local-container--local-container-image
namespace-devboxNamespace Devbox--namespace-image
tensorlakeTensorlake--tensorlake-image
python3 runner.py <task> --provider islo            # default · live-verified
python3 runner.py <task> --provider e2b             # --e2b-template instead of --e2b-image
python3 runner.py <task> --provider daytona         # --daytona-snapshot, --daytona-work-root
python3 runner.py <task> --provider local-container # laptop Docker — no cloud

Pass an unsupported provider and the wrapper fails fast with a list of supported names and a note on why VM providers (aws / hetzner / gcp / …) need a separate adapter:

ValueError: provider 'aws' not supported by this example.
  Container-style providers: daytona, docker, e2b, islo, local-container,
                             modal, namespace-devbox, tensorlake.
  For VM providers (aws, azure, gcp, hetzner, proxmox, ssh) the sandbox
  needs a pre-baked image with python+git; not handled here.

Tests

tests/test_examples_crabbox.py exercises the wrapper in two layers:

$ uv run pytest tests/test_examples_crabbox.py -v
test_provider_config_covers_known_container_providers              PASSED
test_unsupported_provider_raises_with_helpful_message              PASSED
test_task_load_extracts_metadata                                   PASSED
test_extract_verifier_recovers_three_files                         PASSED
test_run_task_builds_correct_command_for_each_provider[islo]       PASSED
test_run_task_builds_correct_command_for_each_provider[e2b]        PASSED
test_run_task_builds_correct_command_for_each_provider[modal]      PASSED
test_run_task_builds_correct_command_for_each_provider[daytona]    PASSED
test_run_task_builds_correct_command_for_each_provider[local-…]    PASSED
test_run_task_builds_correct_command_for_each_provider[docker]     PASSED
test_run_task_builds_correct_command_for_each_provider[namespace…] PASSED
test_run_task_builds_correct_command_for_each_provider[tensorlake] PASSED
test_run_task_rejects_unsupported_provider                         PASSED
test_run_task_writes_keep_flag_when_requested                      PASSED
test_run_task_raises_when_no_sentinel_in_stdout                    PASSED
test_live_islo_oracle_scores_one                                   SKIPPED
  (ISLO_API_KEY not set — live islo.dev smoke is opt-in)
======================== 15 passed, 1 skipped in 1.07s ========================

Unit tests (always run in CI) monkeypatch subprocess.run and assert the exact crabbox argv built for every supported provider — no network, no binary. The live islo smoke pulls pallets__click-3466 from the public reference dataset, scores the oracle, and asserts final_reward == 1.0. It runs only when both ISLO_API_KEY is set and crabbox is on PATH:

ISLO_API_KEY=ak_... uv run pytest tests/test_examples_crabbox.py \
  -k live_islo -v
Scope. Supported today: every pr_diff task in AdithyaSK/repo2rlenv-pr-diff (161 tasks). Pipelines like pr_runtime and commit_runtime build a per-repo Docker image at repo2rlenv bootstrap time and need docker-in-docker — wiring those through crabbox is a follow-up.

Why this matters

Repo2RLEnv's stated focus is synthesis; running is delegated to Harbor's stack. Crabbox extends the runtime reach without modifying either project — Harbor for the in-stack runners, crabbox for the providers Harbor doesn't ship a backend for. Generate once, run anywhere.