Document AA-02 Appendix to — doc AA-01, the execution model Status — actually executed

The loop isn't a diagram. We ran it. The bytes match.

The one-pager explains the vision. This page is the receipt. We seeded a real ClickHouse source with 50,000 web-analytics events, dispatched an Airbyte-contract worker into a live islo sandbox, moved every row into a DuckDB warehouse, and verified the result four independent ways — down to a byte-exact SHA-256 of the moved data. Crabbox is the dispatcher boundary; islo is the sandbox provider used for this executed run.

Sync succeeded · 50,000 / 50,000 rows · 4 / 4 parity checks passed · exit 0

0
rows moved end-to-end
0
rows/sec in the sandbox
4 / 4
parity checks passed
0.33s
sync wall-time
§01 what actually moved

ClickHouse → worker → DuckDB, inside a borrowed box.

A standard pipe: an OLAP system of record on the left, an analytics warehouse on the right, and the Airbyte source→destination record contract in the middle. The whole thing executes inside an islo sandbox — the harness never touches the data. In the generalized shape, Crabbox can target other providers behind the same run contract.

exhibit a · the pipe
source · system of record

ClickHouse

analytics.events
50,000 rows · MergeTree
v26.6.1
engine · airbyte contract

etl.py worker

discover · read
RECORD streams · 10 batches
~9.0 MB read
destination · warehouse

DuckDB

raw.events
50,000 rows · typed
v1.5.3

all of the above runs as one governed job: Crabbox dispatches the run contract, islo provides the sandbox, the repo worker moves the data, and the harness only reads the evidence below.

§02 how a row actually crosses

The engine, in four moves.

The worker follows the same contract Airbyte uses — discover, read, write, verify — on the lightweight custom-CDK path so the whole thing runs self-contained in a borrowed box. Here it is, condensed from the real file.

exhibit b · condensed source
worker/etl.py · the source→destination contractpython 3.12
# 1 · DISCOVER — ask the source for its schema, map types for the destination
cols = describe(f"{db}.{stream}")            # ClickHouse DESCRIBE → catalog
ddl  = ", ".join(map_type(c) for c in cols)  # CH type → DuckDB type (typed, not stringly)

# 2 · READ — pull RECORD batches as JSONEachRow (exactly Airbyte's record shape)
for offset in range(0, n_rows, BATCH):                # 10 batches × 5,000
    page = ch(f"SELECT … LIMIT {BATCH} OFFSET {offset} FORMAT JSONEachRow")

    # 3 · WRITE — vectorized bulk load into the destination warehouse
    duck.execute(f"INSERT INTO raw.events SELECT * FROM read_json({page})")

# 4 · VERIFY — compare src vs dst four ways; exit non-zero unless ALL pass
assert src_count   == dst_count                       # row count
assert src_revenue == dst_revenue                     # decimal aggregate
assert src_tally   == dst_tally                       # per-type group-by
assert sha256(src_rows) == sha256(dst_rows)          # ← byte-exact

Condensed for reading — the real, runnable source is worker/etl.py.

What this proves — and what it doesn't.

This run uses the Airbyte source→destination contract on the custom-CDK path, not a full packaged connector deployment — that's what lets it run self-contained in a sandbox in under a second. What it does prove is the part that matters for agentic data movement: a goal-driven worker can be dispatched into an isolated box, move real typed data end-to-end, and return evidence strong enough (a byte-exact checksum) for a harness to trust the result and decide what to do next.

§03 the receipt · 4 independent parity checks

How we know every row arrived intact.

“It ran” isn't proof. These four checks compare the source and the destination after the move. The worker exits non-zero unless all of them pass. Here is the real output of this run.

exhibit c · four parity checks

1 · Row count

PASS

Every record made it across — no drops, no dupes.

source50,000 destination50,000

2 · Revenue sum

PASS

Decimal aggregate is preserved exactly across the type boundary.

source$611,815.02 destination$611,815.02

3 · Per-event-type tally

PASS

Group-by counts agree for all five event types.

page_view29,958 = 29,958 search7,598 = 7,598 add_to_cart6,014 = 6,014 checkout3,928 = 3,928 purchase2,502 = 2,502

4 · Content SHA-256

PASS

Byte-exact: hash of every (event_id, type, revenue) tuple, sorted. The strongest check.

sourcea82239cc…a5c73ebcb destinationa82239cc…a5c73ebcb fulla82239cc73c96e68528e61f885d0b073ea60fb68e4332aec8d586c2a5c73ebcb
§04 proof the data is live, not just present

Queries run on the destination warehouse.

After the load, the worker queries DuckDB directly — if these return sensible business answers, the moved data is genuinely usable, not just byte-counted.

Top countries by revenue

SELECT … GROUP BY country (on DuckDB)
🇧🇷 BR
$66,444
🇺🇸 US
$66,072
🇫🇷 FR
$63,935
🇦🇺 AU
$62,368
🇮🇱 IL
$60,967

Conversion funnel

count(*) by event_type (on DuckDB)
page_view
29,958
search
7,598
add_to_cart
6,014
checkout
3,928
purchase
2,502
discover step · typed, not stringly

The schema crossed the type boundary intact.

The worker reads the source catalog and maps each ClickHouse type to a DuckDB type before loading — decimals stay decimals, datetimes stay timestamps.

exhibit d · schema map
columnClickHouse · sourceDuckDB · destination
event_idUInt64BIGINT
user_idUInt32BIGINT
session_idUInt32BIGINT
event_typeLowCardinality(String)VARCHAR
channelLowCardinality(String)VARCHAR
deviceLowCardinality(String)VARCHAR
countryLowCardinality(String)VARCHAR
urlStringVARCHAR
revenueDecimal(12, 2)DECIMAL(18,2)
tsDateTimeTIMESTAMP
§05 the worker · a real islo sandbox

Where it ran — full provenance.

This wasn't your laptop. The job ran in a fresh, repo-defined islo microVM in the Canada compute region, then was torn down. Every field below is captured from the live box.

exhibit e · provenance
sandbox id019ea238-1f81-7950-80a9-1b80a5e0b556
sandbox nameairbyte-etl
imagedocker.io/library/python:3.12
kernelLinux 6.16.9+ · x86_64
vCPU / memory4 vCPU · 3930 MB
compute regionca.compute.islo.dev (Canada)
public ownerredacted
source engineClickHouse 26.6.1.472
destination engineDuckDB 1.5.3
runtimePython 3.12.13
sync modefull_refresh_overwrite
bytes read9,450,617 (~9.0 MB)
batches10 × 5,000 records
read / write split0.186s read · 0.145s write

The whole run, phase by phase.

Nine ::CRABBOX_PHASE:: markers split the job into named steps the orchestrator can time, attach evidence to, and reason over. This is the end-to-end, start to finish.

exhibit f · the run, phase by phase
Setup · make the box ready
1

bootstrap ::CRABBOX_PHASE::bootstrap

Install the ClickHouse static binary and a Python venv (duckdb, requests) into the fresh box.

emits → a deterministic, repo-defined environment
2

boot_clickhouse ::CRABBOX_PHASE::boot_clickhouse

Start a local ClickHouse server and wait until it answers on the HTTP port.

emits → ClickHouse 26.6.1 up
Prepare · data + schema
3

seed ::CRABBOX_PHASE::seed

Generate 50,000 deterministic web-analytics events (fixed RNG) and insert them — this is the system of record we're asked to move.

emits → analytics.events · 50,000 rows · $611,815.02
4

discover ::CRABBOX_PHASE::discover

Read the source catalog and map every ClickHouse type to a DuckDB type before a single row moves.

emits → a 10-column typed catalog
5

write_setup ::CRABBOX_PHASE::write_setup

Create the typed destination table in DuckDB from the mapped catalog — drop and recreate, since the sync mode is full_refresh_overwrite.

emits → raw.events ready to receive typed rows
Move · the actual transfer
6

sync ::CRABBOX_PHASE::sync

Stream the rows out as JSONEachRow RECORD batches and bulk-load them into the destination warehouse.

emits → 50,000 rows moved · 0.332s · 0.186s read + 0.145s write
Prove & report · evidence for the harness
7

verify ::CRABBOX_PHASE::verify

Compare source and destination four ways, including a byte-exact SHA-256. The worker exits non-zero unless all four agree.

emits → 4 / 4 checks PASS
8

analytics ::CRABBOX_PHASE::analytics

Query the destination directly (top countries, funnel) to prove the moved data is usable, not just present.

emits → real business answers from DuckDB
9

emit ::CRABBOX_PHASE::emit

Write metrics.json plus a final STATE message — the structured evidence a reflection loop ingests to decide the next move.

emits → status = SUCCEEDED · exit 0
raw evidence · unedited tail

The actual run, from the sandbox.

exhibit g · raw log
airbyte-etl · /workspace/agentic-airbyte/poc · isloexit 0
::CRABBOX_PHASE::bootstrap
[e2e] installing clickhouse static binary
::CRABBOX_PHASE::boot_clickhouse
[e2e] ClickHouse up: 26.6.1.472
::CRABBOX_PHASE::seed
[seed] analytics.events ready: rows=50000 total_revenue=611815.02
::CRABBOX_PHASE::discover
{"type":"LOG","log":{"level":"INFO","message":"discovered stream 'events'"}}
{"type":"LOG","log":{"level":"INFO","message":"source has 50000 records, total_revenue=611815.02"}}
::CRABBOX_PHASE::write_setup
::CRABBOX_PHASE::sync
{"type":"LOG","log":{"level":"INFO","message":"synced 5000/50000 records"}}
          … 10 batches …
{"type":"LOG","log":{"level":"INFO","message":"synced 50000/50000 records"}}
::CRABBOX_PHASE::verify
::CRABBOX_PHASE::analytics
::CRABBOX_PHASE::emit
{"type":"STATE","state":{"records_moved":50000,"status":"SUCCEEDED"}}
{"type":"LOG","log":{"level":"INFO","message":"sync SUCCEEDED: moved 50000 rows in 0.549s (150700.2 rows/s); checks_passed=True"}}
EXIT=0

Full artifacts in the repo: metrics.sandbox.json · e2e_sandbox.log

§06 run it yourself

One command. Same bytes, every time.

The seed is deterministic (fixed RNG), so the SHA-256 above is reproducible. Borrow a box, hydrate it from the repo, run the proof, tear it down.

The islo way — persistent box

Lease a box, hydrate from the repo, run the proof.

your machine
# lease a box, hydrate from the repo, run the proof
islo use airbyte-etl \
  --config poc/islo.yaml \
  --source github://zozo123/agentic-airbyte \
  -- bash poc/run_e2e.sh

The crabbox way — ephemeral worker

Dispatch it as a governed run with evidence capture.

your harness
# dispatch as a governed run with evidence capture
crabbox run --pool org/data-movement/main/... \
  --shell 'bash poc/run_e2e.sh' \
  --artifact-glob 'poc/reports/**' \
  --junit poc/reports/