Agentic Airbyte — Proof Appendix · ClickHouse → DuckDB, verified

§01 what actually moved

ClickHouse → worker → DuckDB, inside a borrowed box.

A standard pipe: an OLAP system of record on the left, an analytics warehouse on the right, and the Airbyte source→destination record contract in the middle. The whole thing executes inside an islo sandbox — the harness never touches the data. In the generalized shape, Crabbox can target other providers behind the same run contract.

exhibit a · the pipe

source · system of record

ClickHouse

analytics.events
50,000 rows · MergeTree
v26.6.1

engine · airbyte contract

etl.py worker

discover · read
RECORD streams · 10 batches
~9.0 MB read

destination · warehouse

DuckDB

raw.events
50,000 rows · typed
v1.5.3

all of the above runs as one governed job: Crabbox dispatches the run contract, islo provides the sandbox, the repo worker moves the data, and the harness only reads the evidence below.

§02 how a row actually crosses

The engine, in four moves.

The worker follows the same contract Airbyte uses — discover, read, write, verify — on the lightweight custom-CDK path so the whole thing runs self-contained in a borrowed box. Here it is, condensed from the real file.

exhibit b · condensed source

worker/etl.py · the source→destination contractpython 3.12

# 1 · DISCOVER — ask the source for its schema, map types for the destination
cols = describe(f"{db}.{stream}")            # ClickHouse DESCRIBE → catalog
ddl  = ", ".join(map_type(c) for c in cols)  # CH type → DuckDB type (typed, not stringly)

# 2 · READ — pull RECORD batches as JSONEachRow (exactly Airbyte's record shape)
for offset in range(0, n_rows, BATCH):                # 10 batches × 5,000
    page = ch(f"SELECT … LIMIT {BATCH} OFFSET {offset} FORMAT JSONEachRow")

    # 3 · WRITE — vectorized bulk load into the destination warehouse
    duck.execute(f"INSERT INTO raw.events SELECT * FROM read_json({page})")

# 4 · VERIFY — compare src vs dst four ways; exit non-zero unless ALL pass
assert src_count   == dst_count                       # row count
assert src_revenue == dst_revenue                     # decimal aggregate
assert src_tally   == dst_tally                       # per-type group-by
assert sha256(src_rows) == sha256(dst_rows)          # ← byte-exact

Condensed for reading — the real, runnable source is worker/etl.py.

What this proves — and what it doesn't.

This run uses the Airbyte source→destination contract on the custom-CDK path, not a full packaged connector deployment — that's what lets it run self-contained in a sandbox in under a second. What it does prove is the part that matters for agentic data movement: a goal-driven worker can be dispatched into an isolated box, move real typed data end-to-end, and return evidence strong enough (a byte-exact checksum) for a harness to trust the result and decide what to do next.

§03 the receipt · 4 independent parity checks

How we know every row arrived intact.

“It ran” isn't proof. These four checks compare the source and the destination after the move. The worker exits non-zero unless all of them pass. Here is the real output of this run.

exhibit c · four parity checks

1 · Row count

PASS

Every record made it across — no drops, no dupes.

source50,000 destination50,000

2 · Revenue sum

PASS

Decimal aggregate is preserved exactly across the type boundary.

source$611,815.02 destination$611,815.02

3 · Per-event-type tally

PASS

Group-by counts agree for all five event types.

page_view29,958 = 29,958 search7,598 = 7,598 add_to_cart6,014 = 6,014 checkout3,928 = 3,928 purchase2,502 = 2,502

4 · Content SHA-256

PASS

Byte-exact: hash of every (event_id, type, revenue) tuple, sorted. The strongest check.

sourcea82239cc…a5c73ebcb destinationa82239cc…a5c73ebcb fulla82239cc73c96e68528e61f885d0b073ea60fb68e4332aec8d586c2a5c73ebcb

§04 proof the data is live, not just present

Queries run on the destination warehouse.

After the load, the worker queries DuckDB directly — if these return sensible business answers, the moved data is genuinely usable, not just byte-counted.

Top countries by revenue

SELECT … GROUP BY country (on DuckDB)

🇧🇷 BR

$66,444

🇺🇸 US

$66,072

🇫🇷 FR

$63,935

🇦🇺 AU

$62,368

🇮🇱 IL

$60,967

Conversion funnel

count(*) by event_type (on DuckDB)

page_view

29,958

search

7,598

add_to_cart

6,014

checkout

3,928

purchase

2,502

discover step · typed, not stringly

The schema crossed the type boundary intact.

The worker reads the source catalog and maps each ClickHouse type to a DuckDB type before loading — decimals stay decimals, datetimes stay timestamps.

exhibit d · schema map

column	ClickHouse · source	DuckDB · destination
event_id	UInt64	BIGINT
user_id	UInt32	BIGINT
session_id	UInt32	BIGINT
event_type	LowCardinality(String)	VARCHAR
channel	LowCardinality(String)	VARCHAR
device	LowCardinality(String)	VARCHAR
country	LowCardinality(String)	VARCHAR
url	String	VARCHAR
revenue	Decimal(12, 2)	DECIMAL(18,2)
ts	DateTime	TIMESTAMP

§05 the worker · a real islo sandbox

Where it ran — full provenance.

This wasn't your laptop. The job ran in a fresh, repo-defined islo microVM in the Canada compute region, then was torn down. Every field below is captured from the live box.

exhibit e · provenance

sandbox id	019ea238-1f81-7950-80a9-1b80a5e0b556
sandbox name	airbyte-etl
image	docker.io/library/python:3.12
kernel	Linux 6.16.9+ · x86_64
vCPU / memory	4 vCPU · 3930 MB
compute region	ca.compute.islo.dev (Canada)
public owner	redacted

source engine	ClickHouse 26.6.1.472
destination engine	DuckDB 1.5.3
runtime	Python 3.12.13
sync mode	full_refresh_overwrite
bytes read	9,450,617 (~9.0 MB)
batches	10 × 5,000 records
read / write split	0.186s read · 0.145s write

The whole run, phase by phase.

Nine ::CRABBOX_PHASE:: markers split the job into named steps the orchestrator can time, attach evidence to, and reason over. This is the end-to-end, start to finish.

exhibit f · the run, phase by phase

Setup · make the box ready

1

bootstrap `::CRABBOX_PHASE::bootstrap`

Install the ClickHouse static binary and a Python venv (duckdb, requests) into the fresh box.

emits → a deterministic, repo-defined environment

2

boot_clickhouse `::CRABBOX_PHASE::boot_clickhouse`

Start a local ClickHouse server and wait until it answers on the HTTP port.

emits → ClickHouse 26.6.1 up

Prepare · data + schema

3

seed `::CRABBOX_PHASE::seed`

Generate 50,000 deterministic web-analytics events (fixed RNG) and insert them — this is the system of record we're asked to move.

emits → analytics.events · 50,000 rows · $611,815.02

4

discover `::CRABBOX_PHASE::discover`

Read the source catalog and map every ClickHouse type to a DuckDB type before a single row moves.

emits → a 10-column typed catalog

5

write_setup `::CRABBOX_PHASE::write_setup`

Create the typed destination table in DuckDB from the mapped catalog — drop and recreate, since the sync mode is full_refresh_overwrite.

emits → raw.events ready to receive typed rows

Move · the actual transfer

6

sync `::CRABBOX_PHASE::sync`

Stream the rows out as JSONEachRow RECORD batches and bulk-load them into the destination warehouse.

emits → 50,000 rows moved · 0.332s · 0.186s read + 0.145s write

Prove & report · evidence for the harness

7

verify `::CRABBOX_PHASE::verify`

Compare source and destination four ways, including a byte-exact SHA-256. The worker exits non-zero unless all four agree.

emits → 4 / 4 checks PASS

8

analytics `::CRABBOX_PHASE::analytics`

Query the destination directly (top countries, funnel) to prove the moved data is usable, not just present.

emits → real business answers from DuckDB

9

emit `::CRABBOX_PHASE::emit`

Write metrics.json plus a final STATE message — the structured evidence a reflection loop ingests to decide the next move.

emits → status = SUCCEEDED · exit 0

raw evidence · unedited tail

The actual run, from the sandbox.

exhibit g · raw log

airbyte-etl · /workspace/agentic-airbyte/poc · isloexit 0

::CRABBOX_PHASE::bootstrap
[e2e] installing clickhouse static binary
::CRABBOX_PHASE::boot_clickhouse
[e2e] ClickHouse up: 26.6.1.472
::CRABBOX_PHASE::seed
[seed] analytics.events ready: rows=50000 total_revenue=611815.02
::CRABBOX_PHASE::discover
{"type":"LOG","log":{"level":"INFO","message":"discovered stream 'events'"}}
{"type":"LOG","log":{"level":"INFO","message":"source has 50000 records, total_revenue=611815.02"}}
::CRABBOX_PHASE::write_setup
::CRABBOX_PHASE::sync
{"type":"LOG","log":{"level":"INFO","message":"synced 5000/50000 records"}}
          … 10 batches …
{"type":"LOG","log":{"level":"INFO","message":"synced 50000/50000 records"}}
::CRABBOX_PHASE::verify
::CRABBOX_PHASE::analytics
::CRABBOX_PHASE::emit
{"type":"STATE","state":{"records_moved":50000,"status":"SUCCEEDED"}}
{"type":"LOG","log":{"level":"INFO","message":"sync SUCCEEDED: moved 50000 rows in 0.549s (150700.2 rows/s); checks_passed=True"}}
EXIT=0

Full artifacts in the repo: metrics.sandbox.json · e2e_sandbox.log

§06 run it yourself

One command. Same bytes, every time.

The seed is deterministic (fixed RNG), so the SHA-256 above is reproducible. Borrow a box, hydrate it from the repo, run the proof, tear it down.

The islo way — persistent box

Lease a box, hydrate from the repo, run the proof.

your machine

# lease a box, hydrate from the repo, run the proof
islo use airbyte-etl \
  --config poc/islo.yaml \
  --source github://zozo123/agentic-airbyte \
  -- bash poc/run_e2e.sh

The crabbox way — ephemeral worker

Dispatch it as a governed run with evidence capture.

your harness

# dispatch as a governed run with evidence capture
crabbox run --pool org/data-movement/main/... \
  --shell 'bash poc/run_e2e.sh' \
  --artifact-glob 'poc/reports/**' \
  --junit poc/reports/

source — run_e2e.sh · worker/etl.py · worker/seed.py · islo.yaml

ClickHouse → worker → DuckDB, inside a borrowed box.

The engine, in four moves.

What this proves — and what it doesn't.

How we know every row arrived intact.

1 · Row count

2 · Revenue sum

3 · Per-event-type tally

4 · Content SHA-256

Queries run on the destination warehouse.

Top countries by revenue

Conversion funnel

The schema crossed the type boundary intact.

Where it ran — full provenance.

The whole run, phase by phase.

bootstrap ::CRABBOX_PHASE::bootstrap

boot_clickhouse ::CRABBOX_PHASE::boot_clickhouse

seed ::CRABBOX_PHASE::seed

discover ::CRABBOX_PHASE::discover

write_setup ::CRABBOX_PHASE::write_setup

sync ::CRABBOX_PHASE::sync

verify ::CRABBOX_PHASE::verify

analytics ::CRABBOX_PHASE::analytics

emit ::CRABBOX_PHASE::emit

The actual run, from the sandbox.

One command. Same bytes, every time.

The islo way — persistent box

The crabbox way — ephemeral worker

bootstrap `::CRABBOX_PHASE::bootstrap`

boot_clickhouse `::CRABBOX_PHASE::boot_clickhouse`

seed `::CRABBOX_PHASE::seed`

discover `::CRABBOX_PHASE::discover`

write_setup `::CRABBOX_PHASE::write_setup`

sync `::CRABBOX_PHASE::sync`

verify `::CRABBOX_PHASE::verify`

analytics `::CRABBOX_PHASE::analytics`

emit `::CRABBOX_PHASE::emit`