The database disaggregated itself: storage became Parquet on S3, the table became a tree of metadata files, and the engine became a replaceable visitor.
Lecture 1 — Iceberg Internals: The Metadata Tree · Lecture 2 — Photon and Engines Over Open Files
How a pile of immutable files acquires ACID semantics.
dt=2026-06-09/region=eu/keys per S3 LIST call — so planning a query over a 1M-file table costs ~1,000 sequential API round-trips: tens of seconds before a single byte of data is read.
INSERT OVERWRITE deletes files under running queriesdt = '2026-06-09' AND user_id = 42LIST calls to enumerate filesdtuser_id min/max straddle 42v4.metadata.json with a new snapshoton a few hundred bytes — a metastore lock, a DynamoDB conditional put, or a REST catalog’s conditional POST. Steps 1–4 are just unreferenced objects in a bucket.
dt into queries — miss it, scan everything, silentlyday(event_ts), bucket(16, user_id), truncate(4, zipcode)event_ts → derived partition bounds → pruningmonth(ts) for 2024, day(ts) from 2025rename just changes the name attached to ID 7drop + add of the same name → a fresh IDFOR VERSION AS OF 8124 plans an old snapshot — freeexpire_snapshots: reference-counting GC over the bucket_delta_log/ of ordered JSON add/remove actionsCan open files perform like a warehouse? The burden of proof is the engine.
average speedup over the prior Spark engine on customer workloads — ≈3× on average, beyond 10× on the most compute-heavy queries (SIGMOD 2022).
baseline text-to-SQL accuracy on Spider 2.0-style enterprise benchmarks — largely because real catalogs are documentation deserts. Descriptions, join paths, and freshness in context move it dramatically.
| Property | Hive directories | Iceberg (tree) | Delta (log) |
|---|---|---|---|
| Table definition | whatever LIST returns | snapshot in metadata file | log replay from checkpoint |
| Commit atomicity | none (file-at-a-time) | CAS on catalog pointer | atomic create of next log file |
| Plan cost, 1M files | ~1,000 LIST calls + opens | ~10s of metadata reads | checkpoint + log tail |
| Partition predicate | user must name dt | hidden: transform on column | generated columns (partial) |
| Schema evolution | positional; unsafe drops | column IDs; safe rename/drop | column mapping (opt-in) |
| Time travel | — | snapshot pin, O(1) | log version pin |
user_id? (Ex 7.1)toyberg, which racing commits may retry and which must abort? (Ex 7.2)