DATA 2027 · Week 07 · Part II — New Access Methods & Engines

The Lakehouse & Open Formats

The database disaggregated itself: storage became Parquet on S3, the table became a tree of metadata files, and the engine became a replaceable visitor.

Lecture 1 — Iceberg Internals: The Metadata Tree · Lecture 2 — Photon and Engines Over Open Files

Lecture 1 · Tuesday

Iceberg Internals: The Metadata Tree

How a pile of immutable files acquires ACID semantics.

L1 · The failure that made the field

Before 2018: a “table” was a convention

L1 · The failure that made the field

Planning by directory listing

1,000

keys per S3 LIST call — so planning a query over a 1M-file table costs ~1,000 sequential API round-trips: tens of seconds before a single byte of data is read.

L1 · The failure that made the field

ACID dies on a directory listing

L1 · Iceberg’s move

Define the table by a tree, not a place

L1 · The tree, level by level

The Iceberg metadata tree

CATALOG db.events → v3.metadata (only mutable bit) commit = CAS this pointer, v3 → v4 v3.metadata.json schema (col IDs) partition spec snapshots: s1 s2 [s3]* snap-s3.avro manifest list m1: dt≤06-08 m2: dt=06-09 m1.avro per file: min/max, nulls, rows m2.avro per file: min/max, nulls, rows a.parquet b.parquet … (data) everything right of the catalog is immutable; readers never block writers
Fig. 7.1 — Four layers of immutable files: table metadata → snapshot → manifest list → manifests → Parquet data files.
L1 · Plan-cost arithmetic

1M files, dt = '2026-06-09' AND user_id = 42

  • Hive: ~1,000 LIST calls to enumerate files
  • Then open every file in the matching partition
  • Cost proportional to table size
  • Iceberg: 1 metadata JSON (~100 KB) + 1 manifest list
  • Only manifests whose range can contain dt
  • Only files whose user_id min/max straddle 42
  • ~a dozen files, read in parallel
  • Same zone-map idea as Week 5 — lifted into the format
L1 · Commits

An append, step by step

  1. Write the new Parquet data files
  2. Write a new manifest listing them
  3. Write a manifest list: old + new manifests
  4. Write v4.metadata.json with a new snapshot
  5. Ask the catalog to compare-and-swap v3 → v4
L1 · Commits

Where all the atomicity lives

1 CAS

on a few hundred bytes — a metastore lock, a DynamoDB conditional put, or a REST catalog’s conditional POST. Steps 1–4 are just unreferenced objects in a bucket.

L1 · Concurrency

Optimistic concurrency, made cheap

L1 · Field note

S3 has no rename

L1 · Hidden partitioning

Partitions the query can’t miss

L1 · Schema evolution

Columns are IDs, not positions

L1 · Time travel & Delta

Free snapshots, and Delta’s other route

Lecture 2 · Thursday

Photon and Engines Over Open Files

Can open files perform like a warehouse? The burden of proof is the engine.

L2 · The two-tier problem

Two copies of the truth

L2 · The lakehouse argument

Three falsifiable bets

L2 · Photon’s why

The JVM in the inner loop

L2 · Photon’s how

Vectorized and interpreted, in C++

L2 · Results

The claim, not hand-waved

average speedup over the prior Spark engine on customer workloads — ≈3× on average, beyond 10× on the most compute-heavy queries (SIGMOD 2022).

L2 · Results

The industry conceded by imitation

L2 · The catalog wars

The new choke point is the map

L2 · The agent angle

Agents skip the SQL front door

L2 · The agent angle

Metadata is load-bearing now

<20%

baseline text-to-SQL accuracy on Spider 2.0-style enterprise benchmarks — largely because real catalogs are documentation deserts. Descriptions, join paths, and freshness in context move it dramatically.

L2 · Recap

Three table definitions

PropertyHive directoriesIceberg (tree)Delta (log)
Table definitionwhatever LIST returnssnapshot in metadata filelog replay from checkpoint
Commit atomicitynone (file-at-a-time)CAS on catalog pointeratomic create of next log file
Plan cost, 1M files~1,000 LIST calls + opens~10s of metadata readscheckpoint + log tail
Partition predicateuser must name dthidden: transform on columngenerated columns (partial)
Schema evolutionpositional; unsafe dropscolumn IDs; safe rename/dropcolumn mapping (opt-in)
Time travelsnapshot pin, O(1)log version pin
When every client reads the files directly, the format is the database — and the catalog is the only map the agent has.
— Week 7 lecture notes
Checkpoint · Discussion

Before you leave

Readings · Due Thursday

Read before Thursday

  1. Lakehouse — Armbrust, Ghodsi, Xin, Zaharia, CIDR 2021. Extract the three bets in §3; which carries the most risk?
  2. Apache Iceberg Table Spec, v2 — ASF; pair with Ryan Blue’s Netflix 2018 talk. Find where per-column bounds live.
  3. Photon — Behm, Palkar, Agarwal, et al., SIGMOD 2022. Focus on §3’s vectorize-and-interpret decision; skim the eval skeptically.