DATA 2027 · Week 07 · Part II — New Access Methods & Engines

The Lakehouse & Open Formats

The database disaggregated itself: storage became Parquet on S3, the table became a tree of metadata files, and the engine became a replaceable visitor.

Lecture 1 — Iceberg Internals: The Metadata Tree · Lecture 2 — Photon and Engines Over Open Files

Lecture 1 · Tuesday

Iceberg Internals: The Metadata Tree

How a pile of immutable files acquires ACID semantics.

L1 · The failure that made the field

Before 2018: a “table” was a convention

A directory, with subdirectories encoding partitions: dt=2026-06-09/region=eu/
A metastore row points at the root
The file set = whatever a LIST returns right now
Hive isn’t a bad database — it is not a database

L1 · The failure that made the field

Planning by directory listing

1,000

keys per S3 LIST call — so planning a query over a 1M-file table costs ~1,000 sequential API round-trips: tens of seconds before a single byte of data is read.

L1 · The failure that made the field

ACID dies on a directory listing

Atomicity: an 800-file write appears 1 file at a time
Readers see torn writes for minutes
Consistency: S3 listings were eventually consistent for years
Isolation: none — INSERT OVERWRITE deletes files under running queries

L1 · Iceberg’s move

Define the table by a tree, not a place

Ryan Blue’s team at Netflix, 2018
Stop defining a table by where its files live
Define it by a persistent, immutable metadata tree
The root is a single swappable pointer
The catalog is the only mutable thing in the system

L1 · The tree, level by level

The Iceberg metadata tree

Fig. 7.1 — Four layers of immutable files: table metadata → snapshot → manifest list → manifests → Parquet data files.

L1 · Plan-cost arithmetic

1M files, `dt = '2026-06-09' AND user_id = 42`

Hive: ~1,000 LIST calls to enumerate files
Then open every file in the matching partition
Cost proportional to table size

Iceberg: 1 metadata JSON (~100 KB) + 1 manifest list
Only manifests whose range can contain dt
Only files whose user_id min/max straddle 42
~a dozen files, read in parallel
Same zone-map idea as Week 5 — lifted into the format

L1 · Commits

An append, step by step

Write the new Parquet data files
Write a new manifest listing them
Write a manifest list: old + new manifests
Write v4.metadata.json with a new snapshot
Ask the catalog to compare-and-swap v3 → v4

L1 · Commits

Where all the atomicity lives

1 CAS

on a few hundred bytes — a metastore lock, a DynamoDB conditional put, or a REST catalog’s conditional POST. Steps 1–4 are just unreferenced objects in a bucket.

L1 · Concurrency

Optimistic concurrency, made cheap

Two writers race → one CAS fails
Loser re-reads metadata, validates for real conflicts
Appends to disjoint files: re-attach manifests, retry
Compaction rewrote the same files: abort
Readers pin a snapshot — never take a lock

L1 · Field note

S3 has no rename

Hive’s “atomic” write: write temp directory, then rename
HDFS rename: O(1) metadata operation
S3: copy-then-delete, one object at a time, non-atomic
Millisecond commits on-prem took 10+ minutes in the cloud
Design rule: never need more than “write a new object”

L1 · Hidden partitioning

Partitions the query can’t miss

Hive leaks dt into queries — miss it, scan everything, silently
Iceberg: partition spec is a transform on a real column
day(event_ts), bucket(16, user_id), truncate(4, zipcode)
Predicate on event_ts → derived partition bounds → pruning
Specs evolve: month(ts) for 2024, day(ts) from 2025

L1 · Schema evolution

Columns are IDs, not positions

Every column carries a permanent integer ID
rename just changes the name attached to ID 7
drop + add of the same name → a fresh ID
Old Parquet is read by ID — deleted data can’t resurrect
Kills the classic Hive bug: drop a column, everything shifts left

L1 · Time travel & Delta

Free snapshots, and Delta’s other route

FOR VERSION AS OF 8124 plans an old snapshot — free
expire_snapshots: reference-counting GC over the bucket
Delta: _delta_log/ of ordered JSON add/remove actions
Commit = atomically create the next-numbered log file
Redo log vs. functional tree — converging (Apache XTable)

Lecture 2 · Thursday

Photon and Engines Over Open Files

Can open files perform like a warehouse? The burden of proof is the engine.

L2 · The two-tier problem

Two copies of the truth

A lake of cheap Parquet for ML
A proprietary warehouse for BI
A fleet of ETL jobs shuttling data between them
Every hop adds staleness, cost, and a drifting second copy
CIDR 2021 names this two-tier world the problem

L2 · The lakehouse argument

Three falsifiable bets

Metadata fixes ACID — Tuesday’s transactional layer
Caching + statistics fix latency — NVMe hides S3’s ~50–100 ms first byte
The format gap is closable — Parquet ≈ warehouses’ internal layout, standardized
Nothing claims novelty; the claim is architectural: open at every lock-in interface

L2 · Photon’s why

The JVM in the inner loop

Spark SQL generated Java bytecode per query (whole-stage codegen)
JIT bails out past 8 KB of bytecode — on the widest operators
GC pauses scale with heap
With NVMe caches, CPU — not I/O — became the limit

L2 · Photon’s how

Vectorized and interpreted, in C++

Columnar batches through precompiled kernels — MonetDB/X100, Week 5
Chose interpretation over codegen: observable, debuggable kernels
Adaptive per batch: ASCII fast path, no-NULL paths, sparse/dense vectors
Slots under Spark’s optimizer via JNI, reads the same Parquet
An engine swap invisible above the operator boundary

L2 · Results

The claim, not hand-waved

3×

average speedup over the prior Spark engine on customer workloads — ≈3× on average, beyond 10× on the most compute-heavy queries (SIGMOD 2022).

L2 · Results

The industry conceded by imitation

Headline datapoint: 100 TB TPC-DS world record, 2021, over open formats
Apply Week 2’s benchmark skepticism — the direction still stands
Trino, DuckDB, ClickHouse, Snowflake, BigQuery: all vectorized over Parquet/Iceberg
Polars, DataFusion, Velox compete inside the open-format world

L2 · The catalog wars

The new choke point is the map

Anyone can read the bytes → control moves to the catalog
Databricks: Unity Catalog, open-sourced under pressure
Snowflake: Polaris — open Iceberg REST catalog protocol
Databricks paid a reported ~$1–2B for Tabular that same week
The REST protocol is the JDBC of this era

L2 · The agent angle

Agents skip the SQL front door

REST catalog → metadata location + vended credential
Parquet read straight into Arrow with DuckDB or Polars
No server enforces semantics on that path
Snapshot, column IDs, transforms — all live in the format
The format is the database; the spec is its source code

L2 · The agent angle

Metadata is load-bearing now

<20%

baseline text-to-SQL accuracy on Spider 2.0-style enterprise benchmarks — largely because real catalogs are documentation deserts. Descriptions, join paths, and freshness in context move it dramatically.

L2 · Recap

Three table definitions

Property	Hive directories	Iceberg (tree)	Delta (log)
Table definition	whatever `LIST` returns	snapshot in metadata file	log replay from checkpoint
Commit atomicity	none (file-at-a-time)	CAS on catalog pointer	atomic create of next log file
Plan cost, 1M files	~1,000 LIST calls + opens	~10s of metadata reads	checkpoint + log tail
Partition predicate	user must name `dt`	hidden: transform on column	generated columns (partial)
Schema evolution	positional; unsafe drops	column IDs; safe rename/drop	column mapping (opt-in)
Time travel	—	snapshot pin, O(1)	log version pin

When every client reads the files directly, the format is the database — and the catalog is the only map the agent has.

— Week 7 lecture notes

Checkpoint · Discussion

Before you leave

Which single Iceberg statistic skips files within the matching week — and what data property makes it useless for user_id? (Ex 7.1)
In your toyberg, which racing commits may retry and which must abort? (Ex 7.2)
How could an agent detect stale catalog metadata rather than trust it? (Ex 7.3)

Readings · Due Thursday

Read before Thursday

Lakehouse — Armbrust, Ghodsi, Xin, Zaharia, CIDR 2021. Extract the three bets in §3; which carries the most risk?
Apache Iceberg Table Spec, v2 — ASF; pair with Ryan Blue’s Netflix 2018 talk. Find where per-column bounds live.
Photon — Behm, Palkar, Agarwal, et al., SIGMOD 2022. Focus on §3’s vectorize-and-interpret decision; skim the eval skeptically.

The Lakehouse & Open Formats

Iceberg Internals: The Metadata Tree

Before 2018: a “table” was a convention

Planning by directory listing

ACID dies on a directory listing

Define the table by a tree, not a place

The Iceberg metadata tree

1M files, dt = '2026-06-09' AND user_id = 42

An append, step by step

Where all the atomicity lives

Optimistic concurrency, made cheap

S3 has no rename

Partitions the query can’t miss

Columns are IDs, not positions

Free snapshots, and Delta’s other route

Photon and Engines Over Open Files

Two copies of the truth

Three falsifiable bets

The JVM in the inner loop

Vectorized and interpreted, in C++

The claim, not hand-waved

The industry conceded by imitation

The new choke point is the map

Agents skip the SQL front door

Metadata is load-bearing now

Three table definitions

Before you leave

Read before Thursday

1M files, `dt = '2026-06-09' AND user_id = 42`