Week 7: The Lakehouse & Open Formats

Learning objectives — after this week you can…

Trace a query through Iceberg’s metadata tree (table metadata → snapshot → manifest list → manifests → data files) and count exactly which files get read during planning.
Explain why one atomic compare-and-swap on a metadata pointer is sufficient to give serializable commits over an object store with no server in the write path.
Contrast hidden partitioning and ID-based schema evolution with Hive’s directory-and-name conventions, and predict which queries each breaks.
Summarize the lakehouse argument (CIDR 2021) and Photon’s engineering bet (SIGMOD 2022): vectorized C++ over open Parquet beats a closed JVM engine without giving up format openness.
Argue for or against the claim that for agent workloads the catalog — not the engine — is the accuracy-critical component.

Lecture 1 · Tuesday

Iceberg Internals: The Metadata Tree

Start with the failure that made this field. Until roughly 2018, a “table” on a data lake was a Hive-style convention: a directory, with subdirectories encoding partitions (dt=2026-06-09/region=eu/), and a metastore row pointing at the root. The set of files in the table was defined as whatever a directory listing returns right now. Every property of a real database that you’ve taken for granted since Week 1 dies on that definition. Atomicity: a job writing 800 files into a partition is visible 1 file at a time, so readers see torn writes for minutes. Consistency: S3’s LIST pages 1,000 keys per call, so planning a query over a million-file table costs 1,000 sequential API round-trips — tens of seconds before a single byte of data is read — and for years those listings were only eventually consistent. Isolation: there is none; INSERT OVERWRITE deletes files out from under running queries. The Hive model isn’t a bad database. It is not a database. Iceberg’s contribution — Ryan Blue’s team at Netflix, 2018 — is to stop defining a table by where its files live and instead define it by a persistent, immutable metadata tree whose root is a single swappable pointer.

The tree, level by level

An Iceberg table is four layers of immutable files, each written once and never modified. At the root sits the table metadata file (v3.metadata.json): the schema (with every column carrying a permanent integer ID), the partition spec, table properties, and the list of snapshots — every committed version of the table, each with a timestamp and an operation summary. A snapshot points to one manifest list, an Avro file with one row per manifest, including per-manifest partition-value ranges. Each manifest is itself an Avro file listing data files, and for every file it records the partition tuple, record count, file size, and per-column lower/upper bounds, null counts, and NaN counts. The leaves are ordinary Parquet data files. The catalog — the only mutable thing in the whole system — stores one pointer per table: “db.events → v3.metadata.json”.

Fig. 7.1 — The Iceberg metadata tree. A query plans by walking metadata files (small, statistics-rich Avro/JSON), never by listing directories. A commit creates new files bottom-up, then performs one atomic swap of the catalog pointer.

Do the arithmetic for a 1M-file table, query WHERE dt = '2026-06-09' AND user_id = 42. Hive: ~1,000 LIST calls just to enumerate files, then open every file in the matching partition. Iceberg: read 1 metadata JSON (~100 KB), 1 manifest list (one row per manifest — a few thousand rows), then only the manifests whose partition range can contain dt = 2026-06-09, then only the data files whose user_id min/max bounds straddle 42. Planning touches perhaps a dozen files, in parallel, with cost proportional to data that might match, not table size. The min/max pruning at the manifest level is the same zone-map idea you saw inside column stores in Week 5 — lifted out of the engine and into the table format itself, where every engine inherits it for free.

Commits: one CAS to rule them all

Writes exploit immutability. A writer appending data (1) writes its Parquet files, (2) writes a new manifest listing them, (3) writes a new manifest list that includes both old and new manifests, (4) writes a new table metadata file v4.metadata.json with a new snapshot, and (5) asks the catalog to atomically compare-and-swap the table pointer from v3 to v4. Steps 1–4 are invisible to everyone — just unreferenced objects in a bucket. All the atomicity in the system is concentrated into step 5, a single CAS on a few hundred bytes, done in a metastore lock, a DynamoDB conditional put, or a REST catalog’s conditional POST. If two writers race, one CAS fails; the loser re-reads the current metadata, checks whether the winner’s changes actually conflict with its own (two appends to disjoint files don’t — it can re-attach its manifests to the new base and retry; a compaction that rewrote files the loser also rewrote does, and aborts). This is textbook optimistic concurrency control from Week 3, with the validation step made cheap because manifests tell you exactly which files each transaction touched. Readers, meanwhile, pin a snapshot at plan time and never take a lock: every file they will read is immutable, so a long analytical scan can run for an hour while a hundred commits land.

Field note Netflix’s motivating horror story: an atomic-looking Hive write was implemented as “write to temp directory, then rename.” On HDFS, rename is an O(1) metadata operation. On S3 there is no rename — it’s copy-then-delete, one object at a time, non-atomic and slow. Jobs that “committed” in milliseconds on-prem took 10+ minutes in the cloud and could fail halfway, leaving half-renamed tables that on-call engineers reconciled by hand. Iceberg’s design rule — never depend on a filesystem operation stronger than “write a new object” — falls directly out of that incident.

Hidden partitioning and schema evolution

Hive partitioning leaks into queries: the table has a dt string column derived from event_ts, and if an analyst (or an LLM) writes WHERE event_ts > '2026-06-09' instead of WHERE dt = ..., the engine scans the entire table — silently. Iceberg stores the partition spec as a transform on a real column: day(event_ts), bucket(16, user_id), truncate(4, zipcode). The engine sees a predicate on event_ts, applies the transform to derive partition bounds, and prunes — the user never names a partition column because none exists in the schema. Better: the spec can change. Partition by month(ts) for 2024’s data, switch to day(ts) in 2025; old files keep their old spec, new files get the new one, and queries plan against both. Schema evolution gets the same treatment via column IDs: rename changes a name attached to ID 7, drop+add of the same name produces a fresh ID, so old Parquet files are read by ID and can never resurrect deleted data or mismatch on position — the classic CSV-on-Hive bug where dropping a middle column shifted every subsequent column one place to the left.

Time travel, and Delta’s alternative

Because snapshots are just rows in the metadata file pointing at immutable trees, time travel is free: SELECT ... FOR VERSION AS OF 8124 plans against an old snapshot’s manifest list. Old snapshots are garbage-collected by an explicit expire_snapshots maintenance action that finds unreachable files — a reference-counting GC over the bucket. We will lean on snapshots hard in Week 8: an agent that can pin a snapshot gets a stable world to reason over, and an agent whose write is rejected by the CAS gets a clean, machine-readable conflict instead of corrupted state.

Delta Lake reaches the same destination by a different route: instead of a tree re-rooted on every commit, Delta keeps a transaction log — a directory _delta_log/ of ordered JSON files (000…0123.json), each holding add/remove file actions plus metadata changes; commit atomicity comes from atomically creating the next-numbered log file, and readers reconstruct table state by replaying the log from the latest Parquet checkpoint (a compacted snapshot of the log written every ~10 commits). It’s a redo log where Iceberg is a persistent functional data structure; the trade-offs are second-order (log replay vs. tree walk, per-commit metadata size), and the formats have been converging — Delta’s checkpoints carry the same file-level statistics, and translation layers like Apache XTable expose one set of data files under both formats’ metadata.

Lecture 2 · Thursday

Photon and Engines Over Open Files

Tuesday established that open files plus a metadata tree can behave like a table. Thursday’s question is whether they can perform like a warehouse — because if not, the two-tier architecture survives: a lake of cheap Parquet for ML and a proprietary warehouse for BI, with a fleet of ETL jobs shuttling data between them, each hop adding staleness, cost, and a second copy of the truth that drifts from the first. The lakehouse paper (CIDR 2021) names this two-tier world as the problem and makes a falsifiable claim: a system built on open formats in commodity object storage, with a transactional metadata layer (Lecture 1) and a sufficiently good engine, can match warehouse performance — so the warehouse’s only remaining moat, the closed storage format it can co-design with its engine, isn’t worth what it costs you in lock-in and copies. The burden of proof sits entirely on “sufficiently good engine.” Photon (SIGMOD 2022) is Databricks’ attempt to discharge it.

The lakehouse argument, stated precisely

Strip the marketing and the CIDR paper makes three technical bets. First, metadata fixes ACID: Delta/Iceberg-style layers give the lake transactions, time travel, and schema enforcement — covered Tuesday. Second, caching and statistics fix latency: hot Parquet cached on local NVMe of stateless compute nodes hides S3’s ~50–100 ms first-byte latency, and file-level zone maps fix the “scan everything” cost model. Third, the format gap is closable: whatever a proprietary format gains from co-design (better encodings, custom indexes), an engine over Parquet can recover through smarter execution and auxiliary data structures, because Parquet already gives you columnar layout, dictionary and RLE encodings, and per-row-group statistics — it’s roughly the same physical design warehouses use internally, standardized. Notice what the argument does not claim: that any single component is novel. The claim is architectural — open at every interface where lock-in used to live.

Photon: evicting the JVM from the inner loop

By 2020 Databricks had a specific bottleneck: Spark SQL’s execution layer. Spark generated Java bytecode per query (whole-stage codegen), which worked until it didn’t — the JIT bails out of methods over 8 KB of bytecode, falling back to the interpreter precisely on the widest, most complex operators; GC pauses scale with heap; and on cloud NVMe-cached NVMe-fast storage, CPU — not I/O — had become the limiting resource for typical customer queries. Photon’s answer: rewrite the execution engine in C++ as a vectorized, interpreted engine — columnar batches of a few thousand values flowing through precompiled operator kernels, the MonetDB/X100 design from Week 5 — rather than doing code generation in C++. The paper is candid about why: vectorization made the engine observable (a fixed set of kernels you can profile and count) and debuggable, and runtime adaptivity replaced compile-time specialization — kernels switch per batch based on observed data, e.g. an ASCII fast path for strings, dedicated paths for batches with no NULLs, sparse-vs-dense activity vectors after a selective filter. Photon slots in under Spark’s existing optimizer and task scheduler via JNI, consuming the same Parquet — an engine swap invisible above the operator boundary, which is itself the lakehouse thesis in microcosm: when the storage format is open, the engine is a replaceable component, and vendors compete on execution rather than on holding your data hostage.

Results, so the claim isn’t hand-waved: the SIGMOD paper reports roughly 3× average speedup over the prior Spark engine on customer workloads, about 3× on average over the previous Databricks engine and beyond 10× on the most compute-heavy queries, and the headline external datapoint — a 100 TB TPC-DS world-record submission in 2021, a benchmark warehouses had owned forever, run over open-format storage. Treat vendor benchmarks with Week 2’s skepticism, but the directional point stands and the rest of the industry conceded it by imitation: every major engine — Trino, DuckDB, ClickHouse, Snowflake (which added Iceberg tables), BigQuery (BigLake) — now executes vectorized over Parquet/Iceberg, and the startups (Polars, DataFusion, Velox as a reusable C++ execution library) compete inside the open-format world rather than against it.

Historical aside There’s an irony worth savoring: Spark’s 2014 pitch was that the JVM was fine and the bottleneck was disk and network; a decade of NVMe and 100 Gb networking later, Databricks paid C++ engineers to remove the JVM from its own product’s inner loop. Hardware trends don’t argue. They invalidate.

The catalog wars

Open data files created a new choke point one level up: if anyone’s engine can read the bytes, whoever controls the catalog — the mutable pointer from Lecture 1, plus permissions, audit, and discovery — controls the platform. Hence the 2023–24 “catalog wars,” compressed: Databricks pushed Unity Catalog as the governance layer for everything (tables, files, ML models), then open-sourced it under competitive pressure; Snowflake countered by launching Polaris, an open-source implementation of the Iceberg REST catalog protocol — a vendor-neutral HTTP API for the operations Lecture 1 made fundamental: resolve a table to its metadata location, and commit via conditional swap, plus namespaces and credential vending. The REST protocol is the durable artifact here, the JDBC of this era: engines program against it once, catalogs compete behind it, and the metadata pointer stops being a moat. Both companies spent 2024 buying or building toward the same conclusion (Databricks paid a reported ~$1–2B for Tabular, the Iceberg creators’ company, the week of Polaris’s announcement) — which tells you exactly where the industry believes the remaining strategic value sits. Not in the files. In the map of the files.

The agent angle: the catalog is the map

Now connect this to the course thesis. Agent frameworks increasingly skip the SQL front door entirely: a Python-tool-equipped agent calls the REST catalog, gets the metadata location and a vended credential, and reads Parquet straight into Arrow with DuckDB or Polars. There is no server enforcing semantics in that path. Everything that makes the result a table rather than some files — the snapshot it pinned, the schema with its column IDs, the partition transforms it must respect to avoid scanning a petabyte — lives in the format. When every client reads the files directly, the table format is the database, and the spec document is its source code.

And the catalog inherits the role the DBA used to play for humans. An analyst who queried the wrong revenue table got corrected in code review; an agent picks its table by reading catalog metadata, so a stale description or a missing deprecation flag converts directly into a confidently wrong answer. Measure it: text-to-SQL accuracy on enterprise schemas jumps dramatically when column descriptions, canonical join paths, and freshness annotations are in context — the Spider 2.0-style enterprise benchmarks put baseline accuracy below 20% largely because real catalogs are documentation deserts. Discoverability metadata used to be a librarian’s nicety. For agent clients it is load-bearing for correctness — which is why “semantic layer” quietly became the most contested feature in every catalog vendor’s roadmap.

When every client reads the files directly, the format is the database — and the catalog is the only map the agent has.

Property	Hive-style directories	Iceberg (metadata tree)	Delta (transaction log)
Table definition	whatever `LIST` returns	snapshot in metadata file	log replay from checkpoint
Commit atomicity	none (file-at-a-time)	CAS on catalog pointer	atomic create of next log file
Plan cost, 1M files	~1,000 LIST calls + opens	~10s of metadata file reads	checkpoint + log tail read
Partition predicate	user must name `dt` column	hidden: transform on real column	generated columns (partial)
Schema evolution	positional; unsafe drops	column IDs; safe rename/drop	column mapping (IDs, opt-in)
Time travel	—	snapshot pin, O(1)	log version pin

Readings

Read Before Thursday

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics — Armbrust, Ghodsi, Xin, Zaharia, CIDR 2021.Read it as an argument, not an ad: extract the three technical bets in §3 and decide which one carries the most risk. Note what the paper says two-tier architectures cost in staleness and copies.

Apache Iceberg Table Spec, v2 — Apache Software Foundation; pair with Ryan Blue’s “Iceberg: a fast table format for S3” talk, Netflix 2018.Focus on the manifest and snapshot sections: find where per-column bounds live and convince yourself the commit protocol needs nothing stronger than one CAS. The talk supplies the Hive war stories the spec politely omits.

Photon: A Fast Query Engine for Lakehouse Systems — Behm, Palkar, Agarwal, et al., SIGMOD 2022.Focus on §3’s decision to vectorize-and-interpret rather than code-generate (and the JVM pathologies in §2), plus the adaptive per-batch kernels. Skim the eval with Week 2’s benchmark skepticism.

Exercises

This Week’s Problems

Exercise 7.1 · warm-up

A table holds 730 daily partitions × 1,400 Parquet files each (≈1.02M files). S3: LIST returns 1,000 keys per call at ~30 ms; GET first-byte ~60 ms; metadata files as described in lecture (1 JSON root, 1 manifest list with 950 manifest entries, manifests covering ~1,100 data files each). For the query WHERE event_ts BETWEEN '2026-06-01' AND '2026-06-07' AND user_id = 42, compute the number of object-store operations needed to plan the query under (a) Hive-style listing and (b) Iceberg, assuming manifests are partition-aligned and 64-way parallel metadata reads. Then state the planning wall-clock for each. Finally: which single Iceberg statistic lets you skip data files within the matching week, and what data property would make it useless for user_id?

Exercise 7.2 · core

Build toyberg: a ~300-line table format over a local directory. Requirements: (1) immutable Parquet data files plus JSON manifests recording per-file min/max for every column; (2) a root pointer file updated only via atomic rename(2) of a temp file, with the expected prior version embedded so a stale writer fails; (3) scan(table, predicate, version=None) that prunes by manifest stats and supports time travel; (4) commit retry that re-validates: appends touching disjoint files succeed on retry, overlapping rewrites abort. Demonstrate with two concurrent writer processes racing 100 commits each — show zero lost updates and report how many retries occurred. Bonus: implement expire_snapshots and prove with a file census that it removes exactly the unreachable files.

Exercise 7.3 · stretch

Design and evaluate an agent-facing catalog. Take any Iceberg REST catalog (Polaris, Lakekeeper, or your 7.2 toy behind HTTP) and a schema of ≥20 tables with realistic ambiguity (e.g., three revenue-like tables at different grains, one deprecated). Define a metadata extension — semantic descriptions, canonical join paths, freshness SLAs, deprecation, per-column value samples — and decide where it lives (table properties? separate endpoint? embedded in LoadTableResult?) and how it stays true (who writes it, what CI validates it against the actual schema and snapshot history?). Then measure: run an LLM agent on 30 natural-language questions in three conditions — bare catalog, your enriched catalog, and your enriched catalog with 20% of descriptions deliberately stale — reporting accuracy and tokens consumed per condition. The third condition is the interesting one: propose and prototype one mechanism that lets the agent detect stale metadata rather than trust it. This is an open research problem; a negative result with a sharp analysis earns full credit.

❦