Resources & Reading List

§ 1 · The Canon

The 42 Weekly Readings, Annotated

Three readings a week for fourteen weeks. The list below is the complete canon, organized by part, with each paper's one-line reading instruction carried over from its week page. Two readings appear in two different weeks each — the 2005 What Goes Around Comes Around and the Anthropic self-service analytics post — so the deduplicated list runs to forty entries. Where a reading is assigned twice, both week numbers are shown: the repetition is the assignment.

Part I — Foundations Under New Workloads (Weeks 1–4)

1.1

Architecture of a Database System — Hellerstein, Stonebraker & Hamilton, Foundations and Trends in Databases, 2007. Sections 1–4 only.The canonical map of the machine. Focus on the process model (§2) and the life of a query (§1.1) — and, on every page, ask which stated assumption is about client behavior.

1.2

What Goes Around Comes Around — Stonebraker & Hellerstein, in Readings in Database Systems, 4th ed., 2005. Assigned twice: Weeks 1 & 14.WEEK 1: thirty-five years of data-model fashion cycles in twenty pages — read it as inoculation against rebranded old ideas. WEEK 14: skim the XML chapter and grade its 2005 predictions against the 2024 scorecard — a rare controlled experiment in technological forecasting.

1.3

Self-Service Analytics with Claude — Anthropic engineering blog, June 2026. Assigned twice: Weeks 1 & 9.The 21%→95% result. WEEK 1: notice that none of the gains required touching the database engine. WEEK 9: focus on what the curated skills actually contain (metric definitions, join guidance, pitfalls) and who maintains them — it is a semantic layer in everything but name.

2.1

Database Internals, ch. 2–4 & 7 — Alex Petrov, O’Reilly, 2019.Ch. 2–4 give you the B-tree at implementation depth (cell layouts, splits, B-link); ch. 7 is the cleanest LSM treatment in print. Read with a pencil; redo the fanout math for 16 KB pages.

2.2

Designing Access Methods: The RUM Conjecture — Athanassoulis, Kester, Maas, Stoica, Idreos, Ailamaki & Callaghan, EDBT 2016.Short and sharp. Focus on the overhead definitions in §2 and the design-space figure; come to class able to say which corner the last system you used had silently chosen.

2.3

The Log-Structured Merge-Tree (LSM-Tree) — O’Neil, Cheng, Gawlick & O’Neil, Acta Informatica, 1996.Read §1–3 for the rolling-merge idea; skim the rest. The cost model is HDD-era — your job is to notice exactly which assumptions flash and NVMe broke, and which survived.

3.1

“One Size Fits All”: An Idea Whose Time Has Come and Gone — Stonebraker & Çetintemel, ICDE 2005.Read for the method, not the predictions: how to argue from workload characteristics to architecture. Note which of its bets aged well (warehousing, streams) and which didn’t.

3.2

C-Store: A Column-oriented DBMS — Stonebraker et al., VLDB 2005.Focus on §3–4: projections, sort-order-dependent compression, and the WS/RS split. Ask yourself at every design choice: which corner of the RUM triangle is being traded away?

3.3

MonetDB/X100: Hyper-Pipelining Query Execution — Boncz, Zukowski & Nes, CIDR 2005.Focus on §2 (why TPC-H Q1 gets <10% of hand-coded performance in tuple-at-a-time engines) and the vector-size experiment — the U-shaped curve is the whole lecture in one figure.

4.1

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases — Verbitski et al., SIGMOD 2017.The canonical “ship only the log” paper. Focus on §2–3: the write amplification accounting in Figure 2, and derive the 4/6–3/6 quorum from the AZ+1 model yourself before reading their derivation.

4.2

The Snowflake Elastic Data Warehouse — Dageville et al., SIGMOD 2016.Read for the three-layer architecture and §3.3 on micro-partitions and pruning. Ask at every section: which property here is enabled by immutability alone?

4.3

Building an Elastic Query Engine on Disaggregated Storage — Vuppalapati et al., NSDI 2020.The production retrospective: what the 2016 bet got right, measured. Focus on the workload skew and cache hit-rate data, and the open problem of intermediate (shuffle/spill) data.

Part II — New Access Methods & Engines (Weeks 5–8)

5.1

The Case for Learned Index Structures — Kraska, Beutel, Chi, Dean & Polyzotis, SIGMOD 2018.Read §1–3 closely for the CDF framing and the RMI; treat the eval skeptically and bring one benchmark objection (the SOSD authors found several).

5.2

Bao: Making Learned Query Optimization Practical — Marcus, Negi, Mao, Tatbul, Alizadeh & Kraska, SIGMOD 2021.Focus on §2’s design constraints and the Thompson-sampling loop — note how every choice traces back to a specific deployment failure of Neo.

5.3

SageDB: A Learned Database System — Kraska, Alizadeh, Beutel, Chi, Ding, Kristo, Leclerc, Madden, Mao & Nathan, CIDR 2019.Read as a manifesto, not a system paper; as you read, mark each proposed component as policy or mechanism, and check your marks against what Redshift shipped.

6.1

Product Quantization for Nearest Neighbor Search — Jégou, Douze & Schmid, IEEE TPAMI, 2011.The codebook math behind every compressed vector index. Work through §III until the 256^16-codebook trick and the asymmetric distance tables feel obvious; skim the GIST results.

6.2

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs — Malkov & Yashunin, IEEE TPAMI, 2018.Focus on the neighbor-selection heuristic (Alg. 4) and the layer assignment — the skip-list analogy is in the paper. Note carefully what the memory model assumes.

6.3

DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node — Subramanya, Devvrit, Simhadri, Krishnaswamy & Kadekodi, NeurIPS 2019.Read for the systems decisions, not the graph theory: why α > 1, why PQ steers while SSD stores, why node + adjacency share a block. Compare its cost model against HNSW’s before class.

7.1

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics — Armbrust, Ghodsi, Xin & Zaharia, CIDR 2021.Read it as an argument, not an ad: extract the three technical bets in §3 and decide which one carries the most risk. Note what the paper says two-tier architectures cost in staleness and copies.

7.2

Apache Iceberg Table Spec, v2 — Apache Software Foundation; pair with Ryan Blue’s “Iceberg: a fast table format for S3” talk, Netflix, 2018.Focus on the manifest and snapshot sections: find where per-column bounds live and convince yourself the commit protocol needs nothing stronger than one CAS. The talk supplies the Hive war stories the spec politely omits.

7.3

Photon: A Fast Query Engine for Lakehouse Systems — Behm, Palkar, Agarwal et al., SIGMOD 2022.Focus on §3’s decision to vectorize-and-interpret rather than code-generate (and the JVM pathologies in §2), plus the adaptive per-batch kernels. Skim the eval with Week 2’s benchmark skepticism.

8.1

Calvin: Fast Distributed Transactions for Partitioned Database Systems — Thomson, Diamond, Weng, Ren, Shao & Abadi, SIGMOD 2012.Focus on §3 (the sequencer) and §3.2.1 (dependent transactions / OLLP) — convince yourself why determinism really removes 2PC, and what it costs.

8.2

Spanner: Google’s Globally-Distributed Database — Corbett et al., OSDI 2012.Read §3 (TrueTime) and §4.1.2 (commit wait) closely; skim the rest. Work the invariant: why must locks be held through the wait?

8.3

Neon architecture posts: “Architecture decisions in Neon” & the pageserver/branching deep-dives — Neon engineering blog, 2022–24.Focus on GetPage@LSN, the layer-file map, and what branch creation actually writes — verify Thursday’s O(metadata) claim against their design.

Part III — Semantics, Agents, Governance (Weeks 9–12)

9.1

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows — Lei et al., ICLR 2025.The 86→17 cliff, quantified. Focus on §3’s task construction and the error analysis: count how many failures are schema-scale or business-logic problems rather than SQL-skill problems.

9.2

Can LLM Already Serve as a Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs (BIRD) — Li et al., NeurIPS 2023.Read for the “external knowledge” design decision — the first benchmark to admit questions underdetermine SQL — and bring a skeptical eye to the gold annotations; we discuss the audit findings in class.

9.3

How Anthropic Enables Self-Service Data Analytics with Claude — Anthropic engineering blog, June 2026. See entry 1.3 above — assigned in both Weeks 1 and 9; the Week-9 reading instruction is folded into that entry.Reread it after Spider 2.0 and BIRD: the same post reads completely differently once you know what the benchmarks can’t measure.

10.1

Semantic Operators: A Declarative Model for Rich, AI-Based Data Processing (LOTUS) — Patel, Jha, Asawa, Pan, Guestrin & Zaharia, VLDB 2025.The algebra itself. Focus on the formal semantics of accuracy targets and the cascade algorithms for sem_filter/sem_join — especially how thresholds are calibrated to give statistical guarantees, not vibes.

10.2

Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing — Liu, Russo, Cafarella et al., CIDR 2025.The optimizer story. Focus on the physical plan space (model tiers, fusion, code synthesis) and how plan search handles the (runtime, cost, quality) Pareto frontier; skim the implementation section.

10.3

DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing — Shankar, Parameswaran & Wu, VLDB 2025.Optimization beyond physical operator choice. Focus on the rewrite directives and on how LLM-as-judge validates rewrites — ask yourself where this validation loop could be fooled.

11.1

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory — Chhikara, Khant, Aryan, Singh & Yadav, arXiv:2504.19413, 2025.Read §3 as a write-path spec: extraction then ADD/UPDATE/DELETE/NOOP consolidation. Focus on the latency/token tables and ask where an audit log would have to live.

11.2

Zep: A Temporal Knowledge Graph Architecture for Agent Memory — Rasmussen, Paliychuk, Beauvais, Ryan & Chalef, arXiv:2501.13956, 2025.The Graphiti bi-temporal edge model is the heart. Map t_valid/t_invalid onto SQL:2011 periods and Type-2 SCDs as you read — the correspondence is nearly exact.

11.3

MemGPT: Towards LLMs as Operating Systems — Packer, Wooders, Lin, Fang, Patil, Stoica & Gonzalez, arXiv:2310.08560, 2023.Read the OS metaphor adversarially: main/external context is a buffer pool with the application as its own buffer manager. List the failure modes that DBMSs solved by NOT doing this.

12.1

Model Context Protocol — Specification — Anthropic & the MCP community, 2024–2025.Read the architecture and the tools/resources/prompts sections. Focus on the trust boundaries and what the spec explicitly leaves to server authors — that list is the attack surface.

12.2

The Lethal Trifecta & The Supabase MCP can leak your entire SQL database — Simon Willison, 2025.The trifecta post is the mental model; the Supabase walkthrough is the proof. Trace each leg of the trifecta onto each step of the exploit as you read.

12.3

OWASP Top 10 for LLM Applications — OWASP, 2025.Read LLM01 (Prompt Injection), LLM02 (Sensitive Information Disclosure), and the excessive-agency entry. Map each to a database control from Thursday’s defenses table.

Part IV — Frontier & Futures (Weeks 13–14)

13.1

Self-Driving Database Management Systems — Pavlo et al., CIDR 2017.The manifesto. Focus on the forecast→plan→act architecture and the explicit analogy ladder from “advisor” to “autonomous” — then ask which rung an LLM agent occupies.

13.2

The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models — Idreos et al., SIGMOD 2018.Read for the design-space framing, not the implementation: layout primitives, the 10^32 continuum, and how learned micro-benchmark models compose into whole-structure cost predictions.

13.3

OtterTune postmortem — A. Pavlo, blog post, 2024.A rare honest startup autopsy by its own founder. Focus on why the science worked and the business didn’t: episodic value, and platforms absorbing the feature.

14.1

What Goes Around Comes Around… And Around… — M. Stonebraker & A. Pavlo, SIGMOD Record 53(2), 2024.The lecture’s backbone. Focus on the scoring of NoSQL and graph, and on which “lessons” the authors say never change — you will cite both sides of it in the debate.

14.2

The Seattle Report on Database Research — D. Abadi et al., CACM 65(8), 2022.The field’s last pre-agentic self-portrait. Focus on what the community ranked urgent in 2022 versus what this course argued matters now — the gaps are your debate ammunition.

14.3

What Goes Around Comes Around — Stonebraker & Hellerstein, 2005. See entry 1.2 above — assigned in both Weeks 1 and 14; the Week-14 reading instruction is folded into that entry.The pairing with 14.1 is the point: the same authorship lineage grading its own twenty-year-old predictions.

A note on linksWhere a reading lists only a venue, find it through the venue’s proceedings or the authors’ pages — every paper above is freely available from at least one of those. We deliberately don’t pin URLs for papers: they rot faster than citations do.

§ 2 · The Two Books

Petrov, and the Alternate On-Ramp

Only one book is required: Alex Petrov’s Database Internals (O’Reilly, 2019; databass.dev). It is the course’s implementation-depth backstop — when a lecture asserts something about page layouts, recovery, or consensus and you want to see the actual mechanics, Petrov is where you go. Only chapters 2–4 and 7 are formally assigned (Week 2), but the rest of the book shadows the syllabus:

Chapters	Topic	Where it lands in DATA 2027
ch. 1	Introduction; storage-engine taxonomy	Background for Week 1
ch. 2–4	B-tree basics, file formats, implementing B-trees	Assigned, Week 2; foundation for Lab 1
ch. 5	Transaction processing & recovery (WAL, ARIES-style thinking)	Weeks 2 & 8; Lab 2’s WAL design
ch. 6	B-tree variants (B-link, copy-on-write trees)	Week 2 skim; copy-on-write returns in Week 8
ch. 7	Log-structured storage (LSM-trees, compaction)	Assigned, Week 2; the spec for Lab 1
ch. 8–10	Distributed systems primer; failure detection; leader election	Background for Week 4
ch. 11–12	Replication, consistency, anti-entropy	Weeks 4 & 8
ch. 13–14	Distributed transactions; consensus	Week 8, alongside Calvin and Spanner

The alternate on-ramp is Martin Kleppmann’s Designing Data-Intensive Applications (O’Reilly, 2017; dataintensive.net). It covers much of the same territory one level of abstraction up — system properties and trade-offs rather than page formats and cell layouts. If Petrov’s chapter 4 feels like reading a disassembly, start with DDIA’s chapters 3 (storage and retrieval) and 5–9 (replication through consistency), then come back. Students who arrive from an applications background consistently report that DDIA-first, Petrov-second is the gentler path; students who have written a storage engine before can skip DDIA entirely.

§ 3 · Benchmarks & Leaderboards

How Progress Gets Measured — and Mismeasured

Week 9 and Lab 3 lean on two text-to-SQL benchmarks, and you should know both as artifacts, not just as numbers.

Spider 2.0 (Lei et al., ICLR 2025) — enterprise-scale workflows over real warehouses: thousand-column schemas, dialect quirks, multi-step tasks that require reading project documentation. This is the benchmark where leaderboard scores fell off the famous cliff relative to its predecessor, and the one Lab 3’s eval harness imitates in miniature.
BIRD (Li et al., NeurIPS 2023) — large-scale, database-grounded text-to-SQL with “external knowledge” attached to questions: the first major benchmark to admit that natural-language questions underdetermine SQL.

What to know about eval quality

Treat every leaderboard number as a measurement made with an imperfect instrument. Independent audits of BIRD’s gold annotations have found a substantial fraction of reference queries that are arguably wrong — ambiguous questions, gold SQL that doesn’t match the stated intent, schema values that contradict the “external knowledge.” This matters in both directions: systems get penalized for correct answers and rewarded for reproducing annotation mistakes. When a model “beats human performance” on a benchmark whose human-written gold labels contain errors, ask what is actually being measured. Lab 3 makes this concrete: part of your grade is auditing your own eval set and reporting the annotation defects you find. The habit generalizes — Week 2’s benchmark skepticism (RUM corners, hardware assumptions) and Week 9’s annotation skepticism are the same skill applied at different layers.

§ 4 · Tools You’ll Touch

The Lab Toolchain

Four labs, four primary tools. Install all of them in Week 1; nothing here takes more than a few minutes to set up, and Lab 1’s toolchain check is due before Week 2.

Tool	Lab	Role	Link
Rust + cargo	Lab 1 (Weeks 2–4)	Implementation language for VLSM, your LSM-tree with vector segments — you’ll own memtables, SSTables, compaction, and a vector access method.	rust-lang.org
MinIO	Lab 2 (Weeks 5–8)	S3-compatible object store run locally; the disaggregated storage layer under Mini-Neon’s copy-on-write pages and branches.	min.io
DuckDB	Lab 3 (Weeks 9–10)	The analytical engine your text-to-SQL agent targets and your eval harness queries — in-process, zero-ops, full SQL.	duckdb.org
Python 3.12+	Lab 4 (Weeks 10–12)	Host language for the semantic-operator optimizer: logical plans, model-tier physical operators, and cascade calibration.	python.org

Secondary dependencies (an LLM API key for Lab 3 — Lab 4’s simulator makes no real API calls — and plotting libraries for lab reports) are listed on each lab page: Lab 1, Lab 2, Lab 3, Lab 4.

§ 5 · Staying Current

After Week 14

This course will be stale in places within a year — Part III especially. The fix is a short list of feeds that have stayed reliable while everything else churned.

Andy Pavlo’s annual “Databases in 20XX” retrospectives — cs.cmu.edu/~pavlo/blog. One post a year, every January, scoring the previous year’s acquisitions, funding, failures, and fashion cycles. The single highest signal-to-noise database publication in existence; read the whole back catalog.
The three venues — CIDR (January; where systems people publish opinions and architectures — Lakehouse, SageDB, Palimpzest, and self-driving DBMSs all debuted here), SIGMOD (June), and VLDB (August/September). Skim the industrial tracks first: that is where the production retrospectives live.
CMU Database Group seminar videos — youtube.com/@CMUDatabaseGroup. Weekly talks from the people building the systems in this syllabus, including the “quarantine tech talks” archive; the fastest way to hear how a paper’s claims sound when its authors defend them live.
The two companion essays on this site — Essay № 01, the argument that the database client has changed, and Essay № 02, the design rationale for this course. Reread both after the final exam; they were written to be disagreed with, and by Week 14 you will be equipped to.

❦