Field Notes: What the Flagships Actually Teach
Before inventing a course, survey the territory. We read the current syllabi of every flagship database course at Stanford, MIT, Harvard, CMU, and Berkeley — the actual 2025–2026 lecture schedules, not the catalog blurbs. The result is a snapshot of a discipline mid-pivot: the AI era has breached the core curriculum at exactly two schools, is knocking at a third, and has left the other two split between classical rigor and a vacated chair.
MIT
The bellwether. The core grad DB course now ends its 24-lecture arc with Lecture 17: Vector Databases and Lecture 22: Semantic Operators — LLM-powered relational operators, taught from the group’s own Palimpzest (CIDR ’25), have entered the canon alongside ARIES and C-Store. Kraska is on leave running applied science at AWS, where SageDB’s learned layouts shipped inside Redshift.
Harvard
The boldest rename: CS265 is now literally “Big Data & AI Systems” — NoSQL, LLMs, RAG, image AI. The systems project: build a full NoSQL engine and a full LLM, in C/C++. DASlab’s “self-designing systems” program (Data Calculator, Monkey, Dostoevsky) now treats RAG pipelines and LLM serving as data systems with navigable design spaces.
CMU
Vector indexes are now in the intro indexing lectures of 15-445 — commoditized, as Pavlo predicted. But 15-721 (Advanced, now under Jignesh Patel) stays deliberately classical: vectorized execution, OCC, cloud warehouses, lakehouses. Pavlo’s 2025 year-in-review: every DBMS shipped an MCP server, agents provision the databases, and nobody has solved the security story.
Stanford
The famous data-intensive systems course is effectively frozen at its 2021 edition — Zaharia left for Berkeley and the Databricks CTO chair. The energy moved sideways: CS336 makes students build Common Crawl dedup pipelines (data engineering as part of the model), CS224V teaches SUQL’s hybrid SQL+retrieval queries, and the LOTUS semantic-operators work (VLDB ’25) came from the Zaharia–Guestrin orbit.
Berkeley
No flagship grad DB course since 2020 — but the world’s largest agent courses (LLM Agents, Agentic AI; five-thousand-student MOOCs) and the EPIC lab’s DocETL (VLDB ’25), the third leg of the semantic-operator stool. Hellerstein’s Aqueduct pivoted to RunLLM; his verdict: enterprise picks-and-shovels LLM tooling “is not ready yet.”
The Gap
Vector internals live at CMU, semantic operators at MIT, self-designing systems at Harvard, data curation at Stanford, agents at Berkeley. No single course teaches the agentic data stack end-to-end — storage physics to semantic layer to agent governance. That course is below.
One quote from the field survey deserves framing, because it is the entire reason the course below exists. Michael Stonebraker — Turing laureate, creator of Ingres and Postgres — ran text-to-SQL against MIT’s own data warehouse in 2025 and reported:
“An accuracy of zero — not low, zero.” Private data, idiosyncratic terminology, semantic overlap, complex queries. His corollary: agentic AI is about to go read-write, and what it needs is ACID — “durable computing is basically the D in ACID.” — Michael Stonebraker, 2025 year-in-review interview; his DBOS project has pivoted to durable execution for “crashproof AI agents”
Zero on MIT’s warehouse; ninety-five percent at Anthropic with curated context. The entire pedagogy of the agentic era lives in the space between those two numbers — and no course on Earth currently teaches students how to cross it.
The Course
DATA 2027: Data Systems in the Agentic Era
The arc follows the stack: Part I re-reads the classical canon as a set of workload bets about to be violated. Part II covers the new access methods and engine architectures. Part III climbs to where the accuracy actually lives — semantics, agents, governance. Part IV asks the field’s oldest question of its newest material.
Fourteen Weeks
The Client Has Changed
What Goes Around Comes Around — Stonebraker & Hellerstein
How Anthropic Enables Self-Service Data Analytics with Claude — Anthropic, June 2026
B-Trees, LSM-Trees & the RUM Triangle
The RUM Conjecture — Athanassoulis et al., EDBT 2016
The Log-Structured Merge-Tree — O’Neil et al., 1996
One Size Fits None
C-Store — VLDB 2005
MonetDB/X100: Hyper-Pipelining Query Execution — Boncz et al., CIDR 2005
Disaggregation & Elasticity
The Snowflake Elastic Data Warehouse — Dageville et al., SIGMOD 2016
Building an Elastic Query Engine on Disaggregated Storage — NSDI 2020
Learned Components
Bao: Making Learned Query Optimization Practical — Marcus et al., SIGMOD 2021
SageDB — CIDR 2019
Vector Indexes Are Access Methods, Not Products
DiskANN — Subramanya et al., NeurIPS 2019
Product Quantization — Jégou et al., TPAMI 2011
The Lakehouse & Open Formats
Apache Iceberg Table Spec v2 + Ryan Blue, Netflix 2018
Photon — Behm et al., SIGMOD 2022
Transactions & Branching for Agent Swarms
Spanner — Corbett et al., OSDI 2012
Neon: Architecture Decisions — neon.tech engineering, 2022–24
Text-to-SQL Is Not Solved; It’s Specified
BIRD — Li et al., NeurIPS 2023
Semantic Layer vs Text-to-SQL benchmarks — Cube / dbt Labs, 2026
Semantic Operators
sem_filter and sem_join were relational operators with cost models, not prompt spaghetti? LOTUS, Palimpzest, and DocETL converge from three directions: declarative LLM pipelines deserve an optimizer, and accuracy joins latency and cost in the objective.Palimpzest — Liu et al., CIDR 2025
DocETL — Shankar et al., VLDB 2025
Memory Is a Database Problem
Zep/Graphiti: Temporal KG for Agent Memory — Rasmussen et al., arXiv 2501.13956
MemGPT — Packer et al., 2023
Protocols, Permissions & the Lethal Trifecta
The Lethal Trifecta — Willison, 2025 (incl. the Supabase MCP exfiltration case)
OWASP Top 10 for LLM Applications — 2025 rev.
Self-Driving, Self-Assembling, Self-Designing
The Data Calculator — Idreos et al., SIGMOD 2018
OtterTune postmortem — 2024
What Goes Around
The Seattle Report on Database Research — Abadi et al., CACM 2022
The Labs
Four labs, each one a thesis in miniature. Graded on Pareto frontiers and ablation tables, not point estimates — the Anthropic discipline, enforced from problem set one.
VLSM: an LSM-tree with vector segments
Extend a skeleton LSM engine (Rust) so each SSTable carries a quantized vector segment with a per-level HNSW graph; implement compaction that merges graphs. Measure the read / update / memory / recall frontier under a synthetic agent workload. Deliverable: a Pareto plot, not a number.
Mini-Neon: copy-on-write pages over S3
Build a page layer over object storage: WAL ingestion, page materialization at any LSN, and CREATE BRANCH in O(metadata). Demonstrate 50 concurrent agent branches forked from one parent, with garbage collection of the abandoned ones.
Text-to-SQL agent + eval harness
Build an agent over a 126-table enterprise-style schema; construct a BIRD-style eval set; then add a Cube-style semantic layer and report execution accuracy with and without it, broken down by error class: wrong join, wrong metric, hallucinated column. Reproduce the 21%→95% curve yourself.
A semantic-operator optimizer
Implement sem_filter, sem_join, sem_topk over a 25,000-document corpus with a cost model spanning model tiers, cascades with cheap-proxy scoring, and an accuracy budget. Beat the naive plan by 10× on cost at ≤2% quality loss.
Final Projects & Grading
Teams of two; deliverable is a CIDR-format paper plus a working artifact. The menu — each item is a publishable question wearing a project’s clothing:
Why this course exists: the agent era is producing a generation of engineers who treat the database as a vibes-based retrieval API, and a generation of database researchers who treat LLMs as a noisy UDF. Both are wrong, and the cost of being wrong is measured in hallucinated joins shipped to production and in storage engines optimized for clients that no longer exist. This course exists to produce people who can hold Petrov’s page layouts and a transformer’s context window in their head at the same time — because the systems that win the agentic era will be built by exactly those people, and there are currently about two hundred of them on Earth.
The Synthesis Exam
Take-home, open-everything — including frontier models, because pretending otherwise teaches the wrong lesson. Each question requires holding two layers of the stack in tension; none can be answered from one community’s literature alone.
Sources & Provenance
- [01] MIT 6.5830 Spring 2026 — course site, lecture schedule (Lec 17 Vector DB, Lec 22 Semantic Operators); instructors Cafarella & Li.
- [02] Harvard CS265 “Big Data & AI Systems,” Spring 2026, Idreos — site, syllabus; CS165 Fall 2025 — site.
- [03] CMU 15-445/645 Spring 2026 — syllabus (vector indexes in intro indexing); 15-721 Fall 2025 (Patel) — site.
- [04] Pavlo, “Databases in 2025: A Year in Review” (MCP everywhere; agentic provisioning; security warning).
- [05] Stanford: CS245 (frozen at Winter 2021); CS336 (Common Crawl data labs); CS224V (SUQL); CS224G; CS528 MLSys seminar.
- [06] Berkeley: RDI agent courses — LLM Agents F24, Agentic AI F25; EPIC lab DocETL, VLDB 2025; Hellerstein on LLM tooling, Firebolt interview.
- [07] Stonebraker “accuracy of zero” + ACID-for-agents: 2025 year-in-review interview; DBOS durable execution.
- [08] Semantic operators: LOTUS — Patel et al., VLDB 2025; Palimpzest — Liu et al., CIDR 2025; MIT DSG project page.
- [09] Benchmarks: Spider 2.0 — ICLR 2025 oral; BIRD — NeurIPS 2023, arXiv 2305.03111.
- [10] Learned components: Kraska et al. SIGMOD 2018; Bao — Marcus et al., SIGMOD 2021; SageDB — CIDR 2019; MIT DSAIL; Redshift learned layouts — Amazon Science.
- [11] Vector internals: HNSW — TPAMI 2018; DiskANN — NeurIPS 2019; PQ — Jégou et al., TPAMI 2011.
- [12] Canon: Hellerstein/Stonebraker/Hamilton FnT 2007; C-Store VLDB 2005; MonetDB/X100 CIDR 2005; Aurora SIGMOD 2017; Snowflake SIGMOD 2016; Lakehouse CIDR 2021; Photon SIGMOD 2022; Calvin SIGMOD 2012; Spanner OSDI 2012; Stonebraker & Pavlo, SIGMOD Record 2024.
- [13] Agent memory: Mem0 — arXiv 2504.19413; Zep/Graphiti — arXiv 2501.13956; MemGPT — arXiv 2310.08560.
- [14] Security: Willison, the lethal trifecta / Supabase MCP case; Supabase, Defense in Depth for MCP Servers; Datadog Security Labs Postgres MCP injection.
- [15] Context engineering: Anthropic, self-service analytics (June 2026) and effective context engineering; EvalGen — Shankar et al., UIST 2024.
- [16] OtterTune (2020–2024) postmortem — ottertune.com; Idreos, The Data Calculator — SIGMOD 2018; Pavlo et al., Self-Driving DBMS — CIDR 2017.