Why this course exists
For fifty years, every database design decision — the buffer pool, the optimizer’s cost model, the isolation-level menu, SQL itself — quietly assumed a human at the other end of the connection. That client is gone. The dominant client of the late 2020s issues thousands of speculative queries per task, branches and abandons entire database states, retrieves by similarity as often as by key, and lies confidently when the data model confuses it. Database courses teach the system; ML courses teach the model; nobody teaches the interface between them, which is where the next decade of both fields will be decided.
This course rebuilds database systems knowledge from storage engines upward, asking at each layer what survives, what bends, and what breaks when the client is a model. Part I re-reads the classical canon as a set of workload bets about to be violated. Part II covers the new access methods and engine architectures. Part III climbs to where accuracy actually lives — semantics, agents, governance. Part IV asks the field’s oldest question of its newest material.
LLM policy: frontier models are allowed — encouraged — everywhere, including the exam. You are training to build systems for these clients; pretending they don’t exist teaches the wrong lesson. What is graded is what models can’t fake: Pareto frontiers from your own benchmarks, ablation tables, and designs you can defend in a hallway argument.
Fourteen weeks
Each week has full lecture notes (linked below) and a keyboard-navigable slide deck — the Slides → link in every week’s header. Arrow keys to advance, p to print.
Four labs, each a thesis in miniature
Graded on Pareto frontiers and ablation tables, not point estimates — the validation discipline of production analytics agents, enforced from problem set one.
VLSM: an LSM-tree with vector segments
SSTables carry quantized vector segments with per-level HNSW graphs; compaction merges graphs. Deliverable: a read/update/memory/recall Pareto plot.
Mini-Neon: copy-on-write pages over S3
WAL ingestion, GetPage@LSN, CREATE BRANCH in O(metadata). Demo: 50 concurrent agent branches plus garbage collection of the abandoned ones.
Text-to-SQL agent + eval harness
An agent over a 126-table schema full of deliberate traps; build the eval set, add a semantic layer, and reproduce the 21%→95% curve in miniature.
A semantic-operator optimizer
sem_filter / sem_join / sem_topk with model-tier cost models and proxy cascades. Beat the naive plan by 10× on cost at ≤2% quality loss.
Grading, projects, exam
The final project is a team-of-two, CIDR-format paper plus working artifact, chosen from a menu of eight publishable questions — from recall-aware compaction to the agent-native TPC benchmark. The take-home synthesis exam is open-everything, including frontier models; its five questions each require holding two layers of the stack in tension. The famous ones: RUM-R (formalize recall as the fourth axis of the RUM conjecture) and Isolation for liars (define an isolation level for agents acting on stale memory summaries).
Every lab ships with real starter materials — skeleton code, the 126-table trap schema with gold-SQL evals, the deterministic LLM simulator — linked from each lab’s “Provided materials” section.
Teach it, or take it alone
Teaching DATA 2027
Pacing variants (semester / quarter / undergrad), running the labs, enforcing the open-model policy, adapting the trap schema, and an honest provenance note.
Resources & reading list
All 42 weekly readings in one place, organized by part, plus the two books, the benchmarks, and the tools the labs use.