DATA 2027 SPRING 2027 · A COURSE THAT DOESN’T EXIST YET ← THE SYLLABUS ESSAY  ·  THE ESSAY
MIT 6.5837 · Stanford CS 345A · Harvard CS 2650

Data Systems in the Agentic Era

A complete graduate database course for CS students — 14 weeks of full lecture notes, four systems labs, eight final projects, and a synthesis exam. Free, open, and built on one thesis: the database’s dominant client is now a model, and that changes the workload at every layer of the stack — but not the physics.

Units: 12 (3-0-9)  ·  Meets: Tu/Th 1:00–2:30 + Fri lab  ·  Prereqs: undergrad DB internals, systems programming (Rust/C++/Go), LLM API fluency
Everything on this site is real coursework; only the registrar entry is fictional. Steal freely — CC BY 4.0.
About

Why this course exists

For fifty years, every database design decision — the buffer pool, the optimizer’s cost model, the isolation-level menu, SQL itself — quietly assumed a human at the other end of the connection. That client is gone. The dominant client of the late 2020s issues thousands of speculative queries per task, branches and abandons entire database states, retrieves by similarity as often as by key, and lies confidently when the data model confuses it. Database courses teach the system; ML courses teach the model; nobody teaches the interface between them, which is where the next decade of both fields will be decided.

This course rebuilds database systems knowledge from storage engines upward, asking at each layer what survives, what bends, and what breaks when the client is a model. Part I re-reads the classical canon as a set of workload bets about to be violated. Part II covers the new access methods and engine architectures. Part III climbs to where accuracy actually lives — semantics, agents, governance. Part IV asks the field’s oldest question of its newest material.

LLM policy: frontier models are allowed — encouraged — everywhere, including the exam. You are training to build systems for these clients; pretending they don’t exist teaches the wrong lesson. What is graded is what models can’t fake: Pareto frontiers from your own benchmarks, ablation tables, and designs you can defend in a hallway argument.

Schedule

Fourteen weeks

Each week has full lecture notes (linked below) and a keyboard-navigable slide deck — the Slides → link in every week’s header. Arrow keys to advance, p to print.

Part I — Foundations Under New Workloads
01Feb 2 / 4The Client Has ChangedAnatomy of a DBMS · the agentic workload, measurednotes → 02Feb 9 / 11B-Trees, LSM-Trees & the RUM TriangleThe disk made me do it · amplification arithmeticnotes → 03Feb 16 / 18One Size Fits NoneColumn stores · vectorized executionnotes → 04Feb 23 / 25Disaggregation & ElasticityAurora · Snowflake · Neon’s branchable storagenotes →
Part II — New Access Methods & Engines
05Mar 2 / 4Learned ComponentsLearned indexes · Bao · the deployable patternnotes → 06Mar 9 / 11Vector Indexes Are Access Methods, Not ProductsIVF · PQ · HNSW · DiskANN · the recall axisnotes → 07Mar 16 / 18The Lakehouse & Open FormatsIceberg’s metadata tree · Photon · the catalog warsnotes → — spring break · week of Mar 22 — 08Mar 30 / Apr 1Transactions & Branching for Agent SwarmsCalvin vs Spanner · copy-on-write as productnotes →
Part III — Semantics, Agents, Governance
09Apr 6 / 8Text-to-SQL Is Not Solved; It’s SpecifiedSpider 2.0 · the semantic layer as schema-for-modelsnotes → 10Apr 13 / 15Semantic Operatorssem_filter, sem_join · optimizing pipelines that cost dollars and lienotes → 11Apr 20 / 22Memory Is a Database ProblemMem0, Graphiti, MemGPT — graded as database designsnotes → 12Apr 27 / 29Protocols, Permissions & the Lethal TrifectaMCP as ODBC for agents · securing DBs against hypnotizable clientsnotes →
Part IV — Frontier & Futures
13May 4 / 6Self-Driving, Self-Assembling, Self-DesigningAuto-tuning to self-design · the agent as DBAnotes → 14May 11 / 13What Goes AroundFifty years of rebellions · final debate: the first genuine exception?notes →
Labs

Four labs, each a thesis in miniature

Graded on Pareto frontiers and ablation tables, not point estimates — the validation discipline of production analytics agents, enforced from problem set one.

Assessment

Grading, projects, exam

40%labs
35%final project
15%papers + debate
10%exam

The final project is a team-of-two, CIDR-format paper plus working artifact, chosen from a menu of eight publishable questions — from recall-aware compaction to the agent-native TPC benchmark. The take-home synthesis exam is open-everything, including frontier models; its five questions each require holding two layers of the stack in tension. The famous ones: RUM-R (formalize recall as the fourth axis of the RUM conjecture) and Isolation for liars (define an isolation level for agents acting on stale memory summaries).

Every lab ships with real starter materials — skeleton code, the 126-table trap schema with gold-SQL evals, the deterministic LLM simulator — linked from each lab’s “Provided materials” section.

For Instructors & Self-Studiers

Teach it, or take it alone