DATA 2027: Data Systems in the Agentic Era

About

Why this course exists

For fifty years, every database design decision — the buffer pool, the optimizer’s cost model, the isolation-level menu, SQL itself — quietly assumed a human at the other end of the connection. That client is gone. The dominant client of the late 2020s issues thousands of speculative queries per task, branches and abandons entire database states, retrieves by similarity as often as by key, and lies confidently when the data model confuses it. Database courses teach the system; ML courses teach the model; nobody teaches the interface between them, which is where the next decade of both fields will be decided.

This course rebuilds database systems knowledge from storage engines upward, asking at each layer what survives, what bends, and what breaks when the client is a model. Part I re-reads the classical canon as a set of workload bets about to be violated. Part II covers the new access methods and engine architectures. Part III climbs to where accuracy actually lives — semantics, agents, governance. Part IV asks the field’s oldest question of its newest material.

LLM policy: frontier models are allowed — encouraged — everywhere, including the exam. You are training to build systems for these clients; pretending they don’t exist teaches the wrong lesson. What is graded is what models can’t fake: Pareto frontiers from your own benchmarks, ablation tables, and designs you can defend in a hallway argument.

Schedule

Fourteen weeks

Each week has full lecture notes (linked below) and a keyboard-navigable slide deck — the Slides → link in every week’s header. Arrow keys to advance, p to print.

Part I — Foundations Under New Workloads

01Feb 2 / 4The Client Has ChangedAnatomy of a DBMS · the agentic workload, measurednotes → 02Feb 9 / 11B-Trees, LSM-Trees & the RUM TriangleThe disk made me do it · amplification arithmeticnotes → 03Feb 16 / 18One Size Fits NoneColumn stores · vectorized executionnotes → 04Feb 23 / 25Disaggregation & ElasticityAurora · Snowflake · Neon’s branchable storagenotes →

Part II — New Access Methods & Engines

05Mar 2 / 4Learned ComponentsLearned indexes · Bao · the deployable patternnotes → 06Mar 9 / 11Vector Indexes Are Access Methods, Not ProductsIVF · PQ · HNSW · DiskANN · the recall axisnotes → 07Mar 16 / 18The Lakehouse & Open FormatsIceberg’s metadata tree · Photon · the catalog warsnotes → — spring break · week of Mar 22 — 08Mar 30 / Apr 1Transactions & Branching for Agent SwarmsCalvin vs Spanner · copy-on-write as productnotes →

Part III — Semantics, Agents, Governance

09Apr 6 / 8Text-to-SQL Is Not Solved; It’s SpecifiedSpider 2.0 · the semantic layer as schema-for-modelsnotes → 10Apr 13 / 15Semantic Operatorssem_filter, sem_join · optimizing pipelines that cost dollars and lienotes → 11Apr 20 / 22Memory Is a Database ProblemMem0, Graphiti, MemGPT — graded as database designsnotes → 12Apr 27 / 29Protocols, Permissions & the Lethal TrifectaMCP as ODBC for agents · securing DBs against hypnotizable clientsnotes →

Part IV — Frontier & Futures

13May 4 / 6Self-Driving, Self-Assembling, Self-DesigningAuto-tuning to self-design · the agent as DBAnotes → 14May 11 / 13What Goes AroundFifty years of rebellions · final debate: the first genuine exception?notes →

Labs

Four labs, each a thesis in miniature

Graded on Pareto frontiers and ablation tables, not point estimates — the validation discipline of production analytics agents, enforced from problem set one.

Lab 1 · Weeks 2–4 · Rust

VLSM: an LSM-tree with vector segments

SSTables carry quantized vector segments with per-level HNSW graphs; compaction merges graphs. Deliverable: a read/update/memory/recall Pareto plot.

Lab 2 · Weeks 5–8 · any language

Mini-Neon: copy-on-write pages over S3

WAL ingestion, GetPage@LSN, CREATE BRANCH in O(metadata). Demo: 50 concurrent agent branches plus garbage collection of the abandoned ones.

Lab 3 · Weeks 9–10 · Python

Text-to-SQL agent + eval harness

An agent over a 126-table schema full of deliberate traps; build the eval set, add a semantic layer, and reproduce the 21%→95% curve in miniature.

Lab 4 · Weeks 10–12 · Python

A semantic-operator optimizer

sem_filter / sem_join / sem_topk with model-tier cost models and proxy cascades. Beat the naive plan by 10× on cost at ≤2% quality loss.

Assessment

Grading, projects, exam

40%labs

35%final project

15%papers + debate

10%exam

The final project is a team-of-two, CIDR-format paper plus working artifact, chosen from a menu of eight publishable questions — from recall-aware compaction to the agent-native TPC benchmark. The take-home synthesis exam is open-everything, including frontier models; its five questions each require holding two layers of the stack in tension. The famous ones: RUM-R (formalize recall as the fourth axis of the RUM conjecture) and Isolation for liars (define an isolation level for agents acting on stale memory summaries).

Every lab ships with real starter materials — skeleton code, the 126-table trap schema with gold-SQL evals, the deterministic LLM simulator — linked from each lab’s “Provided materials” section.

For Instructors & Self-Studiers

Teach it, or take it alone

Guide

Teaching DATA 2027

Pacing variants (semester / quarter / undergrad), running the labs, enforcing the open-model policy, adapting the trap schema, and an honest provenance note.

Reference

Resources & reading list

All 42 weekly readings in one place, organized by part, plus the two books, the benchmarks, and the tools the labs use.

❦