- Turn a one-line research question into a falsifiable hypothesis with a measurable baseline
- Build a systems artifact whose experiments someone else can rerun from a single script
- Write a CIDR-style paper: a clear claim, an honest evaluation, and a real limitations section
- Defend a negative result as rigorously as a positive one
What “Done” Means
Every project on the menu is a publishable question. That cuts both ways: nobody expects a VLDB acceptance in fourteen weeks, but everybody is held to the standards that make one possible. Three of those standards are non-negotiable.
Paper format
Your report is a CIDR-format paper, at most seven pages plus references, written for a reader who knows databases but has not taken this course. It must state a hypothesis in the introduction, describe the artifact precisely enough to reimplement, and evaluate against at least one serious baseline — “we didn’t compare against anything” is the most common cause of a failing project grade. A related-work section that engages honestly with the references listed under each project below is expected, not optional.
The artifact bar
The artifact is code that runs. The bar: a fresh checkout, one setup command, one experiment script, and every figure in your paper regenerates. We will actually do this. Prototypes are fine — you may fork RocksDB rather than write an LSM tree, stub a planner rather than write one — but the measured path must be real, and anything mocked must be declared in the paper.
Honest negative results
If your hypothesis turns out to be false, you have not failed the project — you have completed it. A paper that says “speculative pre-execution does not pay for itself below 40 branches/hour, and here is the measurement that shows why” earns full marks. What fails is dishonesty: cherry-picked workloads, missing error bars, baselines configured to lose. We grade the rigor of the investigation, not the sign of the result.
Timeline
| Milestone | Week | Deliverable | Stakes |
|---|---|---|---|
| Proposal | Week 6 | 2 pages: hypothesis, baseline, evaluation plan, division of labor | Graded pass/revise; you cannot start Lab credit on the project without a pass |
| Midway check | Week 10 | Artifact skeleton runs end-to-end on a toy workload; one preliminary plot | 15-minute meeting; this is where doomed scopes get cut, not punished |
| Final report | Week 14 | CIDR paper + tagged artifact release + 10-minute talk | The 35% |
Choose One of Eight
Each brief expands the one-liner from the syllabus into motivation, deliverables, milestones, and the shape of a strong result. You may propose a variation; you may not propose a ninth project without instructor sign-off and a written hypothesis at least as sharp as these.
Motivation. LSM compaction policies are tuned for write amplification and read cost; vector indexes are tuned for recall. When a graph index like HNSW lives inside an LSM engine, compaction silently rewires the graph — and nobody currently co-optimizes the two. The result is engines that pay full compaction cost while degrading the very recall the index exists to provide.
Deliverables. A compaction policy (in a real LSM engine or a faithful simulator) whose cost function includes a graph-quality term; a measurement harness that tracks recall@k and write amplification across compaction schedules; a paper characterizing the recall/write-amp Pareto frontier.
Milestones. (1) Reproduce baseline: measure recall degradation under leveled and tiered compaction on a standard ANN dataset. (2) Implement the recall-aware policy and show it moves at least one point on the frontier. (3) Sensitivity study: vary update rate and dimensionality; find where the policy stops paying.
A strong result shows a policy that dominates a stock policy on at least one axis without losing on the other, with an explanation of why grounded in graph connectivity — or a principled argument that the frontier is inherently flat.
Key references. Malkov & Yashunin, HNSW (TPAMI 2018); Idreos et al., The Data Calculator (SIGMOD 2018).
Motivation. Agent memory systems like Mem0 and Graphiti give agents long-term state, but with the consistency guarantees of a sticky note: no isolation, no history, no audit trail. An agent acting on memory that another agent is concurrently rewriting is a correctness bug wearing a product feature’s clothing.
Deliverables. A Mem0-class memory store with snapshot isolation across concurrent agent sessions, time-travel reads (“what did the agent believe at t?”), and an append-only audit log; a benchmark suite comparing latency, recall, and consistency anomalies against Graphiti.
Milestones. (1) MVCC core: versioned memory writes with snapshot reads, demonstrated with two concurrent sessions. (2) Time travel + audit log, with a replay tool. (3) Head-to-head benchmark against Graphiti on a multi-agent workload, counting observed anomalies under each system.
A strong result quantifies the price of isolation — how much latency and storage snapshot semantics cost — and demonstrates at least one real anomaly class that the baseline exhibits and your system prevents.
Key references. Mem0 (arXiv 2504.19413); Zep/Graphiti (arXiv 2501.13956); MemGPT (arXiv 2310.08560).
Motivation. The lethal-trifecta incidents of 2025 showed that an MCP server with database access turns every row of untrusted data into a potential instruction stream. Application-layer prompt filtering keeps failing because it asks the model to police itself; the database community’s answer — enforce policy below the model — has not yet been built for MCP.
Deliverables. A gateway that sits between an MCP client and a real database, tagging rows with provenance and enforcing row-security policy regardless of what the model asks for; a red-team report where you attack your own gateway with injection payloads embedded in data; a paper on what survived.
Milestones. (1) Working gateway with static row policies; demonstrate it blocks a textbook exfiltration. (2) Provenance tagging: policies that depend on where data came from, not just who asks. (3) Structured red-team: a corpus of injection attacks, success rates with and without the gateway.
A strong result is a defense with a precisely stated threat model, an attack corpus that breaks the unprotected baseline near-100% of the time, and an honest accounting of what still gets through — a gateway that “blocks everything” is a gateway that hasn’t been attacked properly.
Key references. Willison, the lethal trifecta / Supabase MCP case (2025); Supabase, Defense in Depth for MCP Servers.
Motivation. Agents issue queries in long, structured chains — explore the schema, sample a table, refine a filter — and each round trip stalls the whole reasoning loop. Branching storage engines make speculation cheap; what’s missing is the predictor. If an agent’s next five queries are guessable from its trace, the database can have the answers waiting.
Deliverables. A predictor (learned or heuristic) over agent query traces; a speculative executor that pre-runs predicted queries on branches and serves hits from cache; an evaluation on real agent traces measuring hit rate, latency saved, and wasted work.
Milestones. (1) Trace collection: instrument an agent on a text-to-SQL benchmark, characterize query-sequence predictability. (2) Predictor + speculative executor with a wasted-work budget. (3) End-to-end: agent task latency with and without speculation, across budget settings.
A strong result reports the crossover honestly: at what hit rate does speculation pay for its wasted compute, and do real agent traces clear that bar? A clean “no, and here’s the predictability ceiling” is as publishable as a yes.
Key references. Spider 2.0 (ICLR 2025) for workloads; Neon’s branching architecture; Marcus et al., Bao (SIGMOD 2021) for learning-from-feedback framing.
Motivation. Every team now pastes its dbt docs into a system prompt and calls it context engineering. But a semantic layer is a formal artifact — metrics, joins, grain — and the mapping from that artifact to a token budget is a compilation problem, not a copy-paste problem. Nobody has treated it as one.
Deliverables. A compiler from dbt/Cube semantic definitions to model-optimized context, with token-budget-aware schema elision (drop, summarize, or defer the parts of the schema least relevant to the query); an evaluation on a text-to-SQL benchmark across budget levels.
Milestones. (1) Parser + naive serializer: full semantic layer to context, measure baseline accuracy and token cost. (2) Elision engine: relevance-ranked compression under a hard budget. (3) Accuracy-vs-tokens curves across three budgets and two models; ablate each elision strategy.
A strong result is the curve itself: accuracy as a function of token budget, with the compiled context dominating naive truncation, plus a taxonomy of which schema facts models actually need and which they reliably infer.
Key references. Anthropic, effective context engineering for AI agents; BIRD (NeurIPS 2023) and Spider 2.0 (ICLR 2025) for evaluation.
Motivation. Bao showed that a learned advisor steering a classical optimizer’s hints beats replacing the optimizer outright. Semantic-operator pipelines in LOTUS face an even richer steering space — model choice, cascade thresholds, operator ordering — and currently navigate it with static heuristics. The Bao recipe has not been tried where it may matter most.
Deliverables. A learned steering layer for LOTUS pipelines that chooses among rewrite/configuration options using execution feedback (cost and accuracy); a benchmark of semantic pipelines with ground truth; an evaluation against LOTUS’s native optimizations.
Milestones. (1) Define the hint space: enumerate steerable choices in three real LOTUS pipelines and measure the spread between best and worst plans. (2) Feedback loop: bandit or regression model picking plans from observed (cost, accuracy) outcomes. (3) Evaluation: cumulative regret against static defaults and against an oracle.
A strong result shows the learned advisor closing most of the gap to the oracle within a realistic number of pipeline executions — or demonstrates that accuracy feedback is too noisy/expensive to learn from, with a measurement of exactly how noisy.
Key references. Marcus et al., Bao (SIGMOD 2021); Patel et al., LOTUS (VLDB 2025).
Motivation. Software earned its reliability with pull requests, CI, and review; tables get none of that — agents and pipelines mutate production data with no staging, no diff, no gate. Iceberg snapshots make cheap table branches possible, and agents are good at writing tests; the missing piece is the system that wires them into a “pull request for tables.”
Deliverables. A working PR-for-tables system on Iceberg: branch, mutate, auto-generate data tests (agent-written expectations over the diff), gate the merge; a study of test quality — do agent-generated tests catch real regressions?
Milestones. (1) Mechanics: branch/diff/merge workflow on Iceberg snapshots with a CLI. (2) Agent test generation from table diffs and history; merge gate. (3) Fault injection: seed realistic data bugs, measure catch rate and false-positive rate of generated tests.
A strong result reports precision and recall of the merge gate on injected faults, and an honest analysis of the failure modes — the bugs agent-written tests systematically miss are the most interesting finding available here.
Key references. Lakehouse (CIDR 2021); Shankar et al., EvalGen (UIST 2024) on validating LLM-generated checks.
Motivation. Every claim in this course — about speculation, memory, gateways, semantic layers — is currently unfalsifiable at scale because no benchmark models an agentic database client: bursty query chains, schema exploration, branch-heavy speculation, mid-task abandonment. TPC-C modeled a warehouse clerk; nothing models a model. The field needs this more than it needs another engine.
Deliverables. A published benchmark: a workload generator that replays parameterized agent behavior against any SQL endpoint, a metrics suite beyond throughput (time-to-task-completion, wasted work, context-fetch cost), and reference runs on at least two real systems.
Milestones. (1) Workload characterization: collect and analyze real agent traces; defend the generator’s parameters against them. (2) Generator + metrics implementation, runnable against Postgres out of the box. (3) Reference runs on two systems with a published, versioned spec and leaderboard format.
A strong result is a benchmark other people can run without you in the room — and whose metrics provably distinguish systems that conventional benchmarks rank as equivalent. If TPC-C and your benchmark always agree, your benchmark isn’t measuring anything new.
Key references. Spider 2.0 (ICLR 2025); BIRD (NeurIPS 2023); the TPC-C specification, read as a design document rather than a relic.