- Draw the self-driving DBMS control loop (forecast → plan → act → observe) and explain why workload forecasting, not ML cleverness, is the load-bearing component.
- Explain how OtterTune framed knob tuning as black-box optimization with Gaussian-process regression, and give two structural reasons the standalone product died in 2024 while the technique survived inside platforms.
- Navigate the Data Calculator’s design space: given a read/write/memory budget, predict which layout primitives a cost synthesizer would choose, and extend the same framing to RAG and serving pipelines.
- Contrast a narrow learned tuner with a general LLM agent acting as DBA, and specify the guardrail / eval / rollback harness the agent needs before it touches production.
- Audit any “learned database component” claim circa 2026 and classify it honestly: in-production, advisor-mode, or lab-only.
From Auto-Tuning to Self-Design
Andy Pavlo opened CIDR 2017 with a deliberately immodest claim: everything before — index advisors, knob wizards, partition recommenders — was “automated” but not autonomous, because a human still decided when to apply the advice and absorbed the blame when it went wrong. A self-driving DBMS, by analogy to a self-driving car, must close the loop itself: predict where the workload is going, choose actions whose costs pay off before the prediction expires, apply them without a maintenance window, and learn from the outcome. The paper’s architecture (built into the Peloton prototype, later reborn as NoisePage) has three stages, and the ordering is the whole argument: forecast first, then plan. You cannot decide whether building an index is worth its 40-minute build cost unless you can predict that the queries it serves will still be arriving 40 minutes from now. Autonomy is a bet on the future of the workload, and the quality of the bet is bounded by the quality of the forecast.
Workload forecasting: the unglamorous keystone
The CMU follow-up work (QueryBot 5000, SIGMOD 2018) made the forecasting problem concrete. Raw query logs are too high-cardinality to forecast — a busy OLTP system sees millions of distinct statements per day — so the first move is to template queries (strip constants) and then cluster templates by arrival-rate correlation. The striking empirical fact: across the workloads they studied (a bus-tracking app, an academic site, MOOC infrastructure), the top 5 clusters covered roughly 95% of query volume. Forecast those few time series — with an ensemble of linear regression for short horizons and a recurrent network for longer ones — and you can predict arrival rates an hour to a week out, accurately enough to schedule index builds at the 3 a.m. trough the model saw coming. Note what this implies for our course thesis: agentic clients make this harder. Agent-generated SQL templates poorly (every prompt produces a slightly different join order), and agent traffic has bursts correlated with upstream model behavior, not human circadian rhythm. Forecasting research assumed the diurnal human; week 13’s client doesn’t sleep.
Action planning: a control problem wearing an ML costume
Given a forecast, the planner searches a space of actions — add/drop index, repartition, resize buffer pool, change knob — each with an estimated benefit, an estimated cost to apply, and a confidence. Pavlo’s framing is receding-horizon control: pick the action sequence that maximizes predicted objective (latency, cost, SLO compliance) over the forecast window, re-plan as observations arrive. Two disciplined constraints fall out. First, every action needs an undo; an autonomous system that cannot cheaply reverse a bad index is an autonomous system that will eventually destroy a weekend. Second, the system must never explore on the critical path — exploration (trying a knob to see what happens) is the expensive part of learning, and doing it on production traffic is how trust dies. Hold these two constraints in mind; they return on Thursday unchanged, with an LLM in the planner’s seat.
OtterTune: knob tuning as black-box optimization
PostgreSQL exposes roughly 350 configuration knobs; MySQL, over 500. They interact nonlinearly (shared_buffers and effective_cache_size fight over the same RAM; WAL knobs trade throughput against recovery time), and the defaults are tuned for a machine from 2008. OtterTune (SIGMOD 2017, out of the same CMU group) treated the configuration vector as input to a black box: run the workload, observe a target metric, fit a Gaussian-process regression over configurations, and pick the next configuration by maximizing expected improvement — classic Bayesian optimization, plus two pragmatic tricks. It used factor analysis and k-means to prune hundreds of runtime metrics down to a non-redundant handful, and LASSO to rank knobs by impact, tuning only the top of the list. The empirical punchline students should retain: a small set — order of 10 knobs — captures the large majority of achievable gain, and the tuned configurations beat defaults by wide margins (in the paper’s runs, up to 58–94% lower latency than default or other-advisor configs, competitive with human DBA configs). The deeper lesson is the framing: tuning needed no model of the database. The database is a function; sample it.
The postmortem: why the company died and the capability didn’t
OtterTune the startup (founded 2020, ~$14.5M raised) shut down in 2024 — Pavlo announced it with characteristic bluntness after an acquisition fell through. The failure was structural, not scientific, and it generalizes, so treat it as a case study. One: tuning is episodic. A customer pays, gets a 30% improvement in week one, and then the optimizer has converged — what exactly is the recurring subscription buying in month nine? Two: the distribution channel belonged to someone else. Knob access requires deep cloud-provider cooperation, and the providers — Amazon, Microsoft, Google — concluded, correctly, that auto-tuning is a platform feature, not a product. Azure SQL has shipped automatic index creation/drop since 2019; Aurora and the autonomous-branded offerings absorbed the rest. By 2024 the capability hadn’t failed; it had dissolved into the substrate. This is a recurring pattern in our field — remember it when you evaluate any startup whose pitch is “we tune your X”: if X’s vendor can absorb the feature, it will.
The Data Calculator: stop tuning structures, start deriving them
Stratos Idreos’s Data Calculator (SIGMOD 2018) raises the ambition a full level. Tuning accepts the B-tree and adjusts its knobs; self-design asks why you assumed a B-tree at all. The insight: data structures are not a zoo of inventions but points in a combinatorial design space assembled from a few dozen layout primitives — is a node sorted or unsorted? fixed or variable fanout? does it carry filters, fence pointers, a buffer? The Harvard group counts more than 1032 valid two-node-type designs from ~50 primitives; B-trees, LSM-trees, tries, and hash tables are merely the famous coordinates. The second half is cost synthesis: instead of implementing a candidate to benchmark it, the Calculator composes per-primitive cost models — themselves learned from micro-benchmarks on the target hardware, so cache effects come from measurement rather than from a heroic analytical model — into an end-to-end cost for a given workload and machine. What-if questions (“what is this design’s point-read cost on this box at this read/write mix?”) answer in seconds instead of person-months. Design becomes navigation: specify the workload, search the continuum, and the structure falls out. The RUM tradeoff (read vs. update vs. memory — you may optimize two) is the terrain’s conservation law; the Calculator is its map.
And the framing refuses to stay inside storage engines. A RAG pipeline is also a point in a design space: chunk size, embedding model, index family (HNSW’s M and efSearch vs. IVF’s nlist/nprobe), reranker depth, cache policy. Today teams pick these by folklore and vibes — exactly how people picked B-tree fill factors in 1985. A “Data Calculator for serving pipelines” — primitives, composable learned cost models, synthesis under a recall/latency/cost objective — is an obvious open problem, and it is this week’s stretch exercise.
The Agent as DBA
Here is the twist the 2017 roadmap did not predict: the first broadly deployed “autonomous DBA” is not a learned component inside the engine. It is a general-purpose LLM agent sitting outside the engine, holding a connection string, an MCP server, and a packaged master prompt. Every serious Postgres platform now ships an MCP server exposing schema inspection, EXPLAIN, statistics views, and migration execution as tools; alongside them ship what amount to master prompts — multi-thousand-token operational playbooks encoding schema-design review, index selection heuristics, zero-downtime migration choreography (create concurrently, backfill in batches, swap, drop). The expertise that OtterTune encoded in a Gaussian process and the 2004 advisors encoded in C++ is now encoded in prose, executed by a model that can also read the manual. This sounds like a regression in rigor. In three specific ways it is an advance — and in three others, nothing has changed at all.
What changes: the tuner can read, experiment, and explain
First, the agent reads documentation. OtterTune treated the DBMS as a black box because that was the only honest contract a narrow ML system could have. An LLM agent knows — from training and from fetching release notes — that checkpoint_completion_target spreads checkpoint I/O across the interval, that Postgres 17 changed vacuum memory accounting, that your ORM’s lazy loading explains the N+1 pattern in the log. It debugs causally where the GP could only regress. Second, the agent experiments on branches. Copy-on-write database branching (Neon, Aurora clones, and friends) gives a writable replica of production in seconds for cents — the missing substrate for the “never explore on the critical path” rule. The 2017 loop had to learn from cautious production nudges; the 2026 agent forks reality, replays a captured workload, and measures. Third, the agent explains itself. A GP emits a configuration vector; the agent emits a hypothesis, an experiment, a result, and a pull request a human can argue with. Recall the advisor-mode plateau: the approval gap was a trust gap, and legible reasoning is the first technology that actually narrows it.
-- agent tuning run #47 · branch: tune/checkpoint-2026-06-09 (fork of prod@LSN 0/8A3F1C40)
-- hypothesis: p99 spikes align with checkpoints (log: "checkpoint starting: wal" every ~140s)
ALTER SYSTEM SET max_wal_size = '8GB'; -- was 1GB
ALTER SYSTEM SET checkpoint_completion_target = 0.9; -- was 0.5
-- replay: 30 min captured workload, 14,212 statements, agent-traffic mix 71%
-- result: p99 412ms → 287ms · p50 38ms → 37ms · WAL volume +9% · recovery-time est. +6.2 min
-- gate: PROMOTE requires p99 −20%, WAL ≤ +15%, recovery ≤ +10 min → PASS
-- action: open PR with diff + this transcript; do NOT apply to prod directly
What doesn’t change: guardrails, evals, rollback
Now the cold water. The agent inherits every failure mode we catalogued for analytics agents in weeks 10–12, plus root-shaped blast radius. It hallucinates knob semantics across versions (plausible-sounding settings that were renamed two releases ago). It over-indexes — every individual CREATE INDEX looks locally beneficial, and twelve of them quietly halve your write throughput; the planner discipline of charging actions their maintenance cost does not come for free with a master prompt. And it is exposed to last week’s lethal trifecta: a DBA agent has private data, ingests untrusted content (query comments, table contents, the docs it fetches), and can execute DDL — the highest-privilege instance of the pattern we’ve seen. So the harness is non-negotiable, and it is exactly the validation discipline from the analytics chapter: guardrails (allowlisted tools; DDL only on branches; production changes only via reviewed migration PRs; hard SLO tripwires that revert autonomously), evals (a regression suite of tuning scenarios with known-good outcomes — replayed incidents, seeded slow queries — run against every prompt or model change, because a master prompt edit is a deploy), and rollback (every change carries a down-migration; every promoted config carries the branch evidence that justified it). The loop diagram from Tuesday survives intact; only the PLAN box got smarter.
ALTER TABLE … ADD COLUMN … DEFAULT — safe in Postgres 11+, which the agent correctly cited. The production instance was Postgres 10. Full table rewrite, 40 minutes of ACCESS EXCLUSIVE lock, on the orders table. The post-incident fix was not a better model; it was a version-pinned eval case and a rule that migrations execute only through the same CI gate humans use. The agent’s citation was right in general and wrong here — which is the entire epistemology of LLM operations in one incident.An honest ledger: learned components in production, 2026
Close the week by auditing the field’s own marketing. The 2017–2020 literature promised learned everything: indexes (Kraska et al.’s “The Case for Learned Index Structures,” 2018), cardinality estimators, optimizers (Bao, Neo), schedulers. Where did each actually land? The honest 2026 scorecard:
| Component | Status 2026 | Evidence & caveats |
|---|---|---|
| Automatic indexing | Production, fleet-scale | Azure SQL DB has created/dropped indexes autonomously since 2019, with built-in verification & auto-revert — the existence proof that closed-loop actions can work at scale. |
| Knob tuning | Absorbed into platforms | OtterTune dead (2024); the capability lives on as managed-service defaults, adaptive autovacuum, “autonomous” tiers. Nobody buys it; everybody has it. |
| Optimizer steering (Bao-style) | Production, narrow | Hint-steering learned models run inside Microsoft’s big-data stack; per-query plan choice, with the classical optimizer as floor. The “learning to steer, not to plan” compromise won. |
| Learned cardinality estimation | Advisor-mode / lab | Beats histograms on benchmarks; loses on cold-start, drift, and tail-risk accountability. No major engine trusts it on the critical path by default. |
| Learned index structures | Niche | Ideas absorbed into LSM components and specialized stores; the B-tree, with its 60-cycle worst case and zero training pipeline, remains undefeated as default. |
| LLM agent as DBA | Early production, gated | Real deployments behind branch-only DDL, eval suites, and human-merged PRs. The autonomy is real; so is the harness. |
Read the pattern in the table: learned components succeed in production precisely in proportion to how cheap their mistakes are to detect and undo. Auto-indexing shipped because an index is verifiable and reversible; learned cardinalities stall because a bad estimate hides inside a slow plan with no alarm attached. That is the same conservation law that governs the agent-DBA harness, and the same one from Tuesday’s dashed guardrail line. The physics here was never about ML capability. It was about the cost of being wrong — and on that axis, nothing about 2026 is new.
Read Before Thursday
This Week’s Problems
You benchmark five Postgres knobs one-at-a-time against a default baseline of p99 = 500 ms: shared_buffers → 410 ms, max_wal_size → 380 ms, work_mem → 470 ms, random_page_cost → 455 ms, effective_io_concurrency → 495 ms. (a) Rank the knobs by marginal sensitivity, OtterTune-LASSO style. (b) Jointly tuning the top two yields 290 ms — better than either alone predicts multiplicatively. Name the phenomenon, give a mechanistic hypothesis for these two knobs, and explain why it forces sample-efficient methods like GP regression instead of one-knob-at-a-time sweeps over a 350-knob space.
Design the full experiment protocol an LLM-DBA agent must follow before any production change, in the spirit of run #47 from lecture. Specify: (a) how the workload is captured and replayed on a branch, including how you make replay representative of a 70%-agent-traffic mix with poor query templating; (b) the promotion gate as concrete, falsifiable thresholds over at least four metrics — including one maintenance-cost metric that punishes over-indexing and one recovery/availability metric; (c) the rollback contract for both config changes and DDL; (d) five eval scenarios (seeded incidents with known-good resolutions) you would run against every change to the master prompt, with one scenario specifically targeting version-dependent knowledge like the Postgres 10/11 ADD COLUMN DEFAULT trap. Justify each gate threshold in one sentence.
Sketch a Data Calculator for RAG pipelines. (a) Define a primitive set of 10–15 design dimensions (chunking strategy and size, embedding model and dimensionality, index family with its parameters — e.g., HNSW M/efSearch vs. IVF nlist/nprobe — hybrid lexical fusion, reranker depth, cache policy) and estimate the size of the resulting design space. (b) Propose composable cost models for recall@k, p99 latency, and $/1k queries, stating which sub-models must be learned from micro-benchmarks (à la the Calculator) and which can be analytical. (c) Identify this space’s analogue of the RUM conjecture — which three desiderata cannot be simultaneously optimized, and why? (d) Either prototype the synthesizer on a small corpus and report where the cost model’s ranking diverges from measured reality, or argue rigorously for why a specific dimension (we suggest chunking) resists compositional cost modeling in a way storage-layout primitives do not. This is an open research problem; a sufficiently good answer is a workshop paper.