Week 13: Self-Driving, Self-Assembling, Self-Designing

Learning objectives — after this week you can…

Draw the self-driving DBMS control loop (forecast → plan → act → observe) and explain why workload forecasting, not ML cleverness, is the load-bearing component.
Explain how OtterTune framed knob tuning as black-box optimization with Gaussian-process regression, and give two structural reasons the standalone product died in 2024 while the technique survived inside platforms.
Navigate the Data Calculator’s design space: given a read/write/memory budget, predict which layout primitives a cost synthesizer would choose, and extend the same framing to RAG and serving pipelines.
Contrast a narrow learned tuner with a general LLM agent acting as DBA, and specify the guardrail / eval / rollback harness the agent needs before it touches production.
Audit any “learned database component” claim circa 2026 and classify it honestly: in-production, advisor-mode, or lab-only.

Lecture 1 · Tuesday

From Auto-Tuning to Self-Design

Andy Pavlo opened CIDR 2017 with a deliberately immodest claim: everything before — index advisors, knob wizards, partition recommenders — was “automated” but not autonomous, because a human still decided when to apply the advice and absorbed the blame when it went wrong. A self-driving DBMS, by analogy to a self-driving car, must close the loop itself: predict where the workload is going, choose actions whose costs pay off before the prediction expires, apply them without a maintenance window, and learn from the outcome. The paper’s architecture (built into the Peloton prototype, later reborn as NoisePage) has three stages, and the ordering is the whole argument: forecast first, then plan. You cannot decide whether building an index is worth its 40-minute build cost unless you can predict that the queries it serves will still be arriving 40 minutes from now. Autonomy is a bet on the future of the workload, and the quality of the bet is bounded by the quality of the forecast.

Workload forecasting: the unglamorous keystone

The CMU follow-up work (QueryBot 5000, SIGMOD 2018) made the forecasting problem concrete. Raw query logs are too high-cardinality to forecast — a busy OLTP system sees millions of distinct statements per day — so the first move is to template queries (strip constants) and then cluster templates by arrival-rate correlation. The striking empirical fact: across the workloads they studied (a bus-tracking app, an academic site, MOOC infrastructure), the top 5 clusters covered roughly 95% of query volume. Forecast those few time series — with an ensemble of linear regression for short horizons and a recurrent network for longer ones — and you can predict arrival rates an hour to a week out, accurately enough to schedule index builds at the 3 a.m. trough the model saw coming. Note what this implies for our course thesis: agentic clients make this harder. Agent-generated SQL templates poorly (every prompt produces a slightly different join order), and agent traffic has bursts correlated with upstream model behavior, not human circadian rhythm. Forecasting research assumed the diurnal human; week 13’s client doesn’t sleep.

Action planning: a control problem wearing an ML costume

Given a forecast, the planner searches a space of actions — add/drop index, repartition, resize buffer pool, change knob — each with an estimated benefit, an estimated cost to apply, and a confidence. Pavlo’s framing is receding-horizon control: pick the action sequence that maximizes predicted objective (latency, cost, SLO compliance) over the forecast window, re-plan as observations arrive. Two disciplined constraints fall out. First, every action needs an undo; an autonomous system that cannot cheaply reverse a bad index is an autonomous system that will eventually destroy a weekend. Second, the system must never explore on the critical path — exploration (trying a knob to see what happens) is the expensive part of learning, and doing it on production traffic is how trust dies. Hold these two constraints in mind; they return on Thursday unchanged, with an LLM in the planner’s seat.

Fig. 13.1 — The self-driving control loop from Pavlo et al. (CIDR 2017). OtterTune replaced PLAN with a Gaussian process; the Data Calculator widened ACT from “tune the structure” to “choose the structure”; the 2026 LLM-DBA replaces the planner with a general agent. The dashed guardrail line is the only part that has never changed.

OtterTune: knob tuning as black-box optimization

PostgreSQL exposes roughly 350 configuration knobs; MySQL, over 500. They interact nonlinearly (shared_buffers and effective_cache_size fight over the same RAM; WAL knobs trade throughput against recovery time), and the defaults are tuned for a machine from 2008. OtterTune (SIGMOD 2017, out of the same CMU group) treated the configuration vector as input to a black box: run the workload, observe a target metric, fit a Gaussian-process regression over configurations, and pick the next configuration by maximizing expected improvement — classic Bayesian optimization, plus two pragmatic tricks. It used factor analysis and k-means to prune hundreds of runtime metrics down to a non-redundant handful, and LASSO to rank knobs by impact, tuning only the top of the list. The empirical punchline students should retain: a small set — order of 10 knobs — captures the large majority of achievable gain, and the tuned configurations beat defaults by wide margins (in the paper’s runs, up to 58–94% lower latency than default or other-advisor configs, competitive with human DBA configs). The deeper lesson is the framing: tuning needed no model of the database. The database is a function; sample it.

The postmortem: why the company died and the capability didn’t

OtterTune the startup (founded 2020, ~$14.5M raised) shut down in 2024 — Pavlo announced it with characteristic bluntness after an acquisition fell through. The failure was structural, not scientific, and it generalizes, so treat it as a case study. One: tuning is episodic. A customer pays, gets a 30% improvement in week one, and then the optimizer has converged — what exactly is the recurring subscription buying in month nine? Two: the distribution channel belonged to someone else. Knob access requires deep cloud-provider cooperation, and the providers — Amazon, Microsoft, Google — concluded, correctly, that auto-tuning is a platform feature, not a product. Azure SQL has shipped automatic index creation/drop since 2019; Aurora and the autonomous-branded offerings absorbed the rest. By 2024 the capability hadn’t failed; it had dissolved into the substrate. This is a recurring pattern in our field — remember it when you evaluate any startup whose pitch is “we tune your X”: if X’s vendor can absorb the feature, it will.

Field note · the advisor-mode plateauNone of this began in 2017. Microsoft’s AutoAdmin project shipped an index-tuning wizard in SQL Server 7.0 in 1998; IBM’s DB2 had SMART; Oracle shipped the SQL Tuning Advisor in 2004. Every one of them plateaued at the same altitude: the system recommends, a human approves. Two decades of research couldn’t cross the approval gap, because the gap was never technical — it was about who gets paged at 3 a.m. when the recommendation is wrong. Keep that in mind Thursday, when we ask whether an LLM crosses it.

The Data Calculator: stop tuning structures, start deriving them

Stratos Idreos’s Data Calculator (SIGMOD 2018) raises the ambition a full level. Tuning accepts the B-tree and adjusts its knobs; self-design asks why you assumed a B-tree at all. The insight: data structures are not a zoo of inventions but points in a combinatorial design space assembled from a few dozen layout primitives — is a node sorted or unsorted? fixed or variable fanout? does it carry filters, fence pointers, a buffer? The Harvard group counts more than 10³² valid two-node-type designs from ~50 primitives; B-trees, LSM-trees, tries, and hash tables are merely the famous coordinates. The second half is cost synthesis: instead of implementing a candidate to benchmark it, the Calculator composes per-primitive cost models — themselves learned from micro-benchmarks on the target hardware, so cache effects come from measurement rather than from a heroic analytical model — into an end-to-end cost for a given workload and machine. What-if questions (“what is this design’s point-read cost on this box at this read/write mix?”) answer in seconds instead of person-months. Design becomes navigation: specify the workload, search the continuum, and the structure falls out. The RUM tradeoff (read vs. update vs. memory — you may optimize two) is the terrain’s conservation law; the Calculator is its map.

And the framing refuses to stay inside storage engines. A RAG pipeline is also a point in a design space: chunk size, embedding model, index family (HNSW’s M and efSearch vs. IVF’s nlist/nprobe), reranker depth, cache policy. Today teams pick these by folklore and vibes — exactly how people picked B-tree fill factors in 1985. A “Data Calculator for serving pipelines” — primitives, composable learned cost models, synthesis under a recall/latency/cost objective — is an obvious open problem, and it is this week’s stretch exercise.

Autonomy never shipped as a product you buy. It shipped as a layer you stop noticing.

Lecture 2 · Thursday

The Agent as DBA

Here is the twist the 2017 roadmap did not predict: the first broadly deployed “autonomous DBA” is not a learned component inside the engine. It is a general-purpose LLM agent sitting outside the engine, holding a connection string, an MCP server, and a packaged master prompt. Every serious Postgres platform now ships an MCP server exposing schema inspection, EXPLAIN, statistics views, and migration execution as tools; alongside them ship what amount to master prompts — multi-thousand-token operational playbooks encoding schema-design review, index selection heuristics, zero-downtime migration choreography (create concurrently, backfill in batches, swap, drop). The expertise that OtterTune encoded in a Gaussian process and the 2004 advisors encoded in C++ is now encoded in prose, executed by a model that can also read the manual. This sounds like a regression in rigor. In three specific ways it is an advance — and in three others, nothing has changed at all.

What changes: the tuner can read, experiment, and explain

First, the agent reads documentation. OtterTune treated the DBMS as a black box because that was the only honest contract a narrow ML system could have. An LLM agent knows — from training and from fetching release notes — that checkpoint_completion_target spreads checkpoint I/O across the interval, that Postgres 17 changed vacuum memory accounting, that your ORM’s lazy loading explains the N+1 pattern in the log. It debugs causally where the GP could only regress. Second, the agent experiments on branches. Copy-on-write database branching (Neon, Aurora clones, and friends) gives a writable replica of production in seconds for cents — the missing substrate for the “never explore on the critical path” rule. The 2017 loop had to learn from cautious production nudges; the 2026 agent forks reality, replays a captured workload, and measures. Third, the agent explains itself. A GP emits a configuration vector; the agent emits a hypothesis, an experiment, a result, and a pull request a human can argue with. Recall the advisor-mode plateau: the approval gap was a trust gap, and legible reasoning is the first technology that actually narrows it.

-- agent tuning run #47 · branch: tune/checkpoint-2026-06-09 (fork of prod@LSN 0/8A3F1C40)
-- hypothesis: p99 spikes align with checkpoints (log: "checkpoint starting: wal" every ~140s)
ALTER SYSTEM SET max_wal_size = '8GB';                 -- was 1GB
ALTER SYSTEM SET checkpoint_completion_target = 0.9;   -- was 0.5
-- replay: 30 min captured workload, 14,212 statements, agent-traffic mix 71%
-- result: p99 412ms → 287ms · p50 38ms → 37ms · WAL volume +9% · recovery-time est. +6.2 min
-- gate:   PROMOTE requires p99 −20%, WAL ≤ +15%, recovery ≤ +10 min   → PASS
-- action: open PR with diff + this transcript; do NOT apply to prod directly

What doesn’t change: guardrails, evals, rollback

Now the cold water. The agent inherits every failure mode we catalogued for analytics agents in weeks 10–12, plus root-shaped blast radius. It hallucinates knob semantics across versions (plausible-sounding settings that were renamed two releases ago). It over-indexes — every individual CREATE INDEX looks locally beneficial, and twelve of them quietly halve your write throughput; the planner discipline of charging actions their maintenance cost does not come for free with a master prompt. And it is exposed to last week’s lethal trifecta: a DBA agent has private data, ingests untrusted content (query comments, table contents, the docs it fetches), and can execute DDL — the highest-privilege instance of the pattern we’ve seen. So the harness is non-negotiable, and it is exactly the validation discipline from the analytics chapter: guardrails (allowlisted tools; DDL only on branches; production changes only via reviewed migration PRs; hard SLO tripwires that revert autonomously), evals (a regression suite of tuning scenarios with known-good outcomes — replayed incidents, seeded slow queries — run against every prompt or model change, because a master prompt edit is a deploy), and rollback (every change carries a down-migration; every promoted config carries the branch evidence that justified it). The loop diagram from Tuesday survives intact; only the PLAN box got smarter.

Field note · the 40-minute lockA team we’ll keep anonymous let a tuning agent apply its own “safe” migration: ALTER TABLE … ADD COLUMN … DEFAULT — safe in Postgres 11+, which the agent correctly cited. The production instance was Postgres 10. Full table rewrite, 40 minutes of ACCESS EXCLUSIVE lock, on the orders table. The post-incident fix was not a better model; it was a version-pinned eval case and a rule that migrations execute only through the same CI gate humans use. The agent’s citation was right in general and wrong here — which is the entire epistemology of LLM operations in one incident.

An honest ledger: learned components in production, 2026

Close the week by auditing the field’s own marketing. The 2017–2020 literature promised learned everything: indexes (Kraska et al.’s “The Case for Learned Index Structures,” 2018), cardinality estimators, optimizers (Bao, Neo), schedulers. Where did each actually land? The honest 2026 scorecard:

Component	Status 2026	Evidence & caveats
Automatic indexing	Production, fleet-scale	Azure SQL DB has created/dropped indexes autonomously since 2019, with built-in verification & auto-revert — the existence proof that closed-loop actions can work at scale.
Knob tuning	Absorbed into platforms	OtterTune dead (2024); the capability lives on as managed-service defaults, adaptive autovacuum, “autonomous” tiers. Nobody buys it; everybody has it.
Optimizer steering (Bao-style)	Production, narrow	Hint-steering learned models run inside Microsoft’s big-data stack; per-query plan choice, with the classical optimizer as floor. The “learning to steer, not to plan” compromise won.
Learned cardinality estimation	Advisor-mode / lab	Beats histograms on benchmarks; loses on cold-start, drift, and tail-risk accountability. No major engine trusts it on the critical path by default.
Learned index structures	Niche	Ideas absorbed into LSM components and specialized stores; the B-tree, with its 60-cycle worst case and zero training pipeline, remains undefeated as default.
LLM agent as DBA	Early production, gated	Real deployments behind branch-only DDL, eval suites, and human-merged PRs. The autonomy is real; so is the harness.

Read the pattern in the table: learned components succeed in production precisely in proportion to how cheap their mistakes are to detect and undo. Auto-indexing shipped because an index is verifiable and reversible; learned cardinalities stall because a bad estimate hides inside a slow plan with no alarm attached. That is the same conservation law that governs the agent-DBA harness, and the same one from Tuesday’s dashed guardrail line. The physics here was never about ML capability. It was about the cost of being wrong — and on that axis, nothing about 2026 is new.

Readings

Read Before Thursday

Self-Driving Database Management Systems — Pavlo et al., CIDR 2017.The manifesto. Focus on the forecast→plan→act architecture and the explicit analogy ladder from “advisor” to “autonomous” — then ask which rung an LLM agent occupies.

The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models — Idreos et al., SIGMOD 2018.Read for the design-space framing, not the implementation: layout primitives, the 10^32 continuum, and how learned micro-benchmark models compose into whole-structure cost predictions.

OtterTune postmortem — A. Pavlo, blog post, 2024.A rare honest startup autopsy by its own founder. Focus on why the science worked and the business didn’t: episodic value, and platforms absorbing the feature.

Exercises

This Week’s Problems

Exercise 13.1 · warm-up

You benchmark five Postgres knobs one-at-a-time against a default baseline of p99 = 500 ms: shared_buffers → 410 ms, max_wal_size → 380 ms, work_mem → 470 ms, random_page_cost → 455 ms, effective_io_concurrency → 495 ms. (a) Rank the knobs by marginal sensitivity, OtterTune-LASSO style. (b) Jointly tuning the top two yields 290 ms — better than either alone predicts multiplicatively. Name the phenomenon, give a mechanistic hypothesis for these two knobs, and explain why it forces sample-efficient methods like GP regression instead of one-knob-at-a-time sweeps over a 350-knob space.

Exercise 13.2 · core

Design the full experiment protocol an LLM-DBA agent must follow before any production change, in the spirit of run #47 from lecture. Specify: (a) how the workload is captured and replayed on a branch, including how you make replay representative of a 70%-agent-traffic mix with poor query templating; (b) the promotion gate as concrete, falsifiable thresholds over at least four metrics — including one maintenance-cost metric that punishes over-indexing and one recovery/availability metric; (c) the rollback contract for both config changes and DDL; (d) five eval scenarios (seeded incidents with known-good resolutions) you would run against every change to the master prompt, with one scenario specifically targeting version-dependent knowledge like the Postgres 10/11 ADD COLUMN DEFAULT trap. Justify each gate threshold in one sentence.

Exercise 13.3 · stretch

Sketch a Data Calculator for RAG pipelines. (a) Define a primitive set of 10–15 design dimensions (chunking strategy and size, embedding model and dimensionality, index family with its parameters — e.g., HNSW M/efSearch vs. IVF nlist/nprobe — hybrid lexical fusion, reranker depth, cache policy) and estimate the size of the resulting design space. (b) Propose composable cost models for recall@k, p99 latency, and $/1k queries, stating which sub-models must be learned from micro-benchmarks (à la the Calculator) and which can be analytical. (c) Identify this space’s analogue of the RUM conjecture — which three desiderata cannot be simultaneously optimized, and why? (d) Either prototype the synthesizer on a small corpus and report where the cost model’s ranking diverges from measured reality, or argue rigorously for why a specific dimension (we suggest chunking) resists compositional cost modeling in a way storage-layout primitives do not. This is an open research problem; a sufficiently good answer is a workshop paper.

❦