Week 5: Learned Components

Learning objectives — after this week you can…

Explain why a sorted index is an approximation of the data’s CDF, and derive the lookup cost of a recursive model index from its error bound.
State precisely where learned indexes win (read-mostly, in-memory, sorted numeric) and lose (strings, updates, cold caches), citing SOSD evidence.
Contrast end-to-end learned optimizers (Neo) with steering approaches (Bao) and argue why bounded action spaces made Bao deployable.
Trace what survived from the SageDB vision into shipping systems, and explain OtterTune’s failure as a product-shape lesson, not an ML lesson.
Articulate the inversion: why the agentic era moved the “learned component” from inside the engine to an LLM client outside it, connected over MCP.

Lecture 1 · Tuesday

The Case For (and Against) Learned Indexes

Start with a thought experiment Kraska, Beutel, Chi, Dean, and Polyzotis open their SIGMOD 2018 paper with: suppose your table holds every integer from 1 to 100M, densely packed in a sorted array. You do not need a B-tree to find key k — its position is k − 1, computable in one subtraction, O(1) time and O(1) space. The B-tree you would have built is doing something embarrassing: spending 25 bytes per key and four cache-missing pointer hops to rediscover a linear function. Real data is never that clean, but the provocation generalizes. Every sorted index answers one question — given a key, what is its position? — and that question is exactly evaluating the data’s empirical cumulative distribution function: pos = F(key) · N. An index is a model of the CDF. The only debate is which model class you use.

B-trees were models all along

Read a B-tree through this lens. Each traversal from root to leaf is a piecewise-constant regression: the inner nodes partition the key space, and the leaf tells you “your key lives somewhere in this 4KB page — go scan it.” That final scan is the model’s error bound, fixed by construction at the page size. A B-tree over N keys with fanout f guarantees error ≤ page_size after log_f(N) lookups. It is a model with a hard worst-case guarantee and zero training cost, which is why it has run the world since 1971. The learned-index bet is that for many real distributions, a learned regressor can hit a smaller error bound with fewer memory accesses than the tree spends on its descent — because evaluating two linear layers costs ~50–100ns of pure arithmetic in registers, while three or four pointer chases through a cold tree cost ~100ns each in DRAM latency. The trade is branchy memory traffic for predictable FLOPs, and on 2018 hardware FLOPs were already nearly free.

The recursive model index

A single model over 200M keys is hopeless — a linear regression might miss by millions of positions, and a neural net accurate to ±100 everywhere would be slower than the tree it replaces. Kraska’s fix is the recursive model index (RMI): a fixed-depth hierarchy of models where each stage narrows the problem. Stage 1 is one model (often a small neural net or just a linear fit) that takes the key and predicts not a position but which stage-2 model to ask — effectively learning a coarse partition of the key space. Stage 2 is an array of, say, 100,000 cheap linear models, each responsible for a thin slice of the distribution where the CDF is locally near-linear. Crucially this is not a tree: there are no pointers, no traversal decisions, just two array indexings and two fused multiply-adds. At build time you record, per leaf model, the maximum over- and under-shoot on the actual data: err_min, err_max. Lookup is then: predict p, binary-search the window [p − err_min, p + err_max]. If the leaf’s worst error is 128 positions, that residual search is 7 comparisons inside one or two cache lines. The reported numbers: up to 1.5–3× faster than a tuned B-tree, using 10–100× less memory for the index structure itself. The memory result matters more than the speed result — an RMI over 200M doubles fits in a few megabytes, i.e., in L2.

Fig. 5.1 — Every sorted index approximates the CDF. A B-tree does it as a step function with error fixed at one page; an RMI does it with piecewise-linear leaves plus a recorded worst-case error band. Lookup = two array indexings, two multiply-adds, then a 7-comparison search inside the band.

The update problem, and the two serious fixes

The 2018 paper handled read-only data, and critics pounced: insert one key and every position to its right shifts, invalidating the model and its error bounds. Two designs answered properly. ALEX (Microsoft Research, SIGMOD 2020) keeps the RMI shape but stores data in gapped arrays — leaves deliberately left ~30% empty, with inserts placed where the model predicts they belong (“model-based insertion”), so the model stays accurate by construction rather than by retraining. When a leaf’s error or density degrades past a threshold, ALEX splits or retrains just that leaf. Result: up to 4× the throughput of a B+tree on read-heavy mixes while remaining competitive on writes. PGM-index (Ferragina & Vinciguerra, VLDB 2020) goes the other way: it abandons learning-as-training entirely and computes the provably minimal piecewise-linear approximation with error ≤ ε using an O(N) streaming convex-hull algorithm, then recurses on the segment endpoints. You get B-tree-style worst-case guarantees — O(log N) lookups, fully dynamic via an LSM-like scheme — with a structure typically 10–100× smaller than the equivalent tree. The PGM is the version of this idea you can put in a textbook: no gradient descent, no hyperparameters, a theorem instead of a benchmark.

SOSD: the honest scoreboard

The field then did something admirable: it built a neutral benchmark and let the idea be falsified. SOSD (Marcus, Kipf, van Renen et al., VLDB 2020) raced RMI, PGM, RadixSpline against tuned B-trees, ART, FAST, and plain interpolation search on 200M-key datasets — real ones (OSM cell IDs, Facebook user IDs, Amazon book popularity) and synthetic ones. The verdict is worth memorizing because it is the template for every honest ML-for-systems evaluation since. Learned structures genuinely win — often by 1.5–2× on lookup latency at a fraction of the space — but only inside a specific box: read-mostly, in-memory, sorted, numeric keys. Step outside any wall of that box and the win evaporates. Strings? The model needs an embedding and the comparison cost dominates anyway. Heavy updates? You pay ALEX’s gap maintenance or PGM’s merge amortization, and a B-tree’s boring predictability looks good again. Data on disk? One I/O is 100µs; saving 200ns of search inside the page is rounding error — the B-tree’s fanout already minimized I/Os, which was always its actual job. The 2018 paper’s implicit claim was “this replaces B-trees”; the 2020 evidence says “this is a better node layout for a specific, large, but bounded regime.” Both things can be true. Notably, Google reported learned-index techniques in production inside Bigtable’s SSTable block indexes — exactly the read-only, sorted, in-memory regime SOSD identified.

Field note The loudest early objection wasn’t technical. “The case for learned index structures” has Jeff Dean on the author list, and half the community read it as Google announcing that systems people would be replaced by TensorFlow. The lasting contribution turned out to be almost the opposite of the hype: a framing (indexes are CDF models) that let non-ML people like Ferragina build the PGM-index with no machine learning whatsoever. The best outcome of an ML-for-systems paper was a new piece of classical algorithmics.

Lecture 2 · Thursday

Bao: ML Steering the Optimizer

Query optimization should have been the killer app for learned components, because the classical solution is built on estimates everyone knows are fiction. Leis et al.’s “How Good Are Query Optimizers, Really?” (VLDB 2015) measured it: on the Join Order Benchmark, PostgreSQL’s cardinality estimates for multi-join queries were off by factors of 10⁴ to 10⁸ — not percent, orders of magnitude — because the independence and uniformity assumptions behind per-column histograms collapse under correlated predicates. (Movies produced in Germany really are more likely to have German-language titles.) Cost models are quadratic refinements of garbage inputs. So researchers reached for ML twice: first to fix the estimates, then to replace the optimizer wholesale. Both attacks taught us more about deployment than about ML.

Learned cardinalities: better numbers, same fragility

Learned cardinality estimators — MSCN’s supervised set-convolutions over query features (Kipf et al., CIDR 2019), DeepDB’s sum-product networks over the data itself, NeuroCard’s join-aware autoregressive models — cut median q-error from hundreds down to single digits in lab settings. And yet none of them ship in a major engine’s hot path. Three reasons recur. Tail risk: a median-better estimator that occasionally hallucinates a 10⁶× error produces one catastrophic plan a day, and a database that is fast on average but unpredictable is a database users abandon. Drift: the model is trained against yesterday’s data and yesterday’s workload; after a bulk load or a schema migration its errors are silently stale, while a histogram is at least cheaply rebuildable by ANALYZE. Inference cost in the wrong place: optimization happens per-query on the latency path; nobody wants a 5ms neural-net call to plan a 2ms query. Keep this failure triangle — tails, drift, inference placement — on the board; Bao is engineered point-by-point against it.

Neo versus Bao: replace the optimizer, or steer it?

Marcus and the MIT group tried both ends of the spectrum, which makes the pair a perfect controlled experiment in research strategy. Neo (VLDB 2019) is the maximalist: an end-to-end learned optimizer that builds join orders bottom-up with a tree-convolution value network guiding best-first search, bootstrapped from PostgreSQL’s plans and then improving past them. It worked — after about a day of training it matched or beat commercial optimizers on its training workload. It also exhibited every deployment pathology at once: cold start on a new schema (it knows nothing until it has watched thousands of queries), brittleness under drift, an unbounded action space in which a bad network can emit an arbitrarily terrible plan, and zero explainability when it does.

Bao (SIGMOD 2021, ACM SIGMOD Best Paper) is the same group’s deliberate retreat to a defensible position, and it is the most important design move in this whole literature. Bao does not generate plans. PostgreSQL already exposes coarse planner switches — enable_hashjoin, enable_nestloop, enable_indexscan, and friends. A subset of those switches is a hint set; Bao fixes a menu of 48 of them. For each incoming query it asks the classical optimizer to plan the query under each hint set, feeds each resulting plan tree (operators + cardinality estimates as features) through a tree-convolutional value model predicting latency, and executes the predicted-best plan. Exploration is handled as a contextual multi-armed bandit with Thompson sampling — Bao deliberately occasionally picks a non-greedy arm, observes the true latency, and retrains; the regret framework gives it a principled answer to “how much do we pay to keep learning?” Look at how each leg of the failure triangle is amputated. Tail risk: every arm is a plan a real optimizer produced, so the worst case is a valid, costed, classical plan — the floor is “PostgreSQL on an off day,” not chaos. Cold start: the default arm (empty hint set) is PostgreSQL, so day-one behavior is the status quo. Drift: a 48-way choice retrains in minutes on one GPU, and the paper shows it adapting to workload and schema shifts in hours, cutting tail (p99) latencies substantially on real workloads and reducing cost on cloud instances. Bao shipped in spirit: Microsoft’s steered query optimizer for its SCOPE big-data workloads (SIGMOD 2021; deployed in production at Microsoft) and Meta’s “AutoSteer” for PrestoDB (an Intel/Meta/TUM collaboration, PVLDB 2023) are recognizably Bao-shaped.

System	Where the ML sits	Action space	Worst case	Fate
Neo (VLDB ’19)	Replaces the optimizer	All possible plans	Arbitrarily bad plan, unexplainable	Influential; never deployed as-is
Bao (SIGMOD ’21)	Advises the optimizer (picks 1 of 48 hint sets)	48 classical plans	A valid plan from a real optimizer	Pattern adopted (MSFT steered optimizer, Meta AutoSteer & kin)
OtterTune (’17–’24)	Outside the engine, tuning knobs	~10–200 config knobs	Bad config (recoverable restart)	Good tech, dead company (2024)
LLM-DBA via MCP (now)	Outside the engine, as a client	SQL, EXPLAIN, DDL, knobs — anything a human DBA can type	Whatever you let it execute	The current frontier — and Week 12’s safety problem

The systems that shipped all share one clause in their contract: the model advises, the engine decides.

SageDB’s vision, and the parts that actually shipped

SageDB (Kraska et al., CIDR 2019) is the maximalist manifesto: a database where every component — index, sort, join, scheduler, optimizer, storage layout — is synthesized from a learned model of the data and workload, the “instance-optimal” system. As a research agenda it was generative; as a system it never arrived in that form. What shipped is instructive precisely because of which pieces made it. Amazon Redshift — staffed by several SageDB authors — ships learned components today, but all in advisory, off-critical-path roles: ML-driven automatic table optimization picks sort keys and distribution styles by watching the workload; automatic workload management uses a model to predict query memory and runtime to schedule admission; short-query acceleration is a learned classifier deciding which queue a query enters. Notice the pattern — every one of these is a policy decision feeding a classical mechanism, made asynchronously, where a wrong answer degrades performance but never correctness. The learned sort and the learned join from the manifesto, the parts that touch query results? Nowhere in production. The physics separation you should carry out of this course: ML is excellent at policies (what to do, given this workload) and dangerous at mechanisms (how to do it correctly under all inputs).

OtterTune: the cautionary product tale

OtterTune began as Carnegie Mellon research (Van Aken, Pavlo et al., SIGMOD 2017): use Gaussian-process Bayesian optimization over telemetry to tune the hundreds of configuration knobs in PostgreSQL/MySQL — shared_buffers, work_mem, checkpoint cadence — a task human DBAs do badly and rarely. The research was sound and the company (founded 2020) raised real money, yet it shut down in mid-2024, with Pavlo delivering the unsparing postmortem himself. The autopsy matters for this course. First, knob tuning is bounded-upside: a great config buys you maybe 20–50% on a workload, once — then the value is captured and the subscription looks expensive. Second, the cloud vendors absorbed the feature: if Aurora and Azure auto-tune for free, a third-party tuner is selling against a bundled default. Third, and most relevant to us: customers didn’t actually want an autonomous closed loop touching production — they wanted a recommendation with an explanation they could review. OtterTune died selling automation to people who wanted advice. Hold that thought for thirty seconds.

The inversion: the DBA model moved outside the engine

Now place this week inside the course thesis. A decade of “ML for systems” fought to push models into the engine — into the index, the estimator, the tuner — and the survivors were the ones that stayed advisory and bounded. The agentic era inverts the geometry entirely. The most capable learned component in your 2026 database deployment is not in the engine at all: it is an LLM connected as a client over MCP, holding EXPLAIN ANALYZE output, pg_stat_statements, and the schema in its context window, proposing indexes and rewrites in natural language. It is, structurally, exactly what OtterTune’s customers were asking for — a tireless advisor that explains itself — built from a general model instead of a bespoke one, and crucially it inherits Bao’s safety contract for free if you design the tool surface correctly: the agent advises, a human or a guarded executor decides. This inversion changes what engines must optimize for. The internal-ML decade asked engines for training hooks; the agentic decade asks them for observability and safe control surfaces — rich EXPLAIN output, queryable statistics, dry-run DDL, permissioned hint interfaces. The engine’s job is no longer to learn; it is to be legible to something that does. Workloads changed; the physics — bounded action spaces, classical floors under learned policies — did not.

Historical aside Self-tuning databases are not a 2018 idea. Microsoft’s AutoAdmin project shipped the Index Tuning Wizard in SQL Server 7.0 — in 1998 — and IBM ran SMART/LEO (a learning optimizer that corrected its own cardinality estimates from runtime feedback) at the turn of the millennium. LEO was Bao’s feedback loop, twenty years early, and it was quietly turned off in DB2 because customers feared plan instability more than they valued plan improvement. Every generation rediscovers that the binding constraint is operator trust, not model accuracy.

Readings

Read Before Thursday

The Case for Learned Index Structures — Kraska, Beutel, Chi, Dean, Polyzotis, SIGMOD 2018.Read §1–3 closely for the CDF framing and the RMI; treat the eval skeptically and bring one benchmark objection (the SOSD authors found several).

Bao: Making Learned Query Optimization Practical — Marcus, Negi, Mao, Tatbul, Alizadeh, Kraska, SIGMOD 2021.Focus on §2’s design constraints and the Thompson-sampling loop — note how every choice traces back to a specific deployment failure of Neo.

SageDB: A Learned Database System — Kraska, Alizadeh, Beutel, Chi, Ding, Kristo, Leclerc, Madden, Mao, Nathan, CIDR 2019.Read as a manifesto, not a system paper; as you read, mark each proposed component as policy or mechanism, and check your marks against what Redshift shipped.

Exercises

This Week’s Problems

Exercise 5.1 · warm-up

You build a two-stage RMI over N = 200M sorted 64-bit keys: one stage-1 linear model and 100,000 stage-2 linear models, each storing its own (err_min, err_max). (a) Compute the index’s total memory footprint, assuming each linear model is two doubles plus two 32-bit error bounds, and compare against a B+tree with 16-byte entries and fanout 256 over the same keys (count inner nodes only). (b) If the worst leaf error is 512 positions, how many key comparisons does the residual binary search cost, and how many cache lines (64B, 8B keys) can it touch? (c) State the dataset property that would blow up the error bound of a linear stage-2 model, and name which fix — ALEX or PGM — addresses it without changing the model class.

Exercise 5.2 · core

Reproduce a miniature Bao. On PostgreSQL with the Join Order Benchmark (or TPC-H SF1 if your laptop protests), define four hint sets: {} (default), {enable_nestloop=off}, {enable_hashjoin=off}, {enable_nestloop=off, enable_mergejoin=off}. For 20 queries, run EXPLAIN ANALYZE under each arm and record true latencies. (a) Report the regret of always choosing the default arm versus the per-query oracle — how much was on the table? (b) Featurize each plan (operator counts, total estimated cost, estimated rows) and fit any regressor you like to predict latency; report how often your model picks the oracle arm on held-out queries. (c) Identify one query where the default optimizer’s choice was wrong by >2× and explain, from the plan, which cardinality estimate misled it.

Exercise 5.3 · stretch

Design the Bao contract for an LLM-DBA. Specify an MCP tool surface for an agent that tunes a production PostgreSQL instance, such that the agent’s worst case is provably bounded the way Bao’s was. Your design must answer: (a) which capabilities are read-only, which are dry-run (e.g., hypothetical indexes via hypopg), and which mutate — and what guards each mutation (cost ceilings, canary windows, automatic rollback triggers expressed over which metrics)? (b) What is the formal analogue of Bao’s “every arm is a classical plan” floor when the action space is free-form SQL DDL — can you define one, or must you restrict the action space to a menu? Argue precisely. (c) Bao explored via Thompson sampling against a latency posterior; what is the legitimate exploration mechanism for an agent on a production system where exploration has customer-visible cost — shadow replicas, traffic mirroring, off-peak budgets? Propose one, estimate its dollar cost on a cloud instance, and identify the failure mode it cannot catch. There is no canonical answer; we will tear three of these apart in Friday’s lab.

❦