DATA 2027 · Week 05 · Part II — New Access Methods & Engines

Learned Components

ML inside the engine mostly lost; ML advising the engine quietly won — and the LLM agent outside the engine is the strangest winner of all.

Lecture 1 — The case for (and against) learned indexes · Lecture 2 — Bao: ML steering the optimizer

Lecture 1 · Tuesday

The Case For (and Against) Learned Indexes

What if an index is just a model — and the model class is up for debate?

L1 · The provocation

Kraska’s thought experiment (SIGMOD 2018)

Table holds every integer 1…100M, sorted, densely packed.
Position of key k is k − 1: one subtraction, O(1).
The B-tree spends 25 bytes per key instead.
Plus four cache-missing pointer hops per lookup.
It is expensively rediscovering a linear function.

L1 · The framing

Every index is a model of the CDF

Every sorted index answers one question: key → position.
That is the empirical CDF: pos = F(key) · N.
An index is a model of the data’s distribution.
The only debate: which model class to use.

L1 · The framing

B-trees were models all along

Root-to-leaf traversal = a piecewise-constant regression.
The leaf says: scan this 4KB page.
That scan is the error bound, fixed at page size.
Error ≤ page_size after log_f(N) lookups.
Hard worst case, zero training — running the world since 1971.

L1 · The hardware bet

Trade pointer chases for FLOPs

Learned: two linear layers ≈ 50–100ns total.
Pure arithmetic, in registers, predictable.

Tree: 3–4 pointer chases through cold DRAM.
≈ 100ns of latency each.
On 2018 hardware, FLOPs were nearly free.

L1 · RMI

The recursive model index

One model over 200M keys is hopeless.
Stage 1: one model picks which stage-2 model to ask.
Stage 2: ~100,000 cheap linear models on thin slices.
Not a tree: no pointers, no traversal decisions.
Just two array indexings, two fused multiply-adds.

L1 · RMI

The error bound makes it correct

At build time, record each leaf’s worst miss: err_min, err_max.
Lookup: predict p, binary-search [p − err_min, p + err_max].
Worst error 128 → 7 comparisons, one or two cache lines.
Reported: up to 1.5–3× faster than a tuned B-tree.

L1 · RMI

An index as a CDF model

Fig. 5.1 — B-tree: step function, error fixed at one page. RMI: piecewise-linear leaves plus a recorded worst-case error band.

L1 · RMI

The memory result matters more

10–100×

less memory than the B-tree’s index structure — an RMI over 200M doubles fits in a few megabytes, i.e., in L2.

L1 · Updates

Inserts break it — ALEX’s answer

Insert one key: every position to its right shifts.
ALEX (MSR, SIGMOD 2020): gapped arrays, leaves ~30% empty.
Model-based insertion keeps the model accurate by construction.
Degraded leaf? Split or retrain just that leaf.
Up to 4× B+tree throughput on read-heavy mixes.

L1 · Updates

PGM-index: a theorem, not a benchmark

Ferragina & Vinciguerra, VLDB 2020 — no learning at all.
Provably minimal piecewise-linear approximation, error ≤ ε.
Built by an O(N) streaming convex-hull pass, then recurse.
B-tree-style guarantees: O(log N) lookups, fully dynamic.
Typically 10–100× smaller than the equivalent tree.

L1 · SOSD

SOSD: the honest scoreboard

Wins inside the box: read-mostly, in-memory, sorted, numeric.
Often 1.5–2× faster lookups at a fraction of the space.

Strings: embedding needed; comparison cost dominates.
Heavy updates: gap maintenance, merge amortization.
Disk: one I/O is 100µs; 200ns saved is rounding error.

L1 · Verdict

What actually shipped

Not “replaces B-trees” — a better node layout for a bounded regime.
Google: learned indexes in Bigtable’s SSTable block indexes.
Exactly the regime SOSD identified: read-only, sorted, in-memory.
The lasting gift was the framing, not the neural nets.
It let Ferragina build the PGM with zero machine learning.

Lecture 2 · Thursday

Bao: ML Steering the Optimizer

Replace the optimizer, or advise it? The answer decided what shipped.

L2 · The opening

The estimates are fiction

10⁴–10⁸×

PostgreSQL’s cardinality errors on Join Order Benchmark multi-joins (Leis et al., VLDB 2015) — independence and uniformity assumptions collapse under correlated predicates.

L2 · Learned cardinalities

Attack one: learn better estimates

MSCN: supervised set-convolutions over query features.
DeepDB: sum-product networks over the data itself.
NeuroCard: join-aware autoregressive models.
Median q-error: hundreds → single digits, in the lab.
And yet: none ship in a major engine’s hot path.

L2 · Learned cardinalities

The failure triangle

Tail risk: one 10⁶× hallucination = one catastrophic plan a day.
Drift: trained on yesterday; ANALYZE rebuilds a histogram cheaply.
Inference placement: no 5ms model call to plan a 2ms query.
Bao is engineered point-by-point against this triangle.

L2 · Neo

Neo: replace the optimizer

End-to-end learned optimizer (VLDB 2019): tree-conv value network, best-first search.
Bootstrapped from PostgreSQL’s plans, then improved past them.
~One day of training to match commercial optimizers.
But: cold start, drift brittleness, unbounded action space.
And zero explainability when a plan goes wrong.

L2 · Bao

Bao: steer, don’t replace

SIGMOD 2021, Best Paper — a deliberate retreat.
Doesn’t generate plans: picks among 48 hint sets.
Hint sets = planner switches: enable_hashjoin, enable_nestloop, …
Tree-convolutional value model predicts each candidate plan’s latency.
Thompson-sampling bandit: a principled price for continued learning.

L2 · Bao

The Bao loop

Fig. 5.2 — Bao never generates plans: it picks 1 of 48 hint sets, and the classical optimizer does the rest.

L2 · Bao

The triangle, amputated

Tail risk: worst case is “PostgreSQL on an off day,” not chaos.
Cold start: the empty hint set is PostgreSQL — day one = status quo.
Drift: a 48-way choice retrains in minutes on one GPU.
Cuts tail (p99) latencies substantially; reduces cloud cost.
Shipped in spirit: Microsoft’s steered optimizer (SCOPE), Meta’s AutoSteer (PrestoDB).

L2 · The pattern

Where the ML sits decides the fate

System	Where the ML sits	Worst case	Fate
Neo (VLDB ’19)	Replaces the optimizer	Arbitrarily bad plan	Influential; never deployed as-is
Bao (SIGMOD ’21)	Advises: 1 of 48 hint sets	A valid classical plan	Pattern adopted (MSFT steered optimizer, Meta AutoSteer & kin)
OtterTune (’17–’24)	Outside, tuning knobs	Bad config (recoverable)	Good tech, dead company (2024)
LLM-DBA via MCP	Outside, as a client	Whatever you let it execute	The current frontier

L2 · SageDB

SageDB: the manifesto vs. the shipping list

CIDR 2019: every component learned — the “instance-optimal” system.
Never arrived as a system; the shipped pieces are the lesson.
Redshift: learned table optimization, workload management, short-query classifier.
All advisory, off the critical path — wrong answers never break correctness.
Learned sort and learned join? Nowhere in production.

L2 · SageDB

The physics separation

ML is excellent at policies: what to do, given this workload.
ML is dangerous at mechanisms: doing it correctly under all inputs.
Every survivor is a policy decision feeding a classical mechanism.
Made asynchronously, where errors degrade performance, not correctness.

L2 · OtterTune

OtterTune: the product-shape autopsy

CMU research (SIGMOD 2017): Bayesian optimization over config knobs.
Sound research, real funding — shut down mid-2024.
Bounded upside: a great config buys 20–50%, once.
Absorbed: Aurora and Azure auto-tune for free.
Customers wanted a recommendation with an explanation, not autonomy.

L2 · History

Self-tuning is not a 2018 idea

1998

SQL Server 7.0 ships the Index Tuning Wizard. IBM’s LEO was Bao’s feedback loop twenty years early — turned off because customers feared plan instability. The binding constraint is operator trust, not model accuracy.

L2 · The inversion

The learned component moved outside

The strongest learned component in 2026 is an LLM client, over MCP.
Holds EXPLAIN ANALYZE, pg_stat_statements, schema in context.
Structurally what OtterTune’s customers asked for: an advisor that explains.
Inherits Bao’s safety contract — if the tool surface is designed right.
Engines must now be legible: rich EXPLAIN, dry-run DDL, guarded hints.

The systems that shipped all share one clause in their contract: the model advises, the engine decides.

— Week 5 lecture notes, DATA 2027

Checkpoint · Discussion

Before you leave

Two-stage RMI over 200M keys: footprint vs. a fanout-256 B+tree? Which dataset property blows up a linear leaf? (Ex 5.1)
Mini-Bao, four hint sets, 20 queries: how much regret does always-default leave on the table? (Ex 5.2)
An LLM-DBA over MCP: what is the formal analogue of Bao’s “every arm is a classical plan” floor? (Ex 5.3)

Readings · Week 05

Read before Thursday

The Case for Learned Index Structures — Kraska et al., SIGMOD 2018. §1–3 closely; bring one benchmark objection.
Bao: Making Learned Query Optimization Practical — Marcus et al., SIGMOD 2021. §2 constraints + the Thompson-sampling loop.
SageDB: A Learned Database System — Kraska et al., CIDR 2019. Mark each component policy or mechanism; check against Redshift.