Week 2: B-Trees, LSM-Trees & the RUM Triangle

Learning objectives — after this week you can…

Derive B-tree height from page size and entry width, and explain why ~250-way fanout puts a billion keys four levels deep instead of thirty.
Read a slotted-page layout and trace an insert through page splits, occupancy drift, and sibling pointers.
Explain why a steal/no-force buffer pool forces write-ahead logging, and sketch ARIES recovery in three sentences.
Define read, write, and space amplification with formulas, and compute all three for leveled and size-tiered LSM compaction.
Place a workload — including an agent-memory workload — on the RUM triangle and defend the placement quantitatively.

Lecture 1 · Tuesday

B-trees: the disk made me do it

Take the data structure every undergraduate reaches for — a balanced binary search tree — and put it on a disk. It dies instantly. Not because comparisons got slower; comparisons are free. It dies because storage devices do not sell you bytes, they sell you pages, and a binary tree wastes almost the entire page on every single hop. The B-tree (Bayer & McCreight, 1970) is not a clever data structure. It is the least clever data structure that takes the page seriously, which is exactly why it has run the world’s OLTP for fifty-five years. Today we rebuild it from the constraint up; Thursday we meet the structure that took the opposite constraint just as seriously.

The fanout argument, in full

A balanced binary tree over 10⁹ keys has height log₂ 10⁹ ≈ 30. On disk, every pointer chase is a potential device read. On a 2010 spinning disk at ~10 ms a seek, that’s 300 ms per point lookup; on a modern NVMe drive at ~80 µs, it’s still ~2.4 ms — and the deeper insult is that each of those 30 reads fetched a full 4 KB page to use a 16-byte node. You consumed 0.4% of every page you paid for.

So fill the page. An interior node’s job is pure navigation: separator key plus child pointer. With an 8-byte key and an 8-byte child pointer, a 4 KB page holds 4096 / 16 = 256 entries — call it ~250-way fanout after the header. Now the height is log₂₅₀ 10⁹ = ln 10⁹ / ln 250 ≈ 20.7 / 5.5 ≈ 3.8, i.e. four levels, versus thirty. And the first three levels total 1 + 250 + 62,500 ≈ 63 k pages ≈ 250 MB — small enough to pin in memory permanently, so a point lookup costs one device read: the leaf. That is the entire theory of the B-tree in one sentence: since the device forces you to read a page, make every byte of that page advance the search.

Notice what the math is sensitive to: page size barely matters (heights are logarithms, and logarithms are stubborn), but key width matters a lot. Fatten the key from 8 bytes to a 36-character string and you’ve cut fanout by 3× — which is why serious engines do prefix truncation in interior nodes: a separator needn’t be a real key, only a string that sorts between two leaves.

Anatomy of a page: the slotted layout

Inside a page, real records are variable-length, so you can’t just array-index them. The universal answer is the slotted page: a small header, then a slot array growing forward from the front, then the records (“cells”) growing backward from the end, with free space in the middle. Each slot is a 2–4 byte offset into the page. The slots are kept in key order even though the cells are in arrival order, so binary search runs over the slot array without ever moving a record. Deletion just marks a slot dead; the page compacts itself lazily when free space fragments.

The slot array is also an indirection layer with consequences beyond one page: a record’s external identity is (page #, slot #), so the engine may shuffle cells within the page without invalidating a single secondary-index entry pointing at it. Every row update that changes the row’s size and survives is this two-level indirection earning its keep.

Splits, merges, and the 69% equilibrium

Insert into a full leaf and the leaf splits: half the entries move to a new page, and the separator key is pushed into the parent — which may itself split, all the way up. The tree grows only at the root, which is the B-tree’s quiet masterpiece: balance is a side effect of the growth rule, with no rotations and no rebalancing pass, ever.

Splitting at the midpoint leaves both pages half full, and under uniformly random inserts the steady-state page occupancy converges to ln 2 ≈ 69% — a number worth memorizing, because it is a permanent ~1.44× space tax the B-tree pays for being update-friendly. Engines cheat where they can: for monotonically increasing keys, splitting at the insertion point instead of the midpoint yields ~100%-full left pages. Deletion is the embarrassing part: textbooks merge pages below half full; production engines mostly don’t, because merging under concurrency is misery. They let pages drift sparse and clean up offline (VACUUM, OPTIMIZE TABLE) — which is why a table that grew to 100 GB and shrank to 10 GB of live data is still 100 GB on disk until you say otherwise.

Two refinements complete the modern picture. The B+-tree puts all records in the leaves, keeps interior nodes navigation-only, and chains the leaves with sibling pointers — a range scan descends once and walks right, never re-entering the tree. And the Lehman–Yao B-link variant gives each node a high key and a right-link, so a reader racing an in-flight split detects “my key is past this node’s high key” and follows the link — concurrency without lock-coupling. PostgreSQL’s nbtree is a B-link tree; this is not an academic flourish.

Field note A team I knew shipped a write-heavy service with random UUIDv4 primary keys on InnoDB. Every insert landed on a uniformly random leaf, dirtying a cold page: the buffer pool thrashed, checkpoints fell behind, throughput collapsed at a few thousand rows/s on hardware good for ten times that. Switching to time-sortable IDs (UUIDv7/ULID) confined inserts to the tree’s right edge — hot pages stayed hot, throughput tripled, nothing in the schema changed. The key distribution is part of your storage engine whether you admit it or not.

WAL, the buffer pool, and steal/no-force

B-tree writes are in-place writes, and in-place writes have a crash problem. Pages are modified in the buffer pool, not on disk, so the engine faces two policy questions: may the pool evict a dirty page of an uncommitted transaction (steal)? Must commit flush every page the transaction dirtied (force)? Every serious engine answers steal / no-force, decoupling eviction policy from commit latency. The price: after a crash, the disk may contain uncommitted changes and may be missing committed ones. The write-ahead log covers both with two rules: no dirty page reaches disk before the log records describing its changes are durable (enabling undo), and commit doesn’t return before the commit record is durable (enabling redo). Note what got bought: commit costs one sequential log append instead of a spray of random page writes.

ARIES, in one paragraph, since you will meet it everywhere: every log record gets a monotonically increasing LSN, and every page is stamped with the LSN of the last record applied to it, which makes redo idempotent — replay a record only if the page’s LSN is older. Recovery runs three passes: analysis reconstructs which transactions were live at the crash from a fuzzy checkpoint; redo repeats history exactly — including the losers — to restore the precise crash-moment state; undo then rolls the losers back, writing compensation log records (CLRs) so that a crash during recovery never undoes the same change twice. Fuzzy checkpoints mean the system never stalls to snapshot. Keep one contrast in your pocket for Thursday: ARIES is this elaborate because B-trees mutate pages in place. A structure that never overwrites anything gets to throw half this machinery away.

Lecture 2 · Thursday

LSM-trees and the amplification triangle

Now invert every assumption. Suppose your workload is not “read a customer row, update a balance” but “ingest a firehose, occasionally look things up.” The B-tree’s in-place discipline becomes pure liability: every insert is a random page write, every random write is a buffer-pool eviction fight. O’Neil, Cheng, Gawlick & O’Neil asked in 1996: what if we never update in place at all? Buffer writes in memory, dump them to disk as immutable sorted runs, and pay a background process to merge the runs. That is the log-structured merge-tree, and today it sits under RocksDB, Cassandra, HBase — and under most of the memory stores being sold to agent builders right now, which is why you must be able to audit its costs.

Outrunning the disk: memtables and SSTables

The write path: append the operation to a WAL (sequential, cheap), insert it into the memtable — an in-memory sorted structure, typically a skiplist. When the memtable hits a threshold (~64 MB in RocksDB), freeze it and write it out as an SSTable: an immutable file of sorted key–value pairs with a block index, min/max key metadata, and a Bloom filter. One sequential write, no page in the tree is ever touched, and the file is never modified again. Updates are simply newer versions; deletes are tombstones — a special “this key is dead” record that must survive until compaction has provably buried every older version below it.

The read path is where the bill arrives: check the memtable, then each SSTable newest to oldest, because the newest version wins. Untreated, a point lookup costs one block read per run the key might be in. Everything else in this lecture — compaction policy, Bloom filters, the RUM triangle — is about managing that bill.

Compaction: leveled vs. size-tiered

Size-tiered compaction (Cassandra’s historical default) waits until ~4 runs of similar size accumulate, then merges them into one run of the next tier. Each byte is rewritten only about once per tier — roughly log₄(N) times in its life — so write amplification is low. But runs within and across tiers overlap freely in key range, so a key may live in many runs (high read amplification) and a hot, repeatedly updated key can have stale versions in every tier; transient space amplification of 2× during a big merge is routine.

Leveled compaction (LevelDB, RocksDB default) imposes order: levels L1…Ln, each T = 10× larger than the last, and within a level the SSTables are non-overlapping — together they form one giant sorted run. Compaction picks a file from Lᵢ and merges it with the ~T overlapping files in Lᵢ₊₁. Consequence one: at most one version of any key per level, so reads probe at most one file per level and space amplification collapses to ≈ 1 + 1/T ≈ 1.11 (only the not-yet-compacted top is duplicated). Consequence two: every byte that descends into Lᵢ₊₁ drags ~T bytes of rewrite with it. You just bought reads and space with writes.

The amplification ledger

Define the three numbers precisely, because vendors won’t. Write amplification WA = bytes physically written to the device ÷ bytes logically written by the client. Read amplification RA = device reads performed per query ÷ reads the query logically needed (for a point lookup: how many block reads to find one key). Space amplification SA = bytes occupied on the device ÷ bytes of live, unique data.

Worked example — RocksDB, leveled, T = 10, four levels below L0, dataset 640 GB. Follow one user byte: written once to the WAL, once when the memtable flushes, then on each descent Lᵢ → Lᵢ₊₁ it joins a merge that rewrites ~T bytes of the level below per byte arriving. Total: WA ≈ 1 + 1 + 10×4 = 42 — the folklore “~10× per level, 40–50× overall” is just this sum. Before you gasp, audit the B-tree the same way: updating a 128-byte row dirties a 4 KB leaf, and with full-page writes in the WAL that’s 8 KB written for 128 bytes — WA = 64. The LSM’s sin is not that it amplifies more; often it amplifies less, and sequentially. The sin is that the cost is deferred, bursty, and easy to ignore until compaction debt eats your p99. Run the same dataset size-tiered with 4-run tiers and ~3 tiers: WA ≈ 1 + 1 + 3 = 5 — eight times less device wear — but a point read must consider up to ~12 overlapping runs, and SA can transiently hit 2×.

The numbers are not hypothetical; RocksDB prints the ledger on demand:

** Compaction Stats [default] **
Level   Files  Size(GB)  Read(GB)  Write(GB)  W-Amp
  L0     4/0      0.25       0.0       62.1    1.0
  L1    10/1      0.62     601.7      598.9    9.6
  L2    98/3      6.21     580.4      577.8    9.3
  L3   940/8     62.05     551.2      549.0    8.9
 Sum                      1733.3     1787.8   28.8

Friday’s lab is an autopsy of exactly this table from a live ingest run: you will predict the W-Amp column from T and the level count before you run it, then explain every deviation.

Bloom filters: buying back the reads

The read bill gets paid down with a probabilistic IOU. Each SSTable carries a Bloom filter: m bits, k hash functions; inserting a key sets k bits, querying checks them. False positives only — the filter can say “maybe here” wrongly, never “not here” wrongly — with rate FPR ≈ (1 − e^(−kn/m))^k, minimized at k = (m/n)·ln 2. The engineering point: at the standard 10 bits per key, k = 7 and FPR ≈ 0.8%. A point lookup now reads, in expectation, ≈ 1 + 0.008 × (runs − 1) blocks — size-tiered’s twelve-run nightmare collapses to ~1.1 reads. Ten bits per key is the best deal in systems.

But mark the limitation, because it returns with force in week 6: a Bloom filter answers “is key X in this file?” It cannot answer “does anything in [a, b) live here?” — so range scans get no help and must consult every overlapping run. And a semantic query — “memories similar to this embedding” — isn’t even a range. There is no Bloom filter for meaning. Hold that thought.

The RUM conjecture — and the corner agents sit on

Athanassoulis et al. (EDBT 2016) named the pattern you’ve now seen three times. Every access method has a Read overhead, an Update overhead, and a Memory (space) overhead, and the conjecture states: optimizing any two toward their minimum forces the third away from its minimum. The B-tree takes reads (giving up update locality and 31% of its space). Leveled LSM takes space and decent reads (paying 40× writes). Size-tiered takes updates (paying reads and space). A hash index takes point reads and space (surrendering ranges entirely). It is a conjecture, not a theorem — the frontier has never been formally bounded — but in thirty years of access-method design, nobody has bought all three corners.

Fig. 2.1 — The RUM triangle (after Athanassoulis et al., EDBT 2016). Each engine buys two corners and bills the third. The agent-memory workload is pulled toward the read and write corners simultaneously while its space requirement grows monotonically — and week 6 will add a fourth, off-plane axis: recall.

Now place this course’s protagonist on the triangle. An agent fleet’s memory system ingests at machine speed: every step appends observations, tool outputs, reflections, embeddings — millions of small writes per hour, essentially never updated in place. That is the write corner; LSM territory; the easy half. But before nearly every action, the agent runs recall: semantic search over its entire accumulated history — read-heavy, scan-shaped, top-k with no exact key, so Bloom filters are spectators. And agents, unlike users, almost never delete: memory is the product, so live data grows monotonically. The agent workload sits on an awkward corner — write-heavy at ingest, read-heavy at recall, space-hungry forever — and the conjecture says it cannot have all three cheaply. Most “AI memory” products are an LSM with a vector index bolted on: they have silently chosen U and are paying for R at query time. Week 6 makes this precise by adding the axis RUM omits — recall: when reads are approximate, you trade R, U, and M against answer quality, and the triangle becomes a tetrahedron.

You don’t choose whether to amplify. You choose which amplification to confess to — and the workload audits you.

Historical aside The LSM paper landed in 1996 to near-total silence and waited a decade: Bigtable (2006) rebuilt the idea at Google scale, LevelDB (2011) shrank it to a library, and Meta’s RocksDB fork turned it into infrastructure — MyRocks cut Facebook’s MySQL storage roughly in half versus InnoDB, a space-amplification argument, not a speed one. The LSM wasn’t wrong in 1996; the hardware economics were. It needed flash — cheap sequential bandwidth, costly random writes, finite endurance — to become inevitable. When you meet a “premature” design this semester, ask what hardware bill it is waiting for.

The week’s scorecard, in one table — memorize the shape, not the digits:

Dimension	B+-tree (in-place)	LSM — leveled	LSM — size-tiered
Write amp	page/record ratio: ~30–60× per small update, random I/O	~T per level; ≈ 40–50× at T=10, 4 levels — but sequential	~1× per tier; ≈ 5× total — sequential
Point-read amp	1 I/O (upper levels cached)	≤ 1 file/level; ≈ 1 I/O with Bloom filters (10 bits/key, FPR 0.8%)	up to runs-per-tier × tiers candidates; ≈ 1.1 I/O with filters
Range scans	excellent — descend once, walk sibling pointers	good — merge ~1 iterator per level; no Bloom help	poor — merge every overlapping run; no Bloom help
Space amp	~1.44× (ln 2 occupancy) + non-merging deletes	≈ 1 + 1/T ≈ 1.1×, tombstones transient	up to ~2× transient; stale versions linger across tiers
Concurrency	latching + lock coupling; B-link eases readers; hot-page contention on right-edge inserts	immutable SSTables: readers never block writers; cost moves to compaction scheduling and write stalls

Readings

Read Before Thursday

Database Internals, ch. 2–4 & 7 — Alex Petrov, O’Reilly 2019.Ch. 2–4 give you the B-tree at implementation depth (cell layouts, splits, B-link); ch. 7 is the cleanest LSM treatment in print. Read with a pencil; redo the fanout math for 16 KB pages.

Designing Access Methods: The RUM Conjecture — Athanassoulis, Kester, Maas, Stoica, Idreos, Ailamaki, Callaghan, EDBT 2016.Short and sharp. Focus on the overhead definitions in §2 and the design-space figure; come to class able to say which corner the last system you used had silently chosen.

The Log-Structured Merge-Tree (LSM-Tree) — O’Neil, Cheng, Gawlick & O’Neil, Acta Informatica 1996.Read §1–3 for the rolling-merge idea; skim the rest. The cost model is HDD-era — your job is to notice exactly which assumptions flash and NVMe broke, and which survived.

Exercises

This Week’s Problems

Exercise 2.1 · warm-up

InnoDB uses 16 KB pages. Assume interior entries cost 20 bytes (8-byte key, 6-byte page pointer, 6 bytes overhead). Compute the fanout, the tree height for 10¹⁰ rows, and the I/Os per point lookup if the top two levels are cached. How much RAM does caching all interior levels require? Now repeat with 16-byte UUIDv4 keys (28-byte entries) and write two sentences on why your DBA keeps muttering about surrogate keys.

Exercise 2.2 · core

An agent fleet writes 200 GB/day of new memories into a store that grows toward 6 TB. (a) Under leveled compaction with T = 10 and four levels below L0, estimate total device writes per day, including WAL and flush. (b) Your drive is a 7.68 TB SSD rated 3 DWPD over five years. Does the workload fit? With what headroom? (c) Redo the estimate for size-tiered compaction with 4-run tiers and 3 tiers. (d) Each SSTable has a 10-bits-per-key Bloom filter (FPR ≈ 0.8%). Compute the expected block reads per point lookup under both policies, assuming 5 candidate files (leveled) vs. 12 overlapping runs (size-tiered). (e) The recall path is a range-shaped scan over the newest 30 days of memories: which policy wins now, and why do the Bloom filters not appear in your answer?

Exercise 2.3 · stretch

Design a storage engine for agent memory with this workload: append-only writes (no in-place updates, no deletes); exact reads with strong recency skew (80% of point lookups hit the newest 5% of data); a weekly full-corpus scan to rebuild embeddings; and a monthly “consolidation” pass in which an LLM rewrites cold memories into summaries one-tenth the size. Specify your layout (you may hybridize: e.g., leveled hot levels over a time-partitioned, compressed, size-tiered cold store), derive WA/RA/SA bounds for each component as functions of your design parameters, and argue your engine’s position on the RUM triangle. Then take a position on the hard question: consolidation destroys information to save space — is it just semantic compaction, and if so, what is its amplification metric? Propose one, with units, and defend it. Optionally, prototype the hot path with RocksDB column families (leveled hot, FIFO cold) under db_bench, and report predicted vs. measured write amplification with an explanation of every deviation above 15%.

❦