DATA 2027 · Week 02 · Part I — Foundations Under New Workloads

B-Trees, LSM-Trees & the RUM Triangle

Two storage architectures, three amplifications, and one triangle nobody has escaped.

Lecture 1 — B-trees: the disk made me do it · Lecture 2 — LSM-trees and the amplification triangle

Lecture 1 · Tuesday

B-trees: the disk made me do it

The least clever data structure that takes the page seriously.

L1 · The problem

Put a binary tree on disk. It dies.

L1 · Fanout

So fill the page

L1 · Fanout

A billion keys, this deep

4

levels instead of thirty — and levels 1–3 total 1 + 250 + 62,500 ≈ 63 k pages ≈ 250 MB, pinned in memory. A point lookup costs one device read: the leaf.

L1 · Fanout

What the math is sensitive to

  • Page size barely matters — heights are logarithms, and logarithms are stubborn.
  • Key width matters a lot — a 36-char key cuts fanout 3×.
  • Fix: prefix truncation — a separator only needs to sort between leaves.
L1 · Page anatomy

The slotted page

hdr s0 s1 s2 free space cell C B A slot array → (kept in key order) ← cells (arrival order) binary search runs over slots; records never move for a search
Fig. — Slotted page: slots grow forward, cells grow backward, free space in the middle. Each slot is a 2–4 byte offset.
L1 · Page anatomy

Indirection earns its keep

L1 · Splits

Growth only at the root

L1 · Splits

The equilibrium occupancy

69%

steady-state page occupancy under random inserts (ln 2) — a permanent ~1.44× space tax for being update-friendly. Monotonic keys + split-at-insertion-point cheat to ~100%-full pages.

L1 · Splits

Deletion is the embarrassing part

L1 · Refinements

B+-tree and B-link

  • B+-tree: all records in leaves; interior nodes navigation-only.
  • Sibling pointers: range scan descends once, walks right.
  • B-link (Lehman–Yao): high key + right-link per node.
  • Readers racing a split just follow the link — no lock-coupling.
  • PostgreSQL’s nbtree is a B-link tree.
L1 · Field note

Your key distribution is part of the engine

L1 · Crash safety

Steal / no-force → WAL

L1 · Crash safety

ARIES in three passes

Lecture 2 · Thursday

LSM-trees and the amplification triangle

What if we never update in place at all?

L2 · Inversion

Invert every assumption

L2 · Write path

Memtables and SSTables

L2 · Read path

Where the bill arrives

L2 · Compaction

Size-tiered vs. leveled

  • Size-tiered: merge ~4 similar-size runs into the next tier.
  • Each byte rewritten ~once per tier: low write amp.
  • Runs overlap freely → high read amp, ~2× transient space.
  • Leveled: levels each T = 10× larger, non-overlapping within level.
  • ≤ 1 version per key per level; SA ≈ 1 + 1/T ≈ 1.11.
  • Each descending byte drags ~T bytes of rewrite.
L2 · The ledger

Define the three numbers precisely

L2 · The ledger

One byte’s life, leveled (T = 10, 4 levels)

42

write amplification ≈ 1 (WAL) + 1 (flush) + 10 × 4 (descents) — the folklore “~10× per level, 40–50× overall” is just this sum. Size-tiered, 3 tiers: WA ≈ 5.

L2 · The ledger

Now audit the B-tree the same way

64

B-tree WA for a 128-byte update: 4 KB leaf dirtied + full-page write in WAL = 8 KB for 128 bytes. The LSM often amplifies less, and sequentially — its sin is that the cost is deferred, bursty, and eats your p99.

L2 · The ledger

RocksDB prints the bill on demand

** Compaction Stats [default] **
Level   Files  Size(GB)  Read(GB)  Write(GB)  W-Amp
  L0     4/0      0.25       0.0       62.1    1.0
  L1    10/1      0.62     601.7      598.9    9.6
  L2    98/3      6.21     580.4      577.8    9.3
  L3   940/8     62.05     551.2      549.0    8.9
 Sum                      1733.3     1787.8   28.8

Friday’s lab: predict the W-Amp column before you run it.

L2 · Bloom filters

Buying back the reads

L2 · Bloom filters

The limitation that returns in week 6

L2 · RUM

The RUM triangle

READ-OPTIMIZED WRITE-OPTIMIZED SPACE-OPTIMIZED b+tree (1 I/O reads, 69% pages) lsm, leveled (sa~1.1, wa~40) lsm, size-tiered (wa~5) agent memory: wants R and U, never deletes. awkward.
Fig. 2.1 — The RUM conjecture (Athanassoulis et al., EDBT 2016): optimize two overheads toward their minimum and the third moves away. A conjecture, not a theorem — but nobody has bought all three corners in thirty years.
L2 · RUM

The corner agents sit on

L2 · Scorecard

Memorize the shape, not the digits

DimensionB+-treeLSM leveledLSM size-tiered
Write amp~30–60×, random≈ 40–50×, sequential≈ 5×, sequential
Point reads1 I/O≈ 1 I/O w/ Bloom≈ 1.1 I/O w/ Bloom
Range scansexcellentgood, no Bloom helppoor, no Bloom help
Space amp~1.44×≈ 1.1×up to ~2× transient
Concurrencylatching, lock couplingimmutable SSTables; cost moves to compaction stalls
You don’t choose whether to amplify. You choose which amplification to confess to — and the workload audits you.
— Week 2 notes
Checkpoint · Discussion

Before you leave

Readings

Read before Thursday