DATA 2027 · Week 04 · Part I — Foundations Under New Workloads

Disaggregation & Elasticity

Snowflake said the warehouse is rented compute over object storage; Amazon said the log is the database — this week we take both literally.

Lecture 1 — Aurora: the Log Is the Database · Lecture 2 — Snowflake and the Elastic Warehouse

Lecture 1 · Tuesday

Aurora: the Log Is the Database

Verbitski et al., SIGMOD 2017 — ship only the redo log, and let storage do the rest.

L1 · The insight

Two descriptions of one state

L1 · Write amplification

What mirrored MySQL writes per transaction

L1 · Write amplification

Count it: 10 rows, 8 pages

  • Mirrored MySQL: 8 × 16 KB pages, doubled.
  • Plus log, binlog, full replay downstream.
  • Over half a megabyte of writes.
  • Aurora: 10 redo records, ~100 bytes each.
  • Sent to six storage nodes.
  • About 6 KB total.
L1 · Write amplification

The intent was always small

~100×

reduction in write traffic when only redo records cross the network — a few hundred bytes of intent, not megabytes of pages.

L1 · The benchmark

And it shows up in throughput

35×

more transactions than mirrored MySQL over a 30-minute SysBench write-only run — with 7.7× fewer I/Os per transaction.

L1 · Write path

Only the log crosses the network

L1 · Quorum math

Why not the usual 2/3?

L1 · Quorum math

Six copies, two per AZ

PRIMARY (InnoDB) SQL · txns · buffer pool redo records only (~100 B, no pages) SN1 SN2 SN3 SN4 SN5 SN6 AZ-a AZ-b AZ-c ACK at 4/6 write quorum · read at 3/6 lose an AZ (−2): writes continue · lose AZ+1 (−3): no data lost, rebuild and resume
Fig. 1 — AZ+1: losing a whole AZ leaves 4/6 (writes continue); AZ plus one node leaves 3/6 (reads stand, data survives).
L1 · Repair speed

Durability is a race

~10 s

to re-replicate a lost 10 GB segment over a 10 Gbps link. A second failure only hurts inside that window, in the same protection group. Big segments would stretch it to hours.

L1 · Storage nodes

Storage materializes the pages

  • Foreground: receive redo, queue, persist, ACK.
  • Two steps — the whole latency-critical path.
  • Background: sort by LSN, gossip holes.
  • Coalesce redo onto pages; back up to S3.
  • Garbage-collect versions; scrub checksums.
L1 · Recovery

Crash recovery, inverted

Lecture 2 · Thursday

Snowflake and the Elastic Warehouse

Dageville et al., SIGMOD 2016 — what if compute owned nothing at all?

L2 · Architecture

Three layers, three domains

L2 · Micro-partitions

Immutable micro-partitions

L2 · Micro-partitions

What immutability buys

L2 · Economics

The one-line argument

L2 · Economics

NSDI 2020: the bet, measured

L2 · Neon

Neon: Postgres, taken apart

L2 · Neon

The thesis of the week, as an API

GetPage@LSN(tenant, timeline, rel, blkno, lsn)
  → 8 KB page image
L2 · Neon

WAL in, pages out

POSTGRES (stateless) scale to zero when idle WAL SAFEKEEPERS paxos quorum 2/3 = commit PAGESERVER delta + image layers · NVMe cache GetPage@LSN → 8 KB page S3 · immutable layers CoW branch = (parent, LSN)
Fig. 2 — Neon: WAL durable at a 2/3 safekeeper quorum; pageservers reorganize it into immutable layers tiered to S3; stateless Postgres fetches pages on demand.
L2 · Branching

Copy-on-write, like a Git branch

~1 s

to branch a 2 TB database — a branch is just (parent timeline, LSN) plus its own WAL after. A hundred branches diverging by megabytes cost a few hundred megabytes total.

L2 · Comparison

Three cuts, one idea

AxisAurora (2017)Snowflake (2016)Neon (2020s)
Crosses network on writeRedo records onlyNew micro-partition filesWAL stream
Durability4/6 quorum, 3 AZsS3 + replicated metadata2/3 Paxos, then S3
Unit of elasticityRead replicasWarehouses, per secondCompute → zero per branch
MaterializationStorage coalesces redoCompute writes filesDelta + image layers
Cost of a full copyVolume clone (CoW)Zero-copy cloneBranch = (parent, LSN)
L2 · Agents

Seismograph demand curves

L2 · Agents

Branch-per-experiment is isolation

L2 · Synthesis

The cache hierarchy is the architecture

The buffer pool didn’t disappear into the cloud — it became the architecture.
— Week 4 lecture notes, DATA 2027
Checkpoint · Discussion

Before you leave

Readings · Due Thursday

Read before Thursday