Week 4: Disaggregation & Elasticity

Learning objectives — after this week you can…

Quantify write amplification in traditional replicated databases and explain why shipping only the redo log collapses it by two orders of magnitude.
Reconstruct Aurora’s 6-way quorum (4/6 write, 3/6 read) from first principles of the AZ+1 failure model, and explain why 10 GB segments make repair fast enough for the math to hold.
Describe Snowflake’s three-layer split and trace how immutable micro-partitions enable pruning, time travel, and zero-copy cloning.
Walk a GetPage@LSN request through Neon’s safekeeper → pageserver → S3 pipeline, and explain why a copy-on-write branch costs almost nothing to create.
Argue, with cost arithmetic, when scale-to-zero and branch-per-experiment make disaggregated storage the right substrate for agent workloads.

Lecture 1 · Tuesday

Aurora: the Log Is the Database

Every durable database already contains two complete descriptions of its state: the data pages, and the redo log that produced them. They are informationally redundant — given the log, you can rebuild every page. For forty years we shipped both across every replication boundary, because networks were slow and disks were local, so materializing pages eagerly everywhere seemed obviously right. Aurora’s insight (Verbitski et al., SIGMOD 2017) is brutal in its simplicity: in a cloud datacenter, where the network is fast and the storage fleet can run code, shipping pages is pure waste. Ship only the log. Let storage materialize pages itself, lazily, close to the disks. Everything else in the paper — the quorum, the segments, the recovery story — falls out of that one decision.

The write amplification problem

Start by counting what a conventional highly-available MySQL writes per transaction. InnoDB writes the redo log record, the data page itself (16 KB, even if you changed 120 bytes of it), and a second copy of the page to the double-write buffer to guard against torn writes. MySQL also writes the binlog for replication, plus assorted metadata. Now mirror that instance for HA: networked block storage typically replicates each of those writes again, and a hot standby repeats the entire set on its own mirrored volumes. The Aurora paper’s Figure 2 counts the same logical write traveling the network and hitting media many times over — and several of those writes are sequential and synchronous, so latency stacks.

Make it concrete. A transaction updates 10 rows spread over 8 pages. Mirrored MySQL moves roughly 8 × 16 KB of pages, doubled by the double-write buffer, plus log and binlog — then the whole bundle is replayed down a second instance’s stack: comfortably over half a megabyte of writes for a few hundred bytes of intent. Aurora sends 10 redo records of ~100 bytes each to six storage nodes: about 6 KB total. That is roughly a 100× reduction, and it shows up in the paper’s headline benchmark: over a 30-minute SysBench write-only run, Aurora completed 35× more transactions than mirrored MySQL while issuing 7.7× fewer I/Os per transaction. The transaction’s intent is small; only our habits made it big.

Only the log crosses the network

In Aurora the primary instance still runs InnoDB’s parser, optimizer, transactions, and buffer pool — but the only thing it ever writes is redo records. Each record is tagged with its log sequence number (LSN), grouped by the 10 GB storage segment it belongs to, and streamed to the six replicas of that segment. Commit is asynchronous and quorum-based: a worker thread marks a transaction durable once the volume durable LSN (the highest point at which the log is known complete on a write quorum) passes the transaction’s commit LSN. No page ever crosses the wire in the write path. The primary doesn’t even checkpoint in the traditional sense — checkpointing has been delegated to storage, which is continuously, incrementally applying the log it already holds.

Quorum math for three availability zones

Why six copies, and why 4/6 and 3/6? Quorum systems need two inequalities: the read and write quorums must intersect (V_r + V_w > V), and two writes must intersect (V_w > V/2). A 2/3 scheme satisfies both, and that’s what most systems used. Aurora rejects it because of correlated failure: in AWS, an entire availability zone can vanish — fire, flood, a bad network push — and failures of individual nodes (disk dies, top-of-rack flaps) continue at background rates regardless. The design target is AZ+1: lose a whole AZ plus one unrelated node, and you must not lose data. With 2/3 across three AZs, AZ+1 destroys two of three copies — quorum gone. So: six copies, two per AZ. Losing an AZ (−2) leaves 4/6, so writes continue. Losing AZ+1 (−3) leaves 3/6: writes stall, but the read quorum still stands, so no data is lost and the system can rebuild fresh copies and resume.

The subtle half of the argument is repair speed, because a quorum is only as good as the time you spend degraded. Aurora carves the volume into 10 GB protection groups precisely so that re-replicating a lost segment over a 10 Gbps link takes about ten seconds. For double failure to hurt you, the second failure must land inside that ten-second window in the same protection group. Big segments would stretch that window to minutes or hours and quietly invalidate the whole availability story. Durability is a race between failure rate and repair rate, and Aurora wins it by making the unit of repair small.

Storage nodes materialize the pages

Each storage node runs a small pipeline. Foreground: receive redo records, append to an in-memory queue, persist, ACK. That’s the entire latency-critical path — two steps. Everything else is background: sort records by LSN, gossip with peer replicas to fill holes in the log, coalesce redo onto data pages, back up log and pages to S3 continuously, garbage-collect superseded versions, and scrub blocks against checksums. When the database later reads a page, the storage node returns a materialized page at the requested LSN — applying any not-yet-coalesced redo on demand.

This inverts crash recovery. Traditional recovery replays the log from the last checkpoint before opening for business — minutes of downtime proportional to log backlog. Aurora’s storage is always applying the log, so on crash the database re-establishes the durable LSN by quorum reads and opens immediately; pages are brought to consistency lazily, on first touch. Recovery time stops being a function of how much work was in flight. The log is not a recovery aid for the database; the log is the database, and pages are merely a cache of its prefix sums.

Field note A team I worked with once ran the AZ+1 drill for real: an AZ-wide network partition during a deploy, then twenty minutes in, a routine disk failure took a storage node. On a 2/3 system that’s a data-loss incident. On the 6-way layout it was an on-call engineer watching repair jobs re-replicate 10 GB segments while the database kept serving reads. Nobody got paged because of the quorum — only because a dashboard turned yellow.

Lecture 2 · Thursday

Snowflake and the Elastic Warehouse

Aurora disaggregated storage but kept compute monogamous: one primary writes, replicas read, and the unit you scale is still “a database instance.” Snowflake (Dageville et al., SIGMOD 2016) went further and made compute promiscuous. Storage is just S3. Compute is a fleet of virtual warehouses — clusters of EC2 nodes you summon, resize, and dismiss like contractors — and any number of them can work over the same data simultaneously without coordinating, because the data is immutable and the metadata lives in a third, separately-scaled service tier. Aurora asked “what if storage understood the log?” Snowflake asked “what if compute owned nothing at all?”

Three layers, three failure and scaling domains

Snowflake splits into: cloud services (a multi-tenant, replicated brain holding the catalog, transaction manager, optimizer, and access control), virtual warehouses (stateless worker clusters with local SSD acting purely as cache), and storage (S3, which is slow at first byte but has effectively unbounded aggregate bandwidth and eleven nines of durability). The layers scale independently: a metadata-heavy workload stresses services; a giant join stresses a warehouse; nothing ever stresses S3. Crucially, a warehouse failure loses no state — kill every node mid-query and you’ve lost cache and in-flight work, never data. That property is what makes elasticity safe, not merely possible: you can only treat compute as disposable if it was never entrusted with anything.

Immutable micro-partitions and pruning

Tables are stored as micro-partitions: immutable files each holding roughly 50–500 MB of uncompressed data, internally columnar (a PAX-style hybrid layout), heavily compressed per column. Writes never modify a file; they write new files and update metadata to swap them in — an update to one row rewrites that row’s micro-partition. That sounds wasteful until you notice what immutability buys. First, pruning: cloud services keep min/max values (zone maps) per column per micro-partition, so WHERE event_date = '2026-06-09' over a 100,000-partition table touches only the partitions whose range can contain that date — 99%+ of I/O eliminated before a single byte leaves S3. There are no B-tree indexes in Snowflake; pruning over sorted-ish data is the index. Second, time travel: old micro-partitions stick around for a retention window, so querying the table “as of” an hour ago is just resolving an older file list. Third, zero-copy cloning: a clone of a 50 TB table is a metadata entry pointing at the same files. Versioning collapses into bookkeeping.

The economics of elasticity

The deep argument for disaggregation is economic, and it’s one line: 100 machines for 1 hour costs the same as 1 machine for 100 hours — but only one of them returns your answer today. In a coupled architecture you can’t exploit that identity, because adding machines means re-sharding data, which takes longer than the query. With state in S3 and compute stateless, a warehouse resize is just “boot more nodes and warm their caches.” Per-second billing plus auto-suspend turns capacity planning into a control loop: a nightly transform that needs 64 nodes for 9 minutes costs about the same as one node grinding for 9.6 hours, and an analyst’s warehouse that’s idle 95% of the day costs 5% of list price. The 2020 NSDI follow-up (Vuppalapati et al.), written from production telemetry, confirms the bet and reports the honest residue: real workloads are violently bursty, local SSD caches achieve high hit rates despite holding a sliver of total data (access skew is your friend), and the unsolved annoyance is intermediate data — shuffle and spill during a big join wants storage that is neither as expensive as RAM nor as far away as S3.

Neon: Postgres, taken apart along the same seam

Neon applies the Aurora cut to open-source Postgres, with a cleaner separation and one killer API. Postgres runs unmodified above the storage layer and is stateless: no data directory that matters. Its WAL streams to a trio of safekeepers, which run a Paxos-style consensus to decide the durable WAL frontier — commit means a quorum of safekeepers has the bytes. Downstream, pageservers ingest the WAL, reorganize it into immutable layer files (delta layers holding WAL slices, image layers holding materialized pages), cache them on local NVMe, and tier everything to S3. When Postgres misses in its buffer pool, it doesn’t read a disk — it issues an RPC:

GetPage@LSN(tenant, timeline, rel, blkno, lsn) → 8 KB page image

Read that signature slowly, because it’s the thesis of the week. The page is requested at an LSN — at a point in logical time. The pageserver reconstructs it by finding the newest image layer at or below that LSN and applying any delta records between. Once pages are addressable by (block, time), a branch is trivially cheap: it’s a new timeline defined as “parent timeline up to LSN L, plus my own WAL after.” Creating one copies nothing — reads of unmodified pages resolve through the parent’s layers, copy-on-write, exactly like a Git branch sharing history. A 2 TB database branches in roughly a second, and a hundred branches that diverge by a few megabytes each cost a few hundred megabytes total.

Fig. 4.1 — Two cuts through the same idea. Aurora (left): the primary ships ~100-byte redo records to six storage nodes (two per AZ); commit waits for 4/6, reads need 3/6, and page materialization happens inside storage. Neon (right): WAL is made durable by a Paxos quorum of safekeepers, pageservers reorganize it into immutable layers tiered to S3, and stateless Postgres fetches pages on demand via GetPage@LSN — which is also what makes copy-on-write branches free.

Axis	Aurora (2017)	Snowflake (2016)	Neon (2020s)
What crosses the network on write	Redo records only	New micro-partition files	WAL stream
Durability mechanism	4/6 quorum over 3 AZs	S3 + replicated metadata	2/3 Paxos on safekeepers, then S3
Unit of elasticity	Read replicas (writes: one primary)	Whole virtual warehouses, per second	Compute scales to zero per branch
Page/file materialization	Storage nodes coalesce redo	Compute writes immutable files	Pageservers build delta + image layers
Cost of a full copy	Volume clone (CoW, fast)	Zero-copy clone (metadata)	Branch = (parent, LSN), ~1 second

The agent angle: seismograph demand curves

Why does this week sit in a course about agents? Look at the load profile an agent fleet generates. Human-driven OLTP is a diurnal sine wave; you provision for the afternoon peak and eat the overnight waste. Agents produce a seismograph: a flat line of near-zero, then a workflow fires and 400 concurrent sessions appear within two seconds, run hot for ninety seconds, and vanish. The duty cycle can be under 1%. Provisioned-for-peak hardware at a 1% duty cycle is a 100× cost penalty; only scale-to-zero compute over disaggregated storage prices that curve honestly. Neon suspends an idle compute in minutes and cold-starts it in well under a second of compute boot plus cache warming — acceptable to an agent that just spent four seconds thinking about what query to write.

Second, agents are reckless in exactly the way branches make safe. A migration-writing agent should never rehearse ALTER TABLE against the production timeline; it should branch, apply, check, then promote or delete. At human scale that workflow meant a weekend restoring dumps onto a staging box. At agent scale — hundreds of experiments a day — it is feasible only because a branch costs one second and zero bytes up front. Branch-per-experiment is not a convenience feature; it is the isolation mechanism for a client that ships code faster than you can review it. Disaggregation is the precondition: copy-on-write across experiments only works when all experiments share one storage substrate addressed by (page, LSN).

Third, notice what happened to the humble buffer pool. In 1981 it was a few megabytes of RAM in front of a local disk. In these systems the cache hierarchy is the architecture: Postgres shared buffers (~100 ns) miss to pageserver or warehouse NVMe (~100 µs) miss to S3 (tens of milliseconds to first byte, but ~$0.023/GB·month and request-priced GETs). Every design decision this week — what to materialize where, what to keep hot, what an acceptable miss costs — is a buffer-pool policy question scaled up to datacenter size. The five-minute rule from Week 2 didn’t retire; it got promoted to chief architect.

The buffer pool didn’t disappear into the cloud — it became the architecture.

Historical aside We litigated this once before. Stonebraker’s 1986 “The Case for Shared Nothing” won the argument against shared-disk designs like IBM’s and later Oracle RAC, largely because 1980s networks were slow and cache-coherence across nodes was agony. Snowflake and Aurora are shared-disk architectures that won anyway — because S3 changed the constants (unbounded bandwidth, eleven nines) and because both systems quietly dodged coherence: Aurora allows one writer; Snowflake makes data immutable. The physics didn’t change. The constants did, and the constants were the argument.

Readings

Read Before Thursday

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases — Verbitski et al., SIGMOD 2017.The canonical “ship only the log” paper. Focus on §2–3: the write amplification accounting in Figure 2, and derive the 4/6–3/6 quorum from the AZ+1 model yourself before reading their derivation.

The Snowflake Elastic Data Warehouse — Dageville et al., SIGMOD 2016.Read for the three-layer architecture and §3.3 on micro-partitions and pruning. Ask at every section: which property here is enabled by immutability alone?

Building an Elastic Query Engine on Disaggregated Storage — Vuppalapati et al., NSDI 2020.The production retrospective: what the 2016 bet got right, measured. Focus on the workload skew and cache hit-rate data, and the open problem of intermediate (shuffle/spill) data.

Exercises

This Week’s Problems

Exercise 4.1 · warm-up

A transaction updates 200 bytes spread across 5 distinct 16 KB pages. Count the bytes that cross the network to achieve durable, replicated commit in (a) a mirrored MySQL pair where each instance writes data pages, a double-write buffer copy, redo, and binlog to network-attached storage that itself mirrors every write; and (b) Aurora, assuming 120-byte redo records and 6 storage replicas. State your assumptions explicitly, compute the amplification ratio, and identify which writes in (a) are on the synchronous commit path versus deferrable.

Exercise 4.2 · core

Aurora chose V=6, V_w=4, V_r=3 over three AZs. (a) Show that 3/5 and 2/3 schemes each violate some part of the AZ+1 requirement, stating the placement of copies across AZs in each case. (b) Compute the expected window of vulnerability for a protection group: with 10 GB segments re-replicated at 10 Gbps, how long is the system one-failure-from-data-loss after an AZ+1 event, and how does that change with 100 GB segments at the same bandwidth? (c) Suppose AWS adds a fourth AZ. Propose a quorum (V, V_w, V_r, placement) that tolerates AZ+1 with the minimum total copies, and argue why Aurora might still decline to use it.

Exercise 4.3 · stretch

Design the storage layer for a “branch-per-agent-experiment” service on a Neon-like substrate. Assumptions: one 2 TB parent database; 10,000 branches created per day; the median branch lives 11 minutes, writes 40 MB of WAL, issues 50,000 GetPage@LSN reads (80% within 2 GB of hot pages shared with the parent), and is then deleted; 1% of branches are promoted and live indefinitely. Specify: (a) your layer-file design and a compaction policy that bounds read amplification for a page at an arbitrary LSN on an arbitrary branch — define read amplification precisely and prove your bound; (b) the pageserver NVMe cache size and admission policy that keeps p99 GetPage latency under 1 ms for the 80% hot set, with arithmetic; (c) a monthly cost model (S3 storage at $0.023/GB, GETs at $0.0004/1k, your chosen NVMe instance pricing) and the break-even duty cycle at which always-on provisioned replicas beat scale-to-zero; (d) the failure analysis: what happens to in-flight branches when a pageserver dies, and what invariant must the safekeepers preserve for your answer to hold? You will be graded on the precision of your invariants and the honesty of your arithmetic.

❦