DATA 2027 DATA SYSTEMS IN THE AGENTIC ERA ← SCHEDULE
Lab 2 · Weeks 5–8

Mini-Neon — Copy-on-Write Pages over Object Storage

Neon rebuilt Postgres storage as a log-structured index over S3; you will do the same in miniature. By the end of week 8 your pageserver hands 50 concurrent agents their own writable branch of one database — and cleans up after the 40 that wander off.

Out: Fri, week 5 · Due: Fri, week 8, 23:59  ·  Weight: 10% of course grade (¼ of the 40% lab component) · Teams: pairs  ·  Language: any with an S3 client (Go skeleton provided)
Learning objectives — after this lab you can…
  • Implement a log-structured page store that materializes any page at any LSN from immutable layer files in object storage.
  • Design binary record and layer-file formats whose read cost is bounded by layers touched, never by total WAL length.
  • Implement copy-on-write branching whose cost is independent of database size, by treating a branch as pure metadata.
  • State and defend a garbage-collection safety invariant for immutable layers shared across live branches.
  • Evaluate a storage system on a latency-versus-round-trips Pareto frontier with ablations, not point estimates.
Overview & Goals

What You Are Building

You will build a pageserver: a service that ingests a write-ahead log, stores it as immutable layer files in an S3-compatible bucket, and answers one question fast — GetPage(branch, page, lsn): “give me the 8 KiB image of page p as it existed at LSN n on branch b.” On top of that single primitive you get time travel for free, and branching becomes a metadata write: a new branch is just a record saying “I fork from timeline T at LSN n.” No pages are copied. Ever.

This is the storage architecture from the Neon paper read in week 5, stripped to its load-bearing walls: no SQL, no MVCC, no multi-tenancy. The page space is flat (page_no in 0…N−1), there is exactly one writer per timeline, and the WAL stream is handed to you by our generator. Everything else — formats, layer organization, resolution, branching, GC — is yours.

A branch is not a copy. It is a name for an LSN you promise not to garbage-collect.

Non-goals and forbidden shortcuts

The grader instruments your S3 endpoint, so these are enforced mechanically, not on the honor system:

The key × LSN plane

Think of all state as points in a two-dimensional plane: page number on one axis, LSN on the other. A delta layer is a rectangle — every WAL record for a key range over an LSN range, sorted by (page, lsn). An image layer is a vertical slice — full 8 KiB images of every page in a key range as of a single LSN. Image layers exist purely to bound reads: resolution walks deltas backward only until it hits one.

delta layer (WAL over an LSN range) image layer (full pages @ one LSN) timeline: main CREATE BRANCH agent-07 @ 0x48 timeline: agent-07 GetPage(p, @0x5a) LSN → 0x10 0x30 0x48 0x5a 0x70
Fig. 1 — The layer map. Vertical red bars are image layers; outlined rectangles are delta layers. agent-07 forks from main at 0x48 by writing one metadata object. The dashed path shows GetPage(p, @0x5a) resolving: newest delta on the child, hop to the parent at the fork LSN, walk parent deltas back to the image layer at 0x30, then replay forward.

Wire format, bucket layout, reference API

Your implementation may use any language; the formats below are normative. All integers are little-endian. Pages are exactly 8,192 bytes.

// ---- WAL record (as emitted by walgen, little-endian) -------------
// off      size  field     notes
// 0        8     lsn       uint64, strictly increasing within a timeline
// 8        4     page_no   uint32, flat page space 0..N-1
// 12       1     kind      uint8: 1 = FULL_PAGE, 2 = DELTA
// 13       2     plen      uint16, payload length in bytes
// 15       plen  payload   FULL_PAGE: 8192-byte image
//                          DELTA: k × { uint16 off; uint16 len; len bytes }
// 15+plen  4     crc32c    over bytes [0, 15+plen); reject on mismatch

// ---- bucket layout (names normative, layer encoding is yours) -----
// s3://lab2-<team>/tl/<timeline_id>/img__<keylo>-<keyhi>__<lsn>
//   full images of pages in [keylo, keyhi] as of <lsn>; indexed for
//   single-GET point lookup (range GETs allowed)
// s3://lab2-<team>/tl/<timeline_id>/del__<keylo>-<keyhi>__<lo>-<hi>
//   every WAL record with page in [keylo, keyhi], lsn in (lo, hi],
//   sorted by (page_no, lsn)
// s3://lab2-<team>/branches/<name>.json
//   authoritative branch metadata; LIST is never the source of truth

// ---- reference API (stubbed in pageserver.go / branch.go) ----------
type LSN uint64
type PageID uint32

func (ps *Pageserver) IngestWAL(ctx context.Context, tl TimelineID,
        rec []byte) (LSN, error)               // validate crc32c, append
func (ps *Pageserver) Flush(ctx context.Context) error
                                               // seal open layer, PUT to S3
func (ps *Pageserver) GetPage(ctx context.Context, branch string,
        p PageID, at LSN) ([]byte, error)      // exactly 8192 bytes
func (ps *Pageserver) CreateBranch(ctx context.Context, name, parent string,
        at LSN) (*BranchMeta, error)           // O(metadata); no page I/O
func (ps *Pageserver) DeleteBranch(ctx context.Context,
        name string) error                     // mark dead; GC reclaims
func (ps *Pageserver) RunGC(ctx context.Context, horizon, idle,
        grace time.Duration) (GCReport, error) // two-phase, safe
Provided Materials

What Is in the Starter Kit

1
lab2-starter repository.Go skeleton — types.go and storage.go complete, pageserver.go and branch.go stubbed — plus the MinIO docker one-liner in the README
2
walgen — deterministic WAL generator.seeded streams; seeds 1–9 are the grading seeds; emits records in the wire format above
3
Conformance vectors.(seed, page, lsn) → sha256 of the expected page image; 1,000 sampled historical reads per seed
4
Swarm workload (via walgen).main.wal plus agent-001…agent-050.wal and plan.json; your demo driver branches 50 agents from one parent, deletes 10, abandons 40
5
Report format (you write report/REPORT.md).required headers for the ablation table, the Pareto plot, and the GC safety argument
Milestone 1 · Week 5

WAL Ingestion and Page Materialization

Ingest a walgen stream into an in-memory open layer; on Flush() (or a configurable size threshold), seal it into a delta layer object and PUT it to the bucket. Reject any record with a bad CRC and return the LSN of the last durable record. Then implement GetPage at the head LSN only: locate the covering layers, replay the page’s records in ascending LSN order starting from its most recent FULL_PAGE, and return the image.

Survive the crash test: after the grader deletes local scratch state, a fresh process pointed at the same bucket must answer the same reads correctly. This forces your layer inventory to be reconstructible from durable metadata, not from a warm process.

Exit criteria: all head-LSN conformance vectors pass on seeds 1–3, before and after the simulated crash.

Milestone 2 · Week 6

Time Travel: GetPage@LSN

Generalize to arbitrary LSNs and add image layers as read-bound insurance. The normative resolution algorithm, which Milestone 3 extends across timelines:

  1. Set T ← the request’s timeline and L ← min(at, head_lsn(T)).
  2. In T’s layer map, collect delta layers whose key range contains p and whose LSN range intersects (L_img, L], newest first, stopping at the newest image layer for p with lsn ≤ L (call it L_img).
  3. If no image layer exists in T and T has a parent: set L ← min(L, fork_lsn(T)), T ← parent(T), go to step 2.
  4. Replay the collected records for p in ascending LSN order onto the image; return the page.

Write image layers during compaction on a configurable interval (every D bytes of WAL per key range). This knob is your main ablation axis in the evaluation: small D means fewer deltas per read but more bucket bytes; large D means the opposite.

Exit criteria: 1,000 sampled (page, lsn) historical reads per seed match the vectors; instrumentation shows GETs per read bounded by layers intersected, independent of total WAL length.

Milestone 3 · Week 7

CREATE BRANCH in O(metadata)

CreateBranch(name, parent, at) writes exactly one object: branches/<name>.json. It must validate at ≤ the parent’s flushed LSN (a fork above durable WAL would dangle), then the new timeline starts empty and resolution hops to the parent per step 3 above. Writes to the child append to the child’s timeline only; the parent never observes them.

FieldTypeMeaning
branch_iduuidstable identity; name is a mutable alias
parent_iduuid | nullnull only for main
fork_lsnLSNreads at lsn ≤ fork_lsn resolve through the ancestor chain
head_lsnLSNhighest durable LSN on this timeline
stateenumlive | dead (set by DeleteBranch; GC consumes)
created_at, last_read_attimestampsinputs to the abandonment policy in Milestone 4

Exit criteria: branch-create p99 latency is flat (within noise) across parents of 1 GiB and 8 GiB; reads on a child below fork_lsn are byte-identical to reads on the parent at the same LSN.

Milestone 4 · Week 8

GC and the 50-Branch Swarm

Immutable layers are shared down the ancestry tree, so deletion needs a real argument, not a TTL. Your report must state and defend the safety invariant — a sufficient form: a layer on timeline T may be deleted only if, for every live branch whose ancestry passes through T, every readable LSN that could resolve through that layer is already covered by a newer image layer at or below it. “Readable” is defined by your retention policy: each branch’s head, plus a PITR horizon, plus every descendant’s fork_lsn. Note that a layer entirely older than every head can still be load-bearing if it sits below a fork — see Pitfalls.

Run GC in two phases: mark candidates against a snapshot of authoritative branch metadata, wait out a grace epoch (in-flight GetPage calls may hold a stale layer map), then issue DELETEs. Branch abandonment is a policy you choose and justify — e.g. state = dead, or live with last_read_at older than a configured idle horizon.

Finally, the demo: your driver replays the walgen swarm workload, creating 50 branches from one parent and running each agent’s WAL against its own branch; 10 call DeleteBranch and 40 are abandoned without cleanup. After GC, the surviving branches must pass conformance reads, and bucket bytes attributable solely to deleted and abandoned branches must be reclaimed. Your driver prints both numbers.

Exit criteria: zero wrong reads after GC across three consecutive swarm runs; reclaimed-bytes ratio reported; safety argument written.

Grading Rubric

How the 100 Points Fall

Performance is graded on a Pareto frontier, not a point estimate: submit at least three configurations (vary the image-layer interval D; optionally cache size and layer fan-out) and we plot GetPage p50/p99 against mean S3 GETs per read. Dominated frontiers lose points; a slow-but-honest frontier beats a fast number we cannot reproduce. The ablation table in REPORT.md needs one row per configuration with columns: p50, p99, GETs/read (cold and warm), bucket bytes, branch-create p99.

ComponentPointsWhat we measure
M1 — WAL ingestion & materialization10head-LSN vectors pass; CRC rejection; crash test
M2 — GetPage@LSN time travel151,000 historical reads per seed; layers-touched bound holds
M3 — O(metadata) branching15create p99 flat from 1 to 8 GiB; child/parent read equivalence
M4 — GC + 50-branch swarm15bytes reclaimed; zero wrong reads after GC, three runs
Performance frontier20p50/p99 vs GETs-per-read Pareto across ≥ 3 configurations
Report: design, ablations, GC argument15complete ablation table; invariant stated and defended
Code quality & reproducibility10make demo runs everything from a clean MinIO
Pitfalls

Where Previous Cohorts Lost Points

Pitfall · the copy trap If CreateBranch ever reads or writes page data, you have built a slow pg_dump, and the 1-vs-8 GiB latency check will fail loudly. Branching is one JSON PUT. If your design “needs” to seed the child with images, your resolution algorithm is wrong, not your branch path.
Pitfall · LSN boundary semantics A record at LSN n produces the page version readable at n; that version remains valid on [n, next). Decide once whether delta-layer ranges are (lo, hi] (we recommend it; it composes with fork_lsn) and use it everywhere. Most wrong-read bugs in this lab are a < that should be at a fork point.
Pitfall · LIST is not a catalog Driving correctness from ListObjects couples you to listing latency and to half-written state after a crash mid-flush. Write metadata that names every live layer, commit it atomically (PUT-then-rename pattern, or a versioned manifest), and treat unreferenced objects as garbage by definition.
Pitfall · GC below a fork A layer can be older than every branch head and still be the only path to pages a child reads below its fork_lsn. Any policy of the shape “delete layers older than X” is unsafe without consulting the fork-point set. Checkpoint 2.3 makes you construct the counterexample.
Pitfall · caches hiding your GETs A warm local cache makes any design look great. The frontier requires cold and warm columns separately, and the grader counts GETs at the MinIO proxy — not in your code. Benchmark from a cleared cache or your numbers will not reproduce.
Checkpoints

Checkpoint Questions

Answer in report/CHECKPOINTS.md; graded with the report. These are designed so that a correct implementation makes them easy and a confused one makes them impossible.

Checkpoint 2.1 · resolution trace

Timeline main holds img__0-99__10, del__0-99__10-30, img__0-99__30, del__0-99__30-48, del__0-99__48-70. Branch agent-07 forks at 0x48 and holds del__0-99__48-5e. For GetPage("agent-07", 9, 0x5a), list every object read, in order, with the timeline hop marked; give the cold-cache GET count; then repeat for at = 0x44 and explain why the child’s delta layer is not consulted.

Checkpoint 2.2 · the cost of a branch

Enumerate the exact bytes CreateBranch writes and argue from your design that the count is independent of parent size. Then: the parent’s head_lsn advances concurrently with branch creation. Why must fork_lsn be validated against the parent’s flushed LSN rather than its in-memory head, and what wrong read becomes possible if you skip that check and the parent crashes before its next flush?

Checkpoint 2.3 · breaking naive GC

A teammate proposes: “delete any delta layer whose lsn_hi is older than 24 hours of WAL behind every branch head.” Construct a concrete branch topology and read that this policy corrupts. State your safety invariant precisely, and explain why the two-phase mark/grace/delete protocol is still required even when the invariant is evaluated correctly — i.e., what failure does the grace epoch alone prevent?

Provided Materials

Starter files

README.mdarchitecture recap (safekeeper → pageserver → S3), MinIO docker one-liner, provided-vs-TODO map, and the 50-branch demo workflow
go.modmodule data2027/lab2 (Go 1.22, minio-go v7); run `go mod tidy` once, then `go build ./...` compiles out of the box
types.gocomplete & normative: WalRecord encode/decode with crc32c, delta-span codec, LayerFileMeta + object-key grammar, BranchMeta — do not change
storage.gocomplete: thin ObjectStore (Put/Get/GetRange/List/Delete), in-memory fake with GET/PUT counters for tests, MinIO implementation
pageserver.goyour M1–M2: IngestWAL, Flush, GetPage@LSN with the timeline-ancestry walk spelled out in comments; materializePage stubbed
branch.goyour M3–M4: CreateBranch in O(metadata), DeleteBranch, and two-phase mark/grace/sweep RunGC with the safety contract in doc comments
cmd/walgen/main.gocomplete deterministic WAL generator (seeded SplitMix64): emits main.wal, 50 branch streams forking at random LSNs, and plan.json