- Implement a log-structured page store that materializes any page at any LSN from immutable layer files in object storage.
- Design binary record and layer-file formats whose read cost is bounded by layers touched, never by total WAL length.
- Implement copy-on-write branching whose cost is independent of database size, by treating a branch as pure metadata.
- State and defend a garbage-collection safety invariant for immutable layers shared across live branches.
- Evaluate a storage system on a latency-versus-round-trips Pareto frontier with ablations, not point estimates.
What You Are Building
You will build a pageserver: a service that ingests a write-ahead log, stores it as immutable layer files in an S3-compatible bucket, and answers one question fast — GetPage(branch, page, lsn): “give me the 8 KiB image of page p as it existed at LSN n on branch b.” On top of that single primitive you get time travel for free, and branching becomes a metadata write: a new branch is just a record saying “I fork from timeline T at LSN n.” No pages are copied. Ever.
This is the storage architecture from the Neon paper read in week 5, stripped to its load-bearing walls: no SQL, no MVCC, no multi-tenancy. The page space is flat (page_no in 0…N−1), there is exactly one writer per timeline, and the WAL stream is handed to you by our generator. Everything else — formats, layer organization, resolution, branching, GC — is yours.
Non-goals and forbidden shortcuts
The grader instruments your S3 endpoint, so these are enforced mechanically, not on the honor system:
- Forbidden: copying or rewriting any page data during
CreateBranch. Branch creation must perform O(1) object writes regardless of parent size. - Forbidden: a local-disk source of truth. Local disk is a cache only — mid-run, the grader deletes your scratch directory (“crash”) and immediately re-issues reads. They must still be correct.
- Forbidden: replaying unbounded WAL per read.
GetPagemay touch only the layers that cover the requested key and LSN range; we count GETs. - Forbidden: using S3
LISTas the authority for which layers or branches exist. Listing is for debugging; your metadata must be self-describing. - Non-goals: SQL, concurrent writers on one timeline, encryption, multi-region. Durability of WAL not yet flushed to S3 may rely on a local append-only log.
The key × LSN plane
Think of all state as points in a two-dimensional plane: page number on one axis, LSN on the other. A delta layer is a rectangle — every WAL record for a key range over an LSN range, sorted by (page, lsn). An image layer is a vertical slice — full 8 KiB images of every page in a key range as of a single LSN. Image layers exist purely to bound reads: resolution walks deltas backward only until it hits one.
agent-07 forks from main at 0x48 by writing one metadata object. The dashed path shows GetPage(p, @0x5a) resolving: newest delta on the child, hop to the parent at the fork LSN, walk parent deltas back to the image layer at 0x30, then replay forward.Wire format, bucket layout, reference API
Your implementation may use any language; the formats below are normative. All integers are little-endian. Pages are exactly 8,192 bytes.
// ---- WAL record (as emitted by walgen, little-endian) -------------
// off size field notes
// 0 8 lsn uint64, strictly increasing within a timeline
// 8 4 page_no uint32, flat page space 0..N-1
// 12 1 kind uint8: 1 = FULL_PAGE, 2 = DELTA
// 13 2 plen uint16, payload length in bytes
// 15 plen payload FULL_PAGE: 8192-byte image
// DELTA: k × { uint16 off; uint16 len; len bytes }
// 15+plen 4 crc32c over bytes [0, 15+plen); reject on mismatch
// ---- bucket layout (names normative, layer encoding is yours) -----
// s3://lab2-<team>/tl/<timeline_id>/img__<keylo>-<keyhi>__<lsn>
// full images of pages in [keylo, keyhi] as of <lsn>; indexed for
// single-GET point lookup (range GETs allowed)
// s3://lab2-<team>/tl/<timeline_id>/del__<keylo>-<keyhi>__<lo>-<hi>
// every WAL record with page in [keylo, keyhi], lsn in (lo, hi],
// sorted by (page_no, lsn)
// s3://lab2-<team>/branches/<name>.json
// authoritative branch metadata; LIST is never the source of truth
// ---- reference API (stubbed in pageserver.go / branch.go) ----------
type LSN uint64
type PageID uint32
func (ps *Pageserver) IngestWAL(ctx context.Context, tl TimelineID,
rec []byte) (LSN, error) // validate crc32c, append
func (ps *Pageserver) Flush(ctx context.Context) error
// seal open layer, PUT to S3
func (ps *Pageserver) GetPage(ctx context.Context, branch string,
p PageID, at LSN) ([]byte, error) // exactly 8192 bytes
func (ps *Pageserver) CreateBranch(ctx context.Context, name, parent string,
at LSN) (*BranchMeta, error) // O(metadata); no page I/O
func (ps *Pageserver) DeleteBranch(ctx context.Context,
name string) error // mark dead; GC reclaims
func (ps *Pageserver) RunGC(ctx context.Context, horizon, idle,
grace time.Duration) (GCReport, error) // two-phase, safe
What Is in the Starter Kit
report/REPORT.md).required headers for the ablation table, the Pareto plot, and the GC safety argumentWAL Ingestion and Page Materialization
Ingest a walgen stream into an in-memory open layer; on Flush() (or a configurable size threshold), seal it into a delta layer object and PUT it to the bucket. Reject any record with a bad CRC and return the LSN of the last durable record. Then implement GetPage at the head LSN only: locate the covering layers, replay the page’s records in ascending LSN order starting from its most recent FULL_PAGE, and return the image.
Survive the crash test: after the grader deletes local scratch state, a fresh process pointed at the same bucket must answer the same reads correctly. This forces your layer inventory to be reconstructible from durable metadata, not from a warm process.
Exit criteria: all head-LSN conformance vectors pass on seeds 1–3, before and after the simulated crash.
Time Travel: GetPage@LSN
Generalize to arbitrary LSNs and add image layers as read-bound insurance. The normative resolution algorithm, which Milestone 3 extends across timelines:
- Set
T ←the request’s timeline andL ← min(at, head_lsn(T)). - In
T’s layer map, collect delta layers whose key range containspand whose LSN range intersects(L_img, L], newest first, stopping at the newest image layer forpwithlsn ≤ L(call itL_img). - If no image layer exists in
TandThas a parent: setL ← min(L, fork_lsn(T)),T ← parent(T), go to step 2. - Replay the collected records for
pin ascending LSN order onto the image; return the page.
Write image layers during compaction on a configurable interval (every D bytes of WAL per key range). This knob is your main ablation axis in the evaluation: small D means fewer deltas per read but more bucket bytes; large D means the opposite.
Exit criteria: 1,000 sampled (page, lsn) historical reads per seed match the vectors; instrumentation shows GETs per read bounded by layers intersected, independent of total WAL length.
CREATE BRANCH in O(metadata)
CreateBranch(name, parent, at) writes exactly one object: branches/<name>.json. It must validate at ≤ the parent’s flushed LSN (a fork above durable WAL would dangle), then the new timeline starts empty and resolution hops to the parent per step 3 above. Writes to the child append to the child’s timeline only; the parent never observes them.
| Field | Type | Meaning |
|---|---|---|
branch_id | uuid | stable identity; name is a mutable alias |
parent_id | uuid | null | null only for main |
fork_lsn | LSN | reads at lsn ≤ fork_lsn resolve through the ancestor chain |
head_lsn | LSN | highest durable LSN on this timeline |
state | enum | live | dead (set by DeleteBranch; GC consumes) |
created_at, last_read_at | timestamps | inputs to the abandonment policy in Milestone 4 |
Exit criteria: branch-create p99 latency is flat (within noise) across parents of 1 GiB and 8 GiB; reads on a child below fork_lsn are byte-identical to reads on the parent at the same LSN.
GC and the 50-Branch Swarm
Immutable layers are shared down the ancestry tree, so deletion needs a real argument, not a TTL. Your report must state and defend the safety invariant — a sufficient form: a layer on timeline T may be deleted only if, for every live branch whose ancestry passes through T, every readable LSN that could resolve through that layer is already covered by a newer image layer at or below it. “Readable” is defined by your retention policy: each branch’s head, plus a PITR horizon, plus every descendant’s fork_lsn. Note that a layer entirely older than every head can still be load-bearing if it sits below a fork — see Pitfalls.
Run GC in two phases: mark candidates against a snapshot of authoritative branch metadata, wait out a grace epoch (in-flight GetPage calls may hold a stale layer map), then issue DELETEs. Branch abandonment is a policy you choose and justify — e.g. state = dead, or live with last_read_at older than a configured idle horizon.
Finally, the demo: your driver replays the walgen swarm workload, creating 50 branches from one parent and running each agent’s WAL against its own branch; 10 call DeleteBranch and 40 are abandoned without cleanup. After GC, the surviving branches must pass conformance reads, and bucket bytes attributable solely to deleted and abandoned branches must be reclaimed. Your driver prints both numbers.
Exit criteria: zero wrong reads after GC across three consecutive swarm runs; reclaimed-bytes ratio reported; safety argument written.
How the 100 Points Fall
Performance is graded on a Pareto frontier, not a point estimate: submit at least three configurations (vary the image-layer interval D; optionally cache size and layer fan-out) and we plot GetPage p50/p99 against mean S3 GETs per read. Dominated frontiers lose points; a slow-but-honest frontier beats a fast number we cannot reproduce. The ablation table in REPORT.md needs one row per configuration with columns: p50, p99, GETs/read (cold and warm), bucket bytes, branch-create p99.
| Component | Points | What we measure |
|---|---|---|
| M1 — WAL ingestion & materialization | 10 | head-LSN vectors pass; CRC rejection; crash test |
| M2 — GetPage@LSN time travel | 15 | 1,000 historical reads per seed; layers-touched bound holds |
| M3 — O(metadata) branching | 15 | create p99 flat from 1 to 8 GiB; child/parent read equivalence |
| M4 — GC + 50-branch swarm | 15 | bytes reclaimed; zero wrong reads after GC, three runs |
| Performance frontier | 20 | p50/p99 vs GETs-per-read Pareto across ≥ 3 configurations |
| Report: design, ablations, GC argument | 15 | complete ablation table; invariant stated and defended |
| Code quality & reproducibility | 10 | make demo runs everything from a clean MinIO |
Where Previous Cohorts Lost Points
CreateBranch ever reads or writes page data, you have built a slow pg_dump, and the 1-vs-8 GiB latency check will fail loudly. Branching is one JSON PUT. If your design “needs” to seed the child with images, your resolution algorithm is wrong, not your branch path.
n produces the page version readable at n; that version remains valid on [n, next). Decide once whether delta-layer ranges are (lo, hi] (we recommend it; it composes with fork_lsn) and use it everywhere. Most wrong-read bugs in this lab are a < that should be ≤ at a fork point.
ListObjects couples you to listing latency and to half-written state after a crash mid-flush. Write metadata that names every live layer, commit it atomically (PUT-then-rename pattern, or a versioned manifest), and treat unreferenced objects as garbage by definition.
fork_lsn. Any policy of the shape “delete layers older than X” is unsafe without consulting the fork-point set. Checkpoint 2.3 makes you construct the counterexample.
Checkpoint Questions
Answer in report/CHECKPOINTS.md; graded with the report. These are designed so that a correct implementation makes them easy and a confused one makes them impossible.
Timeline main holds img__0-99__10, del__0-99__10-30, img__0-99__30, del__0-99__30-48, del__0-99__48-70. Branch agent-07 forks at 0x48 and holds del__0-99__48-5e. For GetPage("agent-07", 9, 0x5a), list every object read, in order, with the timeline hop marked; give the cold-cache GET count; then repeat for at = 0x44 and explain why the child’s delta layer is not consulted.
Enumerate the exact bytes CreateBranch writes and argue from your design that the count is independent of parent size. Then: the parent’s head_lsn advances concurrently with branch creation. Why must fork_lsn be validated against the parent’s flushed LSN rather than its in-memory head, and what wrong read becomes possible if you skip that check and the parent crashes before its next flush?
A teammate proposes: “delete any delta layer whose lsn_hi is older than 24 hours of WAL behind every branch head.” Construct a concrete branch topology and read that this policy corrupts. State your safety invariant precisely, and explain why the two-phase mark/grace/delete protocol is still required even when the invariant is evaluated correctly — i.e., what failure does the grace epoch alone prevent?