DATA 2027 DATA SYSTEMS IN THE AGENTIC ERA ← SCHEDULE
Lab 1 · Weeks 2–4

VLSM — an LSM-Tree with Vector Segments

You will take a working key–value LSM engine and teach it to answer approximate nearest-neighbor queries — without giving up its write path. The deliverable is not a demo; it is a measured frontier.

Out: Friday, Week 2  ·  Due: Friday, Week 4, 23:59  ·  Teams: pairs  ·  Weight: 10% of course grade (¼ of the 40% lab component)  ·  Language: Rust (skeleton provided)
Learning objectives — after this lab you can…
  • Design and document a binary on-disk format with explicit alignment, endianness, and checksum rules — and write a reader that rejects corrupt input.
  • Implement HNSW construction and beam search from the paper, and reason about its M/ef knobs as memory–recall–latency trade-offs.
  • Extend leveled compaction so that an auxiliary index (a proximity graph) is merged, not rebuilt, and quantify what that buys you.
  • Build a benchmark harness for a mixed agentic workload and report results as Pareto frontiers and ablation tables rather than point estimates.
  • Argue, with your own numbers, where the LSM write path and the ANN read path fight each other — and which knob resolves each fight.
Overview & Goals

What You Are Building

Agents read and write the same store in the same breath: an upsert of a tool result is followed, milliseconds later, by a semantic search over everything written so far. Classical vector databases assume a mostly-static corpus and rebuild indexes offline; classical LSM-trees ingest fast but cannot answer “what is near this embedding?” In this lab you build VLSM, an LSM engine in which every SSTable carries a vector segment: a block of scalar-quantized embeddings plus a serialized HNSW graph over them. Searches fan out across the memtable and every level’s graphs; compaction merges graphs instead of throwing them away.

The lab has four milestones. Milestone 1 is a file format. Milestone 2 is an index. Milestone 3 is the heart of the lab — compaction that preserves index work. Milestone 4 turns your engine into evidence: a harness, a frontier, and an ablation table. Each milestone has an acceptance test in the grading suite (cargo test --features grading, distributed at the end of Week 2); passing it is necessary but not sufficient for full credit.

We grade the frontier, not the point. A system that can’t trade memory for recall isn’t tuned — it’s stuck.

Fix these workload parameters unless a milestone says otherwise: vector dimension D = 768 (f32 source vectors), cosine distance over L2-normalized inputs (so you may implement it as negative dot product), k = 10 for all recall figures, and the synthetic agent trace produced by the provided generator with seed 2027. “Recall@10” always means recall against exact f32 brute-force search over the live (non-deleted, latest-version) rows.

Provided Materials

The Skeleton

Clone data2027/vlsm-skeleton. It is a small but real engine — roughly 1,000 lines — and it already passes its own key–value test suite. You get:

The public surface you extend, abridged:

// src/lib.rs — provided; the memtable implements it, your Db must too
pub trait Engine {
    fn put(&mut self, key: Key, value: Value, vector: Option<Vector>)
        -> Result<Option<DocId>>;
    fn get(&self, key: &[u8]) -> Result<Option<Value>>;
    fn scan(&self, start: &[u8], end: &[u8]) -> Result<Vec<(Key, Value)>>;
    fn knn(&self, query: &Vector, k: usize, ef: usize)
        -> Result<Vec<(DocId, f32)>>;
}

// src/sstable.rs — writer provided (codec 0)
impl SstWriter {
    pub fn new(path: impl Into<PathBuf>, dim: usize) -> SstWriter;
    pub fn add(&mut self, key: &[u8], value: &[u8],
               vector: Option<(DocId, &Vector)>) -> Result<()>;
    pub fn add_tombstone(&mut self, key: &[u8]) -> Result<()>;
    pub fn finish(self) -> Result<SstMeta>;
}

// src/sstable.rs — YOU WRITE (Milestones 1–2): the todo!() stubs
impl SstReader {
    pub fn open(path: &Path) -> Result<SstReader>;
    pub fn get(&self, key: &[u8]) -> Result<Option<Value>>;
    pub fn scan(&self, start: &[u8], end: &[u8]) -> Result<Vec<(Key, Option<Value>)>>;
    pub fn vector(&self, ordinal: u32) -> Result<Vector>;
    pub fn knn(&self, query: &Vector, k: usize, ef: usize)
        -> Result<Vec<(DocId, f32)>>;                   // Milestone 2
}

// new modules — YOU WRITE (Milestones 2–3): HNSW build/search/serialize,
// a Db (memtable + levels) implementing Engine, and graph-merging
// compaction, e.g.
pub fn merge_graphs(srcs: &[&SstReader], live: &DocIdFilter,
                    params: HnswParams) -> (SstWriter, MergeStats);

DocId is a u64, assigned sequentially in put order, one per vectored version; each segment’s row-id map takes a graph ordinal to its DocId. The workload generator pre-assigns the same sequence in Op::Upsert::doc_id — replay the trace in order, exactly once, and ground-truth ids match engine ids with no translation table.

Milestone 1

The Vector Segment Format

Every .sst file holds six sections — header, key block, vector codes, codebook, row-id map, HNSW adjacency — followed by a 64-byte footer at EOF that points backward at all of them. All integers are little-endian. Every section starts at a 64-byte-aligned offset (pad with zeros), so the reader can hand out aligned zero-copy slices from the mmap. The layout:

sstable-000042.sst header 32 B key block vector sections (this lab) footer 64 B vector codes N×D · codec 0/1/2 codebook scale/bias · PQ row-id map N × u64 HNSW CSR graph every section 64-B aligned · all integers little-endian footer = 6 × u64 section offsets + u32 crc32c + u32 'VSFT' + u64 pad HNSW: u32 num_layers, u32 entry_point, then per-layer CSR — u32 node_count, u32 offsets[n+1], u32 neighbors[] reader is zero-copy: cast mmap slices, never deserialize into owned structs
Fig. 1 — VLSM on-disk layout. Top: an SSTable with the vector sections between key block and footer. Bottom: the vector sections expanded. The footer is written last and points backward at every section, so a reader seeks to it first.
SectionFieldType / sizeMeaning
header (32 B)magic[u8; 8]ASCII “VSST0001”; reject anything else
versionu321
dimu32vector dimension D (768 here)
entry_countu32rows in the key block
vector_countu32number of vectors N (may be 0)
codecu80 = raw f32, 1 = per-dimension i8 scalar quant, 2 = PQ
reserved7 B, zerofuture codecs; readers must ignore
key blockentry_count recordsper record: key_len: u32 · val_len: u32 · flags: u32 (bit0 = tombstone) · vec_ordinal: u32 (0xFFFF_FFFF = no vector), then key and value bytes; keys strictly increasing
codesN × D codesrow-major vectors; codec 0 = f32, codec 1 = i8 — code c decodes as scale[j]·c + bias[j]; codec 2 = PQ u8
codebookm_pq: u32 · k_pq: u32 + paramscodec 1: D × f32 scales then D × f32 biases, computed over this segment’s vectors; codec 2: PQ centroids
row-id mapN × u64graph ordinal → DocId; sorted by ordinal
HNSWsee Fig. 1empty in Milestone 1 (num_layers = 0); filled in Milestone 2
footer (64 B)6 × u64 + u32 crc32c + u32 magic “VSFT” + u64 padsection offsets from file start; CRC covers header through HNSW

Acceptance: round-trip property tests pass (write → read → decode within quantization error); the reader returns Err(Corrupt), never panics or UB, on the provided corpus of 200 mutated files; brute-force search over decoded i8 codes achieves recall@10 ≥ 0.95 against f32 brute force on the 100k-vector sample set.

Milestone 2

Per-Level HNSW: Build and Search

Implement HNSW (Malkov & Yashunin, 2018 — assigned in Week 2) from scratch. Construction runs at flush and at the end of compaction, over the quantized codes; drive it from the provided HnswParams { m, ef_construction, ef_search } and serialize the result into the segment’s HNSW section as per-layer CSR adjacency (Fig. 1). Distances during build and search are computed directly on i8 codes using the segment’s scale/bias — do not decode whole vectors to f32 buffers.

Db::knn must consult every live source: brute-force over the memtable, then beam search (width ef) in each level’s segment graphs, merging candidates in a single global heap. A candidate whose DocId is shadowed by a newer version or a tombstone above it must be filtered before it can occupy one of the k result slots. Optionally re-rank the final 4k candidates with f32 vectors — keep this behind a flag; it is one of your ablations.

Acceptance: on a 1M-vector single-segment fixture with m = 16, ef = 64: recall@10 ≥ 0.90 with mean query latency ≤ 2 ms on the grading machine (8 cores, AVX2); construction ≤ 8 minutes; the serialized graph reloads byte-identically.

Milestone 3

Graph-Merging Compaction

This is where VLSM differs from “LSM plus a vector index bolted on.” When compaction merges SSTables S₁…Sₙ into new output tables, the naive move is to rebuild each output graph from scratch — O(N log N) distance computations you already paid for once. Implement merge_graphs instead:

Acceptance: after the scripted compaction storm in tests/m3_storm.rs (50 compactions, 30% deletes), recall@10 of merged graphs is within 0.02 of rebuilt-from-scratch graphs at equal ef, and total compaction CPU time is at least 3× lower than the rebuild baseline — which you must also implement, behind --features rebuild-baseline, because it is the control arm of your ablation.

Milestone 4

Benchmark Harness and Pareto Report

The provided generator (workload.rs) emits a synthetic agent trace: sessions that interleave upserts (tool results, ~20% of ops), range scans (~10%), and vector searches (~70%) with Zipfian skew over both keys and query templates. Build a harness that replays it at a fixed arrival rate while sampling four axes: search latency p99, sustained update throughput, resident memory (RSS, plus your own accounting of graph bytes), and recall@10 (sampled: every 50th search is also answered by the brute-force oracle).

Sweep at least three knobs — m ∈ {8, 16, 32}, ef ∈ {16, 64, 256}, and re-ranking on/off at minimum; codec and level fan-out are encouraged — and plot the Pareto frontier in the recall–latency plane with memory encoded as mark size, plus one ablation table isolating each knob’s marginal effect with the others held at defaults. The report (≤ 6 pages, PDF) must state the hardware, the seed, and the exact commit; every figure must be regenerable by make report.

Grading

Rubric

ComponentPointsWhat we look at
Correctness40Milestone acceptance tests (10 each): format round-trip and corruption handling; HNSW recall/latency floor; merge-vs-rebuild recall gap and CPU ratio; harness oracle agreement. Tests run on our machine, fresh clone, cargo test --features grading.
Performance frontier30Your frontier’s dominance region against the class. Full credit requires a frontier — at least 6 non-dominated configurations — not a single tuned point. A fast system with recall 0.6 and a slow one with recall 0.99 both earn less than a genuine trade-off curve.
Report20Pareto plot correctness, ablation table, honest negative results, reproducibility (make report works). Claims must cite your own measurements.
Code quality10No unsafe outside the SstReader cast paths (each with a safety comment); no panics on user input; clippy-clean; the format documented in FORMAT.md well enough for a stranger to write an independent reader.

Late policy: 10% per day, 3 days maximum. One milestone may be designated “best effort” in your report for a 5-point cap reduction instead of failing its tests silently — honesty is cheaper than hope.

Pitfalls

Where Last Year’s Cohorts Bled

Pitfall · the accidental rebuildIf your “merge” inserts every node of every source graph into a fresh graph, you have written the rebuild baseline twice and the merge zero times. The 3× CPU acceptance test exists precisely to catch this. Instrument distance evaluations per compaction from day one.
Pitfall · unaligned zero-copy castsCasting an mmap slice to &[u32] at an unaligned offset is undefined behavior that often “works” on x86 until it doesn’t. The 64-byte section alignment in the spec is not decorative — validate alignment in the reader and add a miri test for the cast paths.
Pitfall · grading recall against the wrong oracleRecall measured against brute force over quantized codes, or over a corpus that still contains tombstoned rows, will flatter you by several points and fail our oracle check. Ground truth is exact f32 over live rows only.
Pitfall · tombstones poisoning the beamIf shadowed rows are filtered after the result heap fills, deletes silently shrink your effective ef and recall decays as the trace ages. Filter at candidate-expansion time, and keep dropped-but-repaired nodes usable for routing until the next compaction if you want the easy fix.
Ground Rules

What You May Not Do

Checkpoints

Checkpoint Questions

Answer in ≤ 1 page each, in CHECKPOINTS.md, before the milestone they accompany. They are graded inside the report component.

Checkpoint 1.1 · due with Milestone 2

Why per-level graphs instead of one global mutable HNSW over the whole store? Give the write-amplification argument from SSTable immutability, then compute: with level fan-out 10, L0 holding up to 4 files, and 100M vectors at steady state, how many graphs does one search touch in the worst case, and how does expected search cost scale with store size compared to a single graph? State your assumptions about per-graph search cost.

Checkpoint 1.2 · due with Milestone 2

Derive a worst-case bound on the error of a dot product computed over per-dimension i8 scalar-quantized vectors with affine decode scale[j]·c + bias[j], as a function of D and the per-dimension value ranges. Using your Milestone 1 measurements, explain when f32 re-ranking of the top 4k candidates recovers essentially all lost recall — and construct (or describe) a data distribution where it cannot.

Checkpoint 1.3 · due with Milestone 3

In merge_graphs, how do you choose the entry point and maximum layer of the merged graph, and why is naively keeping the largest source’s entry point risky after heavy deletes? Then give an asymptotic cost comparison of merge (edge repair + minority insertion) versus full rebuild as a function of the surviving fraction s and the size ratio between sources, and predict — before running the storm test — the s at which rebuilding wins. Check the prediction against your measurements and report the discrepancy.

Provided Materials

Starter files

README.mdRepo layout, build/run instructions, provided-vs-TODO map, and the Pareto CSV submission schema
Cargo.tomlCrate manifest — rand, serde, csv only; the ANN-crate ban is enforced from here
src/lib.rsCore types, the Engine trait (put/get/scan/knn), EngineConfig/HnswParams, error enum, ParetoRow
src/memtable.rsComplete BTreeMap memtable with a DocId→Vector side-table, brute-force knn, and the flush iterator
src/sstable.rsNormative on-disk byte layout (header, key block, codes, PQ codebook, HNSW CSR, footer), working writer, reader stubs for Milestone 1
src/workload.rsDeterministic seeded agent trace — 70% Zipfian-repeat knn, 20% upserts, 10% scans — with brute-force ground truth per query
src/bin/bench.rsSmoke harness: replays the trace against the memtable (recall must print 1.0) and emits a schema-correct pareto.csv row