DATA 2027 · Week 06 · Part II — New Access Methods & Engines

Vector Indexes Are Access Methods, Not Products

The industry sold approximate nearest-neighbor search as a new category of database. It isn’t one — it’s an index, and we treat it with B-tree rigor.

Lecture 1 — ANN Search: IVF, PQ, HNSW · Lecture 2 — DiskANN and the Recall Axis

Lecture 1 · Tuesday

ANN Search: IVF, PQ, HNSW

Three ideas underlie essentially every ANN index in production — and each trades something away.

L1 · The Query

Every agent issues the same query

Given this embedding, find the k most similar among N.
Agent memory is this query. RAG is this query.
Tool retrieval, dedup, semantic caching — same query.
Plausibly the most-executed access pattern in new systems.
Relational engines never had a native index for it.

L1 · Brute Force

The exact-kNN invoice

N = 100M vectors, d = 768 floats: 307 GB raw.
One exact query: 100M dot products ≈ 154 GFLOPs (77B multiply-adds).
At ~50 GB/s bandwidth: six seconds per query.
Before you’ve served a second user.

L1 · Curse of Dimensionality

Concentration of distances

99%

of a unit ball’s volume at d = 768 lies in the outer 1% of radius. Everything is “far,” by roughly the same amount — so pruning indexes (kd-trees, R-trees) can’t exclude anything and degenerate to a scan.

L1 · The Unlock

The great surrender

d_max/d_min → 1: branch-and-bound can’t prune.
A kd-tree on 768-dim data visits nearly every leaf.
So: stop demanding the true neighbor.
Accept recall@k — fraction of true top-k returned — as a dial.
Orders of magnitude appear.

L1 · IVF

IVF: k-means wearing a database hat

Offline: cluster into nlist ≈ √N centroids (~10,000 for 100M).
Each vector filed in a posting list — “shard by meaning.”
Query: compare to 10,000 centroids, not 100M vectors.
Scan only the closest nprobe lists.
nprobe = 10 of 10,000 → scan ~0.1%: a 1000× reduction.

L1 · IVF

The edge problem is the dial

Queries near Voronoi boundaries miss neighbors in unprobed cells.
Recall@10: 40% at nprobe = 1, 95% at nprobe = 64.
Monotone: more probes, more recall, more time.

What IVF does not fix: lists hold raw floats.
100M × 768d is still 307 GB somewhere.
That’s the problem PQ attacks.

L1 · Product Quantization

PQ compression

32×

A 128-dim float vector (512 bytes) becomes 16 bytes: m = 16 subvectors of 8 dims, each encoded by an 8-bit codebook of 256 k-means centroids. The Cartesian-product trick: 256¹⁶ = 2¹²⁸ representable points from ≈131 KB of stored centroids. Jégou, Douze & Schmid, TPAMI 2011.

L1 · Product Quantization

Distance without multiplies

Per query: precompute a 16 × 256 table of subspace distances.
Distance to any candidate: 16 lookups + 15 adds.
Scans become bandwidth-bound on 16-byte codes, not 512-byte vectors.
Price: distortion — PQ distances are biased estimates.
Remedy: fetch ~200 candidates cheap, re-rank exactly. FAISS IVFADC quantizes residuals.

L1 · Field Note

What an unindexed scan costs

400M × 1536-dim raw embeddings, brute-forced on GPUs.
Monthly bill exceeded the replacing engineer’s salary.
Nobody asked “what is the access method here?”

After IVF-PQ: 2.4 TB → 38 GB of codes.
p99: 1.8 s → 19 ms.
At recall@10 = 0.96.

L1 · HNSW

A skip list, generalized to a metric space

Fig. 6.1 — HNSW search: enter sparse top layer, greedy-walk highways, drop a layer when stalled, finish with an efSearch-wide beam on layer 0.

L1 · HNSW

Why the walk doesn’t get stuck

Each node links to ~M near neighbors (M = 16–32).
Navigability: diversity heuristic keeps long-range highways, not M redundant short edges.
Hierarchy: layer drawn geometrically, P(layer ≥ ℓ) = e^−ℓ/m_L.
Each layer up has ~1/M the nodes — coarse hops first.
Expected logarithmic search depth survives. Malkov & Yashunin, TPAMI 2018.

L1 · HNSW

Two knobs govern everything

efConstruction (100–500): build-time beam.
Bigger → better edges, slower builds.

efSearch: query-time beam on layer 0.
10 → recall 0.80 in 30 µs; 200 → 0.99 in 400 µs.
It’s nprobe in graph clothes — every ANN family has one such dial.

L1 · HNSW’s Sin

Everything lives in RAM

350 GB

DRAM for one replica of 100M × 768d: 307 GB of full-precision vectors + ~13 GB of graph edges + allocator overhead. DRAM costs ~20× NVMe per byte — that gap is why Thursday’s lecture exists.

Lecture 2 · Thursday

DiskANN and the Recall Axis

The impolite question of 2019: what if the graph lived on SSD?

L2 · DiskANN

One node, one billion vectors

~5 ms

DiskANN (Microsoft Research, 2019): a single 64 GB-RAM machine with an NVMe drive serves a billion vectors at 95%+ recall. The 2019 baseline was a DRAM cluster an order of magnitude more expensive.

L2 · Vamana

A flatter graph, built for blocks

Drops HNSW’s hierarchy: one flat graph, fixed medoid entry.
α-pruning (α = 1.2) keeps aggressive long-range shortcuts.
Payoff: fewer hops — on SSD, hops are the cost model.
SSD read ~100 µs vs DRAM ~100 ns: 1000× worse.
Same B-tree lesson: access count dominates when storage is slow.

L2 · DiskANN

PQ steers in RAM, SSD stores the truth

Fig. 6.2 — DiskANN’s split: PQ demoted from index to steering mechanism; the exact re-rank at the end repairs its distortion.

L2 · Beam Search

Hiding SSD latency

NVMe gives huge random IOPS — but only at queue depth.
Beam width W = 4–8: issue W adjacency reads concurrently.
Cuts wall-clock hops nearly W-fold.
Representative query: ~40 I/Os ≈ 1–4 ms; PQ arithmetic free beside it.
Billion-scale, one node, five milliseconds, recall 0.95.

L2 · Historical Aside

The arc rhymes with 1970s hashing

Early hash tables assumed memory; databases needed disk.
Linear & extendible hashing (1979–80): re-derived under “one I/O per lookup.”
DiskANN: the same move for similarity graphs.
The access-method playbook hasn’t changed in fifty years.
Only the query has.

L2 · Updates

Graphs hate deletes

Removing a node can sever the only navigable path.
In-place edge repair under live queries: concurrency nightmare.
FreshDiskANN (2021): inserts go to a small in-RAM Vamana.
Deletes → tombstones: still traversed, filtered from results.
Background merge excises tombstones, repairs edges with α-prune.

L2 · Convergence

Read it back slowly: that’s an LSM-tree

Write-optimized RAM delta + deletion markers + background compaction.
Lucene/Elasticsearch: HNSW per immutable segment, rebuild at merge.
pgvector: Postgres heap pages, MVCC versions, vacuum reclaims.
Three independent codebases, one shape: immutable units + tombstones + merge.
When every lineage converges, it’s physics, not fashion.

L2 · The Recall Axis

RUM grows a fourth axis

RUM: balance Read, Update, Memory — optimize two, pay the third.
ANN adds Recall: a knob turned at query time.
Same index: 0.80 at 30 µs or 0.99 at 400 µs — one integer apart.
A vector index is a curve, not a point.
Comparing systems at unstated recall: the benchmark fraud of our decade.

L2 · The Recall Axis

Recall is a budget, per query

Agent re-ranking 50 candidates via LLM: buy cheap recall-0.85, let the model absorb noise.
Exact-match memory lookup (“already filed this ticket?”): pay for 0.99+.

Per-query targets are an API surface.
Agents already set temperature.
min_recall is the same kind of dial.

L2 · Three Families

Pick your trade

Property	IVF-PQ	HNSW	DiskANN
Memory, 100M × 768d	~3–6 GB codes	~350 GB DRAM	~6–10 GB RAM + ~340 GB SSD
QPS / latency	High QPS, ~1–10 ms	10⁴–10⁵ QPS, 0.1–1 ms	~10³ QPS, 2–10 ms
Recall	Needs re-rank above ~0.95	0.99+ via efSearch	0.95–0.99 with re-rank
Updates	Appends easy; retrain on drift	Deletes poison → segment + rebuild	RAM delta + tombstones + merge
Use when	Huge N, tight memory	Latency is king, DRAM paid for	Billion-scale; cost is king

L2 · The Agent Angle

Vector-as-feature beat vector-as-product

2021–23: VC funded ANN as a database category.
By 2025: pgvector, Elasticsearch, MongoDB, Redis, SQLite, every warehouse.
Agent retrieval never arrives alone: similarity joined with predicates, freshness, transactions.
Filtered ANN is open research: filter out 99% and the graph disconnects.
Solving it needs a query planner — B-trees were never a company.

Recall is the fourth resource. You don’t maximize it — you budget it, query by query, like CPU.

— Week 6 lecture notes, DATA 2027

L2 · Checkpoint

Discussion questions

For 1B × 768d float32: raw storage, HNSW DRAM footprint (M = 32), and DiskANN’s PQ RAM at 32 B/vector — do the budgets by hand.
Where does the no-re-rank IVF-PQ recall curve plateau, and which PQ error term explains the ceiling?
Filtered ANN: as label selectivity drops 50% → 0.1%, where do post-filter and pre-filter cross — and what would an optimizer need to choose automatically?

L2 · Readings

Read before Thursday

Product Quantization for Nearest Neighbor Search — Jégou, Douze & Schmid, TPAMI 2011. Work §III until the 256¹⁶ trick feels obvious.
Efficient and Robust ANN Search Using HNSW Graphs — Malkov & Yashunin, TPAMI 2018. Focus on Alg. 4 and layer assignment; note the memory model.
DiskANN: Billion-point Search on a Single Node — Subramanya et al., NeurIPS 2019. Read for systems decisions: why α > 1, why PQ steers while SSD stores.