Week 3: One Size Fits None

Learning objectives — after this week you can…

Reconstruct Stonebraker’s “one size fits all” argument and apply its method — workload analysis first, architecture second — to a new workload class.
Compute compression ratios for RLE, dictionary, and bit-packed encodings on a concrete column, and explain when each wins.
Explain late materialization and C-Store projections, and trace which columns a given query actually touches.
Quantify the per-tuple overhead of the Volcano iterator model and explain how ~1000-value vectors amortize it while staying cache-resident.
Argue, with numbers, why scan-dominant speculative agent analytics favors columnar formats (Parquet/Arrow) plus vectorized engines with cheap startup.

Lecture 1 · Tuesday

The Column-Store Argument

Michael Stonebraker’s 2005 ICDE paper “One Size Fits All: An Idea Whose Time Has Come and Gone” is a polemic, and we should read it as one — not for its specific predictions (some aged badly) but for its method. The argument runs: commercial RDBMSs of the era were all descendants of System R, tuned for the workload of 1985 — OLTP, write-heavy, short transactions, terminals operated by humans. By 2005 the actual workload population had diversified: stream processing, scientific arrays, text, warehousing. Stonebraker’s move was to take one workload class, design an engine for it alone, and measure the gap against the general-purpose incumbent. For warehousing, that purpose-built engine was C-Store, and the gap wasn’t 20% — it was one to two orders of magnitude. When a specialized design beats the generalist by 50×, the generalist isn’t a compromise; it’s a category error. That method — pick the workload, derive the architecture, measure the gap — is the method of this entire course, because agents are exactly the kind of new workload class that the method demands we take seriously.

Rows are an OLTP decision, not a law of nature

A row store lays a tuple’s attributes contiguously: (id, name, state, price, qty, …), then the next tuple. This is optimal when your unit of work is the tuple — insert an order, read a customer record, update a balance. One seek, one page, done. But an analytical query like SELECT AVG(price) FROM lineitem WHERE shipdate > '2025-01-01' touches 2 of, say, 16 attributes. In a row store with 200-byte tuples, you drag all 200 bytes through the memory hierarchy to use 12 of them — a 94% waste of every cache line and every byte of disk bandwidth. A column store flips the layout: all values of price contiguous, all values of shipdate contiguous. The scan now reads only the columns the query names. For a 16-column table and a 2-column query, that’s an immediate 8× I/O reduction before we’ve even mentioned compression.

Fig. 3.1 — The same table, two physical layouts. The shaded cells are what a SELECT AVG(price) … WHERE date > … scan must pull through the memory hierarchy. NSM (n-ary storage model) drags every attribute along; DSM (decomposition storage model) reads only the two named columns, and each 64-byte cache line it fetches is 100% useful, type-homogeneous, and SIMD-friendly.

Compression: where columns earn their second order of magnitude

The I/O reduction above is the boring half. The exciting half: a column is a run of same-typed, often-sorted, low-entropy values, which compresses dramatically better than interleaved rows. Three encodings carry most of the weight. Work the numbers with me on a 1-billion-row lineitem table.

Run-length encoding (RLE). Take shipdate, and suppose the table is sorted on it. A billion rows spanning ~2,500 distinct days means each date repeats ~400,000 times consecutively. Store (value, run_length) pairs: 2,500 pairs × 8 bytes ≈ 20 KB, versus 4 GB raw (4-byte dates × 10⁹). That’s a 200,000× ratio — absurd, and real, but only on the sort column. RLE on an unsorted column degenerates to runs of length 1 and a 2× expansion. Lesson: compression choice is a function of sort order, which is why C-Store makes sort order a first-class physical-design decision.

Dictionary encoding. Take ship_state: 50 distinct values, stored as variable-length strings averaging 9 bytes. Build a dictionary mapping each string to an integer code; ⌈log₂ 50⌉ = 6 bits per value. Raw: 9 GB. Encoded: 10⁹ × 6 bits = 750 MB, plus a 450-byte dictionary. Ratio: 12×. Better still, predicates rewrite into code space — state = 'CA' becomes code = 4, an integer comparison against packed data, no string handling in the inner loop.

Bit-packing. Take quantity, stored as a 4-byte int but actually ranging 1–50. You need ⌈log₂ 50⌉ = 6 bits, not 32. Packed: 750 MB vs 4 GB — 5.3×, from nothing but refusing to store leading zeros. Combine with frame-of-reference (store deltas from a per-block minimum) and ascending keys like order IDs pack into a few bits each. These three encodings stack: dictionary first, then bit-pack the codes, then RLE the packed runs. Real Parquet files on real warehouse data routinely land at 5–10× overall — which means a 10 TB logical scan is a 1–2 TB physical one.

Late materialization and projections

If you decompress and glue columns back into tuples at the bottom of the plan, you’ve rebuilt a row store with extra steps. C-Store’s answer is late materialization: operate on columns (ideally still compressed) as long as possible, carrying only position lists between operators. Evaluate shipdate > X directly on RLE runs — a predicate on a run of 400,000 identical values is one comparison, not 400,000 — producing a bitmap of qualifying positions. AND that bitmap with the result of state = 4. Only then fetch price values at the surviving positions and aggregate. Tuples get stitched together only at the top, for the handful of rows that survive, instead of at the bottom for all billion.

C-Store’s other heresy: it doesn’t store “the table” at all. It stores projections — overlapping subsets of columns, each sorted on a different key, each compressed per its own sort order. A query picks whichever projection covers its columns with the friendliest sort order. Redundant storage buying read speed — this is the RUM triangle from Week 2 wearing different clothes. The write problem (sorted, compressed, replicated data is miserable to update) is handled by a small row-oriented Writeable Store in front, merged into the Read-optimized Store in batches. Squint and you’ll see the LSM shape again: optimize reads in the big immutable structure, absorb writes in a small mutable one, pay with background merges.

Why the hardware votes for columns

Two mechanical facts seal the argument. Cache lines: memory arrives in 64-byte lines. Scanning a column of 4-byte ints, every line delivers 16 useful values; scanning the same predicate over 200-byte rows, every line delivers at most a fraction of one. On a machine with ~25 GB/s per-core effective bandwidth, that utilization difference is the speedup, because scans are bandwidth-bound. SIMD: AVX-512 compares 16 packed 32-bit values per instruction — but only if the values are contiguous and same-typed. Columnar layout is precisely the precondition for SIMD; row layout makes it impossible without a gather, which forfeits the win. The column store isn’t clever; it’s shaped like the machine.

Field note · 2009, a telecom warehouse

A team I knew ran the same 40-line revenue query on their row-store warehouse (11 minutes) and on an early Vertica cluster — C-Store’s commercial child — on cheaper hardware (4.2 seconds). The DBA’s first reaction was to assume the result was cached and restart the cluster. It wasn’t the cache. It was 14 columns they were no longer reading, a sort order that turned the date predicate into RLE-run arithmetic, and a plan that never materialized a tuple until the final 900 rows. The vendor didn’t win the deal on price. They won it because the demo looked like a magic trick.

A 50× gap isn’t an optimization opportunity — it’s the sound of the wrong architecture hitting the right workload.

Lecture 2 · Thursday

Vectorized Execution: From Tuple-at-a-Time to Batches

Tuesday was about layout; today is about execution, and the villain is one of the most beloved designs in database history. The Volcano iterator model (Graefe, 1990) gives every operator the same interface — open(), next(), close() — and composes plans like Unix pipes: each next() call pulls one tuple from the operator below. It is beautiful, modular, and memory-frugal: only one tuple is in flight at a time, which mattered enormously when RAM was measured in megabytes. It is also, on modern CPUs, catastrophically slow for analytics — and the MonetDB/X100 paper (CIDR 2005) measured exactly how slow: commercial DBMSs executing TPC-H Query 1 ran at roughly a third of the instructions-per-cycle of hand-written C — and paid an order of magnitude more cycles per tuple — spending under 10% of their cycles on actual query work.

The price of one tuple at a time

Count the overhead per tuple in a Volcano plan. Each next() is a virtual function call — an indirect branch the CPU can’t reliably predict, ~20–40 cycles with the pipeline flush. A three-operator plan (scan → filter → aggregate) costs three of those per tuple, plus attribute extraction from the tuple slot, plus type-dispatch (“is this an int4 or a numeric?”) decided per value inside the expression interpreter. Total: hundreds of cycles of interpretation wrapping ~5 cycles of actual work — a compare, an add. At 10⁹ tuples, a query whose useful work is ~5 billion cycles spends 100+ billion on ceremony. The function-call tax doesn’t show up in OLTP, where a query touches 10 rows; it devours OLAP, where a query touches 10⁹. Same model, different workload, opposite verdict — Stonebraker’s thesis again, now at the level of the inner loop.

X100: vectors of ~1000, primitives, cache residency

MonetDB/X100’s fix keeps Volcano’s composable pull-based plan but changes the unit of exchange: next() returns a vector of ~1000 values per column, not one tuple. The virtual call, the branch misprediction, the operator-dispatch logic — all still there, but amortized over 1000 values: 30 cycles ÷ 1000 = 0.03 cycles per value of overhead instead of 30. Inside operators, work is done by primitives: pre-compiled, type-specialized tight loops like select_lt_int32 or map_mul_double. No interpretation inside the loop — the type dispatch happened once per vector, not once per value — so the compiler unrolls, the hardware prefetcher streams, and SIMD applies. X100 reported TPC-H Q1 going from ~minutes-per-gigabyte interpretation regimes to being limited by memory bandwidth, roughly 3× the IPC and one to two orders of magnitude more raw execution power than the tuple-at-a-time commercial engines it benchmarked.

Why 1000 and not 1,000,000? Because the other failure mode is full-column-at-a-time execution (original MonetDB BAT algebra): materializing every intermediate result as a complete column array spills intermediates to RAM — or disk — between every operator. The vector size is chosen so that all the vectors live in a plan’s working set fit in L1/L2 cache. A 1000-value vector of 4-byte ints is 4 KB; a dozen of them in flight is ~50 KB — comfortably inside a 1 MB L2. Each vector is produced, consumed by the next primitive, and overwritten while still cache-hot. The X100 paper’s own sensitivity graph is U-shaped: performance climbs steeply from vector size 1 (interpretation-dominated), plateaus around 10²–10³, then degrades past ~10⁴ as vectors fall out of cache. The sweet spot is “big enough to amortize interpretation, small enough to stay resident.” Memorize that sentence; it generalizes to half of systems design.

/* An X100-style primitive: selection on one column vector.
 * One virtual call delivered 'n' ≈ 1024 values; everything
 * below is a branch-free, type-specialized, cache-resident loop
 * the compiler auto-vectorizes (AVX2: 8 lanes per instruction).
 */
size_t sel_lt_int32(const int32_t *col, int32_t bound,
                    uint32_t *sel_out, size_t n)
{
    size_t k = 0;
    for (size_t i = 0; i < n; i++) {
        sel_out[k] = (uint32_t)i;     /* write position…        */
        k += (col[i] < bound);        /* …keep it only on match */
    }
    return k;   /* selection vector: positions, not copies —
                   late materialization at execution scale */
}

Note what the primitive returns: a selection vector of qualifying positions, not a copied-out vector of survivors. Downstream primitives take (data, sel, n) and iterate only the selected positions. This is Tuesday’s late materialization recurring at micro-scale — pass positions, defer copying — and the branch-free k += (cond) idiom sidesteps the branch predictor entirely, which matters because a 50%-selective predicate is the predictor’s worst nightmare.

Vectorize or compile? The HyPer counterpoint

Vectorization is not the only escape from Volcano. The other is JIT compilation: HyPer (Neumann, 2011) compiles each query into LLVM machine code that fuses a whole pipeline — scan, filter, aggregate — into one loop, keeping the current tuple in registers with zero materialization between operators. Both approaches kill interpretation overhead; they differ in what they pay. The 2018 “Everything You Always Wanted to Know…” study (Kersten et al.) built both models in one codebase and found neither dominates: compilation wins on computation-heavy queries with long pipelines (no vector materialization at all), vectorization wins on memory-bound scans (SIMD-friendly, better cache behavior on hash joins) — usually within 2× of each other, both ~100× over Volcano.

Dimension	Volcano (tuple-at-a-time)	Vectorized (X100 → DuckDB)	JIT-compiled (HyPer → Umbra)
Interpretation overhead per value	~100s of cycles (virtual calls, type dispatch)	~0.03–0.5 cycles (amortized over ~1K vector)	≈0 (fused native loop)
Intermediate results	One tuple in flight	Cache-resident vectors (~4 KB/col)	None — values held in registers
SIMD	Impossible	Natural (contiguous typed vectors)	Possible, harder to coax from codegen
Query startup latency	None	None — primitives precompiled	Compile time: ms–100s of ms per query
Profiling & adaptivity	Easy, per-operator	Easy — per-primitive timing, swap kernels mid-query	Hard — one opaque fused loop
Engineering cost	Low	Medium (primitive library, ~exhaustive type combos)	High (codegen, debugging machine code)

Aside · the HyPer compile-time tax

HyPer’s authors discovered that for short interactive queries, LLVM’s optimization passes could cost more than running the query — so HyPer shipped an adaptive scheme: start executing a quick-and-dirty compiled version (or interpret the bytecode) while optimized code compiles in the background, swapping it in mid-flight if the query lasts long enough. Its successor Umbra went further with a custom IR to make codegen itself near-instant. Keep this in your pocket for the agent discussion below: when queries are numerous, short, and speculative, fixed per-query startup cost stops being a rounding error and starts being the bill.

The agent angle: scan-dominant, speculative, and impatient

Now run this week through the course thesis. A human analyst issues perhaps a dozen carefully-considered queries per hour. An agent doing exploratory analysis issues hundreds: profile every column, test five hypotheses, compute a correlation matrix, abandon four branches, drill into the fifth. Three properties of this workload follow. It is scan-dominant — profiling and hypothesis-testing queries are aggregations over wide ranges, the exact shape columns plus vectorization serve best, and the exact shape B-tree point indexes serve worst. It is speculative — most queries are dead ends, so the economics demand cheap queries over fast-but-expensive ones; you can’t justify building an index, or even a long JIT compile, for a query that will run once and be discarded. And it is impatient in aggregate — per-query overhead (connection setup, planning, compilation) multiplies by the query count, which is why vectorized engines with precompiled primitives and near-zero startup (DuckDB embedded in the agent’s process is the limit case) fit agents better than engines that amortize heavy per-query investment over long-running reports.

This is also why the open descendants of this week’s papers matter more than the papers’ own systems. Parquet is C-Store’s storage chapter standardized: columnar layout, dictionary + RLE + bit-packing per column chunk, min/max zone maps per row group for predicate skipping — readable by every engine, sitting in object storage. Arrow is the X100 vector standardized: a common in-memory columnar format so that the engine, the dataframe library, and the agent’s tool-call boundary pass vectors by pointer instead of serializing tuples. An agent that calls DuckDB over Arrow against Parquet in S3 is running the 2005 research stack, verbatim, as commodity open infrastructure. One size fit none; so the sizes became formats, and the formats won.

Readings

Read Before Thursday

“One Size Fits All”: An Idea Whose Time Has Come and Gone — Stonebraker & Çetintemel, ICDE 2005.Read for the method, not the predictions: how to argue from workload characteristics to architecture. Note which of its bets aged well (warehousing, streams) and which didn’t.

C-Store: A Column-oriented DBMS — Stonebraker et al., VLDB 2005.Focus on §3–4: projections, sort-order-dependent compression, and the WS/RS split. Ask yourself at every design choice: which corner of the RUM triangle is being traded away?

MonetDB/X100: Hyper-Pipelining Query Execution — Boncz, Zukowski & Nes, CIDR 2005.Focus on §2 (why TPC-H Q1 gets <10% of hand-coded performance in tuple-at-a-time engines) and the vector-size experiment — the U-shaped curve is the whole lecture in one figure.

Exercises

This Week’s Problems

Exercise 3.1 · warm-up

A 2-billion-row events table has a country column (200 distinct values, stored as strings averaging 11 bytes) and a ts timestamp column (8 bytes, table sorted on it, ~50,000 rows share each distinct second). Compute the compressed size and compression ratio of: (a) country under dictionary encoding with bit-packed codes; (b) ts under RLE with 16-byte (value, run) pairs; (c) country under RLE — and explain in one sentence why (c) comes out the way it does.

Exercise 3.2 · core

Model a three-operator plan (scan → filter → sum) over 10⁹ tuples. Assume each next() call costs 30 cycles of call-and-dispatch overhead, per-value useful work is 4 cycles, and your CPU runs 3×10⁹ cycles/s. (a) Estimate runtime under tuple-at-a-time execution. (b) Re-estimate with 1024-value vectors, where each operator additionally pays a fixed 200 cycles per vector for primitive setup. (c) Find the vector size at which total overhead (call amortization + setup) drops below 5% of useful work. (d) Now add a memory constraint — each in-flight vector column is 4 bytes/value and the plan keeps 10 vectors live in a 512 KB L2 — and state the largest admissible vector size. Compare (c) and (d) and write two sentences on what bounds the sweet spot from each side.

Exercise 3.3 · stretch

Design — on paper, with concrete mechanisms — an execution engine for a speculative agent workload: bursts of 200–500 short analytical queries arriving over ~60 seconds, where ~80% of results are discarded by the agent and many queries are near-duplicates differing only in one predicate constant or one grouping column. Address: (a) vectorized vs. JIT vs. hybrid, with a defensible startup-cost budget per query; (b) what you cache and reuse across near-duplicate queries (plans? compiled primitives? subexpression results? zone-map verdicts?) and how you detect reuse safely; (c) whether you would speculatively execute queries the agent hasn’t issued yet, and what admission control prevents speculation from starving real work; (d) the metric you would optimize — it is not single-query latency — and how you would benchmark it. There is no canonical answer; published systems disagree with each other here. Be quantitative where you can and honest where you can’t.

❦