DATA 2027 · Week 03 · Part I — Foundations Under New Workloads

One Size Fits None

In 2005 the column store beat the general-purpose engine by 50× — this week we dissect why columns beat rows, and why batches beat tuples.

Lecture 1 — The column-store argument · Lecture 2 — Vectorized execution: from tuple-at-a-time to batches

Lecture 1 · Tuesday

The Column-Store Argument

Pick the workload, derive the architecture, measure the gap.

L1 · Stonebraker’s method

Read the 2005 polemic for its method

Commercial RDBMSs: System R descendants, tuned for 1985 OLTP.
By 2005: streams, scientific arrays, text, warehousing.
Move: design for one workload class alone, then measure.
For warehousing, that engine was C-Store.
Agents are exactly the new workload class this method demands.

L1 · Stonebraker’s method

The gap wasn’t 20%

50×

C-Store’s scan speedup over the general-purpose incumbent — one to two orders of magnitude. The generalist isn’t a compromise; it’s a category error.

L1 · Layout

Rows are an OLTP decision, not a law of nature

Row store: tuple’s attributes contiguous — optimal per-tuple work.
AVG(price) WHERE shipdate > … touches 2 of 16 attributes.
200-byte tuples, 12 useful bytes: 94% waste per cache line.
Column store: each column contiguous; scan reads only named columns.
2-column query on 16 columns: 8× I/O reduction before compression.

L1 · Layout

Same table, two physical layouts

Fig. 3.1 — Shaded cells: what SELECT AVG(price) … WHERE date > … must pull through the memory hierarchy.

L1 · Compression

Compression: the second order of magnitude

A column is same-typed, often-sorted, low-entropy data.
Compresses dramatically better than interleaved rows.
Three encodings carry most of the weight: RLE, dictionary, bit-packing.
Worked example: a 1-billion-row lineitem table.

L1 · Compression

Run-length encoding (RLE)

shipdate, table sorted on it: ~2,500 distinct days.
Each date repeats ~400,000 times consecutively.
Store (value, run_length): 2,500 pairs × 8 B ≈ 20 KB.
Raw: 4 GB (4-byte dates × 10⁹).
Unsorted column: runs of length 1 → 2× expansion.

L1 · Compression

RLE on the sort column

200,000×

4 GB → 20 KB. Absurd, and real — but only on the sort column. That’s why C-Store makes sort order a first-class physical-design decision.

L1 · Compression

Dictionary & bit-packing

Dictionary: ship_state, 50 values, ~9-byte strings.
⌈log₂ 50⌉ = 6 bits/value: 9 GB → 750 MB, 12×.
Predicates rewrite into code space: state = 'CA' → code = 4.

Bit-packing: quantity ranges 1–50, needs 6 bits, not 32.
4 GB → 750 MB, 5.3× — just dropping leading zeros.
Encodings stack: dictionary → bit-pack → RLE; Parquet lands at 5–10× overall.

L1 · Late materialization

Operate on columns as long as possible

Gluing tuples at the bottom rebuilds a row store, badly.
Carry only position lists between operators.
Predicate on an RLE run of 400,000 values: one comparison.
AND the bitmaps; fetch price only at surviving positions.
Stitch tuples at the top, for the few survivors.

L1 · Projections

C-Store doesn’t store “the table”

Stores projections: overlapping column subsets, each sorted differently.
Queries pick the projection with the friendliest sort order.
Redundant storage buying read speed — the RUM triangle again.
Writes hit a small row-oriented Writeable Store, merged in batches.
Squint: it’s the LSM shape from Week 2.

L1 · Hardware

Why the hardware votes for columns

Cache lines: memory arrives in 64-byte lines.
Column of 4-byte ints: 16 useful values per line.
200-byte rows: a fraction of one value per line.
Scans are bandwidth-bound (~25 GB/s per core).

SIMD: AVX-512 compares 16 packed 32-bit values per instruction.
Requires contiguous, same-typed values — columnar is the precondition.
Row layout forces a gather, forfeiting the win.
The column store is shaped like the machine.

L1 · Field note, 2009

11 minutes became

4.2 s

Same 40-line revenue query: row-store warehouse vs. an early Vertica cluster on cheaper hardware. Not the cache — 14 columns never read, RLE-run arithmetic on the date predicate, tuples materialized only for the final 900 rows.

Lecture 2 · Thursday

Vectorized Execution

From tuple-at-a-time to batches — the inner loop meets the modern CPU.

L2 · Volcano

The beloved villain: Volcano (Graefe, 1990)

Every operator: open(), next(), close() — composes like Unix pipes.
Each next() pulls one tuple from below.
Beautiful, modular, memory-frugal: one tuple in flight.
Mattered enormously when RAM was megabytes.
On modern CPUs: catastrophically slow for analytics.

L2 · Volcano

What X100 measured (CIDR 2005)

<10%

Share of CPU cycles commercial DBMSs spent on actual query work running TPC-H Q1 — ~3× lower IPC and an order of magnitude more cycles per tuple than hand-written C.

L2 · Volcano

The price of one tuple at a time

Each next(): a virtual call, ~20–40 cycles with pipeline flush.
Three operators → three calls per tuple, plus attribute extraction.
Type dispatch decided per value in the interpreter.
Hundreds of cycles of ceremony around ~5 cycles of work.
At 10⁹ tuples: ~5 B useful cycles, 100+ B on ceremony.

L2 · X100

X100’s fix: change the unit of exchange

next() returns a vector of ~1000 values per column.
Virtual-call overhead amortized: 30 ÷ 1000 = 0.03 cycles/value.
Work done by primitives: precompiled, type-specialized tight loops.
Type dispatch once per vector, not once per value.
Result: TPC-H Q1 limited by memory bandwidth, not interpretation.

L2 · X100

Why ~1000, not 1,000,000

Fig. 3.2 — X100’s sensitivity curve: big enough to amortize interpretation, small enough to stay cache-resident.

L2 · Primitives

An X100-style primitive

/* one virtual call delivered n ≈ 1024 values;
 * below: branch-free, type-specialized, cache-resident,
 * auto-vectorized (AVX2: 8 lanes per instruction) */
size_t sel_lt_int32(const int32_t *col, int32_t bound,
                    uint32_t *sel_out, size_t n)
{
    size_t k = 0;
    for (size_t i = 0; i < n; i++) {
        sel_out[k] = (uint32_t)i;     /* write position…        */
        k += (col[i] < bound);        /* …keep it only on match */
    }
    return k;   /* positions, not copies */
}

L2 · Primitives

Selection vectors: positions, not copies

Primitive returns qualifying positions, not copied survivors.
Downstream primitives take (data, sel, n).
Tuesday’s late materialization, recurring at micro-scale.
Branch-free k += (cond) sidesteps the branch predictor.
A 50%-selective predicate is the predictor’s worst nightmare.

L2 · HyPer

Vectorize or compile? The HyPer counterpoint

HyPer (Neumann, 2011): JIT-compile each query to LLVM machine code.
Fuses a whole pipeline into one loop; tuple stays in registers.
Kersten et al. 2018: built both in one codebase — neither dominates.
Compilation wins computation-heavy, long pipelines; vectorization wins memory-bound scans.
Usually within 2× of each other; both ~100× over Volcano.

L2 · HyPer

Three execution models

Dimension	Volcano	Vectorized (X100 → DuckDB)	JIT (HyPer → Umbra)
Overhead / value	~100s of cycles	~0.03–0.5 cycles	≈0 (fused loop)
Intermediates	One tuple in flight	Cache-resident vectors (~4 KB/col)	None — registers
SIMD	Impossible	Natural	Possible, harder
Startup latency	None	None — precompiled	ms–100s of ms / query
Profiling	Easy, per-operator	Easy, per-primitive	Hard — opaque loop
Engineering cost	Low	Medium	High

L2 · HyPer

The compile-time tax

For short queries, LLVM passes can cost more than execution.
HyPer’s fix: run quick-and-dirty code, swap in optimized mid-flight.
Umbra: a custom IR to make codegen itself near-instant.
When queries are numerous, short, speculative — fixed startup cost is the bill.

L2 · The agent angle

Agents: hundreds of queries, not a dozen

Analyst: ~a dozen considered queries per hour. Agent: hundreds.
Scan-dominant: profiling and hypothesis tests are wide aggregations.
Speculative: most queries are dead ends — demand cheap, not fast-but-expensive.
Impatient in aggregate: per-query overhead multiplies by query count.
Fit: vectorized engines, precompiled primitives, near-zero startup (DuckDB embedded).

L2 · The agent angle

The sizes became formats — and won

Parquet = C-Store’s storage chapter, standardized.
Dictionary + RLE + bit-packing per column chunk.
Min/max zone maps per row group skip predicates.

Arrow = the X100 vector, standardized in memory.
Engine, dataframe, tool-call boundary pass vectors by pointer.
DuckDB over Arrow on Parquet in S3 = the 2005 research stack, verbatim.

A 50× gap isn’t an optimization opportunity — it’s the sound of the wrong architecture hitting the right workload.

— Week 3 lecture notes

Checkpoint · Discussion

Before you leave

Why does RLE on an unsorted column expand instead of compress?
What bounds the vector-size sweet spot from each side?
For bursts of 200–500 speculative agent queries, what metric do you optimize? (Hint: not single-query latency.)

Readings

Read before Thursday

“One Size Fits All”: An Idea Whose Time Has Come and Gone — Stonebraker & Çetintemel, ICDE 2005. Read for the method, not the predictions.
C-Store: A Column-oriented DBMS — Stonebraker et al., VLDB 2005. Focus on §3–4: projections, sort-order-dependent compression, WS/RS split.
MonetDB/X100: Hyper-Pipelining Query Execution — Boncz, Zukowski & Nes, CIDR 2005. The U-shaped vector-size curve is the whole lecture in one figure.