DATA 2027 DATA SYSTEMS IN THE AGENTIC ERA ← SCHEDULE
Week 08 · Part II — New Access Methods & Engines

Transactions & Branching for Agent Swarms

When ten thousand agents want to try something before they mean it, the transaction abstraction stretches in two directions at once: stricter global ordering underneath, and cheap speculative universes on top. This week we study both ends.

Lectures: Tue — Distributed transactions: Calvin vs Spanner · Thu — Copy-on-write branching as product  ·  Lab: Fri — Create, diverge, and garbage-collect 500 Neon branches; measure storage and branch-creation latency  ·  Slides →
Learning objectives — after this week you can…
  • State precisely what serializability and external consistency promise, and which one Calvin and Spanner each deliver.
  • Explain why deterministic pre-ordering eliminates two-phase commit — and why the same move eliminates interactive transactions.
  • Compute the commit-wait cost of TrueTime from a clock-uncertainty bound ε, and argue when that cost is worth paying.
  • Do the page-sharing arithmetic for copy-on-write branches: creation cost, marginal storage per branch, and GC pressure under a given branch birth/death rate.
  • Design a branch–validate–merge write path for an agent swarm and identify the anomalies it does and doesn’t prevent.
Lecture 1 · Tuesday

Distributed Transactions: Calvin vs Spanner

Agents are greedy transactional clients. A human clicks “checkout” once; an agent fleet retries, forks, races itself, and issues correlated writes from fifty workers that all read the same stale plan. If your isolation story is “read committed and hope,” a swarm will find the anomaly within the hour — we saw the lost-update demo in Week 2. So today we go to the two most instructive poles of distributed transaction design, both published in 2012, both still defining the trade space: Calvin, which says agree on the order first and execution becomes embarrassingly simple, and Spanner, which says buy better clocks and the order falls out of physics. Everything shipping today — CockroachDB, Yugabyte, FoundationDB, TiDB — is a point on the line between them.

Sixty seconds of serializability

One paragraph of refresher, because the rest of the lecture leans on it. A schedule of concurrently executing transactions is serializable if its effects equal those of some serial execution of the same transactions — note the existential quantifier: the system may pick any order it likes. Strict serializability (Spanner calls it external consistency) adds a real-time constraint: if transaction T₁ commits before T₂ begins, in wall-clock time observable from outside the system, then the chosen serial order must put T₁ before T₂. The gap matters for agents: a serializable-but-not-strict system may let an agent commit a write, tell its sibling about it over HTTP, and have the sibling’s subsequent read not see it — perfectly legal, because the equivalent serial order put the sibling first. Out-of-band communication between agents is the norm, not the exception, which is why external consistency stopped being an academic nicety around 2024.

Calvin: order first, execute later

Calvin (Thomson et al., SIGMOD 2012) makes one observation and follows it off a cliff: if every replica of every partition agrees on the full input sequence of transactions before executing any of them, then no agreement is needed afterward. A sequencing layer batches incoming transaction requests into epochs (10 ms in the paper), replicates each batch via Paxos, and hands every partition the same global log. A deterministic lock manager then grants locks in log order — no deadlock is possible, because lock acquisition order is a total order — and workers execute. A node that crashes mid-epoch doesn’t trigger an abort vote; it just replays the log, and determinism guarantees it converges to the same state. There is no two-phase commit, no commit protocol at all for multi-partition transactions. Throughput on TPC-C-like workloads scaled to 500,000 txns/sec on 100 commodity machines in 2012, when Spanner-style designs were paying cross-datacenter round trips per commit.

The price is steep and structural: Calvin must know each transaction’s read and write sets before sequencing it. That means no interactive transactions — no BEGIN, read something, think, decide what to write, COMMIT. Your transaction is a stored procedure, submitted whole. For transactions whose write set depends on a read (“delete the order with the lowest score”), Calvin runs a reconnaissance query at low cost first, then submits the real transaction with the discovered keys, re-checking at execution that the recon results still hold — Optimistic Lock Location Prediction. It works, but every dependent transaction now makes two trips, and a hot key can make recon perpetually stale. Note the irony for our course: an LLM agent’s “transaction” is the most interactive workload imaginable — read, think for nine seconds, write — and yet the agent’s final intent, once it stops thinking, is usually a perfectly declarable read/write set. Hold that thought until Thursday.

Spanner: trust the clocks, pay the wait

Spanner (Corbett et al., OSDI 2012) keeps interactive transactions — two-phase locking, with 2PC across shards coordinated over Paxos groups so the coordinator itself can’t be a single point of failure — and solves ordering with hardware. TrueTime exposes time as an interval: TT.now() returns [earliest, latest], with the guarantee that absolute time lies inside. GPS receivers and atomic clocks in every datacenter keep the uncertainty ε small: the paper reports ε typically 1–7 ms, averaging around 4 ms. The trick is commit wait: a transaction is assigned timestamp s = TT.now().latest at commit, then the coordinator blocks until TT.after(s) is true — i.e., until every clock in the fleet agrees that s is in the past — before releasing locks and acknowledging. Expected stall: about 2ε, call it 5–10 ms. After that wait, timestamp order is real-time order, and external consistency holds globally, across continents, with no communication between non-overlapping transactions.

So Spanner pays in latency-per-commit what Calvin pays in programming-model flexibility. And Spanner still runs 2PC: a multi-shard commit costs prepare and commit rounds across Paxos groups, each a replication round trip. Cross-region read-write transactions land in the 10–100 ms range. What you buy: lock-free snapshot reads at any timestamp in the past — a replica can serve a consistent read at timestamp t with zero coordination once its applied state covers t. For read-heavy agent fleets, this is the feature that matters more than write latency.

Field note The determinism idea didn’t start with Calvin — it’s H-Store/VoltDB’s lineage, and Abadi’s group spent years being told that “no interactive transactions” made it a toy. Then FoundationDB shipped a deterministic-simulation-tested core, FaunaDB built directly on Calvin’s design, and the “toy” restriction quietly became a feature: a transaction you can declare is a transaction you can simulate, replay, and audit. Agent platforms rediscovered this in 2025 when audit logs of exactly what an agent did became a compliance requirement. Determinism is auditability.

The poles, side by side

DimensionCalvin (SIGMOD ’12)Spanner (OSDI ’12)
Ordering mechanismPre-agreed global log (Paxos on input batches)TrueTime timestamps + 2PL
Distributed commitNone — determinism makes it unnecessary2PC layered over Paxos groups
Interactive transactionsNo; recon + re-submit for dependent onesYes, full BEGIN…COMMIT
Consistency deliveredSerializable (one global order, by construction)Externally consistent / strict serializable
Latency taxEpoch batching (~10 ms) + recon tripsCommit wait ≈ 2ε (5–10 ms) + 2PC rounds
Special hardwareNoneGPS + atomic clocks per datacenter
Failure handlingReplay the log; no aborts from failuresPaxos re-election; 2PC blocking bounded by group failover
Best-fit agent workloadHigh-throughput declared writes (tool calls with known keys)Read-mostly fleets needing consistent global snapshots

MVCC and snapshot isolation: what engines already maintain

The bridge to Thursday. Nearly every serious engine — Postgres, InnoDB, Oracle, SQL Server under RCSI, RocksDB-based systems — is already a multi-version store: an update doesn’t overwrite a row, it creates a new version stamped with a transaction ID, and old versions linger until vacuum/purge decides nobody can see them. Snapshot isolation falls out almost for free: give a transaction the set of versions committed as of its start, and it reads a frozen, consistent world while writers proceed. (Remember SI’s one famous leak — write skew: two transactions read overlapping data, write disjoint data, both commit, and an invariant spanning both rows breaks. Serializable SI à la Postgres SERIALIZABLE closes it by tracking rw-antidependencies.) The point I want you to carry out of the room: your database already pays the storage and bookkeeping cost of keeping many coexisting versions of reality. It then aggressively garbage-collects them after seconds. Branching, Thursday’s topic, is just the decision to keep one of those realities alive, give it a name, and let it diverge. The mechanism was always there; only the product was missing.

A branch is a transaction that lived long enough to get a name.
Lecture 2 · Thursday

Copy-on-Write Branching as Product

Speculation is the defining agentic workload. An agent doesn’t know whether its migration script is right; the honest move is to run it somewhere consequence-free and look. Before 2022 “somewhere” meant a nightly-restored staging database, shared by everyone and trusted by no one. Copy-on-write branching changes the economics: a full, writable copy of a production-sized database in tens of milliseconds, costing only the pages you touch. Today: the mechanism from a single B-tree page up to a cloud product, the arithmetic of running ten thousand speculative branches an hour, and why branch–validate–merge — not write credentials — is the correct way to let agents write.

Copy-on-write, from one LMDB page to a whole timeline

Start small. LMDB is a memory-mapped B-tree where a writer never modifies a page in place: it writes the new version of a leaf to a free page, then new versions of every ancestor up to the root, then flips one of two meta pages to point at the new root. Readers grab whichever meta page was current when they started and traverse an immutable tree — no locks, no torn reads, and crash recovery is “use the last valid meta page.” Every commit is, structurally, a tiny fork: old root and new root coexist, sharing all unmodified pages. LMDB just recycles the old tree’s exclusive pages as soon as no reader holds them. ZFS and APFS snapshots, Aurora’s log-structured storage, and RocksDB checkpoints all play the same chord.

Neon plays it at cloud scale by separating compute from storage. Postgres runs unmodified as compute; the storage layer ingests the WAL — the write-ahead log, the totally ordered record of every page modification, addressed by LSN (log sequence number, a byte offset into history) — and indexes page versions by the pair (page id, LSN). When compute needs a page, it asks GetPage@LSN and the pageserver materializes that page as of that point in history from layered immutable files. Once your storage engine can answer “page P at LSN L” for any L it retains, time travel is a query parameter, and a branch is barely more than bookkeeping.

Branch = a new timeline at a WAL LSN

Creating a branch in Neon means writing one metadata record: new timeline T₂, parent T₁, fork point LSN L. No pages are copied — O(metadata), not O(data); tens of milliseconds whether the parent holds 1 GB or 10 TB. Reads on T₂ for pages unwritten since the fork resolve through to the parent’s layer files at LSN ≤ L; writes on T₂ append to T₂’s own WAL and produce layer files only T₂ can see. The parent is never touched — it can’t even tell it has children, which is exactly the isolation property you want when the children are agents. Storage cost of a branch = (its own WAL and layers) + (a constant-size record), and the parent’s history up to L is now pinned: GC on the parent must retain everything a live child might read through.

main (T1) — WAL → LSN 0/52A0 LSN 0/9F10 agent-7f/try-migration (T2) own WAL: dirty pages only agent-7f/plan-B (T3) abandoned → GC reclaims pages unwritten since fork: resolved from parent @ fork LSN branch create = 1 metadata record: (child, parent, fork LSN). O(metadata), ~10s of ms.
Fig. 8.1 — Branches as timelines forking from the parent’s WAL at a chosen LSN. Solid child: live, accumulating its own delta. Dashed child: abandoned; its exclusive layers are GC’d, while parent history up to the oldest live fork point stays pinned.

Ten thousand speculative branches an hour

Now the arithmetic, because “cheap” must survive multiplication. Suppose a planning fleet forks 10,000 branches/hour against a 64 GB parent (8 KB pages → 8.4 M pages). Creation is free-ish: 10,000 metadata records and some control-plane traffic. Storage grows only with divergence. Measure your agents: a typical speculative run touches maybe 200 pages (a few tables, some index leaves, catalog if it DDLs) ≈ 1.6 MB of exclusive layers. If branches live 5 minutes on average before being abandoned or merged, steady-state live branches = 10,000/h × (5/60)h ≈ 833, holding ≈ 833 × 1.6 MB ≈ 1.3 GB of deltas against a 64 GB base. Trivial. What is not trivial: (1) GC pressure — abandoned branches must actually die. Agents are terrible at cleanup (a crashed agent never calls delete_branch), so you need TTLs and a reaper, or the 1.6 MB-each corpses accumulate at 16 GB/hour — they overtake the base in four hours. (2) Parent GC pinning — one forgotten week-old branch pins a week of parent WAL and layer history; retention cost is set by your oldest live fork point, not your average. (3) Page-materialization load — 833 branches cold-reading the same hot pages turn the pageserver into a fan-out read amplifier; shared caches help precisely because the pages are shared.

Worked example — 50 branches, 1% divergence each. Same 64 GB parent. Each of 50 analyst-agent branches dirties 1% of pages: 0.01 × 8.4 M ≈ 83,900 pages ≈ 0.64 GB exclusive. Dirty pages are private even if two branches modify the same page — divergent versions can’t be shared — so deltas add linearly:

StrategyFormulaTotal storageCreate time
50 full copies (pg_dump/restore)50 × 64 GB3,200 GB~40 min each
50 CoW branches @ 1% divergence64 + 50 × 0.64 GB96 GB~50 ms each
Break-even divergenceshared wins until ≈ 98% dirty

A 33× storage reduction, and the break-even calculation tells you CoW only loses when branches rewrite essentially the whole database — at which point they aren’t branches, they’re different databases. The honest caveat: long-lived branches drift. At 1% divergence per day, a branch parked for three months has rewritten most of its hot set and pinned a quarter-year of parent history. Branches are a speculation primitive, not an archival one.

The agent transaction shape: long think-time, optimistic wins

Look at an agent write through Tuesday’s lens. The shape is: read some rows (50 ms) → call an LLM to decide (3–30 s) → write (20 ms). Under pessimistic two-phase locking, that transaction holds read locks across the think-time, and 30 seconds of held locks on a hot row is a denial-of-service attack you launched against yourself: at even 10 such transactions/sec touching one row, the queue never drains. The classic concurrency-control result (Agrawal–Carey–Livny, 1987) says optimistic methods win when conflicts are rare and transaction “footprint time” is long relative to execution — and the agent workload is that regime, taken to a comic extreme: think-time inflates duration 100–1000×, while genuine row-level conflict between agents working different tasks stays low. So the right discipline is read a snapshot, think with no locks held, then validate-and-write: at commit, check that everything you read is unchanged (version check on the read set), and abort-and-retry if not. Retries are cheap for agents — re-prompting with “the data changed, here’s the new state” is one more API call, and nobody is sitting at the keyboard getting angry. A branch is exactly this OCC pattern made durable: the snapshot you “read” is the fork LSN, the think-time can be minutes, and validation happens at merge.

Field note A team I advised in 2025 gave a fleet of refund-processing agents a shared Postgres role with UPDATE on the ledger, reasoning that the agents’ prompts forbade anything dangerous. Within a month an agent, retrying after a timeout it had actually half-committed, issued the “same” refund batch twice — prompts are not idempotency keys. The postmortem fix was not a better prompt. It was revoking the write credential entirely: agents now write to branches, and a 40-line validator owns the only path to main. Incidents since: zero. The validator has rejected 3–4% of merges, which is the real number to frame on the wall — that’s how often a confident agent was about to be wrong in production.

Branch–validate–merge: the safe agent write path

Which brings us to the architecture this week has been building toward. Giving an agent production write credentials means trusting the weakest prompt-injection in its context window with your ledger. Instead: (1) Branch — agent gets its own timeline at fork LSN L; full read/write freedom, zero blast radius. (2) Validate — a deterministic, non-LLM gate inspects the branch delta: schema diffs against policy, invariant queries (balances sum, foreign keys hold, row-count deltas within bounds), test suites run on the branch. (3) Merge — replay the branch’s logical changes onto main as one ordinary transaction, with OCC-style validation that main hasn’t conflictingly moved past L; on conflict, re-fork from the new head and let the agent reconcile. Notice what this is: Calvin’s insight, rediscovered at the workflow layer. The agent’s exploration was interactive, but the thing that touches main is a declared, validated, replayable change set — a deterministic transaction with a known write set, which is why the merge step needs no human and no 2PC, just one serialization point. The branch absorbed the interactivity; the merge inherits the auditability. That division of labor — speculation on cheap CoW timelines, commitment through a narrow deterministic gate — is, I’d argue, the transaction model of the agentic era.

Readings

Read Before Thursday

1
Calvin: Fast Distributed Transactions for Partitioned Database Systems — Thomson, Diamond, Weng, Ren, Shao & Abadi, SIGMOD 2012.Focus on §3 (the sequencer) and §3.2.1 (dependent transactions / OLLP) — convince yourself why determinism really removes 2PC, and what it costs.
2
Spanner: Google’s Globally-Distributed Database — Corbett et al., OSDI 2012.Read §3 (TrueTime) and §4.1.2 (commit wait) closely; skim the rest. Work the invariant: why must locks be held through the wait?
3
Neon architecture posts: “Architecture decisions in Neon” & the pageserver/branching deep-dives — Neon engineering blog, 2022–24.Focus on GetPage@LSN, the layer-file map, and what branch creation actually writes — verify Thursday’s O(metadata) claim against their design.
Exercises

This Week’s Problems

Exercise 8.1 · warm-up

A Spanner-style system has clock uncertainty ε = 4 ms and intra-region Paxos replication latency of 2 ms per round. (a) Estimate the minimum latency of a single-shard read-write commit, including commit wait. (b) Now a transaction spans 3 shards: add 2PC (prepare + commit, each Paxos-replicated). (c) A Calvin deployment with 10 ms sequencing epochs runs the same 3-shard transaction with a fully known write set. Which system acknowledges first, and by roughly how much? (d) One sentence: which assumption, if violated, flips your answer to (c)?

Exercise 8.2 · core

Extend Thursday’s arithmetic. A 200 GB parent (8 KB pages) serves an agent platform that forks b branches/hour; each branch dirties d% of pages, lives m minutes if completed (70% of branches) and forever if its agent crashed without cleanup (30%), unless a reaper kills branches idle > R minutes. (a) Derive steady-state branch storage as a function of (b, d, m, R). (b) For b = 2,000, d = 0.5, m = 10: plot (or tabulate) storage at R = 15, 60, 1440 minutes. (c) The oldest live fork point pins parent WAL at 5 GB/hour of write traffic — add that term and recompute. (d) Recommend an R, and state the failure mode of setting it too low for agents with long think-times.

Exercise 8.3 · stretch

Design a merge protocol for branch–validate–merge and analyze its isolation guarantee. Agents fork at LSN L, produce a logical change set C with recorded read set R, and merge replays C onto main iff no committed-since-L transaction wrote into R. (a) Prove this yields serializability for merged change sets, or exhibit a counterexample — pay attention to write skew between two branches whose read sets overlap but whose write sets don’t, and to invariants spanning rows neither branch read. (b) Propose the cheapest strengthening that closes any gap you found (predicate read tracking? rw-antidependency checks à la SSI? invariant re-validation at merge?), and bound its cost in pages tracked per branch. (c) Open question, argue a position: should merges of commutative change sets (e.g., inserts into disjoint keyspaces) skip the serialization point entirely, CRDT-style — and what does your answer concede about external consistency for agents that communicate out of band? There is no settled answer; we’ll workshop the best two designs in Friday’s lab.