DATA 2027 · Week 11 · Part III — Semantics, Agents, Governance

Memory Is a Database Problem

Every agent memory system shipped since 2023 is a storage engine wearing a trench coat. This week we take the coat off and grade what’s underneath.

Lecture 1 — Agent memory systems, read as database designs · Lecture 2 — Temporal knowledge, consolidation, and forgetting

Lecture 1 · Tuesday

Agent Memory Systems, Read as Database Designs

Strip the branding off any memory product and you find a workload spec.

L1 · The Workload

Four requirements, no branding

Episodic append — cheap, sequential, never blocks the response.
Semantic recall — “what do I know that bears on this?”
Namespace isolation — leaks between users are privacy breaches.
Cheap point reads — timezone lookups cost microseconds, not embeddings.

L1 · The Workload

You learned these in week two

Episodic append → sequential append
Semantic recall → secondary indexes

Namespace isolation → multi-tenancy
Point reads → primary-key lookup

The physics didn’t change. The client did.

L1 · The Workload

Writes are not the hard part

~200k

episodic appends/day for a support agent at 10k conversations × 20 turns — trivial; a single Postgres instance yawns at it. The hard part is recall on every turn.

L1 · The Workload

Why you must index

17s → 1.4s

p95 end-to-end latency: full-history replay vs. retrieval over distilled memories on LOCOMO-length conversations (Mem0 paper) — with >90% fewer tokens billed. Not an optimization; the difference between a product and a demo.

L1 · Semantic Recall

Recall means hybrid retrieval

Vector similarity — catches paraphrase.
BM25 — exact tokens: error codes, invoice numbers.
Graph traversal — multi-hop: “who manages this service’s owner?”
Once you can’t scan, you index — then decide what the index is over.

L1 · Mem0

Mem0: an LLM-driven upsert pipeline

Mem0 (arXiv 2504.19413) write path: extraction, then consolidation choosing ADD / UPDATE / DELETE / NOOP.

L1 · Mem0

The payoff is real

+26%

relative improvement over OpenAI’s built-in memory on LOCOMO question answering (LLM-as-judge). The graph variant Mem0ᵍ adds a couple more points on temporal and multi-hop questions.

L1 · Mem0

Now grade it as a database

Durability: facts the extractor skips were never written.
There is no redo log for attention.
Determinism: same transcript twice → different stores.
Audit: UPDATE keeps no tombstone, no before-image, no provenance.
All fixable — fact log, source citations, versioned writes.

L1 · MemGPT

MemGPT: the buffer pool, rediscovered

Main context = RAM: scarce, fast, fixed-size.
External context: recall + archival storage, unbounded.
Reachable only via model-issued function calls.

Main context = buffer pool.
Eviction-with-summary = lossy page replacement.
Warning at ~70% occupancy = high-water mark flush.

L1 · MemGPT

The application as its own buffer manager

The genuinely novel move: the model runs page replacement, via tool calls.
DBAs spent thirty years learning why application-managed caching is hard.
Hint rot, working-set misestimation, no global view — all reappear.
Agents forget to page in and confabulate instead of faulting.

L1 · Field Note

The eight-month primary key

Sales agent kept calling a customer by his predecessor’s name.
Consolidation judged “new contact is Dana” a NOOP vs. “contact is Marcus.”
Similar embeddings, different truth.
The fix wasn’t a better prompt — it was a unique constraint.
One contact_for(account) per account; writes invalidate the old row.

L1 · Report Card

Three systems, one rubric

Property	Mem0	MemGPT	Zep / Graphiti	A DBMS would say
Write path	LLM extract → LLM upsert	Self-directed tool calls	Edges into temporal graph	Log first, derive later
Read path	Vector top-k over facts	Model-issued search	Cosine + BM25 + graph	Optimizer picks the path
Durability	Lossy at extraction	Lossy at eviction	Edges invalidated, kept	WAL or it didn’t happen
Audit	Overwrites destroy history	Edits unversioned	Bi-temporal lineage	Every version `AS OF`
Consistency	Async → stale reads	Serial until you shard	Summaries can lag	Define isolation, enforce it

Lecture 2 · Thursday

Temporal Knowledge, Consolidation, and Forgetting

Mutable key–value memory is the wrong data model. The right one is from the 1990s.

L2 · Temporal Model

Facts are intervals, not values

“Alice is on-call” was true March 3 to March 17.
When Bob takes over, the Alice-edge becomes bounded, not false.
Contradiction closes intervals; it never deletes rows.
The past stays queryable.

L2 · Graphiti

Bi-temporal edges: two timelines per fact

Valid time: t_valid / t_invalid
When the fact held in the world.

Transaction time: learned / superseded
The database’s own epistemic history.

Textbook bi-temporality: Snodgrass’s TSQL2 (1995), standardized in SQL:2011.

L2 · Graphiti

Contradiction as invalidation

The Dana edge closes the Marcus edge’s interval (t_invalid := Mar 17) but preserves the row; “current truth” is t_invalid IS NULL.

L2 · Graphiti

Three queries overwrite stores can’t answer

What is true now — filter t_invalid IS NULL.
What was true on April 2 — interval containment.
What did we believe on April 2 — transaction-time variant.
The last one determines liability: belief at action time.

L2 · Graphiti

Temporal structure pays rent

94.8%

Zep on Deep Memory Retrieval vs. MemGPT’s 93.4% — plus up to 18.5% accuracy gains on LongMemEval at ~90% lower latency than full-context baselines. It retrieves the right version of a fact.

L2 · Prior Art

The field did this in the 90s

Kimball’s Type-2 slowly-changing dimension, circa 1996.
Expire the row, insert a successor; history joins as it was.
Graphiti = Type-2 SCD with an LLM deciding row identity.
Power: natural-language contradiction. Risk: probabilistic expiry trigger.
Ask vendors: which TSQL2 query classes? How do you coalesce intervals?

L2 · Consolidation

Consolidation is LSM compaction

Raw episodes are L0: small, recent, overlapping, fast to write.
Background passes merge into deduplicated, contradiction-resolved runs.
Same design questions: when to compact, what it costs.
Here compaction cost is LLM tokens — a real dollar number.
Read amplification before vs. write amplification after.

L2 · Forgetting

Forgetting is garbage collection

TTL deletion (“drop after 90 days”) is the blunt instrument.
Decay-scored eviction — e^(−Δt/τ) + use count — is generational GC.
Bi-temporal stores can forget retrievability without forgetting the record.
Exactly the split GDPR-era systems need.
Overwrite memories have one delete — and it destroys evidence.

L2 · Open Problem

The semantic phantom

A belief matching no committed base fact stays readable because it lives in a derived representation — a phantom-read analogue.

L2 · Open Problem

Why no classical answer fits

Sync maintenance (materialized view): correct, but LLM on write path.
Snapshot semantics: summaries declare an episode-LSN; nobody ships it.
OCC: validate watermark before commit — but actions can’t abort.
You cannot unsend an email; abort must become block before side effect.
Dependency tracking + isolation level + agent architecture, all at once.

An agent that cannot say why it believes something is just a cache with opinions.

— Week 11 lecture notes, DATA 2027

Checkpoint · Discussion

Before you leave

Why can valid-time and transaction-time AS OF queries disagree for the same date?
Where exactly can Mem0 lose a fact despite a durable episode log?
Can belief-snapshot isolation kill semantic phantoms without a synchronous LLM write?

Readings

Read before Thursday

Mem0 — Chhikara et al., arXiv:2504.19413. Read §3 as a write-path spec.
Zep — Rasmussen et al., arXiv:2501.13956. Map t_valid/t_invalid onto SQL:2011.
MemGPT — Packer et al., arXiv:2310.08560. Read the OS metaphor adversarially.