#2206: What Actually Works in AI Memory

Most AI memory systems are just vector databases with similarity search. We break down what mem0, Zep, and Letta are actually doing—and why benchma...

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2364
Published: Apr 13
Duration: 26:30
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: claude-sonnet-4-6
Topics: ai-memory vector-databases knowledge-graphs

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

What Actually Works in AI Memory: Beyond Vector Append-Only Stores

The promise of AI memory frameworks is seductive: systems that never forget, that learn your preferences, that understand context across months of conversation. The reality is messier. Most production deployments are running something embarrassingly simple—and when you look closely at the few systems claiming sophistication, the evidence gets complicated fast.

The Naive Baseline: Append-Only Vector Stores

The simplest approach—and the one most tutorials teach—is straightforward: embed every conversation chunk, drop it into a vector database (Pinecone, Chroma, Weaviate), and retrieve at query time using cosine similarity. It's retrieval-augmented generation applied to conversation history.

This breaks down immediately at scale. If a user mentions "I prefer Python" fifty times across different conversations, you now have fifty nearly-identical facts in your store. No deduplication. No consolidation. Just fifty copies with equal retrieval weight. Then in month three they say "I switched to Go." Now you have fifty-one facts, and fifty of them are confidently wrong.

The deeper problem is semantic similarity isn't relevance. A user mentioning six months ago that their shoes run narrow has near-zero semantic overlap with "I want to return these shoes"—different words, different framing. A similarity search won't surface it. But it's exactly what a returns agent needs. You can't retrieve what you don't know to look for.

The Two Camps: Vector-Plus-Judge vs. Temporal Graphs

The memory framework space is splitting into two architectural approaches, each with genuine trade-offs.

The mem0 Approach: LLM as Editor

Mem0 is the most popular framework by GitHub stars (52,900+) and Y Combinator S24-backed. Their two-phase pipeline extracts candidate memories from new exchanges, then runs an LLM-powered update step. The LLM makes one of four decisions: ADD a new memory, UPDATE an existing one, DELETE something contradicted, or NOOP if nothing changes.

They also offer Mem0g, a graph variant that stores memories as directed labeled knowledge graphs, extracts entities, generates labeled edges, and flags contradictions. An LLM resolver decides whether to add, merge, invalidate, or skip each element.

The performance numbers look striking: 66.9% accuracy on LoCoMo, 91% lower p95 latency versus full-context (1.44 seconds vs. 17.12 seconds), 90% fewer tokens. But there's a problem: the full-context baseline—just stuffing everything into the context window—scores 72.9% on LoCoMo. The sophisticated memory system is less accurate than the dumbest possible approach on its own chosen benchmark.

The full-context approach is impractical (17-second p95 latency means abandoned tickets and dead voice lines), so the trade-off is real. But if your specialized system is less accurate than the naive baseline in lab conditions, that's a problem worth acknowledging.

The Zep Approach: Temporal Knowledge Graphs

Zep built their system on Graphiti, an open-source temporal context graph engine (24,900 GitHub stars). The core insight: every fact in the graph has a validity window. When information changes, old facts are invalidated with timestamps, not deleted. History is preserved.

The concrete example: Robbie says in September, "I only wear Adidas shoes." In October, "My shoes fell apart, I'll be wearing Nike going forward." The old Adidas facts get timestamped as invalid. New Nike facts are added. Now the agent can reason: Robbie used to love Adidas, switched to Nike after a bad experience, and the switch happened in October. A vector store can't do this—both facts float around with equal weight.

Graphiti uses hybrid retrieval: semantic embeddings plus BM25 keyword search plus graph traversal. The performance curve is instructive. At minimal config (5 edges, 2 nodes): 69.62% accuracy at 149ms. At default (15, 5): 77.06% at 199ms. At maximum (30, 30): 80.32% at 189ms but 2000 tokens. Diminishing returns are steep.

On LongMemEval—a benchmark using 115,000 tokens that actually requires temporal reasoning—Zep scores 71.2% against a full-context baseline of 60.2%. That's an 18.5% improvement with 10x faster latency. A more credible result than LoCoMo.

The Benchmark Problem

LoCoMo conversations average 16,000-26,000 tokens, which fits comfortably in modern context windows. You're benchmarking memory systems on a task that doesn't require memory. It's like testing compression algorithms on files already small enough to send uncompressed.

Zep disputed mem0's LoCoMo results, alleging incorrect user modeling, wrong timestamp handling, and sequential instead of parallel searches. When Zep re-ran it with their preferred implementation, they claim 75.14%—outperforming Mem0g by 10 percentage points.

The deeper issue: two companies fighting over a benchmark that may not measure what matters. LongMemEval—requiring genuine temporal reasoning—is more credible.

What's Actually Smart?

The honest answer is: it's a spectrum. Some frameworks are doing genuinely clever things. But the baseline for most production deployments is still pretty embarrassing once you look closely. Intelligent memory requires solving deduplication, conflict reconciliation, relevance scoring, and temporal reasoning—not just similarity search with more steps.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2206: What Actually Works in AI Memory

So Daniel sent us this one, and I'll read it out. He writes: AI memory frameworks often promise an AI that never forgets, but in practice, an AI that remembers every mundane detail would be unhelpful and counterproductive. The real utility comes from intelligent distillation — saving what's relevant, like user preferences, working patterns, and key decisions, while letting noise decay. Equally important is reconciling conflicts when new memories update or contradict old ones. This mirrors how human memory works: we don't remember conversations verbatim, we consolidate and update our mental models. He wants us to dig into what processes actually exist for intelligent distillation, relevance scoring, memory decay, deduplication, and conflict reconciliation. How do frameworks like mem0, Letta, Zep, and others approach the "what's worth remembering" problem? And are any of them doing something genuinely smart, or is most of this space just naive append-only stores with similarity search slapped on top?

That last question is the one I've been sitting with, because the honest answer is: it's a spectrum. Some of these frameworks are doing genuinely clever things. But the baseline — the thing most production deployments are actually running — is pretty embarrassing once you look at it closely.

Let's start there then. What does the naive baseline actually look like, and why is it so bad?

So the simplest implementation, and what basically every tutorial walks you through, is append-only vector storage. Every conversation chunk gets embedded and dropped into a vector database — Pinecone, Chroma, Weaviate, take your pick — and at query time you do a cosine similarity search to pull back the top-k most similar chunks. It's retrieval-augmented generation applied to conversation history.

Which sounds reasonable on the surface.

It sounds reasonable until you think about what happens over time. Say a user mentions "I prefer Python" in fifty separate conversations. You now have fifty nearly-identical facts in your store. No deduplication, no consolidation — just fifty copies. And then in month three they say "I switched to Go." Now you have fifty-one facts, and fifty of them are wrong, but they all have equal retrieval weight. The store has no concept of which fact is current.

So the system is confidently wrong, but at scale.

And there's a third problem that's more subtle. Semantic similarity is not the same as relevance. If a user mentioned six months ago that their shoes run narrow, that fact has near-zero semantic overlap with "I want to return these shoes" — different vocabulary, different framing. A similarity search won't surface it. But it's exactly the context a returns agent needs.

That's the unknown unknowns problem. You can't retrieve what you don't know to look for.

Zep actually published a really sharp blog post on this. Their framing is: agent-controlled retrieval — where the agent calls a tool to look something up — fundamentally cannot solve this class of problem. Because there's no signal in the current message that would prompt the agent to go searching for shoe-width preferences. You'd have to already know the fact exists to think to look for it.

So what's Zep's actual solution there?

Deterministic context assembly. Instead of waiting for the agent to decide what it needs, Zep pre-assembles a context block before the LLM runs — no tool calls required. Their temporal knowledge graph pre-connects disparate facts, so order details, sizing complaints, brand preferences, return patterns — they all arrive as one coherent block on every agent turn.

Before we go deep on Zep, I want to understand the architectural split here, because it seems like the whole space is dividing into two camps. You've got vector store plus LLM-as-judge on one side, and temporal knowledge graphs on the other.

That's the right framing. And those two camps have genuinely different trade-offs. Let me take them in order because they're worth understanding properly.

By the way — today's script is coming to us courtesy of Claude Sonnet four point six, which I think is fitting given we're talking about AI systems that are supposed to remember things. Hopefully it remembers what it's doing.

Ha. No pressure.

Okay, so — mem0. That's the vector-plus-LLM-judge camp. Walk me through what they're actually doing.

Mem0 is the most popular memory framework by GitHub stars — sitting around fifty-two point nine thousand at last count, Y Combinator S24 backed, with an accepted research paper at ECAI. Their architecture is a two-phase pipeline. Phase one is extraction: they ingest the latest exchange, a rolling summary, and the most recent messages simultaneously, and use an LLM to extract candidate memories from that combined context. The rolling summary refreshes asynchronously in the background so inference doesn't stall.

And phase two is where the interesting part happens.

Phase two is the update step, and this is genuinely more sophisticated than naive append-only. Each new candidate memory gets compared against the most similar entries already in the vector database, and then an LLM makes a decision — one of four operations: ADD a new memory, UPDATE an existing one, DELETE something that's been contradicted, or NOOP if nothing needs to change. That four-operation pipeline is their core mechanism for conflict resolution and deduplication.

So the LLM is essentially acting as an editor, deciding what goes in the permanent record.

Right. And they have a graph variant called Mem0g that takes it further — stores memories as a directed labeled knowledge graph, runs an entity extractor, generates labeled edges between entities, and has a conflict detector that flags overlapping or contradictory nodes. An LLM-powered update resolver then decides whether to add, merge, invalidate, or skip each graph element.

Their performance numbers are pretty striking. Sixty-six point nine percent accuracy on LoCoMo, ninety-one percent lower p95 latency versus full-context — one point four four seconds versus seventeen point one two seconds — and ninety percent fewer tokens. That's not nothing.

The latency and token numbers are real and they matter a lot in production. But the accuracy story is where it gets uncomfortable. Because mem0 also reports the full-context baseline on LoCoMo — just stuff everything into the context window — and that baseline scores seventy-two point nine percent. Which is higher than mem0's sixty-eight point four percent with Mem0g.

Wait. The dumbest possible approach — just shoving everything in — beats the sophisticated memory system on their own benchmark?

On that benchmark, yes. And this is the benchmark mem0 chose to publish as evidence of state-of-the-art performance. Now, the full-context approach takes seventeen seconds at p95 and uses twenty-six thousand tokens, which is completely impractical in production — seventeen-second p95 latency means abandoned tickets, dead voice lines, developers who stop using your tool. So the trade-off is real. But if your specialized memory system is less accurate than the naive baseline in lab conditions, that's a problem for the benchmark, not a vindication of the system.

And then Zep came in and said the benchmark was run wrong anyway.

Zep published a detailed rebuttal. They alleged that mem0 used an incorrect user model for Zep — assigning the user role to both conversation participants — and appended timestamps to messages instead of using Zep's dedicated created-at field, which breaks temporal reasoning. And they ran searches sequentially instead of in parallel, artificially inflating Zep's latency numbers. When Zep ran the LoCoMo benchmark with what they say is a correct implementation, they claim seventy-five point one four percent — outperforming Mem0g by roughly ten percentage points.

So we have two companies fighting over a benchmark that may not be measuring the right thing anyway.

The deeper problem is that LoCoMo conversations average sixteen thousand to twenty-six thousand tokens — which fits comfortably inside modern context windows. So you're benchmarking memory systems on a task that doesn't actually require memory. It's like testing a compression algorithm on files that are already small enough to send uncompressed.

What's the better benchmark?

LongMemEval, which Zep prefers. That one uses a hundred and fifteen thousand tokens and requires genuine temporal reasoning — not just "what did the user say" but "what was true when, and how has it changed." On LongMemEval with GPT-4o, Zep scores seventy-one point two percent against a full-context baseline of sixty point two percent — an eighteen point five percent improvement. And their median latency is two point five eight seconds versus twenty-eight point nine seconds for full-context. That's a ten-times speedup with a meaningful accuracy gain. That's a more credible result.

Okay, so let's actually dig into what Zep is doing architecturally, because I think the temporal graph approach is genuinely different in kind, not just in degree.

Zep is built on Graphiti, their open-source temporal context graph engine — twenty-four point nine thousand GitHub stars, arxiv paper 2501.13956. The core concept is temporal fact management. Every fact in the graph — every edge — has a validity window: when it became true, and when it was superseded. When information changes, old facts are invalidated, not deleted. The history is preserved.

Give me the concrete example because that distinction matters a lot.

The example from their docs: Robbie says in September, "I only wear Adidas shoes, I love them." That creates facts: Robbie only wears Adidas, Robbie strongly favors Adidas. Then in October, Robbie says "My shoes fell apart, I'll be wearing Nike going forward." The old Adidas facts get invalidated with timestamps. New Nike facts are added. Now the agent can reason: Robbie used to love Adidas, switched to Nike after a bad experience, and the switch happened in October. That's temporal reasoning a vector store simply cannot do. A vector store would have both facts floating around with equal weight and no way to know which is current.

And the retrieval is different too — it's not just similarity search.

Graphiti uses hybrid retrieval: semantic embeddings combined with BM25 keyword search combined with graph traversal. So you get the fuzzy matching of embeddings, the precision of keyword search, and the relational structure of graph traversal all at once. They also did fifty experiments varying their edge-limit and node-limit retrieval parameters, and the results are instructive. At the minimal config — five edges, two nodes — you get sixty-nine point six two percent accuracy at a hundred and forty-nine milliseconds. At their default of fifteen and five, seventy-seven point zero six percent at a hundred and ninety-nine milliseconds. At the maximum config of thirty and thirty, eighty point three two percent at a hundred and eighty-nine milliseconds but with two thousand tokens.

The diminishing returns curve is steep.

Very steep. Going from the minimal config to twenty and twenty buys you ten point four percentage points for four times the tokens. Going from twenty and twenty to thirty and thirty buys you zero point two six points for one point five times the tokens. And at the minimal config, nearly one in four questions had insufficient context. So there's a real design decision there about where you want to sit on that curve.

Let's talk about Letta, because I think they're doing the most philosophically interesting thing in the space.

Letta — formerly MemGPT — draws explicit inspiration from operating system memory management. The original MemGPT paper from October 2023 proposed treating LLMs like operating systems: the way an OS uses virtual memory to give programs the appearance of large memory by moving data between RAM and disk, MemGPT manages memory tiers to give LLMs the appearance of unlimited context. But the really novel thing they've done more recently is sleep-time compute.

Which is exactly what it sounds like.

It is. They published the sleep-time paper in April 2025. The architecture is a two-agent system. The primary agent handles conversations — it's a fast model, something like GPT-4o-mini, and critically, it does not have tools to edit its own core memory. Then there's a sleep-time agent that runs asynchronously during downtime, has access to edit the primary agent's memory blocks, and uses a stronger, slower model — GPT-4.1, Claude Sonnet 3.7, something in that class.

So the expensive thinking happens when no one's waiting for a response.

Right. And the sleep-time agent's job is consolidation — reorganizing, deduplicating, cleaning up memories that formed incrementally during conversations. Because incremental formation is inherently messy. You learn things piecemeal, in whatever order the user happens to mention them, with whatever phrasing they happen to use. The sleep-time agent can step back and produce clean, structured, well-organized memories from that raw material.

This is the thing that maps most directly to how human memory actually works. We don't consolidate during the conversation — we do it during sleep. Literally.

The neuroscience parallel is real. Memory consolidation during sleep isn't just a metaphor — it's a fundamental mechanism of how the hippocampus transfers information to long-term cortical storage. Letta is implementing something structurally similar: a background process that runs during downtime to reorganize what was learned during active use.

And the core memory architecture — the always-in-context blocks — is distinct from the sleep-time layer.

Letta's memory blocks are sections of the agent's context window that are always visible. No retrieval required. They're prepended to the agent's prompt in a structured format — a human block for what the agent knows about the user, a persona block for the agent's own identity, and custom blocks for whatever the application needs. The key insight is that agents can autonomously read and update these blocks using built-in memory tools. The agent is the author of its own memory state.

Which is fundamentally different from external memory systems where something else is deciding what to store.

And then there's the finding that should embarrass everyone in the space. Letta tested what they call a filesystem agent — it just stores conversation histories in a flat file. On LoCoMo, it scored seventy-four point zero percent. Beating specialized memory libraries.

I keep coming back to that result. If a flat file beats your sophisticated memory system, what does that say?

It says the benchmark is broken, primarily. A flat file fails catastrophically at scale — you can't efficiently search a ten-gigabyte conversation history file. But it also says that for the task LoCoMo is actually testing, the complexity of sophisticated memory systems isn't justified. The honest question is: at what scale and use-case complexity does sophisticated memory start paying off? And I don't think anyone has a clean answer.

Let's talk about LangMem, because it's doing something different from all three of these — it's more of a primitives approach.

LangMem is from LangChain, natively integrated with LangGraph. Rather than being an opinionated system, it gives developers composable primitives. They map to the cognitive science taxonomy: semantic memory for facts and knowledge, episodic memory for past experiences, procedural memory for system behavior — rules that evolve through feedback.

The procedural memory piece is the one I find most interesting.

It's genuinely novel. LangMem can optimize the agent's own system prompt based on experience. If the agent consistently fails at a certain type of task, the prompt optimizer rewrites the instructions to improve future behavior. It's the AI equivalent of updating your mental model of how to do a job — not just storing new facts, but changing how you approach problems.

And they have this profile versus collection distinction that's worth unpacking.

A profile is a single document representing current state — user preferences, goals. When new information arrives, it updates the existing document rather than creating a new record. A collection is individual memory records that have to be reconciled against each other. The profile approach is cleaner for anything where you only care about the latest state. The collection approach is more powerful but introduces the reconciliation problem — you have to decide whether to delete, invalidate, update, or consolidate when new information arrives.

LangMem is also the most explicit about acknowledging the relevance problem. They say outright that memory relevance is more than semantic similarity — it should combine similarity with importance, and with memory strength as a function of how recently and frequently it was used.

They acknowledge it more explicitly than anyone else. But — and this is important — the implementation is left to the developer. They define the concept, they don't give you a working decay function. No framework does. Nobody has a principled, non-LLM-based way to score memory importance. Nobody implements forgetting curves. Nobody does time-based decay in any rigorous sense.

That feels like the central unsolved problem. The "what's worth remembering" question.

It is the central unsolved problem. And when you look at how each framework approaches it, you see the same pattern everywhere. Mem0 uses an LLM to make the ADD/UPDATE/DELETE/NOOP decision — which means the quality of memory distillation is entirely dependent on the quality of the LLM's judgment, and that judgment is opaque and hard to audit. Zep extracts everything into the graph and relies on retrieval quality to surface what's relevant. Letta lets the agent self-edit its own memory blocks, which is the agent using its own judgment. LangMem acknowledges the multi-factor relevance problem and hands it back to the developer. It's LLM judgment calls all the way down, with no principled framework underneath.

What's actually missing that nobody has built yet?

A few things. First, principled importance scoring — something that doesn't just rely on the LLM to decide what matters, but has a structured model of importance that can be audited and tuned. Second, true memory decay — actual forgetting curves where memories that haven't been accessed or reinforced in a long time lose retrieval weight. Human memory has this, and it's a feature, not a bug. Third, cross-user memory — what happens when the same fact is relevant across many users? A product bug that affects everyone should be a shared memory, not a thousand separate private copies. Fourth, memory provenance for debugging — when the AI says something wrong because of a bad memory, how do you trace it back to the source and fix it? Zep's bi-temporal tracking with full provenance is the closest thing to this, but it's not a complete solution.

And then there's the privacy dimension, which I don't think gets enough attention.

An AI that remembers everything creates serious problems. What happens when a user wants to be forgotten? What if the system surfaces something said in a moment of distress, in a context where it's inappropriate? Mem0 is SOC 2 and HIPAA compliant with bring-your-own-key encryption. Zep is SOC 2 Type II certified. The compliance boxes are getting checked. But the behavioral question — when should an AI choose not to surface a sensitive memory even when it's technically relevant — is completely unaddressed by any framework. That's a product design problem, and it's genuinely hard.

Let me try to synthesize what's actually smart versus what's marketing in this space, because I think that's what Daniel is really asking.

Fair. What's genuinely smart: Zep's temporal fact invalidation — not deleting old facts but marking them as superseded with timestamps — is a real architectural insight that enables a class of temporal reasoning vector stores simply cannot do. Letta's sleep-time compute is genuinely novel and maps to something real about how memory consolidation works. Mem0's four-operation update pipeline is a real improvement over naive append-only, even if it's LLM-dependent. And Zep's "unknown unknowns" framing — the insight that agents can't retrieve context they don't know to look for, and the architectural response of deterministic assembly — is a genuine contribution.

And what's mostly marketing?

Claims of state-of-the-art performance on LoCoMo, which is a benchmark where the full-context baseline beats the specialized memory system. "Memory compression engine" language that describes LLM-based extraction with a vector store. And the idea that any of these systems truly mirrors how human memory works — they're all much simpler than biological memory consolidation. Human memory isn't just storage with smart retrieval — it's a reconstructive process, it's emotionally weighted, it's deeply integrated with identity and context in ways none of these frameworks touch.

So where does that leave someone who's actually building an agent system today and trying to make sensible choices?

The practical answer depends on what your agent needs to do. If you need temporal reasoning — tracking how facts change over time, what was true when, how entities relate to each other — Graphiti and Zep's approach is the most architecturally capable. The hybrid retrieval and bi-temporal tracking are real advantages. If you need something more straightforward and you're comfortable with LLM-as-judge for conflict resolution, mem0's pipeline is production-tested and the latency numbers are real. If you're building agents that need persistent identity across long sessions — the kind of agent that's supposed to know you over months or years — Letta's memory block architecture and sleep-time consolidation is the most thoughtful design. And if you're building on LangGraph and want flexibility to compose your own approach, LangMem gives you the best primitives.

But in all cases, don't benchmark on LoCoMo and call it a day.

Please don't benchmark on LoCoMo and call it a day. LongMemEval is more rigorous — it has genuine temporal reasoning requirements and conversations that don't fit in a context window. Even then, design your evaluation around your actual use case, because no current benchmark captures the full complexity of what production memory systems need to do.

The thing I keep coming back to is that the benchmark wars between these frameworks are actually revealing something important about the state of the field. The fact that they're fighting over LoCoMo — a benchmark with known flaws, where the naive baseline wins — suggests the field doesn't yet have consensus on what good memory even looks like.

That's the right read. The benchmark disagreement is downstream of a conceptual disagreement. What is the goal of an AI memory system? Is it accuracy on a fixed evaluation set? Is it latency in production? Is it the quality of the agent's behavior over a six-month relationship with a user? Those are different objectives, and different frameworks are optimizing for different things without always being explicit about it.

Daniel's framing in the prompt is actually the sharpest articulation I've seen of what the real goal should be. It's not an AI that never forgets. It's an AI that remembers the right things, lets the noise decay, and reconciles conflicts when the world changes. That's a much harder problem than storage.

And the honest answer to whether anyone has solved it is: partially. The pieces are there — temporal graphs, LLM-based conflict resolution, sleep-time consolidation, principled profile-versus-collection distinctions. But nobody has assembled them into a complete system with principled importance scoring, real decay functions, and provenance that lets you debug and audit what the AI believes and why.

That's the next frontier.

It's the next frontier, and I suspect the frameworks that get there first are going to look less like database tools and more like cognitive architectures. The interesting question is whether that happens at the framework layer or whether the model providers build it in natively.

Alright. Real quick practical takeaway before we wrap: if you're evaluating these frameworks today, what's the one thing to test that most people skip?

Test the conflict resolution path, not just the happy path. Create a user profile, establish some facts, then update those facts, and check whether the old facts are gone or still floating around with equal weight. That single test will tell you more about a framework's actual memory quality than any benchmark score.

That's a good one. Okay — big thanks to Modal for the GPU credits that keep this whole pipeline running. Thanks as always to our producer Hilbert Flumingtop. This has been My Weird Prompts. If you want to find us, we're at myweirdprompts dot com. Until next time.

See you then.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2206: What Actually Works in AI Memory

What Actually Works in AI Memory: Beyond Vector Append-Only Stores

The Naive Baseline: Append-Only Vector Stores

The Two Camps: Vector-Plus-Judge vs. Temporal Graphs

The Benchmark Problem

What's Actually Smart?

Downloads

You Might Also Like

#2206: What Actually Works in AI Memory