Why Every Voice AI Agent Has Amnesia

Sarah calls back

Sarah calls Acme Insurance’s AI agent about a water damage claim she started last week. “Hi, I’m calling about my water damage claim from last week.” The agent pauses. Not for dramatic effect. It genuinely has no idea who Sarah is.

LiveKit connects the call. Deepgram or Whisper transcribes the audio into text. The agent framework spins up an AgentSession with a fresh, empty ChatContext. Who is Sarah? What claim? What water damage? So begins a familiar cascade: an Auth0 token exchange adds ~200ms, a Postgres query to look up her account adds ~30ms, a Pinecone vector retrieval adds ~150ms (and it’s probabilistic — maybe it returns last week’s water damage conversation, maybe it returns a dental claim from March), and a Redis session cache check adds another round trip — assuming Redis hasn’t restarted since Tuesday, in which case that session is just gone. Four network hops, ~400ms of latency before the agent can say “Welcome back, Sarah.” And there’s a real chance the Pinecone retrieval pulled the wrong context entirely.

This isn’t a hypothetical. This is how every production voice AI agent works today. And if you’ve built one, you already know it.

The six-service problem

Building a production voice agent in 2026 means assembling a Frankenstein stack. LiveKit or Twilio for real-time media transport. PostgreSQL for persistent user data. Redis for ephemeral session state. Pinecone or Chroma for “memory” via RAG. Auth0 or Clerk for identity and authentication. Datadog or Grafana for observability. That’s six services, often spanning three languages, connected by 2,500+ lines of glue code that nobody wants to own.

The LiveKit Agents framework is genuinely good at what it does. It handles WebRTC, audio routing, and STT/TTS pipelines with real craft. But their own documentation confirms the limitation: AgentSession has no built-in cross-session persistence. The userdata dictionary is session-scoped. The moment a call ends, everything the agent learned during that conversation vanishes. The next time Sarah calls, the agent is a stranger again.

So developers bolt on memory. They add a vector database for conversation recall, a relational database for structured data, a cache for fast session lookups. Each one solves a narrow problem. None of them solve the actual problem.

And that’s where things get genuinely painful.

You have memory solutions. You don’t have a memory system.

Let’s walk through the standard toolkit and be honest about what each one actually gives you.

Vector databases and RAG (Pinecone, Chroma, Weaviate) are the most popular approach to agent memory right now. The idea is appealing: embed past conversations as vectors, then retrieve “relevant” context via semantic similarity when the user calls back. But semantic similarity doesn’t understand which conversation you want. “Water damage claim for my basement” and “flood damage to my car” sit close together in embedding space. The retrieval is probabilistic. You don’t get the right context — you get the most similar context, and you hope those are the same thing. You can’t do “give me every conversation with user X, ordered by date.” That’s not what vector search does. It’s non-deterministic, hard to debug, and when it’s wrong, it’s wrong in ways that are difficult to even detect, let alone reproduce.

PostgreSQL with an application server gives you deterministic queries. You can absolutely look up Sarah’s exact claim by ID, retrieve her full conversation history in order, and join it with her account data. But it’s a separate service. Every call requires a network hop — 30ms if your database is in the same region, more if it isn’t. You need connection pool configuration, schema migrations, deployment choreography. Your agent logic runs in one process; your state lives in another. That separation means you’re constantly serializing state, shipping it over the network, and deserializing it on the other side. It works. But it’s accidental complexity that has nothing to do with making your agent smarter.

Redis for session state is fast, no question. Sub-millisecond reads within the same datacenter. But Redis is a cache, not a database, and the distinction matters in production. Managed Redis instances do routine maintenance restarts — your cloud provider’s SLA allows for it. When that happens, any session data that wasn’t persisted elsewhere is gone. There’s no row-level security. No audit trail by default. No way to query across sessions without building your own indexing layer on top. You’re building a database on top of a cache, and you’re doing it without the guarantees that databases provide.

The deeper issue isn’t that any of these tools are bad. They’re each good at their narrow purpose. The issue is that agent memory has two requirements that pull in opposite directions: it needs to be deterministic (give me exactly this user’s data, not something semantically similar) and it needs to be integrated (accessible from within the agent’s own execution context, not across a network boundary). The standard stack gives you deterministic-but-remote (Postgres) or fast-but-volatile (Redis) or integrated-but-probabilistic (vector RAG). Pick two. Or, more accurately, pick all three and write the glue code to hold them together.

That glue code is where the real cost lives. Not in the cloud bills — in the engineering time, the on-call pages, the bugs that only manifest when one of six services hiccups at 2 AM.

Failure modes

These aren’t dramatic. They’re mundane. That’s what makes them so corrosive.

The Redis restart. You deploy a voice agent for a healthcare startup. It works well for three weeks. Friday afternoon, your managed Redis instance does routine maintenance — a restart that’s well within the provider’s SLA. You don’t notice immediately because your monitoring is in Datadog, your agent is in LiveKit, and your sessions are in Redis. Three dashboards, three services, no single pane of glass. Monday morning: support tickets. “The agent asked me to re-explain my symptoms even though I called yesterday.” Every session from the Friday maintenance window is gone. There’s no audit trail in Redis, so you can’t even enumerate which users were affected. You spend Tuesday writing a script to cross-reference LiveKit call logs with Postgres user records to figure out who needs to be contacted. The technical fix takes an hour. The trust repair takes months.

The wrong retrieval. A customer calls about their auto insurance claim. Your agent dutifully queries Pinecone for relevant conversation history. Semantic search returns chunks from a previous conversation — high cosine similarity, looks right. But “water damage to basement” and “car damaged in a flood” are close neighbors in embedding space. The agent confidently references the wrong claim, citing details from a conversation the customer never had. The customer corrects it, confused. The agent apologizes and starts over. But the next customer doesn’t correct it — they assume the agent knows something they don’t. You find out when someone posts about it on Twitter. You can’t reproduce the bug because the retrieval is non-deterministic; running the same query again returns different chunks depending on index state.

The HIPAA audit. An auditor asks for a complete access log for Patient 12847’s data over the last 90 days. Straightforward request. Except the data lives across five systems: conversation transcripts in Postgres, session metadata in Redis (which doesn’t keep access logs by default), embedded conversation chunks in Pinecone, authentication events in Auth0, and call recordings in LiveKit. Five systems, five log formats, five timestamp conventions. You spend two weeks building a merge script. The auditor asks why you can’t produce a unified access log. You explain your architecture. The auditor’s expression tells you everything you need to know.

What if memory was first-class?

These problems share a root cause: agent memory is an afterthought, bolted onto a media transport layer that was never designed to manage state. What if it wasn’t?

Cosmictron takes a different approach. Instead of assembling six services and writing glue code between them, you define your agent’s memory as a SQL table that lives in the same process as your agent logic. Not a vector embedding. Not a cache. A real, queryable, durable table with row-level security built in.

Here’s what defining conversation memory looks like:

#[table(name = "conversations")]
pub struct ConversationTurn {
    #[primary_key] #[auto_inc]
    pub id: u64,
    pub user_id: Identity,
    pub role: String,
    pub content: String,
    pub timestamp: u64,
}

Persisting a turn when the user speaks:

#[reducer]
fn on_user_message(ctx: &ReducerContext, text: String) {
    db::insert(ConversationTurn {
        user_id: ctx.sender,
        role: "user".into(),
        content: text,
        timestamp: ctx.timestamp,
    });
}

And when Sarah calls back next week, recalling her exact conversation history:

history = await db.query(
    "SELECT content FROM conversations "
    "WHERE user_id = $1 ORDER BY timestamp DESC LIMIT 20",
    [session.user_id]
)
# Microseconds. Deterministic. RLS-protected. Same process.

No network hop to Postgres. No probabilistic retrieval from Pinecone. No volatile session in Redis. The query runs in microseconds because the data is co-located with the agent logic. It’s deterministic because it’s SQL, not semantic similarity. It’s secure because row-level security is structural — ctx.sender is a cryptographic identity, not a JWT you hope hasn’t expired.

One binary. Real-time media transport, persistent agent memory, row-level security, and built-in observability. No glue code. No six-service deployment choreography. No three-dashboard fire drills at 2 AM.

We’re not claiming this solves every problem in voice AI. Media transport is hard. Speech recognition is hard. LLM orchestration is hard. But agent memory shouldn’t be hard. It should be a table.

Try it

Cosmictron is a commercially licensed agent platform with a free self-hosted Developer tier. We’re building the real-time AI platform that treats memory as a first-class primitive, not a retrieval pipeline. If you’re tired of assembling six services just to give your agent the ability to remember a conversation, join the waitlist or talk to our team.