Industry Benchmark

LoCoMo

Industry-standard conversational memory benchmark.

What LoCoMo Is

LoCoMo measures multi-session conversational memory across 10 dialogues and about 300 questions. It comes from mem0ai/memory-benchmarks.

  • Single-hop: direct factual recall from prior turns.
  • Multi-hop: joins facts across sessions.
  • Temporal: reasons about order, recency, and time.
  • Open-domain: mixes memory retrieval with broader answering.

Reference Scores

System Single-hop Multi-hop Temporal Overall
Mem0 v3 97% 93% 93% 91.6%
Hindsight n/a n/a n/a SOTA
GBrain not-run not-run not-run not-run
Quaid 50% 0% 0% 20%

Reference values come from published materials. Hindsight per-type LoCoMo breakdown is not shown here. GBrain has no public LoCoMo run.

Why Quaid Scores Will Start Lower

Quaid is doc-native today. The current adapter stores whole conversation turns as documents instead of extracting and maintaining distilled facts. That limits multi-hop and temporal performance on the first run.

The roadmap item that closes this gap is the conversation memory feature in issue #105. Once Quaid can ingest turns as durable memory facts, this benchmark becomes a direct before-and-after measurement.

How To Run

OPENAI_API_KEY=sk-... bash benchmarks/locomo/run.sh

Current Status

measured
v0.23.0
2026-06-22

LLM judge required. Default answerer and judge model: GPT-4o.