LoCoMo
Industry-standard conversational memory benchmark.
What LoCoMo Is
LoCoMo measures multi-session conversational memory across 10 dialogues and about 300 questions. It comes from mem0ai/memory-benchmarks.
- Single-hop: direct factual recall from prior turns.
- Multi-hop: joins facts across sessions.
- Temporal: reasons about order, recency, and time.
- Open-domain: mixes memory retrieval with broader answering.
Reference Scores
| System | Single-hop | Multi-hop | Temporal | Overall |
|---|---|---|---|---|
| Mem0 v3 | 97% | 93% | 93% | 91.6% |
| Hindsight | n/a | n/a | n/a | SOTA |
| GBrain | not-run | not-run | not-run | not-run |
| Quaid | 50% | 0% | 0% | 20% |
Reference values come from published materials. Hindsight per-type LoCoMo breakdown is not shown here. GBrain has no public LoCoMo run.
Why Quaid Scores Will Start Lower
Quaid is doc-native today. The current adapter stores whole conversation turns as documents instead of extracting and maintaining distilled facts. That limits multi-hop and temporal performance on the first run.
The roadmap item that closes this gap is the conversation memory feature in issue #105. Once Quaid can ingest turns as durable memory facts, this benchmark becomes a direct before-and-after measurement.
How To Run
OPENAI_API_KEY=sk-... bash benchmarks/locomo/run.sh Current Status
LLM judge required. Default answerer and judge model: GPT-4o.