Industry Benchmark

LoCoMo

Industry-standard conversational memory benchmark.

What LoCoMo Is

LoCoMo measures multi-session conversational memory across 10 dialogues and about 300 questions. It comes from mem0ai/memory-benchmarks.

Single-hop: direct factual recall from prior turns.
Multi-hop: joins facts across sessions.
Temporal: reasons about order, recency, and time.
Open-domain: mixes memory retrieval with broader answering.

Reference Scores

System	Single-hop	Multi-hop	Temporal	Overall
Mem0 v3	97%	93%	93%	91.6%
Hindsight	n/a	n/a	n/a	SOTA
GBrain	not-run	not-run	not-run	not-run
Quaid	50%	0%	0%	20%

Reference values come from published materials. Hindsight per-type LoCoMo breakdown is not shown here. GBrain has no public LoCoMo run.

Why Quaid Scores Will Start Lower

Quaid is doc-native today. The current adapter stores whole conversation turns as documents instead of extracting and maintaining distilled facts. That limits multi-hop and temporal performance on the first run.

The roadmap item that closes this gap is the conversation memory feature in issue #105. Once Quaid can ingest turns as durable memory facts, this benchmark becomes a direct before-and-after measurement.

How To Run

OPENAI_API_KEY=sk-... bash benchmarks/locomo/run.sh

Current Status

measured

v0.23.0

2026-06-22

LLM judge required. Default answerer and judge model: GPT-4o.