Industry Benchmark

LongMemEval

ICLR 2025. 500 questions, 6 types, per-question corpus.

What LongMemEval Is

LongMemEval tests long-term conversational memory with six question types covering cross-session recall, temporal reasoning, memory updates, user facts, assistant facts, and preferences.

The benchmark is designed for systems that can turn messy dialogue into durable memory, then retrieve the right fact under question-specific context.

Key Difference From LoCoMo

Each question has its own haystack sessions, up to 53 sessions, ingested fresh for that question only. That gives true isolation with no cross-question memory pollution.

LoCoMo: Shared corpus. All questions hit the same memory store. LongMemEval: Per-question corpus. Every question starts from a fresh ingest.

Reference Scores

Type	Count	Mem0 v3	GBrain	Quaid
multi-session	133	~93%	not-run	0%
temporal-reasoning	133	93%	not-run	0%
knowledge-update	78	~90%	not-run	0%
single-session-user	70	~95%	not-run	0%
single-session-assistant	56	100%	not-run	0%
single-session-preference	30	~90%	not-run	0%
Overall	500	93.4%	not-run	0%

Per-type LongMemEval references are rounded from published benchmark materials. GBrain has no public LongMemEval run.

Why Quaid Is Pending

Same gap as LoCoMo. Quaid currently stores raw conversation turns as documents instead of extracting durable facts, so the benchmark stays unfairly hard until the memory layer changes.

Issue #105 closes this by adding fact extraction on top of raw turn storage.

How To Run

OPENAI_API_KEY=sk-... bash benchmarks/longmemeval/run.sh
MAX_QUESTIONS=50 OPENAI_API_KEY=sk-... bash benchmarks/longmemeval/run.sh

Current Status

measured

v0.23.0

2026-06-22

LLM judge required. Default answerer and judge model: GPT-4o.