Industry Benchmark

LongMemEval

ICLR 2025. 500 questions, 6 types, per-question corpus.

What LongMemEval Is

LongMemEval tests long-term conversational memory with six question types covering cross-session recall, temporal reasoning, memory updates, user facts, assistant facts, and preferences.

The benchmark is designed for systems that can turn messy dialogue into durable memory, then retrieve the right fact under question-specific context.

Key Difference From LoCoMo

Each question has its own haystack sessions, up to 53 sessions, ingested fresh for that question only. That gives true isolation with no cross-question memory pollution.

LoCoMo: Shared corpus. All questions hit the same memory store. LongMemEval: Per-question corpus. Every question starts from a fresh ingest.

Reference Scores

Type Count Mem0 v3 GBrain Quaid
multi-session 133 ~93% not-run 0%
temporal-reasoning 133 93% not-run 0%
knowledge-update 78 ~90% not-run 0%
single-session-user 70 ~95% not-run 0%
single-session-assistant 56 100% not-run 0%
single-session-preference 30 ~90% not-run 0%
Overall 500 93.4% not-run 0%

Per-type LongMemEval references are rounded from published benchmark materials. GBrain has no public LongMemEval run.

Why Quaid Is Pending

Same gap as LoCoMo. Quaid currently stores raw conversation turns as documents instead of extracting durable facts, so the benchmark stays unfairly hard until the memory layer changes.

Issue #105 closes this by adding fact extraction on top of raw turn storage.

How To Run

OPENAI_API_KEY=sk-... bash benchmarks/longmemeval/run.sh
MAX_QUESTIONS=50 OPENAI_API_KEY=sk-... bash benchmarks/longmemeval/run.sh

Current Status

measured
v0.23.0
2026-06-22

LLM judge required. Default answerer and judge model: GPT-4o.