LongMemEval
ICLR 2025. 500 questions, 6 types, per-question corpus.
What LongMemEval Is
LongMemEval tests long-term conversational memory with six question types covering cross-session recall, temporal reasoning, memory updates, user facts, assistant facts, and preferences.
The benchmark is designed for systems that can turn messy dialogue into durable memory, then retrieve the right fact under question-specific context.
Key Difference From LoCoMo
Each question has its own haystack sessions, up to 53 sessions, ingested fresh for that question only. That gives true isolation with no cross-question memory pollution.
Reference Scores
| Type | Count | Mem0 v3 | GBrain | Quaid |
|---|---|---|---|---|
| multi-session | 133 | ~93% | not-run | 0% |
| temporal-reasoning | 133 | 93% | not-run | 0% |
| knowledge-update | 78 | ~90% | not-run | 0% |
| single-session-user | 70 | ~95% | not-run | 0% |
| single-session-assistant | 56 | 100% | not-run | 0% |
| single-session-preference | 30 | ~90% | not-run | 0% |
| Overall | 500 | 93.4% | not-run | 0% |
Per-type LongMemEval references are rounded from published benchmark materials. GBrain has no public LongMemEval run.
Why Quaid Is Pending
Same gap as LoCoMo. Quaid currently stores raw conversation turns as documents instead of extracting durable facts, so the benchmark stays unfairly hard until the memory layer changes.
Issue #105 closes this by adding fact extraction on top of raw turn storage.
How To Run
OPENAI_API_KEY=sk-... bash benchmarks/longmemeval/run.sh
MAX_QUESTIONS=50 OPENAI_API_KEY=sk-... bash benchmarks/longmemeval/run.sh Current Status
LLM judge required. Default answerer and judge model: GPT-4o.