LongMemEval：对长期交互记忆中的聊天助手进行基准测试

摘要

最近大型语言模型（LLM）驱动的聊天助手系统已经集成了记忆组件来追踪用户-助手聊天历史，从而实现更准确和个性化的回复。然而，它们在持续互动中的长期记忆能力仍未得到充分探讨。本文介绍了LongMemEval，一个旨在评估聊天助手的五个核心长期记忆能力的综合基准：信息提取、多会话推理、时间推理、知识更新和弃权。通过500个精心策划的问题嵌入自由可扩展的用户-助手聊天历史中，LongMemEval对现有长期记忆系统提出了重大挑战，商用聊天助手和长上下文LLM在跨持续互动中记忆信息时显示了30%的准确率下降。然后，我们提出了一个统一框架，将长期记忆设计分解为索引、检索和阅读阶段的四个设计选择。基于关键实验见解，我们提出了几种记忆设计，包括会话分解以优化价值粒度、事实增强的关键扩展以增强索引结构，以及时间感知的查询扩展以细化搜索范围。实验结果表明，这些优化极大地提高了LongMemEval上的记忆召回和下游问题回答。总体而言，我们的研究为提升基于LLM的聊天助手的长期记忆能力提供了宝贵资源和指导，为实现更个性化和可靠的对话AI铺平了道路。

English

Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. This paper introduces LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into four design choices across the indexing, retrieval, and reading stages. Built upon key experimental insights, we propose several memory designs including session decomposition for optimizing value granularity, fact-augmented key expansion for enhancing the index structure, and time-aware query expansion for refining the search scope. Experiment results show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI.

LongMemEval：对长期交互记忆中的聊天助手进行基准测试

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

摘要

Summary

Support