LongMemEval:对长期交互记忆中的聊天助手进行基准测试
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
October 14, 2024
作者: Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, Dong Yu
cs.AI
摘要
最近大型语言模型(LLM)驱动的聊天助手系统已经集成了记忆组件来追踪用户-助手聊天历史,从而实现更准确和个性化的回复。然而,它们在持续互动中的长期记忆能力仍未得到充分探讨。本文介绍了LongMemEval,一个旨在评估聊天助手的五个核心长期记忆能力的综合基准:信息提取、多会话推理、时间推理、知识更新和弃权。通过500个精心策划的问题嵌入自由可扩展的用户-助手聊天历史中,LongMemEval对现有长期记忆系统提出了重大挑战,商用聊天助手和长上下文LLM在跨持续互动中记忆信息时显示了30%的准确率下降。然后,我们提出了一个统一框架,将长期记忆设计分解为索引、检索和阅读阶段的四个设计选择。基于关键实验见解,我们提出了几种记忆设计,包括会话分解以优化价值粒度、事实增强的关键扩展以增强索引结构,以及时间感知的查询扩展以细化搜索范围。实验结果表明,这些优化极大地提高了LongMemEval上的记忆召回和下游问题回答。总体而言,我们的研究为提升基于LLM的聊天助手的长期记忆能力提供了宝贵资源和指导,为实现更个性化和可靠的对话AI铺平了道路。
English
Recent large language model (LLM)-driven chat assistant systems have
integrated memory components to track user-assistant chat histories, enabling
more accurate and personalized responses. However, their long-term memory
capabilities in sustained interactions remain underexplored. This paper
introduces LongMemEval, a comprehensive benchmark designed to evaluate five
core long-term memory abilities of chat assistants: information extraction,
multi-session reasoning, temporal reasoning, knowledge updates, and abstention.
With 500 meticulously curated questions embedded within freely scalable
user-assistant chat histories, LongMemEval presents a significant challenge to
existing long-term memory systems, with commercial chat assistants and
long-context LLMs showing 30% accuracy drop on memorizing information across
sustained interactions. We then present a unified framework that breaks down
the long-term memory design into four design choices across the indexing,
retrieval, and reading stages. Built upon key experimental insights, we propose
several memory designs including session decomposition for optimizing value
granularity, fact-augmented key expansion for enhancing the index structure,
and time-aware query expansion for refining the search scope. Experiment
results show that these optimizations greatly improve both memory recall and
downstream question answering on LongMemEval. Overall, our study provides
valuable resources and guidance for advancing the long-term memory capabilities
of LLM-based chat assistants, paving the way toward more personalized and
reliable conversational AI.Summary
AI-Generated Summary