LongMemEval: 장기 상호 작용 메모리에 대한 채팅 어시스턴트의 벤치마킹

초록

최근 대형 언어 모델(Large Language Model, LLM) 기반 채팅 어시스턴트 시스템은 사용자-어시스턴트 채팅 기록을 추적하는 메모리 구성 요소를 통합하여 더 정확하고 개인화된 응답을 가능하게 하였습니다. 그러나 이러한 시스템의 장기 기억 능력은 지속적 상호작용에서 아직 충분히 탐구되지 않았습니다. 본 논문은 LongMemEval을 소개하는데, 이는 채팅 어시스턴트의 다섯 가지 핵심 장기 기억 능력을 평가하기 위해 설계된 포괄적인 벤치마크입니다: 정보 추출, 다중 세션 추론, 시간적 추론, 지식 업데이트 및 기피. 자유롭게 확장 가능한 사용자-어시스턴트 채팅 기록 내에 포함된 500개의 신중하게 선별된 질문을 통해 LongMemEval은 기존의 장기 기억 시스템에 상당한 도전을 제시하며, 상업용 채팅 어시스턴트 및 장기 문맥 LLM은 지속적 상호작용에서 정보 기억에서 30%의 정확도 하락을 보입니다. 그런 다음 색인, 검색 및 읽기 단계를 통해 장기 기억 설계를 네 가지 설계 선택으로 분해하는 통합 프레임워크를 제시합니다. 주요 실험적 통찰력을 기반으로, 세션 분해를 통한 가치 세분화 최적화, 사실 보강 키 확장을 통한 색인 구조 강화, 그리고 시간 인식 쿼리 확장을 통한 검색 범위 정제를 포함하는 여러 메모리 설계를 제안합니다. 실험 결과는 이러한 최적화가 LongMemEval에서의 기억 회상과 하류 질문 응답 모두 크게 향상시킨다는 것을 보여줍니다. 전반적으로, 본 연구는 LLM 기반 채팅 어시스턴트의 장기 기억 능력을 발전시키기 위한 가치 있는 자원과 지침을 제공하여, 개인화되고 신뢰할 수 있는 대화형 AI로 나아가는 길을 열어줍니다.

English

Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. This paper introduces LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into four design choices across the indexing, retrieval, and reading stages. Built upon key experimental insights, we propose several memory designs including session decomposition for optimizing value granularity, fact-augmented key expansion for enhancing the index structure, and time-aware query expansion for refining the search scope. Experiment results show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI.

LongMemEval: 장기 상호 작용 메모리에 대한 채팅 어시스턴트의 벤치마킹

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

초록

Support