L-CiteEval:長文本模型是否真正善用上下文來回應?
L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?
October 3, 2024
作者: Zecheng Tang, Keyan Zhou, Juntao Li, Baibei Ji, Jianye Hou, Min Zhang
cs.AI
摘要
近年來,長文本模型(LCMs)取得了顯著進展,為處理涉及長篇文本的任務(如文件摘要)提供了極大便利。隨著社群日益重視生成結果的忠實性,僅確保LCM輸出的準確性是不夠的,因為人類很難驗證來自極長文本的結果。儘管一些工作旨在評估LCMs是否真實地基於上下文作出回應,但這些工作要麼僅限於特定任務,要麼嚴重依賴於像GPT-4這樣的外部評估資源。在本研究中,我們介紹了L-CiteEval,這是一個針對帶引用的長文本理解的全面多任務基準,旨在評估LCMs的理解能力和忠實性。L-CiteEval涵蓋了來自不同領域的11個任務,涵蓋的上下文長度範圍從8K到48K,並提供了一套完全自動化的評估套件。通過對11個尖端的封閉源和開源LCMs進行測試,我們發現這些模型在生成結果上雖有細微差異,但開源模型在引用準確性和召回率方面明顯遠遠落後於封閉源模型。這表明目前的開源LCMs很容易基於其固有知識而作出回應,而非根據給定的上下文,這對實際應用中的用戶體驗構成重大風險。我們還評估了RAG方法,觀察到RAG能夠顯著提高LCMs的忠實性,儘管在生成質量上略微降低。此外,我們發現LCMs的注意機制與引文生成過程之間存在相關性。
English
Long-context models (LCMs) have made remarkable strides in recent years,
offering users great convenience for handling tasks that involve long context,
such as document summarization. As the community increasingly prioritizes the
faithfulness of generated results, merely ensuring the accuracy of LCM outputs
is insufficient, as it is quite challenging for humans to verify the results
from the extremely lengthy context. Yet, although some efforts have been made
to assess whether LCMs respond truly based on the context, these works either
are limited to specific tasks or heavily rely on external evaluation resources
like GPT-4.In this work, we introduce L-CiteEval, a comprehensive multi-task
benchmark for long-context understanding with citations, aiming to evaluate
both the understanding capability and faithfulness of LCMs. L-CiteEval covers
11 tasks from diverse domains, spanning context lengths from 8K to 48K, and
provides a fully automated evaluation suite. Through testing with 11
cutting-edge closed-source and open-source LCMs, we find that although these
models show minor differences in their generated results, open-source models
substantially trail behind their closed-source counterparts in terms of
citation accuracy and recall. This suggests that current open-source LCMs are
prone to responding based on their inherent knowledge rather than the given
context, posing a significant risk to the user experience in practical
applications. We also evaluate the RAG approach and observe that RAG can
significantly improve the faithfulness of LCMs, albeit with a slight decrease
in the generation quality. Furthermore, we discover a correlation between the
attention mechanisms of LCMs and the citation generation process.Summary
AI-Generated Summary