L-CiteEval：長文本模型是否真正善用上下文來回應？

摘要

近年來，長文本模型（LCMs）取得了顯著進展，為處理涉及長篇文本的任務（如文件摘要）提供了極大便利。隨著社群日益重視生成結果的忠實性，僅確保LCM輸出的準確性是不夠的，因為人類很難驗證來自極長文本的結果。儘管一些工作旨在評估LCMs是否真實地基於上下文作出回應，但這些工作要麼僅限於特定任務，要麼嚴重依賴於像GPT-4這樣的外部評估資源。在本研究中，我們介紹了L-CiteEval，這是一個針對帶引用的長文本理解的全面多任務基準，旨在評估LCMs的理解能力和忠實性。L-CiteEval涵蓋了來自不同領域的11個任務，涵蓋的上下文長度範圍從8K到48K，並提供了一套完全自動化的評估套件。通過對11個尖端的封閉源和開源LCMs進行測試，我們發現這些模型在生成結果上雖有細微差異，但開源模型在引用準確性和召回率方面明顯遠遠落後於封閉源模型。這表明目前的開源LCMs很容易基於其固有知識而作出回應，而非根據給定的上下文，這對實際應用中的用戶體驗構成重大風險。我們還評估了RAG方法，觀察到RAG能夠顯著提高LCMs的忠實性，儘管在生成質量上略微降低。此外，我們發現LCMs的注意機制與引文生成過程之間存在相關性。

English

Long-context models (LCMs) have made remarkable strides in recent years, offering users great convenience for handling tasks that involve long context, such as document summarization. As the community increasingly prioritizes the faithfulness of generated results, merely ensuring the accuracy of LCM outputs is insufficient, as it is quite challenging for humans to verify the results from the extremely lengthy context. Yet, although some efforts have been made to assess whether LCMs respond truly based on the context, these works either are limited to specific tasks or heavily rely on external evaluation resources like GPT-4.In this work, we introduce L-CiteEval, a comprehensive multi-task benchmark for long-context understanding with citations, aiming to evaluate both the understanding capability and faithfulness of LCMs. L-CiteEval covers 11 tasks from diverse domains, spanning context lengths from 8K to 48K, and provides a fully automated evaluation suite. Through testing with 11 cutting-edge closed-source and open-source LCMs, we find that although these models show minor differences in their generated results, open-source models substantially trail behind their closed-source counterparts in terms of citation accuracy and recall. This suggests that current open-source LCMs are prone to responding based on their inherent knowledge rather than the given context, posing a significant risk to the user experience in practical applications. We also evaluate the RAG approach and observe that RAG can significantly improve the faithfulness of LCMs, albeit with a slight decrease in the generation quality. Furthermore, we discover a correlation between the attention mechanisms of LCMs and the citation generation process.

L-CiteEval：長文本模型是否真正善用上下文來回應？

L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?

摘要

Summary

Support

Support