长上下文查询聚焦摘要的非结构化证据归因

摘要

大型语言模型（LLMs）能够根据用户查询从极长的上下文中生成连贯的摘要。提取并恰当引用证据片段有助于提升这些摘要的透明度和可靠性。然而，LLMs在处理和理解信息时存在位置偏差，这可能影响证据的引用。以往的研究多聚焦于预定义粒度（如句子、段落、文档等）的证据引用，我们则提出了长上下文查询聚焦摘要与无结构证据引用的任务。我们揭示了现有系统在从其上下文中生成并正确引用无结构证据方面的困难，以及证据往往“迷失在中间”的现象。为缓解这一问题，我们创建了“无结构证据文本摘要数据集”（SUnsET），这是一个通过新颖的领域无关流程生成的合成数据集，可作为监督数据来调整LLMs以适应此任务。我们通过五种不同规模的LLMs和四种包含多样文档类型及长度的数据集进行验证，发现使用SUnsET数据调整后的LLMs相较于基础模型，能生成更相关且事实一致的证据，从上下文中更广泛的位置提取证据，并能生成更相关且一致的摘要。

English

Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query. Extracting and properly citing evidence spans could help improve the transparency and reliability of these summaries. At the same time, LLMs suffer from positional biases in terms of which information they understand and attend to, which could affect evidence citation. Whereas previous work has focused on evidence citation with predefined levels of granularity (e.g. sentence, paragraph, document, etc.), we propose the task of long-context query focused summarization with unstructured evidence citation. We show how existing systems struggle to generate and properly cite unstructured evidence from their context, and that evidence tends to be "lost-in-the-middle". To help mitigate this, we create the Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated using a novel domain-agnostic pipeline which can be used as supervision to adapt LLMs to this task. We demonstrate across 5 LLMs of different sizes and 4 datasets with varying document types and lengths that LLMs adapted with SUnsET data generate more relevant and factually consistent evidence than their base models, extract evidence from more diverse locations in their context, and can generate more relevant and consistent summaries.

长上下文查询聚焦摘要的非结构化证据归因

Unstructured Evidence Attribution for Long Context Query Focused Summarization

摘要

Summary

Support