真实还是幻象?朝着具有LLM-OASIS的端到端真实性评估前进
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OASIS
November 29, 2024
作者: Alessandro Scirè, Andrei Stefan Bejgu, Simone Tedeschi, Karim Ghonim, Federico Martelli, Roberto Navigli
cs.AI
摘要
在大型语言模型(LLMs)问世后,自然语言生成(NLG)任务的表现出现了显著改进,包括文本摘要和机器翻译。然而,LLMs 仍然会生成包含幻觉的输出,即不基于事实信息的内容。因此,开发评估LLMs真实性的方法变得迫在眉睫。
事实上,最近出现了用于真实性评估的资源。尽管具有挑战性,但这些资源面临以下一种或多种限制:(i)它们专为特定任务或领域量身定制;(ii)它们规模有限,从而阻碍了新真实性评估器的训练;(iii)它们设计用于更简单的验证任务,如声明验证。
为解决这些问题,我们介绍了LLM-Oasis,据我们所知,这是目前最大的用于训练端到端真实性评估器的资源。LLM-Oasis是通过从维基百科中提取声明,伪造其中的一部分声明,并生成一对事实和非事实文本构建的。然后,我们依赖人类标注者来验证我们数据集的质量,并创建一个用于基准测试真实性评估系统的黄金标准测试集。
我们的实验表明,LLM-Oasis对于最先进的LLMs构成了重大挑战,其中GPT-4o在我们提出的端到端真实性评估任务中达到了高达60%的准确率,突显了它在推动未来领域研究方面的潜力。
English
After the introduction of Large Language Models (LLMs), there have been
substantial improvements in the performance of Natural Language Generation
(NLG) tasks, including Text Summarization and Machine Translation. However,
LLMs still produce outputs containing hallucinations, that is, content not
grounded in factual information. Therefore, developing methods to assess the
factuality of LLMs has become urgent.
Indeed, resources for factuality evaluation have recently emerged. Although
challenging, these resources face one or more of the following limitations: (i)
they are tailored to a specific task or domain; (ii) they are limited in size,
thereby preventing the training of new factuality evaluators; (iii) they are
designed for simpler verification tasks, such as claim verification.
To address these issues, we introduce LLM-Oasis, to the best of our knowledge
the largest resource for training end-to-end factuality evaluators. LLM-Oasis
is constructed by extracting claims from Wikipedia, falsifying a subset of
these claims, and generating pairs of factual and unfactual texts. We then rely
on human annotators to both validate the quality of our dataset and to create a
gold standard test set for benchmarking factuality evaluation systems.
Our experiments demonstrate that LLM-Oasis presents a significant challenge
for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our
proposed end-to-end factuality evaluation task, highlighting its potential to
drive future research in the field.Summary
AI-Generated Summary