真實還是幻象?朝向具有LLM-OASIS的端對端真實性評估
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OASIS
November 29, 2024
作者: Alessandro Scirè, Andrei Stefan Bejgu, Simone Tedeschi, Karim Ghonim, Federico Martelli, Roberto Navigli
cs.AI
摘要
隨著大型語言模型(LLMs)的引入,自然語言生成(NLG)任務的表現出現了顯著改善,包括文本摘要和機器翻譯。然而,LLMs 仍然會產生包含幻覺的輸出,即不基於事實信息的內容。因此,開發評估LLMs事實性的方法變得迫切。
事實上,最近出現了用於評估事實性的資源。儘管具有挑戰性,這些資源面臨以下一個或多個限制:(i)它們針對特定任務或領域量身定制;(ii)它們在規模上受限,從而阻礙了新事實性評估器的訓練;(iii)它們設計用於更簡單的驗證任務,如主張驗證。
為了應對這些問題,我們介紹了LLM-Oasis,據我們所知是目前最大的用於訓練端到端事實性評估器的資源。LLM-Oasis是通過從維基百科提取主張,對其中的一部分主張進行虛假,並生成事實和非事實文本對來構建的。然後,我們依賴人類標註者來驗證我們數據集的質量,並創建一個用於基準測試事實性評估系統的黃金標準測試集。
我們的實驗表明,LLM-Oasis對於最先進的LLMs構成了重大挑戰,其中GPT-4o在我們提出的端到端事實性評估任務中實現了高達60%的準確率,突顯了它在推動未來該領域研究的潛力。
English
After the introduction of Large Language Models (LLMs), there have been
substantial improvements in the performance of Natural Language Generation
(NLG) tasks, including Text Summarization and Machine Translation. However,
LLMs still produce outputs containing hallucinations, that is, content not
grounded in factual information. Therefore, developing methods to assess the
factuality of LLMs has become urgent.
Indeed, resources for factuality evaluation have recently emerged. Although
challenging, these resources face one or more of the following limitations: (i)
they are tailored to a specific task or domain; (ii) they are limited in size,
thereby preventing the training of new factuality evaluators; (iii) they are
designed for simpler verification tasks, such as claim verification.
To address these issues, we introduce LLM-Oasis, to the best of our knowledge
the largest resource for training end-to-end factuality evaluators. LLM-Oasis
is constructed by extracting claims from Wikipedia, falsifying a subset of
these claims, and generating pairs of factual and unfactual texts. We then rely
on human annotators to both validate the quality of our dataset and to create a
gold standard test set for benchmarking factuality evaluation systems.
Our experiments demonstrate that LLM-Oasis presents a significant challenge
for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our
proposed end-to-end factuality evaluation task, highlighting its potential to
drive future research in the field.Summary
AI-Generated Summary