ChatPaper.aiChatPaper

大语言模型在不同语言中的幻觉程度如何?——关于大语言模型多语言幻觉的实地评估

How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild

February 18, 2025
作者: Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
cs.AI

摘要

在信息误传的时代,幻觉——即大型语言模型(LLMs)生成非事实或不忠实回应的倾向——构成了其全球应用的主要风险。尽管LLMs正日益多语言化,但绝大多数关于检测和量化LLM幻觉的研究(a)以英语为中心,(b)集中于机器翻译(MT)和摘要生成,这些任务在现实场景中远不如开放信息检索常见。与此相对,我们旨在量化LLMs在知识密集型长问答任务中跨语言的幻觉程度。为此,我们训练了一个多语言幻觉检测模型,并在30种语言和6个开源LLM家族中进行了大规模研究。我们从英语幻觉检测数据集出发,依赖MT生成其他语言的(含噪声)训练数据。同时,我们手动标注了五种高资源语言的黄金数据;随后,我们证明,在这些语言中,幻觉率的估计在银(LLM生成)测试集与黄金测试集之间相似,验证了使用银数据估算其他语言幻觉率的有效性。为最终估算幻觉率,我们构建了一个涵盖30种语言的知识密集型问答数据集,其中包含LLM生成的提示和维基百科文章作为参考。我们发现,虽然LLMs为高资源语言生成了更长的回答,包含更多幻觉词汇,但语言的长度标准化幻觉率与其数字表征之间并无关联。此外,我们还发现,较小的LLM比大型模型表现出更高的幻觉率。
English
In the age of misinformation, hallucination -- the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses -- represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common ``in the wild'' than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering. To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to generate (noisy) training data in other languages. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build a knowledge-intensive QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. We find that, while LLMs generate longer responses with more hallucinated tokens for higher-resource languages, there is no correlation between length-normalized hallucination rates of languages and their digital representation. Further, we find that smaller LLMs exhibit larger hallucination rates than larger models.

Summary

AI-Generated Summary

PDF32February 21, 2025