大语言模型在不同语言中的幻觉程度如何?——关于大语言模型多语言幻觉的实地评估
How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
February 18, 2025
作者: Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
cs.AI
摘要
在信息误传的时代,幻觉——即大型语言模型(LLMs)生成非事实或不忠实回应的倾向——构成了其全球应用的主要风险。尽管LLMs正日益多语言化,但绝大多数关于检测和量化LLM幻觉的研究(a)以英语为中心,(b)集中于机器翻译(MT)和摘要生成,这些任务在现实场景中远不如开放信息检索常见。与此相对,我们旨在量化LLMs在知识密集型长问答任务中跨语言的幻觉程度。为此,我们训练了一个多语言幻觉检测模型,并在30种语言和6个开源LLM家族中进行了大规模研究。我们从英语幻觉检测数据集出发,依赖MT生成其他语言的(含噪声)训练数据。同时,我们手动标注了五种高资源语言的黄金数据;随后,我们证明,在这些语言中,幻觉率的估计在银(LLM生成)测试集与黄金测试集之间相似,验证了使用银数据估算其他语言幻觉率的有效性。为最终估算幻觉率,我们构建了一个涵盖30种语言的知识密集型问答数据集,其中包含LLM生成的提示和维基百科文章作为参考。我们发现,虽然LLMs为高资源语言生成了更长的回答,包含更多幻觉词汇,但语言的长度标准化幻觉率与其数字表征之间并无关联。此外,我们还发现,较小的LLM比大型模型表现出更高的幻觉率。
English
In the age of misinformation, hallucination -- the tendency of Large Language
Models (LLMs) to generate non-factual or unfaithful responses -- represents the
main risk for their global utility. Despite LLMs becoming increasingly
multilingual, the vast majority of research on detecting and quantifying LLM
hallucination are (a) English-centric and (b) focus on machine translation (MT)
and summarization, tasks that are less common ``in the wild'' than open
information seeking. In contrast, we aim to quantify the extent of LLM
hallucination across languages in knowledge-intensive long-form question
answering. To this end, we train a multilingual hallucination detection model
and conduct a large-scale study across 30 languages and 6 open-source LLM
families. We start from an English hallucination detection dataset and rely on
MT to generate (noisy) training data in other languages. We also manually
annotate gold data for five high-resource languages; we then demonstrate, for
these languages, that the estimates of hallucination rates are similar between
silver (LLM-generated) and gold test sets, validating the use of silver data
for estimating hallucination rates for other languages. For the final rates
estimation, we build a knowledge-intensive QA dataset for 30 languages with
LLM-generated prompts and Wikipedia articles as references. We find that, while
LLMs generate longer responses with more hallucinated tokens for
higher-resource languages, there is no correlation between length-normalized
hallucination rates of languages and their digital representation. Further, we
find that smaller LLMs exhibit larger hallucination rates than larger models.Summary
AI-Generated Summary