大语言模型时代下的维基百科:演变与风险
Wikipedia in the Era of LLMs: Evolution and Risks
March 4, 2025
作者: Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen
cs.AI
摘要
本文深入分析了大型语言模型(LLMs)对维基百科的影响,通过现有数据考察维基百科的演变,并借助模拟探索潜在风险。我们首先通过分析页面浏览量和文章内容,研究维基百科近期的变化,评估LLMs的影响。随后,我们评估了LLMs如何影响与维基百科相关的各类自然语言处理(NLP)任务,包括机器翻译和检索增强生成(RAG)。我们的研究结果和模拟实验表明,维基百科文章已受到LLMs的影响,在某些类别中影响程度约为1%-2%。如果基于维基百科的机器翻译基准受到LLMs影响,模型的评分可能会被夸大,模型间的比较结果也可能发生变化。此外,如果知识库被LLM生成的内容污染,RAG的有效性可能会降低。尽管LLMs尚未彻底改变维基百科的语言和知识结构,但我们认为,实证研究结果提示需要审慎考虑未来可能面临的风险。
English
In this paper, we present a thorough analysis of the impact of Large Language
Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through
existing data and using simulations to explore potential risks. We begin by
analyzing page views and article content to study Wikipedia's recent changes
and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect
various Natural Language Processing (NLP) tasks related to Wikipedia, including
machine translation and retrieval-augmented generation (RAG). Our findings and
simulation results reveal that Wikipedia articles have been influenced by LLMs,
with an impact of approximately 1%-2% in certain categories. If the machine
translation benchmark based on Wikipedia is influenced by LLMs, the scores of
the models may become inflated, and the comparative results among models might
shift as well. Moreover, the effectiveness of RAG might decrease if the
knowledge base becomes polluted by LLM-generated content. While LLMs have not
yet fully changed Wikipedia's language and knowledge structures, we believe
that our empirical findings signal the need for careful consideration of
potential future risks.Summary
AI-Generated Summary