AntiLeak-Bench:通過自動構建具有最新現實世界知識的基準測試來防止數據污染。
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
December 18, 2024
作者: Xiaobao Wu, Liangming Pan, Yuxi Xie, Ruiwen Zhou, Shuai Zhao, Yubo Ma, Mingzhe Du, Rui Mao, Anh Tuan Luu, William Yang Wang
cs.AI
摘要
資料污染阻礙了公平的LLM評估,因為它將測試資料引入新模型的訓練集中。現有研究通過使用新收集的資料更新基準來解決這一挑戰。然而,它們無法保證無污染的評估,因為新收集的資料可能包含既有知識,而且它們的基準更新依賴於大量人力。為了應對這些問題,本文提出了一個名為AntiLeak-Bench的自動反洩漏基準框架。我們不僅僅使用新收集的資料,而是構建樣本,其中明確不包含LLM訓練集中的新知識,從而確保嚴格無污染的評估。我們進一步設計了一個完全自動化的工作流程來建立和更新我們的基準,無需人力。這顯著降低了基準維護成本,以應對新興LLM。通過廣泛的實驗,我們強調資料污染可能存在於LLM截止時間之前,並展示AntiLeak-Bench有效地克服了這一挑戰。
English
Data contamination hinders fair LLM evaluation by introducing test data into
newer models' training sets. Existing studies solve this challenge by updating
benchmarks with newly collected data. However, they fail to guarantee
contamination-free evaluation as the newly collected data may contain
pre-existing knowledge, and their benchmark updates rely on intensive human
labor. To address these issues, we in this paper propose AntiLeak-Bench, an
automated anti-leakage benchmarking framework. Instead of simply using newly
collected data, we construct samples with explicitly new knowledge absent from
LLMs' training sets, which thus ensures strictly contamination-free evaluation.
We further design a fully automated workflow to build and update our benchmark
without human labor. This significantly reduces the cost of benchmark
maintenance to accommodate emerging LLMs. Through extensive experiments, we
highlight that data contamination likely exists before LLMs' cutoff time and
demonstrate AntiLeak-Bench effectively overcomes this challenge.Summary
AI-Generated Summary