忘记你对LLM评估的认知 - LLM就像变色龙一样。
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
February 11, 2025
作者: Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, Seffi Cohen
cs.AI
摘要
大型语言模型(LLMs)通常在公共基准测试中表现出色,但这些高分可能掩盖了对特定数据集表面线索的过度依赖,而非真正的语言理解。我们引入了变色龙基准过拟合检测器(C-BOD),这是一个元评估框架,通过参数化转换系统地扭曲基准测试提示,并检测LLMs的过拟合情况。通过重新表述输入内容同时保留其语义内容和标签,C-BOD揭示了模型性能是否受到记忆模式的驱动。在使用26个领先的LLMs对MMLU基准进行评估时,我们的方法显示在适度扰动下平均性能下降了2.15%,其中26个模型中有20个表现出统计显著差异。值得注意的是,基线准确性较高的模型在扰动下表现出更大的性能差异,而更大的LLMs倾向于对重新表述更敏感,表明这两种情况都可能过度依赖固定提示模式。相比之下,Llama系列和基线准确性较低的模型显示出不显著的性能下降,表明对表面线索的依赖减少。此外,C-BOD的数据集和模型无关设计使其能够轻松集成到训练流程中,以促进更强大的语言理解。我们的研究结果挑战了社区要超越排行榜分数,优先考虑LLMs评估中的韧性和泛化能力。
English
Large language models (LLMs) often appear to excel on public benchmarks, but
these high scores may mask an overreliance on dataset-specific surface cues
rather than true language understanding. We introduce the Chameleon Benchmark
Overfit Detector (C-BOD), a meta-evaluation framework that systematically
distorts benchmark prompts via a parametric transformation and detects
overfitting of LLMs. By rephrasing inputs while preserving their semantic
content and labels, C-BOD exposes whether a model's performance is driven by
memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our
method reveals an average performance degradation of 2.15% under modest
perturbations, with 20 out of 26 models exhibiting statistically significant
differences. Notably, models with higher baseline accuracy exhibit larger
performance differences under perturbation, and larger LLMs tend to be more
sensitive to rephrasings indicating that both cases may overrely on fixed
prompt patterns. In contrast, the Llama family and models with lower baseline
accuracy show insignificant degradation, suggesting reduced dependency on
superficial cues. Moreover, C-BOD's dataset- and model-agnostic design allows
easy integration into training pipelines to promote more robust language
understanding. Our findings challenge the community to look beyond leaderboard
scores and prioritize resilience and generalization in LLM evaluation.Summary
AI-Generated Summary