LINGOLY-TOO:通过语言模板化与拼写混淆技术分离记忆与推理能力
LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation
March 4, 2025
作者: Jude Khouja, Karolina Korgul, Simi Hellsten, Lingyi Yang, Vlad Neacs, Harry Mayne, Ryan Kearns, Andrew Bean, Adam Mahdi
cs.AI
摘要
大型语言模型(LLMs)的推理能力评估容易因评估基准的数据暴露而被高估。我们引入了一个框架,用于生成语言推理问题,以减少记忆效应对模型性能估计的影响,并应用该框架开发了LINGOLY-TOO,一个具有挑战性的语言推理评估基准。通过开发正字法模板,我们动态地模糊真实语言的书写系统,以生成大量问题变体。这些变体保留了每个解决方案所需的推理步骤,同时降低了特定问题实例出现在模型训练数据中的可能性。我们的实验表明,包括OpenAI o1-preview和DeepSeem R1在内的前沿模型在高级推理方面表现不佳。我们的分析还显示,LLMs在相同问题的不同排列上表现出明显的准确率差异,并且平均而言,在原始正字法中出现的问题上表现更好。我们的发现揭示了LLMs响应生成的不透明性,并提供了证据表明,先前的数据暴露导致了前沿模型推理能力的高估。
English
Effective evaluation of the reasoning capabilities of large language models
(LLMs) are susceptible to overestimation due to data exposure of evaluation
benchmarks. We introduce a framework for producing linguistic reasoning
problems that reduces the effect of memorisation in model performance estimates
and apply this framework to develop LINGOLY-TOO, a challenging evaluation
benchmark for linguistic reasoning. By developing orthographic templates, we
dynamically obfuscate the writing systems of real languages to generate
numerous question variations. These variations preserve the reasoning steps
required for each solution while reducing the likelihood of specific problem
instances appearing in model training data. Our experiments demonstrate that
frontier models, including OpenAI o1-preview and DeepSeem R1, struggle with
advanced reasoning. Our analysis also shows that LLMs exhibit noticeable
variance in accuracy across permutations of the same problem, and on average
perform better on questions appearing in their original orthography. Our
findings highlight the opaque nature of response generation in LLMs and provide
evidence that prior data exposure contributes to overestimating the reasoning
capabilities of frontier models.Summary
AI-Generated Summary