如果你错过了：ARC“挑战”并不那么具有挑战性。

摘要

现代LLM来说，ARC挑战似乎比ARC简单更具挑战性，主要是因为评估设置阻止了直接比较答案选择，而非固有复杂性。尽管一些研究人员在过去一年中悄悄转向更合适的方案，但这种变化的影响尚未被广泛认可。我们强调这一被忽视的转变，展示类似的评估实践如何错误地暗示其他基准测试中的推理缺陷，并证明更公平的方法可以显著减少性能差距（例如在SIQA上），甚至产生超人类的结果（OpenBookQA）。通过这样做，我们揭示了评估如何塑造了被认为困难的程度，并提供指南，以确保多项选择评估准确反映实际模型能力。

English

ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.

如果你错过了：ARC“挑战”并不那么具有挑战性。

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

摘要

Summary

Support