如果你錯過了：ARC「挑戰」並不那麼具挑戰性

摘要

對於現代語言模型來說，ARC 挑戰題比起 ARC 簡單題更具挑戰性，主要是因為評估設定阻礙了直接比較答案選項，而非固有複雜性。儘管一些研究人員在過去一年悄悄地轉向更適當的方案，這種變化的影響尚未被廣泛認可。我們強調這個被忽視的轉變，展示類似的評估實踐如何錯誤地暗示其他基準中的推理缺陷，並證明更公平的方法能夠顯著減少性能差距（例如在 SIQA 上），甚至產生超人類的結果（OpenBookQA）。通過這樣做，我們揭示了評估如何塑造了被認為困難的程度，並提供指南，以確保多選評估準確反映實際模型能力。

English

ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.

如果你錯過了：ARC「挑戰」並不那麼具挑戰性

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

摘要

Summary

Support