見逃してしまった方のために：ARC「チャレンジ」はそれほど挑戦的ではありません

要旨

ARCチャレンジは、近年のLLMにとってARCイージーよりも難しいように見える。これは、直接的な回答選択肢の比較を妨げる評価設定に主に起因しており、固有の複雑さではない。一部の研究者は静かに昨年からより適切なスキームに移行してきたが、この変化の影響はまだ広く認識されていない。私たちは、この見過ごされている変化を強調し、類似した評価方法が他のベンチマークで推論の欠陥を誤って示唆していることを示し、公正な方法が性能差（例：SIQA）を劇的に減少させ、時には超人的な結果（OpenBookQA）をもたらすことを実証する。これにより、評価が知覚される難しさを形作る方法と、多肢選択評価が実際のモデルの能力を正確に反映するようにするためのガイドラインを提供する。

English

ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.

見逃してしまった方のために：ARC「チャレンジ」はそれほど挑戦的ではありません

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

要旨

Summary

Support

Support