놓치지 말아야 할 것: ARC '도전'은 그리 어렵지 않습니다.

초록

현대 LLMs에게는 ARC Challenge가 ARC Easy보다 더 어려운 것으로 보입니다. 이는 본질적인 복잡성보다는 답변 선택지를 직접 비교할 수 없는 평가 설정 때문입니다. 작년 동안 일부 연구자들이 조용히 더 적절한 체계로 전환해왔지만, 이러한 변화의 영향은 아직 널리 인정받지 못했습니다. 우리는 이 간과된 변화를 강조하고, 유사한 평가 방법이 다른 벤치마크에서 추론 결함을 잘못 시사하는 것을 보여주며, 공정한 방법이 성능 차이를 크게 줄이는 것을 시연하고 (예: SIQA에서), 심지어 초인간적인 결과를 도출하는 것을 보여줍니다 (OpenBookQA). 이를 통해 우리는 평가가 인식된 난이도를 형성하고 다중 선택 평가가 실제 모델 능력을 정확히 반영하도록 하는 지침을 제시합니다.

English

ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.

놓치지 말아야 할 것: ARC '도전'은 그리 어렵지 않습니다.

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

초록

Support