모든 LLM 추론기가 동등하게 만들어지는 것은 아니다.

초록

우리는 LLM의 초등 수학 문제 해결 능력의 심도를 연구합니다. 이를 위해, 우리는 기존의 수학 워드 문제 쌍에 대한 성능을 평가합니다. 두 번째 문제의 답이 첫 번째 문제를 올바르게 해결하는 데에 의존하는 방식으로. 우리의 연구 결과는 대부분의 LLM에서 중요한 추론 간극을 보여줍니다. 즉, 구성적인 문제를 해결하고 각 질문을 독립적으로 해결하는 것 사이의 성능 차이가 있습니다. 이 간극은 더 작고 비용 효율적이며 수학에 특화된 모델에서 더욱 두드러집니다. 게다가, 지시 조정 레시피와 코드 생성은 LLM 크기에 따라 다양한 효과를 보이며, GSM에서의 파인튜닝은 과제 과적합으로 이어질 수 있습니다. 우리의 분석은 대규모 추론 간극이 테스트 세트 유출 때문이 아니라 추가적인 맥락으로 인한 주의 산만과 부족한 두 번째 단계 추론 때문임을 나타냅니다. 전반적으로, LLM은 표준 벤치마크에서의 성능에도 불구하고 추론 능력에 시스템적인 차이를 보입니다.

English

We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.

모든 LLM 추론기가 동등하게 만들어지는 것은 아니다.

Not All LLM Reasoners Are Created Equal

초록

Summary

Support

Support