CoT를 사용할 것인가, 사용하지 않을 것인가? 사고 연쇄는 주로 수학 및 상징적 추론에 도움이 됩니다.

초록

프롬프팅을 통한 사고 연쇄(Chain-of-thought, CoT)는 대규모 언어 모델(Large Language Models, LLMs)로부터 추론 능력을 유도하는 사실상의 방법입니다. 그러나 이러한 추가 "사고"가 어떤 종류의 작업에 정말 도움이 되는 것일까요? 이를 분석하기 위해, 우리는 CoT를 사용하는 100편 이상의 논문을 포괄하는 양적 메타 분석을 실시하고, 14개 모델을 통해 20개 데이터셋에 대한 자체 평가를 진행했습니다. 결과는 CoT가 주로 수학이나 논리를 포함하는 작업에서 강력한 성능 향상을 제공하며, 다른 유형의 작업에서는 훨씬 작은 이득을 제공한다는 것을 보여줍니다. MMLU에서 CoT 없이 답변을 직접 생성하는 경우, 질문이나 모델의 응답에 등호가 포함되어 있지 않는 한, CoT와 거의 동일한 정확도를 보입니다. 이는 상징적 연산과 추론을 나타내는 경우에 해당합니다. 이 발견을 바탕으로, 우리는 계획과 실행을 분리하고 도구 보조 LLMs와 비교하여 이러한 문제에서 CoT의 동작을 분석합니다. CoT의 많은 이득은 상징적 실행을 개선하는 데서 나오지만, 상징적 해결사를 사용하는 것에 비해 성능이 떨어집니다. 결과는 CoT가 성능을 유지하면서 추론 비용을 절약할 수 있는 선택적으로 적용될 수 있음을 나타내며, 중요한 것은 프롬프트 기반 CoT를 넘어서 전체 LLM 응용 프로그램 범위에서 중간 계산을 더 잘 활용하는 새로운 패러다임으로 나아가야 한다는 필요성을 시사합니다.

English

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

CoT를 사용할 것인가, 사용하지 않을 것인가? 사고 연쇄는 주로 수학 및 상징적 추론에 도움이 됩니다.

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

초록

Summary

Support

Support