要使用CoT還是不使用CoT？CoT對數學和符號推理有主要幫助。

摘要

通過提示的思維鏈（CoT）是從大型語言模型（LLMs）中引出推理能力的事實方法。但這種額外的“思考”對於哪種類型的任務真正有幫助呢？為了分析這一點，我們進行了一項定量的元分析，涵蓋了100多篇使用CoT的論文，並對14個模型的20個數據集進行了我們自己的評估。我們的結果顯示，CoT主要在涉及數學或邏輯的任務上帶來了顯著的性能提升，而在其他類型的任務上則獲得了較小的增益。在MMLU上，除非問題或模型的回答包含等號，指示符號操作和推理，否則直接生成答案而不使用CoT將導致幾乎相同的準確性。根據這一發現，我們通過區分計劃和執行並與工具增強的LLMs進行比較，分析了CoT在這些問題上的行為。CoT的很大一部分收益來自於改善符號執行，但相對於使用符號求解器，它表現不佳。我們的結果表明，CoT可以有選擇地應用，保持性能的同時節省推理成本。此外，它們表明需要超越基於提示的CoT，轉向更好地利用整個LLM應用範圍內的中間計算的新範式。

English

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

要使用CoT還是不使用CoT？CoT對數學和符號推理有主要幫助。

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

摘要

Summary

Support

Support