马尔可夫大语言模型测试时扩展的思维原子

摘要

大型语言模型（LLMs）通过训练时的规模扩展实现了卓越性能，而测试时的规模扩展则通过在推理过程中进行有效推理进一步提升了其能力。然而，随着推理规模的增加，现有的测试时扩展方法因累积的历史信息而受到影响，这不仅浪费了计算资源，还干扰了有效推理。为解决这一问题，我们观察到复杂的推理过程通常通过解决一系列独立的子问题来实现，每个子问题都是自包含且可验证的。这些子问题本质上是原子问题，主要依赖于其当前状态而非累积的历史，类似于马尔可夫过程中的无记忆转移。基于这一观察，我们提出了“思维原子”（Atom of Thoughts, AoT），其中推理过程中的每个状态转移包括将当前问题分解为基于依赖的有向无环图，并压缩其子问题，形成一个新的原子问题状态。这一迭代的分解-压缩过程持续进行，直至达到可直接解决的原子问题，自然实现了问题状态间的马尔可夫转移。此外，这些原子问题可以无缝集成到现有的测试时扩展方法中，使AoT能够作为插件增强，提升推理能力。在六个基准测试上的实验证明了AoT作为独立框架和插件增强的有效性。特别是在HotpotQA上，当应用于gpt-4o-mini时，AoT实现了80.6%的F1分数，分别超越了o3-mini 3.4%和DeepSeek-R1 10.6%。代码将在https://github.com/qixucen/atom 上提供。

English

Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning progress is often achieved by solving a sequence of independent subquestions, each being self-contained and verifiable. These subquestions are essentially atomic questions, relying primarily on their current state rather than accumulated history, similar to the memoryless transitions in a Markov process. Based on this observation, we propose Atom of Thoughts (AoT), where each state transition in the reasoning process consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a new atomic question state. This iterative decomposition-contraction process continues until reaching directly solvable atomic questions, naturally realizing Markov transitions between question states. Furthermore, these atomic questions can be seamlessly integrated into existing test-time scaling methods, enabling AoT to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of AoT both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and DeepSeek-R1 by 10.6%. The code will be available at https://github.com/qixucen/atom.

马尔可夫大语言模型测试时扩展的思维原子

Atom of Thoughts for Markov LLM Test-Time Scaling

摘要

Support