마르코프 LLM 테스트 타임 스케일링을 위한 사고의 원자

초록

대규모 언어 모델(LLMs)은 학습 단계에서의 스케일링을 통해 우수한 성능을 달성하며, 추론 과정에서 효과적인 사고를 수행함으로써 테스트 단계에서의 스케일링은 그 능력을 더욱 향상시킵니다. 그러나 사고의 규모가 커질수록, 기존의 테스트 단계 스케일링 방법은 축적된 과거 정보로 인해 문제를 겪게 되는데, 이는 계산 자원을 낭비할 뿐만 아니라 효과적인 사고를 방해합니다. 이 문제를 해결하기 위해, 우리는 복잡한 사고 과정이 종종 독립적인 하위 질문들의 시퀀스를 해결함으로써 이루어지며, 각 하위 질문은 자체적으로 완결되고 검증 가능하다는 점을 관찰했습니다. 이러한 하위 질문들은 본질적으로 원자적 질문으로, 주로 현재 상태에 의존하며 축적된 과거에 크게 의존하지 않습니다. 이는 마르코프 프로세스에서의 무기억 전이와 유사합니다. 이러한 관찰을 바탕으로, 우리는 '사고의 원자(Atom of Thoughts, AoT)'를 제안합니다. 여기서 사고 과정의 각 상태 전이는 현재 질문을 의존성 기반의 방향성 비순환 그래프로 분해하고, 그 하위 질문들을 축약하여 새로운 원자적 질문 상태를 형성하는 것으로 이루어집니다. 이 반복적인 분해-축약 과정은 직접 해결 가능한 원자적 질문에 도달할 때까지 계속되며, 질문 상태 간의 마르코프 전이를 자연스럽게 실현합니다. 더욱이, 이러한 원자적 질문들은 기존의 테스트 단계 스케일링 방법에 원활하게 통합될 수 있어, AoT가 사고 능력을 향상시키는 플러그인 강화 기능으로 사용될 수 있게 합니다. 6개의 벤치마크에서의 실험은 AoT가 독립적인 프레임워크로서뿐만 아니라 플러그인 강화 기능으로서도 효과적임을 입증했습니다. 특히, HotpotQA에서 gpt-4o-mini에 적용된 AoT는 80.6%의 F1 점수를 달성하여 o3-mini를 3.4%, DeepSeek-R1을 10.6% 앞섰습니다. 코드는 https://github.com/qixucen/atom에서 확인할 수 있습니다.

English

Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning progress is often achieved by solving a sequence of independent subquestions, each being self-contained and verifiable. These subquestions are essentially atomic questions, relying primarily on their current state rather than accumulated history, similar to the memoryless transitions in a Markov process. Based on this observation, we propose Atom of Thoughts (AoT), where each state transition in the reasoning process consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a new atomic question state. This iterative decomposition-contraction process continues until reaching directly solvable atomic questions, naturally realizing Markov transitions between question states. Furthermore, these atomic questions can be seamlessly integrated into existing test-time scaling methods, enabling AoT to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of AoT both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and DeepSeek-R1 by 10.6%. The code will be available at https://github.com/qixucen/atom.

마르코프 LLM 테스트 타임 스케일링을 위한 사고의 원자

Atom of Thoughts for Markov LLM Test-Time Scaling

초록

Summary

Support