언어 모델은 숨겨진 추론자입니다: 자기 보상을 통해 잠재적인 추론 능력 해제하기

초록

대형 언어 모델(Large language models, LLMs)은 놀라운 능력을 보여주었지만 여전히 여러 단계를 필요로 하는 복잡한 추론 작업에 어려움을 겪고 있습니다. Chain-of-Thought (CoT)와 같은 프롬프트 기반 방법은 추론 시 LLM의 추론 능력을 향상시킬 수 있지만, 훈련 중 추론 능력을 최적화하는 것은 여전히 어려운 과제입니다. 저희는 추론을 잠재 분포에서 샘플링하는 것으로 정의하고 변분 접근을 통해 최적화하는 원칙적인 프레임워크인 LaTent Reasoning Optimization (LaTRO)를 소개합니다. LaTRO는 외부 피드백이나 보상 모델이 필요하지 않고 LLM이 추론 프로세스와 추론 품질을 동시에 향상시킬 수 있도록 합니다. 저희는 Phi-3.5-mini, Mistral-7B, 그리고 Llama-3.1-8B를 포함한 여러 모델 구조를 사용하여 GSM8K와 ARC-Challenge 데이터셋에서 실험을 통해 LaTRO를 검증합니다. GSM8K에서 LaTRO는 기본 모델 대비 제로샷 정확도를 평균 12.5% 향상시키고 지도된 미세 조정 대비 9.6% 향상시킵니다. 저희의 연구 결과는 사전 훈련된 LLM이 잠재적인 추론 능력을 갖고 있으며 제안된 최적화 접근을 통해 자체 개선 방식으로 해제하고 향상시킬 수 있다는 것을 시사합니다. LaTRO의 코드는 https://github.com/SalesforceAIResearch/LaTRO에서 확인할 수 있습니다.

English

Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at https://github.com/SalesforceAIResearch/LaTRO.

언어 모델은 숨겨진 추론자입니다: 자기 보상을 통해 잠재적인 추론 능력 해제하기

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

초록

Support