LLM은 쉽게 시연으로부터 추론하는 법을 배울 수 있습니다. 구조가 중요한 것이지 내용이 중요한 것은 아닙니다!

초록

대규모 추론 모델(LRMs)은 반사, 되추적, 자가 유효화를 포함하는 긴 사고 체인(Long CoT)을 따라 복잡한 추론 문제에 대처합니다. 그러나 Long CoT를 유도하기 위한 훈련 기술과 데이터 요구 사항은 여전히 잘 이해되지 않고 있습니다. 본 연구에서는 대규모 언어 모델(LLM)이 데이터 효율적인 지도 미세 조정(SFT)과 매개 변수 효율적인 저랭크 적응(LoRA)을 통해 효과적으로 Long CoT 추론을 학습할 수 있다는 것을 발견했습니다. 17k개의 긴 CoT 훈련 샘플만으로 Qwen2.5-32B-Instruct 모델은 AIME 2024에서 56.7% (+40.0%) 및 LiveCodeBench에서 57.0% (+8.1%) 등의 넓은 범위의 수학 및 코딩 벤치마크에서 상당한 개선을 달성했습니다. 이는 소유권이 있는 o1-preview 모델의 44.6% 및 59.1%의 점수와 경쟁력이 있습니다. 더 중요한 것은, Long CoT의 구조가 학습 과정에 중요하며, 개별 추론 단계의 내용은 미미한 영향을 미칩니다. 잘못된 샘플로 훈련하거나 추론 키워드를 제거하는 것과 같은 내용에 영향을 주는 변형은 성능에 미치는 영향이 적습니다. 그에 반해, 추론 단계를 섞거나 삭제하는 등 Long CoT의 논리 일관성을 파괴하는 구조적 수정은 정확도를 크게 저하시킵니다. 예를 들어, 잘못된 답변이 포함된 Long CoT 샘플로 훈련된 모델은 완전히 정확한 샘플로 훈련한 것과 비교했을 때 정확도가 3.2% 낮을 뿐입니다. 이러한 통찰력은 LLMs의 추론 능력을 유도하는 방법에 대한 우리의 이해를 깊게 하고 효율적으로 다음 세대 추론 모델을 훈련하는 데 중요한 고려 사항을 강조합니다. 이는 이전에 출시된 Sky-T1-32B-Preview 모델의 학술 논문입니다. 코드는 https://github.com/NovaSky-AI/SkyThought에서 사용할 수 있습니다.

English

Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model's score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at https://github.com/NovaSky-AI/SkyThought.

LLM은 쉽게 시연으로부터 추론하는 법을 배울 수 있습니다. 구조가 중요한 것이지 내용이 중요한 것은 아닙니다!

LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

초록

Support