DeepSeek-R1: 強化学習を通じてLLMの推論能力を促進する

要旨

初代の推論モデル、DeepSeek-R1-ZeroおよびDeepSeek-R1を紹介します。DeepSeek-R1-Zeroは、教師なし微調整（SFT）を行わずに大規模な強化学習（RL）によって訓練されたモデルであり、優れた推論能力を示しています。RLにより、DeepSeek-R1-Zeroは多くの強力で興味深い推論行動を自然に獲得します。ただし、読みづらさや言語の混在などの課題に直面しています。これらの問題に対処し、推論性能をさらに向上させるために、DeepSeek-R1を導入します。DeepSeek-R1は、RLの前にマルチステージのトレーニングとコールドスタートデータを組み込んでいます。DeepSeek-R1は、推論タスクにおいてOpenAI-o1-1217と同等の性能を達成します。研究コミュニティをサポートするために、DeepSeek-R1-Zero、DeepSeek-R1、およびQwenとLlamaに基づいて抽出された6つの密なモデル（1.5B、7B、8B、14B、32B、70B）をオープンソース化します。

English

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

DeepSeek-R1: 強化学習を通じてLLMの推論能力を促進する

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

要旨

Summary

Support