S*: コード生成のためのテスト時スケーリング

要旨

大規模言語モデル（LLM）のテスト時計算量の増加は、さまざまな領域で有望な成果を示していますが、数学分野での広範な研究にもかかわらず、コード生成においてはまだ十分に探求されていません。本論文では、生成コードのカバレッジと選択精度を大幅に向上させる初のハイブリッドテスト時スケーリングフレームワークであるS*を提案します。S*は、既存の並列スケーリングパラダイムを逐次スケーリングで拡張し、性能の限界を押し広げます。さらに、ペアワイズ比較のための識別入力を適応的に生成する新たな選択メカニズムと、実行に基づく情報を組み合わせることで、正しいソリューションを堅牢に特定します。12の大規模言語モデルと大規模推論モデルを対象に評価を行い、以下の結果を示します：（1）S*はモデルファミリーやサイズを問わず一貫して性能を向上させ、3BモデルがGPT-4o-miniを上回ることを可能にします；（2）S*は非推論モデルが推論モデルを凌駕することを可能にし、S*を適用したGPT-4o-miniはLiveCodeBenchにおいてo1-previewを3.7%上回ります；（3）S*は最先端の推論モデルをさらに強化し、S*を適用したDeepSeek-R1-Distill-Qwen-32BはLiveCodeBenchで85.7%を達成し、o1（高）の88.5%に迫ります。コードはhttps://github.com/NovaSky-AI/SkyThoughtで公開されます。

English

Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 Large Language Models and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be available under https://github.com/NovaSky-AI/SkyThought.

S*: コード生成のためのテスト時スケーリング

S*: Test Time Scaling for Code Generation

要旨

Summary

Support

Support