通过单步奖励实现多轮代码生成

摘要

我们致力于解决基于多轮执行反馈的代码生成问题。现有方法要么在无反馈的情况下生成代码，要么采用复杂的层次化强化学习来优化多轮奖励。我们提出了一种简单且可扩展的方法——muCode，它仅利用单步奖励即可解决多轮代码生成问题。我们的核心洞见在于，代码生成是一个单步可恢复的马尔可夫决策过程（MDP），即从任何中间代码状态出发，都能在一轮内恢复出正确的代码。muCode通过迭代训练，一方面让生成器根据多轮执行反馈提供代码解决方案，另一方面让验证器对新生成的代码进行评分。实验评估表明，我们的方法相较于现有最先进的基线模型取得了显著提升。我们深入分析了奖励模型与策略的设计选择，并展示了muCode在利用执行反馈方面的有效性。相关代码已公开于https://github.com/portal-cornell/muCode。

English

We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, muCode, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. muCode iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of muCode at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.

通过单步奖励实现多轮代码生成

Multi-Turn Code Generation Through Single-Step Rewards

摘要

Summary

Support

Support