忽略KL惩罚！通过增加对关键标记的探索来增强强化学习微调。

摘要

在当前大型语言模型（LLMs）的发展中，实现长期目标的能力是一个关键挑战。为了解决这个问题，可以利用强化学习（RL）对预训练的LLMs进行微调，以探索优化给定目标的解决方案。然而，LLMs的探索是困难的，因为需要在发现新解决方案和保持足够接近预训练模型之间取得平衡，以避免降低基本能力。通常会使用Kullback-Leibler（KL）惩罚来控制这一点。在本文中，我们研究了一个简单算术任务上小型语言模型的探索动态。我们展示了不同程度的预训练如何影响探索，并展示了“关键标记”的重要性，对最终结果产生了显著影响。因此，我们引入了对KL惩罚的简单修改，有利于在关键标记上进行探索，提高了RL微调阶段的效率。

English

The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of "critical tokens" which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on critical tokens, increasing the efficiency of the RL fine-tuning stage.

忽略KL惩罚！通过增加对关键标记的探索来增强强化学习微调。

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning

摘要

Summary

Support