关于语言模型蒸馏中的教师模型破解
On Teacher Hacking in Language Model Distillation
February 4, 2025
作者: Daniil Tiapkin, Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, Alexandre Ramé, Mathieu Blondel
cs.AI
摘要
语言模型(LM)的后训练越来越依赖以下两个阶段:(i)知识蒸馏,其中LM被训练以模仿一个更大的教师LM,以及(ii)从人类反馈中强化学习(RLHF),其中LM通过优化奖励模型来对齐。在第二个RLHF阶段中,一个众所周知的挑战是奖励欺骗,即LM过度优化奖励模型。这种现象符合古德哈特定律,并可能导致在真实目标上性能下降。在本文中,我们调查了一个类似的现象,我们称之为教师欺骗,是否会在知识蒸馏过程中发生。这可能是因为教师LM本身是对真实分布的不完美近似。为了研究这一点,我们提出了一个受控的实验设置,包括:(i)代表地面真实分布的oracle LM,(ii)从oracle蒸馏出的教师LM,以及(iii)从教师蒸馏出的学生LM。我们的实验揭示了以下见解。当使用固定的离线数据集进行蒸馏时,教师欺骗会发生;此外,我们可以通过观察优化过程偏离多项式收敛定律的时刻来检测它。相比之下,采用在线数据生成技术有效地减轻了教师欺骗。更准确地说,我们确定数据多样性是防止欺骗的关键因素。总的来说,我们的发现深入理解了蒸馏在构建强大和高效LM方面的益处和局限性。
English
Post-training of language models (LMs) increasingly relies on the following
two stages: (i) knowledge distillation, where the LM is trained to imitate a
larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF),
where the LM is aligned by optimizing a reward model. In the second RLHF stage,
a well-known challenge is reward hacking, where the LM over-optimizes the
reward model. Such phenomenon is in line with Goodhart's law and can lead to
degraded performance on the true objective. In this paper, we investigate
whether a similar phenomenon, that we call teacher hacking, can occur during
knowledge distillation. This could arise because the teacher LM is itself an
imperfect approximation of the true distribution. To study this, we propose a
controlled experimental setup involving: (i) an oracle LM representing the
ground-truth distribution, (ii) a teacher LM distilled from the oracle, and
(iii) a student LM distilled from the teacher. Our experiments reveal the
following insights. When using a fixed offline dataset for distillation,
teacher hacking occurs; moreover, we can detect it by observing when the
optimization process deviates from polynomial convergence laws. In contrast,
employing online data generation techniques effectively mitigates teacher
hacking. More precisely, we identify data diversity as the key factor in
preventing hacking. Overall, our findings provide a deeper understanding of the
benefits and limitations of distillation for building robust and efficient LMs.Summary
AI-Generated Summary