辩论有助于从弱到强的泛化。
Debate Helps Weak-to-Strong Generalization
January 21, 2025
作者: Hao Lang, Fei Huang, Yongbin Li
cs.AI
摘要
常见的用于将已有模型与期望行为对齐的方法依赖于人类提供监督的能力。然而,未来的超人类模型将超越人类的能力。因此,人类只能对超人类模型进行弱监督。人类评估的这种预期不足将削弱未来人工智能系统的安全性。可扩展的监督和弱到强泛化是解决这一问题的两种互补方法。在本文中,我们尝试结合这两种方法的优势,进一步改进对齐。具体而言,我们研究了利用强预训练模型改进人类监督的方式,然后用增强的弱人类监督监督强模型。为了进行迭代经验进展,我们考虑了一个类比:我们能否利用强模型改进弱模型的监督,然后再用它来监督强模型?我们通过在地面真实标签上对一个小的弱模型进行微调,并借助一个大的强模型的额外帮助,然后通过对由弱模型生成的标签对强模型进行微调来进行实证测试。我们发现辩论可以帮助弱模型从一个不可信的强模型中提取可信赖的信息,这在训练弱模型时提供了样本的上下文。我们还展示了一组弱模型有助于利用由强模型辩手生成的长论点,并获得更稳健的监督估计。对OpenAI弱到强自然语言处理基准的大量实验表明,这种组合方法导致更好的对齐,这表明辩论有助于弱到强泛化。
English
Common methods for aligning already-capable models with desired behavior rely
on the ability of humans to provide supervision. However, future superhuman
models will surpass the capability of humans. Therefore, humans will only be
able to weakly supervise superhuman models. This expected deficiency of human
evaluation would weaken the safety of future AI systems. Scalable oversight and
weak-to-strong generalization are two complementary approaches to tackle this
issue. In this paper, we attempt to combine the strengths of these two
approaches to further improve alignment. Specifically, we investigate ways of
improving human supervision with a strong pretrained model and then supervise
the strong model with enhanced weak human supervision. To make iterative
empirical progress, we consider an analogy: can we use a strong model to
improve weak model supervision and then use it to supervise the strong model?
We empirically test it by finetuning a small weak model on ground truth labels
with the additional help from a large strong model, and then finetuning the
strong model on labels generated by the weak model. We find that debate can
assist a weak model in extracting trustworthy information from an untrustworthy
strong model, which provides leverage as context on samples when training a
weak model. We also show that an ensemble of weak models helps exploit long
arguments generated by strong model debaters and obtain a more robust
supervision estimate. Extensive experiments on the OpenAI weak-to-strong NLP
benchmarks show that the combination approach leads to better alignment, which
indicates that debate has the potential to help weak-to-strong generalization.Summary
AI-Generated Summary