辩论有助于从弱到强的泛化。

摘要

常见的用于将已有模型与期望行为对齐的方法依赖于人类提供监督的能力。然而，未来的超人类模型将超越人类的能力。因此，人类只能对超人类模型进行弱监督。人类评估的这种预期不足将削弱未来人工智能系统的安全性。可扩展的监督和弱到强泛化是解决这一问题的两种互补方法。在本文中，我们尝试结合这两种方法的优势，进一步改进对齐。具体而言，我们研究了利用强预训练模型改进人类监督的方式，然后用增强的弱人类监督监督强模型。为了进行迭代经验进展，我们考虑了一个类比：我们能否利用强模型改进弱模型的监督，然后再用它来监督强模型？我们通过在地面真实标签上对一个小的弱模型进行微调，并借助一个大的强模型的额外帮助，然后通过对由弱模型生成的标签对强模型进行微调来进行实证测试。我们发现辩论可以帮助弱模型从一个不可信的强模型中提取可信赖的信息，这在训练弱模型时提供了样本的上下文。我们还展示了一组弱模型有助于利用由强模型辩手生成的长论点，并获得更稳健的监督估计。对OpenAI弱到强自然语言处理基准的大量实验表明，这种组合方法导致更好的对齐，这表明辩论有助于弱到强泛化。

English

Common methods for aligning already-capable models with desired behavior rely on the ability of humans to provide supervision. However, future superhuman models will surpass the capability of humans. Therefore, humans will only be able to weakly supervise superhuman models. This expected deficiency of human evaluation would weaken the safety of future AI systems. Scalable oversight and weak-to-strong generalization are two complementary approaches to tackle this issue. In this paper, we attempt to combine the strengths of these two approaches to further improve alignment. Specifically, we investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. To make iterative empirical progress, we consider an analogy: can we use a strong model to improve weak model supervision and then use it to supervise the strong model? We empirically test it by finetuning a small weak model on ground truth labels with the additional help from a large strong model, and then finetuning the strong model on labels generated by the weak model. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model, which provides leverage as context on samples when training a weak model. We also show that an ensemble of weak models helps exploit long arguments generated by strong model debaters and obtain a more robust supervision estimate. Extensive experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment, which indicates that debate has the potential to help weak-to-strong generalization.

辩论有助于从弱到强的泛化。

Debate Helps Weak-to-Strong Generalization

摘要

Summary

Support