ChatPaper.aiChatPaper

优秀的模型思维相似,这削弱了人工智能监督。

Great Models Think Alike and this Undermines AI Oversight

February 6, 2025
作者: Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping
cs.AI

摘要

随着语言模型(LM)能力的提升,对其进行规模化评估和监督对人类来说变得更加困难。人们希望其他语言模型能够自动化这两项任务,这被称为“AI监督”。我们研究了模型相似性如何影响AI监督的两个方面,提出了一种基于模型错误重叠的LM相似性的概率度量。利用这一度量,我们首先展示了LLM作为评判者的评分偏好于与评判者相似的模型,从而概括了最近的自我偏好结果。然后,我们研究了在LM注释上的训练,并发现弱监督者和强学生模型之间的互补知识在“从弱到强的泛化”中起着至关重要的作用。随着模型能力的增强,发现其错误变得更加困难,我们可能会更多地依赖AI监督。然而,我们观察到一个令人担忧的趋势——随着能力的增强,模型的错误变得更加相似,指向了由相关故障带来的风险。我们的工作强调了报告和纠正模型相似性的重要性,特别是在AI监督新兴范式中。
English
As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as "AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

Summary

AI-Generated Summary

PDF312February 7, 2025