ChatPaper.aiChatPaper

語言模型通過強化學習和人類反饋學會誤導人們

Language Models Learn to Mislead Humans via RLHF

September 19, 2024
作者: Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Boman, He He, Shi Feng
cs.AI

摘要

語言模型(LMs)可能會產生對人類來說難以檢測的錯誤,尤其是在任務較為複雜時。RLHF,即最流行的後訓練方法,可能會加劇這個問題:為了獲得更高的獎勵,LMs可能會更善於說服人類他們是對的,即使他們是錯的。我們在標準的RLHF流程下研究了這一現象,稱之為“U-SOPHISTRY”,因為這是模型開發者意外的結果。具體而言,我們要求有時間限制(例如3-10分鐘)的人類受試者評估模型輸出的正確性,並計算人類對金標籤的準確性。在問答任務(QuALITY)和編程任務(APPS)中,RLHF使LMs更善於說服我們的受試者,但並未更善於正確完成任務。RLHF還使模型更難評估:在QuALITY上,我們受試者的誤報率增加了24.1%,在APPS上增加了18.3%。最後,我們展示了探測,一種用於檢測有意的詭辯(例如後門式LMs)的最先進方法,並不能推廣到U-SOPHISTRY。我們的結果突顯了RLHF的一個重要失敗模式,呼籲在協助人類對齊方面進行更多研究。
English
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting Intended Sophistry (e.g. backdoored LMs), does not generalize to U-SOPHISTRY. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.

Summary

AI-Generated Summary

PDF102November 16, 2024