教導模型平衡抵抗和接受說服
Teaching Models to Balance Resisting and Accepting Persuasion
October 18, 2024
作者: Elias Stengel-Eskin, Peter Hase, Mohit Bansal
cs.AI
摘要
大型語言模型(LLMs)容易受到說服,這可能在模型面對對抗性對話者時帶來風險。我們朝著保護模型免受說服的方向邁出第一步,同時主張防禦對抗性(即負面)說服只是問題的一半:模型還應該能夠接受有益的(即正面)說服以改善其答案。我們表明,僅優化模型的一方會導致在另一方面表現不佳。為了平衡正面和負面的說服,我們引入了平衡說服訓練(或PBT),利用多智能體遞迴對話樹來創建數據,並通過偏好優化訓練模型以在適當時接受說服。PBT不斷提高對錯誤信息的抵抗力和對挑戰的韌性,同時在包含正面和負面說服的整體數據上實現最佳總體表現。至關重要的是,我們表明PBT模型在多智能體辯論中是更好的隊友。我們發現,沒有PBT,強弱模型對的表現不穩定,模型呈現答案的順序決定了團隊獲得強模型還是弱模型的表現。PBT帶來更好和更穩定的結果,減少了順序依賴性,強模型始終穩定地提升弱模型。
English
Large language models (LLMs) are susceptible to persuasion, which can pose
risks when models are faced with an adversarial interlocutor. We take a first
step towards defending models against persuasion while also arguing that
defense against adversarial (i.e. negative) persuasion is only half of the
equation: models should also be able to accept beneficial (i.e. positive)
persuasion to improve their answers. We show that optimizing models for only
one side results in poor performance on the other. In order to balance positive
and negative persuasion, we introduce Persuasion-Balanced Training (or PBT),
which leverages multi-agent recursive dialogue trees to create data and trains
models via preference optimization to accept persuasion when appropriate. PBT
consistently improves resistance to misinformation and resilience to being
challenged while also resulting in the best overall performance on holistic
data containing both positive and negative persuasion. Crucially, we show that
PBT models are better teammates in multi-agent debates. We find that without
PBT, pairs of stronger and weaker models have unstable performance, with the
order in which the models present their answers determining whether the team
obtains the stronger or weaker model's performance. PBT leads to better and
more stable results and less order dependence, with the stronger model
consistently pulling the weaker one up.Summary
AI-Generated Summary