ChatPaper.aiChatPaper

從反向推理中獲得的見解:透過反向強化學習重建LLM訓練目標

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

October 16, 2024
作者: Jared Joselowitz, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
cs.AI

摘要

使用從人類反饋中進行強化學習(RLHF)訓練的大型語言模型(LLMs)展現了卓越的能力,但其潛在的獎勵函數和決策過程仍然不透明。本文介紹了一種新方法,通過應用逆強化學習(IRL)來解釋LLMs,以恢復其隱含的獎勵函數。我們對大小不同的與毒性對齊的LLMs進行實驗,提取出能夠達到高達80.40%準確度的預測人類偏好的獎勵模型。我們的分析揭示了獎勵函數的非可識別性、模型大小與可解釋性之間的關係,以及RLHF過程中可能出現的潛在問題。我們證明了IRL推導的獎勵模型可以用於微調新的LLMs,在毒性基準測試中實現相當或更好的性能。這項工作為理解和改善LLM對齊提供了一個新的視角,對這些強大系統的負責任發展和部署具有重要意義。
English
Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 80.40% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.

Summary

AI-Generated Summary

PDF42November 16, 2024