從反向推理中獲得的見解：透過反向強化學習重建LLM訓練目標

摘要

使用從人類反饋中進行強化學習（RLHF）訓練的大型語言模型（LLMs）展現了卓越的能力，但其潛在的獎勵函數和決策過程仍然不透明。本文介紹了一種新方法，通過應用逆強化學習（IRL）來解釋LLMs，以恢復其隱含的獎勵函數。我們對大小不同的與毒性對齊的LLMs進行實驗，提取出能夠達到高達80.40%準確度的預測人類偏好的獎勵模型。我們的分析揭示了獎勵函數的非可識別性、模型大小與可解釋性之間的關係，以及RLHF過程中可能出現的潛在問題。我們證明了IRL推導的獎勵模型可以用於微調新的LLMs，在毒性基準測試中實現相當或更好的性能。這項工作為理解和改善LLM對齊提供了一個新的視角，對這些強大系統的負責任發展和部署具有重要意義。

English

Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 80.40% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.

從反向推理中獲得的見解：透過反向強化學習重建LLM訓練目標

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

摘要

Summary

Support

Support