風格勝於內容：LLM 法官在對齊基準測試中的失敗模式

摘要

ChatGPT 於 2022 年 11 月的發布引發了對後訓練的極大興趣，也帶來了大量新的偏好優化（PO）方法。這些方法聲稱通過與人類兩兩偏好更好地對應，通常由 LLM 評審來衡量，從而獲得更優異的對齊性。在這份工作中，我們試圖回答以下問題 -- LLM 評審的偏好是否能轉化為對其他更具體的對齊度指標的進展，如果不能，原因是什麼？我們為對齊度定義了一個具體的指標，並介紹了迄今為止最大的標準化、可重現的 LLM 元基準 SOS-Bench。我們發現：（1）LLM 評判與安全性、世界知識和指示遵循等具體度量指標沒有相關性；（2）LLM 評審存在強大的隱含偏見，將風格置於事實和安全性之上；以及（3）後訓練的監督微調（SFT）階段，而非 PO 階段，對對齊度有最大影響，其中數據規模和提示多樣性是主要驅動因素。我們的代碼庫和完整結果可在 https://github.com/penfever/sos-bench 找到。

English

The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench, the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.

風格勝於內容：LLM 法官在對齊基準測試中的失敗模式

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

摘要

Summary

Support

Support