스타일 우선: LLM 판사의 실패 모드와 정렬 벤치마킹

초록

2022년 11월 ChatGPT의 출시는 후훈련(post-training)에 대한 흥미 폭발과 새로운 선호도 최적화(PO) 방법의 폭풍을 촉발했습니다. 이러한 방법들은 LLM 판사들에 의해 측정되는 인간의 쌍별 선호도와 더 잘 일치함으로써 우수한 정렬을 주장합니다. 본 연구에서는 LLM 판사의 선호도가 다른 구체적인 정렬 지표에 어떻게 옮겨지는지, 그리고 그렇지 않다면 왜 그런지에 대한 질문에 시도합니다. 우리는 정렬을 위한 구체적인 지표를 정의하고, 현재까지 가장 큰 표준화된, 재현 가능한 LLM 메타-벤치마크인 SOS-Bench를 소개합니다. 우리는 (1) LLM 판단이 안전성, 세계 지식, 지시 따르기와 같은 구체적인 측정치와 상관관계가 없음을 발견했습니다; (2) LLM 판사들은 스타일보다 사실성과 안전성을 우선시하는 강력한 내재적 편향을 가지고 있습니다; 그리고 (3) 후훈련의 지도 미세 조정(SFT) 단계가 정렬에 가장 큰 영향을 미치며, 데이터 스케일링과 프롬프트 다양성이 주요 요인입니다. 우리의 코드베이스와 완전한 결과는 https://github.com/penfever/sos-bench에서 확인할 수 있습니다.

English

The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench, the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.

스타일 우선: LLM 판사의 실패 모드와 정렬 벤치마킹

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

초록

Summary

Support

Support