RM-Bench: 섬세함과 스타일을 갖는 언어 모델의 보상 모델을 벤치마킹하는 방법

초록

보상 모델은 인간 피드백으로부터 강화 학습 (RLHF) 및 추론 스케일링 법과 같은 기술에서 중요하며, 언어 모델 정렬을 안내하고 최적의 응답을 선택합니다. 그러나 그 중요성에도 불구하고, 기존의 보상 모델 벤치마크는 종종 모델의 힘에 따라 생성된 응답을 구별하도록 요청하여 모델을 평가합니다. 그러나 이 접근 방식은 보상 모델을 섬세하지만 중요한 콘텐츠 변경 및 스타일 변화에 대해 평가하지 못하며, 결과적으로 정책 모델 성능과의 낮은 상관 관계를 보입니다. 이에 우리는 RM-Bench를 소개하여 보상 모델의 섬세한 콘텐츠 차이에 대한 민감도와 스타일 편향에 대한 저항력을 기반으로 보상 모델을 평가하는 새로운 벤치마크를 설계했습니다. 광범위한 실험을 통해 RM-Bench가 정책 모델 성능과 강력한 상관 관계를 갖는 것으로 입증되었으며, 이를 통해 언어 모델을 효과적으로 정렬하기 위해 보상 모델을 선택하는 믿을만한 참고 자료가 되었습니다. 우리는 RM-Bench에서 거의 40개의 보상 모델을 평가했습니다. 결과는 최첨단 모델조차 스타일 편향 간섭에 직면했을 때 무작위 수준의 정확도 (50%)를 달성하지 못하며, 평균 성능이 46.6%에 불과함을 보여줍니다. 이 결과는 현재의 보상 모델에서 개선할 여지가 상당함을 강조합니다. 관련 코드와 데이터는 https://github.com/THU-KEG/RM-Bench에서 확인할 수 있습니다.

English

Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. To this end, we introduce RM-Bench, a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases. Extensive experiments demonstrate that RM-Bench strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language models effectively. We evaluate nearly 40 reward models on RM-Bench. Our results reveal that even state-of-the-art models achieve an average performance of only 46.6%, which falls short of random-level accuracy (50%) when faced with style bias interference. These findings highlight the significant room for improvement in current reward models. Related code and data are available at https://github.com/THU-KEG/RM-Bench.

RM-Bench: 섬세함과 스타일을 갖는 언어 모델의 보상 모델을 벤치마킹하는 방법

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

초록

Support