ChatPaper.aiChatPaper

Chatbot 可信人工评估中的挑战

Challenges in Trustworthy Human Evaluation of Chatbots

December 5, 2024
作者: Wenting Zhao, Alexander M. Rush, Tanya Goyal
cs.AI

摘要

像Chatbot Arena这样的开放社区驱动平台,从网站访问者那里收集用户偏好数据,已经成为LLM性能最可信赖的公开基准之一。虽然现在已经成为标准,但要实施有效的防护措施来收集高质量的人类注释并不容易。本文演示了三种糟糕注释的来源,包括恶意和其他形式,可能会破坏开放排行榜的可靠性。特别地,我们表明,只有10%的低质量投票,来自漠不关心的(网站访问者没有适当激励给出正确投票)或敌对的(试图提升目标模型排名的不良行为者)注释者,就能够将模型在排行榜上的排名改变多达5个位置。最后,我们讨论确保高质量人类注释的开放挑战。
English
Open community-driven platforms like Chatbot Arena that collect user preference data from site visitors have gained a reputation as one of the most trustworthy publicly available benchmarks for LLM performance. While now standard, it is tricky to implement effective guardrails to collect high-quality annotations from humans. In this paper, we demonstrate that three sources of bad annotations, both malicious and otherwise, can corrupt the reliability of open leaderboard rankings. In particular, we show that only 10\% of poor quality votes by apathetic (site visitors not appropriately incentivized to give correct votes) or adversarial (bad actors seeking to inflate the ranking of a target model) annotators can change the rankings of models by up to 5 places on the leaderboard. Finally, we discuss open challenges in ensuring high-quality human annotations.

Summary

AI-Generated Summary

PDF32December 6, 2024