ChatPaper.aiChatPaper

TurtleBench:透過真實世界的是/否問題評估頂尖語言模型

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

October 7, 2024
作者: Qingchen Yu, Shichao Song, Ke Fang, Yunfeng Shi, Zifan Zheng, Hanyu Wang, Simin Niu, Zhiyu Li
cs.AI

摘要

隨著大型語言模型(LLMs)的應用擴展,對可靠評估的需求也增加。現有的LLM評估基準主要依賴靜態數據集,這使得在與用戶動態交互中評估模型性能變得具有挑戰性。此外,這些基準通常依賴特定的背景知識,使得衡量模型邏輯推理能力變得複雜。基於強模型或手動工作的其他動態評估方法可能會引入偏見,並產生高成本和時間需求,從而阻礙大規模應用。為解決這些問題,我們提出了TurtleBench。TurtleBench從我們開發的在線Turtle Soup Puzzle平台中收集真實用戶猜測。這種方法允許相對動態地生成評估數據集,減輕模型作弊的風險,同時更貼近真實用戶對推理能力的需求,從而提高評估的可靠性。TurtleBench包括1,532個用戶猜測以及注釋後的猜測正確性。使用這個數據集,我們對當今九種最先進的LLMs進行了全面評估。值得注意的是,OpenAI o1系列模型在這些評估中並未取得領先的結果。我們提出了一些進一步研究的假設,例如“o1的潛在推理利用了微不足道的Chain-of-Thought(CoT)技術”和“增加CoT長度不僅提供推理效益,還會產生噪音成本”。
English
As the application of Large Language Models (LLMs) expands, the demand for reliable evaluations increases. Existing LLM evaluation benchmarks primarily rely on static datasets, making it challenging to assess model performance in dynamic interactions with users. Moreover, these benchmarks often depend on specific background knowledge, complicating the measurement of a model's logical reasoning capabilities. Other dynamic evaluation methods based on strong models or manual efforts may introduce biases and incur high costs and time demands, hindering large-scale application. To address these issues, we propose TurtleBench. TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform that we developed. This approach allows for the relatively dynamic generation of evaluation datasets, mitigating the risk of model cheating while aligning assessments more closely with genuine user needs for reasoning capabilities, thus enhancing the reliability of evaluations. TurtleBench includes 1,532 user guesses along with the correctness of guesses after annotation. Using this dataset, we thoroughly evaluated nine of the most advanced LLMs available today. Notably, the OpenAI o1 series models did not achieve leading results in these evaluations. We propose several hypotheses for further research, such as "the latent reasoning of o1 utilizes trivial Chain-of-Thought (CoT) techniques" and "increasing CoT length not only provides reasoning benefits but also incurs noise costs."

Summary

AI-Generated Summary

PDF102November 16, 2024