ChatPaper.aiChatPaper

最弱環節定律:大型語言模型的跨能力

Law of the Weakest Link: Cross Capabilities of Large Language Models

September 30, 2024
作者: Ming Zhong, Aston Zhang, Xuewei Wang, Rui Hou, Wenhan Xiong, Chenguang Zhu, Zhengxing Chen, Liang Tan, Chloe Bi, Mike Lewis, Sravya Popuri, Sharan Narang, Melanie Kambadur, Dhruv Mahajan, Sergey Edunov, Jiawei Han, Laurens van der Maaten
cs.AI

摘要

大型語言模型(LLMs)的開發和評估主要集中在個別能力上。然而,這忽略了跨越不同專業領域的多種能力交集,這些能力通常在現實任務中是必需的,我們稱之為跨能力。為了系統地探索這個概念,我們首先定義了七個核心個別能力,然後將它們配對形成七種常見的跨能力,每種都由手工構建的分類法支持。基於這些定義,我們引入了CrossEval,這是一個基準測試,包括1,400個人工標註提示,每種個別和跨能力各有100個提示。為了確保可靠的評估,我們邀請專家標註者評估4,200個模型回應,收集8,400個帶有詳細解釋的人工評分,作為參考示例。我們的研究發現,在靜態評估和優化特定能力的嘗試中,當前的LLMs一貫表現出“最弱環節法則”,即跨能力表現受到最弱環節的顯著限制。具體而言,在來自17個模型的58個跨能力得分中,有38個得分低於所有個別能力,而20個介於強和弱之間,但更接近較弱的能力。這些結果突顯了LLMs在跨能力任務中的表現不佳,使得識別和改進最弱能力成為未來研究中優化在複雜、多維場景中表現的關鍵重點。
English
The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.

Summary

AI-Generated Summary

PDF552November 13, 2024