LLM是否比報告更好?檢測標籤錯誤並減輕其對模型性能的影響
Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
October 24, 2024
作者: Omer Nahum, Nitay Calderon, Orgad Keller, Idan Szpektor, Roi Reichart
cs.AI
摘要
自然語言處理基準測試依賴於標準化資料集來訓練和評估模型,對於推動該領域的發展至關重要。傳統上,專家標註確保高質量標籤;然而,專家標註的成本並未隨現代模型對更大資料集的需求增長而成比例地擴展。儘管眾包提供了更具可擴展性的解決方案,但往往以標註精確性和一致性為代價。大型語言模型(LLMs)的最新進展為增強標註過程提供了新機會,特別是用於檢測現有資料集中的標籤錯誤。在這項工作中,我們考慮了LLM作為評判的最新方法,利用LLM集成來標記潛在的標籤錯誤範例。通過對TRUE基準測試中四個資料集的案例研究,涵蓋不同任務和領域,我們在協議、標籤質量和效率方面,從實證角度分析了現有資料集的標註質量,並比較了專家、眾包和我們基於LLM的標註,展示了每種標註方法的優勢和局限性。我們的研究發現了大量的標籤錯誤,當進行更正時,將使報告的模型性能顯著提升。這表明許多LLMs所謂的錯誤是由於標籤錯誤而非真正的模型失敗。此外,我們討論了標記錯誤資料的影響,並提出了減輕這些問題以提高模型性能的方法。
English
NLP benchmarks rely on standardized datasets for training and evaluating
models and are crucial for advancing the field. Traditionally, expert
annotations ensure high-quality labels; however, the cost of expert annotation
does not scale well with the growing demand for larger datasets required by
modern models. While crowd-sourcing provides a more scalable solution, it often
comes at the expense of annotation precision and consistency. Recent
advancements in large language models (LLMs) offer new opportunities to enhance
the annotation process, particularly for detecting label errors in existing
datasets. In this work, we consider the recent approach of LLM-as-a-judge,
leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through
a case study of four datasets from the TRUE benchmark, covering different tasks
and domains, we empirically analyze the labeling quality of existing datasets,
and compare expert, crowd-sourced, and our LLM-based annotations in terms of
agreement, label quality, and efficiency, demonstrating the strengths and
limitations of each annotation method. Our findings reveal a substantial number
of label errors, which, when corrected, induce a significant upward shift in
reported model performance. This suggests that many of the LLMs so-called
mistakes are due to label errors rather than genuine model failures.
Additionally, we discuss the implications of mislabeled data and propose
methods to mitigate them in training to improve model performance.Summary
AI-Generated Summary