LLM优于报告吗？检测标签错误并减轻其对模型性能的影响

摘要

自然语言处理基准测试依赖于标准化数据集来训练和评估模型，对于推动该领域的发展至关重要。传统上，专家注释确保高质量的标签；然而，专家注释的成本随着现代模型对更大数据集的需求增长而不断上升。虽然众包提供了一种更具规模化的解决方案，但往往以标注精度和一致性为代价。大型语言模型（LLMs）的最新进展为增强注释过程提供了新的机会，特别是用于检测现有数据集中标签错误的情况。在这项工作中，我们考虑了LLM作为评判者的最新方法，利用LLM集合来标记潜在的错误标记示例。通过对TRUE基准测试中四个数据集的案例研究，涵盖不同任务和领域，我们从经验上分析了现有数据集的标注质量，并比较了专家、众包和我们基于LLM的注释在协议、标签质量和效率方面的一致性，展示了每种注释方法的优势和局限性。我们的研究结果揭示了大量的标签错误，一旦纠正，就会显著提高报告的模型性能。这表明许多LLM所谓的错误是由于标签错误而不是真正的模型失败。此外，我们讨论了错误标记数据的影响，并提出了在训练中减轻这些影响以提高模型性能的方法。

English

NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.

LLM优于报告吗？检测标签错误并减轻其对模型性能的影响

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

摘要

Summary

Support

Support