LLM이 보고된 것보다 나은가? 레이블 오류 감지 및 모델 성능에 미치는 영향 완화하기

초록

NLP 벤치마크는 모델을 훈련하고 평가하기 위해 표준화된 데이터셋에 의존하며, 이는 분야를 발전시키는 데 중요합니다. 기존에는 전문가 주석이 고품질 레이블을 보장했지만, 최근 모델이 요구하는 대규모 데이터셋의 수요 증가에 비례하여 전문가 주석의 비용이 증가하는 문제가 있습니다. 크라우드소싱은 더 확장 가능한 해결책을 제공하지만 주석 정확도와 일관성이 희생되는 경우가 많습니다. 대형 언어 모델(LLMs)의 최근 발전은 기존 데이터셋에서 레이블 오류를 감지하는 데 특히 유용한 새로운 기회를 제공합니다. 본 연구에서는 LLM을 판사로 활용하는 최근 접근 방식을 고려하여, LLM 앙상블을 활용하여 잠재적으로 잘못 레이블이 지정된 예제를 식별합니다. 서로 다른 작업과 도메인을 다루는 TRUE 벤치마크의 네 데이터셋을 사례 연구를 통해, 기존 데이터셋의 레이블링 품질을 경험적으로 분석하고 전문가, 크라우드소싱, 그리고 LLM 기반 주석을 합의, 레이블 품질, 효율성 측면에서 비교하여 각 주석 방법의 장단점을 시연합니다. 우리의 연구 결과는 상당수의 레이블 오류를 발견했으며, 이를 수정하면 보고된 모델 성능이 상당히 향상됨을 보여줍니다. 이는 많은 LLM이라 불리는 모델의 오류가 진짜 모델 실패가 아닌 레이블 오류로 인한 것임을 시사합니다. 더불어, 잘못 레이블이 지정된 데이터의 함의를 논의하고 모델 성능 향상을 위해 훈련 중에 이를 완화하는 방법을 제안합니다.

English

NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.

LLM이 보고된 것보다 나은가? 레이블 오류 감지 및 모델 성능에 미치는 영향 완화하기

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

초록

Support