人工智慧偵測器足夠好嗎?關於機器生成文本資料集品質的調查
Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts
October 18, 2024
作者: German Gritsai, Anastasia Voznyuk, Andrey Grabovoy, Yury Chekhovich
cs.AI
摘要
自回歸大型語言模型(LLMs)的快速發展顯著提高了生成文本的質量,這需要可靠的機器生成文本檢測器。大量檢測器和帶有人工智能片段的收集物已經出現,根據這些收集物中的目標指標,一些檢測方法甚至顯示出高達99.9%的識別質量。然而,這些檢測器的質量在實際應用中往往急劇下降,引發一個問題:這些檢測器是否真的非常可靠,還是它們的高基準分數來自於評估數據集的質量不佳?在本文中,我們強調了對於評估生成數據的堅固和優質方法的需求,以確保未來模型的偏見和低泛化能力。我們對專門用於檢測人工智能生成內容的競賽數據集進行了系統性回顧,並提出了評估包含人工智能生成片段的數據集質量的方法。此外,我們討論了使用高質量生成數據來實現兩個目標的可能性:改善檢測模型的訓練以及改善訓練數據集本身。我們的貢獻旨在促進更好地理解人類和機器文本之間的動態,從而最終支持在日益自動化的世界中信息的完整性。
English
The rapid development of autoregressive Large Language Models (LLMs) has
significantly improved the quality of generated texts, necessitating reliable
machine-generated text detectors. A huge number of detectors and collections
with AI fragments have emerged, and several detection methods even showed
recognition quality up to 99.9% according to the target metrics in such
collections. However, the quality of such detectors tends to drop dramatically
in the wild, posing a question: Are detectors actually highly trustworthy or do
their high benchmark scores come from the poor quality of evaluation datasets?
In this paper, we emphasise the need for robust and qualitative methods for
evaluating generated data to be secure against bias and low generalising
ability of future model. We present a systematic review of datasets from
competitions dedicated to AI-generated content detection and propose methods
for evaluating the quality of datasets containing AI-generated fragments. In
addition, we discuss the possibility of using high-quality generated data to
achieve two goals: improving the training of detection models and improving the
training datasets themselves. Our contribution aims to facilitate a better
understanding of the dynamics between human and machine text, which will
ultimately support the integrity of information in an increasingly automated
world.Summary
AI-Generated Summary