LLM教師-學生框架用於無需手動標註數據的文本分類:以IPTC新聞主題分類為例研究
LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification
November 29, 2024
作者: Taja Kuzman, Nikola Ljubešić
cs.AI
摘要
隨著線上新聞故事數量不斷增加,無論其所使用的語言為何,將這些故事按主題進行分類對增進讀者對相關內容的訪問至關重要。為應對此挑戰,我們提出一種基於大型語言模型(LLMs)的教師-學生框架,用於開發合理大小的多語言新聞分類模型,無需手動數據標註。該框架採用生成式預訓練變壓器(GPT)模型作為教師模型,通過自動標註斯洛維尼亞語、克羅地亞語、希臘語和加泰隆尼亞語的新聞文章,來建立一個IPTC媒體主題訓練數據集。教師模型在所有四種語言上展現出高零樣本性能,其與人類標註者的一致性可與人類標註者之間的一致性相媲美。為了克服每日處理數百萬文本的計算限制,較小的BERT-like學生模型在GPT標註的數據集上進行微調。這些學生模型實現了與教師模型相當的高性能。此外,我們探討了訓練數據大小對學生模型性能的影響,並研究了它們的單語、多語和零樣本跨語言能力。研究結果表明,學生模型可以在相對少量的訓練實例下實現高性能,並展現出強大的零樣本跨語言能力。最後,我們發布了表現最佳的新聞主題分類器,實現了與IPTC媒體主題架構頂級類別的多語言分類。
English
With the ever-increasing number of news stories available online, classifying
them by topic, regardless of the language they are written in, has become
crucial for enhancing readers' access to relevant content. To address this
challenge, we propose a teacher-student framework based on large language
models (LLMs) for developing multilingual news classification models of
reasonable size with no need for manual data annotation. The framework employs
a Generative Pretrained Transformer (GPT) model as the teacher model to develop
an IPTC Media Topic training dataset through automatic annotation of news
articles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits
a high zero-shot performance on all four languages. Its agreement with human
annotators is comparable to that between the human annotators themselves. To
mitigate the computational limitations associated with the requirement of
processing millions of texts daily, smaller BERT-like student models are
fine-tuned on the GPT-annotated dataset. These student models achieve high
performance comparable to the teacher model. Furthermore, we explore the impact
of the training data size on the performance of the student models and
investigate their monolingual, multilingual and zero-shot cross-lingual
capabilities. The findings indicate that student models can achieve high
performance with a relatively small number of training instances, and
demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the
best-performing news topic classifier, enabling multilingual classification
with the top-level categories of the IPTC Media Topic schema.Summary
AI-Generated Summary