LLM教师-学生框架用于无需手动标注数据的文本分类:IPTC新闻主题分类案例研究
LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification
November 29, 2024
作者: Taja Kuzman, Nikola Ljubešić
cs.AI
摘要
随着在线新闻报道数量不断增加,无论其所用语言如何,将其按主题分类对增强读者获取相关内容的能力至关重要。为解决这一挑战,我们提出了基于大型语言模型(LLMs)的教师-学生框架,用于开发合理规模的多语言新闻分类模型,无需手动数据标注。该框架采用生成式预训练变换器(GPT)模型作为教师模型,通过自动注释斯洛文尼亚语、克罗地亚语、希腊语和加泰罗尼亚语的新闻文章,开发了一个IPTC媒体主题训练数据集。教师模型在所有四种语言上展现出高零样本性能。其与人类标注者的一致性与人类标注者之间的一致性相当。为了缓解每天处理数百万文本所带来的计算限制,较小的类似BERT的学生模型在GPT注释的数据集上进行微调。这些学生模型实现了与教师模型相当的高性能。此外,我们探讨了训练数据规模对学生模型性能的影响,并研究了它们的单语、多语和零样本跨语言能力。研究结果表明,学生模型可以在相对较少的训练实例下实现高性能,并展现出强大的零样本跨语言能力。最后,我们发布了表现最佳的新闻主题分类器,实现了具有IPTC媒体主题模式顶层类别的多语言分类。
English
With the ever-increasing number of news stories available online, classifying
them by topic, regardless of the language they are written in, has become
crucial for enhancing readers' access to relevant content. To address this
challenge, we propose a teacher-student framework based on large language
models (LLMs) for developing multilingual news classification models of
reasonable size with no need for manual data annotation. The framework employs
a Generative Pretrained Transformer (GPT) model as the teacher model to develop
an IPTC Media Topic training dataset through automatic annotation of news
articles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits
a high zero-shot performance on all four languages. Its agreement with human
annotators is comparable to that between the human annotators themselves. To
mitigate the computational limitations associated with the requirement of
processing millions of texts daily, smaller BERT-like student models are
fine-tuned on the GPT-annotated dataset. These student models achieve high
performance comparable to the teacher model. Furthermore, we explore the impact
of the training data size on the performance of the student models and
investigate their monolingual, multilingual and zero-shot cross-lingual
capabilities. The findings indicate that student models can achieve high
performance with a relatively small number of training instances, and
demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the
best-performing news topic classifier, enabling multilingual classification
with the top-level categories of the IPTC Media Topic schema.Summary
AI-Generated Summary