ChatPaper.aiChatPaper

揭示语言模型在新闻摘要中的能力

Unraveling the Capabilities of Language Models in News Summarization

January 30, 2025
作者: Abdurrahman Odabaşı, Göksel Biricik
cs.AI

摘要

鉴于最近引入了多个语言模型以及对改进自然语言处理任务,特别是摘要生成的持续需求,本研究提供了对20个最新语言模型的全面基准测试,重点关注较小的模型在新闻摘要生成任务中的表现。在本研究中,我们系统地测试了这些模型在总结不同风格的新闻文章文本以及在三个不同数据集中呈现的能力和有效性。具体而言,我们在本研究中专注于零样本学习和少样本学习设置,并应用了一种结合了自动评估指标、人工评估和以LLM为评判者的强大评估方法。有趣的是,在少样本学习设置中包含演示示例并没有提升模型的性能,在某些情况下甚至导致生成摘要的质量变差。这个问题主要是由于所使用的参考摘要的质量较差,从而对模型的性能产生负面影响。此外,我们研究的结果突出了GPT-3.5-Turbo和GPT-4的出色表现,通常由于其先进的能力而占据主导地位。然而,在评估的公共模型中,某些模型如Qwen1.5-7B、SOLAR-10.7B-Instruct-v1.0、Meta-Llama-3-8B和Zephyr-7B-Beta展现出了有希望的结果。这些模型显示出了显著的潜力,使它们成为新闻摘要生成任务的大模型的有竞争力的替代选择。
English
Given the recent introduction of multiple language models and the ongoing demand for improved Natural Language Processing tasks, particularly summarization, this work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task. In this work, we systematically test the capabilities and effectiveness of these models in summarizing news article texts which are written in different styles and presented in three distinct datasets. Specifically, we focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology that combines different evaluation concepts including automatic metrics, human evaluation, and LLM-as-a-judge. Interestingly, including demonstration examples in the few-shot learning setting did not enhance models' performance and, in some cases, even led to worse quality of the generated summaries. This issue arises mainly due to the poor quality of the gold summaries that have been used as reference summaries, which negatively impacts the models' performance. Furthermore, our study's results highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities. However, among the public models evaluated, certain models such as Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B and Zephyr-7B-Beta demonstrated promising results. These models showed significant potential, positioning them as competitive alternatives to large models for the task of news summarization.

Summary

AI-Generated Summary

PDF43February 3, 2025