ChatPaper.aiChatPaper

长键:长文档的关键短语提取

LongKey: Keyphrase Extraction for Long Documents

November 26, 2024
作者: Jeovane Honorio Alves, Radu State, Cinthia Obladen de Almendra Freitas, Jean Paul Barddal
cs.AI

摘要

在信息过载的时代,手动注释庞大且不断增长的文档和学术论文变得越来越不切实际。自动关键词提取通过识别文本中的代表性术语来解决这一挑战。然而,大多数现有方法专注于短文档(最多512个标记),导致在处理长文本文档时存在空白。本文介绍了一种名为LongKey的新型框架,用于从长文档中提取关键词,该框架使用基于编码器的语言模型来捕捉扩展文本的复杂性。LongKey使用最大池化嵌入器来增强关键词候选表示。通过在全面的LDKP数据集和六个不同的未见数据集上进行验证,LongKey始终优于现有的无监督和基于语言模型的关键词提取方法。我们的研究结果表明LongKey具有多样性和卓越性能,标志着在处理各种文本长度和领域的关键词提取方面取得了进展。
English
In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.

Summary

AI-Generated Summary

PDF122November 29, 2024