长键:长文档的关键短语提取
LongKey: Keyphrase Extraction for Long Documents
November 26, 2024
作者: Jeovane Honorio Alves, Radu State, Cinthia Obladen de Almendra Freitas, Jean Paul Barddal
cs.AI
摘要
在信息过载的时代,手动注释庞大且不断增长的文档和学术论文变得越来越不切实际。自动关键词提取通过识别文本中的代表性术语来解决这一挑战。然而,大多数现有方法专注于短文档(最多512个标记),导致在处理长文本文档时存在空白。本文介绍了一种名为LongKey的新型框架,用于从长文档中提取关键词,该框架使用基于编码器的语言模型来捕捉扩展文本的复杂性。LongKey使用最大池化嵌入器来增强关键词候选表示。通过在全面的LDKP数据集和六个不同的未见数据集上进行验证,LongKey始终优于现有的无监督和基于语言模型的关键词提取方法。我们的研究结果表明LongKey具有多样性和卓越性能,标志着在处理各种文本长度和领域的关键词提取方面取得了进展。
English
In an era of information overload, manually annotating the vast and growing
corpus of documents and scholarly papers is increasingly impractical. Automated
keyphrase extraction addresses this challenge by identifying representative
terms within texts. However, most existing methods focus on short documents (up
to 512 tokens), leaving a gap in processing long-context documents. In this
paper, we introduce LongKey, a novel framework for extracting keyphrases from
lengthy documents, which uses an encoder-based language model to capture
extended text intricacies. LongKey uses a max-pooling embedder to enhance
keyphrase candidate representation. Validated on the comprehensive LDKP
datasets and six diverse, unseen datasets, LongKey consistently outperforms
existing unsupervised and language model-based keyphrase extraction methods.
Our findings demonstrate LongKey's versatility and superior performance,
marking an advancement in keyphrase extraction for varied text lengths and
domains.Summary
AI-Generated Summary