長文本關鍵詞提取:LongKey

LongKey: Keyphrase Extraction for Long Documents

November 26, 2024
作者: Jeovane Honorio Alves, Radu State, Cinthia Obladen de Almendra Freitas, Jean Paul Barddal
cs.AI

摘要

在信息過載的時代,手動標註龐大且不斷增長的文檔和學術論文已變得越來越不切實際。自動關鍵詞提取通過識別文本中的代表性詞語來應對這一挑戰。然而,大多數現有方法專注於短文檔(最多512個標記),這導致長文檔處理存在空白。在本文中,我們介紹了LongKey,這是一個從冗長文檔中提取關鍵詞的新框架,它使用基於編碼器的語言模型來捕捉擴展文本的細微差異。LongKey使用最大池化嵌入器來增強關鍵詞候選表示。在全面的LDKP數據集和六個多樣的未見數據集上驗證後,LongKey始終優於現有的無監督和基於語言模型的關鍵詞提取方法。我們的研究結果展示了LongKey的多功能性和卓越性能,標誌著在不同文本長度和領域的關鍵詞提取方面的進步。
English
In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.

Summary

AI-Generated Summary

PDF112November 29, 2024