長文本關鍵詞提取:LongKey
LongKey: Keyphrase Extraction for Long Documents
November 26, 2024
作者: Jeovane Honorio Alves, Radu State, Cinthia Obladen de Almendra Freitas, Jean Paul Barddal
cs.AI
摘要
在信息過載的時代,手動標註龐大且不斷增長的文檔和學術論文已變得越來越不切實際。自動關鍵詞提取通過識別文本中的代表性詞語來應對這一挑戰。然而,大多數現有方法專注於短文檔(最多512個標記),這導致長文檔處理存在空白。在本文中,我們介紹了LongKey,這是一個從冗長文檔中提取關鍵詞的新框架,它使用基於編碼器的語言模型來捕捉擴展文本的細微差異。LongKey使用最大池化嵌入器來增強關鍵詞候選表示。在全面的LDKP數據集和六個多樣的未見數據集上驗證後,LongKey始終優於現有的無監督和基於語言模型的關鍵詞提取方法。我們的研究結果展示了LongKey的多功能性和卓越性能,標誌著在不同文本長度和領域的關鍵詞提取方面的進步。
English
In an era of information overload, manually annotating the vast and growing
corpus of documents and scholarly papers is increasingly impractical. Automated
keyphrase extraction addresses this challenge by identifying representative
terms within texts. However, most existing methods focus on short documents (up
to 512 tokens), leaving a gap in processing long-context documents. In this
paper, we introduce LongKey, a novel framework for extracting keyphrases from
lengthy documents, which uses an encoder-based language model to capture
extended text intricacies. LongKey uses a max-pooling embedder to enhance
keyphrase candidate representation. Validated on the comprehensive LDKP
datasets and six diverse, unseen datasets, LongKey consistently outperforms
existing unsupervised and language model-based keyphrase extraction methods.
Our findings demonstrate LongKey's versatility and superior performance,
marking an advancement in keyphrase extraction for varied text lengths and
domains.Summary
AI-Generated Summary