긴 문서를 위한 키프레이즈 추출

초록

정보 과부하 시대에는 방대하고 계속 증가하는 문서와 학술 논문 코퍼스를 수동으로 주석을 다는 것이 점점 불가능해지고 있습니다. 자동 핵심어 추출은 텍스트 내 대표 용어를 식별함으로써 이러한 도전에 대처합니다. 그러나 대부분의 기존 방법은 짧은 문서(최대 512 토큰)에 초점을 맞추어 긴 콘텍스트 문서의 처리에는 빈 공간이 남습니다. 본 논문에서는 긴 문서에서 핵심어를 추출하기 위한 새로운 프레임워크인 LongKey를 소개합니다. 이는 인코더 기반 언어 모델을 사용하여 확장된 텍스트 세부 사항을 포착합니다. LongKey는 맥스-풀링 임베더를 사용하여 핵심어 후보 표현을 강화합니다. LDKP 데이터셋과 여섯 가지 다양한, 이전에 보지 못한 데이터셋에서 검증된 결과, LongKey는 일관되게 기존의 비지도 및 언어 모델 기반 핵심어 추출 방법을 능가합니다. 우리의 연구 결과는 LongKey의 다재다능성과 우수한 성능을 입증하며, 다양한 텍스트 길이와 도메인에 대한 핵심어 추출의 발전을 나타냅니다.

English

In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.

긴 문서를 위한 키프레이즈 추출

LongKey: Keyphrase Extraction for Long Documents

초록

Support