建立歷史土耳其語自然語言處理的基礎:資源與模型
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models
January 8, 2025
作者: Şaziye Betül Özateş, Tarık Emre Tıraş, Ece Elif Adak, Berat Doğan, Fatih Burak Karagöz, Efe Eren Genç, Esma F. Bilgin Taşdemir
cs.AI
摘要
本文介紹了自然語言處理(NLP)歷史土耳其語的基礎資源和模型,這是在計算語言學領域中尚未得到充分探索的領域。我們提出了第一個命名實體識別(NER)數據集 HisTR 和第一個通用依存樹庫 OTA-BOUN,用於土耳其語歷史形式的基於轉換器的模型,這些模型是使用這些數據集進行訓練,用於命名實體識別、依存分析和詞性標記任務。此外,我們還介紹了奧斯曼文本語料庫(OTC),這是一個乾淨的歷史土耳其文本的語料庫,涵蓋了廣泛的歷史時期。我們的實驗結果顯示,在歷史土耳其語的計算分析中取得了顯著的改進,在需要理解歷史語言結構的任務中取得了令人期待的結果。它們還突出了現有的挑戰,例如領域適應和隨時間變化的語言變化。所有提出的資源和模型都可在 https://huggingface.co/bucolin 上獲得,以作為未來歷史土耳其語NLP進展的基準。
English
This paper introduces foundational resources and models for natural language
processing (NLP) of historical Turkish, a domain that has remained
underexplored in computational linguistics. We present the first named entity
recognition (NER) dataset, HisTR and the first Universal Dependencies treebank,
OTA-BOUN for a historical form of the Turkish language along with
transformer-based models trained using these datasets for named entity
recognition, dependency parsing, and part-of-speech tagging tasks.
Additionally, we introduce Ottoman Text Corpus (OTC), a clean corpus of
transliterated historical Turkish texts that spans a wide range of historical
periods. Our experimental results show significant improvements in the
computational analysis of historical Turkish, achieving promising results in
tasks that require understanding of historical linguistic structures. They also
highlight existing challenges, such as domain adaptation and language
variations across time periods. All of the presented resources and models are
made available at https://huggingface.co/bucolin to serve as a benchmark for
future progress in historical Turkish NLP.Summary
AI-Generated Summary