构建历史土耳其语自然语言处理的基础:资源与模型

Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

January 8, 2025
作者: Şaziye Betül Özateş, Tarık Emre Tıraş, Ece Elif Adak, Berat Doğan, Fatih Burak Karagöz, Efe Eren Genç, Esma F. Bilgin Taşdemir
cs.AI

摘要

本文介绍了历史土耳其语自然语言处理(NLP)的基础资源和模型,这是计算语言学领域中尚未充分探索的领域。我们提出了第一个命名实体识别(NER)数据集 HisTR 和第一个通用依存树库 OTA-BOUN,用于土耳其语历史形式的转换器模型训练,这些模型用于命名实体识别、依存句法分析和词性标注任务。此外,我们介绍了奥斯曼文本语料库(OTC),这是一个干净的转录历史土耳其文本语料库,涵盖了广泛的历史时期。我们的实验结果显示,在历史土耳其语的计算分析方面取得了显著进展,在需要理解历史语言结构的任务中取得了令人期待的结果。它们还突出了现有的挑战,如领域适应和不同历史时期的语言变化。所有提出的资源和模型都可在 https://huggingface.co/bucolin 上获取,以成为历史土耳其语NLP未来进展的基准。
English
This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Additionally, we introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results show significant improvements in the computational analysis of historical Turkish, achieving promising results in tasks that require understanding of historical linguistic structures. They also highlight existing challenges, such as domain adaptation and language variations across time periods. All of the presented resources and models are made available at https://huggingface.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.

Summary

AI-Generated Summary

PDF113January 10, 2025