预训练语言模型用于历时语言变化发现

摘要

大型语言模型（LLMs）已展现出作为科学发现工具的潜力，这激发了人们对其在人文学科中应用的日益关注，如历史语言学和文学研究。这些领域常基于体裁或更为严格的时间段划分来构建论点。尽管已有尝试通过微调或模型编辑将推理限制在特定领域，但我们认为，唯一真正的保证是领域受限的预训练——这通常是一项数据密集且计算成本高昂的提议。我们展示了高效的预训练技术能够在对于手动检查而言过大、但对于“典型”LLM方法又过小的语料库上生成有用的模型。我们采用了一种新颖的日期归属流程，以获得一个时间分段的五部分、每部分一千万词的语料库。我们分别在这些语料库片段上训练了两组对应的五模型集合，一组采用高效预训练，另一组则基于Llama3-8B参数进行高效微调。研究发现，预训练模型比微调基线训练速度更快，且更能尊重我们语料库的历史划分。强调速度与精确性而非跨历史的全面性，使得在我们目标领域中，假设发现与测试得以采用多种新颖方法。以历时语言学为测试平台，我们展示了该方法能够检测到一系列多样现象，包括大规模词汇变迁、非词汇（语法和形态）变化以及词义引入/废弃。我们提供了一套即用型流程，仅需最小调整即可将我们的方法扩展至其他目标领域。

English

Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domains via fine-tuning or model editing, we posit that the only true guarantee is domain-restricted pretraining -- typically, a data- and compute-expensive proposition. We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection but too small for "typical" LLM approaches. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus. Emphasizing speed and precision over a-historical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields. Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence. We provide a ready-to-use pipeline that allows extension of our approach to other target fields with only minimal adaptation.

预训练语言模型用于历时语言变化发现

Pretraining Language Models for Diachronic Linguistic Change Discovery

摘要

Summary

Support

Support