預訓練語言模型以發現歷時語言變遷

摘要

大型語言模型（LLMs）已展現出作為科學發現工具的潛力，這引發了人們對其在人文學科中應用的日益關注，例如歷史語言學和文學研究。這些領域通常基於文類或更為嚴格的時間段劃分來構建論點。儘管已通過微調或模型編輯等方法努力將推理限制在特定領域，但我們認為，唯一真正的保證是領域限制的預訓練——這通常是一項數據和計算成本高昂的提議。我們展示了高效的預訓練技術能夠在規模過大難以手動檢查但對「典型」LLM方法來說又過小的語料庫上生成有用的模型。我們採用了一種新穎的日期歸因流程，以獲取一個按時間分段、包含五個1000萬詞片段的數據集。我們在這些語料片段上訓練了兩組對應的五模型系列，分別採用高效預訓練和Llama3-8B參數的高效微調。我們發現，預訓練模型比微調基線模型訓練速度更快，並且更能尊重我們語料的歷史劃分。強調速度和精確性而非非歷史的全面性，使得在我們目標領域中能夠採用多種新穎的假設發現和測試方法。以歷時語言學作為測試平台，我們展示了我們的方法能夠檢測到多種現象，包括大規模詞彙變化、非詞彙（語法和形態）變化以及詞義引入/淘汰。我們提供了一個即用型流程，只需最小程度的適應即可將我們的方法擴展到其他目標領域。

English

Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domains via fine-tuning or model editing, we posit that the only true guarantee is domain-restricted pretraining -- typically, a data- and compute-expensive proposition. We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection but too small for "typical" LLM approaches. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus. Emphasizing speed and precision over a-historical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields. Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence. We provide a ready-to-use pipeline that allows extension of our approach to other target fields with only minimal adaptation.

預訓練語言模型以發現歷時語言變遷

Pretraining Language Models for Diachronic Linguistic Change Discovery

摘要

Summary

Support

Support