通過微調和模型合併追蹤通用特徵

摘要

我們研究特徵如何在不同文本領域微調模型時出現、消失和持久。更具體地說，我們從一個基礎單層Transformer語言模型出發，該模型在BabyLM語料庫和The Stack的Python代碼集合的組合上進行訓練。這個基礎模型被適應到兩個新的文本領域：TinyStories和Lua編程語言，然後這兩個模型通過球面線性插值合併。我們的探索旨在深入了解特徵在典型的遷移學習場景中的穩定性和轉變，使用小規模模型和稀疏自編碼器。

English

We study how features emerge, disappear, and persist across models fine-tuned on different domains of text. More specifically, we start from a base one-layer Transformer language model that is trained on a combination of the BabyLM corpus, and a collection of Python code from The Stack. This base model is adapted to two new domains of text: TinyStories, and the Lua programming language, respectively; and then these two models are merged using these two models using spherical linear interpolation. Our exploration aims to provide deeper insights into the stability and transformation of features across typical transfer-learning scenarios using small-scale models and sparse auto-encoders.

通過微調和模型合併追蹤通用特徵

Tracking Universal Features Through Fine-Tuning and Model Merging

摘要

Summary

Support

Support