ChatPaper.aiChatPaper

Zipfian白化

Zipfian Whitening

November 1, 2024
作者: Sho Yokoi, Han Bao, Hiroto Kurita, Hidetoshi Shimodaira
cs.AI

摘要

神經模型中的詞嵌入空間存在偏差,糾正這一問題可以提升任務表現。我們指出,大多數用於建模、糾正和衡量嵌入空間對稱性的方法,都隱含地假設詞頻是均勻的;而實際上,詞頻遵循一種高度非均勻的分佈,即齊夫定律。令人驚訝的是,僅通過根據遵循齊夫定律的實證詞頻進行主成分分析(PCA)白化,就能顯著提升任務表現,超越已建立的基準線。從理論角度來看,我們的方法和現有方法都可以清晰地分類:詞表示根據指數族分佈,其基本測度可以是均勻的,也可以是遵循齊夫定律的。通過採用後者的方法,我們可以自然地強調具有信息量的低頻詞,這在信息幾何學的角度和不平衡分類的損失函數方面變得明顯。此外,我們的理論證實了流行的自然語言處理方法,如skip-gram負採樣、WhiteningBERT和無頭語言模型之所以表現良好,僅因為它們的詞嵌入將實證詞頻編碼到基礎概率模型中。
English
The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective, and in terms of the loss functions for imbalanced classification. Additionally, our theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models, work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.

Summary

AI-Generated Summary

PDF92November 13, 2024