ChatPaper.aiChatPaper

Zipf白化

Zipfian Whitening

November 1, 2024
作者: Sho Yokoi, Han Bao, Hiroto Kurita, Hidetoshi Shimodaira
cs.AI

摘要

神经模型中的词嵌入空间存在偏斜,纠正这一问题可以提高任务性能。我们指出,大多数用于建模、纠正和衡量嵌入空间对称性的方法都隐含地假设词频是均匀的;而实际上,词频遵循高度非均匀分布,即齐夫定律。令人惊讶的是,简单地进行PCA白化,根据遵循齐夫定律的经验词频加权,显著提高了任务性能,超过了已建立的基线。从理论角度看,我们的方法和现有方法都可以清晰地分类:词表示根据具有均匀或齐夫基础测度的指数族分布。通过采用后一种方法,我们可以自然地强调信息量较大的低频词,这在信息几何学的角度和不平衡分类的损失函数方面变得明显。此外,我们的理论证实了流行的自然语言处理方法,如skip-gram负采样、WhiteningBERT和无头语言模型之所以表现良好,只是因为它们的词嵌入将经验词频编码到基础概率模型中。
English
The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective, and in terms of the loss functions for imbalanced classification. Additionally, our theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models, work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.

Summary

AI-Generated Summary

PDF92November 13, 2024