舊優化器,新規範:一個選集
Old Optimizer, New Norm: An Anthology
September 30, 2024
作者: Jeremy Bernstein, Laker Newhouse
cs.AI
摘要
深度學習優化器通常是透過凸和近似二階理論的結合來激發動力。我們選擇了三種這樣的方法——Adam、Shampoo和Prodigy——並且認為每種方法實際上可以被理解為一種沒有凸性假設的一階方法。事實上,在關閉指數移動平均後,每種方法等同於在特定範數下的最陡下降。通過擴展這一觀察,我們為訓練算法開創了一個新的設計空間。應該根據張量在網絡中的作用,為不同的運算符範數分配不同的值。例如,雖然線性和嵌入層可能具有相同的權重空間R^{m乘以n},但這些層扮演不同的角色,應該分配不同的範數。我們希望這種精心測量神經結構的想法可能會導致更穩定、可擴展且更快速的訓練。
English
Deep learning optimizers are often motivated through a mix of convex and
approximate second-order theory. We select three such methods -- Adam, Shampoo
and Prodigy -- and argue that each method can instead be understood as a
squarely first-order method without convexity assumptions. In fact, after
switching off exponential moving averages, each method is equivalent to
steepest descent under a particular norm. By generalizing this observation, we
chart a new design space for training algorithms. Different operator norms
should be assigned to different tensors based on the role that the tensor plays
within the network. For example, while linear and embedding layers may have the
same weight space of R^{mtimes n}, these layers play different
roles and should be assigned different norms. We hope that this idea of
carefully metrizing the neural architecture might lead to more stable, scalable
and indeed faster training.Summary
AI-Generated Summary