舊優化器，新規範：一個選集

摘要

深度學習優化器通常是透過凸和近似二階理論的結合來激發動力。我們選擇了三種這樣的方法——Adam、Shampoo和Prodigy——並且認為每種方法實際上可以被理解為一種沒有凸性假設的一階方法。事實上，在關閉指數移動平均後，每種方法等同於在特定範數下的最陡下降。通過擴展這一觀察，我們為訓練算法開創了一個新的設計空間。應該根據張量在網絡中的作用，為不同的運算符範數分配不同的值。例如，雖然線性和嵌入層可能具有相同的權重空間R^{m乘以n}，但這些層扮演不同的角色，應該分配不同的範數。我們希望這種精心測量神經結構的想法可能會導致更穩定、可擴展且更快速的訓練。

English

Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of R^{mtimes n}, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

舊優化器，新規範：一個選集

Old Optimizer, New Norm: An Anthology

摘要

Summary

Support

Support