옛 옵티마이저, 새로운 노름: 수필집

초록

딥러닝 최적화기는 종종 볼록 및 근사 2차 이론의 혼합을 통해 동기부여를 받습니다. 우리는 Adam, Shampoo 및 Prodigy와 같은 세 가지 방법을 선택하고 각 방법을 볼록성 가정 없이 명확하게 1차 방법으로 이해할 수 있다고 주장합니다. 실제로 지수 가중 이동 평균을 끄면, 각 방법은 특정 노름 하에 가파른 하강과 동등합니다. 이 관찰을 일반화하여, 훈련 알고리즘을 위한 새로운 설계 공간을 제시합니다. 네트워크 내에서 텐서의 역할에 따라 다른 연산자 노름을 할당해야 합니다. 예를 들어, 선형 및 임베딩 레이어는 R^{m x n}의 동일한 가중치 공간을 가질 수 있지만, 이러한 레이어는 서로 다른 역할을 하므로 다른 노름이 할당되어야 합니다. 우리는 신경 구조를 신중하게 메트리화하는 이러한 아이디어가 더 안정적이고 확장 가능하며 실제로 더 빠른 훈련으로 이어질 수 있기를 희망합니다.

English

Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of R^{mtimes n}, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

옛 옵티마이저, 새로운 노름: 수필집

Old Optimizer, New Norm: An Anthology

초록

Summary

Support

Support