소프트맥스 함수를 사용하여 트랜스포머 네트워크를 학습시키는 방법을 제안합니다.

초록

확산 트랜스포머는 이미지 생성에서 놀라운 능력을 보여주었지만 종종 과도한 매개변수화로 실제 응용 프로그램에서 상당한 추론 오버헤드를 유발합니다. 본 연구에서는 TinyFusion이라는 깊이 가지치기 방법을 제안하여 확산 트랜스포머에서 중복된 레이어를 제거하기 위해 엔드 투 엔드 학습을 통해 설계되었습니다. 우리 접근 방식의 핵심 원칙은 강력한 성능을 회복할 수 있는 가지치기된 모델을 생성하는 것으로, 미세 조정 후 강력한 성능을 되찾을 수 있도록 합니다. 이를 달성하기 위해 우리는 가지치기를 학습 가능하게 만드는 미분 가능한 샘플링 기술을 소개하고, 미래 미세 조정을 시뮬레이션하는 공동 최적화 매개변수를 도입합니다. 이전 연구들은 가지치기 후 손실이나 오류를 최소화하는 데 초점을 맞추었지만, 우리 방법은 가지치기된 모델의 미세 조정 후 성능을 명시적으로 모델링하고 최적화합니다. 실험 결과는 이 학습 가능한 패러다임이 확산 트랜스포머의 레이어 가지치기에 상당한 이점을 제공하며, 기존의 중요도 기반 및 오류 기반 방법을 능가한다는 것을 보여줍니다. 또한 TinyFusion은 DiTs, MARs 및 SiTs와 같은 다양한 아키텍처에 걸쳐 강력한 일반화를 보여줍니다. DiT-XL과의 실험 결과는 TinyFusion이 사전 훈련 비용의 7% 미만으로 얕은 확산 트랜스포머를 만들어내어 FID 점수가 2.86인 2배의 가속을 달성하며, 유사한 효율성을 가진 경쟁 상대를 능가한다는 것을 보여줍니다. 코드는 https://github.com/VainF/TinyFusion에서 확인할 수 있습니다.

English

Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization, resulting in considerable inference overhead in real-world applications. In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. The core principle of our approach is to create a pruned model with high recoverability, allowing it to regain strong performance after fine-tuning. To accomplish this, we introduce a differentiable sampling technique to make pruning learnable, paired with a co-optimized parameter to simulate future fine-tuning. While prior works focus on minimizing loss or error after pruning, our method explicitly models and optimizes the post-fine-tuning performance of pruned models. Experimental results indicate that this learnable paradigm offers substantial benefits for layer pruning of diffusion transformers, surpassing existing importance-based and error-based methods. Additionally, TinyFusion exhibits strong generalization across diverse architectures, such as DiTs, MARs, and SiTs. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2times speedup with an FID score of 2.86, outperforming competitors with comparable efficiency. Code is available at https://github.com/VainF/TinyFusion.

소프트맥스 함수를 사용하여 트랜스포머 네트워크를 학습시키는 방법을 제안합니다.

TinyFusion: Diffusion Transformers Learned Shallow

초록

Summary

Support