Presto！加速音樂生成的步驟和層級

摘要

儘管擴散式文字轉音樂（TTM）方法有所進展，但高效率、高品質的生成仍然是一個挑戰。我們引入了Presto！，一種通過減少取樣步驟和每步成本的方法來加速基於樂譜的擴散變壓器的推理。為了減少步驟，我們為EDM系列的擴散模型開發了一種新的基於樂譜的分佈匹配蒸餾（DMD）方法，這是第一個基於GAN的TTM蒸餾方法。為了降低每步的成本，我們對最近的一種層蒸餾方法進行了簡單但強大的改進，通過更好地保留隱藏狀態變異性來改善學習。最後，我們將我們的步驟和層蒸餾方法結合起來，形成一種雙重方法。我們獨立評估了我們的步驟和層蒸餾方法，並展示了每個都具有最佳性能。我們結合的蒸餾方法可以生成高質量的輸出，並提高多樣性，將我們的基本模型加速10-18倍（32秒單聲道/立體聲44.1kHz的延遲為230/435ms，比可比的SOTA快15倍）-- 據我們所知，這是速度最快的高質量TTM。聲音示例可在https://presto-music.github.io/web/找到。

English

Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.

Presto！加速音樂生成的步驟和層級

Presto! Distilling Steps and Layers for Accelerating Music Generation

摘要

Summary

Support

Support