ChatPaper.aiChatPaper

滑动瓦片注意力快速视频生成

Fast Video Generation with Sliding Tile Attention

February 6, 2025
作者: Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, Hao Zhang
cs.AI

摘要

扩散变压器(DiTs)具有3D全注意力状态下的最先进视频生成能力,但存在计算成本过高的问题——仅生成一段5秒的720P视频时,注意力占据了总推理时间的945秒中的800秒。本文介绍了滑动瓷砖注意力(STA)来解决这一挑战。STA利用了预训练视频扩散模型中的注意力分数主要集中在局部3D窗口内的观察结果。通过在局部时空区域滑动和关注,STA消除了全注意力中的冗余。与传统的基于令牌的滑动窗口注意力(SWA)不同,STA逐个瓷砖进行操作,采用一种新颖的硬件感知滑动窗口设计,保持了表达能力同时具备了硬件效率。通过仔细的内核级优化,STA提供了第一个高效的2D/3D滑动窗口式注意力实现,实现了58.79%的MFU。具体来说,STA将注意力加速了2.8-17倍,超过了FlashAttention-2(FA2)的1.6-10倍,超过了FlashAttention-3(FA3)。在领先的视频DiT,HunyuanVideo上,STA将端到端延迟从945秒(FA3)降低到685秒,而无需降低质量,无需训练。启用微调进一步将延迟降低到268秒,仅在VBench上下降了0.09%。
English
Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.

Summary

AI-Generated Summary

PDF492February 10, 2025