LinGen: 선형 계산 복잡도를 가진 고해상도 분 단위 텍스트에서 비디오 생성을 향해

초록

비디오 생성을 위한 텍스트 대 비디오 변환은 콘텐츠 작성을 향상시키지만 계산적으로 매우 비용이 많이 듭니다: 확산 트랜스포머(DiTs)의 계산 비용은 픽셀 수에 제곱적으로 증가합니다. 이는 짧은 길이의 비디오 생성이 매우 비싸게 만들어, 대부분의 기존 모델이 10-20초 길이의 비디오 생성으로 제한되게 합니다. 우리는 픽셀 수에 선형적으로 증가하는 비용을 가지는 선형 복잡도 텍스트 대 비디오 생성(LinGen) 프레임워크를 제안합니다. LinGen은 고품질의 고해상도 짧은 길이 비디오 생성을 GPU 한 대에서 희생 없이 가능하게 합니다. 이는 계산적으로 우세하고 제곱 복잡도의 블록인 셀프 어텐션을 선형 복잡도 블록인 MATE로 대체합니다. MATE는 MA-브랜치와 TE-브랜치로 이루어진 새로운 TEmporal Swin 어텐션 블록을 포함하며, Mamba2 블록과 우리의 토큰 재배치 방법인 Rotary Major Scan, 그리고 장비디오 생성을 위해 개발된 리뷰 토큰을 결합하여 단거리에서 장거리 상관관계를 타깃으로 합니다. TE-브랜치는 인접 토큰과 중간 범위 토큰 간의 시간적 상관관계에 초점을 맞춘 새로운 TEmporal Swin 어텐션 블록입니다. MATE 블록은 Mamba의 인접성 보존 문제를 해결하고 생성된 비디오의 일관성을 크게 향상시킵니다. 실험 결과는 LinGen이 DiT보다 비디오 품질에서 75.6%의 승률로 우세하며 최대 15배(11.5배)의 FLOPs(지연) 감소를 보여줍니다. 더불어 자동 측정 및 인간 평가 모두 LinGen-4B가 최첨단 모델(Gen-3, LumaLabs, Kling에 대해 각각 50.5%, 52.1%, 49.1%의 승률)과 비교 가능한 비디오 품질을 제공함을 보여줍니다. 이는 시간 길이 영화 생성과 실시간 대화형 비디오 생성의 길을 열어줍니다. 프로젝트 웹사이트에서 68초 비디오 생성 결과와 더 많은 예시를 제공합니다: https://lineargen.github.io/.

English

Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15times (11.5times) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: https://lineargen.github.io/.

LinGen: 선형 계산 복잡도를 가진 고해상도 분 단위 텍스트에서 비디오 생성을 향해

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

초록

Support