분할된 교차 주의와 콘텐츠 풍부한 비디오 데이터 큐레이션을 활용한 장기 비디오 생성

초록

우리는 장거리 일관성과 풍부한 콘텐츠를 갖춘 15초 동영상을 생성하기 위해 설계된 혁신적인 비디오 확산 모델인 Presto를 소개합니다. 장기간에 걸쳐 시나리오 다양성을 유지하는 비디오 생성 방법을 확장하는 것은 상당한 어려움을 겪습니다. 이를 해결하기 위해 우리는 Segmented Cross-Attention (SCA) 전략을 제안합니다. 이는 숨겨진 상태를 시간적 차원을 따라 세그먼트로 분할하여 각 세그먼트가 해당 서브 캡션에 교차 주의를 기울일 수 있도록 합니다. SCA는 추가 매개변수가 필요하지 않으며, 현재 DiT 기반 아키텍처에 매끄럽게 통합될 수 있습니다. 고품질 장기 동영상 생성을 용이하게 하기 위해 LongTake-HD 데이터셋을 구축했습니다. 이 데이터셋은 시나리오 일관성을 갖춘 261k개의 콘텐츠 풍부한 동영상으로 구성되어 있으며, 전체 비디오 캡션과 다섯 가지 서브 캡션으로 주석이 달려 있습니다. 실험 결과, 우리의 Presto는 VBench 의미 점수에서 78.5%를 달성하고, Dynamic Degree에서 100%를 기록하여 기존 최첨단 비디오 생성 방법을 능가했습니다. 이는 우리가 제안한 Presto가 콘텐츠 풍부성을 크게 향상시키고, 장거리 일관성을 유지하며, 복잡한 텍스트 세부 사항을 포착한다는 것을 보여줍니다. 더 많은 세부 정보는 저희 프로젝트 페이지에서 확인하실 수 있습니다: https://presto-video.github.io/.

English

We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: https://presto-video.github.io/.

분할된 교차 주의와 콘텐츠 풍부한 비디오 데이터 큐레이션을 활용한 장기 비디오 생성

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

초록

Summary

Support