LVD-2M: 시간적으로 밀도 높은 캡션을 가진 장기 비디오 데이터셋

초록

비디오 생성 모델의 효과성은 그들의 훈련 데이터셋의 품질에 크게 의존합니다. 대부분의 이전 비디오 생성 모델은 짧은 비디오 클립에서 훈련되었지만, 최근에는 긴 비디오에서 직접 훈련되는 긴 비디오 생성 모델에 대한 관심이 증가하고 있습니다. 그러나 이러한 고품질의 긴 비디오의 부족은 긴 비디오 생성의 발전을 방해합니다. 긴 비디오 생성 연구를 촉진하기 위해, 긴 비디오 생성 모델 훈련에 필수적인 네 가지 핵심 기능을 갖춘 새로운 데이터셋이 필요합니다: (1) 적어도 10초 이상의 긴 비디오, (2) 컷이 없는 장면이 연속된 긴 비디오, (3) 큰 움직임과 다양한 콘텐츠, (4) 시간적으로 밀도 있는 자막. 이를 위해, 우리는 고품질의 장면 컷, 동적 정도, 의미 수준의 품질을 포함하는 비디오 품질을 정량적으로 평가하기 위한 메트릭 세트를 정의하여 대량의 소스 비디오에서 고품질의 장면 컷 비디오를 걸러내는 것을 가능하게 합니다. 이후, 우리는 시간적으로 밀도 있는 자막을 생성하기 위한 계층적 비디오 자막 파이프라인을 개발합니다. 이 파이프라인을 사용하여, 우리는 10초 이상을 커버하는 각각의 2백만 개의 장면 컷 비디오를 포함하고 시간적으로 밀도 있는 자막으로 주석이 달린 첫 번째 장면 컷 비디오 데이터셋인 LVD-2M을 만들었습니다. 우리는 또한 LVD-2M의 효과를 검증하기 위해 비디오 생성 모델을 세밀하게 조정하여 동적 움직임을 가진 긴 비디오를 생성합니다. 우리는 우리의 연구가 미래의 긴 비디오 생성 연구에 상당한 기여를 할 것으로 믿습니다.

English

The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.

LVD-2M: 시간적으로 밀도 높은 캡션을 가진 장기 비디오 데이터셋

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

초록

Summary

Support