InstanceCap: 인스턴스 인식 구조화된 캡션을 통해 텍스트에서 비디오 생성 개선

초록

최근 몇 년간 텍스트-비디오 생성 기술은 급속히 발전하여 현저한 결과를 이끌어내고 있습니다. 일반적으로 훈련은 비디오 캡션과 쌍을 이루는 데이터에 의존하며, 이는 생성 성능을 향상시키는 데 중요한 역할을 합니다. 그러나 현재의 비디오 캡션은 종종 세부 정보가 부족하거나 환각적이며, 정확하지 않은 동작 묘사로 인해 생성된 비디오의 충실도와 일관성에 영향을 미칩니다. 본 연구에서는 인스턴스 수준 및 세밀한 비디오 캡션을 처음으로 달성하기 위해 InstanceCap이라는 새로운 인스턴스 인식 구조화된 캡션 프레임워크를 제안합니다. 이 체계를 기반으로 원본 비디오를 인스턴스로 변환하여 인스턴스 충실도를 향상시키기 위한 보조 모델 클러스터를 설계합니다. 비디오 인스턴스는 구조화된 구문으로 밀도 높은 프롬프트를 세밀하게 다듬어 간결하면서도 정확한 설명을 달성하는 데 활용됩니다. 더불어, 22K InstanceVid 데이터셋이 훈련용으로 정리되었으며, InstanceCap 구조에 맞춰 개선된 파이프라인이 제안되어 추론에 활용됩니다. 실험 결과는 우리의 제안된 InstanceCap이 이전 모델들을 크게 능가하여 캡션과 비디오 간의 높은 충실도를 보장하면서 환각을 줄였음을 보여줍니다.

English

Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.

InstanceCap: 인스턴스 인식 구조화된 캡션을 통해 텍스트에서 비디오 생성 개선

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

초록

Support