ViBe:一個用於評估大型多模型中幻覺的文本到視頻基準測試
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
November 16, 2024
作者: Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das
cs.AI
摘要
大型多模型(LMMs)的最新發展已擴展其能力,包括視頻理解。具體來說,文本到視頻(T2V)模型在質量、理解力和持續時間方面取得了顯著進展,擅長從簡單的文本提示中創建視頻。然而,它們仍然經常產生幻覺內容,清楚地表明該視頻是由人工智能生成的。我們介紹了ViBe:一個大規模的文本到視頻幻覺視頻基準,來自T2V模型。我們確定了五種主要類型的幻覺:消失的主題、數值變異性、時間扭曲、遺漏錯誤和物理不一致性。使用10個開源T2V模型,我們開發了第一個大規模的幻覺視頻數據集,包括由人類注釋的3782個視頻,分為這五個類別。ViBe為評估T2V模型的可靠性提供了一個獨特的資源,並為改進視頻生成中的幻覺檢測和緩解奠定了基礎。我們確立了分類作為基線,並提出了各種集成分類器配置,其中TimeSFormer + CNN組合實現了最佳性能,達到0.345的準確度和0.342的F1分數。這個基準旨在推動開發出能夠更準確地與輸入提示對齊的強大T2V模型。
English
Latest developments in Large Multimodal Models (LMMs) have broadened their
capabilities to include video understanding. Specifically, Text-to-video (T2V)
models have made significant progress in quality, comprehension, and duration,
excelling at creating videos from simple textual prompts. Yet, they still
frequently produce hallucinated content that clearly signals the video is
AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of
hallucinated videos from T2V models. We identify five major types of
hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia,
Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we
developed the first large-scale dataset of hallucinated videos, comprising
3,782 videos annotated by humans into these five categories. ViBe offers a
unique resource for evaluating the reliability of T2V models and provides a
foundation for improving hallucination detection and mitigation in video
generation. We establish classification as a baseline and present various
ensemble classifier configurations, with the TimeSFormer + CNN combination
yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score.
This benchmark aims to drive the development of robust T2V models that produce
videos more accurately aligned with input prompts.Summary
AI-Generated Summary