ViBe：一个用于评估大型多模态模型中幻觉的文本到视频基准测试。

摘要

大型多模态模型（LMMs）的最新发展已经扩展了它们的能力，包括视频理解。具体来说，文本到视频（T2V）模型在质量、理解力和持续时间方面取得了显著进展，擅长根据简单的文本提示创建视频。然而，它们仍经常产生明显表明视频是由人工智能生成的幻觉内容。我们介绍了ViBe：一个大规模的从T2V模型产生的幻觉视频的文本到视频基准。我们确定了五种主要类型的幻觉：主体消失、数字变化、时间错形、遗漏错误和物理不协调。利用10个开源T2V模型，我们开发了第一个大规模的由人类注释为这五个类别的幻觉视频数据集，包括3,782个视频。ViBe为评估T2V模型的可靠性提供了独特资源，并为改善视频生成中的幻觉检测和缓解奠定了基础。我们建立了分类作为基准，并提出了各种集成分类器配置，其中TimeSFormer + CNN组合表现最佳，实现了0.345的准确率和0.342的F1分数。这一基准旨在推动开发出更准确地与输入提示对齐的强大T2V模型。

English

Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.

ViBe：一个用于评估大型多模态模型中幻觉的文本到视频基准测试。

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

摘要

Summary

Support