VBench++：视频生成模型的全面多功能基准套件

摘要

视频生成已经取得了显著的进展，但评估这些模型仍然是一个挑战。视频生成的全面评估基准至关重要，原因有两点：1）现有的度量标准并不完全符合人类感知；2）理想的评估系统应提供见解，以指导未来视频生成的发展。为此，我们提出了VBench，一个全面的基准套件，将“视频生成质量”分解为具体、分层和解耦的维度，每个维度都有量身定制的提示和评估方法。VBench具有几个吸引人的特点：1）全面的维度：VBench包括视频生成中的16个维度（例如，主体身份不一致、动作平滑度、时间闪烁和空间关系等）。细粒度级别的评估度量揭示了各个模型的优势和劣势。2）与人类对齐：我们还提供了一个人类偏好注释数据集，以验证我们基准与人类感知的对齐性，分别针对每个评估维度。3）宝贵的见解：我们研究了当前模型在各种评估维度和各种内容类型上的能力。我们还调查了视频和图像生成模型之间的差距。4）多功能基准测试：VBench++支持评估文本到视频和图像到视频。我们引入了一个具有自适应宽高比的高质量图像套件，以实现在不同图像到视频生成设置下的公平评估。除了评估技术质量，VBench++还评估视频生成模型的可信度，提供了对模型性能更全面的视角。5）完全开源：我们完全开源了VBench++，并不断向我们的排行榜添加新的视频生成模型，推动视频生成领域的发展。

English

Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal evaluation system should provide insights to inform future developments of video generation. To this end, we present VBench, a comprehensive benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. VBench has several appealing properties: 1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation (e.g., subject identity inconsistency, motion smoothness, temporal flickering, and spatial relationship, etc). The evaluation metrics with fine-grained levels reveal individual models' strengths and weaknesses. 2) Human Alignment: We also provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception, for each evaluation dimension respectively. 3) Valuable Insights: We look into current models' ability across various evaluation dimensions, and various content types. We also investigate the gaps between video and image generation models. 4) Versatile Benchmarking: VBench++ supports evaluating text-to-video and image-to-video. We introduce a high-quality Image Suite with an adaptive aspect ratio to enable fair evaluations across different image-to-video generation settings. Beyond assessing technical quality, VBench++ evaluates the trustworthiness of video generative models, providing a more holistic view of model performance. 5) Full Open-Sourcing: We fully open-source VBench++ and continually add new video generation models to our leaderboard to drive forward the field of video generation.

VBench++：视频生成模型的全面多功能基准套件

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

摘要

Support