VideoAutoArena：一个通过用户模拟评估视频分析中大型多模态模型的自动化竞技场

摘要

近来，具备先进视频分析能力的大型多模态模型（LMMs）日益受到重视。然而，大多数评估仍依赖于传统方法，如在VideoMME和LongVideoBench等基准测试中的多项选择问题，这些方法往往缺乏捕捉真实用户复杂需求所需的深度。为解决这一局限性，考虑到人工标注视频任务的成本高昂且速度缓慢，我们引入了VideoAutoArena，这是受LMSYS Chatbot Arena框架启发的竞技场式基准测试，旨在自动评估LMMs的视频分析能力。VideoAutoArena利用用户模拟生成开放式、自适应的问题，严格评估模型在视频理解方面的表现。该基准测试具备自动化、可扩展的评估框架，采用修改后的ELO评分系统，以公平、持续地比较多个LMMs。为验证我们的自动评判系统，我们构建了一个“黄金标准”，使用精心策划的人工标注子集，证明我们的竞技场与人类判断高度一致，同时保持可扩展性。此外，我们引入了一种基于错误驱动的演进策略，逐渐增加问题复杂度，推动模型处理更具挑战性的视频分析场景。实验结果表明，VideoAutoArena有效区分了最先进的LMMs，为模型的优势和改进方向提供了见解。为进一步简化我们的评估流程，我们推出了VideoAutoBench作为辅助基准测试，其中人工标注者标记了VideoAutoArena战斗中的获胜者。我们使用GPT-4o作为评委，将回答与这些经人工验证的答案进行比较。VideoAutoArena和VideoAutoBench共同提供了一种成本效益高、可扩展的框架，用于评估以用户为中心的视频分析中的LMMs。

English

Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.

VideoAutoArena：一个通过用户模拟评估视频分析中大型多模态模型的自动化竞技场

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

摘要

Summary

Support