VideoAutoArena:一个通过用户模拟评估视频分析中大型多模态模型的自动化竞技场
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
November 20, 2024
作者: Ziyang Luo, Haoning Wu, Dongxu Li, Jing Ma, Mohan Kankanhalli, Junnan Li
cs.AI
摘要
近来,具备先进视频分析能力的大型多模态模型(LMMs)日益受到重视。然而,大多数评估仍依赖于传统方法,如在VideoMME和LongVideoBench等基准测试中的多项选择问题,这些方法往往缺乏捕捉真实用户复杂需求所需的深度。为解决这一局限性,考虑到人工标注视频任务的成本高昂且速度缓慢,我们引入了VideoAutoArena,这是受LMSYS Chatbot Arena框架启发的竞技场式基准测试,旨在自动评估LMMs的视频分析能力。VideoAutoArena利用用户模拟生成开放式、自适应的问题,严格评估模型在视频理解方面的表现。该基准测试具备自动化、可扩展的评估框架,采用修改后的ELO评分系统,以公平、持续地比较多个LMMs。为验证我们的自动评判系统,我们构建了一个“黄金标准”,使用精心策划的人工标注子集,证明我们的竞技场与人类判断高度一致,同时保持可扩展性。此外,我们引入了一种基于错误驱动的演进策略,逐渐增加问题复杂度,推动模型处理更具挑战性的视频分析场景。实验结果表明,VideoAutoArena有效区分了最先进的LMMs,为模型的优势和改进方向提供了见解。为进一步简化我们的评估流程,我们推出了VideoAutoBench作为辅助基准测试,其中人工标注者标记了VideoAutoArena战斗中的获胜者。我们使用GPT-4o作为评委,将回答与这些经人工验证的答案进行比较。VideoAutoArena和VideoAutoBench共同提供了一种成本效益高、可扩展的框架,用于评估以用户为中心的视频分析中的LMMs。
English
Large multimodal models (LMMs) with advanced video analysis capabilities have
recently garnered significant attention. However, most evaluations rely on
traditional methods like multiple-choice questions in benchmarks such as
VideoMME and LongVideoBench, which are prone to lack the depth needed to
capture the complex demands of real-world users. To address this limitation-and
due to the prohibitive cost and slow pace of human annotation for video
tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS
Chatbot Arena's framework, designed to automatically assess LMMs' video
analysis abilities. VideoAutoArena utilizes user simulation to generate
open-ended, adaptive questions that rigorously assess model performance in
video understanding. The benchmark features an automated, scalable evaluation
framework, incorporating a modified ELO Rating System for fair and continuous
comparisons across multiple LMMs. To validate our automated judging system, we
construct a 'gold standard' using a carefully curated subset of human
annotations, demonstrating that our arena strongly aligns with human judgment
while maintaining scalability. Additionally, we introduce a fault-driven
evolution strategy, progressively increasing question complexity to push models
toward handling more challenging video analysis scenarios. Experimental results
demonstrate that VideoAutoArena effectively differentiates among
state-of-the-art LMMs, providing insights into model strengths and areas for
improvement. To further streamline our evaluation, we introduce VideoAutoBench
as an auxiliary benchmark, where human annotators label winners in a subset of
VideoAutoArena battles. We use GPT-4o as a judge to compare responses against
these human-validated answers. Together, VideoAutoArena and VideoAutoBench
offer a cost-effective, and scalable framework for evaluating LMMs in
user-centric video analysis.Summary
AI-Generated Summary