VideoAutoArena:一個用於透過使用者模擬評估大型多模式模型在視頻分析中的自動化競技場
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
November 20, 2024
作者: Ziyang Luo, Haoning Wu, Dongxu Li, Jing Ma, Mohan Kankanhalli, Junnan Li
cs.AI
摘要
近來,具有先進視頻分析能力的大型多模型模型(LMMs)已經引起了相當大的關注。然而,大多數評估仰賴傳統方法,例如在VideoMME和LongVideoBench等基準測試中的多項選擇問題,這些方法容易缺乏捕捉真實用戶複雜需求所需的深度。為了解決這一限制,以及由於人工標註視頻任務的成本高昂且速度緩慢,我們引入了VideoAutoArena,這是一個受LMSYS Chatbot Arena框架啟發的競技場式基準測試,旨在自動評估LMMs的視頻分析能力。VideoAutoArena利用用戶模擬生成開放式、自適應問題,嚴格評估模型在視頻理解方面的表現。該基準測試具有自動化、可擴展的評估框架,並採用修改後的ELO等級制度,以公平持續地比較多個LMMs。為了驗證我們的自動評判系統,我們利用精心策劃的人工標註子集構建了一個“黃金標準”,證明我們的競技場與人類判斷高度一致,同時保持可擴展性。此外,我們引入了一種基於錯誤驅動的進化策略,逐步增加問題的複雜性,以推動模型應對更具挑戰性的視頻分析場景。實驗結果表明,VideoAutoArena能有效區分最先進的LMMs,提供了有關模型優勢和改進領域的見解。為了進一步簡化我們的評估,我們引入了VideoAutoBench作為輔助基準測試,其中人工標註者標記VideoAutoArena戰鬥中的勝利者。我們使用GPT-4o作為評判,將回答與這些經人工驗證的答案進行比較。總的來說,VideoAutoArena和VideoAutoBench提供了一個成本效益高且可擴展的框架,用於評估以用戶為中心的視頻分析的LMMs。
English
Large multimodal models (LMMs) with advanced video analysis capabilities have
recently garnered significant attention. However, most evaluations rely on
traditional methods like multiple-choice questions in benchmarks such as
VideoMME and LongVideoBench, which are prone to lack the depth needed to
capture the complex demands of real-world users. To address this limitation-and
due to the prohibitive cost and slow pace of human annotation for video
tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS
Chatbot Arena's framework, designed to automatically assess LMMs' video
analysis abilities. VideoAutoArena utilizes user simulation to generate
open-ended, adaptive questions that rigorously assess model performance in
video understanding. The benchmark features an automated, scalable evaluation
framework, incorporating a modified ELO Rating System for fair and continuous
comparisons across multiple LMMs. To validate our automated judging system, we
construct a 'gold standard' using a carefully curated subset of human
annotations, demonstrating that our arena strongly aligns with human judgment
while maintaining scalability. Additionally, we introduce a fault-driven
evolution strategy, progressively increasing question complexity to push models
toward handling more challenging video analysis scenarios. Experimental results
demonstrate that VideoAutoArena effectively differentiates among
state-of-the-art LMMs, providing insights into model strengths and areas for
improvement. To further streamline our evaluation, we introduce VideoAutoBench
as an auxiliary benchmark, where human annotators label winners in a subset of
VideoAutoArena battles. We use GPT-4o as a judge to compare responses against
these human-validated answers. Together, VideoAutoArena and VideoAutoBench
offer a cost-effective, and scalable framework for evaluating LMMs in
user-centric video analysis.Summary
AI-Generated Summary