VideoAutoArena：一個用於透過使用者模擬評估大型多模式模型在視頻分析中的自動化競技場

摘要

近來，具有先進視頻分析能力的大型多模型模型（LMMs）已經引起了相當大的關注。然而，大多數評估仰賴傳統方法，例如在VideoMME和LongVideoBench等基準測試中的多項選擇問題，這些方法容易缺乏捕捉真實用戶複雜需求所需的深度。為了解決這一限制，以及由於人工標註視頻任務的成本高昂且速度緩慢，我們引入了VideoAutoArena，這是一個受LMSYS Chatbot Arena框架啟發的競技場式基準測試，旨在自動評估LMMs的視頻分析能力。VideoAutoArena利用用戶模擬生成開放式、自適應問題，嚴格評估模型在視頻理解方面的表現。該基準測試具有自動化、可擴展的評估框架，並採用修改後的ELO等級制度，以公平持續地比較多個LMMs。為了驗證我們的自動評判系統，我們利用精心策劃的人工標註子集構建了一個“黃金標準”，證明我們的競技場與人類判斷高度一致，同時保持可擴展性。此外，我們引入了一種基於錯誤驅動的進化策略，逐步增加問題的複雜性，以推動模型應對更具挑戰性的視頻分析場景。實驗結果表明，VideoAutoArena能有效區分最先進的LMMs，提供了有關模型優勢和改進領域的見解。為了進一步簡化我們的評估，我們引入了VideoAutoBench作為輔助基準測試，其中人工標註者標記VideoAutoArena戰鬥中的勝利者。我們使用GPT-4o作為評判，將回答與這些經人工驗證的答案進行比較。總的來說，VideoAutoArena和VideoAutoBench提供了一個成本效益高且可擴展的框架，用於評估以用戶為中心的視頻分析的LMMs。

English

Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.

VideoAutoArena：一個用於透過使用者模擬評估大型多模式模型在視頻分析中的自動化競技場

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

摘要

Support