비디오 자동 아레나: 사용자 시뮬레이션을 통해 비디오 분석에서 대규모 다중 모달 모델을 평가하는 자동화된 아레나

초록

최근에는 고급 비디오 분석 기능을 갖춘 대규모 다중 모달 모델(LMMs)이 상당한 관심을 받고 있습니다. 그러나 대부분의 평가는 VideoMME 및 LongVideoBench와 같은 벤치마크에서 객관식 문제와 같은 전통적인 방법에 의존하며, 이러한 방법은 실제 사용자의 복잡한 요구 사항을 충분히 포착하기에 부족할 수 있습니다. 이 한계를 극복하기 위해 - 그리고 비디오 작업에 대한 인간 주석의 비용이 높고 속도가 느린 문제로 인해 - 우리는 LMSYS Chatbot Arena의 프레임워크에서 영감을 받은 VideoAutoArena를 소개합니다. 이는 LMMs의 비디오 분석 능력을 자동으로 평가하기 위해 설계된 아레나 스타일의 벤치마크입니다. VideoAutoArena는 사용자 시뮬레이션을 활용하여 비디오 이해 모델의 성능을 엄격하게 평가하는 개방형 적응형 질문을 생성합니다. 이 벤치마크는 수정된 ELO Rating System을 활용하여 여러 LMMs 간의 공정하고 지속적인 비교를 위한 자동화된 확장 가능한 평가 프레임워크를 특징으로 합니다. 우리의 자동 심사 시스템을 검증하기 위해 인간 주석의 신중하게 선별된 하위 집합을 사용하여 '골드 표준'을 구축하고, 우리의 아레나가 인간 판단과 강력하게 일치하면서도 확장 가능성을 유지함을 보여줍니다. 또한, 모델이 더 어려운 비디오 분석 시나리오를 처리하도록 밀어내기 위해 점진적으로 질문 복잡성을 높이는 결함 주도 진화 전략을 소개합니다. 실험 결과는 VideoAutoArena가 최첨단 LMMs 간에 효과적으로 차별화되며, 모델의 강점과 개선 영역에 대한 통찰을 제공한다는 것을 보여줍니다. 평가를 더욱 간소화하기 위해, 우리는 VideoAutoArena 전투의 하위 집합에서 인간 주석자가 우승자를 레이블링하는 보조 벤치마크인 VideoAutoBench를 소개합니다. 우리는 GPT-4o를 심사관으로 사용하여 이러한 인간 확인된 답변과 비교합니다. VideoAutoArena와 VideoAutoBench는 사용자 중심의 비디오 분석에서 LMMs를 평가하기 위한 비용 효율적이고 확장 가능한 프레임워크를 제공합니다.

English

Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.

비디오 자동 아레나: 사용자 시뮬레이션을 통해 비디오 분석에서 대규모 다중 모달 모델을 평가하는 자동화된 아레나

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

초록

Support