VisionArena:具有偏好標籤的23萬真實世界用戶-VLM對話

VisionArena: 230K Real World User-VLM Conversations with Preference Labels

December 11, 2024
作者: Christopher Chou, Lisa Dunlap, Koki Mashita, Krishna Mandal, Trevor Darrell, Ion Stoica, Joseph E. Gonzalez, Wei-Lin Chiang
cs.AI

摘要

隨著視覺語言模型(VLMs)的日益普及和功能增強,需要有能夠捕捉真實用戶-VLMs互動的基準。為了應對這一需求,我們創建了VisionArena數據集,其中包含23萬組用戶與VLMs之間的真實對話。這些對話來自Chatbot Arena,這是一個開源平台,用戶可以在其中與VLMs互動並提交偏好投票。VisionArena涵蓋了7.3萬名獨立用戶、45個VLMs和138種語言。我們的數據集包含三個子集:VisionArena-Chat,包括20萬組用戶與VLMs之間的單輪和多輪對話;VisionArena-Battle,包括3萬組用戶對比兩個匿名VLMs並進行偏好投票的對話;以及VisionArena-Bench,包含500個多樣化用戶提示的自動基準,有效地模擬了實時Chatbot Arena模型排名。此外,我們還突出了用戶提出的問題類型、回應風格對偏好的影響,以及模型常常失敗的領域。我們發現,像是標題和幽默等開放式任務高度依賴風格,目前的VLMs在空間推理和規劃任務方面表現不佳。最後,我們展示了在VisionArena-Chat上微調相同基礎模型的效果優於Llava-Instruct-158K,在MMMU上提高了17個百分點,在WildVision基準上提高了46個百分點。數據集位於https://huggingface.co/lmarena-ai。
English
With the growing adoption and capabilities of vision-language models (VLMs) comes the need for benchmarks that capture authentic user-VLM interactions. In response, we create VisionArena, a dataset of 230K real-world conversations between users and VLMs. Collected from Chatbot Arena - an open-source platform where users interact with VLMs and submit preference votes - VisionArena spans 73K unique users, 45 VLMs, and 138 languages. Our dataset contains three subsets: VisionArena-Chat, 200k single and multi-turn conversations between a user and a VLM; VisionArena-Battle, 30K conversations comparing two anonymous VLMs with user preference votes; and VisionArena-Bench, an automatic benchmark of 500 diverse user prompts that efficiently approximate the live Chatbot Arena model rankings. Additionally, we highlight the types of question asked by users, the influence of response style on preference, and areas where models often fail. We find open-ended tasks like captioning and humor are highly style-dependent, and current VLMs struggle with spatial reasoning and planning tasks. Lastly, we show finetuning the same base model on VisionArena-Chat outperforms Llava-Instruct-158K, with a 17-point gain on MMMU and a 46-point gain on the WildVision benchmark. Dataset at https://huggingface.co/lmarena-ai

Summary

AI-Generated Summary

PDF133December 13, 2024