VisionArena:带有偏好标签的23万真实用户-VLM对话
VisionArena: 230K Real World User-VLM Conversations with Preference Labels
December 11, 2024
作者: Christopher Chou, Lisa Dunlap, Koki Mashita, Krishna Mandal, Trevor Darrell, Ion Stoica, Joseph E. Gonzalez, Wei-Lin Chiang
cs.AI
摘要
随着视觉语言模型(VLMs)的日益普及和能力增强,对捕捉真实用户-VLMs交互的基准测试的需求日益迫切。为此,我们创建了VisionArena数据集,其中包含23万个用户与VLMs之间的真实对话。这些对话是从Chatbot Arena收集而来的,这是一个开源平台,用户可以与VLMs进行交互并提交偏好投票。VisionArena涵盖了7.3万个独特用户、45个VLMs和138种语言。我们的数据集包含三个子集:VisionArena-Chat,包括20万个用户与VLM之间的单轮和多轮对话;VisionArena-Battle,包括3万个比较两个匿名VLMs并进行用户偏好投票的对话;以及VisionArena-Bench,一个自动基准测试,包含500个多样的用户提示,有效地模拟了实时Chatbot Arena模型排名。此外,我们还强调了用户提出的问题类型、回复风格对偏好的影响,以及模型经常失败的领域。我们发现,像字幕和幽默这样的开放式任务高度依赖于风格,而当前的VLMs在空间推理和规划任务方面表现不佳。最后,我们展示了在VisionArena-Chat上微调相同基础模型优于Llava-Instruct-158K,在MMMU上提高了17个百分点,在WildVision基准测试上提高了46个百分点。数据集链接:https://huggingface.co/lmarena-ai
English
With the growing adoption and capabilities of vision-language models (VLMs)
comes the need for benchmarks that capture authentic user-VLM interactions. In
response, we create VisionArena, a dataset of 230K real-world conversations
between users and VLMs. Collected from Chatbot Arena - an open-source platform
where users interact with VLMs and submit preference votes - VisionArena spans
73K unique users, 45 VLMs, and 138 languages. Our dataset contains three
subsets: VisionArena-Chat, 200k single and multi-turn conversations between a
user and a VLM; VisionArena-Battle, 30K conversations comparing two anonymous
VLMs with user preference votes; and VisionArena-Bench, an automatic benchmark
of 500 diverse user prompts that efficiently approximate the live Chatbot Arena
model rankings. Additionally, we highlight the types of question asked by
users, the influence of response style on preference, and areas where models
often fail. We find open-ended tasks like captioning and humor are highly
style-dependent, and current VLMs struggle with spatial reasoning and planning
tasks. Lastly, we show finetuning the same base model on VisionArena-Chat
outperforms Llava-Instruct-158K, with a 17-point gain on MMMU and a 46-point
gain on the WildVision benchmark. Dataset at https://huggingface.co/lmarena-aiSummary
AI-Generated Summary