VisionArena：带有偏好标签的23万真实用户-VLM对话

摘要

随着视觉语言模型（VLMs）的日益普及和能力增强，对捕捉真实用户-VLMs交互的基准测试的需求日益迫切。为此，我们创建了VisionArena数据集，其中包含23万个用户与VLMs之间的真实对话。这些对话是从Chatbot Arena收集而来的，这是一个开源平台，用户可以与VLMs进行交互并提交偏好投票。VisionArena涵盖了7.3万个独特用户、45个VLMs和138种语言。我们的数据集包含三个子集：VisionArena-Chat，包括20万个用户与VLM之间的单轮和多轮对话；VisionArena-Battle，包括3万个比较两个匿名VLMs并进行用户偏好投票的对话；以及VisionArena-Bench，一个自动基准测试，包含500个多样的用户提示，有效地模拟了实时Chatbot Arena模型排名。此外，我们还强调了用户提出的问题类型、回复风格对偏好的影响，以及模型经常失败的领域。我们发现，像字幕和幽默这样的开放式任务高度依赖于风格，而当前的VLMs在空间推理和规划任务方面表现不佳。最后，我们展示了在VisionArena-Chat上微调相同基础模型优于Llava-Instruct-158K，在MMMU上提高了17个百分点，在WildVision基准测试上提高了46个百分点。数据集链接：https://huggingface.co/lmarena-ai

English

With the growing adoption and capabilities of vision-language models (VLMs) comes the need for benchmarks that capture authentic user-VLM interactions. In response, we create VisionArena, a dataset of 230K real-world conversations between users and VLMs. Collected from Chatbot Arena - an open-source platform where users interact with VLMs and submit preference votes - VisionArena spans 73K unique users, 45 VLMs, and 138 languages. Our dataset contains three subsets: VisionArena-Chat, 200k single and multi-turn conversations between a user and a VLM; VisionArena-Battle, 30K conversations comparing two anonymous VLMs with user preference votes; and VisionArena-Bench, an automatic benchmark of 500 diverse user prompts that efficiently approximate the live Chatbot Arena model rankings. Additionally, we highlight the types of question asked by users, the influence of response style on preference, and areas where models often fail. We find open-ended tasks like captioning and humor are highly style-dependent, and current VLMs struggle with spatial reasoning and planning tasks. Lastly, we show finetuning the same base model on VisionArena-Chat outperforms Llava-Instruct-158K, with a 17-point gain on MMMU and a 46-point gain on the WildVision benchmark. Dataset at https://huggingface.co/lmarena-ai

VisionArena：带有偏好标签的23万真实用户-VLM对话

VisionArena: 230K Real World User-VLM Conversations with Preference Labels

摘要

Support