InterFeedback：通过人类反馈揭示大型多模态模型的交互智能

摘要

现有基准测试未能充分评估大型多模态模型（LMMs）与人类用户的交互智能，而这对于开发通用人工智能助手至关重要。为此，我们设计了InterFeedback，一个可应用于任何LMM及数据集的交互框架，以自主评估这一能力。在此基础上，我们推出了InterFeedback-Bench，利用MMMU-Pro和MathVerse这两个代表性数据集，对10种不同的开源LMM进行了交互智能评估。此外，我们还发布了InterFeedback-Human，这是一个包含120个案例的新数据集，专门用于手动测试如OpenAI-o1和Claude-3.5-Sonnet等领先模型的交互表现。评估结果显示，即便是最先进的LMM（如OpenAI-o1），在通过人类反馈修正结果方面的成功率也不足50%。我们的发现强调了需要开发能够增强LMMs理解并利用反馈能力的方法。

English

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-3.5-Sonnet. Our evaluation results show that even state-of-the-art LMM (like OpenAI-o1) can correct their results through human feedback less than 50%. Our findings point to the need for methods that can enhance the LMMs' capability to interpret and benefit from feedback.

InterFeedback：通过人类反馈揭示大型多模态模型的交互智能

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

摘要

Summary

Support