ChatPaper.aiChatPaper

SegAgent:通过模仿人类标注轨迹探索多模态大语言模型的像素理解能力

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

March 11, 2025
作者: Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, Yang Liu, Ming Yang, Chunhua Shen
cs.AI

摘要

尽管多模态大语言模型(MLLMs)已展现出足够的图像理解能力,但在像素级理解方面仍存在局限,制约了其实际应用。当前的评估任务,如视觉问答(VQA)和视觉定位,仍过于粗略,难以准确评估细粒度的像素理解。虽然分割是像素级理解的基础,但现有方法通常要求MLLMs生成隐含标记,并通过外部像素解码器解码。这种方法扰乱了MLLM的文本输出空间,可能损害语言能力,降低灵活性和可扩展性,同时未能反映模型内在的像素级理解能力。 因此,我们引入了类人掩码标注任务(HLMAT),这是一种新范式,MLLMs在此任务中模仿人类标注者使用交互式分割工具。将分割建模为多步马尔可夫决策过程,HLMAT使MLLMs能够迭代生成基于文本的点击点,无需改变架构或生成隐含标记即可获得高质量掩码。通过这一设置,我们开发了SegAgent模型,该模型在类人标注轨迹上进行了微调,其性能与最先进(SOTA)方法相当,并支持掩码精炼和标注过滤等附加任务。 HLMAT为评估MLLMs的细粒度像素理解提供了一种协议,并引入了一个以视觉为中心的多步决策任务,促进了探索MLLMs视觉推理能力的研究。我们对策略改进方法StaR和PRM引导的树搜索的适应,进一步增强了模型在复杂分割任务中的鲁棒性,为未来MLLMs在细粒度视觉感知和多步决策方面的进步奠定了基础。
English
While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM's text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model's intrinsic pixel-level understanding. Thus, we introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to state-of-the-art (SOTA) methods and supports additional tasks like mask refinement and annotation filtering. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs' visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM-guided tree search further enhance model robustness in complex segmentation tasks, laying a foundation for future advancements in fine-grained visual perception and multi-step decision-making for MLLMs.

Summary

AI-Generated Summary

PDF242March 12, 2025