UFO:通过开放式语言接口实现细粒度视觉感知的统一方法
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
March 3, 2025
作者: Hao Tang, Chenwei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang
cs.AI
摘要
通用模型在语言和视觉-语言任务中取得了显著成功,展现了统一建模的潜力。然而,如何有效地将检测和分割等细粒度感知任务整合到这些模型中仍然是一个重大挑战。这主要是因为这些任务通常严重依赖于特定任务的设计和架构,这可能会使建模过程复杂化。为解决这一挑战,我们提出了\ours,一个通过开放式语言接口统一细粒度视觉感知任务的框架。通过将所有感知目标转换到语言空间,\ours将对象级检测、像素级分割和图像级视觉-语言任务统一到一个单一模型中。此外,我们引入了一种新颖的嵌入检索方法,该方法仅依赖于语言接口来支持分割任务。我们的框架弥合了细粒度感知与视觉-语言任务之间的差距,显著简化了架构设计和训练策略,同时在性能上达到或超越了具有复杂任务特定设计的方法。在五个标准视觉感知数据集上进行多任务训练后,\ours在COCO实例分割上比之前的最先进通用模型提高了12.3 mAP,在ADE20K语义分割上提高了3.3 mIoU。此外,我们的方法能够无缝集成现有的多模态大语言模型(MLLMs),有效地将细粒度感知能力与其高级语言能力相结合,从而支持更具挑战性的任务,如推理分割。代码和模型将公开发布。
English
Generalist models have achieved remarkable success in both language and
vision-language tasks, showcasing the potential of unified modeling. However,
effectively integrating fine-grained perception tasks like detection and
segmentation into these models remains a significant challenge. This is
primarily because these tasks often rely heavily on task-specific designs and
architectures that can complicate the modeling process. To address this
challenge, we present \ours, a framework that Unifies
Fine-grained visual perception tasks through an Open-ended
language interface. By transforming all perception targets into the language
space, \ours unifies object-level detection, pixel-level segmentation, and
image-level vision-language tasks into a single model. Additionally, we
introduce a novel embedding retrieval approach that relies solely on the
language interface to support segmentation tasks. Our framework bridges the gap
between fine-grained perception and vision-language tasks, significantly
simplifying architectural design and training strategies while achieving
comparable or superior performance to methods with intricate task-specific
designs. After multi-task training on five standard visual perception datasets,
\ours outperforms the previous state-of-the-art generalist models by 12.3 mAP
on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation.
Furthermore, our method seamlessly integrates with existing MLLMs, effectively
combining fine-grained perception capabilities with their advanced language
abilities, thereby enabling more challenging tasks such as reasoning
segmentation. Code and models will be publicly available.Summary
AI-Generated Summary