UFO：通过开放式语言接口实现细粒度视觉感知的统一方法

摘要

通用模型在语言和视觉-语言任务中取得了显著成功，展现了统一建模的潜力。然而，如何有效地将检测和分割等细粒度感知任务整合到这些模型中仍然是一个重大挑战。这主要是因为这些任务通常严重依赖于特定任务的设计和架构，这可能会使建模过程复杂化。为解决这一挑战，我们提出了\ours，一个通过开放式语言接口统一细粒度视觉感知任务的框架。通过将所有感知目标转换到语言空间，\ours将对象级检测、像素级分割和图像级视觉-语言任务统一到一个单一模型中。此外，我们引入了一种新颖的嵌入检索方法，该方法仅依赖于语言接口来支持分割任务。我们的框架弥合了细粒度感知与视觉-语言任务之间的差距，显著简化了架构设计和训练策略，同时在性能上达到或超越了具有复杂任务特定设计的方法。在五个标准视觉感知数据集上进行多任务训练后，\ours在COCO实例分割上比之前的最先进通用模型提高了12.3 mAP，在ADE20K语义分割上提高了3.3 mIoU。此外，我们的方法能够无缝集成现有的多模态大语言模型（MLLMs），有效地将细粒度感知能力与其高级语言能力相结合，从而支持更具挑战性的任务，如推理分割。代码和模型将公开发布。

English

Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present \ours, a framework that Unifies Fine-grained visual perception tasks through an Open-ended language interface. By transforming all perception targets into the language space, \ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, \ours outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks such as reasoning segmentation. Code and models will be publicly available.

UFO：通过开放式语言接口实现细粒度视觉感知的统一方法

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

摘要

Summary

Support

Support