YOLOE:实时万物感知
YOLOE: Real-Time Seeing Anything
March 10, 2025
作者: Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
cs.AI
摘要
目标检测与分割在计算机视觉应用中广泛采用,然而传统模型如YOLO系列虽高效精准,却受限于预定义类别,在开放场景中的适应性受限。近期开放集方法通过文本提示、视觉线索或无提示范式来突破这一局限,但常因高计算需求或部署复杂性而在性能与效率间做出妥协。本研究中,我们提出了YOLOE,它在一个高效模型中整合了多种开放提示机制下的检测与分割,实现了实时“见万物”的能力。针对文本提示,我们提出了可重参数化的区域-文本对齐策略(RepRTA),通过一个可重参数的轻量级辅助网络优化预训练文本嵌入,并以零推理和迁移开销增强视觉-文本对齐。对于视觉提示,我们设计了语义激活的视觉提示编码器(SAVPE),采用解耦的语义与激活分支,以最小复杂度提升视觉嵌入与准确性。在无提示场景下,我们引入了惰性区域-提示对比策略(LRPC),利用内置大词汇表及专用嵌入识别所有对象,避免了对昂贵语言模型的依赖。大量实验表明,YOLOE在零样本性能与迁移能力上表现卓越,同时具备高推理效率与低训练成本。特别地,在LVIS数据集上,YOLOE-v8-S以3倍少的训练成本和1.4倍的推理速度提升,超越了YOLO-Worldv2-S,AP提升了3.5。迁移至COCO时,YOLOE-v8-L相较于封闭集YOLOv8-L,在AP^b和AP^m上分别提升了0.6和0.4,且训练时间减少了近4倍。代码与模型已发布于https://github.com/THU-MIG/yoloe。
English
Object detection and segmentation are widely employed in computer vision
applications, yet conventional models like YOLO series, while efficient and
accurate, are limited by predefined categories, hindering adaptability in open
scenarios. Recent open-set methods leverage text prompts, visual cues, or
prompt-free paradigm to overcome this, but often compromise between performance
and efficiency due to high computational demands or deployment complexity. In
this work, we introduce YOLOE, which integrates detection and segmentation
across diverse open prompt mechanisms within a single highly efficient model,
achieving real-time seeing anything. For text prompts, we propose
Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines
pretrained textual embeddings via a re-parameterizable lightweight auxiliary
network and enhances visual-textual alignment with zero inference and
transferring overhead. For visual prompts, we present Semantic-Activated Visual
Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches
to bring improved visual embedding and accuracy with minimal complexity. For
prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy.
It utilizes a built-in large vocabulary and specialized embedding to identify
all objects, avoiding costly language model dependency. Extensive experiments
show YOLOE's exceptional zero-shot performance and transferability with high
inference efficiency and low training cost. Notably, on LVIS, with 3times
less training cost and 1.4times inference speedup, YOLOE-v8-S surpasses
YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6
AP^b and 0.4 AP^m gains over closed-set YOLOv8-L with nearly 4times less
training time. Code and models are available at
https://github.com/THU-MIG/yoloe.Summary
AI-Generated Summary