EMOv2:拓展500万视觉模型前沿
EMOv2: Pushing 5M Vision Model Frontier
December 9, 2024
作者: Jiangning Zhang, Teng Hu, Haoyang He, Zhucun Xue, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Dacheng Tao
cs.AI
摘要
本工作旨在开发参数高效且轻量级的模型,用于密集预测,同时在参数、FLOPs和性能之间进行权衡。我们的目标是在各种下游任务中建立5M量级轻量级模型的新前沿。倒转残差块(IRB)作为轻量级CNN的基础,但基于注意力的设计尚未被认可。我们重新思考了高效IRB的轻量级基础架构和Transformer中的实用组件,从统一的角度扩展了基于CNN的IRB到基于注意力的模型,并提炼出一个残差元移动块(MMBlock)用于轻量级模型设计。遵循简洁而有效的设计标准,我们推导出现代改进的倒转残差移动块(i2RMB),并改进了一个没有复杂结构的分层高效模型(EMOv2)。考虑到移动用户在4G/5G带宽下载模型时的不可察觉延迟,并确保模型性能,我们研究了5M量级轻量级模型的性能上限。在各种视觉识别、密集预测和图像生成任务上进行的大量实验表明,我们的EMOv2优于最先进的方法,例如,EMOv2-1M/2M/5M分别达到72.3、75.8和79.4的Top-1准确率,明显超过同等级别的CNN-/基于注意力的模型。同时,配备RetinaNet的EMOv2-5M在目标检测任务中实现了41.5的mAP,超过先前的EMO-5M +2.6。当采用更强大的训练配方时,我们的EMOv2-5M最终实现了82.9的Top-1准确率,将5M量级模型的性能提升到一个新水平。代码可在https://github.com/zhangzjn/EMOv2找到。
English
This work focuses on developing parameter-efficient and lightweight models
for dense predictions while trading off parameters, FLOPs, and performance. Our
goal is to set up the new frontier of the 5M magnitude lightweight model on
various downstream tasks. Inverted Residual Block (IRB) serves as the
infrastructure for lightweight CNNs, but no counterparts have been recognized
by attention-based design. Our work rethinks the lightweight infrastructure of
efficient IRB and practical components in Transformer from a unified
perspective, extending CNN-based IRB to attention-based models and abstracting
a one-residual Meta Mobile Block (MMBlock) for lightweight model design.
Following neat but effective design criterion, we deduce a modern Improved
Inverted Residual Mobile Block (i2RMB) and improve a hierarchical Efficient
MOdel (EMOv2) with no elaborate complex structures. Considering the
imperceptible latency for mobile users when downloading models under 4G/5G
bandwidth and ensuring model performance, we investigate the performance upper
limit of lightweight models with a magnitude of 5M. Extensive experiments on
various vision recognition, dense prediction, and image generation tasks
demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g.,
EMOv2-1M/2M/5M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order
CNN-/Attention-based models significantly. At the same time, EMOv2-5M equipped
RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the
previous EMO-5M by +2.6. When employing the more robust training recipe, our
EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the
performance of 5M magnitude models to a new level. Code is available at
https://github.com/zhangzjn/EMOv2.Summary
AI-Generated Summary