EMOv2：拓展500万视觉模型前沿

摘要

本工作旨在开发参数高效且轻量级的模型，用于密集预测，同时在参数、FLOPs和性能之间进行权衡。我们的目标是在各种下游任务中建立5M量级轻量级模型的新前沿。倒转残差块（IRB）作为轻量级CNN的基础，但基于注意力的设计尚未被认可。我们重新思考了高效IRB的轻量级基础架构和Transformer中的实用组件，从统一的角度扩展了基于CNN的IRB到基于注意力的模型，并提炼出一个残差元移动块（MMBlock）用于轻量级模型设计。遵循简洁而有效的设计标准，我们推导出现代改进的倒转残差移动块（i2RMB），并改进了一个没有复杂结构的分层高效模型（EMOv2）。考虑到移动用户在4G/5G带宽下载模型时的不可察觉延迟，并确保模型性能，我们研究了5M量级轻量级模型的性能上限。在各种视觉识别、密集预测和图像生成任务上进行的大量实验表明，我们的EMOv2优于最先进的方法，例如，EMOv2-1M/2M/5M分别达到72.3、75.8和79.4的Top-1准确率，明显超过同等级别的CNN-/基于注意力的模型。同时，配备RetinaNet的EMOv2-5M在目标检测任务中实现了41.5的mAP，超过先前的EMO-5M +2.6。当采用更强大的训练配方时，我们的EMOv2-5M最终实现了82.9的Top-1准确率，将5M量级模型的性能提升到一个新水平。代码可在https://github.com/zhangzjn/EMOv2找到。

English

This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1M/2M/5M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5M by +2.6. When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level. Code is available at https://github.com/zhangzjn/EMOv2.

EMOv2：拓展500万视觉模型前沿

EMOv2: Pushing 5M Vision Model Frontier

摘要

Support