EMOv2：拓展500萬視覺模型前沿

摘要

本研究旨在開發參數高效且輕量級的模型，用於密集預測，同時在參數、FLOPs和性能之間取得平衡。我們的目標是在各種下游任務中建立新的5M量級輕量級模型的前沿。反向殘差塊（IRB）作為輕量級CNN的基礎，但基於注意力的設計中尚未確認相應的對應物。我們重新思考了高效IRB的輕量級基礎架構以及Transformer中的實用組件，從統一的角度擴展基於CNN的IRB到基於注意力的模型，並提煉出一個單殘差元移動塊（MMBlock）用於輕量級模型設計。遵循簡潔而有效的設計標準，我們推導出現代化的改進反向殘差移動塊（i2RMB），並通過沒有繁瑣複雜結構來改進分層高效模型（EMOv2）。考慮到在4G/5G頻寬下載模型時對移動用戶的不可察覺的延遲，並確保模型性能，我們研究了具有5M量級的輕量級模型的性能上限。對各種視覺識別、密集預測和圖像生成任務的大量實驗證明了我們的EMOv2相對於最先進的方法的優越性，例如，EMOv2-1M/2M/5M實現了72.3、75.8和79.4的Top-1，明顯超越了同等級別的CNN-/基於注意力的模型。同時，配備RetinaNet的EMOv2-5M實現了41.5的物體檢測任務的mAP，超越了之前的EMO-5M +2.6。當應用更強大的訓練配方時，我們的EMOv2-5M最終實現了82.9的Top-1準確率，將5M量級模型的性能提升到一個新水平。代碼可在https://github.com/zhangzjn/EMOv2找到。

English

This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1M/2M/5M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5M by +2.6. When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level. Code is available at https://github.com/zhangzjn/EMOv2.

EMOv2：拓展500萬視覺模型前沿

EMOv2: Pushing 5M Vision Model Frontier

摘要

Support