iFormer：将ConvNet和Transformer集成到移动应用程序中

摘要

我们提出了一种名为iFormer的新型移动混合视觉网络系列，重点优化移动应用程序的延迟和准确性。iFormer有效地将卷积的快速局部表示能力与自注意力的高效全局建模能力相结合。局部交互源自将标准卷积网络ConvNeXt转换为设计更轻量级移动网络。我们新引入的移动调制注意力消除了自注意力中的内存密集型操作，并采用高效的调制机制来增强动态全局表示能力。我们进行了全面的实验，证明iFormer在各种任务中优于现有的轻量级网络。值得注意的是，iFormer在iPhone 13上仅需1.10毫秒的延迟，在ImageNet-1k上实现了令人印象深刻的80.4\%的Top-1准确率，超过了最近提出的MobileNetV4在类似延迟约束下的表现。此外，我们的方法在下游任务中显示出显著改进，包括COCO目标检测、实例分割和ADE20k语义分割，同时在这些场景中为移动设备保持低延迟，适用于高分辨率输入。

English

We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, i.e., ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.

iFormer：将ConvNet和Transformer集成到移动应用程序中

iFormer: Integrating ConvNet and Transformer for Mobile Application

摘要

Summary

Support