你的视觉Transformer本质上是一个图像分割模型

摘要

视觉Transformer（ViTs）在各种计算机视觉任务中展现出卓越的性能和可扩展性。为了将单尺度ViTs应用于图像分割，现有方法采用卷积适配器生成多尺度特征，像素解码器融合这些特征，以及Transformer解码器利用融合特征进行预测。本文研究表明，在模型规模足够大且预训练充分的情况下，这些任务特定组件引入的归纳偏置可由ViT自身学习获得。基于这一发现，我们提出了仅编码器掩码Transformer（EoMT），它重新利用朴素ViT架构进行图像分割。通过大规模模型和预训练，EoMT获得了与使用任务特定组件的最先进模型相媲美的分割精度。同时，得益于其架构的简洁性，EoMT显著快于这些方法，例如，使用ViT-L时速度提升高达4倍。在不同模型规模下，EoMT展示了分割精度与预测速度之间的最佳平衡，表明计算资源更应投入于扩展ViT本身而非增加架构复杂性。代码：https://www.tue-mps.org/eomt/。

English

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.

你的视觉Transformer本质上是一个图像分割模型

Your ViT is Secretly an Image Segmentation Model

摘要

Summary

Support

Support