Il tuo ViT è Segretamente un Modello di Segmentazione delle Immagini

Abstract

I Vision Transformer (ViT) hanno dimostrato prestazioni e scalabilità notevoli in vari compiti di visione artificiale. Per applicare i ViT a singola scala alla segmentazione delle immagini, i metodi esistenti adottano un adattatore convoluzionale per generare caratteristiche multi-scala, un decodificatore di pixel per fondere queste caratteristiche e un decodificatore Transformer che utilizza le caratteristiche fuse per fare previsioni. In questo articolo, dimostriamo che i bias induttivi introdotti da questi componenti specifici per il compito possono invece essere appresi dal ViT stesso, dati modelli sufficientemente grandi e un pre-addestramento estensivo. Sulla base di queste osservazioni, introduciamo l'Encoder-only Mask Transformer (EoMT), che ripropone l'architettura semplice del ViT per condurre la segmentazione delle immagini. Con modelli su larga scala e pre-addestramento, EoMT ottiene un'accuratezza di segmentazione simile a quella dei modelli all'avanguardia che utilizzano componenti specifici per il compito. Allo stesso tempo, EoMT è significativamente più veloce di questi metodi grazie alla sua semplicità architetturale, ad esempio fino a 4 volte più veloce con ViT-L. In una gamma di dimensioni del modello, EoMT dimostra un equilibrio ottimale tra accuratezza di segmentazione e velocità di previsione, suggerendo che le risorse computazionali sono meglio impiegate nel ridimensionare il ViT stesso piuttosto che nell'aggiungere complessità architetturale. Codice: https://www.tue-mps.org/eomt/.

English

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.

Il tuo ViT è Segretamente un Modello di Segmentazione delle Immagini

Your ViT is Secretly an Image Segmentation Model

Abstract

Summary

Support

Support