説明的な指示：統一されたビジョンタスク理解とゼロショット汎化へ

要旨

コンピュータビジョン（CV）は、自然言語処理（NLP）で観察されるゼロショットタスクの汎化を完全に達成していません。NLPで確立された多くのマイルストーンに従っており、大規模なトランスフォーマーモデル、広範な事前トレーニング、自己回帰パラダイムなどを採用しています。本論文では、CVが離散的で用語的なタスク定義（例：「画像セグメンテーション」）を採用していることが、ゼロショットタスクの汎化の主要な障壁である可能性を探求します。私たちの仮説は、これらの用語的定義によって以前に見たタスクを真に理解していないため、深層モデルが新しいタスクに汎化するのに苦労しているというものです。これを検証するために、入力画像から出力への詳細な言語的変換を介してCVタスク目標を直感的に定義する説明的指示を導入します。12百万の「画像入力から説明的指示への出力」トリプレットからなる大規模データセットを作成し、画像と説明的指示の両方を入力とする自己回帰型ビジョン言語モデル（ARベースのVLM）をトレーニングします。これらの指示に従うことを学習することで、ARベースのVLMは以前に見たタスクにおける指示レベルのゼロショット能力を達成し、見たことのないCVタスクに対する強力なゼロショット汎化を実証します。コードとデータセットは当社のGitHubリポジトリで公開されます。

English

Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we explore the idea that CV adopts discrete and terminological task definitions (\eg, ``image segmentation''), which may be a key barrier to zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks--due to these terminological definitions--deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million ``image input to explanatory instruction to output'' triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be openly available on our GitHub repository.

説明的な指示：統一されたビジョンタスク理解とゼロショット汎化へ

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

要旨

Support