MMFactory: ビジョン-言語タスクのための汎用ソリューション検索エンジン

要旨

AIの基盤とビジョン言語モデルの進歩や効果的なファインチューニング技術により、様々な視覚タスク向けに一般的または特定用途向けの多数のモデルが開発されています。これらのモデルは柔軟性とアクセシビリティに優れていますが、どのモデルもすべてのタスクや/または潜在的なユーザーが考える可能性のあるアプリケーションを処理することができるわけではありません。最近のアプローチでは、ビジュアルプログラミングや統合ツールを備えたマルチモーダルLLMによって、プログラム合成を通じて複雑な視覚タスクに取り組もうとしています。しかし、このようなアプローチはユーザーの制約（例：パフォーマンス/計算ニーズ）を見落とし、デプロイが難しいテスト時のサンプル固有の解決策を生成し、時には素人ユーザーの能力を超える低レベルの命令が必要とされます。これらの制約に対処するために、我々はMMFactoryを導入します。これは、モデルとメトリクスのルーティングコンポーネントを含む汎用フレームワークであり、さまざまな利用可能なモデルを横断してソリューション検索エンジンのように機能します。タスクの説明と少数のサンプル入出力ペア、および（オプションで）リソースおよび/またはパフォーマンスの制約に基づいて、MMFactoryは、モデルリポジトリからの視覚言語ツールをインスタンス化および組み合わせることで、プログラムソリューションの多様なプールを提案することができます。これらのソリューションを合成するだけでなく、MMFactoryはパフォーマンス/リソース特性を評価し、ユーザーが独自の設計制約を満たすソリューションを選択できるようにします。技術的な観点から、ユーザー向けに実行可能で多様で普遍的かつ堅牢なソリューションを生成するために、マルチエージェントLLM会話を活用する委員会ベースのソリューション提案者を導入しました。実験結果は、MMFactoryが、ユーザーの問題仕様に合わせた最先端のソリューションを提供することで、既存の手法を上回ることを示しています。プロジェクトページはhttps://davidhalladay.github.io/mmfactory_demoでご覧いただけます。

English

With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.

MMFactory: ビジョン-言語タスクのための汎用ソリューション検索エンジン

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

要旨

Support