MMFactory：一個針對視覺-語言任務的通用解決方案搜索引擎。

摘要

隨著基礎和視覺語言模型的進步，以及有效的微調技術，已經為各種視覺任務開發了大量通用和特定目的的模型。儘管這些模型具有靈活性和易用性，但沒有單一模型能夠處理所有可能由潛在用戶構想的任務和/或應用。最近的方法，如視覺編程和帶有集成工具的多模式LLM，旨在通過程序合成來應對複雜的視覺任務。然而，這些方法忽略了用戶的限制（例如性能/計算需求），產生了難以部署的測試時間特定解決方案，有時需要超出初學者能力的低級指令。為了解決這些限制，我們引入了MMFactory，這是一個通用框架，包括模型和指標路由組件，像跨各種可用模型的解決方案搜索引擎。根據任務描述和少量樣本輸入-輸出對以及（可選）資源和/或性能限制，MMFactory可以通過實例化和組合其模型庫中的視覺語言工具，提出多樣的程序化解決方案。除了合成這些解決方案，MMFactory還提出指標和基準性能/資源特性，讓用戶選擇符合其獨特設計限制的解決方案。從技術角度來看，我們還引入了一個基於委員會的解決方案提議者，利用多代理LLM對話生成可執行、多樣、通用和強大的解決方案供用戶使用。實驗結果表明，MMFactory通過提供針對用戶問題規格定制的最新解決方案，勝過現有方法。項目頁面位於https://davidhalladay.github.io/mmfactory_demo。

English

With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.

MMFactory：一個針對視覺-語言任務的通用解決方案搜索引擎。

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

摘要

Summary

Support