MMFactory：一个用于视觉-语言任务的通用解决方案搜索引擎

摘要

随着基础和视觉-语言模型的进步，以及有效的微调技术，已经开发了大量通用和专用模型，用于各种视觉任务。尽管这些模型具有灵活性和易用性，但没有单一模型能够处理所有可能由潜在用户设想的任务和/或应用。最近的方法，如视觉编程和带有集成工具的多模态LLMs，旨在通过程序合成来解决复杂的视觉任务。然而，这些方法忽视了用户约束（例如性能/计算需求），产生了难以部署的测试时样本特定解决方案，并且有时需要超出普通用户能力的低级指令。为了解决这些限制，我们引入了MMFactory，这是一个通用框架，包括模型和度量路由组件，类似于跨各种可用模型的解决方案搜索引擎。基于任务描述和少量样本输入-输出对以及（可选）资源和/或性能约束，MMFactory可以通过实例化和组合其模型库中的视觉-语言工具，提供多样的程序化解决方案。除了合成这些解决方案，MMFactory还提出度量标准和基准性能/资源特征，使用户能够选择符合其独特设计约束的解决方案。从技术角度来看，我们还引入了基于委员会的解决方案提议者，利用多代理LLM对话来为用户生成可执行、多样化、通用和稳健的解决方案。实验结果表明，MMFactory通过提供符合用户问题规范的最新解决方案，优于现有方法。项目页面位于https://davidhalladay.github.io/mmfactory_demo。

English

With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.

MMFactory：一个用于视觉-语言任务的通用解决方案搜索引擎

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

摘要

Support