MMFactory: 시각-언어 작업을 위한 범용 솔루션 검색 엔진

초록

기초 및 시각-언어 모델의 발전과 효과적인 세밀 조정 기술로 다양한 시각 작업을 위해 일반 및 특수 목적 모델이 개발되었습니다. 이러한 모델들의 유연성과 접근성에도 불구하고, 잠재적 사용자가 상상할 수 있는 모든 작업 및/또는 응용 프로그램을 처리할 수 있는 단일 모델은 없습니다. 최근에는 시각 프로그래밍 및 통합 도구를 갖춘 다중모달 LLMs와 같은 접근 방식이 프로그램 합성을 통해 복잡한 시각 작업에 대응하려고 합니다. 그러나 이러한 방식은 사용자 제약 조건(예: 성능/계산 요구 사항)을 간과하며, 배포하기 어려운 테스트 시간 특정 솔루션을 생성하며 때로는 순진한 사용자의 능력을 벗어나는 저수준 지침이 필요할 수 있습니다. 이러한 한계를 해결하기 위해, 우리는 MMFactory를 소개합니다. 이는 모델 및 메트릭 라우팅 구성 요소를 포함하는 범용 프레임워크로, 다양한 사용 가능한 모델을 횡단하는 솔루션 검색 엔진처럼 작동합니다. 작업 설명과 몇 가지 입력-출력 쌍 및 (선택적으로) 리소스 및/또는 성능 제약 조건에 따라, MMFactory는 모델 저장소에서 시각-언어 도구를 인스턴스화하고 결합하여 프로그래밍 솔루션의 다양한 풀을 제안할 수 있습니다. 이러한 솔루션을 합성하는 것 외에도 MMFactory는 사용자가 고유한 설계 제약 조건을 충족하는 솔루션을 선택할 수 있도록 성능/리소스 특성을 제안하는 메트릭 및 벤치마크도 제안합니다. 기술적인 측면에서, 우리는 또한 사용자를 위해 실행 가능하고 다양하며 범용적이며 견고한 솔루션을 생성하기 위해 다중 에이전트 LLM 대화를 활용하는 위원회 기반 솔루션 제안자를 소개했습니다. 실험 결과는 MMFactory가 사용자 문제 사양에 맞춘 최첨단 솔루션을 제공하여 기존 방법을 능가한다는 것을 보여줍니다. 프로젝트 페이지는 https://davidhalladay.github.io/mmfactory_demo에서 확인할 수 있습니다.

English

With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.

MMFactory: 시각-언어 작업을 위한 범용 솔루션 검색 엔진

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

초록

Summary

Support

Support