MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

¹University of British Columbia, ²Vector Institute for AI ³CIFAR AI Chair

Abstract

We introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools (e.g., detection, segmentation, VLMs) from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user.

Methodology

Step-by-step process of our MMFactory. Please wait a while for loading the images.

Multi-agent framework as the Solution router. Please wait a while for loading the images.

Acknowledgements

This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chairs, NSERC Canada Research Chair (CRC), and NSERC Discovery and Discovery Accelerator Supplement Grants. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, the Digital Research Alliance of Canada, companies sponsoring the Vector Institute, and Advanced Research Computing at the University of British Columbia. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant and Compute Canada under the Resource Allocation Competition award.

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Abstract

Video

Methodology

Qualitative examples

Acknowledgements