Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Proposed system paradigm integrates ChatGPT with a pool of vision experts Defined and explored a comprehensive list of advanced vision tasks Textual prompt design allows language models to accept, associate, and process multimodal information Zero-shot experiments demonstrate effectiveness in addressing specified capabilities Discussed and compared system paradigm with alternative approach Paper Content Introduction Recent years have seen significant advancement for computer vision Different vision problems require different models One research direction is to combine vision and language modules Large language models have shown impressive dialogue capability NLP research has demonstrated the effectiveness of integrating external NLP tools with LLMs MM-REACT combines vision experts with ChatGPT for multimodal reasoning and action MM-REACT provides extra flexibility in module upgrades Related work LLMs have strong chain-of-thought capabilities LLMs can use external NLP tools to solve problems LLMs can reason and take action independently, but not together Recent studies have attempted to merge reasoning and action for LLMs MM-REACT uses vision tools as executable actions MM-REACT uses ChatGPT to determine which vision expert to invoke User input ChatGPT only accepts texts as input File paths are used to indicate non-text inputs Vision experts are used to understand image content from different perspectives Chatgpt response ChatGPT is expected to provide two kinds of responses Key challenge is to set up a protocol to know when to invoke vision expert Use keyword “Assistant” to distinguish if vision expert is required Encourage Chat-GPT to show thought process to highlight why external tool is required Vision experts Use regular expression matching to parse expert name and file path Standardize output into text format Represent output of detection model as <object name, x1, y1, x2, y2> Add text description to explain numerical values Inject knowledge of vision experts’ usages into prefix Extensibility Motivated by REACT, which uses NLP tools Extended to vision domain by replacing non-text modality with path string Can be extended to other modalities, such as speech and audio Can incorporate more tools by formatting their outputs in text format Performance can be enhanced by upgrading to more powerful LLM Experiments Experiment setup Implemented MM-REACT based on LangChain codebase and ReAct Accessed ChatGPT via Azure API with token length limit of 4096 Utilized vision experts from Azure Cognitive Services APIs Expanded toolset with customized tools for spatial understanding and image editing Examples of capabilities and application scenarios in Figures 4-14 Unfolded steps in Figure 18 Enhanced LLM from ChatGPT to GPT-4 in Figures 23 and 24 Plugged image editing tool from X-decoder in Figure 25 Limitations Recognition capability in the wild is hard to evaluate with accuracy numbers due to lack of annotated benchmarks Vision capability is limited by integrated vision experts Knowledge is injected in the prefix, limited by context window Visual signals are converted to text words for ChatGPT understanding Manual prompt engineering required for MM-REACT Conclusion MM-REACT is a system paradigm that combines multimodal reasoning and action to solve visual understanding problems....