RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

February 2024

Overall framework

Abstract

In this paper, we highlight the potential of combining retrieving and ranking with multi-modal large language models to revolutionize perception tasks such as fine-grained recognition, zero-shot image recognition, and few-shot object recognition. Motivated by the limited zero-shot/few-shot of CLIP and MLLMs on fine-grained datasets, our RAR designs the pipeline that uses MLLM to rank the retrieved results. Our proposed approach can be seamlessly integrated into various MLLMs for real-world applications where the variety and volume of categories continuously expand. Our method opens up new avenues for research in augmenting the MLLM’s abilities with the retrieving-augmented solution and could be beneficial for other tasks such as reasoning and generation in future works.

Type

Conference paper

Publication

Arxiv 2024

Source Themes

Jiaqi Wang 王佳琦

Research Scientist
Shanghai AI Laboratory

Jiaqi Wang is a Research Scientist at Shanghai AI Laboratory. His research interests focus on Multimodal Learning, Visual Perception, and AI Content Creation in both 2D and 3D open worlds.

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Abstract

Jiaqi Wang 王佳琦

Research ScientistShanghai AI Laboratory

Related

Research Scientist
Shanghai AI Laboratory