RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Overall framework


In this paper, we highlight the potential of combining retrieving and ranking with multi-modal large language models to revolutionize perception tasks such as fine-grained recognition, zero-shot image recognition, and few-shot object recognition. Motivated by the limited zero-shot/few-shot of CLIP and MLLMs on fine-grained datasets, our RAR designs the pipeline that uses MLLM to rank the retrieved results. Our proposed approach can be seamlessly integrated into various MLLMs for real-world applications where the variety and volume of categories continuously expand. Our method opens up new avenues for research in augmenting the MLLM’s abilities with the retrieving-augmented solution and could be beneficial for other tasks such as reasoning and generation in future works.

Arxiv 2024
Jiaqi Wang 王佳琦
Jiaqi Wang 王佳琦
Research Scientist
Shanghai AI Laboratory

Jiaqi Wang is a Research Scientist at Shanghai AI Laboratory. His research interests focus on Multimodal Learning, Visual Perception, and AI Content Creation in both 2D and 3D open worlds.