Publications

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Long-CLIP: Unlocking the Long-Text Capability of CLIP

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

OneLLM: One Framework to Align All Modalities with Language

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

VIGC: Visual Instruction Generation and Correction

HyperDreamer: Hyper-Realistic 3D Content Generation and Editing from a Single Image

V3Det: Vast Vocabulary Visual Detection Dataset

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation

Multi-level Logit Distillation

BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D Scene Reconstruction From A Single Image

Dense Distinct Query for End-to-End Object Detection

Voxurf: Voxel-based Efficient and Accurate Neural Surface Reconstruction

Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences

Semi-Supervised Semantic Segmentation via Gentle Teaching Assistant

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Few-Shot Object Detection via Association and DIscrimination

Texture Memory-Augmented Deep Patch-Based Image Inpainting

CARAFE++: Unified Content-Aware ReAssembly of FEatures

Seesaw Loss for Long-Tailed Instance Segmentation

Side-Aware Boundary Localization for More Precise Object Detection

CARAFE: Content-Aware ReAssembly of FEatures

Region Proposal by Guided Anchoring

Hybrid Task Cascade for Instance Segmentation

Optimizing Video Object Detection via a Scale-Time Lattice