RegionGPT: Towards Region Understanding Vision Language Model

Qiushan Guo², Shalini De Mello¹, Hongxu Yin¹, Wonmin Byeon¹, Ka Chun Cheung¹, Yizhou Yu², Ping Luo² Sifei Liu¹

Nvidia¹, The University of Hong Kong²
Work done during Qiushan's internship at Nvidia Research.
(We have cleaned the code. Pre-trained Models will be released soon. Stay tuned. Thanks.)

Abstract

Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs, yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder, and the use of coarse-grained training data that lacks detailed, region-specific captions. To address this, we introduce RegionGPT (short as RGPT), a novel framework designed for complex region-level captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We further improve performance on tasks requiring a specific output scope by integrating task-guided instruction prompts during both training and inference phases, while maintaining the model's versatility for general-purpose tasks. Additionally, we develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks, including but not limited to complex region descriptions, reasoning, object classification, and referring expressions comprehension.

Problem Overview

We introduce RegionGPT that enables complex region-level captioning, reasoning, classification, and referring expression comprehension capabilities for the multimodal large language model. Users can input regions of interest of any shape, utilizing <region> as a placeholder within the instruction at any position. Such placeholders are subsequently replaced with semantic region-level embeddings that are fed into the language decoder.

RegionGPT Architecture

Starting from a visual backbone, we extract low-resolution semantic features from an input image \(X_v\). Then, a feature refinement module is composed to obtain higher-resolution feature maps. With a patch merge module, the feature maps are further merged to reduce the length of input image-level sequence. The mask features are obtained by averaging the feature in the target region \(X_r\), inputted as another branch, with Mask Pooling layer. Both the image-level feature and region-level feature share the connector for semantic consistency. The example interactions demonstrate the model's capabilities in complex region-level description, reasoning, object classification, and referring expression comprehension.

VLM-assistant Region Caption Generation

We explore using an existing global-level image captioning VLM, i.e., LLaVA for region-specific tasks. The proposed pipeline is composed of two stages.

In the first stage, we generate a global-level caption for the image using the VLM. This global description is then used as contextual information, which we include in the form of text at the beginning of the prompt. Subsequently in the second stage, by inputting the ROI, the VLM is prompted to describe the specific region represented by the image patch. We illustrate this approach with a detailed example in the following:

In the context of the entire image, <GlobalCaption>, describe the close-up region in detail.

We further enhance our approach by incorporating human-annotated class names as an additional condition when prompting the VLM to describe the properties of the region: