RS-GPT4V: A Unified Multimodal Instruction-Following
Dataset for Remote Sensing Image Understanding

Linrui Xu Central South University, Changsha, China Ling Zhao Central South University, Changsha, China Wang Guo Central South University, Changsha, China Qiujun Li Central South University, Changsha, China Kewang Long Central South University, Changsha, China Kaiqi Zou Central South University, Changsha, China Yuhan Wang Central South University, Changsha, China Haifeng Li Corresponding author email: [email protected] Central South University, Changsha, China

Abstract

The remote sensing image (RSI) intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, such as RSI-CD, DOTA, which have led to advances in RSI intelligence understanding in the last decade, are no longer suitable for fire-new tasks. We argued that a new dataset must be carefully designed to lighten tasks in the new paradigm with the following features: (1) Generalization: training model to learn shared knowledge among tasks and to adapt to different tasks; (2) Understanding complex scenes: training model to understand the fine-grained attribute of the objects of interest, and to be able to describe the scene with natural language with detailed; (3) Reasoning: training model to be able to realize high-level visual reasoning. In this paper, we designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding produced by GPT-4V and existing datasets, which we called RS-GPT4V. To achieve generalization, RS-GPT4V used a (Question, Answer) which was deduced from GPT-4V via instruction-following to unify the tasks such as captioning, localization; To achieve an understanding of a complex scene, RS-GPT4V proposed a hierarchical instruction description with local strategy in which the fine-grained attributes of the objects and their spatial relationships are described and global strategy in which all the local information are integrated to yield detailed instruction descript; To achieve reasoning, RS-GPT4V designed multiple-turn (Question, Answer) pair to provide the reasoning ability for a model. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information, and implicit knowledge in multiple complex remote sensing scenarios, and reason better than existing datasets. The source code and dataset can be visited at: https://github.com/GeoX-Lab/RS-GPT4V.

1 Introduction

In the LaDM paradigm, the datasets were designed for research tasks such as scene classification[1], object detection[2], image captioning[3], and visual question answering[4]. Research typically relies on designing and training models separately for each task, neglecting the potential commonalities and knowledge sharing between different tasks and datasets. Remote sensing interpretation datasets primarily consist of images and annotations, where images are consistent entity samples, while annotations are diverse, including labels, bounding boxes, and text. Additionally, existing datasets usually preset a limited number of scenes and target categories, mainly examining the model’s ability to recognize these specific categories, overlooking whether the model can deeply understand and reason about the complex relationships between various scenes and targets.

In the LaGD paradigm, significant achievements in language understanding and reasoning capabilities have been made in recent years, driven by the rapid development of large language models (LLMs) such as GPT-4[5] and LLaMA2[6]. To further harness the potential of LLMs, researchers have expanded their capabilities by integrating visual perception modules, thus forming MLLMs capable of understanding complex image scenes and performing visual reasoning. The training paradigm for MLLMs typically follows a pre-training and fine-tuning model, where visual language instruction fine-tuning plays a crucial role throughout the training process. When MLLMs undergo Supervised Fine-Tuning (SFT) with appropriate visual language instruction-following datasets, they demonstrate powerful functionalities. They can not only perceive, understand, and process visual information but also enhance their performance in executing complex task instructions. Visual language instruction fine-tuning provides the model with the ability to comprehend and follow visual content-related instructions, enabling it to generate responses that meet user interaction needs.

However, the existing remote sensing datasets and benchmarks present three challenges: (1) The diversity of annotations in remote sensing datasets limits the generalization capability of models. The inconsistency in annotation modalities makes it difficult for models to adapt to different tasks. (2) Remote sensing annotation data are insufficient to accurately describe the fine-grained attributes of objects in the area of interest and the structural information between these fine-grained attributes. (3) Existing methods can only perform one-loop recognition (OLR) and are unable to achieve multi-turn reasoning (MTR) to discover implicit knowledge.

Despite MLLMs demonstrating excellent performance in various fields, the diversity of annotations in existing remote sensing datasets limits their application. Moreover, current annotation data have limitations in providing detailed descriptions of RSI targets, particularly regarding fine-grained attributes, spatial relationships, and background knowledge, as shown in Figure 1. Therefore, this paper proposes a Unified RS Multimodal Instruction-Following Dataset (RS-GPT4V). This dataset uses the text modality to express different annotations as a unified foundation. By converting multiple tasks into language-understanding tasks, data unification can be achieved, further realizing task unification. RS-GPT4V aims to cover a wide range of scenarios and target categories and integrate various visual language tasks. The dataset adopts a unified (Question, Answer) format, supporting tasks such as image description, visual question answering, complex scene understanding, visual reasoning, and task planning.

RS-GPT4V is constructed through two key methods: Instruction-Annotation Adaption and Instruction-Response Generation. Instruction-Annotation Adaption converts existing visual language tasks into (Question, Answer) pairs using instruction templates. Instruction-Response Generation utilizes system prompts and advanced GPT-4V models to generate (Question, Answer) pairs based on existing annotation data. By using RS-GPT4V for supervised fine-tuning of MLLMs, the model can understand the relationships between objects in complex scenes and uncover implicit knowledge. For example, using contextual visual information of a ship’s wake to infer whether the target is stationary or moving.

Our contributions are listed as follows:

•

We achieve a new high-quality, diversified, and unified multimodal instruction-following dataset by (Question, Answer) pair as the uniform form for RSI understanding produced by GPT4V and existing datasets, called RS-GPT4V for the new paradigm. The RS-GPT4V could be used to train and test models’ capabilities in generalization, understanding complex scenes, and reasoning.
•

The designed some principles, such as such as uniformity, diversity, accuracy, and richness, to design a dataset. These principles are general and could serve as guidance for the LaGD paradigm in the future.
•

The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information, implicit knowledge in multiple complex remote sensing scenarios, and reason better than existing datasets. We will release the dataset and code.

2 Related work

Remote Sensing Visual Language Datasets. RSI datasets are typically categorized by task types, such as classification[1], segmentation[7], detection[8], and change detection[9]. The primary distinction among these datasets lies in their annotation methods, with classification tasks using class indices and detection tasks using bounding boxes. With the advancement of vision-language models[10], researchers have developed remote sensing vision-language datasets, such as RS5M[11], SkyScrip[12], and LHRS-Align[13], by integrating RSI with language descriptions. These datasets aim to achieve a unified annotation approach using language modalities, thereby enabling the development of vision-language models capable of handling various downstream tasks. While these datasets have demonstrated significant advantages in tasks like image captioning and scene classification, they have yet to achieve complete task unification, as models trained on these datasets are typically limited to specific tasks within the dataset. This work posits that the annotations for different tasks are essentially information carriers that can be expressed through language modalities. By learning the map** relationship between images and language annotations, vision-language models have the potential to handle diverse tasks in a unified manner.

Remote Sensing Multimodal Instruction-Following Datasets. Although using language modalities to unify annotations across different tasks can facilitate task unification within datasets, enabling vision-language models to select downstream tasks autonomously remains a challenge. To address this issue, researchers have started incorporating task instructions into model training. Specifically, instruction-guided multimodal datasets introduce task-specific texts such as "describe the image," "locate objects in the image," or "answer the question briefly" alongside visual and language inputs. These texts guide the model in understanding the task type and its corresponding requirements. Through this common interface, the same model can be fine-tuned for various visual tasks, resulting in a general-purpose multimodal model capable of accepting any language and visual inputs and solving diverse visual tasks.[14]

Refer to caption — Figure 1: Evolution of Remote Sensing Tasks and Data. The figure illustrates the progression from early remote sensing tasks using purely visual data to advanced tasks using visual language data. By adding instructions, these data can be transformed into multimodal instruction-following datasets.

Currently, notable multimodal instruction-following datasets in remote sensing include MMRS-1M[15], RS-instructions[16], and RS-Specialized-Instruct[17]. As shown in Figure 1, These datasets enhance existing vision-language remote sensing datasets like UCM-Captions[18], RSVQA[19], and DIOR[20] by adding instructional text, merging them into new multimodal Instruction-Following datasets. Additionally, datasets like HqDC-Instruct[17] and LHRS-Instruct[13] leverage large language models to construct new task instructions for multi-turn Conversations, detailed descriptions, and complex reasoning. However, these datasets often lack systematic guiding principles. For instance, MMRS-1M[15] lacks complexity, making it difficult for models to handle complex semantic reasoning tasks. GeoChat’s[21] data generation process lacks supervision, leading to insufficient correctness and a higher likelihood of contextually inconsistent predictions. Therefore, this work aims to redesign the principles for constructing remote sensing instruction fine-tuning datasets and, under these guidelines, create more unified, rich, and correct remote sensing multimodal instruction-following datasets.

3 RS-GPT4V Dataset

3.1 Design Principles and Characteristics

In constructing the RS-GPT4V Dataset, we adhered to several key design principles, as illustrated in Figure 2, to ensure that the remote sensing instruction-following dataset is not only of high quality but also broadly applicable to various visual and visual-language tasks. The design principles are summarized as follows:

Unity: Our dataset integrates multiple visual and visual-language tasks, such as image description, visual question answering, and visual dialogue. By adopting a unified data annotation, we ensure consistency and standardization, allowing a single model to be trained and evaluated across multiple tasks.

Diversity: To accommodate different remote sensing scenarios and research needs, the instruction-following dataset includes images from various scenes along with their corresponding instructions and questions. This diversity is reflected not only in the visual content but also in the types of tasks and model responses.[22]

Correctness: The matching of visual information with textual content must be precise. All images and text descriptions in the RS-GPT4V Dataset have been manually corrected to ensure error-free image-text question answering.[23]

Richness: In addition to basic visual and textual data, we have included supplementary background knowledge, such as the scientific background of related objects. This rich information helps the model understand and interpret the complex elements and relationships within the visual content better.

Complexity: Our dataset design includes multi-layered tasks and challenges, particularly in reasoning about object attributes and relationships. By designing tasks that require complex logic and reasoning capabilities, we enhance the model’s ability to handle intricate scenarios.[23, 22]

Robustness: The instruction-following dataset includes additional spatial negative samples to enhance the model’s robustness, ensuring stable performance across various remote sensing scenarios and reducing hallucinations.[24]

3.2 Principles-Driven Pipeline for RS-GPT4V Dataset Construction

During the construction of the RS-GPT4V Dataset, we adhered to the design principles[25, 23] outlined in Section 3.1 to ensure the dataset’s quality, accuracy, and diversity, as shown in Figure 3. The following describes the detailed pipeline for dataset construction and how these principles were applied within the pipeline.

Data Collection: We extracted data from multiple well-recognized remote sensing visual language datasets. These datasets cover a wide range of scenes, from natural landscapes to urban environments, providing a rich source of RSI and related annotations. Although these datasets serve their respective purposes, they often lack the depth and complexity required to address more complex challenges in multi-turn interactions. By collecting data from these diverse sources, each dataset was carefully selected to cover different geographical regions and environmental conditions, ensuring data diversity. For example, some datasets contain high-resolution images of urban areas, while others focus on natural landscapes or agricultural regions. To construct a dataset capable of complex reasoning, we selected the DIOR[20] dataset with coordinate annotations, which have been manually labeled to provide accurate coordinate hints. In contrast, classification datasets containing only a single category label are not suitable for GPT-4V due to the difficulty and high error rate in image understanding.

Instruction-Response Generation: As shown in the diagram, we adopted a hierarchical prompting instruction description generation method. First, we obtained fine-grained information about objects at a local level and combined this information with RSIs to systematically generate detailed instruction descriptions. To ensure that the generated instructions are detailed and accurate, each instruction and its corresponding response must accurately reflect the image content and possess sufficient complexity and diversity. Specifically, the coordinate information from the DIOR dataset was used to identify different types of objects in the image, such as tennis courts and basketball courts. To avoid the influence of non-continuous objects on GPT-4V’s[5] counting ability, we ensured that the sequence of labeled objects was continuous. For example, the sequence would be "tennis court, tennis court, basketball court" rather than "tennis court, basketball court, tennis court". To enhance GPT-4V’s spatial understanding ability, we used rotated bounding boxes to represent the positions of objects. Using rotated bounding boxes instead of standard bounding boxes helps avoid coordinate overlap, enabling the model to better understand the fine-grained attributes and spatial relationships of objects. Meanwhile, the questions generated by GPT-4V should be sufficiently challenging, covering complex reasoning, world knowledge, explanatory answers, and multi-turn dialogues. For example, "Please describe the basketball court in the image and its color," and "How many tennis courts are there in the image?". Combining the complex and diverse instruction information obtained at the local level with the RSIs, GPT-4V can generate more accurate and detailed responses. For example, identifying a basketball court in the image and generating a detailed description of its location, color, and surrounding environment.

Instruction-Annotation Adaption: Adapting annotations from visual language datasets to instruction datasets is a crucial part of constructing the RS-GPT4V Dataset. By adding specific instructions, we transformed existing visual language datasets into new instruction datasets. For example, in image description tasks, the original dataset might only contain simple descriptions, and by adding instructions such as "Provide a one-sentence caption for this RSI," the dataset becomes more suitable for complex instruction-response tasks. Through this method, the RS-GPT4V Dataset not only ensures task diversity and accuracy at the data collection stage but also enhances dataset unity through meticulous instruction generation and annotation transformation steps.

3.3 Overview of the RS-GPT4V Dataset

As shown in Table 1. The RS-GPT4V dataset comprises multiple tasks with specific RSIs and instruction annotations for tasks such as image captioning, visual question answering, visual grounding, and region-level descriptions. Key datasets include NWPU-Captions[26], RSICD[27], RSITMD[28], Sydney-Captions[18], UCM-Captions[18], RSVQA-LR[19], RSVQA-HR[19], FloodNet[29], RSIVQA[30], and DIOR-RSVG[31], et al. Overall, the dataset provides 91,937 training images with 991,206 question-answer pairs and 15,999 test images with 258,419 question-answer pairs, supporting detailed descriptions and complex reasoning capabilities across various visual and visual-language tasks.

Table 1: Details of the RS-GPT4V Dataset: Images and QA Pairs for Training and Testing

Task	Data Source	Train images	Train QA Pairs	Test images	Test QA Pairs
Image Captioning	NWPU-Captions	25200	125894	3150	1093
	RSICD	8734	17813	1093	1093
	RSITMD	4291	20096	-	-
	Sydney-Captions	497	2294	58	58
	UCM-Captions	1680	7999	210	210
Visual Question Answering	RSVQA-LR	572	57223	100	10004
	RSVQA-HR	6251	625340	2226	222684
	FloodNet	1448	4511	-	-
	RSIVQA	5401	19218	-	-
Visual Grounding	DIOR-RSVG	9466	19643	7936	18677
Region-level Captioning	DIOR-RSVG	9466	19643	-	-
Multi-turn Conversation	RS-GPT4V-Instruct	9466	62067	613	3987
Detailed Description	RS-GPT4V-Instruct	9465	9465	613	613
Total	-	91937	991206	15999	258419

The RS-GPT4V dataset comprises a total of 91,937 training images and 991,206 training instruction-answer pairs, with 15,999 images and 258,419 instruction-answer pairs in the test set. By integrating these datasets, the RS-GPT4V dataset features detailed annotations and complex reasoning capabilities. The structure of this dataset supports a variety of tasks, ranging from image description to multi-turn dialogues and detailed descriptions.

4 Experiments

4.1 Experimental Setup

To evaluate the impact of different fine-tuning strategies on the RS-GPT4V dataset, we employed three fine-tuning methods to perform supervised fine-tuning based on the LLaVA-1.5-7B [32] model: Full-Parameter Fine-Tuning (Full-FT), LoRA Fine-Tuning, and MoE-LoRA Fine-Tuning. The objective of these strategies is to compare the efficacy and performance of various fine-tuning methods in handling complex remote sensing tasks. All experiments were conducted on four NVIDIA A800-80G GPUs, with a configuration that includes a global batch size of 64, an initial learning rate of 2e-4, and a total of 14,666 training steps, corresponding to 1 epoch. Additionally, the rank is set to 128, and the number of experts is 4.

4.2 Benchmarks for RS-GPT4V

Full-FT: In the full-parameter fine-tuning strategy, we adjust all parameters of the LLaVA-1.5 model to fully adapt to the specific characteristics of RSIs. This comprehensive adjustment strategy optimizes all trainable parameters, enabling the model to handle complex remote sensing tasks more effectively.

LoRA: The LoRA[33] fine-tuning strategy aims to improve the model’s learning efficiency and reduce computational resource requirements by applying low-rank approximation optimization to specific parameters within the model. This strategy optimizes the structure of certain parameters, significantly reducing resource consumption and enabling training to be completed in a shorter time frame.

MoE-LoRA: We employed the MoE-LoRA[34] method, which combines the Mixture of Experts (MoE) and Low-Rank Adaptation (LoRA) approaches. This method leverages the advantages of MoE in multi-task learning and the efficiency of LoRA in parameter fine-tuning. By designing multiple experts, each consisting of a pair of low-rank matrices, this approach maintains computational efficiency while significantly enhancing the model’s capability to handle multiple tasks.

4.3 Experiments and Results

4.3.1 Image Captioning Task

In our experimental analysis, we evaluated the performance of three benchmark models (Full-FT, LoRA, MoE-LoRA) fine-tuned using RS-GPT4V on four different remote sensing datasets (UCM-Captions[18], RSICD[27], Sydney-Captions[18], NWPU-Captions[26]). As shown in 4, the fine-tuned benchmark models generally demonstrated significant performance improvements, providing accurate and detailed descriptions of RSIs. However, the performance on the Sydney-Captions dataset was not as strong as on the other datasets. This discrepancy is primarily due to the relatively small scale of the Sydney-Captions dataset, which resulted in the models being unable to fully capture the specific characteristics of this dataset during training. Nonetheless, the performance of the fine-tuned models on the other datasets still indicates that the RS-GPT4V fine-tuning strategy has significant potential and utility in enhancing the automatic parsing and description of RSI.

4.3.2 Visual Question Answering Task

In the performance analysis of the RSVQA-LR and RSVQA-HR datasets (including Test set 1 and Test set 2), as shown in Figure 5, the three benchmark models fine-tuned with RS-GPT4V demonstrated significant advantages in multiple key metrics. Although the Count metric in RSVQA-LR was slightly inferior to Bi-Modal[40] and SHRNet[41] models, the MoE-LoRA benchmark performed exceptionally well across most evaluation metrics, particularly excelling in Presence, Comparison, and Rural/Urban. In both test sets of RSVQA-HR, all three models generally outperformed other methods, with MoE-LoRA’s strong performance in high-resolution image analysis being particularly noteworthy. However, compared to Test set 1, the accuracy of all models decreased in Test set 2, potentially reflecting increased test set difficulty or changes in data distribution. Overall, despite the subpar performance in the Count metric, the excellent performance in Area and other advanced visual understanding tasks validates the effectiveness of the RS-GPT4V fine-tuning strategy in enhancing RSI understanding capabilities.

4.3.3 Visual Grounding Task

In the analysis of the visual grounding task results, as shown in Table3, the performance of different models on the [email protected] metric revealed significant differences. Overall, the three benchmark models fine-tuned with RS-GPT4V, namely Full-FT, LoRA, and MoE-LoRA, demonstrated superior performance in this task, significantly outperforming the Qwen-vl-Chat[42] and LLaVA-1.5[43] models. General MLLMs showed poor performance in detection tasks, while the models fine-tuned with RS-GPT4V exhibited strong capabilities in visual grounding tasks. These results validate the efficacy and robustness of the RS-GPT4V fine-tuning strategy in enhancing RSI visual grounding tasks.

Table 2: Performance of Models on DIOR-RSVG. Compares the accuracy ([email protected]) of various models on the DIOR-RSVG dataset, using the DIOR dataset split for training and testing.

Method	[email protected]
Qwen-vl-Chat	25.05
LLaVA-1.5	9.52
Full-FT	36.31
LoRA	33.15
MoE-LoRA	37.86

Table 3: GPT-4V Performance Evaluation. GPT-4V scores the performance of various models on the RS-GPT4V-Instruct dataset, focusing on complex conversation and detailed description. Scores are restricted between 1 and 10, with higher scores awarded for answers containing more accurate information.

Method	complex reasoning & conversation	detailed description	All Scores
LLaVA-1.5	5.21	5.088	5.194
Qwen-vl-Chat	2.648	2.282	2.599
InternLM-XC2	5.312	4.392	5.189
Full-FT	6.27	6.53	6.304
LoRA	6.061	6.374	6.103
MoE-LoRA	6.108	6.468	6.156

4.3.4 Performance Evaluation of the RS-GPT4V-Instruct Dataset

In evaluating different models on the RS-GPT4V-Instruct dataset, we focused on two core metrics: complex reasoning & conversation and detailed description. These metrics aim to assess the models’ ability to understand and generate complex dialogues and accurately capture details. As shown in Table 3, Full-FT performed the best in this evaluation, demonstrating its superior understanding and generation capabilities. Meanwhile, MoE-LoRA and LoRA also showed strong performance in these metrics. In contrast, other models such as Qwen-vl-Chat, LLaVA-1.5, and InternLM-XC2 scored relatively lower, indicating their potential limitations in handling complex reasoning, conversation, and detailed descriptions in remote sensing tasks.

4.4 Limitations and Discussion

Limitations: Although the RS-GPT4V dataset integrates multiple data sources and covers a wide range of remote sensing scenarios and target categories, its overall scale remains limited, particularly in specific application scenarios involving infrared and SAR modalities, where data support is insufficient. Despite the use of manual correction and advanced model generation to ensure data accuracy and high quality, issues of annotation errors and inconsistencies persist in large-scale datasets. Additionally, the models perform poorly in target detection tasks involving complex RSIs, especially when detecting small targets or targets within highly complex backgrounds.

Societal Impact: Training large-scale multimodal language models requires significant computational resources, which can lead to significant energy consumption and corresponding carbon emissions. In the long run, this high energy consumption model may put continuous pressure on the environment.

5 Conclusion

This study successfully constructed and thoroughly analyzed the RS-GPT4V dataset, a unified and multifunctional multimodal instruction-following dataset designed for remote sensing visual language tasks. By integrating a wide range of remote sensing scenarios and target categories, the RS-GPT4V dataset supports various tasks such as image description, visual question answering, complex scene understanding, visual reasoning, and task planning, using a unified (Question, Answer) format, significantly enhancing the dataset’s practicality. During its construction, we adhered to key design principles of unity, diversity, correctness, richness, complexity, and robustness, ensuring the dataset’s high quality and broad applicability to different visual and visual language tasks. Our experimental results demonstrate that fine-tuning with the RS-GPT4V dataset significantly improves the performance and generalization capabilities of multimodal large language models in executing complex task instructions. Overall, the release of the RS-GPT4V dataset provides strong support for remote sensing visual language research, and its design and implementation example will help advance remote sensing technology in various fields. In the future, we plan to expand the dataset to include other remote sensing domains, such as infrared and SAR modalities, further enhancing the models’ adaptability.

References

[1] Gui-Song Xia, **gwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017.
[2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
[3] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
[4] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
[5] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[6] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[7] Di Wang, **g Zhang, Bo Du, Minqiang Xu, Lin Liu, Dacheng Tao, and Liangpei Zhang. Samrs: Scaling-up remote sensing segmentation dataset with segment anything model, 2023.
[8] ** Wang, Wenhui Diao, ** Chen, Jihao Li, Yingchao Feng, Tao Xu, Martin Weinmann, Stefan Hinz, Cheng Wang, and Kun Fu. Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery, 2021.
[9] Deyi Ji, Siqi Gao, Mingyuan Tao, Hongtao Lu, and Feng Zhao. Changenet: Multi-temporal asymmetric change detection dataset, 2024.
[10] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
[11] Yang Long, Gui-Song Xia, Shengyang Li, Wen Yang, Michael Ying Yang, Xiao Xiang Zhu, Liangpei Zhang, and Deren Li. On creating benchmark dataset for aerial image interpretation: Reviews, guidances and million-aid. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:4205–4230, 2021.
[12] Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and semantically diverse vision-language dataset for remote sensing, 2023.
[13] Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model, 2024.
[14] Jiaxing Huang, **gyi Zhang, Kai Jiang, Han Qiu, and Shijian Lu. Visual instruction tuning towards general-purpose multimodal model: A survey. arXiv preprint arXiv:2312.16602, 2023.
[15] Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain. arXiv preprint arXiv:2401.16822, 2024.
[16] Yakoub Bazi, Laila Bashmal, Mohamad Mahmoud Al Rahhal, Riccardo Ricci, and Farid Melgani. Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery. Remote Sensing, 16(9), 2024.
[17] Chao Pang, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Xingxing Weng, Shuai Wang, Litong Feng, Gui-Song Xia, and Conghui He. H2rsvlm: Towards helpful and honest remote sensing large vision language model, 2024.
[18] Bo Qu, Xuelong Li, Dacheng Tao, and Xiaoqiang Lu. Deep semantic understanding of high resolution remote sensing image. In 2016 International conference on computer, information and telecommunication systems (Cits), pages 1–5. IEEE, 2016.
[19] Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555–8566, 2020.
[20] Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing, 159:296–307, January 2020.
[21] Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. arXiv preprint arXiv:2311.15826, 2023.
[22] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
[23] Chen Li, Yixiao Ge, Dian Li, and Ying Shan. Vision-language instruction tuning: A review and analysis. arXiv preprint arXiv:2311.08172, 2023.
[24] Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
[25] Sondos Mahmoud Bsharat, Aidar Myrzakhan, and Zhiqiang Shen. Principled instructions are all you need for questioning llama-1/2, gpt-3.5/4. arXiv preprint arXiv:2312.16171, 2023.
[26] Qimin Cheng, Haiyan Huang, Yuan Xu, Yuzhuo Zhou, Huanying Li, and Zhongyuan Wang. Nwpu-captions dataset and mlca-net for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2022.
[27] Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017.
[28] Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2022.
[29] Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Roberson Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding. IEEE Access, 9:89644–89654, 2021.
[30] Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555–8566, December 2020.
[31] Yang Zhan, Zhitong Xiong, and Yuan Yuan. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61:1–13, 2023.
[32] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
[33] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
[34] Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Derong Xu, Feng Tian, and Yefeng Zheng. Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv preprint arXiv:2310.18339, 2023.
[35] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
[36] Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss, editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.
[37] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
[38] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation, 2015.
[39] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation, 2016.
[40] Yakoub Bazi, Mohamad Mahmoud Al Rahhal, Mohamed Lamine Mekhalfi, Mansour Abdulaziz Al Zuair, and Farid Melgani. Bi-modal transformer-based approach for visual question answering in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2022.
[41] Zixiao Zhang, Licheng Jiao, Lingling Li, Xu Liu, Puhua Chen, Fang Liu, Yuxuan Li, and Zhicheng Guo. A spatial hierarchical reasoning network for remote sensing visual question answering. IEEE Transactions on Geoscience and Remote Sensing, 61:1–15, 2023.
[42] **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
[43] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.

Appendix A Model Architecture

To comprehensively evaluate the performance of the RS-GPT4V dataset, we used LLaVA-1.5-7B as the pre-trained model. This model is pre-trained using rich visual language data to bridge the gap between vision and language, with the trainable parameters of the model denoted as $\theta$ . Specifically, $X_{v}$ represents the input image, $X_{instruct}$ represents the text instruction, $L$ is the sequence length of the answer $X_{a}$ , and $X_{a,<i}$ represents all answer tokens before the current prediction token $x_{i}$ , where $i$ denotes the steps during text token generation. We calculate the probability of the entire target answer $X_{a}$ as follows:

p\left(X_{a}|X_{v},X_{instruct}\right)=\prod_{i=1}^{L}p_{\theta}\left(x_{i}|X_% {v},X_{instruct,<i},X_{a,<i}\right)

(1)

As shown in the Figure 6, LLaVA-1.5-7B employs CLIP-336px to retain more information from the original pixel space and uses Vicuna v1.5 as the language encoder. Additionally, the model’s multimodal connector adopts a two-layer MLP structure. During training, the image encoder first extracts visual tokens from the input image. Then, the encoder encodes the image into a series of image tokens. Next, the generated token sequence is passed through a two-layer MLP, which uses the GELU activation function to map the visual tokens to the embedding space dimension. The mapped image features are combined with the text instruction tokens to form the input for the large language model. The architecture design of LLaVA-1.5-7B enables it to effectively handle various remote sensing tasks, including region-level descriptions, visual localization, visual question answering, complex reasoning, and image descriptions. Through comprehensive training on these tasks, the model can accurately identify objects and scenes in images and perform complex reasoning and descriptions.

Appendix B RS-GPT4V-Instruct

We conducted a comprehensive comparison of the RS-GPT4V-Instruct dataset, generated through Instruction-Response Generation, with several other remote sensing instruction datasets. As shown in Table 4, all instruction datasets support complex reasoning and multi-turn dialogues. However, the RS-GPT4V-Instruct dataset undergoes final corrections through manual verification, significantly enhancing its accuracy. Other datasets, such as GeoChat, LHRS-Instruct, and HqDC-Instruct, primarily adopt a two-stage process: the first stage uses multimodal language models (MLLMs) to generate detailed descriptions and extract fine-grained information, while the second stage employs large language models (LLMs) for Instruction-Response Generation, complex reasoning, and multi-turn dialogues. Errors in the initial annotations tend to accumulate in subsequent processing. Due to the lack of visual signal support in the second stage, the generated multi-turn dialogues are limited to the annotation information from the first stage. In contrast, the RS-GPT4V-Instruct dataset integrates visual signals, capturing object information and image background details beyond the system prompts, effectively reducing the accumulation of annotation errors. This approach ensures higher accuracy, making the RS-GPT4V-Instruct dataset perform exceptionally well in tasks involving image content.

Table 4: Comparative of RS-GPT4V-Instruct and Other Remote Sensing Multimodal Instruction-Following Datasets.

Instruction-Following Dataset

GeoChat

LHRS-Instruct

HqDC-Instruct

RS-GPT4V-Instruct

Data Source

DOTA

DIOR

FAIR1M

RSITMD

NWPU-Captions

LHRS-Align