\useunder

\ul

Learning Geospatial Region Embedding with Heterogeneous Graph

Xingchen Zou¹ Jiani Huang^2∗ Xixuan Hao¹ Yuhao Yang³ Haomin Wen⁴ Yibo Yan¹ Chao Huang³ Yuxuan Liang¹ ¹ The Hong Kong University of Science and Technology (Guangzhou), ² The Hong Kong Polytechnic University, ³ University of Hong Kong, ⁴ Bei**g Jiaotong University {xzou428,xhao390}@connect.hkust-gz.edu.cn {jianihuang01,yanyibo70,chaohuang75}@gmail.com {wenhaomin}@bjtu.edu.cn {yuhao-yang,yuxliang}@outlook.com These authors contributed equally to this workY. Liang is the corresponding author. Email: [email protected]

Abstract

Learning effective geospatial embeddings is crucial for a series of geospatial applications such as city analytics and earth monitoring. However, learning comprehensive region representations presents two significant challenges: first, the deficiency of effective intra-region feature representation; and second, the difficulty of learning from intricate inter-region dependencies. In this paper, we present GeoHG, an effective heterogeneous graph structure for learning comprehensive region embeddings for various downstream tasks. Specifically, we tailor satellite image representation learning through geo-entity segmentation and point-of-interest (POI) integration for expressive intra-regional features. Furthermore, GeoHG unifies informative spatial interdependencies and socio-environmental attributes into a powerful heterogeneous graph to encourage explicit modeling of higher-order inter-regional relationships. The intra-regional features and inter-regional correlations are seamlessly integrated by a model-agnostic graph learning framework for diverse downstream tasks. Extensive experiments demonstrate the effectiveness of GeoHG in geo-prediction tasks compared to existing methods, even under extreme data scarcity (with just 5% of training data). With interpretable region representations, GeoHG exhibits strong generalization capabilities across regions. We will release code and data upon paper notification.

1 Introduction

Geospatial regions, as complex systems, witness the intertwining of natural laws and social dynamics, leading to mixed-order interactions among regions and entities. Recently, there has been a growing interest in leveraging deep learning methods to tackle various geospatial tasks [59, 41], such as regional indicator prediction [43, 42, 45, 14], earth monitoring [19, 25, 47], and geopolitics optimization [12, 31, 40]. Global geospatial regions exhibit intricate and multifaceted dynamics. Learning effective geospatial embeddings that capture the inherent characteristics of regions and their intricate mixed-order relationships lays the groundwork for these applications [55, 33, 19].

To learn effective representations for geographical regions, it is essential to capture two key characteristics: intra-region features and inter-region correlations [30, 53, 5]. As illustrated in Figure 1, (a) intra-region features encapsulate a region’s environmental information (e.g., vegetation coverage, and water resources) and societal information (e.g., colleges, or government institutions). (b) Inter-region correlations indicate the relationships between regions, extending beyond simple pairwise adjacency to capture higher-order dependencies. Specifically, we believe that the representation of a region can be influenced by multiple regions collectively, even if they are not spatially adjacent. For instance, regions along the same river basin may exhibit correlated climate and economic patterns despite lacking direct adjacency.

Refer to caption — Figure 1: Illustration of region representation from intra-region and inter-region perspective. (a) intra-region features consist of environmental features and societal features. (b) inter-region correlations, e.g., the second-order adjacent correlation between neighboring regions; and the high-order correlation between remote region groups induced by environmental and societal factors.

Although relevant works [35, 42, 21] have sought to learn better geospatial representations, concisely learning an effective and comprehensive geospatial embedding remains challenging. The questions are two-fold. First, an effective representation for intra-region features remains unexplored. Existing studies opt for leveraging globally available satellite imagery to learn geospatial representations that can generate a unified embedding space for regions across the world [35, 42, 20]. These studies often employ vision-based encoders pre-trained on natural images to learn semantics from satellite imagery, but such encoders struggle due to the substantial differences between satellite and natural imagery. Second, comprehensively modeling the complex high-order inter-region relations in geospatial embeddings is challenging. In geo-space, inter-region relationships extend beyond second-order adjacency to encompass ternary and higher-order dependencies between regions and region groups. For instance, a certain region may be influenced by several non-adjacent regions due to shared geographic or socioeconomic functions, despite lacking physical proximity [30, 21]. While multi-view graphs have been explored to model complex geographic and socioeconomic relations, they heavily rely on specific data to construct and are limited to second-order perspectives [59, 16, 17]. Alternatively, knowledge graphs (KGs) [27] built from geographical data, such as POIs, aim to capture complex structures, but their complexity limits scalability and effectiveness in training [27, 17].

To address these challenges, we propose a novel Heterogeneous Graph structure with a learning framework for effective and mixed-order relation-aware Geospatial embedding (named GeoHG). Our framework leverages satellite images and POI information to effectively derive unified region representations encapsulating both intra-region and inter-region information from environmental and societal perspectives. To derive effective intra-region representations that generalize across global regions, we proposed a novel satellite image encoding mechanism. Instead of directly using vanilla vision encoders pre-trained on life images, which struggle with the domain gap, we perform semantic segmentation on the satellite images to effectively distinguish geo-entities like water bodies, vegetation, and man-made structures. Subsequently, we integrate the spatial coordinate with the geo-entities and extra POI information within each region to construct a comprehensive intra-regional feature representation. To efficiently capture high-order inter-region relations, we leverage the powerful mixed-order dependencies from the heterogeneous graph that reflects both environmental and societal aspects of geographical regions. The heterogeneous formulation integrates spatial attributes with explicit region-entity associations, representing the high-dimensional socio-environmental dependencies across regions in a unified framework. Finally, our GeoHG seamlessly integrates the intra-region and inter-region representations through model-agnostic heterogeneous graph structure. Its universal compatibility allows for seamless integration with different models and algorithms for various tasks, ensuring optimal performance across diverse applications with effective geospatial representation.

To summarize, the contributions of this work are concluded as follows:

•

Novel Heterogeneous Graph Structure for Geospatial Embedding. We introduce a model-agnostic heterogeneous graph structure to integrate intra-region features and inter-region correlations in geospace efficiently. As far as we know, it is the first work that integrates comprehensive intra-region information with complex mixed-order inter-region relations in a heterogeneous graph for geospatial representation (Sec 3.3).
•

Efficient and Explicit Intra Region Embedding. We develop an effective approach to construct intra-region feature representation by leveraging entity segments from satellite images to extract environmental features and utilizing POI information to capture societal attribute features (Sec 3.1).
•

Mixed-order Inter-region Relation-Aware Representation. We propose a novel method to construct interpretable heterogeneous graphs that explicitly capture mixed-order relations between regions. Our approach effectively represents the intricate multivariate relationships within the regions’ natural environment. Additionally, it captures the complex social attribute relationships among various types of social entities within geospatial regions (Sec 3.2).
•

Empirical Evidence. Our empirical studies across various downstream tasks in different locations demonstrate the superior performance of GeoHG compared to various baseline models. It consistently outperforms existing methods and maintains substantial gains even in extremely low-data regimes (5% training data). Moreover, GeoHG outperforms conventional geospatial interpolation methods specifically designed for such data-scarce settings (Sec 4.1).

2 Related Work

Geospatial Embedding. Numerous prior studies [52, 57, 13, 35, 20] have attempted to address the challenge of geospatial embedding. For example, Zhang et al. [52] proposed a multi-view graph representation approach that considered POI data and mobility patterns to generate representations of regions. In a similar vein, Zhou et al. [57] employed a prompt learning method by leveraging both POI and mobility data. However, these approaches are limited to specific tasks and regions due to limited and non-comprehensive embeddings [53, 42]. To tackle this issue, there has been a growing interest in utilizing satellite imagery for geospatial embedding, as it offers easy accessibility and global coverage. For instance, PG-SimCLR [38] and UrbanCLIP [42] have demonstrated success in profiling urban regions through satellite images for various downstream tasks. Nevertheless, there remains a gap for these models to effectively interpret the complex and professional semantic meaning of satellite imagery. Moreover, the aforementioned methods overlook the crucial modeling of mixed-order inter-region relations, which could affect the comprehensiveness of geospatial embedding. Therefore, our proposed GeoHG initially employs a heterogeneous graph to capture these inter-region relationships while concurrently learning the explicit features of a given region.

Graph Neural Network for Geospatial Representation. Graph Neural Networks (GNNs) offer a succinct and scalable approach for modeling intricate geospaces with non-Euclidean characteristics [51, 26, 11, 34]. By propagating messages through edges, GNNs are capable of learning the geospatial correlations between regions and geo-entities effectively. To better represent these complex correlations, numerous studies have adopted a multi-view graph structure [22, 5, 3, 7]. In this framework, nodes in each view (e.g., distance, mobility, semantic) are linked with distinct edge vectors that represent the correlations between node pairs under specific perspectives. Differing from this approach, our work utilizes a Heterogeneous Graph (HG) to further represent mixed-order geospatial correlations. Although HG structures are frequently used in knowledge graphs [27, 17, 23] that often lack scalability and effective training [27, 17], our GeoHG innovatively revises HG to succinctly represent mixed-order correlations and group interactions between regions from both environmental and societal perspectives. Moreover, we devise a streamlined yet efficient pipeline for automatically constructing this heterogeneous graph from satellite imagery and POI information.

Multimodal Learning in GeoAI. Multimodal learning involves integrating data from various modalities to enhance model performance [1, 8, 6]. Geospatial multimodal learning, in particular, improves regional understanding by combining spatio-temporal (e.g., POI and road network), visual (e.g., urban imagery), textual (e.g., social media data) modalities, resulting in more comprehensive and accurate geospatial representations [59, 55]. Traditional geospatial representation learning mostly relies on task-specific supervised methods [48, 55], which are limited by their dependence on domain expertise and inability to adapt to new tasks [59]. To overcome these limitations, recent studies [15, 52] focus on learning general geospatial embeddings from a multimodal perspective. Nonetheless, efficiently integrating multimodal data from intricate geospaces, while maintaining generalizability, poses a significant challenge [59]. In this study, we develop a holistic geospatial embedding learning framework grounded in graph theory to amalgamate visual insights and POI information of geospace with geospatial mixed-order relation awareness while maintaining awareness of geospatial mixed-order relations.

3 Methodology

Geospatial Region Embedding: We aim to train a GeoHG model capable of generating embeddings for regions worldwide to facilitate a diverse set of downstream applications. To enable our proposed model to derive embeddings for arbitrary regions, we construct the GeoHG architecture using globally available satellite imagery and readily accessible POI data as input sources, serving as proxies for environmental and societal information, respectively.

Given the access to satellite imagery and POI dataset $D_{\rm satellite\_POI}=\left\{\left(S_{i},P_{i}\right)\right\}_{i=1}^{I}$ containing pairs of satellite image $S_{i}$ and POI data $P_{i}$ corresponding to the region of interest $R_{i}$ , our GeoHG model W generates embedding $E_{i}=\phi\left(S_{i},P_{i}\right)$ , where $E_{i}\in\mathbb{R}^{d}$ , representing the respective region embedding. $\phi(\cdot)$ is an embedding function capturing semantic and contextual information of $R_{i}$ . The generated embeddings $E_{i}$ encapsulate the intra-region features containing environmental and societal information and its higher-order inter-region dependency with other regions. Notably, the satellite images $S_{i}$ and $P_{i}$ can originate from arbitrary geographical regions worldwide, necessitating that the learned embeddings concisely yet comprehensively encode rich contextual information across a global scale, which poses a formidable challenge.

Our architecture comprises four major stages to obtain intra-regional and inter-regional correlations and effectively integrate them, as illustrated in Figure 2: intra-region feature representation, inter-region feature representation, heterogenous graph-based representation integration and pre-training and end-to-end training. Now, we present the details of each stage.

3.1 Intra-Region Feature Representation

To effectively encode spatial, environmental, and societal signals into intra-region representations, we extract multi-modal geo-contextual data including satellite imagery, POI information, and the location of the region of interest.

Spatial Position Embedding: We adopt a simple yet effective method to extract the spatial information of the region of interest, drawing inspiration from industrial conventions [42, 9]. Specifically, we divide the overall geospace into multiple $1\textrm{km}\times 1\textrm{km}$ spatial grids with coordinate system origin ( $lon^{0},lat^{0}$ ) and use the abscissa $x_{R}$ and ordinate $y_{R}$ of the region of interest $R$ as spatial information $E_{pos}(x,y)$ , illustrated in Figure 3. Notably, $E_{pos}$ is directly related to the geo-coordinate ( $lon,lat$ ) at the centroid of the targeted region, for any region $E^{i}$ in the target space, we have:

E^{i}_{pos}=(lon^{i}-lon^{0},lat^{i}-lat^{0})\Delta*D^{-1}

(1)

where $\Delta$ is the transform vector decided by the coordinate system. $D$ denotes the scale of the grids, and for $1\textrm{km}\times 1\textrm{km}$ grids, $D=1$ .

Environmental Feature Embedding: We utilize satellite imagery to mine the environmental features and expect the satellite imagery encoding to be efficient and concise. To achieve this, we draw inspiration from the European Space Agency (ESA) with ESA WorldCover Dataset [50], elaborate in Appendix A, and design a segmentation-based process to encode the satellite images instead of the most commonly used visual encoders such as CNNs. Given a satellite image $S_{i}$ of a region of interest $R_{i}$ , we conduct semantic segmentation to obtain a series of geo-entities $\{entity_{1},entity_{2},...,entity_{j}\}_{j=1}^{J}$ within that region, such as developed areas, grass, and trees based on ESA Worldcover, leveraging their area proportion as environmental feature embedding as below:

E_{env}=\{p_{ent_{1}},p_{ent_{2}},...,p_{ent_{j}}\}_{j=1}^{J}

(2)

where $p_{ent_{j}}=\frac{A_{j}}{A_{R}}$ , and $A_{j}$ is the area occupied by $entity_{j}$ , $A_{R}$ is the area of region $R_{i}$ , $J$ is the number of environmental entity types. We discuss the motivation for using the segmentation-based approach instead of visual encoders and additional efficiency experiments in Appendix B.

Societal Feature Embedding: To effectively represent the societal feature of a region, given POI dataset $D_{POI}$ and the region of interest $R_{i}$ , we firstly find all points $D^{R_{i}}_{POI}$ ( $poi_{1},poi_{2},...,poi_{j})$ which are located at $R_{i}$ . Then, we count the proportion of different $K$ POI categories in the region $R_{i}$ to get $E_{soc}^{\prime}$ as below:

E_{soc}^{\prime}=\{p_{poi_{1}},p_{poi_{2}},...,p_{poi_{k}}\}_{k=1}^{K},\quad p% _{\text{poi}_{k}}=|C_{k}\cap D_{POI}^{R_{i}}|/|D_{POI}^{R_{i}}|

(3)

where $K$ is the number of POI categories, $C_{k}=\{poi_{j}\in D_{POI}\mid\text{category}(poi_{j})=k\}$ . However, directly employing the proportion of different POI categories $E_{soc}^{\prime}$ as societal features can still be problematic, since it cannot distinguish between areas with high POI density (usually with a higher degree of socialization) and those with relatively low POI density. Therefore, we transform $E_{soc}^{\prime}$ into final societal feature embedding $E_{soc}$ by multiply a social impact factor $f$ as below:

E_{soc}=f\cdot E_{soc}^{\prime},\quad f=log(D^{R_{i}}_{POI}+1)

(4)

Finally, the intra-region feature embedding is generated by concatenating the spatial position embedding $E_{pos}$ , environmental feature embedding $E_{env}$ and societal feature embedding $E_{soc}$ :

E_{intra}=concat[E_{pos},E_{env},E_{soc}]

(5)

3.2 Inter-Region Correlation Representation

Our objective is to model pairwise second-order relationships between regions and their adjacent neighbors and higher-order dependencies with distant region groups exhibiting similar environmental or societal characteristics. Inspired by hypergraph theories [21, 4], elaborated in Appendix C.1, we employ a heterogeneous graph formulation GeoHG to represent these relationships explicitly. As shown in Figure 4, we take heterogeneous nodes in the graph structure as transfer nodes to achieve high-order message passing in the hypergraph. Given a pair of $1\textrm{km}\times 1\textrm{km}$ satellite image $S_{i}$ and POI data $P_{i}$ in $D_{satellite\_{POI}}$ , we construct an undirected weighted heterogeneous graph $G_{i}$ , containing three types of nodes: regional nodes, environmental entity nodes, and societal entity nodes. Specifically, regional nodes correspond to $1\textrm{km}\times 1\textrm{km}$ grid cells from the geospace; environmental entity nodes represent the $J$ distinct geo-entity classes detected from satellite image $S_{i}$ , which are consistent with intra-region environmental features in Section 3.1; societal entity nodes constitute the $K$ different POI categories. The overall structure of GeoHG is shown in Figure 2 and the details are discussed in Appendix 4. We then elaborate on how second-order and high-order information is obtained through GeoHG $\mathcal{G}_{i}=(\mathcal{V},\mathcal{E})$ , respectively.

Second-Order Relation Representation: Second-order relations capture the pairwise spatial adjacency between regional nodes (grid cells) in $\mathcal{G}_{i}$ . We construct undirected edges $\mathcal{E}_{RNR}$ , Region Nearby Region (RNR) between regional nodes whose corresponding grid cells are spatially adjacent in a $3\times 3$ grid, thereby encoding these local second-order dependencies.

\mathcal{E}_{RNR,mn}=1,\text{if }\mathcal{V}_{m}\text{ and }\mathcal{V}_{n}% \text{ are {adjacent} in geospace}

(6)

High-Order Relation Representation: Differ from traditional multi-view graphs [7, 59] represent complex correlations through additional second-order region connections, we explicitly model higher-order relations by leveraging the environmental entity nodes $\mathcal{V}_{env}$ and societal entity nodes $\mathcal{V}_{soc}$ in $\mathcal{G}_{i}$ . Specifically, we construct heterogeneous weighted edges $\mathcal{E}_{ELR}$ and $\mathcal{E}_{SLR}$ through $\mathcal{V}_{env}$ and $\mathcal{V}_{soc}$ to associate each regional node with its constituting environmental entities (e.g., water, vegetation) and societal entities (e.g., educational, commercial POIs). For $p_{ent_{it}}\geq\theta_{Env}$ or $f\cdot p_{poi_{ik}}\geq\theta_{Soc}$ :

\mathcal{E}_{ELR,ij}=1\cdot p_{ent_{ij}},\quad\mathcal{E}_{SLR,ik}=1\cdot f% \cdot p_{poi_{ik}}

(7)

where $p_{ent_{ij}}$ represents the area proportion of geo-entity $j$ at region $i$ , $p_{poi_{ij}}$ denotes the quantity proportion of POI category $k$ at region $i$ and $f$ is the social impact factor elaborated in Equation 4. $\theta_{Env}$ and $\theta_{Soc}$ serve as hyperparameters designed to optimize the graph structure for conciseness and efficiency. This edge formulation allows encoding higher-order relationships between regions exhibiting similar environmental/societal attributes, even if they lack spatial proximity. By combining the second-order spatial adjacency edges with these higher-order hyperedge associations, $\mathcal{G}_{i}$ holistically represents the mixed-order relational patterns within the region.

3.3 Heterogenous Graph-Based Representation Integration

Having derived intra-region features capturing regions’ intrinsic attributes and inter-regional dependencies encoded in the heterogeneous graph $\mathcal{G}_{i}$ , we employ a model-agnostic graph neural network framework to jointly reason over the intra-regional and inter-regional information. In the graph, each regional node $v$ in $\mathcal{G}_{i}$ is represented with an initial node feature $\mathbf{x}_{v}$ corresponding to its intra-regional feature $E_{intra}$ elaborated in Sec 3.1. Note that for simplicity, instead of utilizing hypergraph neural networks [4, 44, 39], we model hyperedges with the form of complete subgraphs in $\mathcal{G}_{i}$ . Therefore, we apply Heterogeneous Graph Neural Networks (HGNNs) over $\mathcal{G}_{i}$ to update the node representations by iteratively aggregating and transforming multi-hop neighborhood information:

\displaystyle\mathbf{x}_{v}^{(l+1)}

\displaystyle=\text{HGNN}^{(l)}\left(\mathbf{x}_{v}^{(l)},\mathcal{N}(v),% \mathcal{G}_{i}\right),\mathbf{x}_{v}^{(l)}\in\mathbb{R}^{d}

(8)

where $\mathcal{N}(v)$ denotes the neighbors (spatial and higher-order) of node $v$ in $\mathcal{G}_{i}$ . After L layers of updates, the final node embedding $\mathbf{x}_{v}^{(L)}$ integrates cues from the intra-regional features as well as mixed-order inter-region dependencies. This final embedding $\mathbf{x}_{v}^{(L)}$ serves as the comprehensive representation integrating intra-regional and inter-regional information for the given region node. Our framework is model-agnostic, enabling flexible integration of any HGNN variant for neighborhood aggregation, and allowing easy extension to more expressive HGNN models.

3.4 Pretraining and End-to-End Training

To effectively empower various geospatial tasks with the devised representations, we present two training strategies: self-supervised pre-training that learns generalizable region representations without task-specific labels, enabling efficient transfer to diverse downstream tasks; and end-to-end training that directly optimizes task-specific objectives for peak performance.

Self-Supervised Pre-training: Inspired by CLIP [29], we pre-train our model using a contrastive learning paradigm where we maximize the similarity between a region $\mathbf{R}_{i}$ and its corresponding positive region groups $\mathbf{C}_{i}$ , while minimizing its similarity with other region groups in the batch. Specifically, we sample regional nodes adjacent to $\mathbf{R}_{i}$ as well as nodes exhibiting similar intra-regional features to construct $\mathbf{C}_{i}$ . During training, we leverage a GNN model $\mathcal{M}_{pretrain}$ to obtain embeddings for $\mathbf{R}_{i}$ and all nodes in $\mathbf{C}_{i}$ . The embeddings of the sampled positive regional nodes are further pooled to obtain the unified positive region representation $\mathbf{e}_{j}$ . The process of generating embeddings for $\mathbf{R}_{i}$ and $\mathbf{C}_{i}$ is conducted as follows:

\mathbf{e}_{i}=\mathcal{M}_{pretrain}(\mathbf{X}_{i}),\quad\mathbf{e}_{j}=% \operatorname{Pool}\left(\mathcal{M}_{pretrain}(\mathbf{X}_{j}),{j\in\mathcal{% C}_{i}}\right)

(9)

We further define a similarity function $f_{\mathbf{score}}$ to measure the similarity between the representations $\mathbf{e}_{i},\mathbf{e}_{j}$ . $f_{\mathbf{score}}$ can be a simple dot product or a more complex metric and then employ an InfoNCE-based loss [28] to conduct contrastive learning:

\mathcal{L}_{pretrain}=\mathbb{E}[-\log\frac{\exp\left(f_{\mathbf{score}}\left% (\mathbf{e}_{i},\mathbf{e}_{j}\right)\right)}{\sum_{\forall i,n}\exp\left(f_{% \mathbf{score}}\left(\mathbf{e}_{i},\mathbf{e}_{n}\right)\right)}]

(10)

where $f_{\mathbf{score}}(\mathbf{e}_{i},\mathbf{e}_{j})$ represents the score of positive pairs while $f_{\mathbf{score}}\left(\mathbf{e}_{i},\mathbf{e}_{n}\right)$ refers to the scores of negative pairs. After obtaining the pre-trained model $\mathcal{M}_{pretrain}$ through self-supervised learning, we fine-tune it for different downstream tasks by adding three trainable linear layers. The weights of these additional linear layers are updated during training for the specific downstream task while the weights of $\mathcal{M}_{pretrain}$ itself remain fixed.

End-to-End Training: For a specific downstream task, we directly optimize the whole GeoHG’s parameters from scratch using the task’s supervised signal. Taking the regional regression task as an illustrative example, we adopt the HGNN model $\mathcal{M}$ to obtain embeddings for given region $i$ , then feed it into a three-linear layer regression head. $\mathcal{M}$ ’s parameters are updated by minimizing the Mean Square Error (MSE) loss on the training data, jointly learning task-specific region representations.

4 Experiments

Datasets & Baselines. The datasets used in this paper include satellite imagery, POI information and five region indicators (Population, GDP, Night Light, Carbon, $PM_{2.5}$ ) located in four representative cities in China: Bei**g, Shanghai, Guangzhou and Shenzhen. The satellite images are collected according to geospatial grids and each presents a spatial area of 1 $km^{2}$ . The entity segmentation results for satellite imagery are collected from the European Space Agency. We randomly split the data into 60% training, 20% validation, and 20% testing sets. For comparison, we select two classical methods (AutoEncoder [18] and ResNet-18 [10]), the state-of-the-art satellite imagery-based models ( UrbanCLIP [42]) and multisource and multimodal approaches (UrbanVLP [9], PG-SimCLR [38] and GeoStructual [20]) for geospatial region embedding. Moreover, for optimal generalization capability, we explore a variation of GeoHG where we employ contrast-based Graph Self-Supervised Learning (SSL) pretraining instead of end-to-end training, represented as GeoHG-SSL. Dataset, baseline and implementation details are provided in Appendix D, E.

4.1 Comparison with State-of-the-Art Methods

Table 1: Region indicators prediction results. The bold/underlined font means the best/the second-best result.

Methods		GeoHG		GeoHG-SSL		UrbanVLP		GeoStructural		PG-SimCLR		UrbanCLIP		ResNet-18		AutoEncoder
Metric		$\text{R}^{2}$	MAE	$\text{R}^{2}$	MAE	$\text{R}^{2}$	MAE	$\text{R}^{2}$	MAE	$\text{R}^{2}$	MAE	$\text{R}^{2}$	MAE	$\text{R}^{2}$	MAE	$\text{R}^{2}$	MAE
Bei**g	Carbon	0.954	0.110	0.937	0.161	0.787	0.353	0.765	0.378	0.442	0.631	0.664	0.528	0.394	0.577	0.298	0.565
	Population	0.874	0.271	0.870	0.282	0.725	0.404	0.730	0.402	0.471	0.964	0.461	0.552	0.266	0.623	0.168	0.667
	GDP	0.647	0.336	0.644	0.331	0.586	0.416	0.617	0.401	0.277	0.768	0.355	0.539	0.285	0.699	0.171	0.822
	Night Light	0.901	0.239	0.900	0.244	0.531	0.394	0.488	0.429	0.369	0.404	0.420	0.457	0.348	0.576	0.276	0.643
	$PM_{2.5}$	0.971	0.064	0.970	0.065	0.641	0.484	0.694	0.306	0.398	0.624	0.533	0.556	0.341	0.589	0.209	0.659
Shanghai	Carbon	0.915	0.157	0.912	0.162	0.716	0.392	0.688	0.413	0.298	0.712	0.671	0.426	0.326	0.465	0.230	0.532
	Population	0.936	0.161	0.928	0.172	0.593	0.471	0.613	0.456	0.315	0.731	0.456	0.557	0.279	0.627	0.166	0.742
	GDP	0.778	0.323	0.767	0.331	0.310	0.595	0.377	0.553	0.294	0.767	0.326	0.587	0.289	0.711	0.197	0.844
	Night Light	0.898	0.222	0.891	0.234	0.457	0.494	0.442	0.517	0.308	0.566	0.387	0.511	0.244	0.571	0.164	0.617
	$PM_{2.5}$	0.866	0.120	0.836	0.150	0.486	0.497	0.527	0.398	0.303	0.617	0.444	0.518	0.292	0.642	0.243	0.691
Guangzhou	Carbon	0.885	0.219	0.884	0.209	0.698	0.385	0.681	0.497	0.422	0.708	0.585	0.444	0.375	0.515	0.254	0.570
	Population	0.871	0.244	0.855	0.255	0.665	0.441	0.687	0.433	0.303	0.954	0.533	0.567	0.274	0.671	0.195	0.753
	GDP	0.715	0.371	0.712	0.366	0.436	0.541	0.439	0.533	0.282	0.897	0.440	0.546	0.251	0.725	0.176	0.811
	Night Light	0.871	0.234	0.854	0.248	0.577	0.418	0.574	0.415	0.435	0.433	0.483	0.478	0.242	0.551	0.176	0.602
	$PM_{2.5}$	0.833	0.133	0.822	0.158	0.638	0.462	0.652	0.461	0.315	0.542	0.56	0.514	0.231	0.694	0.196	0.780
Shenzhen	Carbon	0.926	0.128	0.912	0.162	0.659	0.418	0.647	0.431	0.257	0.683	0.562	0.483	0.241	0.577	0.189	0.634
	Population	0.892	0.165	0.879	0.173	0.790	0.343	0.797	0.314	0.311	0.758	0.527	0.592	0.299	0.654	0.175	0.772
	GDP	0.798	0.297	0.767	0.331	0.532	0.448	0.517	0.455	0.307	0.895	0.508	0.464	0.234	0.817	0.119	0.884
	Night Light	0.942	0.149	0.939	0.155	0.457	0.459	0.445	0.488	0.454	0.358	0.387	0.511	0.243	0.543	0.166	0.608
	$PM_{2.5}$	0.906	0.116	0.905	0.117	0.566	0.494	0.597	0.451	0.323	0.613	0.430	0.586	0.273	0.598	0.149	0.645

We adopt Mean Absolute Error (MAE), rooted mean squared error (RMSE), and coefficient of determination ( $R^{2}$ ) as evaluation metrics. An increase in $R^{2}$ , along with a reduction in MAE and RMSE values is indicative of enhanced model accuracy. We report the performances of each model in Table 1 and the table with RMSE metrics is provided in Appendix F. From these tables, we have three key findings: 1) Both GeoHG and GeoHG-SSL outperform all competing baselines over the 5 datasets for 4 cities. For instance, GeoHG surpasses the previous SOTA performance in Bei**g, achieving $R_{2}$ improvements of +16.7%, +14.9%, +6.1%, +37%, +33% in Carbon, Population, GDP, Night Light and $PM_{2.5}$ tasks respectively. Consistent trends are observed across other cities, underscoring GeoHG’s stable accuracy, versatility, and strong generalization capabilities for geospatial region embedding. 2) The end-to-end trained GeoHG outperforms its pre-trained version GeoHG-SSL in most tasks while GeoHG-SSL shows a good performance in multi-task transferring. 3) Multisource and multimodal approaches, i.e., UrbanVLP [9] and GeoStructual [20], largely surpass traditional satellite imagery-based models by their more comprehensive embedding views.

4.2 Ablation Study & Interpretation Analysis

Effects of Core Components. To examine the effectiveness of each core component in our proposed framework, we conducted an ablation study based on the following variants for comparison: a) w/o Env, which excludes environment information from the satellite imagery for embedding. b) w/o Soc, which omits POI entities for embedding. c) w/o Pos, which does not utilize the corresponding location information of the regions. The $R^{2}$ results for five tasks in two cities, Guangzhou and Shanghai, are displayed in Figure 5. We can observe that removing location information markedly degrades performance across all tasks. Meanwhile, environmental features severely impact model accuracy on the PM2.5, Night Light, and GDP tasks. In contrast, excluding POI data slightly reduces performance on the Carbon task. Appendix F.3 shows ablation results for the other two cities.

Effect of Graph Construction. Our framework utilizes heterogeneous graph and hyperedges between regions to reflect the geospatial high-order relations. To validate the effectiveness of high-order relation representation, we compare GeoHG against two variants: GeoHG-MLP which discards graph structure and relies only on intra-region feature representation, employing a 3-layer MLP for regression; GeoHG-Mono which keeps edges in adjacent regions while discarding hyperedges for high-order relations. The results presented in Figure 5 indicated that our GeoHG significantly benefits from efficient representation of complex mixed-order relations in geospace and integration of intra-region and inter-region representations, thereby resulting in enhanced performance.

Interpretation of Mixed-order Geospatial Correlations. To investigate the power of heterogeneous graph structure in capturing mixed-order geospatial relations, inspired by GNNExplainer [46], we depict the trend of learned geospatial dependency among Carbon Emission datasets in Guangzhou. Notably, we do not incorporate any external features (e.g., road network, human mobility data) to construct aided edges in the graph. By selecting the top $N$ important nodes by their weight in the GNN prediction process, as illustrated in Figure 6, we observe that GeoHG effectively captures the high-order correlations between the target region and remote region groups through message passing within the heterogeneous graph structure. The details about GNNExplainer and experiment results are introduced in Appendix F.5.

4.3 Analysis on Few-shot Learning and Data Efficiency

Applying geospatial models globally across tasks often requires large labeled datasets for supervised training - a process that is time-consuming, computationally expensive, and hindered by data scarcity. We therefore evaluate GeoHG’s performance under limited data regimes of 5%, 10%, and 20% available training samples, with results shown in Figure 7. Results demonstrate GeoHG’s strong data efficiency - with only 5% data, GeoHG outperforms previous SOTA methods like UrbanVLP [9] and StructuralGeo [20] trained on the entire training dataset across all tasks. Utilizing only 20% of the data, GeoHG suffers minor performance degradation of 4%, 0.5%, 2.01%, 1.7%, and 3.2% on the Carbon, GDP, Light, PM2.5, and Population tasks, respectively. These findings illustrate GeoHG’s promising potential for data-efficient global deployment across diverse geospatial prediction tasks.

Geospatial data interpolation methods like Inverse Distance Weighting (IDW) [58, 24] and Universal Kriging (UK) are employed to generate predictions for data-scarce regions. We evaluate GeoHG against these methods on population prediction in an enlarged area of Shenzhen, with only 5% visible data (863 points), tasked with inferring the remaining 16,401 regions (16,401 $km^{2}$ ), as illustrated in Figure 7. GeoHG accurately captures the true distribution, substantially outperforming IDW and UK which deviate significantly. This highlights that traditional methods, solely relying on spatial relationships for interpolation, fail to model the intricate socio-environmental dependencies critical for characterizing population distributions. In contrast, GeoHG’s effective geospatial representations and modeling of high-order relationships enable accurate data pattern learning from limited samples.

5 Conclusion

In this paper, we propose GeoHG, a novel heterogeneous graph structure coupled with an efficient learning framework, specifically designed to generate informative geospatial embeddings for global regions. Our approach effectively captures comprehensive intra-region features from environmental and societal perspectives, as well as higher-order inter-region correlation through a heterogeneous graph formulation, and offers a seamless integration of these components within a model-agnostic graph structure. Extensive experiments across multiple datasets demonstrate GeoHG’s superior performance compared to existing methods. Notably, our method exhibits competitive performance even when the training data is significantly reduced. Due to the page limit, we provide more discussion in Appendix H, including the limitations, future directions and the social impact of our research.

References

Baltrušaitis et al. [2018] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Du et al. [2019] Jiadi Du, Yunchao Zhang, Pengyang Wang, Jennifer Leopold, and Yanjie Fu. Beyond geo-first law: Learning spatial representations via integrated autocorrelations and complementarity. In 2019 IEEE International Conference on Data Mining (ICDM), pages 160–169. IEEE, 2019.
Feng et al. [2019] Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3558–3565, 2019.
Fu et al. [2019] Yanjie Fu, Pengyang Wang, Jiadi Du, Le Wu, and Xiaolin Li. Efficient region embedding with multi-view spatial networks: A perspective of locality-constrained spatial autocorrelations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 906–913, 2019.
Gao et al. [2020] **g Gao, Peng Li, Zhikui Chen, and Jianing Zhang. A survey on deep learning for multimodal data fusion. Neural Computation, 32(5):829–864, 2020.
Geng et al. [2019] Xu Geng, Yaguang Li, Leye Wang, Lingyu Zhang, Qiang Yang, Jie** Ye, and Yan Liu. Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3656–3663, 2019.
Guo et al. [2019] Wenzhong Guo, Jianwen Wang, and Shi** Wang. Deep multimodal representation learning: A survey. Ieee Access, 7:63373–63394, 2019.
Hao et al. [2024] Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, and Yuxuan Liang. Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction. arXiv preprint arXiv:2403.16831, 2024.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
Huang et al. [2019] Chao Huang, Chuxu Zhang, Jiashu Zhao, Xian Wu, Dawei Yin, and Nitesh Chawla. Mist: A multiview and multimodal spatial-temporal learning framework for citywide abnormal event forecasting. In The world wide web conference, pages 717–728, 2019.
Huang et al. [2023] Ying**g Huang, Fan Zhang, Yong Gao, Wei Tu, Fabio Duarte, Carlo Ratti, Diansheng Guo, and Yu Liu. Comprehensive urban space representation with varying numbers of street-level images. Computers, Environment and Urban Systems, 106:102043, 2023.
Jean et al. [2016] Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016.
Jenkins et al. [2019] Porter Jenkins, Ahmad Farag, Suhang Wang, and Zhenhui Li. Unsupervised representation learning of spatial data via multimodal embedding. In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1993–2002, 2019.
Jiang and Luo [2022] Weiwei Jiang and Jiayun Luo. Graph neural network for traffic forecasting: A survey. Expert Systems with Applications, 207:117921, 2022.
** et al. [2023] Guangyin **, Yuxuan Liang, Yuchen Fang, Zezhi Shao, **cai Huang, Junbo Zhang, and Yu Zheng. Spatio-temporal graph neural networks for predictive learning in urban computing: A survey. IEEE Transactions on Knowledge and Data Engineering, 2023.
Kramer [1991] Mark A Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE journal, 37(2):233–243, 1991.
Lacoste et al. [2024] Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. Geo-bench: Toward foundation models for earth monitoring. Advances in Neural Information Processing Systems, 36, 2024.
Li et al. [2022] Tong Li, Shiduo Xin, Yanxin Xi, Sasu Tarkoma, Pan Hui, and Yong Li. Predicting multi-level socioeconomic indicators from structural urban imagery. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 3282–3291, 2022.
Liang et al. [2022] Yuxuan Liang, Kun Ouyang, Yiwei Wang, Zheyi Pan, Yifang Yin, Hongyang Chen, Junbo Zhang, Yu Zheng, David S Rosenblum, and Roger Zimmermann. Mixed-order relation-aware recurrent neural networks for spatio-temporal forecasting. IEEE Transactions on Knowledge and Data Engineering, 2022.
Liu et al. [2023a] Hao Liu, Qingyu Guo, Hengshu Zhu, Yanjie Fu, Fuzhen Zhuang, Xiaojuan Ma, and Hui Xiong. Characterizing and forecasting urban vibrancy evolution: A multi-view graph mining perspective. ACM Transactions on Knowledge Discovery from Data, 17(5):68:1–68:24, February 2023a. ISSN 1556-4681. doi: 10.1145/3568683.
Liu et al. [2023b] Yu Liu, **gtao Ding, Yanjie Fu, and Yong Li. Urbankg: An urban knowledge graph system. ACM Transactions on Intelligent Systems and Technology, 14(4):1–25, 2023b.
Lu and Wong [2008] George Y Lu and David W Wong. An adaptive inverse-distance weighting spatial interpolation technique. Computers & geosciences, 34(9):1044–1055, 2008.
Lütjens et al. [2019] Björn Lütjens, Lucas Liebenwein, and Katharina Kramer. Machine learning-based estimation of forest carbon stocks to increase transparency of forest preservation efforts. 2019 NeurIPS Workshop on Tackling Climate Change with AI (CCAI), 2019.
Ma et al. [2019] Yao Ma, Suhang Wang, Chara C Aggarwal, Dawei Yin, and Jiliang Tang. Multi-dimensional graph convolutional networks. In Proceedings of the 2019 siam international conference on data mining, pages 657–665. SIAM, 2019.
Ning et al. [2024] Yansong Ning, Hao Liu, Hao Wang, Zhenyu Zeng, and Hui Xiong. Uukg: unified urban knowledge graph dataset for urban spatiotemporal prediction. Advances in Neural Information Processing Systems, 36, 2024.
Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Ratcliffe [2005] Jerry H Ratcliffe. Detecting spatial movement of intra-region crime patterns over time. Journal of Quantitative Criminology, 21:103–123, 2005.
Robin and Acuto [2018] Enora Robin and Michele Acuto. Global urban policy and the geopolitics of urban data. Political Geography, 66:76–87, 2018.
Venter et al. [2022] Zander S Venter, David N Barton, Tirthankar Chakraborty, Trond Simensen, and Geethen Singh. Global 10 m land use land cover datasets: A comparison of dynamic world, world cover and esri land cover. Remote Sensing, 14(16):4101, 2022.
Vivanco Cepeda et al. [2024] Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. Advances in Neural Information Processing Systems, 36, 2024.
Wang et al. [2022] Xiao Wang, Deyu Bo, Chuan Shi, Shaohua Fan, Yanfang Ye, and S Yu Philip. A survey on heterogeneous graph embedding: methods, techniques, applications and sources. IEEE Transactions on Big Data, 9(2):415–436, 2022.
Wang et al. [2020] Zhecheng Wang, Haoyuan Li, and Ram Rajagopal. Urban2vec: Incorporating street view imagery and pois for multi-modal urban neighborhood embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1013–1020, 2020.
WEI **g [2024] LI Zhanqing WEI **g. Chinahighpm2.5: High-resolution and high-quality ground-level pm2.5 dataset for china (2000-2022), 0 2024. URL https://dx.doi.org/10.5281/zenodo.3539349.
Wu et al. [2020] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
Xi et al. [2022] Yanxin Xi, Tong Li, Huandong Wang, Yong Li, Sasu Tarkoma, and Pan Hui. Beyond the first law of geography: Learning representations of satellite imagery by leveraging point-of-interests. In Proceedings of the ACM Web Conference 2022, pages 3308–3316, 2022.
Xia et al. [2022] Lianghao Xia, Chao Huang, and Chuxu Zhang. Self-supervised hypergraph transformer for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2100–2109, 2022.
Xie et al. [2016] Michael Xie, Neal Jean, Marshall Burke, David Lobell, and Stefano Ermon. Transfer learning from deep features for remote sensing and poverty map**. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
Xie et al. [2020] Peng Xie, Tianrui Li, Jia Liu, Shengdong Du, Xin Yang, and Junbo Zhang. Urban flow prediction from spatiotemporal data using machine learning: A survey. Information Fusion, 59:1–12, 2020.
Yan et al. [2024] Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, and Yuxuan Liang. Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web. In Proceedings of the ACM on Web Conference 2024, pages 4006–4017, 2024.
Yang et al. [2021] Xin Yang, Qiuchi Xue, Xingxing Yang, Haodong Yin, Yunchao Qu, Xiang Li, and Jianjun Wu. A novel prediction model for the inbound passenger flow of urban rail transit. Information Sciences, 566:347–363, 2021.
Yang et al. [2022] Yuhao Yang, Chao Huang, Lianghao Xia, Yuxuan Liang, Yanwei Yu, and Chenliang Li. Multi-behavior hypergraph-enhanced transformer for sequential recommendation. In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 2263–2274, 2022.
Yeh et al. [2020] Christopher Yeh, Anthony Perez, Anne Driscoll, George Azzari, Zhongyi Tang, David Lobell, Stefano Ermon, and Marshall Burke. Using publicly available satellite imagery and deep learning to understand economic well-being in africa. Nature communications, 11(1):2583, 2020.
Ying et al. [2019] Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer: Generating explanations for graph neural networks. Advances in neural information processing systems, 32, 2019.
Yu et al. [2024] Sungduk Yu, Walter Hannah, Liran Peng, Jerry Lin, Mohamed Aziz Bhouri, Ritwik Gupta, Björn Lütjens, Justus C Will, Gunnar Behrens, Julius Busecke, et al. Climsim: A large multi-scale dataset for hybrid physics-ml climate emulation. Advances in Neural Information Processing Systems, 36, 2024.
Yuan et al. [2012] **g Yuan, Yu Zheng, and Xing Xie. Discovering regions of different functions in a city using human mobility and pois. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 186–194, 2012.
Zanaga et al. [2021] Daniele Zanaga, Ruben Van De Kerchove, Wanda De Keersmaecker, Niels Souverijns, Carsten Brockmann, Ralf Quast, Jan Wevers, Alex Grosu, Audrey Paccini, Sylvain Vergnaud, Oliver Cartus, Maurizio Santoro, Steffen Fritz, Ivelina Georgieva, Myroslava Lesiv, Sarah Carter, Martin Herold, Linlin Li, Nandin-Erdene Tsendbazar, Fabrizio Ramoino, and Olivier Arino. Esa worldcover 10 m 2020 v100, October 2021. URL https://doi.org/10.5281/zenodo.5571936.
Zanaga et al. [2022] Daniele Zanaga, Ruben Van De Kerchove, Dirk Daems, Wanda De Keersmaecker, Carsten Brockmann, Grit Kirches, Jan Wevers, Oliver Cartus, Maurizio Santoro, Steffen Fritz, et al. Esa worldcover 10 m 2021 v200. 2022.
Zhang et al. [2019a] Chuxu Zhang, Dong** Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 793–803, 2019a.
Zhang et al. [2021] Mingyang Zhang, Tong Li, Yong Li, and Pan Hui. Multi-view joint graph representation learning for urban region embedding. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 4431–4437, 2021.
Zhang et al. [2019b] Yunchao Zhang, Yanjie Fu, Pengyang Wang, Xiaolin Li, and Yu Zheng. Unifying inter-region autocorrelation and intra-region structures for spatial embedding via collective adversarial learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1700–1708, 2019b.
Zhao et al. [2017] Naizhuo Zhao, Ying Liu, Guofeng Cao, Eric L Samson, and **gqi Zhang. Forecasting china’s gdp at the pixel level using nighttime lights time series and population images. GIScience & Remote Sensing, 54(3):407–425, 2017.
Zheng [2015] Yu Zheng. Methodologies for cross-domain data fusion: An overview. IEEE transactions on big data, 1(1):16–34, 2015.
Zhong et al. [2022] X Zhong, Q Yan, and G Li. Development of time series of nighttime light dataset of china (2000–2020)[j]. Journal of Global Change Data & Discovery, 3:416–424, 2022.
Zhou et al. [2023] Silin Zhou, Dan He, Lisi Chen, Shuo Shang, and Peng Han. Heterogeneous region embedding with prompt learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 4981–4989, 2023.
Zimmerman et al. [1999] Dale Zimmerman, Claire Pavlik, Amy Ruggles, and Marc P Armstrong. An experimental comparison of ordinary and universal kriging and inverse distance weighting. Mathematical Geology, 31:375–390, 1999.
Zou et al. [2024] Xingchen Zou, Yibo Yan, Xixuan Hao, Yuehong Hu, Haomin Wen, Erdong Liu, Junbo Zhang, Yong Li, Tianrui Li, Yu Zheng, et al. Deep learning for cross-domain data fusion in urban computing: Taxonomy, advances, and outlook. arXiv preprint arXiv:2402.19348, 2024.

Supplementary for: “GeoHG: Learning Geospatial Region Embedding with Heterogeneous Graph”

We organize our supplementary document as follows:

A

More Introduction of ESA WorldCover Dataset
B
Motivation for using Entity Segmentation for Satellite Imagery Encoding
1. 1
  
  Introduction of semantic segmentation-based approach
2. 2
  
  Validation experiment for segmentation-based approach
C
More Details of GeoHG
1. 1
  
  Basic graph theory and our motivation
2. 2
  
  Details of GeoHG structure
D

Dataset and Experiment Settings
E

Details of Baselines
F
More Details about Experiments Results
1. 1
  
  RMSE metrics of experiment results
2. 2
  
  Mean and standard deviation of metrics for GeoHG
3. 3
  
  More ablation study results
4. 4
  
  Details of data efficiency experiment
5. 5
  
  Details of investigation of GeoHG with GNNExplainer
G

Qualitative Demonstration
H

More Discussion

Appendix A More Introduction of ESA WorldCover Dataset

The ESA WorldCover dataset ¹¹1https://esa-worldcover.org/en represents a groundbreaking advancement in land use/land cover map**, offering freely accessible, high-resolution (10 m) global coverage based on satellite imagery. Inspired by the 2017 WorldCover conference ²²2https://worldcover2017.esa.int/, the European Space Agency (ESA) launched the WorldCover project and the primary accomplishment of this endeavor was the introduction in October 2021 of a freely accessible global WorldCover Dataset at a groundbreaking 10 m resolution for the year 2020 [49, 50]. This dataset leverages satellite imagery from both Sentinel-1 and Sentinel-2, and encompasses 11 distinct geo-entity categories, shown in Figure 8. It has also undergone rigorous independent validation by Wageningen University (for statistical accuracy) and the International Institute for Applied Systems Analysis (IIASA) (for spatial accuracy), attaining a notable global overall accuracy of approximately 75% [32].

WorldCover dataset is continuously updated and revised by ESA with a highly efficient processing pipeline and scalable infrastructure. The core model is trained on a total of 2,160,210 Sentinel-2 images and is able to process the whole world in less than 5 days [49, 50]. The revised version we utilized in this paper was released on 28 October 2022, which elevated the global overall accuracy to 76.7% [50]. This version is free of charge to the entire community and widely accepted by the United Nations Convention to Combat Desertification (UNCCD), the World Resources Institute (WRI), the Centre for International Forestry Research (CIFOR), the Food and Agriculture Organization (FAO) and the Organisation for Economic Co-operation and Development (OECD). We visualized 3 examples of the WorldCover data unitized in this paper in Figure 9.

Appendix B Motivation for Using Entity Segmentation for Satellite Imagery Encoding

B.1 Introduction of semantic segmentation-based approach

It is noticeable that conventional vision encoders are devised and trained on natural images, which significantly differ from satellite imagery. It is a formidable task to comprehend the intricate and specialized geo-semantic content present within satellite imagery.

Fortunately, satellite images exhibit a high degree of structural organization in terms of semantic information, unlike natural images that contain a myriad of diverse information. To effectively interpret a satellite image, one only needs to concentrate on the geological entities it encompasses, the spatial extent of these entities, and their respective positions within the geospace. Consequently, it is highly promising to employ an entity segmentation approach for encoding satellite imagery, as it can directly provide us with essential information regarding the entities and their spatial coverage.

In our proposed method, GeoHG, we employ an entity segmentation-based framework as the backbone for encoding satellite imagery, as opposed to the presently prevalent vision encoders, such as CNN and Transformer. To implement this approach, we utilize the ESA WorldCover Dataset, as discussed in Appendix A, due to its extensive validation by geographical researchers and demonstrated robust global generalization capabilities. As depicted in Figure 10, we directly retrieve the geo-entity segmentation results from the WorldCover Dataset for the input satellite imagery. Subsequently, we determine the proportion of geo-entities within the relevant region and construct the environmental feature embedding $E_{env}$ .

An additional advantage of our proposed approach lies in its ability to explicitly construct high-order connections among regions, geo-entities, and other regions based on the semantic content of satellite imagery. This is not feasible with traditional encoders. Moreover, since all segmentation results are readily available, this method significantly reduces computational overhead. We have observed that the segmentation-based approach dramatically decreases our training time. For example, the training time for one epoch on the Population dataset in Bei**g was reduced substantially from $\approx 25minutes$ to $\approx 0.2seconds$ .

B.2 Validation experiment for segmentation-based approach

To validate the effectiveness of our segmentation-based approach. We devise a validation experiment on predicting five indicators for 4 cities by taking the segmentation-based geo-entity proportion $E_{env}$ as the only input (named GeoSegment for convenience in Tabel 2). In comparison to the state-of-the-art LLM-enhanced satellite imagery encoder UrbanCLIP [42], POI information-enhanced imagery encoder PG-SimCLR [38] and two traditional image encoders ResNet-18 and AutoEncoder. These baselines are detailed and introduced in Appendix E.

Table 2: Validation Experiment Results. The bold/underlined font means the best/the second-best result.

Dataset	Bei**g
	Carbon			Population			GDP			Night Light			$PM_{2.5}$
Model	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE
Autoencoder	0.298	0.565	0.844	0.168	0.667	0.918	0.171	0.822	1.251	0.276	0.643	0.784	0.209	0.659	0.840
ResNet-18	0.394	0.577	0.805	0.266	0.623	0.857	0.285	0.699	1.024	0.348	0.576	0.734	0.341	0.589	0.821
PG-SimCLR	0.442	0.631	0.754	0.471	0.964	1.117	0.277	0.768	1.254	0.369	0.404	0.728	0.398	0.624	0.845
UrbanCLIP	0.664	\ul0.528	0.598	\ul0.461	\ul0.552	\ul0.669	\ul0.355	\ul0.539	\ul0.864	\ul0.420	0.457	\ul0.700	0.533	\ul0.556	0.699
GeoSegment	\ul0.504	0.513	\ul0.745	0.562	0.514	0.651	0.531	0.445	0.675	0.606	\ul0.470	0.629	\ul0.468	0.535	\ul0.761

Dataset	Shanghai
	Carbon			Population			GDP			Night Light			$PM_{2.5}$
Model	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE
Autoencoder	0.230	0.532	0.771	0.166	0.742	0.898	0.197	0.844	1.511	0.164	0.617	0.729	0.243	0.691	0.942
ResNet-18	0.326	0.465	0.763	0.279	0.627	0.797	0.289	0.711	1.137	0.244	0.571	0.756	0.292	0.642	0.954
PG-SimCLR	0.298	0.712	0.914	0.315	0.731	0.959	0.294	0.767	1.052	0.308	0.566	0.768	0.303	0.617	0.895
UrbanCLIP	\ul0.671	\ul0.426	\ul0.569	\ul0.461	\ul0.552	\ul0.748	\ul0.326	\ul0.587	\ul0.807	\ul0.387	\ul0.511	\ul0.709	\ul0.444	\ul0.518	\ul0.774
GeoSegment	0.791	0.242	0.452	0.812	0.264	0.396	0.578	0.473	0.650	0.684	0.415	0.559	0.656	0.356	0.582

Dataset	Guangzhou
	Carbon			Population			GDP			Night Light			$PM_{2.5}$
Model	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE
Autoencoder	0.254	0.570	0.733	0.195	0.753	0.928	0.176	0.811	1.463	0.176	0.602	0.798	0.196	0.780	0.903
ResNet-18	0.375	0.515	0.673	0.274	0.671	0.831	0.251	0.725	1.048	0.242	0.551	0.704	0.231	0.694	0.859
PG-SimCLR	0.422	0.708	0.708	0.303	0.954	0.972	0.282	0.897	1.264	0.435	0.433	\ul0.627	0.315	0.542	0.741
UrbanCLIP	0.585	0.444	0.603	\ul0.533	\ul0.567	\ul0.687	\ul0.440	\ul0.546	\ul0.762	\ul0.483	0.478	0.633	0.560	\ul0.514	0.694
GeoSegment	\ul0.490	\ul0.519	\ul0.657	0.685	0.409	0.563	0.536	0.488	0.681	0.618	\ul0.453	0.620	\ul0.493	0.435	\ul0.711

Dataset	Shenzhen
	Carbon			Population			GDP			Night Light			$PM_{2.5}$
Model	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE
Autoencoder	0.189	0.634	0.887	0.175	0.772	0.921	0.119	0.884	1.733	0.166	0.608	0.758	0.149	0.645	0.891
ResNet-18	0.241	0.577	0.726	0.299	0.654	0.855	0.234	0.817	1.125	0.243	0.543	0.719	0.273	0.598	0.826
PG-SimCLR	0.257	0.683	0.816	0.311	0.758	0.892	0.307	0.895	1.003	0.454	\ul0.488	\ul0.682	0.597	0.451	0.491
UrbanCLIP	0.562	0.483	0.571	\ul0.527	\ul0.592	\ul0.610	\ul0.508	\ul0.464	\ul0.693	\ul0.387	0.511	0.709	\ul0.460	\ul0.586	\ul0.762
GeoSegment	\ul0.542	\ul0.518	\ul0.698	0.602	0.438	0.602	0.658	0.411	0.587	0.738	0.386	0.512	0.471	0.480	0.752

From Table 2, it is evident that our semantic segmentation-based approach overwhelmingly outperforms all baseline models in almost every task. This includes the state-of-the-art LLM-enhanced satellite imagery encoder, UrbanCLIP [42], and the POI information-enhanced encoder, PG-SimCLR [38]. Notably, GeoSegment achieves these significant advantages without relying on any additional information, such as LLM descriptions or POI data. This demonstrates the superiority of our method in satellite imagery encoding.

Appendix C More Details of GeoHG

C.1 Basic graph theory and our motivation

Graph theory. Graphs provide a natural abstraction to represent structured relationships between entities (nodes) and their attributes (features). Within this framework, knowledge and information are organized via connectivity patterns encoded by edges. A key advantage of graph-structured representations is that nodes can reason about their representations not just from their attributes, but also by recursively aggregating information from their neighbors. This allows graphs to effectively capture complex relational dependencies and contextual patterns.

Graph Neural Networks (GNNs) have emerged as a powerful paradigm for learning representations on graph-structured data by leveraging the message-passing mechanism. The core idea of this mechanism involves each node recursively aggregating representation vectors from its local neighborhood, allowing it to accumulate information from an expanding neighborhood scope across iterations [37]. Formally, we can represent this process through the message-passing equation:

	$\displaystyle m_{u}$	$\displaystyle=\operatorname{Aggregate}\left(f_{v},v\in\mathcal{N}_{u}\right),$		(11)
	$\displaystyle f_{u}^{\prime}$	$\displaystyle=\operatorname{Update}\left(m_{u},f_{u}\right),$		(11)

here, $f_{u}$ represents the original representation of node $u$ while $f_{u}^{\prime}$ denotes its new representation after one iteration, $\mathcal{N}_{u}$ is its set of neighboring nodes. Through this iterative process, GNNs can effectively capture dependencies spanning the entire graph topology, enabling them to learn highly expressive representations that fuse localized features with broader structural context.

Hypergraph theory. Traditional graph models can only connect pairs of vertices within edges, thereby facing limitations in representing higher-order multi-way relationships beyond simple pairwise associations. In real-world scenarios, relationships between data typically go beyond simple pairwise links and involve complex multi-element patterns. For example, in a transportation network, a route often spans multiple cities or locations rather than just directly connecting two endpoints, which is difficult to capture using traditional graphs. To overcome this, hypergraph theory generalizes the graph formulation by introducing hyperedges that can connect any number of vertices, naturally facilitating the representation of higher-order multi-way relationships, defined as:

G=(V,E)\quad where\quad V=\{v_{1},v_{2},...,v_{N}\},E=\{\mathcal{H}_{1},% \mathcal{H}_{2},...,\mathcal{H}_{M}\}

(12)

where $G$ represents a hypergraph and V is the set of nodes. $E=\{\mathcal{H}_{1},\mathcal{H}_{2},...,\mathcal{H}_{M}\}$ is the set of hyperedges representing connectivity among nodes and hyperedge $\mathcal{H}_{m}$ is a subset of $V$ . Revisiting the transportation route example, we can model each city/location as a node and the route passing through multiple locations as a hyperedge connecting all the corresponding nodes, precisely encoding the dependencies between the route and the groups of locations it traverses.

Motivations for modeling geospatial representations with hypergraph. In the context of geospatial representation modeling, the relationships between geospatial areas go far beyond simple spatial proximity, instead arising from the intricate interplay of various environmental and societal factors. Climatic conditions, topography, population distribution, economic factors, and more, exhibit intricate high-order influence patterns across different regions. Traditional methods fail to explicitly characterize this inherent high-order relational structure [7, 59]. Although multi-view graphs for geospace are capable of incorporating additional connections between regions by introducing extra edges between graph nodes, this approach comes with considerable complexity due to its inefficient representation. Furthermore, it is questionable whether distant regions are directly influencing each other just like their connections in a multi-view graph. For instance, normally, a region might initially deteriorate the overall water body of the geospace, subsequently affecting a remote region with abundant water resources, rather than through direct connections.

To effectively model such higher-order geospatial relations, we treat each geospatial region as a node and abstract the higher-order environmental/societal associations influencing multiple regions as hyperedges connecting all the relevant nodes, precisely capturing these intricate dependency patterns. Building upon this idea, we further propose GeoHG to encode high-order relationships and intricate characteristics between regions.

C.2 Details of GeoHG structure

GeoHG introduces a novel heterogeneous hypergraph representation, termed GeoHG, to encode the high-order relational structures within geospatial data. GeoHG comprises three types of nodes: regional nodes, environmental entity nodes, and societal entity nodes, along with the relations between them. Specifically, regional nodes correspond to $1km\times 1km$ grid cells from the geographic space; environmental entity nodes represent the 9 distinct geo-entity classes detected from satellite imagery; societal entity nodes constitute the 14 different point-of-interest (POI) categories within the regions.

We construct hyperedges connecting multiple regional nodes through the environmental and societal entity nodes, enabling the explicit modeling of higher-order inter-region relationships induced by environmental or societal factors. The semantics of the node types and their relationships within GeoHG are illustrated in Tables 3 and 4, respectively.

By introducing hyperedges that can associate arbitrary subsets of regional nodes, GeoHG effectively captures the complex high-order dependencies spanning environmental conditions, geographic entities, social dynamics, and their intricate interplay across different spatial regions. This enriched relational structure encodes valuable contextual signals that are integrated into the region representations learned by GeoHG, ultimately benefiting a wide range of downstream geospatial analytics tasks.

Table 3: Major entities in GeoGraph

Entity	Num	Examples
Region	N	$1\textrm{km}\times 1\textrm{km}$ grid cells from the geospace. The number of region entities depends on the number of grids within a given geographical range.
Environment	9	Tree cover, Shrubland, Grassland, Cropland, Built-up, Bare/sparse vegetation, Permanent water bodies, Herbaceous wetland, Moss and lichen
Society	14	Food and Beverage, Transportation facilities, Shop** spend, Science, education and culture, Companies, Recreation, Financial Institutions, Tourist Attractions, Life services, Car related, Sports and fitness, Hotel accommodation, Healthcare, Commercial residence

Table 4: Major relations in GeoGraph

Relation	Weight	Head & Tail Entity	Abbrev.
Region Nearby Region	-	(Region, Region)	RNR
Environmental Entity Locates at Region	the area occupied by entity	(Environmental Entity, Region)	ELR
Societal Entity Locates at Region	the transformed proportion of POI category	(Societal Entity, Region)	SLR

Appendix D Dataset and Experiment Settings

Dataset Details. We employ five representative tasks located in four cities in China: Bei**g, Shanghai, Guangzhou and Shenzhen. Population, GDP, and Night Light tasks reflect anthropogenic activities, whereas Carbon and Temperature tasks characterize the natural environment. We conduct data preprocessing according to [42]. The detailed information of datasets is listed below:

•

Carbon Emissions: This dataset incorporates anthropogenic CO2 emission estimates sourced from the Open Data Inventory (ODIAC)³³3https://odin.opendatawatch.com/ for the year 2022, spatially aligned with our 1 km² satellite image grids (emissions quantified in tons).
•

Population: This dataset is obtained through WorldPop⁴⁴4https://www.worldpop.org/’s population distribution data for 2020, with counts representing the number of citizens per region.
•

GDP: This dataset includes Gross Domestic Product (GDP) statistics reflecting China’s regional economic development patterns from Zhao et al. [54].
•

Night Light: As a proxy for human activity intensity, a key driver of urban evolution, we leverage nighttime light imagery data from Zhong et al. [56] in 2020.
•

$PM_{2.5}$ : The $PM_{2.5}$ dataset is sourced from ChinaHighPM2.5 dataset [36]. This dataset combines ground-based observations, atmospheric reanalysis, emission inventory, and other techniques to obtain nationwide seamless ground PM2.5 data from 2000 to the present. The main scope is the entire China area, the spatial resolution is 1 km, the time resolution is daily, monthly, and yearly, and the unit is µg/m3.

Table 5: Dataset statistics.

Dataset	Coverage in Geospace			Satellite Image	POI Information
Dataset	Bottom-left	Top-right	Area ( $km^{2}$ )	Satellite Image	POI Information
Bei**g	39.75°N, 116.03°E	40.15°N, 116.79°E	4,277	4,277	709,232
Shanghai	30.98°N, 121.10°E	31.51°N, 121.80°E	5,292	5,292	808,957
Guangzhou	22.94°N, 113.10°E	23.40°N, 113.68°E	8,540	8,540	805,997
Shenzhen	22.45°N, 113.75°E	22.84°N, 114.62°E	5,150	5,150	717,461
Shenzhen-enlarged	22.45°N, 113.75°E	23.84°N, 114.62°E	17,264	17,264	1,813,547

Implementation Details. We implement GeoHG, GeoHG-SSL and baselines with PyTorch 3.9 on a single NVIDIA RTX A6000 with 24GB relevant memory. We use a 2-layer MLP as the regression head for prediction. GeoHG is trained using Adam optimizer with a learning rate of 0.01. For the hyperedge gate $\theta_{Env}$ and $\theta_{Soc}$ , we conduct grid searches over (0.2, 0.4, 0.6, 0.8) and (0.3,0.6,0.9,1.2,1.5) respectively. For the number of layers of the graph convolutional block, we test it from 1 to 3. The execution time required for training the GeoHG is approximately 3 minutes per task.

Final Settings of GeoHG. We introduce the best hyperparameter configurations for each task as below:

•

For the Carbon dataset, the hypergate $\theta_{Env}$ is set as 0.6 while $\theta_{Soc}$ is set as 0.9. The number of layers of the graph convolution block is 3, and the dimension of the hidden layer is 64.
•

For the Population dataset, the hypergate $\theta_{Env}$ is set as 0.2 while $\theta_{Soc}$ is set as 0.9. The number of layers of the graph convolution block is 3, and the dimension of the hidden layer is 64.
•

For the GDP dataset, the hypergate $\theta_{Env}$ is set as 0.4 while $\theta_{Soc}$ is set as 1.2. The number of layers of the graph convolution block is 3, and the dimension of the hidden layer is 64.
•

For the Night Light dataset, the hypergate $\theta_{Env}$ is set as 0.2 while $\theta_{Soc}$ is set as 0.9. The number of layers of the graph convolution block is 3, and the dimension of the hidden layer is 64.
•

For the $PM_{2.5}$ dataset, the hypergate $\theta_{Env}$ is set as 0.8 while $\theta_{Soc}$ is set as 0.6. The number of layers of the graph convolution block is 3, and the dimension of the hidden layer is 64.

Evaluation Metrics. We employ Mean Absolute Error (MAE), rooted mean squared error (RMSE), and coefficient of determination ( $R^{2}$ ) as evaluation metrics, these metrics are calculated as below:

\displaystyle\operatorname{MAE}(y,\hat{y})

\displaystyle=\frac{1}{|y|}\sum_{i=1}^{|y|}\left|y_{i}-\hat{y}_{i}\right|,

(13)

\displaystyle\operatorname{RMSE}(y,\hat{y})

\displaystyle=\sqrt{\frac{1}{|y|}\sum_{i=1}^{|y|}\left(y_{i}-\hat{y}_{i}\right% )^{2}}

(14)

\displaystyle\operatorname{R^{2}}

\displaystyle=1-\frac{\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}}{\sum_{% i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}

(15)

Appendix E Details of Baselines

Description & Settings. We have chosen to incorporate a variety of widely utilized traditional methods and prominent cutting-edge techniques for comparative evaluation. We select two classical methods (AutoEncoder [18] and ResNet-18 [10]), four state-of-the-art vision-based models (Urban2Vec [35] and UrbanCLIP [42]) and multi-modal approaches (UrbanVLP [9] and GeoStructual [20]) for geospatial region embedding. Moreover, we explore a variation of GeoHG where we employ Graph Self-Supervised Learning (SSL) pretraining instead of end-to-end training, represented as GeoHG-SSL. We describe these baselines as follows.

•

AutoEncoder [18]: A neural network architecture framework targets unlabeled satellite imagery for feature extraction using reconstruction loss.
•

ResNet-18 [10] The renowned residual neural network, pretrained on the extensive ImageNet dataset [2], is capable of directly extracting visual features from satellite imagery by leveraging the knowledge it has previously acquired from natural imagery.
•

UrbanCLIP: [42] A model incorporating a multi-modal Large Language Model (LLM) to enhance the encoding of satellite imagery. This model generates descriptive texts for satellite images using the LLM and then fuses these texts with the images through an image-text contrastive learning-based approach to capture the complexity and diversity of geospatial areas.
•

PG-SimCLR [38]: A contrastive learning framework that introduces societal information (i.e., POI) into geospatial region representation learning from satellite imagery.
•

GeoStructural [20]: The graph-based framework profiles geospatial regions by utilizing street segments as graph structure for adaptively fusing features from multi-level satellite and street-view images. For convenience, we refer to this method as GeoStructural.
•

UrbanVLP [9]: An region embedding method based on contrastive learning, which incorporates satellite imagery, street-view images, and spatial position structure. This method is further enhanced by incorporating a Large Language Model (LLM) and GeoCLIP [33], resulting in improved robustness and performance.

Appendix F More Details about Experiments Results

F.1 RMSE metrics of experiment results

Given the page constraints, we hereby present the complementary version of our experimental results table in Table 6, which includes RMSE results for each experiment. It is discernible that the RMSE outcomes are consistent with other performance metrics, thereby conclusively affirming the overall superiority of our model’s performance.

Table 6: Region indicators prediction results. The bold/underlined font means the best/the second-best result.

Methods		GeoHG		GeoHG-SSL		UrbanVLP		GeoStructural		PG-SimCLR		UrbanCLIP		ResNet-18		AutoEncoder
Metric		$\text{R}^{2}$	RMSE	$\text{R}^{2}$	RMSE	$\text{R}^{2}$	RMSE	$\text{R}^{2}$	RMSE	$\text{R}^{2}$	RMSE	$\text{R}^{2}$	RMSE	$\text{R}^{2}$	RMSE	$\text{R}^{2}$	RMSE
Bei**g	Carbon	0.954	0.201	0.937	0.224	0.787	0.457	0.765	0.472	0.442	0.754	0.664	0.598	0.394	0.805	0.298	0.844
	Population	0.874	0.351	0.870	0.374	0.725	0.513	0.730	0.504	0.471	1.117	0.461	0.669	0.266	0.857	0.168	0.918
	GDP	0.647	0.567	0.644	0.581	0.586	0.650	0.617	0.609	0.277	1.254	0.355	0.864	0.285	1.024	0.171	1.251
	Night Light	0.901	0.311	0.900	0.353	0.531	0.629	0.488	0.657	0.369	0.728	0.420	0.700	0.348	0.734	0.276	0.784
	$PM_{2.5}$	0.971	0.160	0.970	0.169	0.641	0.594	0.694	0.482	0.398	0.845	0.533	0.699	0.341	0.821	0.209	0.840
Shanghai	Carbon	0.915	0.290	0.912	0.312	0.716	0.529	0.688	0.557	0.298	0.914	0.671	0.569	0.326	0.763	0.230	0.771
	Population	0.936	0.244	0.928	0.251	0.593	0.607	0.613	0.583	0.315	0.959	0.456	0.748	0.279	0.797	0.166	0.898
	GDP	0.778	0.468	0.767	0.477	0.310	0.816	0.377	0.721	0.294	1.052	0.326	0.807	0.289	1.137	0.197	1.511
	Night Light	0.898	0.311	0.891	0.347	0.457	0.667	0.442	0.685	0.308	0.768	0.387	0.709	0.244	0.756	0.164	0.729
	$PM_{2.5}$	0.866	0.379	0.836	0.394	0.486	0.654	0.527	0.592	0.303	0.895	0.444	0.774	0.292	0.954	0.243	0.942
Guangzhou	Carbon	0.885	0.336	0.884	0.371	0.698	0.514	0.681	0.529	0.422	0.708	0.585	0.603	0.375	0.673	0.254	0.733
	Population	0.871	0.244	0.855	0.255	0.665	0.441	0.687	0.433	0.303	0.972	0.533	0.687	0.274	0.831	0.195	0.928
	GDP	0.715	0.532	0.712	0.574	0.436	0.764	0.439	0.699	0.282	1.264	0.440	0.762	0.251	1.048	0.176	1.463
	Night Light	0.871	0.378	0.854	0.391	0.577	0.573	0.574	0.581	0.435	0.627	0.483	0.633	0.242	0.704	0.176	0.798
	$PM_{2.5}$	0.833	0.403	0.822	0.414	0.638	0.624	0.652	0.597	0.315	0.741	0.56	0.694	0.231	0.859	0.196	0.903
Shenzhen	Carbon	0.926	0.290	0.912	0.304	0.659	0.568	0.647	0.581	0.257	0.816	0.562	0.571	0.241	0.726	0.189	0.887
	Population	0.892	0.244	0.879	0.261	0.790	0.448	0.797	0.390	0.311	0.892	0.527	0.610	0.299	0.855	0.175	0.921
	GDP	0.798	0.468	0.767	0.489	0.532	0.676	0.517	0.699	0.307	1.003	0.508	0.693	0.234	1.125	0.119	1.733
	Night Light	0.942	0.245	0.939	0.268	0.457	0.667	0.445	0.682	0.454	0.588	0.387	0.709	0.243	0.719	0.166	0.758
	$PM_{2.5}$	0.906	0.298	0.905	0.331	0.566	0.598	0.597	0.491	0.323	0.883	0.430	0.762	0.273	0.826	0.149	0.891

F.2 Mean and standard deviation of metrics for GeoHG

Each method is executed five times, and we report the detailed mean and standard deviation of both metrics for GeoHG in Table 7.

Table 7: The mean and standard deviation of both metrics in 5-run for GeoHG.

Cities	Bei**g			Shanghai
Metric	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE
Carbon	0.954±0.002	0.11±0.002	0.201±0.002	0.915±0.001	0.157±0.001	0.29±0.007
Population	0.874±0.003	0.271±0.001	0.351±0.001	0.936±0.005	0.161±0.001	0.244±0.001
GDP	0.647±0.015	0.336±0.010	0.567±0.009	0.778±0.001	0.323±0.002	0.468±0.001
Night Light	0.901±0.014	0.239±0.001	0.311±0.002	0.898±0.004	0.222±0.003	0.311±0.004
$PM_{2.5}$	0.971±0.009	0.064±0.002	0.16±0.003	0.866±0.001	0.12±0.002	0.3789±0.003
Cities	Guangzhou			Shenzhen
Metric	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE
Carbon	0.885±0.001	0.219±0.001	0.336±0.001	0.926±0.001	0.128±0.001	0.29±0.001
Population	0.871±0.003	0.244±0.005	0.368±0.002	0.892±0.005	0.165±0.007	0.244±0.005
GDP	0.715±0.002	0.371±0.005	0.532±0.002	0.798±0.002	0.297±0.002	0.468±0.003
Night Light	0.871±0.001	0.234±0.002	0.378±0.001	0.942±0.001	0.149±0.001	0.245±0.001
$PM_{2.5}$	0.833±0.003	0.133±0.002	0.4034±0.003	0.906±0.002	0.116±0.002	0.298±0.004

F.3 More ablation study results

The ablation results for Bei**g and Shenzhen are illustrated in Figure 11. GeoHG-Mono only preserves the isomorphic graph structure and adjacency relationships between regions, while GeoHG-MLP completely discards the graph structure. Similar to the results for Guangzhou and Shanghai in Figure 5, discarding the heterogeneous graph structure leads to severe performance degradation across all tasks. Furthermore, removing environmental, social, or location information also reduces model performance, but the degree of degradation varies across different cities when omitting each type of information.

F.4 Details of data efficiency experiment

Table 8: Region indicators prediction results in the few-shot setting. The evaluation metric is

R^{2}

, and the test set constitutes a 20% random sample disjoint from the training data. The bold/underlined font means the best/the second-best result.

Methods		GeoHG			UrbanVLP			GeoStructural			UrbanCLIP
Available Data		20%	10%	5%	20%	10%	5%	20%	10%	5%	20%	10%	5%
Bei**g	Carbon	0.954	0.933	0.890	\ul0.719	\ul0.661	\ul0.647	0.579	0.561	0.540	0.643	0.596	0.432
	Population	0.791	0.775	0.728	\ul0.665	\ul0.635	\ul0.615	0.620	0.604	0.522	0.568	0.557	0.518
	GDP	0.689	0.604	0.474	0.511	0.448	0.417	\ul0.600	\ul0.547	\ul0.465	0.401	0.380	0.273
	Night Light	0.871	0.857	0.833	0.439	0.420	\ul0.405	\ul0.485	\ul0.456	0.383	0.415	0.396	0.369
	$PM_{2.5}$	0.987	0.984	0.982	\ul0.611	\ul0.569	\ul0.542	0.598	0.536	0.461	0.450	0.403	0.381
Shanghai	Carbon	0.888	0.851	0.831	0.632	0.618	0.562	\ul0.697	\ul0.657	\ul0.589	0.668	0.598	0.526
	Population	0.923	0.894	0.884	0.556	0.521	0.454	\ul0.573	\ul0.546	\ul0.498	0.530	0.515	0.451
	GDP	0.733	0.665	0.656	0.236	0.228	0.198	\ul0.351	\ul0.318	\ul0.297	0.315	0.302	0.277
	Night Light	0.872	0.839	0.806	0.415	0.368	0.311	\ul0.434	\ul0.394	\ul0.331	0.379	0.345	0.312
	$PM_{2.5}$	0.817	0.785	0.743	0.457	0.415	0.354	\ul0.501	\ul0.455	\ul0.416	0.380	0.357	0.325
Guangzhou	Carbon	0.796	0.775	0.751	0.578	0.532	0.415	\ul0.579	\ul0.574	\ul0.561	0.468	0.377	0.320
	Population	0.841	0.828	0.789	0.605	0.579	0.441	\ul0.682	\ul0.618	\ul0.596	0.556	0.552	0.419
	GDP	0.693	0.627	0.608	0.385	0.352	0.273	\ul0.415	\ul0.367	\ul0.336	0.407	0.303	0.223
	Night Light	0.831	0.827	0.812	0.508	0.472	0.318	\ul0.560	\ul0.529	\ul0.510	0.470	0.461	0.379
	$PM_{2.5}$	0.766	0.758	0.752	0.558	0.493	0.381	\ul0.581	\ul0.528	\ul0.400	0.466	0.329	0.232
Shenzhen	Carbon	0.886	0.842	0.839	0.571	0.520	0.468	\ul0.585	\ul0.582	\ul0.563	0.478	0.463	0.390
	Population	0.891	0.889	0.870	0.605	0.579	0.543	\ul0.691	0.585	0.549	0.646	\ul0.618	\ul0.559
	GDP	0.782	0.781	0.767	\ul0.503	\ul0.477	\ul0.448	0.486	0.482	0.463	0.421	0.374	0.344
	Night Light	0.932	0.930	0.925	\ul0.402	0.354	0.228	0.383	\ul0.380	\ul0.356	0.312	0.270	0.228
	$PM_{2.5}$	0.905	0.888	0.851	0.497	0.403	0.311	\ul0.527	\ul0.407	\ul0.350	0.368	0.332	0.266

We provide the performances of several baseline models in a data-limited setting, as shown in Table 8. We observed that existing geospatial embedding models exhibit certain few-shot learning capabilities, performing relatively well despite significantly reduced training data. Our model consistently outperforms other models under the same training data configurations. Moreover, our model can achieve performance exceeding or comparable to that of other models while using less data.

F.5 Details of investigation of GeoHG with GNNExplainer

To validate GeoHG’s efficacy in capturing higher-order geospatial relationships, we employ GNNExplainer [46], a model-agnostic technique for interpreting GNN predictions through identifying crucial subgraph structures and features. Specifically, for the carbon emission prediction task on the region #(35,65) in Guangzhou, we utilized GNNExplainer to extract the top 10 most influential edges, as shown in Table 9. Similar results for region #(43,87) are shown in Table 10. Based on Table 9, we can derive the following key findings:

Table 9: Top 10 important edges for carbon emission predictions for region #(35,65) in Guangzhou. The underlined font is the regional nodes adjacent to the target region, and the bold font refers to the distant region nodes.

Source Node Type	Source Node Name	Target Node Type	Target Node Name	Importance
region	(35,65)	society	Food and Beverage	0.534
region	(35,65)	society	Shop** Mall	0.511
region	(35,65)	environment	Built-up	0.522
region	(35,65)	region	(34,65)	0.526
region	(35,65)	region	(36,65)	0.519
region	(35,65)	region	(35,64)	0.522
region	(35,65)	region	(34,64)	0.520
society	Food and Beverage	region	(40,109)	0.495
society	Food and Beverage	region	(47,49)	0.495

1)

The target region exhibits strong associations with societal entity nodes like "Food and Beverage" and "Shop** Mall", which are typically carbon-intensive due to factors such as energy usage (cooking, refrigeration), transportation (goods/personnel movement), and packaging consumption.
2)

Adjacent regions’ carbon emissions emerge as highly influential features, an intuitive pattern arising from spatial proximity and potential environmental spillovers between neighboring areas.
3)

Notably, important hyperedges also connect the target region to relatively distant areas like #(40, 109) and #(47, 49). Despite geographical separation, these regions share similar social contexts, being linked to the "Food and Beverage" node, thereby providing relevant emission patterns to inform the prediction.

These interpretable insights validate GeoGraph’s effectiveness in capturing the intricate high-order dependencies between environmental conditions, urbanization factors, economic activities, and their synergistic impact on carbon footprints across regions. By encoding such high-order interactions through graph structures, GeoHG offers a powerful inductive bias tailored for modeling complex geospatial phenomena governed by higher-order environment-society couplings.

Table 10: Top 10 important edges for carbon emission predictions for region #(43,87) in Guangzhou. The underlined font is the regional nodes adjacent to the target region, and the bold font refers to the distant region nodes.

Source Node Type	Source Node Name	Target Node Type	Target Node Name	Importance
region	(43,87)	society	Food and Beverage	0.496
region	(43,87)	environment	Built-up	0.495
region	(43,87)	environment	Tree	0.519
region	(43,87)	region	(42,87)	0.531
region	(43,87)	region	(44,87)	0.526
region	(43,87)	region	(43,86)	0.524
region	(43,87)	region	(43,88)	0.520
environment	Built-up	region	(48,54)	0.497
environment	Tree	region	(119,7)	0.495

Appendix G Qualitative Demonstration

In this section, we qualitatively evaluate the efficacy of our proposed approach. To begin with, we visually depict the regional dependency distribution by calculating the cosine similarity between embeddings of distinct regions, which are acquired through self-supervised training as outlined in Section 3.4. Moreover, we juxtapose the pertinent real-world information of regions exhibiting highly similar embeddings to scrutinize the capacity of our method to succinctly encapsulate the information and interrelationships within geospace.

Given the embedding $E_{A}$ of region A, the cosine similarity with any other region’s embedding can be calculated as follows:

S_{C}(E_{A},E_{B})=\cos(\theta)=\frac{\mathbf{E_{A}}\cdot\mathbf{E_{B}}}{\|% \mathbf{E_{A}}\|\|\mathbf{E_{B}}\|}=\frac{\sum_{i=1}^{n}E_{Ai}E_{Bi}}{\sqrt{% \sum_{i=1}^{n}E_{Ai}^{2}}\cdot\sqrt{\sum_{i=1}^{n}E_{Bi}^{2}}}

(16)

where $E_{Ai}$ and $E_{Bi}$ are the $i_{th}$ components of vectors $E_{A}$ and $E_{B}$ , respectively.

In Figure 12, we present the visualization results of two randomly selected regions regarding their regional similarity within the GeoHG-SSL geospatial embedding of Guangzhou. For Region (40,58), we identify it as a farmland area situated near water on an island. According to the heat map, our model has naturally discovered other regions on the same island, Jiabaosha Island, with closer regions exhibiting higher similarity. Interestingly, our model has also identified some remote islands in the north with similar environmental and societal functions, such as Dazhouwei Island and Sandy Bay. Furthermore, the western part of Guangzhou city, which is predominantly a mountainous industrial area, shows lower similarity results, aligning with this geographical characteristic.

For Region (36,63), we find that it is an urban resident area with green land near a river, named Caohe Village. From the heatmap, we can see that our model connects other residential areas exhibiting similar characteristics in Guangzhou. For example, Shilou Village and Lyv Village share similar traits, with residential settlements bordering flowing rivers.

It is worth noting that, with the comprehensive representation of environmental, societal, and spatial information in the geospace and the specially designed graph structure, our models do not simplify the content region second-order region pair based solely on their intra-feature similarity. Instead, we consistently identify grouped high-order regions through the embedding results, rather than merely relying on individual high-similarity points.

Appendix H More Discussion

Limitations and Future Directions. One limitation of our proposed method is that the complexity of the graph structure is directly proportional to the area size of the geospace, as the unit grid of the geospace is fixed to a $1km\times 1km$ square. This could present a challenge in worldwide large-scale applications, where the area of geospace may be extremely vast. In the future, we may explore the potential of modeling the geospace of the entire Earth through an adaptive grid partition and graph construction method. Furthermore, although we have emphasized and validated the efficacy and advantages of segmentation-based satellite imagery encoding, the upper limit of this approach is determined by the quality of segmentation results provided by third parties. Additionally, while we have demonstrated the effectiveness of our GeoHG in various real-world applications, further investigation is warranted to evaluate its performance in more diverse and complex scenarios.

Social Impacts. Efficient geospatial embedding holds considerable benefits for the broader geographic and spatial computing communities. By comprehensively representing intra-region features and inter-region correlations, our proposed GeoHG framework exhibits significant potential to effect meaningful change across various domains, especially within the realms of smart cities and geoscience. The interoperability and efficiency provided by GeoHG facilitate a deeper understanding of complex geospace and the underlying mixed-order correlations inherent in the space. This comprehensive representation of regional spatial regions empowers stakeholders to monitor cities and environments more effectively, thereby making informed decisions that ultimately enhance individual and communal quality of life while fostering more resilient and sustainable environments.

Moreover, the remarkable data efficiency of GeoHG enables the community to investigate regional fine-grained climates with limited resources. For instance, in Section 4.3, we showcase its exceptional performance in predicting fine-grained ( $1km\times 1km$ ) indicators for a large area ( $16,401km^{2}$ ) using only 863 monitoring points. As human society progresses and the global environment changes, regional extreme climates, such as urban heat islands and local air pollution, continue to pose significant economic, environmental, and health challenges. Thus, effective fine-grained regional climate monitoring becomes increasingly paramount. It is noticed that the millions of lives lost annually due to local extreme heat and air pollution, we believe the enhanced geospatial region representation and data efficiency facilitated by our approach will support more effective human health protection measures and inform geopolitical decision-making, promoting improved urban and environmental management as well as extreme climate mitigation strategies.