\useunder

\ul

Learning Geospatial Region Embedding with Heterogeneous Graph

Xingchen Zou1         Jiani Huang2∗        Xixuan Hao1         Yuhao Yang3        Haomin Wen4        Yibo Yan1        Chao Huang3        Yuxuan Liang1 1 The Hong Kong University of Science and Technology (Guangzhou), 2 The Hong Kong Polytechnic University, 3 University of Hong Kong, 4 Bei**g Jiaotong University {xzou428,xhao390}@connect.hkust-gz.edu.cn   {jianihuang01,yanyibo70,chaohuang75}@gmail.com   {wenhaomin}@bjtu.edu.cn   {yuhao-yang,yuxliang}@outlook.com These authors contributed equally to this workY. Liang is the corresponding author. Email: [email protected]
Abstract

Learning effective geospatial embeddings is crucial for a series of geospatial applications such as city analytics and earth monitoring. However, learning comprehensive region representations presents two significant challenges: first, the deficiency of effective intra-region feature representation; and second, the difficulty of learning from intricate inter-region dependencies. In this paper, we present GeoHG, an effective heterogeneous graph structure for learning comprehensive region embeddings for various downstream tasks. Specifically, we tailor satellite image representation learning through geo-entity segmentation and point-of-interest (POI) integration for expressive intra-regional features. Furthermore, GeoHG unifies informative spatial interdependencies and socio-environmental attributes into a powerful heterogeneous graph to encourage explicit modeling of higher-order inter-regional relationships. The intra-regional features and inter-regional correlations are seamlessly integrated by a model-agnostic graph learning framework for diverse downstream tasks. Extensive experiments demonstrate the effectiveness of GeoHG in geo-prediction tasks compared to existing methods, even under extreme data scarcity (with just 5% of training data). With interpretable region representations, GeoHG exhibits strong generalization capabilities across regions. We will release code and data upon paper notification.

1 Introduction

Geospatial regions, as complex systems, witness the intertwining of natural laws and social dynamics, leading to mixed-order interactions among regions and entities. Recently, there has been a growing interest in leveraging deep learning methods to tackle various geospatial tasks [59, 41], such as regional indicator prediction [43, 42, 45, 14], earth monitoring [19, 25, 47], and geopolitics optimization [12, 31, 40]. Global geospatial regions exhibit intricate and multifaceted dynamics. Learning effective geospatial embeddings that capture the inherent characteristics of regions and their intricate mixed-order relationships lays the groundwork for these applications [55, 33, 19].

To learn effective representations for geographical regions, it is essential to capture two key characteristics: intra-region features and inter-region correlations [30, 53, 5]. As illustrated in Figure 1, (a) intra-region features encapsulate a region’s environmental information (e.g., vegetation coverage, and water resources) and societal information (e.g., colleges, or government institutions). (b) Inter-region correlations indicate the relationships between regions, extending beyond simple pairwise adjacency to capture higher-order dependencies. Specifically, we believe that the representation of a region can be influenced by multiple regions collectively, even if they are not spatially adjacent. For instance, regions along the same river basin may exhibit correlated climate and economic patterns despite lacking direct adjacency.

Refer to caption
Figure 1: Illustration of region representation from intra-region and inter-region perspective. (a) intra-region features consist of environmental features and societal features. (b) inter-region correlations, e.g., the second-order adjacent correlation between neighboring regions; and the high-order correlation between remote region groups induced by environmental and societal factors.

Although relevant works [35, 42, 21] have sought to learn better geospatial representations, concisely learning an effective and comprehensive geospatial embedding remains challenging. The questions are two-fold. First, an effective representation for intra-region features remains unexplored. Existing studies opt for leveraging globally available satellite imagery to learn geospatial representations that can generate a unified embedding space for regions across the world [35, 42, 20]. These studies often employ vision-based encoders pre-trained on natural images to learn semantics from satellite imagery, but such encoders struggle due to the substantial differences between satellite and natural imagery. Second, comprehensively modeling the complex high-order inter-region relations in geospatial embeddings is challenging. In geo-space, inter-region relationships extend beyond second-order adjacency to encompass ternary and higher-order dependencies between regions and region groups. For instance, a certain region may be influenced by several non-adjacent regions due to shared geographic or socioeconomic functions, despite lacking physical proximity [30, 21]. While multi-view graphs have been explored to model complex geographic and socioeconomic relations, they heavily rely on specific data to construct and are limited to second-order perspectives [59, 16, 17]. Alternatively, knowledge graphs (KGs) [27] built from geographical data, such as POIs, aim to capture complex structures, but their complexity limits scalability and effectiveness in training [27, 17].

To address these challenges, we propose a novel Heterogeneous Graph structure with a learning framework for effective and mixed-order relation-aware Geospatial embedding (named GeoHG). Our framework leverages satellite images and POI information to effectively derive unified region representations encapsulating both intra-region and inter-region information from environmental and societal perspectives. To derive effective intra-region representations that generalize across global regions, we proposed a novel satellite image encoding mechanism. Instead of directly using vanilla vision encoders pre-trained on life images, which struggle with the domain gap, we perform semantic segmentation on the satellite images to effectively distinguish geo-entities like water bodies, vegetation, and man-made structures. Subsequently, we integrate the spatial coordinate with the geo-entities and extra POI information within each region to construct a comprehensive intra-regional feature representation. To efficiently capture high-order inter-region relations, we leverage the powerful mixed-order dependencies from the heterogeneous graph that reflects both environmental and societal aspects of geographical regions. The heterogeneous formulation integrates spatial attributes with explicit region-entity associations, representing the high-dimensional socio-environmental dependencies across regions in a unified framework. Finally, our GeoHG seamlessly integrates the intra-region and inter-region representations through model-agnostic heterogeneous graph structure. Its universal compatibility allows for seamless integration with different models and algorithms for various tasks, ensuring optimal performance across diverse applications with effective geospatial representation.

To summarize, the contributions of this work are concluded as follows:

  • Novel Heterogeneous Graph Structure for Geospatial Embedding. We introduce a model-agnostic heterogeneous graph structure to integrate intra-region features and inter-region correlations in geospace efficiently. As far as we know, it is the first work that integrates comprehensive intra-region information with complex mixed-order inter-region relations in a heterogeneous graph for geospatial representation (Sec 3.3).

  • Efficient and Explicit Intra Region Embedding. We develop an effective approach to construct intra-region feature representation by leveraging entity segments from satellite images to extract environmental features and utilizing POI information to capture societal attribute features (Sec 3.1).

  • Mixed-order Inter-region Relation-Aware Representation. We propose a novel method to construct interpretable heterogeneous graphs that explicitly capture mixed-order relations between regions. Our approach effectively represents the intricate multivariate relationships within the regions’ natural environment. Additionally, it captures the complex social attribute relationships among various types of social entities within geospatial regions (Sec 3.2).

  • Empirical Evidence. Our empirical studies across various downstream tasks in different locations demonstrate the superior performance of GeoHG compared to various baseline models. It consistently outperforms existing methods and maintains substantial gains even in extremely low-data regimes (5% training data). Moreover, GeoHG outperforms conventional geospatial interpolation methods specifically designed for such data-scarce settings (Sec 4.1).

2 Related Work

Geospatial Embedding. Numerous prior studies [52, 57, 13, 35, 20] have attempted to address the challenge of geospatial embedding. For example, Zhang et al. [52] proposed a multi-view graph representation approach that considered POI data and mobility patterns to generate representations of regions. In a similar vein, Zhou et al. [57] employed a prompt learning method by leveraging both POI and mobility data. However, these approaches are limited to specific tasks and regions due to limited and non-comprehensive embeddings [53, 42]. To tackle this issue, there has been a growing interest in utilizing satellite imagery for geospatial embedding, as it offers easy accessibility and global coverage. For instance, PG-SimCLR [38] and UrbanCLIP [42] have demonstrated success in profiling urban regions through satellite images for various downstream tasks. Nevertheless, there remains a gap for these models to effectively interpret the complex and professional semantic meaning of satellite imagery. Moreover, the aforementioned methods overlook the crucial modeling of mixed-order inter-region relations, which could affect the comprehensiveness of geospatial embedding. Therefore, our proposed GeoHG initially employs a heterogeneous graph to capture these inter-region relationships while concurrently learning the explicit features of a given region.

Graph Neural Network for Geospatial Representation. Graph Neural Networks (GNNs) offer a succinct and scalable approach for modeling intricate geospaces with non-Euclidean characteristics [51, 26, 11, 34]. By propagating messages through edges, GNNs are capable of learning the geospatial correlations between regions and geo-entities effectively. To better represent these complex correlations, numerous studies have adopted a multi-view graph structure [22, 5, 3, 7]. In this framework, nodes in each view (e.g., distance, mobility, semantic) are linked with distinct edge vectors that represent the correlations between node pairs under specific perspectives. Differing from this approach, our work utilizes a Heterogeneous Graph (HG) to further represent mixed-order geospatial correlations. Although HG structures are frequently used in knowledge graphs [27, 17, 23] that often lack scalability and effective training [27, 17], our GeoHG innovatively revises HG to succinctly represent mixed-order correlations and group interactions between regions from both environmental and societal perspectives. Moreover, we devise a streamlined yet efficient pipeline for automatically constructing this heterogeneous graph from satellite imagery and POI information.

Multimodal Learning in GeoAI. Multimodal learning involves integrating data from various modalities to enhance model performance [1, 8, 6]. Geospatial multimodal learning, in particular, improves regional understanding by combining spatio-temporal (e.g., POI and road network), visual (e.g., urban imagery), textual (e.g., social media data) modalities, resulting in more comprehensive and accurate geospatial representations [59, 55]. Traditional geospatial representation learning mostly relies on task-specific supervised methods [48, 55], which are limited by their dependence on domain expertise and inability to adapt to new tasks [59]. To overcome these limitations, recent studies [15, 52] focus on learning general geospatial embeddings from a multimodal perspective. Nonetheless, efficiently integrating multimodal data from intricate geospaces, while maintaining generalizability, poses a significant challenge [59]. In this study, we develop a holistic geospatial embedding learning framework grounded in graph theory to amalgamate visual insights and POI information of geospace with geospatial mixed-order relation awareness while maintaining awareness of geospatial mixed-order relations.

3 Methodology

Geospatial Region Embedding: We aim to train a GeoHG model capable of generating embeddings for regions worldwide to facilitate a diverse set of downstream applications. To enable our proposed model to derive embeddings for arbitrary regions, we construct the GeoHG architecture using globally available satellite imagery and readily accessible POI data as input sources, serving as proxies for environmental and societal information, respectively.

Given the access to satellite imagery and POI dataset Dsatellite_POI={(Si,Pi)}i=1Isubscript𝐷satellite_POIsuperscriptsubscriptsubscript𝑆𝑖subscript𝑃𝑖𝑖1𝐼D_{\rm satellite\_POI}=\left\{\left(S_{i},P_{i}\right)\right\}_{i=1}^{I}italic_D start_POSTSUBSCRIPT roman_satellite _ roman_POI end_POSTSUBSCRIPT = { ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT containing pairs of satellite image Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and POI data Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to the region of interest Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, our GeoHG model W generates embedding Ei=ϕ(Si,Pi)subscript𝐸𝑖italic-ϕsubscript𝑆𝑖subscript𝑃𝑖E_{i}=\phi\left(S_{i},P_{i}\right)italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where Eidsubscript𝐸𝑖superscript𝑑E_{i}\in\mathbb{R}^{d}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, representing the respective region embedding. ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) is an embedding function capturing semantic and contextual information of Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The generated embeddings Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT encapsulate the intra-region features containing environmental and societal information and its higher-order inter-region dependency with other regions. Notably, the satellite images Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can originate from arbitrary geographical regions worldwide, necessitating that the learned embeddings concisely yet comprehensively encode rich contextual information across a global scale, which poses a formidable challenge.

Refer to caption
Figure 2: GeoHG Framework. (a) Given satellite images and POI information, we derive intra-region features based on geo-entities proportion, POI categories proportion, and location. (b) To model inter-region correlations, we construct a heterogeneous graph structure that reflects the mixed-order relationships from regions’ spatial, environmental and societal perspectives. Details are displayed in Appendix C.2. (c) We integrate the intra-region features and inter-region correlations through a model-agnostic graph architecture. (d) We propose two training strategies: self-supervised pretraining and end-to-end training on specific tasks.

Our architecture comprises four major stages to obtain intra-regional and inter-regional correlations and effectively integrate them, as illustrated in Figure 2: intra-region feature representation, inter-region feature representation, heterogenous graph-based representation integration and pre-training and end-to-end training. Now, we present the details of each stage.

3.1 Intra-Region Feature Representation

Refer to caption
Figure 3: Spatial Grids.

To effectively encode spatial, environmental, and societal signals into intra-region representations, we extract multi-modal geo-contextual data including satellite imagery, POI information, and the location of the region of interest.

Spatial Position Embedding: We adopt a simple yet effective method to extract the spatial information of the region of interest, drawing inspiration from industrial conventions [42, 9]. Specifically, we divide the overall geospace into multiple 1km×1km1km1km1\textrm{km}\times 1\textrm{km}1 km × 1 km spatial grids with coordinate system origin (lon0,lat0𝑙𝑜superscript𝑛0𝑙𝑎superscript𝑡0lon^{0},lat^{0}italic_l italic_o italic_n start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_l italic_a italic_t start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT) and use the abscissa xRsubscript𝑥𝑅x_{R}italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and ordinate yRsubscript𝑦𝑅y_{R}italic_y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT of the region of interest R𝑅Ritalic_R as spatial information Epos(x,y)subscript𝐸𝑝𝑜𝑠𝑥𝑦E_{pos}(x,y)italic_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ( italic_x , italic_y ), illustrated in Figure 3. Notably, Epossubscript𝐸𝑝𝑜𝑠E_{pos}italic_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT is directly related to the geo-coordinate (lon,lat𝑙𝑜𝑛𝑙𝑎𝑡lon,latitalic_l italic_o italic_n , italic_l italic_a italic_t) at the centroid of the targeted region, for any region Eisuperscript𝐸𝑖E^{i}italic_E start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in the target space, we have:

Eposi=(lonilon0,latilat0)ΔD1subscriptsuperscript𝐸𝑖𝑝𝑜𝑠𝑙𝑜superscript𝑛𝑖𝑙𝑜superscript𝑛0𝑙𝑎superscript𝑡𝑖𝑙𝑎superscript𝑡0Δsuperscript𝐷1E^{i}_{pos}=(lon^{i}-lon^{0},lat^{i}-lat^{0})\Delta*D^{-1}italic_E start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT = ( italic_l italic_o italic_n start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_l italic_o italic_n start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_l italic_a italic_t start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_l italic_a italic_t start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) roman_Δ ∗ italic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (1)

where ΔΔ\Deltaroman_Δ is the transform vector decided by the coordinate system. D𝐷Ditalic_D denotes the scale of the grids, and for 1km×1km1km1km1\textrm{km}\times 1\textrm{km}1 km × 1 km grids, D=1𝐷1D=1italic_D = 1.

Environmental Feature Embedding: We utilize satellite imagery to mine the environmental features and expect the satellite imagery encoding to be efficient and concise. To achieve this, we draw inspiration from the European Space Agency (ESA) with ESA WorldCover Dataset [50], elaborate in Appendix A, and design a segmentation-based process to encode the satellite images instead of the most commonly used visual encoders such as CNNs. Given a satellite image Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of a region of interest Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we conduct semantic segmentation to obtain a series of geo-entities {entity1,entity2,,entityj}j=1Jsuperscriptsubscript𝑒𝑛𝑡𝑖𝑡subscript𝑦1𝑒𝑛𝑡𝑖𝑡subscript𝑦2𝑒𝑛𝑡𝑖𝑡subscript𝑦𝑗𝑗1𝐽\{entity_{1},entity_{2},...,entity_{j}\}_{j=1}^{J}{ italic_e italic_n italic_t italic_i italic_t italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e italic_n italic_t italic_i italic_t italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e italic_n italic_t italic_i italic_t italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT within that region, such as developed areas, grass, and trees based on ESA Worldcover, leveraging their area proportion as environmental feature embedding as below:

Eenv={pent1,pent2,,pentj}j=1Jsubscript𝐸𝑒𝑛𝑣superscriptsubscriptsubscript𝑝𝑒𝑛subscript𝑡1subscript𝑝𝑒𝑛subscript𝑡2subscript𝑝𝑒𝑛subscript𝑡𝑗𝑗1𝐽E_{env}=\{p_{ent_{1}},p_{ent_{2}},...,p_{ent_{j}}\}_{j=1}^{J}italic_E start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT = { italic_p start_POSTSUBSCRIPT italic_e italic_n italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_e italic_n italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_e italic_n italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT (2)

where pentj=AjARsubscript𝑝𝑒𝑛subscript𝑡𝑗subscript𝐴𝑗subscript𝐴𝑅p_{ent_{j}}=\frac{A_{j}}{A_{R}}italic_p start_POSTSUBSCRIPT italic_e italic_n italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_A start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG, and Ajsubscript𝐴𝑗A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the area occupied by entityj𝑒𝑛𝑡𝑖𝑡subscript𝑦𝑗entity_{j}italic_e italic_n italic_t italic_i italic_t italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, ARsubscript𝐴𝑅A_{R}italic_A start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is the area of region Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, J𝐽Jitalic_J is the number of environmental entity types. We discuss the motivation for using the segmentation-based approach instead of visual encoders and additional efficiency experiments in Appendix B.

Societal Feature Embedding: To effectively represent the societal feature of a region, given POI dataset DPOIsubscript𝐷𝑃𝑂𝐼D_{POI}italic_D start_POSTSUBSCRIPT italic_P italic_O italic_I end_POSTSUBSCRIPT and the region of interest Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we firstly find all points DPOIRisubscriptsuperscript𝐷subscript𝑅𝑖𝑃𝑂𝐼D^{R_{i}}_{POI}italic_D start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_O italic_I end_POSTSUBSCRIPT(poi1,poi2,,poij)poi_{1},poi_{2},...,poi_{j})italic_p italic_o italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p italic_o italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p italic_o italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) which are located at Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, we count the proportion of different K𝐾Kitalic_K POI categories in the region Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to get Esocsuperscriptsubscript𝐸𝑠𝑜𝑐E_{soc}^{\prime}italic_E start_POSTSUBSCRIPT italic_s italic_o italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as below:

Esoc={ppoi1,ppoi2,,ppoik}k=1K,ppoik=|CkDPOIRi|/|DPOIRi|formulae-sequencesuperscriptsubscript𝐸𝑠𝑜𝑐superscriptsubscriptsubscript𝑝𝑝𝑜subscript𝑖1subscript𝑝𝑝𝑜subscript𝑖2subscript𝑝𝑝𝑜subscript𝑖𝑘𝑘1𝐾subscript𝑝subscriptpoi𝑘subscript𝐶𝑘superscriptsubscript𝐷𝑃𝑂𝐼subscript𝑅𝑖superscriptsubscript𝐷𝑃𝑂𝐼subscript𝑅𝑖E_{soc}^{\prime}=\{p_{poi_{1}},p_{poi_{2}},...,p_{poi_{k}}\}_{k=1}^{K},\quad p% _{\text{poi}_{k}}=|C_{k}\cap D_{POI}^{R_{i}}|/|D_{POI}^{R_{i}}|italic_E start_POSTSUBSCRIPT italic_s italic_o italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_p start_POSTSUBSCRIPT italic_p italic_o italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_p italic_o italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_p italic_o italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT poi start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT = | italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∩ italic_D start_POSTSUBSCRIPT italic_P italic_O italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | / | italic_D start_POSTSUBSCRIPT italic_P italic_O italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | (3)

where K𝐾Kitalic_K is the number of POI categories, Ck={poijDPOIcategory(poij)=k}subscript𝐶𝑘conditional-set𝑝𝑜subscript𝑖𝑗subscript𝐷𝑃𝑂𝐼category𝑝𝑜subscript𝑖𝑗𝑘C_{k}=\{poi_{j}\in D_{POI}\mid\text{category}(poi_{j})=k\}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_p italic_o italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_P italic_O italic_I end_POSTSUBSCRIPT ∣ category ( italic_p italic_o italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_k }. However, directly employing the proportion of different POI categories Esocsuperscriptsubscript𝐸𝑠𝑜𝑐E_{soc}^{\prime}italic_E start_POSTSUBSCRIPT italic_s italic_o italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as societal features can still be problematic, since it cannot distinguish between areas with high POI density (usually with a higher degree of socialization) and those with relatively low POI density. Therefore, we transform Esocsuperscriptsubscript𝐸𝑠𝑜𝑐E_{soc}^{\prime}italic_E start_POSTSUBSCRIPT italic_s italic_o italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into final societal feature embedding Esocsubscript𝐸𝑠𝑜𝑐E_{soc}italic_E start_POSTSUBSCRIPT italic_s italic_o italic_c end_POSTSUBSCRIPT by multiply a social impact factor f𝑓fitalic_f as below:

Esoc=fEsoc,f=log(DPOIRi+1)formulae-sequencesubscript𝐸𝑠𝑜𝑐𝑓superscriptsubscript𝐸𝑠𝑜𝑐𝑓𝑙𝑜𝑔subscriptsuperscript𝐷subscript𝑅𝑖𝑃𝑂𝐼1E_{soc}=f\cdot E_{soc}^{\prime},\quad f=log(D^{R_{i}}_{POI}+1)italic_E start_POSTSUBSCRIPT italic_s italic_o italic_c end_POSTSUBSCRIPT = italic_f ⋅ italic_E start_POSTSUBSCRIPT italic_s italic_o italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_f = italic_l italic_o italic_g ( italic_D start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_O italic_I end_POSTSUBSCRIPT + 1 ) (4)

Finally, the intra-region feature embedding is generated by concatenating the spatial position embedding Epossubscript𝐸𝑝𝑜𝑠E_{pos}italic_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT, environmental feature embedding Eenvsubscript𝐸𝑒𝑛𝑣E_{env}italic_E start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT and societal feature embedding Esocsubscript𝐸𝑠𝑜𝑐E_{soc}italic_E start_POSTSUBSCRIPT italic_s italic_o italic_c end_POSTSUBSCRIPT:

Eintra=concat[Epos,Eenv,Esoc]subscript𝐸𝑖𝑛𝑡𝑟𝑎𝑐𝑜𝑛𝑐𝑎𝑡subscript𝐸𝑝𝑜𝑠subscript𝐸𝑒𝑛𝑣subscript𝐸𝑠𝑜𝑐E_{intra}=concat[E_{pos},E_{env},E_{soc}]italic_E start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT = italic_c italic_o italic_n italic_c italic_a italic_t [ italic_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_s italic_o italic_c end_POSTSUBSCRIPT ] (5)

3.2 Inter-Region Correlation Representation

Refer to caption
Figure 4: Illustration of high-order relations representation in GeoHG. Given a hypergraph (a) with high-order relations, we reconstruct it into a heterogeneous graph (b) by adding nodes as higher-order connection channels and get the final representation (c) in 𝒢𝒢\mathcal{G}caligraphic_G.

Our objective is to model pairwise second-order relationships between regions and their adjacent neighbors and higher-order dependencies with distant region groups exhibiting similar environmental or societal characteristics. Inspired by hypergraph theories [21, 4], elaborated in Appendix C.1, we employ a heterogeneous graph formulation GeoHG to represent these relationships explicitly. As shown in Figure 4, we take heterogeneous nodes in the graph structure as transfer nodes to achieve high-order message passing in the hypergraph. Given a pair of 1km×1km1km1km1\textrm{km}\times 1\textrm{km}1 km × 1 km satellite image Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and POI data Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Dsatellite_POIsubscript𝐷𝑠𝑎𝑡𝑒𝑙𝑙𝑖𝑡𝑒_𝑃𝑂𝐼D_{satellite\_{POI}}italic_D start_POSTSUBSCRIPT italic_s italic_a italic_t italic_e italic_l italic_l italic_i italic_t italic_e _ italic_P italic_O italic_I end_POSTSUBSCRIPT, we construct an undirected weighted heterogeneous graph Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, containing three types of nodes: regional nodes, environmental entity nodes, and societal entity nodes. Specifically, regional nodes correspond to 1km×1km1km1km1\textrm{km}\times 1\textrm{km}1 km × 1 km grid cells from the geospace; environmental entity nodes represent the J𝐽Jitalic_J distinct geo-entity classes detected from satellite image Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which are consistent with intra-region environmental features in Section 3.1; societal entity nodes constitute the K𝐾Kitalic_K different POI categories. The overall structure of GeoHG is shown in Figure 2 and the details are discussed in Appendix 4. We then elaborate on how second-order and high-order information is obtained through GeoHG 𝒢i=(𝒱,)subscript𝒢𝑖𝒱\mathcal{G}_{i}=(\mathcal{V},\mathcal{E})caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( caligraphic_V , caligraphic_E ), respectively.

Second-Order Relation Representation: Second-order relations capture the pairwise spatial adjacency between regional nodes (grid cells) in 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We construct undirected edges RNRsubscript𝑅𝑁𝑅\mathcal{E}_{RNR}caligraphic_E start_POSTSUBSCRIPT italic_R italic_N italic_R end_POSTSUBSCRIPT, Region Nearby Region (RNR) between regional nodes whose corresponding grid cells are spatially adjacent in a 3×3333\times 33 × 3 grid, thereby encoding these local second-order dependencies.

RNR,mn=1,if 𝒱m and 𝒱n are adjacent in geospacesubscript𝑅𝑁𝑅𝑚𝑛1if subscript𝒱𝑚 and subscript𝒱𝑛 are adjacent in geospace\mathcal{E}_{RNR,mn}=1,\text{if }\mathcal{V}_{m}\text{ and }\mathcal{V}_{n}% \text{ are {adjacent} in geospace}caligraphic_E start_POSTSUBSCRIPT italic_R italic_N italic_R , italic_m italic_n end_POSTSUBSCRIPT = 1 , if caligraphic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and caligraphic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are italic_adjacent in geospace (6)

High-Order Relation Representation: Differ from traditional multi-view graphs [7, 59] represent complex correlations through additional second-order region connections, we explicitly model higher-order relations by leveraging the environmental entity nodes 𝒱envsubscript𝒱𝑒𝑛𝑣\mathcal{V}_{env}caligraphic_V start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT and societal entity nodes 𝒱socsubscript𝒱𝑠𝑜𝑐\mathcal{V}_{soc}caligraphic_V start_POSTSUBSCRIPT italic_s italic_o italic_c end_POSTSUBSCRIPT in 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Specifically, we construct heterogeneous weighted edges ELRsubscript𝐸𝐿𝑅\mathcal{E}_{ELR}caligraphic_E start_POSTSUBSCRIPT italic_E italic_L italic_R end_POSTSUBSCRIPT and SLRsubscript𝑆𝐿𝑅\mathcal{E}_{SLR}caligraphic_E start_POSTSUBSCRIPT italic_S italic_L italic_R end_POSTSUBSCRIPT through 𝒱envsubscript𝒱𝑒𝑛𝑣\mathcal{V}_{env}caligraphic_V start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT and 𝒱socsubscript𝒱𝑠𝑜𝑐\mathcal{V}_{soc}caligraphic_V start_POSTSUBSCRIPT italic_s italic_o italic_c end_POSTSUBSCRIPT to associate each regional node with its constituting environmental entities (e.g., water, vegetation) and societal entities (e.g., educational, commercial POIs). For pentitθEnvsubscript𝑝𝑒𝑛subscript𝑡𝑖𝑡subscript𝜃𝐸𝑛𝑣p_{ent_{it}}\geq\theta_{Env}italic_p start_POSTSUBSCRIPT italic_e italic_n italic_t start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≥ italic_θ start_POSTSUBSCRIPT italic_E italic_n italic_v end_POSTSUBSCRIPT or fppoiikθSoc𝑓subscript𝑝𝑝𝑜subscript𝑖𝑖𝑘subscript𝜃𝑆𝑜𝑐f\cdot p_{poi_{ik}}\geq\theta_{Soc}italic_f ⋅ italic_p start_POSTSUBSCRIPT italic_p italic_o italic_i start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≥ italic_θ start_POSTSUBSCRIPT italic_S italic_o italic_c end_POSTSUBSCRIPT:

ELR,ij=1pentij,SLR,ik=1fppoiikformulae-sequencesubscript𝐸𝐿𝑅𝑖𝑗1subscript𝑝𝑒𝑛subscript𝑡𝑖𝑗subscript𝑆𝐿𝑅𝑖𝑘1𝑓subscript𝑝𝑝𝑜subscript𝑖𝑖𝑘\mathcal{E}_{ELR,ij}=1\cdot p_{ent_{ij}},\quad\mathcal{E}_{SLR,ik}=1\cdot f% \cdot p_{poi_{ik}}caligraphic_E start_POSTSUBSCRIPT italic_E italic_L italic_R , italic_i italic_j end_POSTSUBSCRIPT = 1 ⋅ italic_p start_POSTSUBSCRIPT italic_e italic_n italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_S italic_L italic_R , italic_i italic_k end_POSTSUBSCRIPT = 1 ⋅ italic_f ⋅ italic_p start_POSTSUBSCRIPT italic_p italic_o italic_i start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT (7)

where pentijsubscript𝑝𝑒𝑛subscript𝑡𝑖𝑗p_{ent_{ij}}italic_p start_POSTSUBSCRIPT italic_e italic_n italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the area proportion of geo-entity j𝑗jitalic_j at region i𝑖iitalic_i, ppoiijsubscript𝑝𝑝𝑜subscript𝑖𝑖𝑗p_{poi_{ij}}italic_p start_POSTSUBSCRIPT italic_p italic_o italic_i start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the quantity proportion of POI category k𝑘kitalic_k at region i𝑖iitalic_i and f𝑓fitalic_f is the social impact factor elaborated in Equation 4. θEnvsubscript𝜃𝐸𝑛𝑣\theta_{Env}italic_θ start_POSTSUBSCRIPT italic_E italic_n italic_v end_POSTSUBSCRIPT and θSocsubscript𝜃𝑆𝑜𝑐\theta_{Soc}italic_θ start_POSTSUBSCRIPT italic_S italic_o italic_c end_POSTSUBSCRIPT serve as hyperparameters designed to optimize the graph structure for conciseness and efficiency. This edge formulation allows encoding higher-order relationships between regions exhibiting similar environmental/societal attributes, even if they lack spatial proximity. By combining the second-order spatial adjacency edges with these higher-order hyperedge associations, 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT holistically represents the mixed-order relational patterns within the region.

3.3 Heterogenous Graph-Based Representation Integration

Having derived intra-region features capturing regions’ intrinsic attributes and inter-regional dependencies encoded in the heterogeneous graph 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we employ a model-agnostic graph neural network framework to jointly reason over the intra-regional and inter-regional information. In the graph, each regional node v𝑣vitalic_v in 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is represented with an initial node feature 𝐱vsubscript𝐱𝑣\mathbf{x}_{v}bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT corresponding to its intra-regional feature Eintrasubscript𝐸𝑖𝑛𝑡𝑟𝑎E_{intra}italic_E start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT elaborated in Sec 3.1. Note that for simplicity, instead of utilizing hypergraph neural networks [4, 44, 39], we model hyperedges with the form of complete subgraphs in 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Therefore, we apply Heterogeneous Graph Neural Networks (HGNNs) over 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to update the node representations by iteratively aggregating and transforming multi-hop neighborhood information:

𝐱v(l+1)superscriptsubscript𝐱𝑣𝑙1\displaystyle\mathbf{x}_{v}^{(l+1)}bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT =HGNN(l)(𝐱v(l),𝒩(v),𝒢i),𝐱v(l)dformulae-sequenceabsentsuperscriptHGNN𝑙superscriptsubscript𝐱𝑣𝑙𝒩𝑣subscript𝒢𝑖superscriptsubscript𝐱𝑣𝑙superscript𝑑\displaystyle=\text{HGNN}^{(l)}\left(\mathbf{x}_{v}^{(l)},\mathcal{N}(v),% \mathcal{G}_{i}\right),\mathbf{x}_{v}^{(l)}\in\mathbb{R}^{d}= HGNN start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , caligraphic_N ( italic_v ) , caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (8)

where 𝒩(v)𝒩𝑣\mathcal{N}(v)caligraphic_N ( italic_v ) denotes the neighbors (spatial and higher-order) of node v𝑣vitalic_v in 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. After L layers of updates, the final node embedding 𝐱v(L)superscriptsubscript𝐱𝑣𝐿\mathbf{x}_{v}^{(L)}bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT integrates cues from the intra-regional features as well as mixed-order inter-region dependencies. This final embedding 𝐱v(L)superscriptsubscript𝐱𝑣𝐿\mathbf{x}_{v}^{(L)}bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT serves as the comprehensive representation integrating intra-regional and inter-regional information for the given region node. Our framework is model-agnostic, enabling flexible integration of any HGNN variant for neighborhood aggregation, and allowing easy extension to more expressive HGNN models.

3.4 Pretraining and End-to-End Training

To effectively empower various geospatial tasks with the devised representations, we present two training strategies: self-supervised pre-training that learns generalizable region representations without task-specific labels, enabling efficient transfer to diverse downstream tasks; and end-to-end training that directly optimizes task-specific objectives for peak performance.

Self-Supervised Pre-training: Inspired by CLIP [29], we pre-train our model using a contrastive learning paradigm where we maximize the similarity between a region 𝐑isubscript𝐑𝑖\mathbf{R}_{i}bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and its corresponding positive region groups 𝐂isubscript𝐂𝑖\mathbf{C}_{i}bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, while minimizing its similarity with other region groups in the batch. Specifically, we sample regional nodes adjacent to 𝐑isubscript𝐑𝑖\mathbf{R}_{i}bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as well as nodes exhibiting similar intra-regional features to construct 𝐂isubscript𝐂𝑖\mathbf{C}_{i}bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. During training, we leverage a GNN model pretrainsubscript𝑝𝑟𝑒𝑡𝑟𝑎𝑖𝑛\mathcal{M}_{pretrain}caligraphic_M start_POSTSUBSCRIPT italic_p italic_r italic_e italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT to obtain embeddings for 𝐑isubscript𝐑𝑖\mathbf{R}_{i}bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and all nodes in 𝐂isubscript𝐂𝑖\mathbf{C}_{i}bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The embeddings of the sampled positive regional nodes are further pooled to obtain the unified positive region representation 𝐞jsubscript𝐞𝑗\mathbf{e}_{j}bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The process of generating embeddings for 𝐑isubscript𝐑𝑖\mathbf{R}_{i}bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐂isubscript𝐂𝑖\mathbf{C}_{i}bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is conducted as follows:

𝐞i=pretrain(𝐗i),𝐞j=Pool(pretrain(𝐗j),j𝒞i)formulae-sequencesubscript𝐞𝑖subscript𝑝𝑟𝑒𝑡𝑟𝑎𝑖𝑛subscript𝐗𝑖subscript𝐞𝑗Poolsubscript𝑝𝑟𝑒𝑡𝑟𝑎𝑖𝑛subscript𝐗𝑗𝑗subscript𝒞𝑖\mathbf{e}_{i}=\mathcal{M}_{pretrain}(\mathbf{X}_{i}),\quad\mathbf{e}_{j}=% \operatorname{Pool}\left(\mathcal{M}_{pretrain}(\mathbf{X}_{j}),{j\in\mathcal{% C}_{i}}\right)bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_M start_POSTSUBSCRIPT italic_p italic_r italic_e italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_Pool ( caligraphic_M start_POSTSUBSCRIPT italic_p italic_r italic_e italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_j ∈ caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (9)

We further define a similarity function f𝐬𝐜𝐨𝐫𝐞subscript𝑓𝐬𝐜𝐨𝐫𝐞f_{\mathbf{score}}italic_f start_POSTSUBSCRIPT bold_score end_POSTSUBSCRIPT to measure the similarity between the representations 𝐞i,𝐞jsubscript𝐞𝑖subscript𝐞𝑗\mathbf{e}_{i},\mathbf{e}_{j}bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. f𝐬𝐜𝐨𝐫𝐞subscript𝑓𝐬𝐜𝐨𝐫𝐞f_{\mathbf{score}}italic_f start_POSTSUBSCRIPT bold_score end_POSTSUBSCRIPT can be a simple dot product or a more complex metric and then employ an InfoNCE-based loss [28] to conduct contrastive learning:

pretrain=𝔼[logexp(f𝐬𝐜𝐨𝐫𝐞(𝐞i,𝐞j))i,nexp(f𝐬𝐜𝐨𝐫𝐞(𝐞i,𝐞n))]subscript𝑝𝑟𝑒𝑡𝑟𝑎𝑖𝑛𝔼delimited-[]subscript𝑓𝐬𝐜𝐨𝐫𝐞subscript𝐞𝑖subscript𝐞𝑗subscriptfor-all𝑖𝑛subscript𝑓𝐬𝐜𝐨𝐫𝐞subscript𝐞𝑖subscript𝐞𝑛\mathcal{L}_{pretrain}=\mathbb{E}[-\log\frac{\exp\left(f_{\mathbf{score}}\left% (\mathbf{e}_{i},\mathbf{e}_{j}\right)\right)}{\sum_{\forall i,n}\exp\left(f_{% \mathbf{score}}\left(\mathbf{e}_{i},\mathbf{e}_{n}\right)\right)}]caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_e italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT = blackboard_E [ - roman_log divide start_ARG roman_exp ( italic_f start_POSTSUBSCRIPT bold_score end_POSTSUBSCRIPT ( bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT ∀ italic_i , italic_n end_POSTSUBSCRIPT roman_exp ( italic_f start_POSTSUBSCRIPT bold_score end_POSTSUBSCRIPT ( bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) end_ARG ] (10)

where f𝐬𝐜𝐨𝐫𝐞(𝐞i,𝐞j)subscript𝑓𝐬𝐜𝐨𝐫𝐞subscript𝐞𝑖subscript𝐞𝑗f_{\mathbf{score}}(\mathbf{e}_{i},\mathbf{e}_{j})italic_f start_POSTSUBSCRIPT bold_score end_POSTSUBSCRIPT ( bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) represents the score of positive pairs while f𝐬𝐜𝐨𝐫𝐞(𝐞i,𝐞n)subscript𝑓𝐬𝐜𝐨𝐫𝐞subscript𝐞𝑖subscript𝐞𝑛f_{\mathbf{score}}\left(\mathbf{e}_{i},\mathbf{e}_{n}\right)italic_f start_POSTSUBSCRIPT bold_score end_POSTSUBSCRIPT ( bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) refers to the scores of negative pairs. After obtaining the pre-trained model pretrainsubscript𝑝𝑟𝑒𝑡𝑟𝑎𝑖𝑛\mathcal{M}_{pretrain}caligraphic_M start_POSTSUBSCRIPT italic_p italic_r italic_e italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT through self-supervised learning, we fine-tune it for different downstream tasks by adding three trainable linear layers. The weights of these additional linear layers are updated during training for the specific downstream task while the weights of pretrainsubscript𝑝𝑟𝑒𝑡𝑟𝑎𝑖𝑛\mathcal{M}_{pretrain}caligraphic_M start_POSTSUBSCRIPT italic_p italic_r italic_e italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT itself remain fixed.

End-to-End Training: For a specific downstream task, we directly optimize the whole GeoHG’s parameters from scratch using the task’s supervised signal. Taking the regional regression task as an illustrative example, we adopt the HGNN model \mathcal{M}caligraphic_M to obtain embeddings for given region i𝑖iitalic_i, then feed it into a three-linear layer regression head. \mathcal{M}caligraphic_M’s parameters are updated by minimizing the Mean Square Error (MSE) loss on the training data, jointly learning task-specific region representations.

4 Experiments

Datasets & Baselines. The datasets used in this paper include satellite imagery, POI information and five region indicators (Population, GDP, Night Light, Carbon, PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT) located in four representative cities in China: Bei**g, Shanghai, Guangzhou and Shenzhen. The satellite images are collected according to geospatial grids and each presents a spatial area of 1 km2𝑘superscript𝑚2km^{2}italic_k italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The entity segmentation results for satellite imagery are collected from the European Space Agency. We randomly split the data into 60% training, 20% validation, and 20% testing sets. For comparison, we select two classical methods (AutoEncoder [18] and ResNet-18 [10]), the state-of-the-art satellite imagery-based models ( UrbanCLIP [42]) and multisource and multimodal approaches (UrbanVLP [9], PG-SimCLR [38] and GeoStructual [20]) for geospatial region embedding. Moreover, for optimal generalization capability, we explore a variation of GeoHG where we employ contrast-based Graph Self-Supervised Learning (SSL) pretraining instead of end-to-end training, represented as GeoHG-SSL. Dataset, baseline and implementation details are provided in Appendix D, E.

4.1 Comparison with State-of-the-Art Methods

Table 1: Region indicators prediction results. The bold/underlined font means the best/the second-best result.
Methods GeoHG GeoHG-SSL UrbanVLP GeoStructural PG-SimCLR UrbanCLIP ResNet-18 AutoEncoder
Metric R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE
Bei**g Carbon 0.954 0.110 0.937 0.161 0.787 0.353 0.765 0.378 0.442 0.631 0.664 0.528 0.394 0.577 0.298 0.565
Population 0.874 0.271 0.870 0.282 0.725 0.404 0.730 0.402 0.471 0.964 0.461 0.552 0.266 0.623 0.168 0.667
GDP 0.647 0.336 0.644 0.331 0.586 0.416 0.617 0.401 0.277 0.768 0.355 0.539 0.285 0.699 0.171 0.822
Night Light 0.901 0.239 0.900 0.244 0.531 0.394 0.488 0.429 0.369 0.404 0.420 0.457 0.348 0.576 0.276 0.643
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.971 0.064 0.970 0.065 0.641 0.484 0.694 0.306 0.398 0.624 0.533 0.556 0.341 0.589 0.209 0.659
Shanghai Carbon 0.915 0.157 0.912 0.162 0.716 0.392 0.688 0.413 0.298 0.712 0.671 0.426 0.326 0.465 0.230 0.532
Population 0.936 0.161 0.928 0.172 0.593 0.471 0.613 0.456 0.315 0.731 0.456 0.557 0.279 0.627 0.166 0.742
GDP 0.778 0.323 0.767 0.331 0.310 0.595 0.377 0.553 0.294 0.767 0.326 0.587 0.289 0.711 0.197 0.844
Night Light 0.898 0.222 0.891 0.234 0.457 0.494 0.442 0.517 0.308 0.566 0.387 0.511 0.244 0.571 0.164 0.617
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.866 0.120 0.836 0.150 0.486 0.497 0.527 0.398 0.303 0.617 0.444 0.518 0.292 0.642 0.243 0.691
Guangzhou Carbon 0.885 0.219 0.884 0.209 0.698 0.385 0.681 0.497 0.422 0.708 0.585 0.444 0.375 0.515 0.254 0.570
Population 0.871 0.244 0.855 0.255 0.665 0.441 0.687 0.433 0.303 0.954 0.533 0.567 0.274 0.671 0.195 0.753
GDP 0.715 0.371 0.712 0.366 0.436 0.541 0.439 0.533 0.282 0.897 0.440 0.546 0.251 0.725 0.176 0.811
Night Light 0.871 0.234 0.854 0.248 0.577 0.418 0.574 0.415 0.435 0.433 0.483 0.478 0.242 0.551 0.176 0.602
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.833 0.133 0.822 0.158 0.638 0.462 0.652 0.461 0.315 0.542 0.56 0.514 0.231 0.694 0.196 0.780
Shenzhen Carbon 0.926 0.128 0.912 0.162 0.659 0.418 0.647 0.431 0.257 0.683 0.562 0.483 0.241 0.577 0.189 0.634
Population 0.892 0.165 0.879 0.173 0.790 0.343 0.797 0.314 0.311 0.758 0.527 0.592 0.299 0.654 0.175 0.772
GDP 0.798 0.297 0.767 0.331 0.532 0.448 0.517 0.455 0.307 0.895 0.508 0.464 0.234 0.817 0.119 0.884
Night Light 0.942 0.149 0.939 0.155 0.457 0.459 0.445 0.488 0.454 0.358 0.387 0.511 0.243 0.543 0.166 0.608
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.906 0.116 0.905 0.117 0.566 0.494 0.597 0.451 0.323 0.613 0.430 0.586 0.273 0.598 0.149 0.645

We adopt Mean Absolute Error (MAE), rooted mean squared error (RMSE), and coefficient of determination (R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) as evaluation metrics. An increase in R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, along with a reduction in MAE and RMSE values is indicative of enhanced model accuracy. We report the performances of each model in Table 1 and the table with RMSE metrics is provided in Appendix F. From these tables, we have three key findings: 1) Both GeoHG and GeoHG-SSL outperform all competing baselines over the 5 datasets for 4 cities. For instance, GeoHG surpasses the previous SOTA performance in Bei**g, achieving R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT improvements of +16.7%, +14.9%, +6.1%, +37%, +33% in Carbon, Population, GDP, Night Light and PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT tasks respectively. Consistent trends are observed across other cities, underscoring GeoHG’s stable accuracy, versatility, and strong generalization capabilities for geospatial region embedding. 2) The end-to-end trained GeoHG outperforms its pre-trained version GeoHG-SSL in most tasks while GeoHG-SSL shows a good performance in multi-task transferring. 3) Multisource and multimodal approaches, i.e., UrbanVLP [9] and GeoStructual [20], largely surpass traditional satellite imagery-based models by their more comprehensive embedding views.

4.2 Ablation Study & Interpretation Analysis

Effects of Core Components. To examine the effectiveness of each core component in our proposed framework, we conducted an ablation study based on the following variants for comparison: a) w/o Env, which excludes environment information from the satellite imagery for embedding. b) w/o Soc, which omits POI entities for embedding. c) w/o Pos, which does not utilize the corresponding location information of the regions. The R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT results for five tasks in two cities, Guangzhou and Shanghai, are displayed in Figure 5. We can observe that removing location information markedly degrades performance across all tasks. Meanwhile, environmental features severely impact model accuracy on the PM2.5, Night Light, and GDP tasks. In contrast, excluding POI data slightly reduces performance on the Carbon task. Appendix F.3 shows ablation results for the other two cities.

Refer to caption
Figure 5: Results of Ablation Study on R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Metric.

Effect of Graph Construction. Our framework utilizes heterogeneous graph and hyperedges between regions to reflect the geospatial high-order relations. To validate the effectiveness of high-order relation representation, we compare GeoHG against two variants: GeoHG-MLP which discards graph structure and relies only on intra-region feature representation, employing a 3-layer MLP for regression; GeoHG-Mono which keeps edges in adjacent regions while discarding hyperedges for high-order relations. The results presented in Figure 5 indicated that our GeoHG significantly benefits from efficient representation of complex mixed-order relations in geospace and integration of intra-region and inter-region representations, thereby resulting in enhanced performance.

Refer to caption
Figure 6: Visualization of Geospatial Correlations learned by GeoHG in Appendix F.5. Index (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) of a region represents the numbering in the latitude and longitude directions.

Interpretation of Mixed-order Geospatial Correlations. To investigate the power of heterogeneous graph structure in capturing mixed-order geospatial relations, inspired by GNNExplainer [46], we depict the trend of learned geospatial dependency among Carbon Emission datasets in Guangzhou. Notably, we do not incorporate any external features (e.g., road network, human mobility data) to construct aided edges in the graph. By selecting the top N𝑁Nitalic_N important nodes by their weight in the GNN prediction process, as illustrated in Figure 6, we observe that GeoHG effectively captures the high-order correlations between the target region and remote region groups through message passing within the heterogeneous graph structure. The details about GNNExplainer and experiment results are introduced in Appendix F.5.

4.3 Analysis on Few-shot Learning and Data Efficiency

Applying geospatial models globally across tasks often requires large labeled datasets for supervised training - a process that is time-consuming, computationally expensive, and hindered by data scarcity. We therefore evaluate GeoHG’s performance under limited data regimes of 5%, 10%, and 20% available training samples, with results shown in Figure 7. Results demonstrate GeoHG’s strong data efficiency - with only 5% data, GeoHG outperforms previous SOTA methods like UrbanVLP [9] and StructuralGeo [20] trained on the entire training dataset across all tasks. Utilizing only 20% of the data, GeoHG suffers minor performance degradation of 4%, 0.5%, 2.01%, 1.7%, and 3.2% on the Carbon, GDP, Light, PM2.5, and Population tasks, respectively. These findings illustrate GeoHG’s promising potential for data-efficient global deployment across diverse geospatial prediction tasks.

Refer to caption
Figure 7: Left: Data Efficiency of GeoHG in 5555 region indicator prediction tasks. We take the average of R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT results in 4 cities. Right: Visualization of Data Efficiency Result (5%percent55\%5 %) in Population Prediction.

Geospatial data interpolation methods like Inverse Distance Weighting (IDW) [58, 24] and Universal Kriging (UK) are employed to generate predictions for data-scarce regions. We evaluate GeoHG against these methods on population prediction in an enlarged area of Shenzhen, with only 5% visible data (863 points), tasked with inferring the remaining 16,401 regions (16,401 km2𝑘superscript𝑚2km^{2}italic_k italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT), as illustrated in Figure 7. GeoHG accurately captures the true distribution, substantially outperforming IDW and UK which deviate significantly. This highlights that traditional methods, solely relying on spatial relationships for interpolation, fail to model the intricate socio-environmental dependencies critical for characterizing population distributions. In contrast, GeoHG’s effective geospatial representations and modeling of high-order relationships enable accurate data pattern learning from limited samples.

5 Conclusion

In this paper, we propose GeoHG, a novel heterogeneous graph structure coupled with an efficient learning framework, specifically designed to generate informative geospatial embeddings for global regions. Our approach effectively captures comprehensive intra-region features from environmental and societal perspectives, as well as higher-order inter-region correlation through a heterogeneous graph formulation, and offers a seamless integration of these components within a model-agnostic graph structure. Extensive experiments across multiple datasets demonstrate GeoHG’s superior performance compared to existing methods. Notably, our method exhibits competitive performance even when the training data is significantly reduced. Due to the page limit, we provide more discussion in Appendix H, including the limitations, future directions and the social impact of our research.

References

  • Baltrušaitis et al. [2018] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Du et al. [2019] Jiadi Du, Yunchao Zhang, Pengyang Wang, Jennifer Leopold, and Yanjie Fu. Beyond geo-first law: Learning spatial representations via integrated autocorrelations and complementarity. In 2019 IEEE International Conference on Data Mining (ICDM), pages 160–169. IEEE, 2019.
  • Feng et al. [2019] Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3558–3565, 2019.
  • Fu et al. [2019] Yanjie Fu, Pengyang Wang, Jiadi Du, Le Wu, and Xiaolin Li. Efficient region embedding with multi-view spatial networks: A perspective of locality-constrained spatial autocorrelations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 906–913, 2019.
  • Gao et al. [2020] **g Gao, Peng Li, Zhikui Chen, and Jianing Zhang. A survey on deep learning for multimodal data fusion. Neural Computation, 32(5):829–864, 2020.
  • Geng et al. [2019] Xu Geng, Yaguang Li, Leye Wang, Lingyu Zhang, Qiang Yang, Jie** Ye, and Yan Liu. Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3656–3663, 2019.
  • Guo et al. [2019] Wenzhong Guo, Jianwen Wang, and Shi** Wang. Deep multimodal representation learning: A survey. Ieee Access, 7:63373–63394, 2019.
  • Hao et al. [2024] Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, and Yuxuan Liang. Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction. arXiv preprint arXiv:2403.16831, 2024.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
  • Huang et al. [2019] Chao Huang, Chuxu Zhang, Jiashu Zhao, Xian Wu, Dawei Yin, and Nitesh Chawla. Mist: A multiview and multimodal spatial-temporal learning framework for citywide abnormal event forecasting. In The world wide web conference, pages 717–728, 2019.
  • Huang et al. [2023] Ying**g Huang, Fan Zhang, Yong Gao, Wei Tu, Fabio Duarte, Carlo Ratti, Diansheng Guo, and Yu Liu. Comprehensive urban space representation with varying numbers of street-level images. Computers, Environment and Urban Systems, 106:102043, 2023.
  • Jean et al. [2016] Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016.
  • Jenkins et al. [2019] Porter Jenkins, Ahmad Farag, Suhang Wang, and Zhenhui Li. Unsupervised representation learning of spatial data via multimodal embedding. In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1993–2002, 2019.
  • Jiang and Luo [2022] Weiwei Jiang and Jiayun Luo. Graph neural network for traffic forecasting: A survey. Expert Systems with Applications, 207:117921, 2022.
  • ** et al. [2023] Guangyin **, Yuxuan Liang, Yuchen Fang, Zezhi Shao, **cai Huang, Junbo Zhang, and Yu Zheng. Spatio-temporal graph neural networks for predictive learning in urban computing: A survey. IEEE Transactions on Knowledge and Data Engineering, 2023.
  • Kramer [1991] Mark A Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE journal, 37(2):233–243, 1991.
  • Lacoste et al. [2024] Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. Geo-bench: Toward foundation models for earth monitoring. Advances in Neural Information Processing Systems, 36, 2024.
  • Li et al. [2022] Tong Li, Shiduo Xin, Yanxin Xi, Sasu Tarkoma, Pan Hui, and Yong Li. Predicting multi-level socioeconomic indicators from structural urban imagery. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 3282–3291, 2022.
  • Liang et al. [2022] Yuxuan Liang, Kun Ouyang, Yiwei Wang, Zheyi Pan, Yifang Yin, Hongyang Chen, Junbo Zhang, Yu Zheng, David S Rosenblum, and Roger Zimmermann. Mixed-order relation-aware recurrent neural networks for spatio-temporal forecasting. IEEE Transactions on Knowledge and Data Engineering, 2022.
  • Liu et al. [2023a] Hao Liu, Qingyu Guo, Hengshu Zhu, Yanjie Fu, Fuzhen Zhuang, Xiaojuan Ma, and Hui Xiong. Characterizing and forecasting urban vibrancy evolution: A multi-view graph mining perspective. ACM Transactions on Knowledge Discovery from Data, 17(5):68:1–68:24, February 2023a. ISSN 1556-4681. doi: 10.1145/3568683.
  • Liu et al. [2023b] Yu Liu, **gtao Ding, Yanjie Fu, and Yong Li. Urbankg: An urban knowledge graph system. ACM Transactions on Intelligent Systems and Technology, 14(4):1–25, 2023b.
  • Lu and Wong [2008] George Y Lu and David W Wong. An adaptive inverse-distance weighting spatial interpolation technique. Computers & geosciences, 34(9):1044–1055, 2008.
  • Lütjens et al. [2019] Björn Lütjens, Lucas Liebenwein, and Katharina Kramer. Machine learning-based estimation of forest carbon stocks to increase transparency of forest preservation efforts. 2019 NeurIPS Workshop on Tackling Climate Change with AI (CCAI), 2019.
  • Ma et al. [2019] Yao Ma, Suhang Wang, Chara C Aggarwal, Dawei Yin, and Jiliang Tang. Multi-dimensional graph convolutional networks. In Proceedings of the 2019 siam international conference on data mining, pages 657–665. SIAM, 2019.
  • Ning et al. [2024] Yansong Ning, Hao Liu, Hao Wang, Zhenyu Zeng, and Hui Xiong. Uukg: unified urban knowledge graph dataset for urban spatiotemporal prediction. Advances in Neural Information Processing Systems, 36, 2024.
  • Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Ratcliffe [2005] Jerry H Ratcliffe. Detecting spatial movement of intra-region crime patterns over time. Journal of Quantitative Criminology, 21:103–123, 2005.
  • Robin and Acuto [2018] Enora Robin and Michele Acuto. Global urban policy and the geopolitics of urban data. Political Geography, 66:76–87, 2018.
  • Venter et al. [2022] Zander S Venter, David N Barton, Tirthankar Chakraborty, Trond Simensen, and Geethen Singh. Global 10 m land use land cover datasets: A comparison of dynamic world, world cover and esri land cover. Remote Sensing, 14(16):4101, 2022.
  • Vivanco Cepeda et al. [2024] Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. Advances in Neural Information Processing Systems, 36, 2024.
  • Wang et al. [2022] Xiao Wang, Deyu Bo, Chuan Shi, Shaohua Fan, Yanfang Ye, and S Yu Philip. A survey on heterogeneous graph embedding: methods, techniques, applications and sources. IEEE Transactions on Big Data, 9(2):415–436, 2022.
  • Wang et al. [2020] Zhecheng Wang, Haoyuan Li, and Ram Rajagopal. Urban2vec: Incorporating street view imagery and pois for multi-modal urban neighborhood embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1013–1020, 2020.
  • WEI **g [2024] LI Zhanqing WEI **g. Chinahighpm2.5: High-resolution and high-quality ground-level pm2.5 dataset for china (2000-2022), 0 2024. URL https://dx.doi.org/10.5281/zenodo.3539349.
  • Wu et al. [2020] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
  • Xi et al. [2022] Yanxin Xi, Tong Li, Huandong Wang, Yong Li, Sasu Tarkoma, and Pan Hui. Beyond the first law of geography: Learning representations of satellite imagery by leveraging point-of-interests. In Proceedings of the ACM Web Conference 2022, pages 3308–3316, 2022.
  • Xia et al. [2022] Lianghao Xia, Chao Huang, and Chuxu Zhang. Self-supervised hypergraph transformer for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2100–2109, 2022.
  • Xie et al. [2016] Michael Xie, Neal Jean, Marshall Burke, David Lobell, and Stefano Ermon. Transfer learning from deep features for remote sensing and poverty map**. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
  • Xie et al. [2020] Peng Xie, Tianrui Li, Jia Liu, Shengdong Du, Xin Yang, and Junbo Zhang. Urban flow prediction from spatiotemporal data using machine learning: A survey. Information Fusion, 59:1–12, 2020.
  • Yan et al. [2024] Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, and Yuxuan Liang. Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web. In Proceedings of the ACM on Web Conference 2024, pages 4006–4017, 2024.
  • Yang et al. [2021] Xin Yang, Qiuchi Xue, Xingxing Yang, Haodong Yin, Yunchao Qu, Xiang Li, and Jianjun Wu. A novel prediction model for the inbound passenger flow of urban rail transit. Information Sciences, 566:347–363, 2021.
  • Yang et al. [2022] Yuhao Yang, Chao Huang, Lianghao Xia, Yuxuan Liang, Yanwei Yu, and Chenliang Li. Multi-behavior hypergraph-enhanced transformer for sequential recommendation. In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 2263–2274, 2022.
  • Yeh et al. [2020] Christopher Yeh, Anthony Perez, Anne Driscoll, George Azzari, Zhongyi Tang, David Lobell, Stefano Ermon, and Marshall Burke. Using publicly available satellite imagery and deep learning to understand economic well-being in africa. Nature communications, 11(1):2583, 2020.
  • Ying et al. [2019] Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer: Generating explanations for graph neural networks. Advances in neural information processing systems, 32, 2019.
  • Yu et al. [2024] Sungduk Yu, Walter Hannah, Liran Peng, Jerry Lin, Mohamed Aziz Bhouri, Ritwik Gupta, Björn Lütjens, Justus C Will, Gunnar Behrens, Julius Busecke, et al. Climsim: A large multi-scale dataset for hybrid physics-ml climate emulation. Advances in Neural Information Processing Systems, 36, 2024.
  • Yuan et al. [2012] **g Yuan, Yu Zheng, and Xing Xie. Discovering regions of different functions in a city using human mobility and pois. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 186–194, 2012.
  • Zanaga et al. [2021] Daniele Zanaga, Ruben Van De Kerchove, Wanda De Keersmaecker, Niels Souverijns, Carsten Brockmann, Ralf Quast, Jan Wevers, Alex Grosu, Audrey Paccini, Sylvain Vergnaud, Oliver Cartus, Maurizio Santoro, Steffen Fritz, Ivelina Georgieva, Myroslava Lesiv, Sarah Carter, Martin Herold, Linlin Li, Nandin-Erdene Tsendbazar, Fabrizio Ramoino, and Olivier Arino. Esa worldcover 10 m 2020 v100, October 2021. URL https://doi.org/10.5281/zenodo.5571936.
  • Zanaga et al. [2022] Daniele Zanaga, Ruben Van De Kerchove, Dirk Daems, Wanda De Keersmaecker, Carsten Brockmann, Grit Kirches, Jan Wevers, Oliver Cartus, Maurizio Santoro, Steffen Fritz, et al. Esa worldcover 10 m 2021 v200. 2022.
  • Zhang et al. [2019a] Chuxu Zhang, Dong** Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 793–803, 2019a.
  • Zhang et al. [2021] Mingyang Zhang, Tong Li, Yong Li, and Pan Hui. Multi-view joint graph representation learning for urban region embedding. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 4431–4437, 2021.
  • Zhang et al. [2019b] Yunchao Zhang, Yanjie Fu, Pengyang Wang, Xiaolin Li, and Yu Zheng. Unifying inter-region autocorrelation and intra-region structures for spatial embedding via collective adversarial learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1700–1708, 2019b.
  • Zhao et al. [2017] Naizhuo Zhao, Ying Liu, Guofeng Cao, Eric L Samson, and **gqi Zhang. Forecasting china’s gdp at the pixel level using nighttime lights time series and population images. GIScience & Remote Sensing, 54(3):407–425, 2017.
  • Zheng [2015] Yu Zheng. Methodologies for cross-domain data fusion: An overview. IEEE transactions on big data, 1(1):16–34, 2015.
  • Zhong et al. [2022] X Zhong, Q Yan, and G Li. Development of time series of nighttime light dataset of china (2000–2020)[j]. Journal of Global Change Data & Discovery, 3:416–424, 2022.
  • Zhou et al. [2023] Silin Zhou, Dan He, Lisi Chen, Shuo Shang, and Peng Han. Heterogeneous region embedding with prompt learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 4981–4989, 2023.
  • Zimmerman et al. [1999] Dale Zimmerman, Claire Pavlik, Amy Ruggles, and Marc P Armstrong. An experimental comparison of ordinary and universal kriging and inverse distance weighting. Mathematical Geology, 31:375–390, 1999.
  • Zou et al. [2024] Xingchen Zou, Yibo Yan, Xixuan Hao, Yuehong Hu, Haomin Wen, Erdong Liu, Junbo Zhang, Yong Li, Tianrui Li, Yu Zheng, et al. Deep learning for cross-domain data fusion in urban computing: Taxonomy, advances, and outlook. arXiv preprint arXiv:2402.19348, 2024.

Supplementary for: “GeoHG: Learning Geospatial Region Embedding with Heterogeneous Graph”

 

We organize our supplementary document as follows:

  1. A

    More Introduction of ESA WorldCover Dataset

  2. B

    Motivation for using Entity Segmentation for Satellite Imagery Encoding

    1. 1

      Introduction of semantic segmentation-based approach

    2. 2

      Validation experiment for segmentation-based approach

  3. C

    More Details of GeoHG

    1. 1

      Basic graph theory and our motivation

    2. 2

      Details of GeoHG structure

  4. D

    Dataset and Experiment Settings

  5. E

    Details of Baselines

  6. F

    More Details about Experiments Results

    1. 1

      RMSE metrics of experiment results

    2. 2

      Mean and standard deviation of metrics for GeoHG

    3. 3

      More ablation study results

    4. 4

      Details of data efficiency experiment

    5. 5

      Details of investigation of GeoHG with GNNExplainer

  7. G

    Qualitative Demonstration

  8. H

    More Discussion

 

Appendix A More Introduction of ESA WorldCover Dataset

The ESA WorldCover dataset 111https://esa-worldcover.org/en represents a groundbreaking advancement in land use/land cover map**, offering freely accessible, high-resolution (10 m) global coverage based on satellite imagery. Inspired by the 2017 WorldCover conference 222https://worldcover2017.esa.int/, the European Space Agency (ESA) launched the WorldCover project and the primary accomplishment of this endeavor was the introduction in October 2021 of a freely accessible global WorldCover Dataset at a groundbreaking 10 m resolution for the year 2020 [49, 50]. This dataset leverages satellite imagery from both Sentinel-1 and Sentinel-2, and encompasses 11 distinct geo-entity categories, shown in Figure 8. It has also undergone rigorous independent validation by Wageningen University (for statistical accuracy) and the International Institute for Applied Systems Analysis (IIASA) (for spatial accuracy), attaining a notable global overall accuracy of approximately 75% [32].

Refer to caption
Figure 8: Visualization of ESA WorldCover Dataset.

WorldCover dataset is continuously updated and revised by ESA with a highly efficient processing pipeline and scalable infrastructure. The core model is trained on a total of 2,160,210 Sentinel-2 images and is able to process the whole world in less than 5 days [49, 50]. The revised version we utilized in this paper was released on 28 October 2022, which elevated the global overall accuracy to 76.7% [50]. This version is free of charge to the entire community and widely accepted by the United Nations Convention to Combat Desertification (UNCCD), the World Resources Institute (WRI), the Centre for International Forestry Research (CIFOR), the Food and Agriculture Organization (FAO) and the Organisation for Economic Co-operation and Development (OECD). We visualized 3 examples of the WorldCover data unitized in this paper in Figure 9.

Refer to caption
Figure 9: Examples of Utilized Data in Our Paper.

Appendix B Motivation for Using Entity Segmentation for Satellite Imagery Encoding

B.1 Introduction of semantic segmentation-based approach

It is noticeable that conventional vision encoders are devised and trained on natural images, which significantly differ from satellite imagery. It is a formidable task to comprehend the intricate and specialized geo-semantic content present within satellite imagery.

Fortunately, satellite images exhibit a high degree of structural organization in terms of semantic information, unlike natural images that contain a myriad of diverse information. To effectively interpret a satellite image, one only needs to concentrate on the geological entities it encompasses, the spatial extent of these entities, and their respective positions within the geospace. Consequently, it is highly promising to employ an entity segmentation approach for encoding satellite imagery, as it can directly provide us with essential information regarding the entities and their spatial coverage.

In our proposed method, GeoHG, we employ an entity segmentation-based framework as the backbone for encoding satellite imagery, as opposed to the presently prevalent vision encoders, such as CNN and Transformer. To implement this approach, we utilize the ESA WorldCover Dataset, as discussed in Appendix A, due to its extensive validation by geographical researchers and demonstrated robust global generalization capabilities. As depicted in Figure 10, we directly retrieve the geo-entity segmentation results from the WorldCover Dataset for the input satellite imagery. Subsequently, we determine the proportion of geo-entities within the relevant region and construct the environmental feature embedding Eenvsubscript𝐸𝑒𝑛𝑣E_{env}italic_E start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT.

Refer to caption
Figure 10: Geo-entity Segmentation for Satellite Images based on WorldCover Dataset.

An additional advantage of our proposed approach lies in its ability to explicitly construct high-order connections among regions, geo-entities, and other regions based on the semantic content of satellite imagery. This is not feasible with traditional encoders. Moreover, since all segmentation results are readily available, this method significantly reduces computational overhead. We have observed that the segmentation-based approach dramatically decreases our training time. For example, the training time for one epoch on the Population dataset in Bei**g was reduced substantially from 25minutesabsent25𝑚𝑖𝑛𝑢𝑡𝑒𝑠\approx 25minutes≈ 25 italic_m italic_i italic_n italic_u italic_t italic_e italic_sto 0.2secondsabsent0.2𝑠𝑒𝑐𝑜𝑛𝑑𝑠\approx 0.2seconds≈ 0.2 italic_s italic_e italic_c italic_o italic_n italic_d italic_s.

B.2 Validation experiment for segmentation-based approach

To validate the effectiveness of our segmentation-based approach. We devise a validation experiment on predicting five indicators for 4 cities by taking the segmentation-based geo-entity proportion Eenvsubscript𝐸𝑒𝑛𝑣E_{env}italic_E start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT as the only input (named GeoSegment for convenience in Tabel 2). In comparison to the state-of-the-art LLM-enhanced satellite imagery encoder UrbanCLIP [42], POI information-enhanced imagery encoder PG-SimCLR [38] and two traditional image encoders ResNet-18 and AutoEncoder. These baselines are detailed and introduced in Appendix E.

Table 2: Validation Experiment Results. The bold/underlined font means the best/the second-best result.
Dataset Bei**g
Carbon Population GDP Night Light PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT
Model R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE
Autoencoder 0.298 0.565 0.844 0.168 0.667 0.918 0.171 0.822 1.251 0.276 0.643 0.784 0.209 0.659 0.840
ResNet-18 0.394 0.577 0.805 0.266 0.623 0.857 0.285 0.699 1.024 0.348 0.576 0.734 0.341 0.589 0.821
PG-SimCLR 0.442 0.631 0.754 0.471 0.964 1.117 0.277 0.768 1.254 0.369 0.404 0.728 0.398 0.624 0.845
UrbanCLIP 0.664 \ul0.528 0.598 \ul0.461 \ul0.552 \ul0.669 \ul0.355 \ul0.539 \ul0.864 \ul0.420 0.457 \ul0.700 0.533 \ul0.556 0.699
GeoSegment \ul0.504 0.513 \ul0.745 0.562 0.514 0.651 0.531 0.445 0.675 0.606 \ul0.470 0.629 \ul0.468 0.535 \ul0.761
Dataset Shanghai
Carbon Population GDP Night Light PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT
Model R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE
Autoencoder 0.230 0.532 0.771 0.166 0.742 0.898 0.197 0.844 1.511 0.164 0.617 0.729 0.243 0.691 0.942
ResNet-18 0.326 0.465 0.763 0.279 0.627 0.797 0.289 0.711 1.137 0.244 0.571 0.756 0.292 0.642 0.954
PG-SimCLR 0.298 0.712 0.914 0.315 0.731 0.959 0.294 0.767 1.052 0.308 0.566 0.768 0.303 0.617 0.895
UrbanCLIP \ul0.671 \ul0.426 \ul0.569 \ul0.461 \ul0.552 \ul0.748 \ul0.326 \ul0.587 \ul0.807 \ul0.387 \ul0.511 \ul0.709 \ul0.444 \ul0.518 \ul0.774
GeoSegment 0.791 0.242 0.452 0.812 0.264 0.396 0.578 0.473 0.650 0.684 0.415 0.559 0.656 0.356 0.582
Dataset Guangzhou
Carbon Population GDP Night Light PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT
Model R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE
Autoencoder 0.254 0.570 0.733 0.195 0.753 0.928 0.176 0.811 1.463 0.176 0.602 0.798 0.196 0.780 0.903
ResNet-18 0.375 0.515 0.673 0.274 0.671 0.831 0.251 0.725 1.048 0.242 0.551 0.704 0.231 0.694 0.859
PG-SimCLR 0.422 0.708 0.708 0.303 0.954 0.972 0.282 0.897 1.264 0.435 0.433 \ul0.627 0.315 0.542 0.741
UrbanCLIP 0.585 0.444 0.603 \ul0.533 \ul0.567 \ul0.687 \ul0.440 \ul0.546 \ul0.762 \ul0.483 0.478 0.633 0.560 \ul0.514 0.694
GeoSegment \ul0.490 \ul0.519 \ul0.657 0.685 0.409 0.563 0.536 0.488 0.681 0.618 \ul0.453 0.620 \ul0.493 0.435 \ul0.711
Dataset Shenzhen
Carbon Population GDP Night Light PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT
Model R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE
Autoencoder 0.189 0.634 0.887 0.175 0.772 0.921 0.119 0.884 1.733 0.166 0.608 0.758 0.149 0.645 0.891
ResNet-18 0.241 0.577 0.726 0.299 0.654 0.855 0.234 0.817 1.125 0.243 0.543 0.719 0.273 0.598 0.826
PG-SimCLR 0.257 0.683 0.816 0.311 0.758 0.892 0.307 0.895 1.003 0.454 \ul0.488 \ul0.682 0.597 0.451 0.491
UrbanCLIP 0.562 0.483 0.571 \ul0.527 \ul0.592 \ul0.610 \ul0.508 \ul0.464 \ul0.693 \ul0.387 0.511 0.709 \ul0.460 \ul0.586 \ul0.762
GeoSegment \ul0.542 \ul0.518 \ul0.698 0.602 0.438 0.602 0.658 0.411 0.587 0.738 0.386 0.512 0.471 0.480 0.752

From Table 2, it is evident that our semantic segmentation-based approach overwhelmingly outperforms all baseline models in almost every task. This includes the state-of-the-art LLM-enhanced satellite imagery encoder, UrbanCLIP [42], and the POI information-enhanced encoder, PG-SimCLR [38]. Notably, GeoSegment achieves these significant advantages without relying on any additional information, such as LLM descriptions or POI data. This demonstrates the superiority of our method in satellite imagery encoding.

Appendix C More Details of GeoHG

C.1 Basic graph theory and our motivation

Graph theory. Graphs provide a natural abstraction to represent structured relationships between entities (nodes) and their attributes (features). Within this framework, knowledge and information are organized via connectivity patterns encoded by edges. A key advantage of graph-structured representations is that nodes can reason about their representations not just from their attributes, but also by recursively aggregating information from their neighbors. This allows graphs to effectively capture complex relational dependencies and contextual patterns.

Graph Neural Networks (GNNs) have emerged as a powerful paradigm for learning representations on graph-structured data by leveraging the message-passing mechanism. The core idea of this mechanism involves each node recursively aggregating representation vectors from its local neighborhood, allowing it to accumulate information from an expanding neighborhood scope across iterations [37]. Formally, we can represent this process through the message-passing equation:

musubscript𝑚𝑢\displaystyle m_{u}italic_m start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT =Aggregate(fv,v𝒩u),absentAggregatesubscript𝑓𝑣𝑣subscript𝒩𝑢\displaystyle=\operatorname{Aggregate}\left(f_{v},v\in\mathcal{N}_{u}\right),= roman_Aggregate ( italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_v ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) , (11)
fusuperscriptsubscript𝑓𝑢\displaystyle f_{u}^{\prime}italic_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =Update(mu,fu),absentUpdatesubscript𝑚𝑢subscript𝑓𝑢\displaystyle=\operatorname{Update}\left(m_{u},f_{u}\right),= roman_Update ( italic_m start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ,

here, fusubscript𝑓𝑢f_{u}italic_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT represents the original representation of node u𝑢uitalic_u while fusuperscriptsubscript𝑓𝑢f_{u}^{\prime}italic_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes its new representation after one iteration, 𝒩usubscript𝒩𝑢\mathcal{N}_{u}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is its set of neighboring nodes. Through this iterative process, GNNs can effectively capture dependencies spanning the entire graph topology, enabling them to learn highly expressive representations that fuse localized features with broader structural context.

Hypergraph theory. Traditional graph models can only connect pairs of vertices within edges, thereby facing limitations in representing higher-order multi-way relationships beyond simple pairwise associations. In real-world scenarios, relationships between data typically go beyond simple pairwise links and involve complex multi-element patterns. For example, in a transportation network, a route often spans multiple cities or locations rather than just directly connecting two endpoints, which is difficult to capture using traditional graphs. To overcome this, hypergraph theory generalizes the graph formulation by introducing hyperedges that can connect any number of vertices, naturally facilitating the representation of higher-order multi-way relationships, defined as:

G=(V,E)whereV={v1,v2,,vN},E={1,2,,M}formulae-sequence𝐺𝑉𝐸𝑤𝑒𝑟𝑒formulae-sequence𝑉subscript𝑣1subscript𝑣2subscript𝑣𝑁𝐸subscript1subscript2subscript𝑀G=(V,E)\quad where\quad V=\{v_{1},v_{2},...,v_{N}\},E=\{\mathcal{H}_{1},% \mathcal{H}_{2},...,\mathcal{H}_{M}\}italic_G = ( italic_V , italic_E ) italic_w italic_h italic_e italic_r italic_e italic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } , italic_E = { caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_H start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } (12)

where G𝐺Gitalic_G represents a hypergraph and V is the set of nodes. E={1,2,,M}𝐸subscript1subscript2subscript𝑀E=\{\mathcal{H}_{1},\mathcal{H}_{2},...,\mathcal{H}_{M}\}italic_E = { caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_H start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } is the set of hyperedges representing connectivity among nodes and hyperedge msubscript𝑚\mathcal{H}_{m}caligraphic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a subset of V𝑉Vitalic_V. Revisiting the transportation route example, we can model each city/location as a node and the route passing through multiple locations as a hyperedge connecting all the corresponding nodes, precisely encoding the dependencies between the route and the groups of locations it traverses.

Motivations for modeling geospatial representations with hypergraph. In the context of geospatial representation modeling, the relationships between geospatial areas go far beyond simple spatial proximity, instead arising from the intricate interplay of various environmental and societal factors. Climatic conditions, topography, population distribution, economic factors, and more, exhibit intricate high-order influence patterns across different regions. Traditional methods fail to explicitly characterize this inherent high-order relational structure [7, 59]. Although multi-view graphs for geospace are capable of incorporating additional connections between regions by introducing extra edges between graph nodes, this approach comes with considerable complexity due to its inefficient representation. Furthermore, it is questionable whether distant regions are directly influencing each other just like their connections in a multi-view graph. For instance, normally, a region might initially deteriorate the overall water body of the geospace, subsequently affecting a remote region with abundant water resources, rather than through direct connections.

To effectively model such higher-order geospatial relations, we treat each geospatial region as a node and abstract the higher-order environmental/societal associations influencing multiple regions as hyperedges connecting all the relevant nodes, precisely capturing these intricate dependency patterns. Building upon this idea, we further propose GeoHG to encode high-order relationships and intricate characteristics between regions.

C.2 Details of GeoHG structure

GeoHG introduces a novel heterogeneous hypergraph representation, termed GeoHG, to encode the high-order relational structures within geospatial data. GeoHG comprises three types of nodes: regional nodes, environmental entity nodes, and societal entity nodes, along with the relations between them. Specifically, regional nodes correspond to 1km×1km1𝑘𝑚1𝑘𝑚1km\times 1km1 italic_k italic_m × 1 italic_k italic_m grid cells from the geographic space; environmental entity nodes represent the 9 distinct geo-entity classes detected from satellite imagery; societal entity nodes constitute the 14 different point-of-interest (POI) categories within the regions.

We construct hyperedges connecting multiple regional nodes through the environmental and societal entity nodes, enabling the explicit modeling of higher-order inter-region relationships induced by environmental or societal factors. The semantics of the node types and their relationships within GeoHG are illustrated in Tables 3 and 4, respectively.

By introducing hyperedges that can associate arbitrary subsets of regional nodes, GeoHG effectively captures the complex high-order dependencies spanning environmental conditions, geographic entities, social dynamics, and their intricate interplay across different spatial regions. This enriched relational structure encodes valuable contextual signals that are integrated into the region representations learned by GeoHG, ultimately benefiting a wide range of downstream geospatial analytics tasks.

Table 3: Major entities in GeoGraph
Entity Num Examples
Region N 1km×1km1km1km1\textrm{km}\times 1\textrm{km}1 km × 1 km grid cells from the geospace. The number of region entities depends on the number of grids within a given geographical range.
Environment 9 Tree cover, Shrubland, Grassland, Cropland, Built-up, Bare/sparse vegetation, Permanent water bodies, Herbaceous wetland, Moss and lichen
Society 14 Food and Beverage, Transportation facilities, Shop** spend, Science, education and culture, Companies, Recreation, Financial Institutions, Tourist Attractions, Life services, Car related, Sports and fitness, Hotel accommodation, Healthcare, Commercial residence
Table 4: Major relations in GeoGraph
Relation Weight Head & Tail Entity Abbrev.
Region Nearby Region - (Region, Region) RNR
Environmental Entity Locates at Region the area occupied by entity (Environmental Entity, Region) ELR
Societal Entity Locates at Region the transformed proportion of POI category (Societal Entity, Region) SLR

Appendix D Dataset and Experiment Settings

Dataset Details. We employ five representative tasks located in four cities in China: Bei**g, Shanghai, Guangzhou and Shenzhen. Population, GDP, and Night Light tasks reflect anthropogenic activities, whereas Carbon and Temperature tasks characterize the natural environment. We conduct data preprocessing according to [42]. The detailed information of datasets is listed below:

  • Carbon Emissions: This dataset incorporates anthropogenic CO2 emission estimates sourced from the Open Data Inventory (ODIAC)333https://odin.opendatawatch.com/ for the year 2022, spatially aligned with our 1 km2 satellite image grids (emissions quantified in tons).

  • Population: This dataset is obtained through WorldPop444https://www.worldpop.org/’s population distribution data for 2020, with counts representing the number of citizens per region.

  • GDP: This dataset includes Gross Domestic Product (GDP) statistics reflecting China’s regional economic development patterns from Zhao et al. [54].

  • Night Light: As a proxy for human activity intensity, a key driver of urban evolution, we leverage nighttime light imagery data from Zhong et al. [56] in 2020.

  • PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT: The PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT dataset is sourced from ChinaHighPM2.5 dataset [36]. This dataset combines ground-based observations, atmospheric reanalysis, emission inventory, and other techniques to obtain nationwide seamless ground PM2.5 data from 2000 to the present. The main scope is the entire China area, the spatial resolution is 1 km, the time resolution is daily, monthly, and yearly, and the unit is µg/m3.

Table 5: Dataset statistics.
Dataset Coverage in Geospace Satellite Image POI Information
Bottom-left Top-right Area (km2𝑘superscript𝑚2km^{2}italic_k italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT)
Bei**g 39.75°N, 116.03°E 40.15°N, 116.79°E 4,277 4,277 709,232
Shanghai 30.98°N, 121.10°E 31.51°N, 121.80°E 5,292 5,292 808,957
Guangzhou 22.94°N, 113.10°E 23.40°N, 113.68°E 8,540 8,540 805,997
Shenzhen 22.45°N, 113.75°E 22.84°N, 114.62°E 5,150 5,150 717,461
Shenzhen-enlarged 22.45°N, 113.75°E 23.84°N, 114.62°E 17,264 17,264 1,813,547

Implementation Details. We implement GeoHG, GeoHG-SSL and baselines with PyTorch 3.9 on a single NVIDIA RTX A6000 with 24GB relevant memory. We use a 2-layer MLP as the regression head for prediction. GeoHG is trained using Adam optimizer with a learning rate of 0.01. For the hyperedge gate θEnvsubscript𝜃𝐸𝑛𝑣\theta_{Env}italic_θ start_POSTSUBSCRIPT italic_E italic_n italic_v end_POSTSUBSCRIPT and θSocsubscript𝜃𝑆𝑜𝑐\theta_{Soc}italic_θ start_POSTSUBSCRIPT italic_S italic_o italic_c end_POSTSUBSCRIPT, we conduct grid searches over (0.2, 0.4, 0.6, 0.8) and (0.3,0.6,0.9,1.2,1.5) respectively. For the number of layers of the graph convolutional block, we test it from 1 to 3. The execution time required for training the GeoHG is approximately 3 minutes per task.

Final Settings of GeoHG. We introduce the best hyperparameter configurations for each task as below:

  • For the Carbon dataset, the hypergate θEnvsubscript𝜃𝐸𝑛𝑣\theta_{Env}italic_θ start_POSTSUBSCRIPT italic_E italic_n italic_v end_POSTSUBSCRIPT is set as 0.6 while θSocsubscript𝜃𝑆𝑜𝑐\theta_{Soc}italic_θ start_POSTSUBSCRIPT italic_S italic_o italic_c end_POSTSUBSCRIPT is set as 0.9. The number of layers of the graph convolution block is 3, and the dimension of the hidden layer is 64.

  • For the Population dataset, the hypergate θEnvsubscript𝜃𝐸𝑛𝑣\theta_{Env}italic_θ start_POSTSUBSCRIPT italic_E italic_n italic_v end_POSTSUBSCRIPT is set as 0.2 while θSocsubscript𝜃𝑆𝑜𝑐\theta_{Soc}italic_θ start_POSTSUBSCRIPT italic_S italic_o italic_c end_POSTSUBSCRIPT is set as 0.9. The number of layers of the graph convolution block is 3, and the dimension of the hidden layer is 64.

  • For the GDP dataset, the hypergate θEnvsubscript𝜃𝐸𝑛𝑣\theta_{Env}italic_θ start_POSTSUBSCRIPT italic_E italic_n italic_v end_POSTSUBSCRIPT is set as 0.4 while θSocsubscript𝜃𝑆𝑜𝑐\theta_{Soc}italic_θ start_POSTSUBSCRIPT italic_S italic_o italic_c end_POSTSUBSCRIPT is set as 1.2. The number of layers of the graph convolution block is 3, and the dimension of the hidden layer is 64.

  • For the Night Light dataset, the hypergate θEnvsubscript𝜃𝐸𝑛𝑣\theta_{Env}italic_θ start_POSTSUBSCRIPT italic_E italic_n italic_v end_POSTSUBSCRIPT is set as 0.2 while θSocsubscript𝜃𝑆𝑜𝑐\theta_{Soc}italic_θ start_POSTSUBSCRIPT italic_S italic_o italic_c end_POSTSUBSCRIPT is set as 0.9. The number of layers of the graph convolution block is 3, and the dimension of the hidden layer is 64.

  • For the PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT dataset, the hypergate θEnvsubscript𝜃𝐸𝑛𝑣\theta_{Env}italic_θ start_POSTSUBSCRIPT italic_E italic_n italic_v end_POSTSUBSCRIPT is set as 0.8 while θSocsubscript𝜃𝑆𝑜𝑐\theta_{Soc}italic_θ start_POSTSUBSCRIPT italic_S italic_o italic_c end_POSTSUBSCRIPT is set as 0.6. The number of layers of the graph convolution block is 3, and the dimension of the hidden layer is 64.

Evaluation Metrics. We employ Mean Absolute Error (MAE), rooted mean squared error (RMSE), and coefficient of determination (R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) as evaluation metrics, these metrics are calculated as below:

MAE(y,y^)MAE𝑦^𝑦\displaystyle\operatorname{MAE}(y,\hat{y})roman_MAE ( italic_y , over^ start_ARG italic_y end_ARG ) =1|y|i=1|y||yiy^i|,absent1𝑦superscriptsubscript𝑖1𝑦subscript𝑦𝑖subscript^𝑦𝑖\displaystyle=\frac{1}{|y|}\sum_{i=1}^{|y|}\left|y_{i}-\hat{y}_{i}\right|,= divide start_ARG 1 end_ARG start_ARG | italic_y | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y | end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | , (13)
RMSE(y,y^)RMSE𝑦^𝑦\displaystyle\operatorname{RMSE}(y,\hat{y})roman_RMSE ( italic_y , over^ start_ARG italic_y end_ARG ) =1|y|i=1|y|(yiy^i)2absent1𝑦superscriptsubscript𝑖1𝑦superscriptsubscript𝑦𝑖subscript^𝑦𝑖2\displaystyle=\sqrt{\frac{1}{|y|}\sum_{i=1}^{|y|}\left(y_{i}-\hat{y}_{i}\right% )^{2}}= square-root start_ARG divide start_ARG 1 end_ARG start_ARG | italic_y | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y | end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (14)
R2superscriptR2\displaystyle\operatorname{R^{2}}roman_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =1i=1n(yiy^i)2i=1n(yiy¯)2absent1superscriptsubscript𝑖1𝑛superscriptsubscript𝑦𝑖subscript^𝑦𝑖2superscriptsubscript𝑖1𝑛superscriptsubscript𝑦𝑖¯𝑦2\displaystyle=1-\frac{\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}}{\sum_{% i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}= 1 - divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (15)

Appendix E Details of Baselines

Description & Settings. We have chosen to incorporate a variety of widely utilized traditional methods and prominent cutting-edge techniques for comparative evaluation. We select two classical methods (AutoEncoder [18] and ResNet-18 [10]), four state-of-the-art vision-based models (Urban2Vec [35] and UrbanCLIP [42]) and multi-modal approaches (UrbanVLP [9] and GeoStructual [20]) for geospatial region embedding. Moreover, we explore a variation of GeoHG where we employ Graph Self-Supervised Learning (SSL) pretraining instead of end-to-end training, represented as GeoHG-SSL. We describe these baselines as follows.

  • AutoEncoder [18]: A neural network architecture framework targets unlabeled satellite imagery for feature extraction using reconstruction loss.

  • ResNet-18 [10] The renowned residual neural network, pretrained on the extensive ImageNet dataset [2], is capable of directly extracting visual features from satellite imagery by leveraging the knowledge it has previously acquired from natural imagery.

  • UrbanCLIP: [42] A model incorporating a multi-modal Large Language Model (LLM) to enhance the encoding of satellite imagery. This model generates descriptive texts for satellite images using the LLM and then fuses these texts with the images through an image-text contrastive learning-based approach to capture the complexity and diversity of geospatial areas.

  • PG-SimCLR [38]: A contrastive learning framework that introduces societal information (i.e., POI) into geospatial region representation learning from satellite imagery.

  • GeoStructural [20]: The graph-based framework profiles geospatial regions by utilizing street segments as graph structure for adaptively fusing features from multi-level satellite and street-view images. For convenience, we refer to this method as GeoStructural.

  • UrbanVLP [9]: An region embedding method based on contrastive learning, which incorporates satellite imagery, street-view images, and spatial position structure. This method is further enhanced by incorporating a Large Language Model (LLM) and GeoCLIP [33], resulting in improved robustness and performance.

Appendix F More Details about Experiments Results

F.1 RMSE metrics of experiment results

Given the page constraints, we hereby present the complementary version of our experimental results table in Table 6, which includes RMSE results for each experiment. It is discernible that the RMSE outcomes are consistent with other performance metrics, thereby conclusively affirming the overall superiority of our model’s performance.

Table 6: Region indicators prediction results. The bold/underlined font means the best/the second-best result.
Methods GeoHG GeoHG-SSL UrbanVLP GeoStructural PG-SimCLR UrbanCLIP ResNet-18 AutoEncoder
Metric R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT RMSE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT RMSE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT RMSE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT RMSE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT RMSE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT RMSE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT RMSE R2superscriptR2\text{R}^{2}R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT RMSE
Bei**g Carbon 0.954 0.201 0.937 0.224 0.787 0.457 0.765 0.472 0.442 0.754 0.664 0.598 0.394 0.805 0.298 0.844
Population 0.874 0.351 0.870 0.374 0.725 0.513 0.730 0.504 0.471 1.117 0.461 0.669 0.266 0.857 0.168 0.918
GDP 0.647 0.567 0.644 0.581 0.586 0.650 0.617 0.609 0.277 1.254 0.355 0.864 0.285 1.024 0.171 1.251
Night Light 0.901 0.311 0.900 0.353 0.531 0.629 0.488 0.657 0.369 0.728 0.420 0.700 0.348 0.734 0.276 0.784
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.971 0.160 0.970 0.169 0.641 0.594 0.694 0.482 0.398 0.845 0.533 0.699 0.341 0.821 0.209 0.840
Shanghai Carbon 0.915 0.290 0.912 0.312 0.716 0.529 0.688 0.557 0.298 0.914 0.671 0.569 0.326 0.763 0.230 0.771
Population 0.936 0.244 0.928 0.251 0.593 0.607 0.613 0.583 0.315 0.959 0.456 0.748 0.279 0.797 0.166 0.898
GDP 0.778 0.468 0.767 0.477 0.310 0.816 0.377 0.721 0.294 1.052 0.326 0.807 0.289 1.137 0.197 1.511
Night Light 0.898 0.311 0.891 0.347 0.457 0.667 0.442 0.685 0.308 0.768 0.387 0.709 0.244 0.756 0.164 0.729
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.866 0.379 0.836 0.394 0.486 0.654 0.527 0.592 0.303 0.895 0.444 0.774 0.292 0.954 0.243 0.942
Guangzhou Carbon 0.885 0.336 0.884 0.371 0.698 0.514 0.681 0.529 0.422 0.708 0.585 0.603 0.375 0.673 0.254 0.733
Population 0.871 0.244 0.855 0.255 0.665 0.441 0.687 0.433 0.303 0.972 0.533 0.687 0.274 0.831 0.195 0.928
GDP 0.715 0.532 0.712 0.574 0.436 0.764 0.439 0.699 0.282 1.264 0.440 0.762 0.251 1.048 0.176 1.463
Night Light 0.871 0.378 0.854 0.391 0.577 0.573 0.574 0.581 0.435 0.627 0.483 0.633 0.242 0.704 0.176 0.798
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.833 0.403 0.822 0.414 0.638 0.624 0.652 0.597 0.315 0.741 0.56 0.694 0.231 0.859 0.196 0.903
Shenzhen Carbon 0.926 0.290 0.912 0.304 0.659 0.568 0.647 0.581 0.257 0.816 0.562 0.571 0.241 0.726 0.189 0.887
Population 0.892 0.244 0.879 0.261 0.790 0.448 0.797 0.390 0.311 0.892 0.527 0.610 0.299 0.855 0.175 0.921
GDP 0.798 0.468 0.767 0.489 0.532 0.676 0.517 0.699 0.307 1.003 0.508 0.693 0.234 1.125 0.119 1.733
Night Light 0.942 0.245 0.939 0.268 0.457 0.667 0.445 0.682 0.454 0.588 0.387 0.709 0.243 0.719 0.166 0.758
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.906 0.298 0.905 0.331 0.566 0.598 0.597 0.491 0.323 0.883 0.430 0.762 0.273 0.826 0.149 0.891

F.2 Mean and standard deviation of metrics for GeoHG

Each method is executed five times, and we report the detailed mean and standard deviation of both metrics for GeoHG in Table 7.

Table 7: The mean and standard deviation of both metrics in 5-run for GeoHG.
Cities Bei**g Shanghai
Metric R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE
Carbon 0.954±0.002 0.11±0.002 0.201±0.002 0.915±0.001 0.157±0.001 0.29±0.007
Population 0.874±0.003 0.271±0.001 0.351±0.001 0.936±0.005 0.161±0.001 0.244±0.001
GDP 0.647±0.015 0.336±0.010 0.567±0.009 0.778±0.001 0.323±0.002 0.468±0.001
Night Light 0.901±0.014 0.239±0.001 0.311±0.002 0.898±0.004 0.222±0.003 0.311±0.004
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.971±0.009 0.064±0.002 0.16±0.003 0.866±0.001 0.12±0.002 0.3789±0.003
Cities Guangzhou Shenzhen
Metric R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT MAE RMSE
Carbon 0.885±0.001 0.219±0.001 0.336±0.001 0.926±0.001 0.128±0.001 0.29±0.001
Population 0.871±0.003 0.244±0.005 0.368±0.002 0.892±0.005 0.165±0.007 0.244±0.005
GDP 0.715±0.002 0.371±0.005 0.532±0.002 0.798±0.002 0.297±0.002 0.468±0.003
Night Light 0.871±0.001 0.234±0.002 0.378±0.001 0.942±0.001 0.149±0.001 0.245±0.001
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.833±0.003 0.133±0.002 0.4034±0.003 0.906±0.002 0.116±0.002 0.298±0.004

F.3 More ablation study results

The ablation results for Bei**g and Shenzhen are illustrated in Figure 11. GeoHG-Mono only preserves the isomorphic graph structure and adjacency relationships between regions, while GeoHG-MLP completely discards the graph structure. Similar to the results for Guangzhou and Shanghai in Figure 5, discarding the heterogeneous graph structure leads to severe performance degradation across all tasks. Furthermore, removing environmental, social, or location information also reduces model performance, but the degree of degradation varies across different cities when omitting each type of information.

Refer to caption
Figure 11: Results of Ablation Study on R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Metric for Bei**g and Shenzhen.

F.4 Details of data efficiency experiment

Table 8: Region indicators prediction results in the few-shot setting. The evaluation metric is R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and the test set constitutes a 20% random sample disjoint from the training data. The bold/underlined font means the best/the second-best result.
Methods GeoHG UrbanVLP GeoStructural UrbanCLIP
Available Data 20% 10% 5% 20% 10% 5% 20% 10% 5% 20% 10% 5%
Bei**g Carbon 0.954 0.933 0.890 \ul0.719 \ul0.661 \ul0.647 0.579 0.561 0.540 0.643 0.596 0.432
Population 0.791 0.775 0.728 \ul0.665 \ul0.635 \ul0.615 0.620 0.604 0.522 0.568 0.557 0.518
GDP 0.689 0.604 0.474 0.511 0.448 0.417 \ul0.600 \ul0.547 \ul0.465 0.401 0.380 0.273
Night Light 0.871 0.857 0.833 0.439 0.420 \ul0.405 \ul0.485 \ul0.456 0.383 0.415 0.396 0.369
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.987 0.984 0.982 \ul0.611 \ul0.569 \ul0.542 0.598 0.536 0.461 0.450 0.403 0.381
Shanghai Carbon 0.888 0.851 0.831 0.632 0.618 0.562 \ul0.697 \ul0.657 \ul0.589 0.668 0.598 0.526
Population 0.923 0.894 0.884 0.556 0.521 0.454 \ul0.573 \ul0.546 \ul0.498 0.530 0.515 0.451
GDP 0.733 0.665 0.656 0.236 0.228 0.198 \ul0.351 \ul0.318 \ul0.297 0.315 0.302 0.277
Night Light 0.872 0.839 0.806 0.415 0.368 0.311 \ul0.434 \ul0.394 \ul0.331 0.379 0.345 0.312
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.817 0.785 0.743 0.457 0.415 0.354 \ul0.501 \ul0.455 \ul0.416 0.380 0.357 0.325
Guangzhou Carbon 0.796 0.775 0.751 0.578 0.532 0.415 \ul0.579 \ul0.574 \ul0.561 0.468 0.377 0.320
Population 0.841 0.828 0.789 0.605 0.579 0.441 \ul0.682 \ul0.618 \ul0.596 0.556 0.552 0.419
GDP 0.693 0.627 0.608 0.385 0.352 0.273 \ul0.415 \ul0.367 \ul0.336 0.407 0.303 0.223
Night Light 0.831 0.827 0.812 0.508 0.472 0.318 \ul0.560 \ul0.529 \ul0.510 0.470 0.461 0.379
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.766 0.758 0.752 0.558 0.493 0.381 \ul0.581 \ul0.528 \ul0.400 0.466 0.329 0.232
Shenzhen Carbon 0.886 0.842 0.839 0.571 0.520 0.468 \ul0.585 \ul0.582 \ul0.563 0.478 0.463 0.390
Population 0.891 0.889 0.870 0.605 0.579 0.543 \ul0.691 0.585 0.549 0.646 \ul0.618 \ul0.559
GDP 0.782 0.781 0.767 \ul0.503 \ul0.477 \ul0.448 0.486 0.482 0.463 0.421 0.374 0.344
Night Light 0.932 0.930 0.925 \ul0.402 0.354 0.228 0.383 \ul0.380 \ul0.356 0.312 0.270 0.228
PM2.5𝑃subscript𝑀2.5PM_{2.5}italic_P italic_M start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 0.905 0.888 0.851 0.497 0.403 0.311 \ul0.527 \ul0.407 \ul0.350 0.368 0.332 0.266

We provide the performances of several baseline models in a data-limited setting, as shown in Table 8. We observed that existing geospatial embedding models exhibit certain few-shot learning capabilities, performing relatively well despite significantly reduced training data. Our model consistently outperforms other models under the same training data configurations. Moreover, our model can achieve performance exceeding or comparable to that of other models while using less data.

F.5 Details of investigation of GeoHG with GNNExplainer

To validate GeoHG’s efficacy in capturing higher-order geospatial relationships, we employ GNNExplainer [46], a model-agnostic technique for interpreting GNN predictions through identifying crucial subgraph structures and features. Specifically, for the carbon emission prediction task on the region #(35,65) in Guangzhou, we utilized GNNExplainer to extract the top 10 most influential edges, as shown in Table 9. Similar results for region #(43,87) are shown in Table 10. Based on Table  9, we can derive the following key findings:

Table 9: Top 10 important edges for carbon emission predictions for region #(35,65) in Guangzhou. The underlined font is the regional nodes adjacent to the target region, and the bold font refers to the distant region nodes.
Source Node Type Source Node Name Target Node Type Target Node Name Importance
region (35,65) society Food and Beverage 0.534
region (35,65) society Shop** Mall 0.511
region (35,65) environment Built-up 0.522
region (35,65) region (34,65) 0.526
region (35,65) region (36,65) 0.519
region (35,65) region (35,64) 0.522
region (35,65) region (34,64) 0.520
society Food and Beverage region (40,109) 0.495
society Food and Beverage region (47,49) 0.495
  1. 1)

    The target region exhibits strong associations with societal entity nodes like "Food and Beverage" and "Shop** Mall", which are typically carbon-intensive due to factors such as energy usage (cooking, refrigeration), transportation (goods/personnel movement), and packaging consumption.

  2. 2)

    Adjacent regions’ carbon emissions emerge as highly influential features, an intuitive pattern arising from spatial proximity and potential environmental spillovers between neighboring areas.

  3. 3)

    Notably, important hyperedges also connect the target region to relatively distant areas like #(40, 109) and #(47, 49). Despite geographical separation, these regions share similar social contexts, being linked to the "Food and Beverage" node, thereby providing relevant emission patterns to inform the prediction.

These interpretable insights validate GeoGraph’s effectiveness in capturing the intricate high-order dependencies between environmental conditions, urbanization factors, economic activities, and their synergistic impact on carbon footprints across regions. By encoding such high-order interactions through graph structures, GeoHG offers a powerful inductive bias tailored for modeling complex geospatial phenomena governed by higher-order environment-society couplings.

Table 10: Top 10 important edges for carbon emission predictions for region #(43,87) in Guangzhou. The underlined font is the regional nodes adjacent to the target region, and the bold font refers to the distant region nodes.
Source Node Type Source Node Name Target Node Type Target Node Name Importance
region (43,87) society Food and Beverage 0.496
region (43,87) environment Built-up 0.495
region (43,87) environment Tree 0.519
region (43,87) region (42,87) 0.531
region (43,87) region (44,87) 0.526
region (43,87) region (43,86) 0.524
region (43,87) region (43,88) 0.520
environment Built-up region (48,54) 0.497
environment Tree region (119,7) 0.495

Appendix G Qualitative Demonstration

In this section, we qualitatively evaluate the efficacy of our proposed approach. To begin with, we visually depict the regional dependency distribution by calculating the cosine similarity between embeddings of distinct regions, which are acquired through self-supervised training as outlined in Section 3.4. Moreover, we juxtapose the pertinent real-world information of regions exhibiting highly similar embeddings to scrutinize the capacity of our method to succinctly encapsulate the information and interrelationships within geospace.

Given the embedding EAsubscript𝐸𝐴E_{A}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT of region A, the cosine similarity with any other region’s embedding can be calculated as follows:

SC(EA,EB)=cos(θ)=𝐄𝐀𝐄𝐁𝐄𝐀𝐄𝐁=i=1nEAiEBii=1nEAi2i=1nEBi2subscript𝑆𝐶subscript𝐸𝐴subscript𝐸𝐵𝜃subscript𝐄𝐀subscript𝐄𝐁normsubscript𝐄𝐀normsubscript𝐄𝐁superscriptsubscript𝑖1𝑛subscript𝐸𝐴𝑖subscript𝐸𝐵𝑖superscriptsubscript𝑖1𝑛superscriptsubscript𝐸𝐴𝑖2superscriptsubscript𝑖1𝑛superscriptsubscript𝐸𝐵𝑖2S_{C}(E_{A},E_{B})=\cos(\theta)=\frac{\mathbf{E_{A}}\cdot\mathbf{E_{B}}}{\|% \mathbf{E_{A}}\|\|\mathbf{E_{B}}\|}=\frac{\sum_{i=1}^{n}E_{Ai}E_{Bi}}{\sqrt{% \sum_{i=1}^{n}E_{Ai}^{2}}\cdot\sqrt{\sum_{i=1}^{n}E_{Bi}^{2}}}italic_S start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) = roman_cos ( italic_θ ) = divide start_ARG bold_E start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT ⋅ bold_E start_POSTSUBSCRIPT bold_B end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_E start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT ∥ ∥ bold_E start_POSTSUBSCRIPT bold_B end_POSTSUBSCRIPT ∥ end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_A italic_i end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_B italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_A italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_B italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (16)

where EAisubscript𝐸𝐴𝑖E_{Ai}italic_E start_POSTSUBSCRIPT italic_A italic_i end_POSTSUBSCRIPT and EBisubscript𝐸𝐵𝑖E_{Bi}italic_E start_POSTSUBSCRIPT italic_B italic_i end_POSTSUBSCRIPT are the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT components of vectors EAsubscript𝐸𝐴E_{A}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and EBsubscript𝐸𝐵E_{B}italic_E start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, respectively.

Refer to captionRefer to caption
Figure 12: Visualization of Region Similarity in Geospatial Embedding of Guangzhou. Darker regions indicate higher similarity.

In Figure 12, we present the visualization results of two randomly selected regions regarding their regional similarity within the GeoHG-SSL geospatial embedding of Guangzhou. For Region (40,58), we identify it as a farmland area situated near water on an island. According to the heat map, our model has naturally discovered other regions on the same island, Jiabaosha Island, with closer regions exhibiting higher similarity. Interestingly, our model has also identified some remote islands in the north with similar environmental and societal functions, such as Dazhouwei Island and Sandy Bay. Furthermore, the western part of Guangzhou city, which is predominantly a mountainous industrial area, shows lower similarity results, aligning with this geographical characteristic.

For Region (36,63), we find that it is an urban resident area with green land near a river, named Caohe Village. From the heatmap, we can see that our model connects other residential areas exhibiting similar characteristics in Guangzhou. For example, Shilou Village and Lyv Village share similar traits, with residential settlements bordering flowing rivers.

It is worth noting that, with the comprehensive representation of environmental, societal, and spatial information in the geospace and the specially designed graph structure, our models do not simplify the content region second-order region pair based solely on their intra-feature similarity. Instead, we consistently identify grouped high-order regions through the embedding results, rather than merely relying on individual high-similarity points.

Appendix H More Discussion

Limitations and Future Directions. One limitation of our proposed method is that the complexity of the graph structure is directly proportional to the area size of the geospace, as the unit grid of the geospace is fixed to a 1km×1km1𝑘𝑚1𝑘𝑚1km\times 1km1 italic_k italic_m × 1 italic_k italic_m square. This could present a challenge in worldwide large-scale applications, where the area of geospace may be extremely vast. In the future, we may explore the potential of modeling the geospace of the entire Earth through an adaptive grid partition and graph construction method. Furthermore, although we have emphasized and validated the efficacy and advantages of segmentation-based satellite imagery encoding, the upper limit of this approach is determined by the quality of segmentation results provided by third parties. Additionally, while we have demonstrated the effectiveness of our GeoHG in various real-world applications, further investigation is warranted to evaluate its performance in more diverse and complex scenarios.

Social Impacts. Efficient geospatial embedding holds considerable benefits for the broader geographic and spatial computing communities. By comprehensively representing intra-region features and inter-region correlations, our proposed GeoHG framework exhibits significant potential to effect meaningful change across various domains, especially within the realms of smart cities and geoscience. The interoperability and efficiency provided by GeoHG facilitate a deeper understanding of complex geospace and the underlying mixed-order correlations inherent in the space. This comprehensive representation of regional spatial regions empowers stakeholders to monitor cities and environments more effectively, thereby making informed decisions that ultimately enhance individual and communal quality of life while fostering more resilient and sustainable environments.

Moreover, the remarkable data efficiency of GeoHG enables the community to investigate regional fine-grained climates with limited resources. For instance, in Section 4.3, we showcase its exceptional performance in predicting fine-grained (1km×1km1𝑘𝑚1𝑘𝑚1km\times 1km1 italic_k italic_m × 1 italic_k italic_m) indicators for a large area (16,401km216401𝑘superscript𝑚216,401km^{2}16 , 401 italic_k italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) using only 863 monitoring points. As human society progresses and the global environment changes, regional extreme climates, such as urban heat islands and local air pollution, continue to pose significant economic, environmental, and health challenges. Thus, effective fine-grained regional climate monitoring becomes increasingly paramount. It is noticed that the millions of lives lost annually due to local extreme heat and air pollution, we believe the enhanced geospatial region representation and data efficiency facilitated by our approach will support more effective human health protection measures and inform geopolitical decision-making, promoting improved urban and environmental management as well as extreme climate mitigation strategies.