SE-VGAE: Unsupervised Disentangled Representation Learning for Interpretable Architectural Layout Design Graph Generation

Jielin Chen, Rudi Stouffs The authors are with the Department of Architecture, National University of Singapore, Singapore. E-mail: [email protected], [email protected] computational work for this article was performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg). The data sources used in this study are also gratefully acknowledged. This research was supported by the President’s Graduate Fellowship of the National University of Singapore and the Singapore Data Science Consortium (SDSC) Dissertation Research Fellowship. (Corresponding authors: Jielin Chen and Rudi Stouffs) Manuscript received June 17, 2024
Abstract

Despite the suitability of graphs for capturing the relational structures inherent in architectural layout designs, there is a notable dearth of research on interpreting architectural design space using graph-based representation learning and exploring architectural design graph generation. Concurrently, disentangled representation learning in graph generation faces challenges such as node permutation invariance and representation expressiveness. To address these challenges, we introduce an unsupervised disentangled representation learning framework, Style-based Edge-augmented Variational Graph Auto-Encoder (SE-VGAE), aiming to generate architectural layout in the form of attributed adjacency multi-graphs while prioritizing representation disentanglement. The framework is designed with three alternative pipelines, each integrating a transformer-based edge-augmented encoder, a latent space disentanglement module, and a style-based decoder. These components collectively facilitate the decomposition of latent factors influencing architectural layout graph generation, enhancing generation fidelity and diversity. We also provide insights into optimizing the framework by systematically exploring graph feature augmentation schemes and evaluating their effectiveness for disentangling architectural layout representation through extensive experiments. Additionally, we contribute a new benchmark large-scale architectural layout graph dataset extracted from real-world floor plan images to facilitate the exploration of graph data-based architectural design representation space interpretation. This study pioneered disentangled representation learning for the architectural layout graph generation. The code and dataset of this study will be open-sourced.

Index Terms:
Graph representation learning, disentangled representation learning, graph generation, architectural design, attributed adjacency multi-graph, architectural layout.

I Introduction

Architectural design solutions inherently possess structured information with interdependent scopes, making architectural design data intrinsically relational and suitable for graph-based representations [1]. Graph-structured representations are optimal for accurately depicting complex geometric and semantic information in architectural designs [2], providing an abstract yet robust format for encoding design features and their interrelationships. This suitability extends from macro-level layouts to micro-level construction details, allowing for an interconnected view of design elements and illustrating their cohesive functioning. Architectural design space serves as a fundamental concept in design research [3, 4, 7, 5, 6], yet the representation of architectural design space and interpretation of corresponding design representation space using deep learning-based approaches [8], especially concerning structured non-Euclidean data like graphs, remains understudied. Most existing studies focus on Euclidean design data formats [9, 8, 10], leaving a gap in exploring graph-based representation learning and synthesis in architectural design.

Recent advancements in disentangled representation learning for neural network-based graph generation aim to extract distinct generative factors in observed graph data, crucial for understanding real-world graph distributions. Although studies have shown the potential of disentanglement in deep graph representation learning [11, 12, 13], challenges like overlooking permutation invariance and limited expressiveness remain. Additionally, while significant progress has been made in domains like molecule and protein generation [12, 14, 15, 16], the application of these techniques in architectural design graphs is still largely unexplored.

To bridge these significant research gaps, We propose the Style-based Edge-augmented Variational Graph Auto-Encoder (SE-VGAE), an unsupervised disentangled representation learning framework for decomposing latent generative factors of architectural layout design graphs represented in the form of attributed adjacency multi-graphs (AAMG). The framework includes three alternative disentanglement pipelines (Fig. 1). All three pipelines are composed of a transformer-based edge-augmented encoder with permutation equivariance property to integrate both node and edge features, a style-based decoder with two sub-decoders (a node-decoder and an edge-decoder) incorporating a layer-wise stochasticity feature decoding strategy, and a latent space disentanglement module. The latter facilitates the decomposition of latent factors influencing architectural layout graph generation. The three alternative disentanglement modules are a vanilla VAE scheme serving as baseline, a Vector Quantisation (VQ) scheme modelling probability density functions through prototype vectors, and a node-edge co-disentanglement scheme to separate features at node, edge, and graph levels using three specialized sub-encoders.

To the best of our knowledge, this study is the first to generate architectural layout design graphs with a focus on representation disentanglement. We investigate various architectural layout graph feature augmentation schemes and their impact on model performance, exploring different framework setups and structural adjustments to understand their influence on interpreting and disentangling complexities in the architectural layout design graph data space. In addition to framework development, we introduce a novel benchmark large-scale architectural layout graph dataset featuring detailed node and edge attributes extracted from real-world floor plan images. Using this dataset, we uncover latent design patterns and relationships through our proposed disentangled graph representation learning schemes. Additionally, we explore interpreting latent architectural design representation space by extracting high-level structural information using graph data. The contribution of this study can be summarized as follows:

Refer to caption
Figure 1: Overview of the proposed Style-based Edge-augmented Variational Graph Auto-Encoder (SE-VGAE) framework, together with three alternative pipelines, for the latent embedding space disentanglement of architectural layout design representation space
  • Introducing the SE-VGAE framework, a pioneering effort to generate architectural layout design graphs with disentanglement, which provides insights into optimizing model structures for interpreting architectural design graph spaces through varied implementations.

  • Systematically investigating various architectural layout graph feature augmentation schemes and their impact on graph generation and representation disentanglement. Extensive experiments elucidate the efficacy of different augmentation strategies in improving the model’s understanding of architectural layout graph complexities.

  • Offering a new benchmark large-scale architectural layout design graph dataset from real-world floor plan images. This dataset is a valuable resource for training and evaluating disentangled graph representation learning models, enabling researchers to explore latent design patterns and relationships and extract high-level structural information for architectural design data interpretation.

II Related Work

This section briefly reviews the research background pertinent to this study, encompassing disentangled graph representation learning, graph generation and evaluation, architectural layout design representation as graphs, and the relevant subject of architectural design representation space interpretation.

II-A Disentangled graph representation learning

Bengio et al. [17] define disentangled representation as the separation of distinct, independent, and informative generative factors in observed data, crucial for understanding real-world data distributions, including complex graph structures. This concept is essential in deep graph representation learning models, where it is beneficial to discern which latent variables influence specific graph generation properties. Studies using disentanglement-oriented neural networks have demonstrated potential in this area. Stoehr et al. [11] used β𝛽\betaitalic_β-Variational Autoencoders (β𝛽\betaitalic_β-VAE) [18] to discover generative parameters in graphs but neglected edge features and node order independence, compromising reconstruction fidelity. Guo et al. [12, 13] addressed some of these issues with NED-VAE, a framework with sub-encoders and sub-decoders for disentangling node, edge, and graph-level features, though they overlooked graph node permutation invariance and did not use highly expressive graph aggregation methods. The expressive power of graph representation learning is crucial for identifying and differentiating subtle variations within graph structures. Graph neural networks (GNNs) extend the Weisfeiler-Lehman (WL) isomorphism test by representing graphs as vectors in continuous space, capturing relationships between different topologies [19]. However, conventional GNNs are only as powerful as the 1-dimensional WL test [20, 21]. Zhang et al. [22] categorize efforts to enhance GNN expressiveness into three approaches: graph feature enhancement [23], graph topology enhancement [24, 25, 26, 21], and model architecture enhancement [21, 26]. While model architecture enhancement increases complexity and parameters, feature and topology enhancements are more lightweight and easier to implement. These enhancements include adding local and global topological information to each node and using random node attributes or positional information to improve representation [25, 27, 24, 28].

II-B Evaluating interpretable deep graph generation

Robust quantitative evaluation is crucial for graph generative modelling, focusing on the difference between learned and reference graph distributions. Evaluating these distributions is challenging due to the unique properties of graph data. Traditional methods calculate the statistical distribution distance between real and generated graphs but often overlook continuous node and edge features [29]. Recently, neural network classifier-based metrics have gained popularity for aligning learned and real graph distributions [30, 29]. However, metrics from image generation, which use task-specific neural networks, have limitations in graph generative modelling due to the adaptability issues of pre-trained graph neural networks. To address this, studies have shown that randomly initialized graph neural networks can effectively evaluate graph generative models without the need for further training [21, 20, 29].

II-C Representing architectural layout design as graphs

Architectural layout design can be naturally represented as graphs. Specifically, a floor plan layout can be converted into a dual graph, emphasizing space adjacency with nodes as spaces and edges as connectivity [31, 1]. These adjacency graphs can also embed three-dimensional architectural information [32]. Neural network-based graph generation is a burgeoning field explored in various domains like molecules, protein structures, and scene graphs [12, 14, 15, 16]. However, in architectural design research, while many studies focus on generating architectural data in Euclidean formats like images and 3D models [9, 8, 10], the specific task of generating architectural layout design graphs has been notably unexplored.

II-D Architectural design representation space interpretation

Design space is crucial in architectural design research, with its exploration often used to approximate the design process [3, 4, 7, 5, 6]. Previous studies primarily focused on superior exploration strategies, leaving the intrinsic structure of design representation space vague [4]. As design activities are made possible because of designers’ mental models of design representation spaces that designers constantly perceive and formulate [4, 5, 6], Chen & Stouffs [6, 8] promote two explicit models of design representation spaces: the sparse human-learned model and the compressed machine-learned model, arguing that designers may enhance design performance by interacting with simulated design representation spaces. In this context, converting architectural design data into machine-interpretable formats is necessary, requiring flexible representation learning schemes. Recent data-driven techniques have demonstrated the ability to interpret design concepts by converting data into vectors of neural activities [8, 33]. However, current studies on architectural design representation focus mainly on Euclidean data, with limited exploration of non-Euclidean data like graphs.

III Preliminary

This section introduces the relevant notations used in this study in Section III-A, the problem formulation of disentangled graph representation learning of architectural layout design graph generation in Section III-B, and an overview of our proposed approach in Section III-C.

III-A Notations

Formally, a graph is denoted as G=(V,E)𝐺𝑉𝐸G=\left(V,E\right)italic_G = ( italic_V , italic_E ) where V𝑉Vitalic_V is the set of nodes, and E𝐸Eitalic_E is the set of edges. We denote an edge going from node viVsubscript𝑣𝑖𝑉v_{i}\in Vitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V to node vjVsubscript𝑣𝑗𝑉v_{j}\in Vitalic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_V as (vi,vj)Esubscript𝑣𝑖subscript𝑣𝑗𝐸\left(v_{i},v_{j}\right)\in E( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_E. This study does not consider self-loop edges, namely a node connecting to itself. The number of nodes n=|V|𝑛𝑉n=\left|V\right|italic_n = | italic_V | is called the order of graph G𝐺Gitalic_G, and the number of edges e=|E|𝑒𝐸e=\left|E\right|italic_e = | italic_E | is called the size of graph G𝐺Gitalic_G. In this study, we consider multi-graphs within which there can be more than one edge between a pair of nodes. Also, only undirected graphs are discussed, s.t., (vi,vj)E(vj,vi)Esubscript𝑣𝑖subscript𝑣𝑗𝐸subscript𝑣𝑗subscript𝑣𝑖𝐸\left(v_{i},v_{j}\right)\in E\Leftrightarrow\left(v_{j},v_{i}\right)\in E( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_E ⇔ ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_E.

A convenient way to represent a graph is through an adjacency matrix. An adjacency matrix A𝐴Aitalic_A for a multi-graph is a symmetric square matrix with Au,v=asubscript𝐴𝑢𝑣𝑎A_{u,v}=aitalic_A start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT = italic_a if (u,v)E𝑢𝑣𝐸\left(u,v\right)\in E( italic_u , italic_v ) ∈ italic_E and a𝑎aitalic_a is the number of edges connecting nodes u𝑢uitalic_u and v𝑣vitalic_v. To represent a graph G𝐺Gitalic_G with an adjacency matrix A𝐴Aitalic_A, the node set V𝑉Vitalic_V of graph G𝐺Gitalic_G needs to be ordered previously so that every node indexes a particular row and column in the adjacency matrix A𝐴Aitalic_A. There are n!𝑛n!italic_n ! possible node orderings for a graph with order n𝑛nitalic_n, each corresponding to a unique, arbitrary node ordering π𝜋\piitalic_π. Thus, if we choose an ordering π𝜋\piitalic_π, the graph can be represented by the corresponding adjacency matrix Aπn×nsuperscript𝐴𝜋superscript𝑛𝑛A^{\pi}\in\mathbb{R}^{n\times n}italic_A start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT. Considering the multiplicity of possible representations of a single graph, this necessitates the formulation of training mechanisms of graph representation learning models invariant or equivariant to different node permutations of the same graph. That is to say, any arbitrary node permutations of the same graph should result in identical graph representations, and ideally, deep graph representation learning and generative modelling need to learn permutation-invariant graph distributions.

III-B Problem formulation

The learning objective of deep graph representation learning, especially a graph generative model, is to maximize the likelihood of p(G)=πP(G,π)𝑝𝐺subscript𝜋𝑃𝐺𝜋p\left(G\right)=\sum_{\pi}P\left(G,\pi\right)italic_p ( italic_G ) = ∑ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_P ( italic_G , italic_π ). However, while graph encoding can abstract away the ordering of nodes with permutation-invariant node aggregation operations, graph decoding must establish certain node orderings as concrete expressions. Under relatively lenient conditions, graph decoders can attain permutation-equivariance; when presented with a permuted graph, the graph generative model can produce correspondingly permuted graph representations. For sequential generation methods, attaining permutation equivariance is far from trivial and presents a complex challenge, yet achieving this can be straightforward for one-shot generation methods. This is usually done by redefining one or a series of node ordering (e.g., breadth-first search, depth-first search, node degree, or a family of canonical orderings) [30, 34]. Specifically, canonical ordering refers to systematically arranging the nodes of a graph according to specific rules or algorithms, resulting in a consistent and standardized ordering [34]. A family of canonical orderings can be predefined as K={π1,,πk}𝐾subscript𝜋1subscript𝜋𝑘K=\left\{\pi_{1},\ldots,\pi_{k}\right\}italic_K = { italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and can be used to learn an evidence lower bound (ELBO) of πP(G,π)subscript𝜋𝑃𝐺𝜋\sum_{\pi}P\left(G,\pi\right)∑ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_P ( italic_G , italic_π ), namely πKP(G,π)subscript𝜋𝐾𝑃𝐺𝜋\sum_{\pi\in K}P\left(G,\pi\right)∑ start_POSTSUBSCRIPT italic_π ∈ italic_K end_POSTSUBSCRIPT italic_P ( italic_G , italic_π ), as K𝐾Kitalic_K is a strict subset of the full factorial range of node orderings. It is also a tighter lower bound than any single arbitrary canonical ordering likelihood P(G,π)𝑃𝐺𝜋P\left(G,\pi\right)italic_P ( italic_G , italic_π ). Additionally, enlarging the size of K𝐾Kitalic_K can result in a tighter lower bound. Selecting an appropriately sized set K𝐾Kitalic_K can thus strike an optimal balance between the tightness of the bound – which typically corresponds to improved model quality – and the computational costs involved.

Meanwhile, a graph can have both node and edge attributes; such a graph is referred to as an attributed graph. An attributed graph is defined as G=(V,E,X,Ae)𝐺𝑉𝐸𝑋superscript𝐴𝑒G=\left(V,E,X,A^{e}\right)italic_G = ( italic_V , italic_E , italic_X , italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ). The node feature matrix is denoted as Xn×d𝑋superscript𝑛𝑑X\in\mathbb{R}^{n\times d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, where we assume that the ordering of the nodes is consistent with the ordering in the corresponding adjacency matrix A𝐴Aitalic_A, with xνdsubscript𝑥𝜈superscript𝑑x_{\nu}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denoting the feature vector of node v𝑣vitalic_v, while d𝑑ditalic_d is the dimension of the node attributes. While the edge feature matrix can be denoted as Aen×n×csuperscript𝐴𝑒superscript𝑛𝑛𝑐A^{e}\in\mathbb{R}^{n\times n\times c}italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n × italic_c end_POSTSUPERSCRIPT, with xu,vecsuperscriptsubscript𝑥𝑢𝑣𝑒superscript𝑐x_{u,v}^{e}\in\mathbb{R}^{c}italic_x start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT representing the feature vector of edge (u,v)A𝑢𝑣𝐴\left(u,v\right)\in A( italic_u , italic_v ) ∈ italic_A, while c𝑐citalic_c is the dimension of the edge attributes. The edge feature matrix can also be understood as the original adjacency matrix A𝐴Aitalic_A with the edge feature dimension c𝑐citalic_c added to each Au,vsubscript𝐴𝑢𝑣A_{u,v}italic_A start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT.

Given a set of observed graphs, DG={G1,,Gs}subscript𝐷𝐺subscript𝐺1subscript𝐺𝑠D_{G}=\left\{G_{1},\ldots,G_{s}\right\}italic_D start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = { italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } with underlying data distribution p(G)𝑝𝐺p\left(G\right)italic_p ( italic_G ), where each graph Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT may have different order nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and size eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we have Gp(G)similar-to𝐺𝑝𝐺G\sim p\left(G\right)italic_G ∼ italic_p ( italic_G ) for each graph G𝐺Gitalic_G in the dataset. The goal is to have a deep graph representation learning model that is able to learn a close enough estimation pmodel(G)subscript𝑝𝑚𝑜𝑑𝑒𝑙𝐺p_{model}\left(G\right)italic_p start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT ( italic_G ) of the real graph distribution p(G)𝑝𝐺p\left(G\right)italic_p ( italic_G ) without being constrained to a predetermined order or size of the graphs. Such a model would be capable of generating novel, previously unseen graphs of various orders and sizes drawn from the learned probabilistic model pmodel(G)subscript𝑝𝑚𝑜𝑑𝑒𝑙𝐺p_{model}\left(G\right)italic_p start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT ( italic_G ).

The representation map** encoders and decoders are essential for acquiring such graph representation learning models. Formally, a graph encoder, denoted as f(z|G)𝑓conditional𝑧𝐺f\left(z|G\right)italic_f ( italic_z | italic_G ), maps a real discrete graph object as a dense, continuous vector z𝑧zitalic_z of a low-dimensional stochastic latent space that follows a prior distribution p(z)𝑝𝑧p\left(z\right)italic_p ( italic_z ); the graph encoder f(z|G)𝑓conditional𝑧𝐺f\left(z|G\right)italic_f ( italic_z | italic_G ) outputs the parameters of the stochastic distribution. While a graph decoder, denoted as f(G|z)𝑓conditional𝐺𝑧f\left(G|z\right)italic_f ( italic_G | italic_z ), accepts a latent vector zp(z)similar-to𝑧𝑝𝑧z\sim p\left(z\right)italic_z ∼ italic_p ( italic_z ) sampled from the same stochastic distribution p(z)𝑝𝑧p\left(z\right)italic_p ( italic_z ) and performs the inverse function of the graph encoder. Graphs and corresponding features are transformed into a continuous vector space during encoding. Translating continuous data representations back into discrete graph structures, including nodes and edges, is non-trivial. This reconstruction task can take various forms, ranging from the sequential generation of the nodes and edges of the graphs step by step to the one-shot generation of the adjacency matrices or edge lists. Sequential generation leverages local decision-making efficiently and is flexible when the number of nodes is unknown. However, it struggles with maintaining long-range dependencies, which can result in omitting crucial global graph properties. Conversely, one-shot generation can capture a graph’s global properties by simultaneously generating and refining the entire graph structure across multiple iterations [14]. This study adopts the one-shot generation method, as it can learn to map entire graphs into unified latent representations and generate an entire graph directly through a single-step sampling, allowing the extraction of crucial global graph properties without sacrificing computational efficiency.

The ultimate problem this study tries to solve is disentangling local and global graph generative dependencies, which can provide insights into architectural layout design graph topologies. Ideally, upon mastering a latent space that accurately represents the distribution of real graphs, one can sample new latent code zp(z)similar-to𝑧𝑝𝑧z\sim p\left(z\right)italic_z ∼ italic_p ( italic_z ) from this space to control the characteristics of the generated graphs. Disentangled sampling can be then applied by segmenting the latent vector z𝑧zitalic_z into distinct dimensions, with each dimension znsubscript𝑧𝑛z_{n}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT focusing on a unique property. As a result, altering a single latent dimension znsubscript𝑧𝑛z_{n}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can induce specific property changes in the generated graphs, enabling precise manipulation of graph characteristics.

III-C Method overview

This study introduces the Style-based Edge-augmented Variational Graph Auto-Encoder (SE-VGAE), a novel framework designed for unsupervised disentangled representation learning aimed at automatically decomposing latent generative factors within architectural layout design graphs represented in the form of attributed adjacency multi-graphs. The framework comprises three alternative disentanglement pipelines tailored for interpreting the layout design graph data space (refer to Fig. 1). Each pipeline consists of three primary components.

The first component features a transformer-based edge-augmented encoder designed with permutation equivariance to integrate both node and edge features. This encoder takes the node feature matrix and the edge feature matrix of an attributed adjacency multi-graph as input, producing updated node and edge embeddings integrating both local and global nodes and edges’ features as output. Addressing the node ordering challenge, we employ a selected family of canonical orderings, enabling the model to consider various orderings with distinct structural biases while circumventing the challenges associated with factorial permutations.

The latent space disentanglement module follows the edge-augmented encoder. It is essential for decomposing latent factors influencing the interpretation and disentanglement of architectural layout design graph representations. This module employs Graph Isomorphism Network (GIN) [20] layers as building blocks, known for their superior expressive power and ability to generalize the Weisfeiler-Lehman (WL) test. This module takes the node and edge embeddings as input and outputs latent code vectors, the specifics of which vary depending on the chosen module scheme. We leverage different disentanglement regularisation methods to guide the representation disentanglement process and promote independence among the learned latent variables. Specifically, we propose three alternative disentanglement module schemes: 1) a vanilla VAE scheme that outputs a single latent code vector embedding the entire input graph, 2) a Vector Quantization (VQ) scheme that models probability density functions through the distribution of prototype vectors, resulting in a quantized latent code vector embedding the entire input graph, and 3) a node-edge co-disentanglement scheme utilizing three specialized sub-encoders to separate features at node, edge, and graph levels, outputting three latent code vectors embedding node, edge, and graph level features of the input graph, respectively.

The final component is a style-based decoder, which incorporates the layer-wise stochasticity feature decoding strategy [35] by introducing stochastic variations at different layers of the network. The decoder consists of two sub-decoders: a node-decoder and an edge-decoder. The node-decoder reconstructs node features by translating the provided latent representation back into the node-specific attributes of the graph, while the edge-decoder focuses on reconstructing edge features, converting the latent representation into meaningful edge attributes of the graph.

To the best of our knowledge, this study presents a pioneering effort in generating architectural layout design graphs with a primary focus on representation disentanglement. Unlike previous approaches, our work takes into account both node and edge features, as well as the critical issue of graph node permutation invariance. By integrating graph aggregation methods with high expressive power, we aim to acquire high-quality learned graph representations at various levels and, consequently, high effectiveness of feature disentanglement.

IV Style-based Edge-augmented Variational Graph Auto-Encoder

In this study, one essential aspect is striking a balance between expressiveness and efficiency when map** from the spaces of adjacency matrices and node feature matrices to a condensed latent space, as well as the simultaneous generation of graph topology and node/edge attributes. Given that a graph G𝐺Gitalic_G’s topology can be conveniently represented through an edge feature matrix Aesuperscript𝐴𝑒A^{e}italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT (adjacency matrix embedded with edge features) in tandem with a node feature matrix X𝑋Xitalic_X, a prevalent approach is to model the distribution of these matrices in a unified, seamless process [14]. As the one-shot generation method can effectively handle global patterns of graphs, and it is vital to capture global patterns for interpreting architectural layout design graph data, we adopt the adjacency matrix-based one-shot generation approach. We propose a flexible variational autoencoder-based graph representation learning framework designed to learn latent variable distribution at node, edge, and graph levels. The model encodes a comprehensive range of architectural features and relationships inherent in the layout design graph data by simultaneously capturing the nuances at these different levels. Adopting a flexible VAE-based framework paves the way for further systematic implementation and evaluation of a series of structural interventions concerning the model structure. Concretely, we propose Style-based Edge-augmented Variational Graph Auto-Encoder (SE-VGAE), together with three alternative pipelines, for the latent embedding space disentanglement of architectural layout design graph data, as shown in Fig. 1.

All three alternative pipelines comprise a transformer-based edge-augmented encoder, a latent space disentanglement module and a style-based decoder. The edge-augmented encoder inputs the node feature matrix X𝑋Xitalic_X and the edge feature matrix Aesuperscript𝐴𝑒A^{e}italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT of an attributed adjacency multi-graph G𝐺Gitalic_G of an architectural layout design. The input graph undergoes pre-processing using a predefined family of canonical orderings, allowing the model to account for different orderings with unique structural biases while avoiding the computational challenges of the full space of factorially-many permutations [34]. The encoder outputs the correspondingly updated node feature matrix Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and edge feature matrix Aesuperscript𝐴superscript𝑒A^{e^{\prime}}italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT with augmented node and edge embeddings that integrate the nodes and edges’ intricate relationships and features from local and global levels. Details of the encoder component are further discussed in section IV-A. The augmented node and edge feature matrices Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and Aesuperscript𝐴superscript𝑒A^{e^{\prime}}italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT then serve as input to the latent space disentanglement module. We propose three alternative latent space disentanglement modules, offering different disentanglement regularisation methods to guide the representation disentanglement process and promote independence among learned latent variables. We elaborate on the details of the disentanglement modules in section IV-B. The disentanglement module further outputs one or three compressed latent code vectors z of the given graph G𝐺Gitalic_G. The specifics of latent code vectors vary depending on the alternative module scheme. The latent code vector(s) z will then be used as the input of the style-based decoder, which generates a node feature matrix X′′superscript𝑋′′X^{\prime\prime}italic_X start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT and edge feature matrix Ae′′superscript𝐴superscript𝑒′′A^{e^{\prime\prime}}italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as the final outputs. A more in-depth discussion of the style-based decoder is provided in section IV-C.

IV-A Edge-augmented encoder

For the edge-augmented encoder of our proposed model frameworks, we leverage the Edge-augmented Graph Transformer (EGT) [36] as the backbone to integrate both node and edge features. The EGT backbone inherits the permutation equivariance characteristic from the original transformer mechanism [37] and employs global self-attention as its primary aggregation mechanism. This approach markedly differs from the conventional static, localized convolutional node aggregation, enabling the model to facilitate unconstrained long-range dynamic interactions between nodes. A key aspect of the transformer-based edge-augmented encoder is its ability to handle both node and edge features within a unified framework. The residual channels of the original transformer structure are utilized as node channels, while additional edge channels enable the graph’s edge information to evolve across different layers of the model, allowing the model to dynamically update and refine the representation of both nodes and edges through successive layers. As a result, the encoder continuously updates both node and edge embeddings at each layer.

The edge-augmented encoder, denoted as f((X,Ae)|(X,Ae))𝑓conditionalsuperscript𝑋superscript𝐴superscript𝑒𝑋superscript𝐴𝑒f\left((X^{\prime},A^{e^{\prime}})|(X,A^{e})\right)italic_f ( ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) | ( italic_X , italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) ), intakes the node feature matrix Xn×d𝑋superscript𝑛𝑑X\in\mathbb{R}^{n\times d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT and the edge feature matrix Aen×n×csuperscript𝐴𝑒superscript𝑛𝑛𝑐A^{e}\in\mathbb{R}^{n\times n\times c}italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n × italic_c end_POSTSUPERSCRIPT of an attributed adjacency multi-graph G=(V,E,X,Ae)𝐺𝑉𝐸𝑋superscript𝐴𝑒G=\left(V,E,X,A^{e}\right)italic_G = ( italic_V , italic_E , italic_X , italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) as input, and produces updated node feature matrix Xn×dsuperscript𝑋superscript𝑛𝑑X^{\prime}\in\mathbb{R}^{n\times d}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT and edge feature matrix Aen×n×csuperscript𝐴superscript𝑒superscript𝑛𝑛𝑐A^{e^{\prime}}\in\mathbb{R}^{n\times n\times c}italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n × italic_c end_POSTSUPERSCRIPT with respectively augmented node and edge embeddings, incorporating intricate relationships and features from both local and global levels of the nodes and edges. Specifically, xvdsubscript𝑥𝑣superscript𝑑x_{v}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes the feature vector of node v𝑣vitalic_v, while d𝑑ditalic_d is the dimension of the node features, and xu,vecsuperscriptsubscript𝑥𝑢𝑣𝑒superscript𝑐x_{u,v}^{e}\in\mathbb{R}^{c}italic_x start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT represents the feature vector of edge (u,v)E𝑢𝑣𝐸\left(u,v\right)\in E( italic_u , italic_v ) ∈ italic_E, while c𝑐citalic_c is the dimension of the edge features.

Concretely, at the o𝑜oitalic_o-th attention head of the l𝑙litalic_l-th layer of the L𝐿Litalic_L-layer encoder, the attention mechanism is defined as follows: Attn(Qno,l,Kno,l,Vno,l)=softmax(clip(Qno,l(Kno,l)Tbk)+Eeo,l)σ(Geo,l)Vno,l𝐴𝑡𝑡𝑛superscriptsubscript𝑄𝑛𝑜𝑙superscriptsubscript𝐾𝑛𝑜𝑙superscriptsubscript𝑉𝑛𝑜𝑙direct-product𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑐𝑙𝑖𝑝superscriptsubscript𝑄𝑛𝑜𝑙superscriptsuperscriptsubscript𝐾𝑛𝑜𝑙𝑇subscript𝑏𝑘superscriptsubscript𝐸𝑒𝑜𝑙𝜎superscriptsubscript𝐺𝑒𝑜𝑙superscriptsubscript𝑉𝑛𝑜𝑙Attn\left(Q_{n}^{o,l},K_{n}^{o,l},V_{n}^{o,l}\right)=softmax\left(clip\left(% \dfrac{Q_{n}^{o,l}\cdot\left(K_{n}^{o,l}\right)^{T}}{\sqrt{b_{k}}}\right)+E_{e% }^{o,l}\right)\odot\sigma\left(G_{e}^{o,l}\right)\cdot V_{n}^{o,l}italic_A italic_t italic_t italic_n ( italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_c italic_l italic_i italic_p ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ⋅ ( italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) + italic_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ) ⊙ italic_σ ( italic_G start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ) ⋅ italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT. Here, Qno,l,Kno,l,Vno,ln×bksuperscriptsubscript𝑄𝑛𝑜𝑙superscriptsubscript𝐾𝑛𝑜𝑙superscriptsubscript𝑉𝑛𝑜𝑙superscript𝑛subscript𝑏𝑘Q_{n}^{o,l},K_{n}^{o,l},V_{n}^{o,l}\in\mathbb{R}^{n\times b_{k}}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represent the queries, keys, and values obtained from linear transformations of node embeddings, with Qno,l(kno,l)Tn×nsuperscriptsubscript𝑄𝑛𝑜𝑙superscriptsuperscriptsubscript𝑘𝑛𝑜𝑙𝑇superscript𝑛𝑛Q_{n}^{o,l}\cdot\left(k_{n}^{o,l}\right)^{T}\in\mathbb{R}^{n\times n}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ⋅ ( italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT denoting the dot product of Qn,o,lsuperscriptsubscript𝑄𝑛𝑜𝑙Q_{n,}^{o,l}italic_Q start_POSTSUBSCRIPT italic_n , end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT and Kno,lsuperscriptsubscript𝐾𝑛𝑜𝑙K_{n}^{o,l}italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT, bk=d/Osubscript𝑏𝑘𝑑𝑂b_{k}=d/Oitalic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_d / italic_O is the dimension of the keys for normalizing the dot product and O𝑂Oitalic_O is the total number of attention heads. The normalized dot product is clipped to a certain range for better numerical stability ([5,+5]55\left[-5,+5\right][ - 5 , + 5 ] is used following [36]). Eeo,l,Geo,ln×nsuperscriptsubscript𝐸𝑒𝑜𝑙superscriptsubscript𝐺𝑒𝑜𝑙superscript𝑛𝑛E_{e}^{o,l},G_{e}^{o,l}\in\mathbb{R}^{n\times n}italic_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT , italic_G start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT are the learned linear transformations of the edge embeddings. Eeo,lsuperscriptsubscript𝐸𝑒𝑜𝑙E_{e}^{o,l}italic_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT acts as a bias term added to the normalized dot product of the queries and keys of the node embeddings, enabling edge embeddings to influence node embedding attention. While σ(Geo,l)n×n𝜎superscriptsubscript𝐺𝑒𝑜𝑙superscript𝑛𝑛\sigma\left(G_{e}^{o,l}\right)\in\mathbb{R}^{n\times n}italic_σ ( italic_G start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT gates the softmax values before aggregation, regulating information flow between nodes, direct-product\odot is the element-wise product operation and \cdot is matrix multiplication. With O𝑂Oitalic_O number of attention heads in total, we have,

Oel=||o=1OH^o,l,Oeln×n×OO_{e}^{l}=||_{o=1}^{O}\widehat{H}^{o,l},O_{e}^{l}\in\mathbb{R}^{n\times n% \times O}italic_O start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = | | start_POSTSUBSCRIPT italic_o = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT over^ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT , italic_O start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n × italic_O end_POSTSUPERSCRIPT (1)
where,H^o,l=clip(Qno,l(Kno,l)Tbk)+Eeo,l𝑤𝑒𝑟𝑒superscript^𝐻𝑜𝑙𝑐𝑙𝑖𝑝direct-productsuperscriptsubscript𝑄𝑛𝑜𝑙superscriptsuperscriptsubscript𝐾𝑛𝑜𝑙𝑇subscript𝑏𝑘superscriptsubscript𝐸𝑒𝑜𝑙where,\widehat{H}^{o,l}=clip\left(\dfrac{Q_{n}^{o,l}\odot\left(K_{n}^{o,l}% \right)^{T}}{\sqrt{b_{k}}}\right)+E_{e}^{o,l}italic_w italic_h italic_e italic_r italic_e , over^ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT = italic_c italic_l italic_i italic_p ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ⊙ ( italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) + italic_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT (2)
Ae,l+1=LN(FFN(LN(A^e,1))+A^e,1)superscript𝐴𝑒𝑙1𝐿𝑁𝐹𝐹𝑁𝐿𝑁superscript^𝐴𝑒1superscript^𝐴𝑒1A^{e,l+1}=LN\left(FFN\left(LN\left(\widehat{A}^{e,1}\right)\right)+\widehat{A}% ^{e,1}\right)italic_A start_POSTSUPERSCRIPT italic_e , italic_l + 1 end_POSTSUPERSCRIPT = italic_L italic_N ( italic_F italic_F italic_N ( italic_L italic_N ( over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_e , 1 end_POSTSUPERSCRIPT ) ) + over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_e , 1 end_POSTSUPERSCRIPT ) (3)
where,A^e,1=Oe^1+Ae,1𝑤𝑒𝑟𝑒superscript^𝐴𝑒1superscript^subscript𝑂𝑒1superscript𝐴𝑒1where,\widehat{A}^{e,1}=\widehat{O_{e}}^{1}+A^{e,1}italic_w italic_h italic_e italic_r italic_e , over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_e , 1 end_POSTSUPERSCRIPT = over^ start_ARG italic_O start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_A start_POSTSUPERSCRIPT italic_e , 1 end_POSTSUPERSCRIPT (4)

for edge embedding updates at the l𝑙litalic_l-th layer, and

On1=||o=1OCo,lnA^o,lVno,l,On1n×bk×OO_{n}^{1}=||_{o=1}^{O}C^{o,l}\odot\sum^{n}\widehat{A}^{o,l}\cdot V_{n}^{o,l},O% _{n}^{1}\in\mathbb{R}^{n\times b_{k}\times O}italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = | | start_POSTSUBSCRIPT italic_o = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ⊙ ∑ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ⋅ italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT , italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_O end_POSTSUPERSCRIPT (5)
where,A^o,l=softmax(H^o,l)σ(Geo,l)𝑤𝑒𝑟𝑒superscript^𝐴𝑜𝑙direct-product𝑠𝑜𝑓𝑡𝑚𝑎𝑥superscript^𝐻𝑜𝑙𝜎superscriptsubscript𝐺𝑒𝑜𝑙where,\widehat{A}^{o,l}=softmax\left(\widehat{H}^{o,l}\right)\odot\sigma\left(% G_{e}^{o,l}\right)italic_w italic_h italic_e italic_r italic_e , over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( over^ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ) ⊙ italic_σ ( italic_G start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ) (6)
and,Co,l=ln(1+nσ(Geo,l))𝑎𝑛𝑑superscript𝐶𝑜𝑙𝑙𝑛1superscript𝑛𝜎superscriptsubscript𝐺𝑒𝑜𝑙and,C^{o,l}=ln\left(1+\sum^{n}\sigma\left(G_{e}^{o,l}\right)\right)italic_a italic_n italic_d , italic_C start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT = italic_l italic_n ( 1 + ∑ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ ( italic_G start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT ) ) (7)
On^1=reshape(On1),On^1n×dformulae-sequencesuperscript^subscript𝑂𝑛1𝑟𝑒𝑠𝑎𝑝𝑒superscriptsubscript𝑂𝑛1superscript^subscript𝑂𝑛1superscript𝑛𝑑\widehat{O_{n}}^{1}=reshape\left(O_{n}^{1}\right),\widehat{O_{n}}^{1}\in% \mathbb{R}^{n\times d}over^ start_ARG italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_r italic_e italic_s italic_h italic_a italic_p italic_e ( italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , over^ start_ARG italic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT (8)
X1+1=LN(FFN(LN(X^1))+X^1)superscript𝑋11𝐿𝑁𝐹𝐹𝑁𝐿𝑁superscript^𝑋1superscript^𝑋1X^{1+1}=LN\left(FFN\left(LN\left(\widehat{X}^{1}\right)\right)+\widehat{X}^{1}\right)italic_X start_POSTSUPERSCRIPT 1 + 1 end_POSTSUPERSCRIPT = italic_L italic_N ( italic_F italic_F italic_N ( italic_L italic_N ( over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ) + over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) (9)
where,X^1=Ov^1+X1𝑤𝑒𝑟𝑒superscript^𝑋1superscript^subscript𝑂𝑣1superscript𝑋1where,\widehat{X}^{1}=\widehat{O_{v}}^{1}+X^{1}italic_w italic_h italic_e italic_r italic_e , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = over^ start_ARG italic_O start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT (10)

for node embedding updates at the l𝑙litalic_l-th layer, where LN𝐿𝑁LNitalic_L italic_N is layer normalization applied right before and after the attention mechanism, FFN𝐹𝐹𝑁FFNitalic_F italic_F italic_N is a feed-forward network layer for learnable linear transformation, ||||| | refers to the concatenation operation. Co,lsuperscript𝐶𝑜𝑙C^{o,l}italic_C start_POSTSUPERSCRIPT italic_o , italic_l end_POSTSUPERSCRIPT represents the logarithm of the sum of Sigmoid-transformed edge embeddings, scaling node centrality to enhance network sensitivity and expressiveness in identifying non-isomorphic (sub-)graphs through adaptive self-attention [36].

For an input attributed adjacency multi-graph G=(V,E,X,Ae)𝐺𝑉𝐸𝑋superscript𝐴𝑒G=\left(V,E,X,A^{e}\right)italic_G = ( italic_V , italic_E , italic_X , italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ), the node feature embedding X^^𝑋\widehat{X}over^ start_ARG italic_X end_ARG and the edge feature embedding A^esuperscript^𝐴𝑒\widehat{A}^{e}over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT are obtained through a series of learnable linear transformations using original node feature matrix X𝑋Xitalic_X and the edge feature matrix Aesuperscript𝐴𝑒A^{e}italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, accommodating both continuous and discrete values. The edge feature embedding Ae^^superscript𝐴𝑒\widehat{A^{e}}over^ start_ARG italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_ARG is further processed by adding the distance matrix Dmsuperscript𝐷𝑚D^{m}italic_D start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, with Du,vm{0,1,,m}superscriptsubscript𝐷𝑢𝑣𝑚01𝑚D_{u,v}^{m}\in\{0,1,\ldots,m\}italic_D start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ { 0 , 1 , … , italic_m } being the shortest distances between node u𝑢uitalic_u and v𝑣vitalic_v while clipped to the m𝑚mitalic_m-hop distance if exceed. A masking vector is employed in lieu of an edge feature for non-existing edges. The resulting node and edge embeddings are then forwarded to the latent space disentanglement module.

IV-B Latent space disentanglement modules

The major difference among the three alternative pipelines is the latent space disentanglement module between the transformer-based edge-augmented encoder and the style-based decoder. Generally, the latent space disentanglement module, denoted as g(z|(X,Ae))𝑔conditional𝑧superscript𝑋superscript𝐴superscript𝑒g\left(z|(X^{\prime},A^{e^{\prime}})\right)italic_g ( italic_z | ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ), maps the augmented node and edge embeddings Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and Aesuperscript𝐴superscript𝑒A^{e^{\prime}}italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to a dense, continuous vector z𝑧zitalic_z of a low-dimensional stochastic latent space that follows a prior distribution p(z)𝑝𝑧p\left(z\right)italic_p ( italic_z ); the disentanglement module g(z|(X,Ae))𝑔conditional𝑧superscript𝑋superscript𝐴superscript𝑒g\left(z|(X^{\prime},A^{e^{\prime}})\right)italic_g ( italic_z | ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) learns the parameters of the stochastic distribution. We propose three different disentanglement modules to achieve this task, all implemented in an unsupervised manner.

IV-B1 SE-VGAE with vanilla VAE module

The first option of the three proposed pipelines incorporates a traditional VAE scheme (Fig. 1-1), which serves as the baseline of our study. A Graph Isomorphism Network (GIN) [20] is utilized to integrate the augmented node and edge embeddings and produces two subsequent vectors: a mean vector μ𝜇\muitalic_μ and a standard deviation vector ν𝜈\nuitalic_ν. These vectors collectively contribute to the formation of the latent code vector z, which encapsulates the essential features and characteristics of both node and edge representations in a distilled form while capturing the underlying patterns and structures of the graph data in a condensed latent space. The obtained latent code vector z is then fed into the decoder part, which reconstructs the graph data from the latent representation, allowing the model to effectively learn representations of the original graph data by comparing the original graph and its reconstructed counterpart. By employing this conventional VAE scheme as our baseline, we establish a fundamental framework against which we can compare the effectiveness and efficiency of the other proposed pipelines in disentangling the latent embedding space. This baseline provides a crucial reference point for evaluating proposed alternative pipelines. Please refer to Appendix A for the pseudo-code of the vanilla VAE module.

IV-B2 SE-VQ-VGAE with Vector Quantization module

The second framework option employs a Vector Quantization (VQ) scheme [38] (Fig. 1-2), a method particularly adept at modelling probability density functions through the distribution of prototype vectors. It operates by encoding values from a multidimensional vector space into a finite set of discrete values that exist within a subspace of lower dimensions, allowing for a more structured and compact representation of the graph data in the latent space. A critical aspect of the VQ scheme is its output of discrete rather than continuous latent codes produced by the baseline scheme. This shift from continuous to discrete representation can simplify the latent space, potentially making it easier for the model to learn and capture the essential features of the graph data. The major difference between the VQ-based disentanglement module and the vanilla VAE module is the post-process of the latent code z𝑧zitalic_z, namely projecting z𝑧zitalic_z from continuous latent space Z𝑍Zitalic_Z into a discrete latent space Zksubscript𝑍𝑘Z_{k}italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Specifically, after obtaining the latent code z𝑧zitalic_z using the process provided in Algorithm 1, an intermediate embedding space Kk×d𝐾superscript𝑘𝑑K\in\mathbb{R}^{k\times d}italic_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT with k𝑘kitalic_k being the size of the discrete latent space is used to find the nearest neighbour of z𝑧zitalic_z, namely, its discrete counterpart zksubscript𝑧𝑘z_{k}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the discrete latent space Zksubscript𝑍𝑘Z_{k}italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Concretely, we compute the posterior categorical distribution q(zk|z)𝑞conditionalsubscript𝑧𝑘𝑧q(z_{k}|z)italic_q ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_z ) with q(zk|z)={1,k=argminzki20,otherwise𝑞conditionalsubscript𝑧𝑘𝑧cases1𝑘𝑎𝑟𝑔𝑚𝑖𝑛subscriptnorm𝑧subscript𝑘𝑖2otherwise0𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒otherwiseq\left(z_{k}|z\right)=\begin{cases}1,k=argmin\left|\right|z-k_{i}\left|\right|% _{2}\\ 0,otherwise\end{cases}italic_q ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_z ) = { start_ROW start_CELL 1 , italic_k = italic_a italic_r italic_g italic_m italic_i italic_n | | italic_z - italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL start_CELL end_CELL end_ROW, where i1,2,,k𝑖12𝑘i\in 1,2,\ldots,kitalic_i ∈ 1 , 2 , … , italic_k. The quantised latent code zksubscript𝑧𝑘z_{k}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is then used as the input of the subsequent decoder.

By introducing the VQ scheme as the second option for latent embedding space disentanglement, we can explore the potential benefits of a discrete latent space in graph representation learning and the generation of architectural layout design. This approach not only provides an alternative perspective to the continuous latent space option but may also enhance our understanding of how different latent space representations can impact the overall performance and effectiveness of the model in capturing and interpreting architectural design graph data.

IV-B3 SE-NED-VGAE with Node-edge co-disentanglement module

The third option capitalizes on the node-edge co-disentanglement mechanism (NED) [12] (Fig. 1-3) to separate the intricate interplay of features at different levels of the graph using three specialized sub-encoders: a node encoder, an edge encoder, and a node-edge co-encoder. The node encoder focuses on extracting and understanding the features and characteristics unique to individual nodes within the graph, the edge encoder is responsible for capturing and representing the features and properties associated with the edges in the graph, and the node-edge co-encoder works to integrate and comprehend the combined information from both nodes and edges, thereby capturing the overall structural and relational dynamics of the graph. By employing this tripartite encoding strategy, the third option works to model the complex relationships between nodes and edges and disentangle the intertwined node and edge features within the graph. Please refer to Appendix A for the pseudo-code of the NED-based disentanglement module.

IV-C Style-based decoder

The style-based decoder of our proposed model frameworks is constructed by incorporating the layer-wise stochasticity mechanism, a feature decoding strategy initially proposed by StyleGAN [35, 40] for generating image data. This mechanism has proved to be effective in generating high-quality and diverse imagery by introducing stochastic variations at different layers of the network [39, 8]. By grafting the layer-wise stochasticity feature decoding strategy from StyleGAN into our proposed framework, we aim to enhance the model’s capability to generate rich, diverse, and realistic architectural graph representations of both node and edge features, not only leveraging the strengths of StyleGAN’s generative capabilities but also tailoring them to the specific needs of graph representation learning, which is crucial for capturing the complexity and diversity inherent in the architectural layout design graph data.

Our proposed style-based decoder is designed to include two sub-decoders: a node-decoder and an edge-decoder. These two sub-decoders are composed of node-transposed (1D) and edge-transposed (2D) convolution layers to decode node and edge representations and generate the features for nodes and edges simultaneously. The node-decoder is responsible for reconstructing the node features, taking the latent representation provided by the encoder and translating it back into the node-specific attributes of the graph. At the same time, the edge-decoder focuses on the reconstruction of edge features, converting the latent representation provided by the encoder into meaningful edge attributes of the graph. Concretely, the node decoder, denoted as fn((X′′|zn)f^{\prime}_{n}\left((X^{\prime\prime}|z_{n}\right)italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ( italic_X start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), accepts a latent vector znp(z)similar-tosubscript𝑧𝑛𝑝𝑧z_{n}\sim p\left(z\right)italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ italic_p ( italic_z ) and produces the reconstructed/generated node feature matrix X′′n×dsuperscript𝑋′′superscript𝑛𝑑X^{\prime\prime}\in\mathbb{R}^{n\times d}italic_X start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, while the edge decoder, denoted as fe((Ae′′|ze)f^{\prime}_{e}\left((A^{e^{\prime\prime}}|z_{e}\right)italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( ( italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ), accepts a latent vector zep(z)similar-tosubscript𝑧𝑒𝑝𝑧z_{e}\sim p\left(z\right)italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∼ italic_p ( italic_z ) and outputs the reconstructed/generated edge feature matrix Ae′′n×n×csuperscript𝐴superscript𝑒′′superscript𝑛𝑛𝑐A^{e^{\prime\prime}}\in\mathbb{R}^{n\times n\times c}italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n × italic_c end_POSTSUPERSCRIPT. The latent code vector z𝑧zitalic_z can be either provided by the previous disentanglement module or sampled from the same learned stochastic distribution p(z)𝑝𝑧p\left(z\right)italic_p ( italic_z ). Specifically, a predefined maximum number of nodes needs to be set in advance for the decoder to concurrently output two continuous and dense matrices, Ae,π′′n×n×csuperscript𝐴𝑒superscript𝜋′′superscript𝑛𝑛𝑐A^{e,\pi^{\prime\prime}}\in\mathbb{R}^{n\times n\times c}italic_A start_POSTSUPERSCRIPT italic_e , italic_π start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n × italic_c end_POSTSUPERSCRIPT and X^π′′n×dsuperscript^𝑋superscript𝜋′′superscript𝑛𝑑\widehat{X}^{\pi^{\prime\prime}}\in\mathbb{R}^{n\times d}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT with a particular ordering π𝜋\piitalic_π, which define the edge and node attributes of the reconstructed/generated graph G′′superscript𝐺′′G^{\prime\prime}italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT.

Internally, both the node decoder and the edge decoder are composed of a non-linear 8-layer Multi-layer Perceptron (MLP) map** network f1subscriptsuperscript𝑓1f^{\prime}_{1}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a synthesis network f2subscriptsuperscript𝑓2f^{\prime}_{2}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Each sub-decoder intakes the latent vector zM𝑧superscript𝑀z\in\mathbb{R}^{M}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT of input latent space Z𝑍Zitalic_Z with dimension M𝑀Mitalic_M and processes it through the map** network f1:ZW:subscriptsuperscript𝑓1𝑍𝑊f^{\prime}_{1}:Z\rightarrow Witalic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_Z → italic_W to obtain a transformed latent vector wM𝑤superscript𝑀w\in\mathbb{R}^{M}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT of another intermediate latent space W𝑊Witalic_W of the same dimensionality. This structural configuration adheres to the established convention outlined in StyleGAN [35, 40]. The transformed latent vector w𝑤witalic_w is further processed with a learned affine transformation using a fully connected layer [35, 40]. The processed vector w𝑤witalic_w with varied affinement then serves as synthesis control factors by being fed to different convolution layers. Specifically, at each synthesis layer l𝑙litalic_l of the synthesis network f2subscriptsuperscript𝑓2f^{\prime}_{2}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have:

x^l=xlw^lsubscript^𝑥𝑙direct-productsubscript𝑥𝑙subscript^𝑤𝑙\widehat{x}_{l}=x_{l}\odot\widehat{w}_{l}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⊙ over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (11)
where,w^l=Affine(w)𝑤𝑒𝑟𝑒subscript^𝑤𝑙𝐴𝑓𝑓𝑖𝑛𝑒𝑤where,\widehat{w}_{l}=Affine\left(w\right)italic_w italic_h italic_e italic_r italic_e , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_A italic_f italic_f italic_i italic_n italic_e ( italic_w ) (12)

and for the node decoder, we have:

Xl+1=Conv1D(X^l)subscript𝑋𝑙1𝐶𝑜𝑛𝑣1𝐷subscript^𝑋𝑙X_{l+1}=Conv1D\left(\widehat{X}_{l}\right)italic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v 1 italic_D ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) (13)

where Xl,X^l,Xl+1c×nlsubscript𝑋𝑙subscript^𝑋𝑙subscript𝑋𝑙1superscript𝑐subscript𝑛𝑙X_{l},\widehat{X}_{l},X_{l+1}\in\mathbb{R}^{c\times n_{l}}italic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, with nlsubscript𝑛𝑙n_{l}italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT being the ’resolution’ (i.e., number of ’super nodes’) of the synthesized graph node feature matrix at layer l𝑙litalic_l, and conv1d𝑐𝑜𝑛𝑣1𝑑conv1ditalic_c italic_o italic_n italic_v 1 italic_d refers to a 1-dimensional convolution layer. As for the edge decoder, we have:

Al+1=Conv2D(A^l)subscript𝐴𝑙1𝐶𝑜𝑛𝑣2𝐷subscript^𝐴𝑙A_{l+1}=Conv2D\left(\widehat{A}_{l}\right)italic_A start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v 2 italic_D ( over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) (14)

where Al,A^l,Al+1d×nl×nlsubscript𝐴𝑙subscript^𝐴𝑙subscript𝐴𝑙1superscript𝑑subscript𝑛𝑙subscript𝑛𝑙A_{l},\widehat{A}_{l},A_{l+1}\in\mathbb{R}^{d\times n_{l}\times n_{l}}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where conv2d𝑐𝑜𝑛𝑣2𝑑conv2ditalic_c italic_o italic_n italic_v 2 italic_d refers to a 2-dimensional convolution layer.

V Model implementation and training

This section explains the details of our adopted training experiment implementation and training datasets.

V-A Training datasets

TABLE I: Two sets of graph datasets with different numbers of architectural element categories
Number of categories
6 25
Node labels outdoor, room, stair outdoor, room, stair, corridor, elevator, escalator, facilities, furniture, greenery, ladder, lavatory, parking, pillar, pool, terrace, skylight, slope, steps, void,
Edge labels wall, door, window wall, door, window, cased opening, fence, movable partition

Access to large-scale layout graph databases with quasi-exhaustive coverage and enough instances to cover the diversity of real-world architecture layout design data space is essential for examining the performance of the proposed disentanglement graph representation learning framework. However, large-scale attributed adjacency graph datasets of real-world architectural design are scant in the current literature. Although the graph extraction outputs can be more accurate using 3D building models, the accessibility of real-world 3D architectural design models in bulk is difficult. Given that traditional orthographic drawings of existing architectural design, such as floor plans, are fairly easy to acquire from the internet, it is possible to acquire attributed adjacency graphs from real-world architectural floor plan images and construct large-scale architectural layout design graph datasets. In this study, we harness the floor plan image parsing methods provided in [1] and [41] to extract attributed adjacency graphs from a customized repository of real-world architectural floor plan images of 159 architectural categories retrieved from ArchDaily®, a professional architectural design project website. The distribution of extracted layout graphs per architectural category is illustrated in Fig. 2. Specifically, we have curated two graph datasets with different numbers of architectural element categories (Table I): the first dataset is constructed using the ensemble-based supervised floor plan parsing schemes provided by [1] with 6 architectural element categories, while the second one is constructed using the semi-weakly-supervised scheme offered in [41] with 25 architectural element categories.

Refer to caption
Figure 2: Distribution of attributed adjacency graphs per architectural category (159 categories in total). The vertical axis is scaled according to the logarithm of image numbers

We define a concatenated set of baseline node features composed of three major components. 1) Node class represents the categorical classification of each node (as shown in Table I), providing essential information about the type or function of the corresponding element in the architectural layout graph; the class labels are transformed into one-hot encoding. 2) The space area ratio quantifies the proportion of the area occupied by the corresponding original polygon of the node relative to the total layout area. 3) Normalized coordinates of the original space polygon centre; the normalization is to ensure consistency and comparability across different graphs, offering valuable spatial information about the positioning of the spaces within the overall architectural layout. Regarding edge labels, the two graph datasets with different numbers of element categories also provide varied sets of edge labels, as shown in Table I. We transform the edge labels into one-hot encoding for training purposes, denoting the connections between two neighbouring nodes. This feature is instrumental in capturing the architectural layout graph’s relationships and interactions among various spaces. These node and edge features jointly offer a rich and detailed representation of architectural layout data, enabling the proposed disentangled graph representation learning framework to learn and interpret architectural layout designs’ complex and nuanced characteristics.

With these tailored datasets, we embark on experiments to extract high-level structural features from the attributed adjacency graphs and probe into the potential for interpreting and navigating the latent design representation space that the architectural graphs may reveal using the disentangled graph representation learning frameworks proposed in this study. Please refer to Appendix D for the visualization of the graph extraction process, samples of attributed adjacency graphs and other relevant details.

V-B Feature augmentation

We experiment with a series of attributed adjacency multi-graph feature augmentation schemes and examine their impact on model performance.

V-B1 Augmentation with canonical ordering

Ideally, we seek to maximize the likelihood of p(G)=πP(G,π)𝑝𝐺subscript𝜋𝑃𝐺𝜋p\left(G\right)=\sum_{\pi}P\left(G,\pi\right)italic_p ( italic_G ) = ∑ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_P ( italic_G , italic_π ). However, this computation is infeasible as the total number of node orderings is factorial in the number of nodes (n!𝑛n!italic_n !). We deal with this issue by applying a selected set of canonical orderings K={π1,,πk}𝐾subscript𝜋1subscript𝜋𝑘K=\left\{\pi_{1},\ldots,\pi_{k}\right\}italic_K = { italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } to the input graphs, chosen based on certain criteria: each ordering in the set leads to a unique permutation, ensuring that no two orderings encode the same structural information redundantly, and the selected orderings should collectively capture the essential variations in graph structures. While this approach simplifies the issue, it learns an approximated lower bound (ELBO) of the true likelihood πP(G,π)subscript𝜋𝑃𝐺𝜋\sum_{\pi}P\left(G,\pi\right)∑ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_P ( italic_G , italic_π ), namely πKP(G,π)subscript𝜋𝐾𝑃𝐺𝜋\sum_{\pi\in K}P\left(G,\pi\right)∑ start_POSTSUBSCRIPT italic_π ∈ italic_K end_POSTSUBSCRIPT italic_P ( italic_G , italic_π ). The quality of this approximation depends on how well the chosen set of canonical orderings K𝐾Kitalic_K can represent the space of all orderings. Specifically, we chose the following canonical node orderings anchored in graph properties: 1) arranging nodes in descending order based on node degree, 2) sorting based on average neighbor degree, 3) sequencing by closeness centrality and 4) betweenness centrality. Concretely, we have p(G)πKP(G,π)>P(G,π)𝑝𝐺subscript𝜋𝐾𝑃𝐺𝜋for-all𝑃𝐺𝜋p\left(G\right)\geq\sum_{\pi\in K}P\left(G,\pi\right)>\forall P\left(G,\pi\right)italic_p ( italic_G ) ≥ ∑ start_POSTSUBSCRIPT italic_π ∈ italic_K end_POSTSUBSCRIPT italic_P ( italic_G , italic_π ) > ∀ italic_P ( italic_G , italic_π ). By maximizing logπKP(G,π)𝑙𝑜𝑔subscript𝜋𝐾𝑃𝐺𝜋log\sum_{\pi\in K}P\left(G,\pi\right)italic_l italic_o italic_g ∑ start_POSTSUBSCRIPT italic_π ∈ italic_K end_POSTSUBSCRIPT italic_P ( italic_G , italic_π ) for a given graph G𝐺Gitalic_G, we implicitly select the optimal combination of node orderings from the set K𝐾Kitalic_K and maximize observing G𝐺Gitalic_G under the learned distribution using this optimal ordering [34].

V-B2 Augmentation with positional encoding

Conventional graph message-passing approaches are usually unaware of the nodes’ different structural roles, as all nodes are treated equally when performing local operations. Despite the initial intuition that neural networks would be able to discover these roles by constructing deeper model structures, it has been shown that vanilla graph neural networks are ill-suited for this purpose and are blind to the existence of structural properties [42]. Thus, positional encoding can play a pivotal role in graph representation learning by embedding global positional information within individual nodes. This feature is essential for distinguishing isomorphic nodes and edges, enhancing the model’s capacity to capture and represent complex graph structures and relationships. Hussain et al. [36] proposed a unique form of positional encoding that leverages the graph adjacency matrices’ pre-calculated Singular Value Decomposition (SVD). It uses the largest k singular values and their corresponding left and right singular vectors to construct the positional encoding. This positional encoding approach offers a robust mechanism for encoding positional information across diverse graphs, serving as an absolute global coordinate system. This study explores the SVD-based node positional encoding scheme as an additional feature augmentation. Concretely, given the adjacency matrix An×n𝐴superscript𝑛𝑛A\in\mathbb{R}^{n\times n}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT of an input attributed adjacency multi-graph G𝐺Gitalic_G, we have ASVDUSVH𝐴𝑆𝑉𝐷absent𝑈𝑆superscript𝑉𝐻A\dfrac{SVD}{}U\cdot S\cdot V^{H}italic_A divide start_ARG italic_S italic_V italic_D end_ARG start_ARG end_ARG italic_U ⋅ italic_S ⋅ italic_V start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT, with Un×n𝑈superscript𝑛𝑛U\in\mathbb{R}^{n\times n}italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT and VHn×nsuperscript𝑉𝐻superscript𝑛𝑛V^{H}\in\mathbb{R}^{n\times n}italic_V start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT being 2D unitary matrices and H𝐻Hitalic_H refers to the Hermitian transpose; the rows of VHsuperscript𝑉𝐻V^{H}italic_V start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT are the eigenvectors of AHAsuperscript𝐴𝐻𝐴A^{H}Aitalic_A start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_A while the columns of U𝑈Uitalic_U are the eigenvectors of AAH𝐴superscript𝐴𝐻AA^{H}italic_A italic_A start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT. Sn×n𝑆superscript𝑛𝑛S\in\mathbb{R}^{n\times n}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is a diagonal matrix with the principal diagonal being A𝐴Aitalic_A’s singular values sorted in descending order; the principal diagonal of S𝑆Sitalic_S forms the 1D vector sn𝑠superscript𝑛s\in\mathbb{R}^{n}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, which contains the singular values of A𝐴Aitalic_A. The SVD-based node positional encoding is then calculated as,

Γ^=FFN((U||VHT)s),Γ^n×d\widehat{\Gamma}=FFN\left(\left(U||{V^{H}}^{T}\right)\odot\sqrt{s}\right),% \widehat{\Gamma}\in\mathbb{R}^{n\times d}over^ start_ARG roman_Γ end_ARG = italic_F italic_F italic_N ( ( italic_U | | italic_V start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⊙ square-root start_ARG italic_s end_ARG ) , over^ start_ARG roman_Γ end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT (15)

with ||||| | being the concatenation operation along the columns, and the feed-forward network layer being used for learnable projection before integrating the positional encoding into node features. This heuristic approach has been shown to yield improved results [36]. Incorporating the SVD positional encoding, our goal is to enrich the node features with nuanced structural information regarding their relationships within the attributed adjacency multi-graphs. This enhancement aims to bolster the framework’s expressiveness to capture positional-sensitive layout design patterns effectively, thereby improving its overall performance in disentangling and interpreting the architectural layout design graph data space.

V-B3 Augmentation with extra polygon vertices information

We also try to augment the node features by integrating supplementary information on the coordinates of polygon vertices. This entails incorporating normalized coordinates of polygon boundary vertices as extra node features. The rationale behind this augmentation is to enrich the node representation with finer geometric details, which may potentially improve the model’s capacity to comprehend and depict the complex nuances inherent in architectural layout designs. By including this additional information, we further explore the model’s capability to capture and articulate the intricate architectural features embedded within the layout design graphs.

V-C Training implementation

We experiment with various training implementation schemes and disentanglement module variations of the proposed framework to explore how varied training schemes and latent space disentanglement modules can influence the model’s ability to learn and represent the complexities of architectural layout design graph data and assess their impact on the model’s performance regarding representation disentanglement.

V-C1 Dimensionality of intermediate latent space

The latent space in graph representation learning serves as a compressed input data representation, capturing the essential features and patterns in a lower-dimensional form. The choice of dimension for this latent space can be a balance between complexity and expressiveness. A higher-dimensional latent space can potentially capture more detailed and nuanced information about the graph, leading to richer and more accurate representations. However, it can also introduce challenges such as increased computational complexity. Conversely, a lower-dimensional latent space may be computationally more efficient. Still, it might not capture the full complexity of the data, potentially leading to less accurate representations.

We experiment with the dimensionality of the intermediate latent space ZMZsuperscript𝑀\textbf{Z}\in\mathbb{R}^{M}Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and WMWsuperscript𝑀\textbf{W}\in\mathbb{R}^{M}W ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, and examine whether the latent space dimension M𝑀Mitalic_M can significantly influence the model performance. Our experimentation will involve varying the dimensions of the latent spaces and observing the resultant effects on the model’s performance. Key performance metrics measuring fidelity and diversity will be assessed across different latent space dimensions. This will enable us to determine the optimal size of the latent space that balances expressiveness and computational efficiency while maximizing the performance of the disentangled graph representation learning model.

V-C2 Number of architectural element label categories

we also take into account the number of architectural element label categories utilized during training. This is achieved using the two meticulously curated training graph datasets, each featuring a distinct number of architectural element categories. One dataset encompasses 6 architectural element categories, while the other comprises 25 architectural element categories. Our objective is to investigate whether incorporating an increased number of architectural element label categories provides more intricate and pertinent information for graph representation, potentially influencing the model’s performance. Through this evaluation, we aim to discern the impact of varying architectural element label categories on the efficacy of the model and its ability to capture the nuances of architectural layout designs accurately.

V-C3 Latent space disentanglement module

We experiment with the proposed different disentanglement modules, including the vector quantization scheme, the node-edge co-disentanglement scheme, and the layer-wise stochasticity mechanism incorporated in the style-based decoder. Both the vector quantization scheme and the node-edge co-disentanglement scheme are evaluated against the baseline vanilla VAE scheme. To assess the performance of the style-based decoder, we compare it against a vanilla MLP decoder comprising 2 fully connected layers, which also includes two sub-decoders. The MLP node decoder performs the feature transformation as X′′=ln(pr((lindnn(ln(pr(((lMMnn(z)))))))X^{\prime\prime}=l_{n}\left(pr(\left(l^{nn}_{i\rightarrow n*d}\left(l_{n}\left% (pr(\left(\left(l^{nn}_{M\rightarrow M}\left(z\right)\right)\right)\right)% \right)\right)\right)italic_X start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i → italic_n ∗ italic_d end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( italic_z ) ) ) ) ) ) ) and the MLP edge decoder has Ae′′=ln(pr((lin2dnn(ln(pr(((lMMnn(z)))))))A^{e^{\prime\prime}}=l_{n}\left(pr(\left(l^{nn}_{i\rightarrow n^{2}*d}\left(l_% {n}\left(pr(\left(\left(l^{nn}_{M\rightarrow M}\left(z\right)\right)\right)% \right)\right)\right)\right)italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i → italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∗ italic_d end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( italic_z ) ) ) ) ) ) ). Each layer in the MLP stack is immediately followed by a PReLU activation layer [43] and a layer normalization layer [44]. Through these experiments, we aim to analyze the effectiveness of different disentanglement modules and decoder architectures in capturing and representing the latent generative factors of architectural layout design graphs.

V-D Losses

The design of appropriate loss functions is essential to ensure that the representation remains disentangled while retaining the information inherent in the data [45]. The components of the loss function L𝐿Litalic_L used in this study can be categorised into two parts based on their distinct purposes: reconstruction loss and disentanglement loss. The reconstruction loss is crucial in generation tasks for preserving data integrity by encouraging the accurate reconstruction of the original data, which ensures that learned disentangled representations are semantically meaningful. The disentanglement loss is specifically designed to enforce the separation of the representation, ensuring that each part of the disentangled representation corresponds to unique and independent aspects of the data. These two loss function parts work together to ensure a harmonious balance between maintaining the quality and integrity of the data and effectively achieving disentanglement.

Specifically, we have two respective reconstruction losses for the reconstruction of the node feature matrix and the edge feature matrix. For the node feature matrix reconstruction, we have:

Lnode=v=1ni=1d[xvi¯log(xvi)+(1xvi¯)log(1xvi)]subscript𝐿𝑛𝑜𝑑𝑒superscriptsubscript𝑣1𝑛superscriptsubscript𝑖1𝑑delimited-[]¯superscriptsubscript𝑥𝑣𝑖superscriptsubscript𝑥𝑣𝑖1¯superscriptsubscript𝑥𝑣𝑖1superscriptsubscript𝑥𝑣𝑖L_{node}=-\sum_{v=1}^{n}\sum_{i=1}^{d}[\overline{x_{v}^{i}}\log(x_{v}^{i})+(1-% \overline{x_{v}^{i}})\log(1-x_{v}^{i})]italic_L start_POSTSUBSCRIPT italic_n italic_o italic_d italic_e end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [ over¯ start_ARG italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG roman_log ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + ( 1 - over¯ start_ARG italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ) roman_log ( 1 - italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ] (16)

where xvi¯¯superscriptsubscript𝑥𝑣𝑖\overline{x_{v}^{i}}over¯ start_ARG italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG is the reconstructed feature value at dimension i𝑖iitalic_i of node v𝑣vitalic_v and xvisuperscriptsubscript𝑥𝑣𝑖x_{v}^{i}italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the corresponding ground truth value. Similarly, for the edge feature matrix reconstruction, we have:

Ledge=u=1nv=1ni=1c[au,vi¯log(au,vi)+(1au,vi¯)log(1au,vi)]subscript𝐿𝑒𝑑𝑔𝑒superscriptsubscript𝑢1𝑛superscriptsubscript𝑣1𝑛superscriptsubscript𝑖1𝑐delimited-[]¯superscriptsubscript𝑎𝑢𝑣𝑖superscriptsubscript𝑎𝑢𝑣𝑖1¯superscriptsubscript𝑎𝑢𝑣𝑖1superscriptsubscript𝑎𝑢𝑣𝑖L_{edge}=-\sum_{u=1}^{n}\sum_{v=1}^{n}\sum_{i=1}^{c}[\overline{a_{u,v}^{i}}% \log(a_{u,v}^{i})+(1-\overline{a_{u,v}^{i}})\log(1-a_{u,v}^{i})]italic_L start_POSTSUBSCRIPT italic_e italic_d italic_g italic_e end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT [ over¯ start_ARG italic_a start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG roman_log ( italic_a start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + ( 1 - over¯ start_ARG italic_a start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ) roman_log ( 1 - italic_a start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ] (17)

where au,vi¯¯superscriptsubscript𝑎𝑢𝑣𝑖\overline{a_{u,v}^{i}}over¯ start_ARG italic_a start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG is the reconstructed feature value at dimension i𝑖iitalic_i of edge (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) and au,visuperscriptsubscript𝑎𝑢𝑣𝑖a_{u,v}^{i}italic_a start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the corresponding ground truth value. Consequently, we have the total reconstruction loss Lrecsubscript𝐿𝑟𝑒𝑐L_{rec}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT as:

Lrec=Lnode+Ledgesubscript𝐿𝑟𝑒𝑐subscript𝐿𝑛𝑜𝑑𝑒subscript𝐿𝑒𝑑𝑔𝑒L_{rec}=L_{node}+L_{edge}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_n italic_o italic_d italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_e italic_d italic_g italic_e end_POSTSUBSCRIPT (18)

Meanwhile, as the proposed frameworks are VAE-based, we adopt Kullback-Leibler (KL) divergence to optimize the disentanglement latent space by quantifying the distance between the estimated posterior distributions and the isotropic Gaussian prior, and we optimize the KL divergence loss LKLsubscript𝐿𝐾𝐿L_{KL}italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT to minimize the estimation losses. Specifically, for the baseline framework with the vanilla VAE module, we have

LKLsubscript𝐿𝐾𝐿\displaystyle L_{KL}italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT =DKL(logqθ(z|(X,Ae))||p(z))\displaystyle=D_{KL}(logq_{\theta}\left(z|(X,A^{e})\right)\left|\right|p\left(% z\right))= italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_l italic_o italic_g italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | ( italic_X , italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) ) | | italic_p ( italic_z ) ) (19)
=12Mm=1M(1+zσm(zμm)2e2zσm)absent12𝑀superscriptsubscript𝑚1𝑀1superscriptsubscript𝑧𝜎𝑚superscriptsuperscriptsubscript𝑧𝜇𝑚2superscript𝑒2superscriptsubscript𝑧𝜎𝑚\displaystyle=-\frac{1}{2M}\sum_{m=1}^{M}(1+z_{\sigma}^{m}-(z_{\mu}^{m})^{2}-e% ^{2z_{\sigma}^{m}})= - divide start_ARG 1 end_ARG start_ARG 2 italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( 1 + italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT - ( italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT 2 italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) (20)

where p(z)𝑝𝑧p\left(z\right)italic_p ( italic_z ) is the isotropic Gaussian prior 𝒩(0,I)𝒩0I\mathcal{N}\left(0,\textbf{I}\right)caligraphic_N ( 0 , I ) and qθ(z|(X,Ae))subscript𝑞𝜃conditional𝑧𝑋superscript𝐴𝑒q_{\theta}\left(z|(X,A^{e})\right)italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | ( italic_X , italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) ) is the estimated posterior distribution, zμmsuperscriptsubscript𝑧𝜇𝑚z_{\mu}^{m}italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and zσmsuperscriptsubscript𝑧𝜎𝑚z_{\sigma}^{m}italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT are respectively the estimated mean value and log-variance at latent dimension m𝑚mitalic_m. While for the framework with the NED-based disentanglement module, we have

LKLNEDsuperscriptsubscript𝐿𝐾𝐿𝑁𝐸𝐷\displaystyle L_{KL}^{NED}italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_E italic_D end_POSTSUPERSCRIPT =DKL(logqθ(zgraph|(X,Ae))||p(z))\displaystyle=D_{KL}(logq_{\theta}\left(z_{graph}|(X,A^{e})\right)\left|\right% |p\left(z\right))= italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_l italic_o italic_g italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUBSCRIPT | ( italic_X , italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) ) | | italic_p ( italic_z ) ) (21)
+DKL(logqθ(znode|(X,Ae))||p(z))\displaystyle+D_{KL}(logq_{\theta}\left(z_{node}|(X,A^{e})\right)\left|\right|% p\left(z\right))+ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_l italic_o italic_g italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_n italic_o italic_d italic_e end_POSTSUBSCRIPT | ( italic_X , italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) ) | | italic_p ( italic_z ) ) (22)
+DKL(logqθ(zedge|(X,Ae))||p(z))\displaystyle+D_{KL}(logq_{\theta}\left(z_{edge}|(X,A^{e})\right)\left|\right|% p\left(z\right))+ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_l italic_o italic_g italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_e italic_d italic_g italic_e end_POSTSUBSCRIPT | ( italic_X , italic_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) ) | | italic_p ( italic_z ) ) (23)
=12M\displaystyle=-\frac{1}{2M}\cdot= - divide start_ARG 1 end_ARG start_ARG 2 italic_M end_ARG ⋅ (24)
(m=1M(1+zσgraph,m(zμgraph,m)2e2zσgraph,m)\displaystyle(\sum_{m=1}^{M}(1+z_{\sigma}^{graph,m}-(z_{\mu}^{graph,m})^{2}-e^% {2z_{\sigma}^{graph,m}})( ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( 1 + italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h , italic_m end_POSTSUPERSCRIPT - ( italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h , italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT 2 italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h , italic_m end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) (25)
+m=1M(1+zσnode,m(zμnode,m)2e2zσnode,m)superscriptsubscript𝑚1𝑀1superscriptsubscript𝑧𝜎𝑛𝑜𝑑𝑒𝑚superscriptsuperscriptsubscript𝑧𝜇𝑛𝑜𝑑𝑒𝑚2superscript𝑒2superscriptsubscript𝑧𝜎𝑛𝑜𝑑𝑒𝑚\displaystyle+\sum_{m=1}^{M}(1+z_{\sigma}^{node,m}-(z_{\mu}^{node,m})^{2}-e^{2% z_{\sigma}^{node,m}})+ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( 1 + italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e , italic_m end_POSTSUPERSCRIPT - ( italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e , italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT 2 italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e , italic_m end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) (26)
+m=1M(1+zσedge,m(zμedge,m)2e2zσedge,m))\displaystyle+\sum_{m=1}^{M}(1+z_{\sigma}^{edge,m}-(z_{\mu}^{edge,m})^{2}-e^{2% z_{\sigma}^{edge,m}}))+ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( 1 + italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e , italic_m end_POSTSUPERSCRIPT - ( italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e , italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT 2 italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e , italic_m end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) (27)

For the Vector Quantization module, the distance between the embedding vectors k𝑘kitalic_k and the latent codes z𝑧zitalic_z is optimized using Mean Squared Error (MSE). Specifically, the VQ loss LVQsubscript𝐿𝑉𝑄L_{VQ}italic_L start_POSTSUBSCRIPT italic_V italic_Q end_POSTSUBSCRIPT is composed of two parts: the dictionary learning component and the commitment loss component. The former aligns the embedding vectors k𝑘kitalic_k towards the latent codes z𝑧zitalic_z, thereby refining the quantization dictionary. The latter guarantees that the latent codes z𝑧zitalic_z reliably correspond to specific embeddings k𝑘kitalic_k within the quantization dictionary, preventing unchecked dimensionality expansion of the intermediate embedding space K𝐾Kitalic_K. Concretely, we have:

LVQ=i=1K(const(zki)ki)2+i=1K(zkiconst(ki))2subscript𝐿𝑉𝑄superscriptsubscript𝑖1𝐾superscript𝑐𝑜𝑛𝑠𝑡superscriptsubscript𝑧𝑘𝑖superscript𝑘𝑖2superscriptsubscript𝑖1𝐾superscriptsuperscriptsubscript𝑧𝑘𝑖𝑐𝑜𝑛𝑠𝑡superscript𝑘𝑖2L_{VQ}=\sum_{i=1}^{K}(const(z_{k}^{i})-k^{i})^{2}+\sum_{i=1}^{K}(z_{k}^{i}-% const(k^{i}))^{2}italic_L start_POSTSUBSCRIPT italic_V italic_Q end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_c italic_o italic_n italic_s italic_t ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_k start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_c italic_o italic_n italic_s italic_t ( italic_k start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (28)

where const𝑐𝑜𝑛𝑠𝑡constitalic_c italic_o italic_n italic_s italic_t denotes the operation of detaching from the computational graph and renders the variable as a constant during optimization. The first term of the RHS of equation 28 refers to the dictionary learning loss, and the second term is the commitment loss. When both the Vector Quantization module and the Node-edge co-disentanglement module are implemented, we have:

LVQNEDsuperscriptsubscript𝐿𝑉𝑄𝑁𝐸𝐷\displaystyle L_{VQ}^{NED}italic_L start_POSTSUBSCRIPT italic_V italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_E italic_D end_POSTSUPERSCRIPT =i=1K(const(zknode+graph,i)knode+graph,i)2absentsuperscriptsubscript𝑖1𝐾superscript𝑐𝑜𝑛𝑠𝑡superscriptsubscript𝑧𝑘𝑛𝑜𝑑𝑒𝑔𝑟𝑎𝑝𝑖superscript𝑘𝑛𝑜𝑑𝑒𝑔𝑟𝑎𝑝𝑖2\displaystyle=\sum_{i=1}^{K}(const(z_{k}^{node+graph,i})-k^{node+graph,i})^{2}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_c italic_o italic_n italic_s italic_t ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e + italic_g italic_r italic_a italic_p italic_h , italic_i end_POSTSUPERSCRIPT ) - italic_k start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e + italic_g italic_r italic_a italic_p italic_h , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (29)
+i=1K(zknode+graph,iconst(knode+graph,i))2superscriptsubscript𝑖1𝐾superscriptsuperscriptsubscript𝑧𝑘𝑛𝑜𝑑𝑒𝑔𝑟𝑎𝑝𝑖𝑐𝑜𝑛𝑠𝑡superscript𝑘𝑛𝑜𝑑𝑒𝑔𝑟𝑎𝑝𝑖2\displaystyle+\sum_{i=1}^{K}(z_{k}^{node+graph,i}-const(k^{node+graph,i}))^{2}+ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e + italic_g italic_r italic_a italic_p italic_h , italic_i end_POSTSUPERSCRIPT - italic_c italic_o italic_n italic_s italic_t ( italic_k start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e + italic_g italic_r italic_a italic_p italic_h , italic_i end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (30)
+i=1K(const(zkedge+graph,i)kedge+graph,i)2superscriptsubscript𝑖1𝐾superscript𝑐𝑜𝑛𝑠𝑡superscriptsubscript𝑧𝑘𝑒𝑑𝑔𝑒𝑔𝑟𝑎𝑝𝑖superscript𝑘𝑒𝑑𝑔𝑒𝑔𝑟𝑎𝑝𝑖2\displaystyle+\sum_{i=1}^{K}(const(z_{k}^{edge+graph,i})-k^{edge+graph,i})^{2}+ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_c italic_o italic_n italic_s italic_t ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e + italic_g italic_r italic_a italic_p italic_h , italic_i end_POSTSUPERSCRIPT ) - italic_k start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e + italic_g italic_r italic_a italic_p italic_h , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (31)
+i=1K(zkedge+graph,iconst(kedge+graph,i))2superscriptsubscript𝑖1𝐾superscriptsuperscriptsubscript𝑧𝑘𝑒𝑑𝑔𝑒𝑔𝑟𝑎𝑝𝑖𝑐𝑜𝑛𝑠𝑡superscript𝑘𝑒𝑑𝑔𝑒𝑔𝑟𝑎𝑝𝑖2\displaystyle+\sum_{i=1}^{K}(z_{k}^{edge+graph,i}-const(k^{edge+graph,i}))^{2}+ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e + italic_g italic_r italic_a italic_p italic_h , italic_i end_POSTSUPERSCRIPT - italic_c italic_o italic_n italic_s italic_t ( italic_k start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e + italic_g italic_r italic_a italic_p italic_h , italic_i end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (32)

which incorporates the vector quantization of znode+graphsuperscript𝑧𝑛𝑜𝑑𝑒𝑔𝑟𝑎𝑝z^{node+graph}italic_z start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e + italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT and zedge+graphsuperscript𝑧𝑒𝑑𝑔𝑒𝑔𝑟𝑎𝑝z^{edge+graph}italic_z start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e + italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT, namely the node and edge-level latent codes fused with the graph-level latent code.

For different framework implementation schemes, we apply different combinations of the loss function components; details are demonstrated in Table II.

TABLE II: Different combinations of loss function components for various framework implementation schemes. ”EA-encoder” stands for Edge-augmented encoder, ”VAE”, ”VQ”, and ”NED” refer to the different disentanglement modules introduced in this study, ”Style-decoder” indicates the style-based decoder, and ”MLP-decoder” refers to the vanilla MLP decoder for comparison
Framework scheme Objective function
EA-encoder + VAE + Style-decoder Lrec+LKLsubscript𝐿𝑟𝑒𝑐subscript𝐿𝐾𝐿L_{rec}+L_{KL}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT
EA-encoder + VQ + Style-decoder Lrec+LKL+LVQsubscript𝐿𝑟𝑒𝑐subscript𝐿𝐾𝐿subscript𝐿𝑉𝑄L_{rec}+L_{KL}+L_{VQ}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_V italic_Q end_POSTSUBSCRIPT
EA-encoder + NED + Style-decoder Lrec+LKLNEDsubscript𝐿𝑟𝑒𝑐superscriptsubscript𝐿𝐾𝐿𝑁𝐸𝐷L_{rec}+L_{KL}^{NED}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_E italic_D end_POSTSUPERSCRIPT
EA-encoder + VQ + NED + Style-decoder Lrec+LKLNED+LVQsubscript𝐿𝑟𝑒𝑐superscriptsubscript𝐿𝐾𝐿𝑁𝐸𝐷subscript𝐿𝑉𝑄L_{rec}+L_{KL}^{NED}+L_{VQ}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_E italic_D end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_V italic_Q end_POSTSUBSCRIPT
EA-encoder + VAE + MLP-decoder Lrec+LKLsubscript𝐿𝑟𝑒𝑐subscript𝐿𝐾𝐿L_{rec}+L_{KL}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT
EA-encoder + VQ + MLP-decoder Lrec+LKL+LVQsubscript𝐿𝑟𝑒𝑐subscript𝐿𝐾𝐿subscript𝐿𝑉𝑄L_{rec}+L_{KL}+L_{VQ}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_V italic_Q end_POSTSUBSCRIPT
EA-encoder + NED + MLP-decoder Lrec+LKLNEDsubscript𝐿𝑟𝑒𝑐superscriptsubscript𝐿𝐾𝐿𝑁𝐸𝐷L_{rec}+L_{KL}^{NED}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_E italic_D end_POSTSUPERSCRIPT
EA-encoder + VQ + NED + MLP-decoder Lrec+LKLNED+LVQsubscript𝐿𝑟𝑒𝑐superscriptsubscript𝐿𝐾𝐿𝑁𝐸𝐷subscript𝐿𝑉𝑄L_{rec}+L_{KL}^{NED}+L_{VQ}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_E italic_D end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_V italic_Q end_POSTSUBSCRIPT

VI Empirical Experiments

This section introduces our experimental setup and discusses our experiments’ quantitative and qualitative evaluation results.

VI-A Experimental setup

All experiments of this study are performed with PyTorch [46] using the following system setup:

  • Operating system: Red Hat Enterprise Linux release 8.4 (Ootpa).

  • CPU: AMD EPYC 7713P 64-Core Processor.

  • GPU: 1x NVIDIA A100-SXM4-40GB.

  • RAM: 500GB DDR4 ECC RAM

We initialize the model weights using the normal initialization method. The edge-augmented encoder comprises a total of 8.48 million trainable parameters. The vanilla VAE disentanglement module consists of 1.32 million trainable parameters, while the VQ-based disentanglement module and the NED-based disentanglement module have 1.58 million and 12.34 million trainable parameters, respectively. The style-based node decoder contains 14.04 million trainable parameters, whereas the style-based edge decoder has 29.34 million trainable parameters. The vanilla MLP node decoder comprises 1.71 million trainable parameters, while the MLP edge decoder comprises 50.89 million. We set the maximum number of nodes (n𝑛nitalic_n) to 128 in our experiments, a value calculated based on the characteristics of the training datasets.

VI-B Model Performance

This section demonstrates the results of our experimental endeavours, employing both quantitative and qualitative methods to offer an exhaustive assessment of the efficacy of our proposed frameworks.

VI-B1 Quantitative evaluation

For quantitative evaluation of the performance of the proposed frameworks, we adopt a series of domain-agnostic, scalable and expressive evaluation metrics recommended by Thompson et al. [29], tailored for easy and accurate evaluating and ranking of graph generative models. Specifically, Frechet Distance (FD, or FID)[47] approximates the graph embeddings as continuous multivariate Gaussians with sample mean and covariance and the distance between distributions is computed as an approximate measure of the sample qualities. Precision & Recall (P&R) [48] decouples a generator’s quality into two distinct values to detect mode collapse and mode drop**, constructing manifolds by extending a radius from each sample in a set to its kthsuperscript𝑘𝑡k^{t}hitalic_k start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_h nearest neighbour to form hyperspheres. The union of these hyperspheres represents a manifold: precision measures the percentage of generated samples within the real samples’ manifold, while recall measures the percentage of real samples within the generated samples’ manifold. The harmonic mean (“F1 PR”) of P&R, a scalar metric, can further provide meaningful decomposable values in experiments [49]. Density & Coverage (D&C) [50], developed as robust alternatives to P&R, differ by creating a single manifold from the union of all hyperspheres for each set and treating each sample’s hypersphere independently. Density is calculated based on the number of real hyperspheres that a generated sample falls within on average, while coverage, on the other hand, is the percentage of real hyperspheres containing at least one generated sample. These hyperspheres are determined using the kthsuperscript𝑘𝑡k^{t}hitalic_k start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_h nearest neighbour method, similar to P&R, and just like with P&R, a scalar metric can be formed using the harmonic mean (“F1 DC”) of D&C for a comprehensive evaluation [49]. Maximum Mean Discrepancy (MMD) [51] is a versatile measure used to quantify the dissimilarity between two sets of graphs regardless of the domain, utilizing different kernel functions. The original Kernel Inception Distance (KID) applied a polynomial kernel in conjunction with MMD, while MMD Linear employs a parameter-free linear kernel, offering a simpler approach, and the RBF kernel (MMD RBF) is also widely utilized for its effectiveness in capturing dissimilarities [52]. According to Thompson et al.[29], recall, coverage and F1 PR exhibit strong positive correlations with the diversity level, while precision and density are negatively correlated with diversity. Meanwhile, MMD RBF has slightly stronger correlations with both fidelity and diversity of generated samples compared to other metrics. In addition, MMD RBF and F1 PR are both capable of detecting changes in node and edge feature distributions.

We present the t-test statistic alongside the corresponding p-value to demonstrate the significance of the differences between the means of each of the two comparison groups (Table III and Table IV); larger magnitudes (absolute values) of the t-statistic suggest a higher likelihood that the observed difference between the group means is not a product of random chance. This statistical method provides a rigorous approach to assessing the significance of the differences observed in our data, ensuring our conclusions are robust and reliable.

TABLE III: Report of t-test statistic of each evaluation metric, calculated based on all training epochs (Style: apply style-based decoder or the MLP-based counterpart; SVD: involve singular value decomposition-based positional encoding or not; NED: involve node-edge co-disentanglement or not; VQ: involve vector quantization or not; poly: involve polygon vertices information or not; label: number of architectural element label categories involved; z-dim: dimensions of latent codes; FD: lower is better; F1 PR: higher is better; F1 DC: higher is better; MMD Linear: lower is better; MMD RBF: lower is better.)
FD F1 PR F1 DC MMD Linear MMD RBF
Module mean t-statistic mean t-statistic mean t-statistic mean t-statistic mean t-statistic
Style True 86.97 -16;p=.00 .004 -13;p=.00 .04 -43;p=.00 18.58 -10;p=.00 .88 58.3;p=.00
False 138.51 .02 .14 30.93 .53
SVD True 114.24 2.96;p=.00 .01 1.62;p=.11 .09 5.81;p=.00 23.67 1.09;p=.28 .69 -3;p=.00
False 106.42 .01 .08 22.61 .72
NED True 112.87 1.93;p=.05 .01 -1;p=.28 .08 -6;p=.00 23.13 .03;p=.98 .73 7.72;p=.00
False 107.76 .01 .09 23.10 .68
VQ True 96.80 -10;p=.00 .004 -14;p=.00 .05 -37;p=.00 22.57 -1;p=.3 .81 30.18;p=.00
False 122.89 .022 .13 23.64 .61
poly True 105.12 -5;p=.00 .01 -10;p=.00 .10 17.45;p=.00 22.87 -7;p=.51 .74 14.89;p=.00
False 118.76 .02 .06 23.53 .64
label 6 103.07 -7;p=.00 .01 -10;p=.00 .09 7.38;p=.00 20.09 -8;p=.00 .68 -9;p=.00
25 122.47 .02 .07 28.37 .75
z-dim 512 110.87 .80;p=.42 .015 2.45;p=.01 .08 -11;p=.00 22.51 -2;p=.06 .69 -6;p=.00
1024 108.57 .011 .11 24.49 .74
TABLE IV: Report of t-test statistic of each evaluation metric, calculated based on all training epochs (Style: apply style-based decoder or the MLP-based counterpart; SVD: involve singular value decomposition-based positional encoding or not; NED: involve node-edge co-disentanglement or not; VQ: involve vector quantization or not; poly: involve polygon vertices information or not; label: number of architectural element label categories involved; z-dim: dimensions of latent codes; Precision: higher is better; Recall: higher is better; Density: higher is better; Coverage: higher is better.)
Precision Recall Density Coverage
Module mean t-statistic mean t-statistic mean t-statistic mean t-statistic
Style True .98 9.78;p=.00 .003 -9.6;p=.00 1.87 9.04;p=.00 .02 -41;p=.00
False .96 .01 1.60 .08
SVD True .97 -.1;p=.90 .01 .31;p=.75 1.76 1.04;p=.30 .05 5.97;p=.00
False .97 .01 1.73 .04
NED True .97 .75;p=.45 .01 .03;p=.98 1.73 -1;p=.33 .04 -6;p=.00
False .97 .01 1.76 .05
VQ True .973 2.96;p=.00 .003 -11;p=.00 1.70 -4;p=.00 .02 -36;p=.00
False .966 .013 1.79 .07
poly True .99 25.80;p=.00 .005 -8;p=.00 2.30 80.67;p=.00 .06 13.86;p=.00
False .93 .013 .81 .04
label 6 .968 -2;p=.04 .005 -9;p=.00 1.87 13.66;p=.00 .05 4.20;p=.00
25 .973 .014 1.53 .04
z-dim 512 .96 -12;p=.00 .009 2.16;p=.03 1.53 -28;p=.00 .04 -10;p=.00
1024 .99 .007 2.24 .06

The results highlight several key findings. Incorporating a style-based decoder significantly enhances fidelity, as evidenced by lower “FD” and “MMD Linear” values and higher “Precision” and “Density” values, without affecting diversity. Adding SVD positional encoding improves both fidelity and diversity, indicated by lower “MMD RBF” and higher “Coverage” values. Including polygon vertices coordinates enriches node features, boosting fidelity (lower “FD”, higher “F1 DC”, “Precision”, “Density”, “Coverage”) but reducing diversity (higher “MMD RBF”, lower “F1 PR”, “Recall”). Increasing architectural element label categories also enhances fidelity (higher “F1 PR”, “Precision”, “Recall”) at the expense of diversity (higher “MMD Linear”, “MMD RBF”, lower “F1 DC”, “Density”, “Coverage”). Implementing vector quantization improves fidelity (lower “FD”, higher “Precision”) but reduces diversity (higher “MMD RBF”, lower “Recall”, “Density”, “Coverage”). Increasing latent code dimensions enhances diversity (higher “F1 DC”, “Precision”, “Density”, “Coverage”) but slightly reduces fidelity (higher “MMD RBF”). Incorporating node-edge co-disentanglement significantly increases diversity (higher “F1 DC”, “Coverage”) but slightly impacts fidelity (higher “MMD RBF”). These findings underscore the delicate balance between fidelity and diversity in graph representation learning tasks for architectural layout graphs. More detailed documentation of the quantitative evaluation results can be found in Appendix B.

TABLE V: Summary of impacts towards the graph representation learning performance with the intervention of different model structural or feature modules
Modules Fidelity Diversity
+Style + -
+SVD + +
+NED - +
+VQ + -
+poly + -
+label + -
+zdim - +

We summarize the impacts of various structural and feature interventions on the performance of the graph representation learning model in Table V, which provides a detailed overview of how different modifications to the model’s structure or its feature modules affect the fidelity and diversity levels of the learned graph representation. By organizing this information, we can more easily understand the complex interplay among various model implementation schemes concerning graph representation learning performance, offering a valuable reference point for elucidating the trade-offs and synergies inherent in different model design choices.

To further deepen our understanding of the impacts of various implementation choices on the model performance, we systematically compare the effects of different combinations of model design choices on the performance metrics of the graph representation learning models. By applying this method to the diverse range of model design choices and their respective performance metrics, we aimed to identify the sweet spot of graph representation model structure and feature interventions. We highlight the sweet spot combinations of graph representation model structural intervention and feature augmentation choices in Fig. 3 and Fig. 4. Additionally, we conduct One-way ANOVA analysis across all possible combination groups to identify the statistical significance of differences among the various model design choice combinations; the One-way ANOVA results of all group comparisons yield significant F statistics (please refer to Appendix B for more details), indicating that the differences in performance metrics across various model design choice combinations––whether they involve structural modifications or feature enhancements––are statistically significant and not due to random variations.

Fig. 3 and Fig. 4 jointly reveal some interesting insights about the impact of different model design choices on the performance of graph representation learning concerning both fidelity and diversity. It can be observed that the involvement of certain model design choices can have dominant impacts on certain graph representation learning performance metrics. Notably, the inclusion of a style-based decoder results in mutually significantly improved metrics for fidelity, including “FD”, “MMD Linear”, and “Precision”. These findings underscore the effectiveness of the layer-wise stochasticity mechanism in enhancing the fidelity of generated graphs. Contrary to the results observed with the style-based decoder, the involvement of a vector quantisation module only positively impacts “FD” and “MMD Linear” metrics without significant improvement in “Precision”. This suggests that while vector quantization mechanisms contribute to fidelity, their effectiveness is less pronounced than the layer-wise stochasticity mechanism. Interestingly, the implementation of layer-wise stochasticity and vector quantization mechanisms negatively affects diversity-relevant metrics, including “MMD RBF”, “F1 PR”, “Recall”, and “Coverage”. This indicates that these mechanisms do not inherently contribute to the diversity level of generated graphs.

The comparison analysis further highlights a consistent trend across various diversity-relevant metrics, including “MMD RBF”, “F1 PR”, “Recall”, and “Coverage”, wherein models incorporating the SVD encoding scheme consistently outperform others. This observation underscores the critical role of positional encoding mechanisms, such as SVD, in facilitating diverse graph generation. This leads to the conclusion that by incorporating positional information into the learning process, graph representation learning models can effectively capture spatial relationships and structural nuances within the graph data, thus enhancing the generated graphs’ diversity. Moreover, the significance of positional encoding extends beyond individual metrics, emphasising the importance of considering and integrating positional encoding techniques in develo** graph generation models.

Refer to caption
Figure 3: Comparison of different graph representation model design choices and their corresponding metric values, including FID(FD), MMD Linear, MMD RBF, F1 PR, and F1 DC; the sweet spots for each metric measure are highlighted in dashed box
Refer to caption
Figure 4: Comparison of different graph representation model design choices and their corresponding metric values, including precision, density, recall, and coverage; the sweet spots for each metric measure are highlighted in dashed box

VI-B2 Qualitative evaluation

We convert the generated attributed adjacency multigraph into graphical floor plans to clearly delineate the complex information within the edge and node feature matrices, facilitating a direct comparison of model-generated layouts. Such comparisons are crucial for assessing model fidelity, identifying strengths, and pinpointing areas for refinement. Detailed conversion steps are provided in Appendix E.

Based on the outcomes of our quantitative evaluation in Section VI-B1, we carefully select a series of models with representative framework setups. These models are chosen due to their distinct structural configurations and varied approaches to processing the graph data, offering a diverse perspective on model performance. To further analyse and visualise these models’ performance, we randomly sample 1000 latent codes z to ensure pseudo-exhaustive coverage of the distribution of the learned latent space for each selected model. These latent codes are high-dimensional vectors that represent the compressed, encoded information derived from the training graph data. To visualize these high-dimensional latent codes in a more interpretable manner, we use Uniform Manifold Approximation and Projection (UMAP) [53] to map the latent codes into a 2-dimensional space while preserving the essential structures and relationships, allowing us to observe patterns, clusters, and variations among the encoded representations of different models.

The exploration of spatial relationship patterns through the proposed graph representation learning framework, as depicted in the series of demonstrations (Fig. 5, Fig. 6, Fig. 7), reveals some insights into the capabilities of these models in disentangling complex architectural elements. This analysis aligns with the findings presented in the quantitative evaluation section (Section VI-B1).

Refer to caption
Figure 5: Generated graph samples and their corresponding locations in the learned latent space using a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings and 25 categories of architectural elements
Refer to caption
Figure 6: Generated graph samples and their corresponding locations in the learned latent space using a trained framework with edge-augmented encoder, vector quantisation disentanglement module, MLP-based decoder, SVD embeddings and 6 categories of architectural elements
Refer to caption
Figure 7: Generated graph samples and their corresponding locations in the learned latent space using a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, 25 categories of architectural elements, extra features of polygon vertices’ coordinates, and boosted dimensions of the latent space

Specifically, the model options with the SVD embeddings and extra architectural element categories show a high level of diverse clustering of learned latent graph patterns. The marked variation factor trends in nearby clusters indicate a rich and varied understanding of spatial relationships. This diversity in clustering highlights the model’s ability to capture a wide array of spatial patterns effectively (Fig. 5). Contrasting with the previous model, the same GNN setup, except for the employment of the VQ mechanism, has learned a more distinctly disentangled space. However, it exhibits fewer clusters and a lower level of diversity. This suggests a more focused but less varied understanding of spatial relationships (Fig. 6). Finally, the most complex setup, which includes SVD embeddings, an increased number of architectural element categories, features of polygon vertices’ coordinates, and boosted dimensions of the latent space, demonstrates both a high level of disentanglement and diversity, indicative of an advanced understanding and representation of layout design patterns (Fig. 7). More clustering visualization samples can be found in Appendix C.

Refer to caption
Figure 8: Linear interpolation samples starting from the same latent code z𝑧zitalic_z of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, 25 categories of architectural elements, extra features of polygon vertices’ coordinates, and boosted dimensions of the latent space
Refer to caption
Figure 9: Linear interpolation samples starting from the same latent code z𝑧zitalic_z of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, 6 categories of architectural elements, extra features of polygon vertices’ coordinates, and boosted dimensions of the latent space
Refer to caption
Figure 10: Linear interpolation samples starting from the same latent code z𝑧zitalic_z of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, and 25 categories of architectural elements

Meanwhile, when learning a latent code representation of a graph, we assume that each variable in the latent code corresponds to a certain factor or property used to generate the graphs’ edge and node attributes. Thus, by continuously changing the value of one variable and fixing the remaining variables, we can visualize the corresponding change in the generated graphs. Given the absence of a predefined list of layout design variables within the curated graph dataset and the impracticality of manually encoding such information, we opt for an unsupervised approach to assess the manipulation capabilities of the trained graph representation learning models. Specifically, we employ linear interpolation techniques on the learned latent space of selected models, utilizing a series of randomly generated latent code pairs to evaluate the disentanglement performance of our proposed framework. This involves simulating the graph feature manipulation process through linear interpolation operations between each pair of latent codes z𝑧zitalic_z. Fig. 8 illustrate a series of linear interpolation samples of the learned latent space of a framework composed of an edge-augmented encoder, a vanilla VAE disentanglement module, and an MLP-based decoder, and trained with SVD embeddings, 25 categories of architectural elements, extra features of polygon vertices’ coordinates, and boosted dimensionality of the latent space. Distinct trends in graph feature alterations are evident when examining interpolation samples derived from diverse latent code pairs residing across various regions of the learned latent graph feature space, which has been enhanced in dimensionality. These trends encompass modifications in layout features such as the proportion of space area, orientation of room layout, density of spaces, and organizational flow within the layout. Similarly, linear interpolation samples of the learned latent space of the same framework configuration but limited to 6 categories of architectural elements (Fig. 9) demonstrate similar identifiable trends in the modification of separable layout features. Likewise, Fig. 10 showcase samples derived from the learned latent space of the identical framework, albeit without additional polygon vertices’ coordinates and a lesser dimensionality of the latent space. While similar trends in layout feature manipulation are discernible, the generated layouts exhibit a tendency towards oversimplification and a lack of detailed complexity, likely due to the absence of supplementary polygon vertices’ coordinates and the reduced latent space dimensionality. More linear interpolation samples are demonstrated in Appendix C.

Exploring different configurations of the proposed framework demonstrates varied capabilities and performance levels in disentangling and interpreting the latent architectural layout design space. Certain configurations excel in disentangling spatial patterns, while others provide a richer diversity in the representation learning of layout features. The selection of a graph representation learning model setup plays a crucial role in balancing disentanglement and diversity. This underscores the significance of thoughtful model configuration in effectively interpreting architectural design data spaces.

VII Discussion and Future Works

Our empirical experiments have demonstrated the robustness and generalizability of our approaches in learning disentangled graph representations and interpreting the graph-based latent architectural design layout space. Meanwhile, this study’s extensive quantitative and qualitative experiments have also shed light on a few critical aspects concerning disentangled representation learning and deep generative modelling of graph data. These insights pave the way for a series of promising future research avenues, setting the stage for further exploration and refinement in this field.

VII-A Trade-off between disentanglement, fidelity, and diversity

One crucial aspect identified in our study is the trade-off between disentanglement, fidelity, and diversity of the learned architectural layout graph representations, highlighting the complexity of learning and disentangling architectural layout graph representations. The empirical experiments have shown that different structural modifications and feature enhancements significantly affect the performance of graph representation learning in different aspects. For instance, using a style-based decoder instead of a vanilla MLP-based decoder results in improved fidelity but does not positively impact diversity. Similarly, adding polygon vertice coordinates, increasing the number of architectural element label categories, and using a vector quantization mechanism for latent space disentanglement also show trade-offs between improved fidelity and reduced diversity. Furthermore, elevating the latent code dimensionality improves diversity but slightly compromises fidelity. Nevertheless, incorporating SVD-based positional encoding enhances both fidelity and diversity. These findings underline the complexities in balancing these aspects through various implementations, highlighting the need for strategic model design to optimize performance across fidelity and diversity metrics.

The consistent positive effect of incorporating the SVD encoding scheme underscores the importance of integrating positional encoding techniques in develo** architectural layout design graph generation models. Thus, future works may consider testing different positional and structural encoding mechanisms to explore further how spatial relationships and features of layout graphs can be more effectively learned and interpreted. By further investigating different positional and structural encoding strategies, it would be possible to identify more efficient positional and structural encoding schemes for capturing the nuances of layout design features, which are critical for generating more accurate and functional architectural layout graphs with a higher level of diversity.

Meanwhile, this study has yet to explore certain model implementation variations that may also be related to the trade-off issue, presenting opportunities for future research. These include adjusting the regularization coefficients for various loss terms and incorporating domain-specific knowledge into node ordering schemes. Exploring the regularization coefficients may optimize the model’s capacity to manage competing objectives, potentially improving its overall effectiveness concerning both fidelity and diversity. Additionally, different graph domains might see improvements from customized node orderings [34], and the application of domain-specific insights into standard orderings could also be potentially beneficial.

VII-B Evaluation metric effectiveness and suitability

Another key aspect identified is the consistency issue across different evaluation metrics. Our quantitative results indicate potential discrepancies between image and graph generation metrics. Specifically, while “FID (FD)” is widely used to assess both the fidelity and diversity of generated images, its effectiveness in measuring diversity within graph generation is questionable. This is illustrated by conflicting outcomes when comparing “FID (FD)” with other diversity-oriented metrics such as “MMD RBF”, “F1 PR”, “Recall”, and “Coverage”. The “MMD Linear” and “MMD RBF” measures also show potential conflicts, emphasizing their subtle differences in their focused data properties.

These discrepancies underscore the complexity involved in the metric selection and emphasize the importance of choosing appropriate metrics that align with specific research objectives for various graph generation tasks. This also highlights the need for further research into metric effectiveness and suitability, ensuring that the metrics employed provide meaningful insights and support the intended outcomes of the graph generation tasks.

VII-C Generalization across other graph generation domains

Exploring the applicability of the proposed framework across other graph generation domains represents an intriguing avenue for future research. While testing this framework in other contexts is feasible, it was beyond the scope of the current study. Extending this research to other domains could provide valuable insights into the framework’s versatility and effectiveness in different settings. Such investigations could further validate the framework’s broader utility and potentially uncover domain-specific challenges and opportunities for refinement. For instance, as our findings underscore the efficacy of the SVD encoding scheme and highlight the pivotal role of positional encoding techniques in enhancing the capabilities of architectural layout design graph generation models, future research may further investigate this aspect and extend our framework across various domains to ascertain its generalizability and efficiency in different settings.

VIII Conclusion

In conclusion, this study represents a pioneering effort to address significant research gaps in the domain of architectural layout design graph generation and graph-based design representation space interpretation. We have initiated the disentangled representation learning of architectural layout design graphs by introducing the Style-based Edge-augmented Variational Graph Auto-Encoder (SE-VGAE) framework. The proposed framework allows for a nuanced exploration of the complex interrelationships of different model design configurations, facilitating a deeper understanding of graph representation learning concerning the generation of architectural layout graphs.

Moreover, the introduction of a novel benchmark large-scale architectural layout design graph dataset marks another significant contribution. This dataset provides a comprehensive resource for training and evaluating graph generation models in this domain. This dataset enriches the field and sets a foundation for future research to explore and identify latent architectural design layout patterns and relationships.

Our study advances the theoretical understanding of graph-based architectural design and offers practical insights and tools for researchers and practitioners in relevant fields. The exploration of disentangled representation learning in the context of architectural layout design graphs illuminates an innovative path forward, suggesting that much can be gained by continuing to explore and refine techniques in this field. This work lays the groundwork for future explorations aimed at enhancing the robustness, accuracy, and diversity of graph generation models of architectural layout design and beyond.

Acknowledgments

The computational work for this article was performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg). The data sources used in this study are also gratefully acknowledged. This research was supported by the President’s Graduate Fellowship of the National University of Singapore and the Singapore Data Science Consortium (SDSC) Dissertation Research Fellowship.

References

  • [1] J. Chen and R. Stouffs, ‘Robust Attributed Adjacency Graph Extraction Using Floor Plan Images’, in POST-CARBON, Proceedings of the 27th International Conference of the Association for Computer-Aided Architectural Design Research in Asia (CAADRIA), 2022, vol. 2, pp. 385–394.
  • [2] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, ‘Geometric deep learning: going beyond euclidean data’, IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
  • [3] O. Akin, ‘The whittled design space’, AI EDAM, vol. 20, no. 2, pp. 83–88, 2006.
  • [4] R. F. Woodbury and A. L. Burrow, ‘Whither design space?’, Ai Edam, vol. 20, no. 2, pp. 63–82, 2006.
  • [5] N. Cross, Design thinking: Understanding how designers think and work. Bloomsbury Publishing, 2023.
  • [6] J. Chen and R. Stouffs, ‘The “Atlas” of Design Conceptual Space: A Design Thinking Framework with Cognitive and Computational Footings’, in Design Computing and Cognition’22, Springer, 2023, pp. 361–378.
  • [7] J. S. Gero and M. L. Maher, Modeling creativity and knowledge-based creative design. Psychology Press, 2013.
  • [8] J. Chen and R. Stouffs, ‘Deciphering the noisy landscape: architectural conceptual design space interpretation using disentangled representation learning’, Computer-Aided Civil and Infrastructure Engineering, vol. 38, no. 5, pp. 601–620, 2023.
  • [9] S. Chaillou, ‘Archigan: Artificial intelligence x architecture’, in Architectural Intelligence: Selected Papers from the 1st International Conference on Computational Design and Robotic Fabrication (CDRF 2019), 2020, pp. 117–127.
  • [10] I. Koh, ‘Voxel synthesis for architectural design’, in Design Computing and Cognition’20, Springer, 2022, pp. 297–316.
  • [11] N. Stoehr, E. Yilmaz, M. Brockschmidt, and J. Stuehmer, ‘Disentangling interpretable generative parameters of random and real-world graphs’, arXiv preprint arXiv:1910. 05639, 2019.
  • [12] X. Guo, L. Zhao, Z. Qin, L. Wu, A. Shehu, and Y. Ye, ‘Interpretable deep graph generation with node-edge co-disentanglement’, in Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 1697–1707.
  • [13] Y. Du, X. Guo, H. Cao, Y. Ye, and L. Zhao, ‘Disentangled spatiotemporal graph generative models’, in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 6541–6549.
  • [14] X. Guo and L. Zhao, ‘A systematic survey on deep generative models for graph generation’, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 5, pp. 5370–5390, 2022.
  • [15] M. Simonovsky and N. Komodakis, ‘Graphvae: Towards generation of small graphs using variational autoencoders’, in Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I 27, 2018, pp. 412–422.
  • [16] T. Ma, J. Chen, and C. Xiao, ‘Constrained generation of semantically valid graphs via regularizing variational autoencoders’, Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [17] Y. Bengio, A. Courville, and P. Vincent, ‘Representation learning: A review and new perspectives’, IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  • [18] D. P. Kingma and M. Welling, ‘Auto-Encoding Variational Bayes’, stat, vol. 1050, p. 1, 2014.
  • [19] P. Li and J. Leskovec, ‘The expressive power of graph neural networks’, Graph Neural Networks: Foundations, Frontiers, and Applications, pp. 63–98, 2022.
  • [20] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, ‘How powerful are graph neural networks?’, arXiv preprint arXiv:1810. 00826, 2018.
  • [21] C. Morris et al., ‘Weisfeiler and leman go neural: Higher-order graph neural networks’, in Proceedings of the AAAI conference on artificial intelligence, 2019, vol. 33, pp. 4602–4609.
  • [22] B. Zhang et al., ‘The Expressive Power of Graph Neural Networks: A Survey’, arXiv preprint arXiv:2308. 08235, 2023.
  • [23] P. Barcelo, E. V. Kostylev, M. Monet, J. Perez, J. L. Reutter, and J.-P. Silva, ‘The expressive power of graph neural networks as a query language’, ACM SIGMOD Record, vol. 49, no. 2, pp. 6–17, 2020.
  • [24] J. You, R. Ying, and J. Leskovec, ‘Position-aware graph neural networks’, in International conference on machine learning, 2019, pp. 7134–7143.
  • [25] R. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro, ‘Relational pooling for graph representations’, in International Conference on Machine Learning, 2019, pp. 4663–4673.
  • [26] G. Bouritsas, F. Frasca, S. Zafeiriou, and M. M. Bronstein, ‘Improving graph neural network expressivity via subgraph isomorphism counting’, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 657–668, 2022.
  • [27] R. Sato, M. Yamada, and H. Kashima, ‘Random features strengthen graph neural networks’, in Proceedings of the 2021 SIAM international conference on data mining (SDM), 2021, pp. 333–341.
  • [28] B. Srinivasan and B. Ribeiro, ‘On the Equivalence between Positional Node Embeddings and Structural Graph Representations’, in International Conference on Learning Representations, 2019.
  • [29] R. Thompson, B. Knyazev, E. Ghalebi, J. Kim, and G. W. Taylor, ‘On Evaluation Metrics for Graph Generative Models’, in International Conference on Learning Representations, 2021.
  • [30] C.-C. Liu, H. Chan, K. Luk, and A. I. Borealis, ‘Auto-regressive graph generation modeling with improved evaluation methods’, in 33rd Conference on Neural Information Processing Systems. Vancouver, Canada, 2019.
  • [31] J. Park and A. Economou, ‘The Dirksen Grammar: A Generative Description of Mies van der Rohe’s Courthouse Design Language’, Nexus Network Journal, vol. 21, pp. 591–622, 2019.
  • [32] I. As, S. Pal, and P. Basu, ‘Artificial intelligence in architecture: Generating conceptual design via deep learning’, International Journal of Architectural Computing, vol. 16, no. 4, pp. 306–327, 2018.
  • [33] F. C. Kim, M. Johanes, and J. Huang, ‘Text2Form Diffusion: Framework for learning curated architectural vocabulary’, in 41st Conference on Education and Research in Computer Aided Architectural Design in Europe, eCAADe 2023, 2023, pp. 79–88.
  • [34] R. Liao et al., ‘Efficient graph generation with graph recurrent attention networks’, Advances in neural information processing systems, vol. 32, 2019.
  • [35] T. Karras, S. Laine, and T. Aila, ‘A style-based generator architecture for generative adversarial networks’, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
  • [36] M. S. Hussain, M. J. Zaki, and D. Subramanian, ‘Global self-attention as a replacement for graph convolution’, in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 655–665.
  • [37] A. Vaswani et al., ‘Attention is all you need’, Advances in neural information processing systems, vol. 30, 2017.
  • [38] A. Van Den Oord, O. Vinyals, and Others, ‘Neural discrete representation learning’, Advances in neural information processing systems, vol. 30, 2017.
  • [39] Y. Shen, J. Gu, X. Tang, and B. Zhou, ‘Interpreting the latent space of gans for semantic face editing’, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9243–9252.
  • [40] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, ‘Analyzing and improving the image quality of stylegan’, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
  • [41] J. Chen and R. Stouffs, ‘Floor Plan Image Segmentation Via Scribble-Based Semi-Weakly Supervised Learning: A Style and Category-Agnostic Approach’, Available at SSRN 4727643.
  • [42] Z. Chen, L. Chen, S. Villar, and J. Bruna, ‘Can graph neural networks count substructures?’, Advances in neural information processing systems, vol. 33, pp. 10383–10395, 2020.
  • [43] K. He, X. Zhang, S. Ren, and J. Sun, ‘Delving deep into rectifiers: Surpassing human-level performance on imagenet classification’, in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [44] J. L. Ba, J. R. Kiros, and G. E. Hinton, ‘Layer Normalization’, stat, vol. 1050, p. 21, 2016.
  • [45] X. Wang, H. Chen, S. Tang, Z. Wu, and W. Zhu, ‘Disentangled representation learning’, arXiv preprint arXiv:2211. 11695, 2022.
  • [46] A. Paszke et al., ‘Pytorch: An imperative style, high-performance deep learning library’, Advances in neural information processing systems, vol. 32, 2019.
  • [47] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, ‘Gans trained by a two time-scale update rule converge to a local nash equilibrium’, Advances in neural information processing systems, vol. 30, 2017.
  • [48] T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila, ‘Improved precision and recall metric for assessing generative models’, Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [49] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet, ‘Are gans created equal? a large-scale study’, Advances in neural information processing systems, vol. 31, 2018.
  • [50] M. F. Naeem, S. J. Oh, Y. Uh, Y. Choi, and J. Yoo, ‘Reliable fidelity and diversity metrics for generative models’, in International Conference on Machine Learning, 2020, pp. 7176–7185.
  • [51] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola, ‘A kernel method for the two-sample-problem’, Advances in neural information processing systems, vol. 19, 2006.
  • [52] Q. Xu et al., ‘An empirical study on evaluation metrics of generative adversarial networks’, arXiv preprint arXiv:1806. 07755, 2018.
  • [53] L. McInnes, J. Healy, N. Saul, and L. Großberger, ‘UMAP: Uniform Manifold Approximation and Projection’, Journal of Open Source Software, vol. 3, no. 29, 2018.
[Uncaptioned image] Jielin Chen is a PhD candidate in Architecture at the National University of Singapore. She obtained her MLA (Distinction) from the University of Hong Kong and a BEng in Urban Planning from Zhejiang University. Her research specializes in design computing, with a focus on computational methods for design representation space interpretation. She is dedicated to disentangling computational design representation and seeking innovative solutions to complex architectural design research challenges.
[Uncaptioned image] Rudi Stouffs is Associate Professor at the Department of Architecture and Assistant Dean (Research) at the College of Design and Engineering, National University of Singapore. He leads the Architectural and Urban Prototy** lab. His research interests include computational issues of description, modelling and representation for design, in the areas of shape recognition and design generation, building information modelling and analysis, virtual cities and digital twins.

Appendix A Methodology

The pseudo-code of the vanilla VAE module is demonstrated in Algorithm 1. It is worth noting that zσsubscript𝑧𝜎z_{\sigma}italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT is calculated as the log-variance for numerical stability reasons, and the standard deviation σ𝜎\sigmaitalic_σ is computed by ezσ2superscript𝑒subscript𝑧𝜎2e^{\frac{z_{\sigma}}{2}}italic_e start_POSTSUPERSCRIPT divide start_ARG italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT, as ezσ2=elog(σ2)2=σsuperscript𝑒subscript𝑧𝜎2superscript𝑒𝑙𝑜𝑔superscript𝜎22𝜎e^{\frac{z_{\sigma}}{2}}=e^{\frac{log(\sigma^{2})}{2}}=\sigmaitalic_e start_POSTSUPERSCRIPT divide start_ARG italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT divide start_ARG italic_l italic_o italic_g ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT = italic_σ.

Algorithm 1 Pseudo code of the vanilla VAE module
1:  input: Xn×d,Aen×n×cformulae-sequencesuperscript𝑋superscript𝑛𝑑superscript𝐴superscript𝑒superscript𝑛𝑛𝑐X^{\prime}\in\mathbb{R}^{n\times d},A^{e^{\prime}}\in\mathbb{R}^{n\times n% \times c}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n × italic_c end_POSTSUPERSCRIPT
2:  hyperparameter:
3:       dimension of latent space: M𝑀Mitalic_M
4:  trainable parameters:
5:       ϵitalic-ϵ\epsilon\in\mathbb{R}italic_ϵ ∈ blackboard_R
6:       Edge feature map** layer fnne:c:superscriptsubscript𝑓𝑛𝑛𝑒superscript𝑐f_{nn}^{e}:\mathbb{R}^{c}\rightarrow\mathbb{R}italic_f start_POSTSUBSCRIPT italic_n italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT → blackboard_R
7:       Linear layer lnnsuperscript𝑙𝑛𝑛l^{nn}italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT
8:       Parametric Rectified Linear Unit (PReLU) activation layer pr𝑝𝑟pritalic_p italic_r
9:       Layer normalization layer lnsubscript𝑙𝑛l_{n}italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
10:  training process:
11:       X^=(1+ϵ)Xsuperscript^𝑋direct-product1italic-ϵsuperscript𝑋\widehat{X}^{\prime}=(1+\epsilon)\odot X^{\prime}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( 1 + italic_ϵ ) ⊙ italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
12:       A^e=fnne(Ae)superscript^𝐴superscript𝑒superscriptsubscript𝑓𝑛𝑛𝑒superscript𝐴superscript𝑒\widehat{A}^{e^{\prime}}=f_{nn}^{e}\left(A^{e^{\prime}}\right)over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_n italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT )
13:       X^^=A^eX^superscript^^𝑋superscript^𝐴superscript𝑒superscript^𝑋\widehat{\widehat{X}}^{\prime}=\widehat{A}^{e^{\prime}}\cdot\widehat{X}^{\prime}over^ start_ARG over^ start_ARG italic_X end_ARG end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ⋅ over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
14:       X^^^=X^+X^^superscript^^^𝑋superscript^𝑋superscript^^𝑋\widehat{\widehat{\widehat{X}}}^{\prime}=\widehat{X}^{\prime}+\widehat{% \widehat{X}}^{\prime}over^ start_ARG over^ start_ARG over^ start_ARG italic_X end_ARG end_ARG end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + over^ start_ARG over^ start_ARG italic_X end_ARG end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
15:       X^^^=ln(pr((lddnn(X^^^))),X^^^n×d\widehat{\widehat{\widehat{X}}}^{\prime}=l_{n}\left(pr(\left(l^{nn}_{d% \rightarrow d}\left(\widehat{\widehat{\widehat{X}}}^{\prime}\right)\right)% \right),\widehat{\widehat{\widehat{X}}}^{\prime}\in\mathbb{R}^{n\times d}over^ start_ARG over^ start_ARG over^ start_ARG italic_X end_ARG end_ARG end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d → italic_d end_POSTSUBSCRIPT ( over^ start_ARG over^ start_ARG over^ start_ARG italic_X end_ARG end_ARG end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ) , over^ start_ARG over^ start_ARG over^ start_ARG italic_X end_ARG end_ARG end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT
16:       z¯=nX^^^n,d,z¯dformulae-sequence¯𝑧superscript𝑛subscriptsuperscript^^^𝑋𝑛𝑑¯𝑧superscript𝑑\overline{z}=\sum^{n}\widehat{\widehat{\widehat{X}}}^{\prime}_{n,d},\overline{% z}\in\mathbb{R}^{d}over¯ start_ARG italic_z end_ARG = ∑ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over^ start_ARG over^ start_ARG over^ start_ARG italic_X end_ARG end_ARG end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_d end_POSTSUBSCRIPT , over¯ start_ARG italic_z end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT
17:       zμ=ln(pr((lMMnn(ln(pr((ldMnn(z¯)))))),zμMz_{\mu}=l_{n}\left(pr(\left(l^{nn}_{M\rightarrow M}\left(l_{n}\left(pr(\left(l% ^{nn}_{d\rightarrow M}\left(\overline{z}\right)\right)\right)\right)\right)% \right),z_{\mu}\in\mathbb{R}^{M}italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d → italic_M end_POSTSUBSCRIPT ( over¯ start_ARG italic_z end_ARG ) ) ) ) ) ) , italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
18:       zσ=ln(pr((lMMnn(ln(pr((ldMnn(z¯)))))),zμMz_{\sigma}=l_{n}\left(pr(\left(l^{nn}_{M\rightarrow M}\left(l_{n}\left(pr(% \left(l^{nn}_{d\rightarrow M}\left(\overline{z}\right)\right)\right)\right)% \right)\right),z_{\mu}\in\mathbb{R}^{M}italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d → italic_M end_POSTSUBSCRIPT ( over¯ start_ARG italic_z end_ARG ) ) ) ) ) ) , italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
19:       z=zμ+ezσ2r,z𝒩(zμ,ezσ)formulae-sequence𝑧subscript𝑧𝜇direct-productsuperscript𝑒subscript𝑧𝜎2𝑟similar-to𝑧𝒩subscript𝑧𝜇superscript𝑒subscript𝑧𝜎z=z_{\mu}+e^{\frac{z_{\sigma}}{2}}\odot r,z\sim\mathcal{N}\left(z_{\mu},e^{z_{% \sigma}}\right)italic_z = italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT + italic_e start_POSTSUPERSCRIPT divide start_ARG italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ⊙ italic_r , italic_z ∼ caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ),
20:                            r𝒩(0,I),rMformulae-sequencesimilar-to𝑟𝒩0I𝑟superscript𝑀r\sim\mathcal{N}\left(0,\textbf{I}\right),r\in\mathbb{R}^{M}italic_r ∼ caligraphic_N ( 0 , I ) , italic_r ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
21:  return: z𝑧zitalic_z

The pseudo-code of the NED-based disentanglement module is demonstrated in Algorithm 2. Specifically, the node-edge co-encoder learns the mean zμgraphsuperscriptsubscript𝑧𝜇𝑔𝑟𝑎𝑝z_{\mu}^{graph}italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT and standard deviation zσgraphsuperscriptsubscript𝑧𝜎𝑔𝑟𝑎𝑝z_{\sigma}^{graph}italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT of the latent representation of the entire graph and the corresponding zgraph𝒩(zμgraph,e2zσgraphI)similar-tosuperscript𝑧𝑔𝑟𝑎𝑝𝒩superscriptsubscript𝑧𝜇𝑔𝑟𝑎𝑝superscript𝑒2superscriptsubscript𝑧𝜎𝑔𝑟𝑎𝑝Iz^{graph}\sim\mathcal{N}\left(z_{\mu}^{graph},e^{2z_{\sigma}^{graph}}\textbf{I% }\right)italic_z start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT 2 italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT I ). Similarly, the node encoder learns the mean zμnodesuperscriptsubscript𝑧𝜇𝑛𝑜𝑑𝑒z_{\mu}^{node}italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT and standard deviation zσnodesuperscriptsubscript𝑧𝜎𝑛𝑜𝑑𝑒z_{\sigma}^{node}italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT of node representation and the corresponding znode𝒩(zμnode,e2zσnodeI)similar-tosuperscript𝑧𝑛𝑜𝑑𝑒𝒩superscriptsubscript𝑧𝜇𝑛𝑜𝑑𝑒superscript𝑒2superscriptsubscript𝑧𝜎𝑛𝑜𝑑𝑒Iz^{node}\sim\mathcal{N}\left(z_{\mu}^{node},e^{2z_{\sigma}^{node}}\textbf{I}\right)italic_z start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT 2 italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT I ), and the edge encoder learns the mean zμedgesuperscriptsubscript𝑧𝜇𝑒𝑑𝑔𝑒z_{\mu}^{edge}italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT and standard deviation zσedgesuperscriptsubscript𝑧𝜎𝑒𝑑𝑔𝑒z_{\sigma}^{edge}italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT of edge representation and the corresponding zedge𝒩(zμedge,e2zσedgeI)similar-tosuperscript𝑧𝑒𝑑𝑔𝑒𝒩superscriptsubscript𝑧𝜇𝑒𝑑𝑔𝑒superscript𝑒2superscriptsubscript𝑧𝜎𝑒𝑑𝑔𝑒Iz^{edge}\sim\mathcal{N}\left(z_{\mu}^{edge},e^{2z_{\sigma}^{edge}}\textbf{I}\right)italic_z start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT 2 italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT I ). Following this, znodesuperscript𝑧𝑛𝑜𝑑𝑒z^{node}italic_z start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT and zedgesuperscript𝑧𝑒𝑑𝑔𝑒z^{edge}italic_z start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT are respectively fused with zgraphsuperscript𝑧𝑔𝑟𝑎𝑝z^{graph}italic_z start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT to generate znode+graphsuperscript𝑧𝑛𝑜𝑑𝑒𝑔𝑟𝑎𝑝z^{node+graph}italic_z start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e + italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT and zedge+graphsuperscript𝑧𝑒𝑑𝑔𝑒𝑔𝑟𝑎𝑝z^{edge+graph}italic_z start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e + italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT, which are subsequently inputted into the node and edge sub-decoders concurrently.

Algorithm 2 Pseudo code of the NED-based disentanglement module
1:  input: Xn×d,Aen×n×cformulae-sequencesuperscript𝑋superscript𝑛𝑑superscript𝐴superscript𝑒superscript𝑛𝑛𝑐X^{\prime}\in\mathbb{R}^{n\times d},A^{e^{\prime}}\in\mathbb{R}^{n\times n% \times c}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n × italic_c end_POSTSUPERSCRIPT
2:  hyperparameter:
3:       dimension of latent space: M𝑀Mitalic_M
4:  trainable parameters:
5:       ϵitalic-ϵ\epsilon\in\mathbb{R}italic_ϵ ∈ blackboard_R
6:       Edge feature map** layer fnne:c:superscriptsubscript𝑓𝑛𝑛𝑒superscript𝑐f_{nn}^{e}:\mathbb{R}^{c}\rightarrow\mathbb{R}italic_f start_POSTSUBSCRIPT italic_n italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT → blackboard_R
7:       Linear layer lnnsuperscript𝑙𝑛𝑛l^{nn}italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT
8:       Parametric Rectified Linear Unit (PReLU) activation layer pr𝑝𝑟pritalic_p italic_r
9:       Layer normalization layer lnsubscript𝑙𝑛l_{n}italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
10:  training process:
11:       Obtain z¯¯𝑧\overline{z}over¯ start_ARG italic_z end_ARG using the process provided in Algorithm 1 up till line 16
12:       zμgraph=ln(pr((lMMnn(ln(pr((ldMnn(z¯)))))),zμMz_{\mu}^{graph}=l_{n}\left(pr(\left(l^{nn}_{M\rightarrow M}\left(l_{n}\left(pr% (\left(l^{nn}_{d\rightarrow M}\left(\overline{z}\right)\right)\right)\right)% \right)\right),z_{\mu}\in\mathbb{R}^{M}italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d → italic_M end_POSTSUBSCRIPT ( over¯ start_ARG italic_z end_ARG ) ) ) ) ) ) , italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
13:       zσgraph=ln(pr((lMMnn(ln(pr((ldMnn(z¯)))))),zμMz_{\sigma}^{graph}=l_{n}\left(pr(\left(l^{nn}_{M\rightarrow M}\left(l_{n}\left% (pr(\left(l^{nn}_{d\rightarrow M}\left(\overline{z}\right)\right)\right)\right% )\right)\right),z_{\mu}\in\mathbb{R}^{M}italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d → italic_M end_POSTSUBSCRIPT ( over¯ start_ARG italic_z end_ARG ) ) ) ) ) ) , italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
14:       zgraph=zμgraph+ezσgraph2rsuperscript𝑧𝑔𝑟𝑎𝑝superscriptsubscript𝑧𝜇𝑔𝑟𝑎𝑝direct-productsuperscript𝑒superscriptsubscript𝑧𝜎𝑔𝑟𝑎𝑝2𝑟z^{graph}=z_{\mu}^{graph}+e^{\frac{z_{\sigma}^{graph}}{2}}\odot ritalic_z start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT = italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT divide start_ARG italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ⊙ italic_r,
15:                                zgraph𝒩(zμgraph,ezσgraphI)similar-tosuperscript𝑧𝑔𝑟𝑎𝑝𝒩superscriptsubscript𝑧𝜇𝑔𝑟𝑎𝑝superscript𝑒superscriptsubscript𝑧𝜎𝑔𝑟𝑎𝑝Iz^{graph}\sim\mathcal{N}\left(z_{\mu}^{graph},e^{z_{\sigma}^{graph}}\textbf{I}\right)italic_z start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT I ),
16:                                r𝒩(0,I),rMformulae-sequencesimilar-to𝑟𝒩0I𝑟superscript𝑀r\sim\mathcal{N}\left(0,\textbf{I}\right),r\in\mathbb{R}^{M}italic_r ∼ caligraphic_N ( 0 , I ) , italic_r ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
17:       z¯node=nXn,d,z¯nodedformulae-sequencesuperscript¯𝑧𝑛𝑜𝑑𝑒superscript𝑛subscriptsuperscript𝑋𝑛𝑑superscript¯𝑧𝑛𝑜𝑑𝑒superscript𝑑\overline{z}^{node}=\sum^{n}X^{\prime}_{n,d},\overline{z}^{node}\in\mathbb{R}^% {d}over¯ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT = ∑ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_d end_POSTSUBSCRIPT , over¯ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT
18:       zμnode=ln(pr((lMMnn(ln(pr((ldMnn(z¯node))))))z_{\mu}^{node}=l_{n}\left(pr(\left(l^{nn}_{M\rightarrow M}\left(l_{n}\left(pr(% \left(l^{nn}_{d\rightarrow M}\left(\overline{z}^{node}\right)\right)\right)% \right)\right)\right)italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d → italic_M end_POSTSUBSCRIPT ( over¯ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ) ) ) ) ) ),
19:                                             zμnodeMsuperscriptsubscript𝑧𝜇𝑛𝑜𝑑𝑒superscript𝑀z_{\mu}^{node}\in\mathbb{R}^{M}italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
20:       zσnode=ln(pr((lMMnn(ln(pr((ldMnn(z¯node))))))z_{\sigma}^{node}=l_{n}\left(pr(\left(l^{nn}_{M\rightarrow M}\left(l_{n}\left(% pr(\left(l^{nn}_{d\rightarrow M}\left(\overline{z}^{node}\right)\right)\right)% \right)\right)\right)italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d → italic_M end_POSTSUBSCRIPT ( over¯ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ) ) ) ) ) ),
21:                                             zμnodeMsuperscriptsubscript𝑧𝜇𝑛𝑜𝑑𝑒superscript𝑀z_{\mu}^{node}\in\mathbb{R}^{M}italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
22:       znode=zμnode+eezσnode2r,znode𝒩(zμnode,ezσnodeI)formulae-sequencesuperscript𝑧𝑛𝑜𝑑𝑒superscriptsubscript𝑧𝜇𝑛𝑜𝑑𝑒direct-productsuperscript𝑒superscript𝑒superscriptsubscript𝑧𝜎𝑛𝑜𝑑𝑒2𝑟similar-tosuperscript𝑧𝑛𝑜𝑑𝑒𝒩superscriptsubscript𝑧𝜇𝑛𝑜𝑑𝑒superscript𝑒superscriptsubscript𝑧𝜎𝑛𝑜𝑑𝑒Iz^{node}=z_{\mu}^{node}+e^{\frac{e^{z_{\sigma}^{node}}}{2}}\odot r,z^{node}% \sim\mathcal{N}\left(z_{\mu}^{node},e^{z_{\sigma}^{node}}\textbf{I}\right)italic_z start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT = italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT divide start_ARG italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ⊙ italic_r , italic_z start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT I ),
23:                                             r𝒩(0,I),rMformulae-sequencesimilar-to𝑟𝒩0I𝑟superscript𝑀r\sim\mathcal{N}\left(0,\textbf{I}\right),r\in\mathbb{R}^{M}italic_r ∼ caligraphic_N ( 0 , I ) , italic_r ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
24:       znode+graph=l2MMnn(znode||zgraph)z^{node+graph}=l^{nn}_{2M\rightarrow M}\left(z^{node}||z^{graph}\right)italic_z start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e + italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT = italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_M → italic_M end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT | | italic_z start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT ),
25:                                             znode+graphMsuperscript𝑧𝑛𝑜𝑑𝑒𝑔𝑟𝑎𝑝superscript𝑀z^{node+graph}\in\mathbb{R}^{M}italic_z start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e + italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
26:       z¯edge=ln2Mnn(flatten(fnne(Ae)))superscript¯𝑧𝑒𝑑𝑔𝑒subscriptsuperscript𝑙𝑛𝑛superscript𝑛2𝑀𝑓𝑙𝑎𝑡𝑡𝑒𝑛superscriptsubscript𝑓𝑛𝑛𝑒superscript𝐴superscript𝑒\overline{z}^{edge}=l^{nn}_{n^{2}\rightarrow M}\left(flatten\left(f_{nn}^{e}% \left(A^{e^{\prime}}\right)\right)\right)over¯ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT = italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → italic_M end_POSTSUBSCRIPT ( italic_f italic_l italic_a italic_t italic_t italic_e italic_n ( italic_f start_POSTSUBSCRIPT italic_n italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) )
27:       zμedge=ln(pr((lMMnn(ln(pr((lMMnn(z¯edge))))))z_{\mu}^{edge}=l_{n}\left(pr(\left(l^{nn}_{M\rightarrow M}\left(l_{n}\left(pr(% \left(l^{nn}_{M\rightarrow M}\left(\overline{z}^{edge}\right)\right)\right)% \right)\right)\right)italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( over¯ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT ) ) ) ) ) ),
28:                                             zμedgeMsuperscriptsubscript𝑧𝜇𝑒𝑑𝑔𝑒superscript𝑀z_{\mu}^{edge}\in\mathbb{R}^{M}italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
29:       zσedge=ln(pr((lMMnn(ln(pr((lMMnn(z¯edge))))))z_{\sigma}^{edge}=l_{n}\left(pr(\left(l^{nn}_{M\rightarrow M}\left(l_{n}\left(% pr(\left(l^{nn}_{M\rightarrow M}\left(\overline{z}^{edge}\right)\right)\right)% \right)\right)\right)italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_p italic_r ( ( italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M → italic_M end_POSTSUBSCRIPT ( over¯ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT ) ) ) ) ) ),
30:                                             zμedgeMsuperscriptsubscript𝑧𝜇𝑒𝑑𝑔𝑒superscript𝑀z_{\mu}^{edge}\in\mathbb{R}^{M}italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
31:       zedge=zμedge+eezσedge2r,zedge𝒩(zμedge,ezσedgeI)formulae-sequencesuperscript𝑧𝑒𝑑𝑔𝑒superscriptsubscript𝑧𝜇𝑒𝑑𝑔𝑒direct-productsuperscript𝑒superscript𝑒superscriptsubscript𝑧𝜎𝑒𝑑𝑔𝑒2𝑟similar-tosuperscript𝑧𝑒𝑑𝑔𝑒𝒩superscriptsubscript𝑧𝜇𝑒𝑑𝑔𝑒superscript𝑒superscriptsubscript𝑧𝜎𝑒𝑑𝑔𝑒Iz^{edge}=z_{\mu}^{edge}+e^{\frac{e^{z_{\sigma}^{edge}}}{2}}\odot r,z^{edge}% \sim\mathcal{N}\left(z_{\mu}^{edge},e^{z_{\sigma}^{edge}}\textbf{I}\right)italic_z start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT = italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT divide start_ARG italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ⊙ italic_r , italic_z start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT I ),
32:                                             r𝒩(0,I),rMformulae-sequencesimilar-to𝑟𝒩0I𝑟superscript𝑀r\sim\mathcal{N}\left(0,\textbf{I}\right),r\in\mathbb{R}^{M}italic_r ∼ caligraphic_N ( 0 , I ) , italic_r ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
33:       zedge+graph=l2MMnn(zedge||zgraph)z^{edge+graph}=l^{nn}_{2M\rightarrow M}\left(z^{edge}||z^{graph}\right)italic_z start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e + italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT = italic_l start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_M → italic_M end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT | | italic_z start_POSTSUPERSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT ),
34:                                             zedge+graphMsuperscript𝑧𝑒𝑑𝑔𝑒𝑔𝑟𝑎𝑝superscript𝑀z^{edge+graph}\in\mathbb{R}^{M}italic_z start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e + italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
35:  return: znode+graph,zedge+graphsuperscript𝑧𝑛𝑜𝑑𝑒𝑔𝑟𝑎𝑝superscript𝑧𝑒𝑑𝑔𝑒𝑔𝑟𝑎𝑝z^{node+graph},z^{edge+graph}italic_z start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e + italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e + italic_g italic_r italic_a italic_p italic_h end_POSTSUPERSCRIPT

Appendix B Quantitative evaluation

Our investigation indicates that incorporating a style-based decoder (“Style” is True) compared to the vanilla MLP-based decoder (“Style” is False) results in significantly lower “FD” and “MMD Linear” values, coupled with higher “Precision” and “Density” values (Fig. 11, Fig. 12). These findings highlight the effectiveness of the style-based decoder in enhancing fidelity levels within learned graph representations. However, it’s notable that despite these improvements in fidelity, there is no discernible impact on diversity, as evidenced by comparable “F1 PR”, “F1 DC”, “MMD RBF”, “Recall”, and “Coverage” scores between the two decoder types. These results underscore the nuanced role of the style-based decoder, primarily contributing to fidelity enhancement while exhibiting limited influence on diversity under the specified conditions in our study.

Refer to caption
Figure 11: Comparison of style-based decoder and vanilla MLP-based decoder based on the ’FID’, ’KID’, ’F1 PR’, ’F1 DC’, ’MMD Linear’, and ’MMD RBF’ measures
Refer to caption
Figure 12: Comparison of style-based decoder and vanilla MLP-based decoder based on the ’precision’, ’recall’, ’density’, and ’coverage’ measures

Our findings also reveal that the incorporation of singular value decomposition-based positional encoding (“SVD” is True) generally leads to better graph representation learning performance. With this augmentation, “MMD RBF” registers a significantly lower value (Fig. 13), while “Coverage” records a substantially higher value (Fig. 14). The lower value of “MMD RBF” suggests improved fidelity and diversity in the representations, as it indicates a closer similarity to the target set, while concurrently, the increase in “Coverage” value also points to a heightened level of diversity within the learned representations. These results indicate that adding SVD positional encoding to node features can effectively enhance the model’s ability to learn graph representations with higher fidelity and diversity levels.

Refer to caption
Figure 13: Comparison of the intervention of SVD positional embedding based on the ’FID’, ’KID’, ’F1 PR’, ’F1 DC’, ’MMD Linear’, and ’MMD RBF’ measures
Refer to caption
Figure 14: Comparison of the intervention of SVD positional embedding based on the ’precision’, ’recall’, ’density’, and ’coverage’ measures

Meanwhile, significant effects can also be observed when incorporating additional polygon vertices coordinates information into node features (“poly” is True). With this enhancement, the “FD” value showed a notable decrease, while “F1 DC”, “Precision”, “Density”, and “Coverage” values increased significantly (Fig. 15, Fig. 16). This trend suggests that integrating polygon vertices coordinate information into node features enhances the fidelity of the graph representations learned by the model. However, this improvement in fidelity comes with a trade-off. We can also note a significant increase in the “MMD RBF” value and a decrease in the “F1 PR” and the “Recall” values. This shift points to a potential reduction in the diversity of the learned graph representations. The increase in “MMD RBF” and the decrease in “F1 PR” and “Recall” suggest a reduction in the model’s ability to capture the full range of the target graph representation distribution. Therefore, these results indicate a nuanced trade-off effect when adding polygon vertices coordinates information to node features. While it leads to higher fidelity in the graph representation learning model, it appears to do so at the expense of diversity. This phenomenon is somewhat predictable, as the inclusion of polygon vertices coordinates undoubtedly enriches the node representation with more pertinent information, enhancing the model’s ability to capture detailed features and improve fidelity. Yet, this addition also substantially increases the dimensionality of the latent representation that the model needs to capture and inherently escalates the complexity of the representation learning task. This increased complexity can pose significant challenges for the model as it strives to accommodate the broader range of information within the higher-dimensional latent space. Consequently, this can lead to difficulties in effectively learning and representing the entire scope of the data, potentially resulting in mode drop**, i.e., the model fails to represent certain modes or variations within the data distribution, which can diminish the diversity of the learned representations.

Refer to caption
Figure 15: Comparison of the intervention of extra information of polygon vertices coordinates based on the ’FID’, ’KID’, ’F1 PR’, ’F1 DC’, ’MMD Linear’, and ’MMD RBF’ measures
Refer to caption
Figure 16: Comparison of the intervention of extra information of polygon vertices coordinates based on ’precision’, ’recall’, ’density’, and ’coverage’ measures

Similarly, implementing a vector quantization mechanism for latent space disentanglement (“VQ” is True) yielded a significant decrease in “FD” (Fig. 17), while “Precision” increased considerably (Fig. 18), suggesting that the incorporation of a vector quantization mechanism enhances the fidelity of graph representations learned by the model and pointing to a more accurate and closer match between the generated and real graph data. However, alongside these improvements, we also observed a trade-off regarding diversity: there was a significant increase in the “MMD RBF” value and marked decreases in “Recall”, “Density”, and “Coverage” values, suggesting a growing dissimilarity between the overall learned and target distributions and a diminished ability of the model to capture the full range and variety of the target distribution. Therefore, while the vector quantization mechanism improves fidelity, it does so at the expense of diversity within the learned representations as well. This trade-off underscores the complexity involved in latent space disentanglement: enhancing one aspect of the model’s performance can potentially inadvertently impact another. The involvement of the vector quantization mechanism, while beneficial for achieving higher fidelity, also necessitates careful consideration of its effects on the diversity of the generated graph representations, highlighting the intricate balance required in the design and implementation of graph representation learning models.

Refer to caption
Figure 17: Comparison of the intervention of vector quantization mechanism based on the ’FID’, ’KID’, ’F1 PR’, ’F1 DC’, ’MMD Linear’, and ’MMD RBF’ measures
Refer to caption
Figure 18: Comparison of the intervention of vector quantization mechanism based on the ’precision’, ’recall’, ’density’, and ’coverage’ measures

When considering the number of architectural element label categories involved, our analysis reveals that while increasing the number of architectural element label categories involved in the graph representation learning model training, the “F1 PR”, “Precision”, and “Recall” values significantly increase (Fig. 19, Fig. 20), suggesting that the model can achieve a higher fidelity level in graph representation learning. Nevertheless, this improvement in fidelity also comes at a cost to diversity, as evidenced by the substantial increase in “MMD Linear” and “MMD RBF” values, coupled with a notable decrease in “F1 DC”, “Density”, and “Coverage”. These results indicate that an increase in the number of architectural element label categories can also lead to a trade-off effect in the fidelity and diversity level of the learned graph representations. This phenomenon is somewhat predictable, similar to the situation of involving extra polygon vertices coordinates to node features, as adding more architectural element label categories offers more detailed and relevant information for graph representation yet also imposes a heightened challenge on the representation learning process. Thus, balancing the enhanced fidelity with the increased risk of mode drop** underscores a critical aspect of model design and feature selection in graph representation learning. This balance is crucial for develo** graph representation learning models that can effectively capture both the characteristics and diversity of the architectural graph data they are designed to represent.

Refer to caption
Figure 19: Comparison of the intervention of architectural element label categories based on the ’FID’, ’KID’, ’F1 PR’, ’F1 DC’, ’MMD Linear’, and ’MMD RBF’ measures
Refer to caption
Figure 20: Comparison of the intervention of architectural element label categories based on the ’precision’, ’recall’, ’density’, and ’coverage’ measures

Moreover, evaluation of the impact of varying dimensions of latent codes z on learning high-dimensional representations of architectural design data graphs in latent space reveals that increasing the dimension of latent codes z significantly improves “F1 DC”, “Precision”, “Density”, and “Coverage” values (Fig. 21, Fig. 22). This enhancement indicates that the graph representation learning model is capable of learning graph representations with a better level of diversity; a higher dimensional latent space may provide a more expansive and nuanced space for the model to capture a wider range of variations and complexities present in the architectural design graph data. However, this improvement in diversity comes with a trade-off in terms of fidelity, evidenced by an increase in the “MMD RBF” value, which suggests a slight derogation in the fidelity level of the learned graph representations. The increase in “MMD RBF” implies that the representations generated by the model in the higher-dimensional latent space are somewhat less similar to the real data distribution, indicating a minor compromise in the accuracy and precision of the representations. Therefore, while a higher dimension of latent codes z enhances the model’s ability to capture diversity in the graph representations, it also appears to affect the fidelity of the learned representations to some extent. This finding highlights again the delicate balance between achieving a diverse representation and maintaining high fidelity in graph representation learning by underscoring the need for careful consideration of the latent space dimensionality in designing models for architectural design graph data representation.

Refer to caption
Figure 21: Comparison of the intervention of the dimension of latent code space based on the ’FID’, ’KID’, ’F1 PR’, ’F1 DC’, ’MMD Linear’, and ’MMD RBF’ measures
Refer to caption
Figure 22: Comparison of the intervention of the dimension of latent code space based on the ’precision’, ’recall’, ’density’, and ’coverage’ measures

Similar to increasing the latent code dimensionality, the effects of incorporating a node-edge co-disentanglement mechanism into the structure of the graph representation learning model resulted in significant increases in both “F1 DC” and “Coverage” (Fig. 23, Fig. 24). This indicates an enhancement in the diversity level of the graph representations learned by the model, allowing for a more nuanced and detailed representation of the relationships and interactions between nodes and edges within the graph, which, in turn, facilitates the model’s ability to capture a broader spectrum of variations and intricacies inherent in the architectural design data. Yet, this increased diversity comes at a certain cost to fidelity, with an accompanying rise in the “MMD RBF” value, meaning that the representations generated by the model with the node-edge co-disentanglement mechanism are less congruent with the real data distribution, indicating a minor compromise in how accurately and precisely the model captures the details of the architectural designs. Therefore, implementing a node-edge co-disentanglement mechanism in the graph representation learning model also creates a trade-off between diversity and fidelity. While it significantly enriches the diversity of the representations, enabling the model to encompass a wider range of patterns and relationships, it also slightly impacts the fidelity of these representations. This trade-off underscores the complexity of designing graph representation learning models, particularly in balancing the need to capture diverse architectural elements while maintaining high accuracy and precision.

Refer to caption
Figure 23: Comparison of the intervention of incorporating the node-edge co-disentanglement mechanism based on the ’FID’, ’KID’, ’F1 PR’, ’F1 DC’, ’MMD Linear’, and ’MMD RBF’ measures
Refer to caption
Figure 24: Comparison of the intervention of incorporating the node-edge co-disentanglement mechanism based on the ’precision’, ’recall’, ’density’, and ’coverage’ measures

To deepen our understanding of the impacts of various modelling choices on graph representation learning performance, we compare different model design choices with One-way ANOVA analysis across all possible combination groups, allowing us to systematically compare the effects of varied design choices on different graph representation learning performance metrics. One-way ANOVA analysis offers the ability to determine the statistical significance of differences among different graph representation model design choice combinations. Concretely, the One-way ANOVA results of all group comparisons yield significant F statistics (Table VI), indicating that the differences in performance metrics across various model design choice combinations – whether they involve structural modifications or feature enhancements – are statistically significant and not due to random variations.

TABLE VI: One-way ANOVA results of all possible combination groups of graph representation model structure and feature intervention choices over different graph representation learning model performance evaluation metrics
Metric F statistics p-value
FID 12.23 .00
MMD Linear 6.08 .00
MMD RBF 107.75 .00
F1 PR 36.90 .00
F1 DC 123.18 .00
Precision 13.25 .00
Recall 22.47 .00
Density 103.09 .00
Coverage 122.20 .00

Appendix C Qualitative evaluation

Similar to the model with the VQ mechanism, the model with extra features of polygon vertices’ coordinates also demonstrates a significant level of disentanglement but with a comparable limitation in diversity (Fig. 25). The expansion of the latent space dimensions, combined with SVD embeddings and additional features, results in a boosted performance in terms of diversity while maintaining a high level of disentanglement. This setup seems to strike a balance between diverse clustering and clear disentanglement (Fig. 26).

Refer to caption
Figure 25: Generated graph samples and their corresponding locations in the learned latent space using a trained framework with edge-augmented encoder, vector quantisation disentanglement module, MLP-based decoder, SVD embeddings, 6 categories of architectural elements and extra features of polygon vertices’ coordinates
Refer to caption
Figure 26: Generated graph samples and their corresponding locations in the learned latent space using a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, 6 categories of architectural elements, extra features of polygon vertices’ coordinates, and boosted dimensions of the latent space

More linear interpolation samples are demonstrated in Fig. 27, Fig. 28, Fig. 29, Fig. 30, Fig. 31, Fig. 32, Fig. 33, Fig. 34, and Fig. 35, corresponding to what has been discussed in section VI-B2.

Refer to caption
Figure 27: Linear interpolation samples starting from the same latent code z𝑧zitalic_z of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, 25 categories of architectural elements, extra features of polygon vertices’ coordinates, and boosted dimensions of the latent space
Refer to caption
Figure 28: Linear interpolation samples starting from the same latent code z𝑧zitalic_z of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, 6 categories of architectural elements, extra features of polygon vertices’ coordinates, and boosted dimensions of the latent space
Refer to caption
Figure 29: Linear interpolation samples starting from the same latent code z𝑧zitalic_z of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, and 25 categories of architectural elements
Refer to caption
Figure 30: Linear interpolation samples between pairs of randomly generated latent codes z𝑧zitalic_z of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, 25 categories of architectural elements, extra features of polygon vertices’ coordinates, and boosted dimensions of the latent space
Refer to caption
Figure 31: Linear interpolation samples between pairs of randomly generated latent codes z𝑧zitalic_z of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, 6 categories of architectural elements, extra features of polygon vertices’ coordinates, and boosted dimensions of the latent space
Refer to caption
Figure 32: Linear interpolation samples between pairs of randomly generated latent codes z𝑧zitalic_z of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, and 25 categories of architectural elements
Refer to caption
Figure 33: Linear interpolation samples of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, 25 categories of architectural elements, extra features of polygon vertices’ coordinates, and boosted dimensions of the latent space
Refer to caption
Figure 34: Linear interpolation samples of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, and 25 categories of architectural elements
Refer to caption
Figure 35: Linear interpolation samples of the learned latent space of a trained framework with edge-augmented encoder, vanilla VAE disentanglement module, MLP-based decoder, SVD embeddings, 6 categories of architectural elements, extra features of polygon vertices’ coordinates, and boosted dimensions of the latent space

Appendix D Attributed adjacency multi-graph datasets

Some floor plan image samples with corresponding parsing and attributed adjacency multi-graph (AAMG) extraction outputs are shown in Fig. 36. More AAMG samples extracted from the floor plan image repository are shown in Fig. 37. The graph datasets can also be used to explore other learning-based design tools or training tasks.

Refer to caption
Figure 36: Randomly selected floor plan samples with corresponding parsing and AAMG extraction outputs
Refer to caption
Figure 37: Selected AAMG samples of the training dataset with 25 architectural element categories

Appendix E Conversion from generated attributed adjacency multigraph to floor plans

Converting the generated attributed adjacency graph into graphical floor plans is a strategic step for the qualitative evaluation of model performance, particularly concerning the interpretation of graph data space. This conversion can be essential as graphical floor plans provide a clear and tangible representation of the complex information in the adjacency and node feature matrices. This visual form allows a more intuitive understanding of the graph’s spatial relationships and encoded architectural elements. By representing the generated graph data as graphical floor plans, qualitatively evaluating the model’s performance becomes significantly easier, enabling direct comparison between the model-generated graph layouts. This comparison is vital for assessing the fidelity of the model’s interpretation, identifying areas of strength, and pinpointing aspects that may require further refinement.

Typically, the task of automatically converting an attributed adjacency graph into practical floor plan layouts involves interpreting the graph––where nodes represent different spaces and edges represent various connections––and translating this abstract representation into a coherent, spatially accurate floor plan. This reverse engineering process is conventionally not trivial. One primary challenge in converting attributed adjacency graphs back into floor plans is the non-uniqueness of the task, as a single graph can correspond to multiple feasible floor plan layouts, each varying in spatial arrangement while adhering to the same structural relationships and constraints defined by the graph. However, our generated graphs contain more detailed positional and area information for each space node and specified space type. This level of detail aids significantly in reducing the ambiguity typically associated with the conversion, providing a clearer blueprint for the corresponding floor plan layout. Despite the detailed information available in the graphs, a certain level of compromise might still be required to adapt the abstract graph data into functional floor plans. This involves balancing the rigid constraints of the graph with the practical considerations of architectural design.

For the task of converting attributed adjacency graphs back into floor plan layouts, our approach is designed to be straightforward and efficient, primarily serving the purpose of qualitative evaluation of model performance in interpreting graph data space. This simplified conversion process is not the primary focus of our study but rather a means to validate the effectiveness of our model (Fig. 38 and Fig. 39). We begin by extracting the centre point coordinates of all space nodes from the corresponding node feature matrix. Each space node is represented as a simple located point, streamlining the initial layout. In this first step, we exclude the outdoor node, which will be used as a reference point to generate different connection elements (such as walls, doors, windows, etc.) for spaces that are connected to the outdoor node. We utilize the area ratio data from the node feature matrix for space nodes requiring more specific area information. This information is used to convert points into rectangles with areas proportional to the provided ratios. This step excludes certain space nodes like elevators, staircases, and toilets. These spaces typically have standard occupied areas and do not require detailed area information for our purpose. Afterwards, we conduct a series of shape transformations based on the information provided by the corresponding adjacency matrix. If two rectangles overlap, we divide their overlap** area and adjust their boundaries accordingly. For rectangles representing spaces connected in the graph but not adjacent in the layout, we translate their sides to reflect these connections. The boundaries of the generated polygon shapes are then buffered to represent walls, providing a structural outline to the layout. This is followed by segmenting the adjacency sides of pairs of polygons connected by k𝑘kitalic_k different edge types (other than “wall”) into k+1𝑘1k+1italic_k + 1 segments. Each segment is then assigned a connection type like “door”, “window”, or “opening” if indicated by the adjacency matrix. This segmentation also applies to polygons connected to the outdoor node. After all these steps, the remaining space nodes whose occupied areas are usually standard will be plotted as rectangles with consistent areas for demonstration purposes.

Refer to caption
Figure 38: Converting attributed adjacency graphs with 6 distinct architectural design elements into floor plan layouts
Refer to caption
Figure 39: Converting attributed adjacency graphs with 25 distinct architectural design elements into floor plan layouts