HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: abstract

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.07124v2 [stat.ME] 20 Mar 2024

Stochastic gradient descent-based inference for dynamic network models with attractors

Hancong Pan
Department of Mathematics and Statistics
Boston University Xiao**g Zhu
Department of Mathematics and Statistics
Boston University Cantay Caliskan
Goergen Institute for Data Science
University of Rochester Dino P. Christenson
Department of Political Science
Washington University in St. Louis Konstantinos Spiliopoulos
Department of Mathematics and Statistics
Boston University Dylan Walker
Argyros School of Business and Economics
Chapman University Eric D. Kolaczyk
Department of Mathematics and Statistics
McGill University
E-mail: [email protected]
Abstract

In Coevolving Latent Space Networks with Attractors (CLSNA) models, nodes in a latent space represent social actors, and edges indicate their dynamic interactions. Attractors are added at the latent level to capture the notion of attractive and repulsive forces between nodes, borrowing from dynamical systems theory. However, CLSNA reliance on MCMC estimation makes scaling difficult, and the requirement for nodes to be present throughout the study period limit practical applications. We address these issues by (i) introducing a Stochastic gradient descent (SGD) parameter estimation method, (ii) develo** a novel approach for uncertainty quantification using SGD, and (iii) extending the model to allow nodes to join and leave over time. Simulation results show that our extensions result in little loss of accuracy compared to MCMC, but can scale to much larger networks. We apply our approach to the longitudinal social networks of members of US Congress on the social media platform X. Accounting for node dynamics overcomes selection bias in the network and uncovers uniquely and increasingly repulsive forces within the Republican Party.

Keywords: Longitudinal social networks; Attractors; Partisan polarization; Dynamic networks analysis; Co-evolving network model

1 Introduction

We consider the problem of modeling dynamic networks, a collection of network graphs Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indexed over times t𝑡titalic_t. In particular, we focus on a specific class of temporal network models known as CLSNA models, first developed in Zhu et al. (2023). In a CLSNA model, the relations and interactions between nodes and certain attributes of the nodes influence each other as they co-evolve over time.

Latent space network models are a commonly used class of models for static networks (Hoff et al., 2002) where the probability of a relation between actors depends on the positions of individuals in an unobserved “social space” or latent space. Several dynamic latent space network models have been proposed in the literature, which involve embedding nodes of temporal networks into a latent Euclidean space, thus allowing each actor to have a temporal trajectory in the latent space. Earlier approaches modeling the transition and evolution of actors dictate that latent positions evolve over time in a Markov fashion (Sarkar and Moore, 2005; Sewell and Chen, 2015a, b, 2016). In the CLSNA model (Zhu et al., 2023), temporal evolution of latent positions depends on network connectivity via the presence of attractors (a concept that is fundamental to dynamical systems) at the latent level. The CLSNA model has been shown to be effective at disentangling positive (attractive) and negative (repulsive) forces among political elites and the public when applied to longitudinal social networks from the social media platforms X and Reddit (Zhu et al., 2023).

The CLSNA model has adeptly captured the nuances of polarization in American politics, offering valuable insights into the past decade. Despite its efficacy, the model’s potential could be further realized by advancing the inference methods to address the challenges of scalability. The challenges of scalability arise in two key aspects. The first aspect is an increasing number of nodes. The second is changes to sets of nodes over time.

The first aspect addresses the reliance on MCMC, as used in Zhu et al. (2023), for estimating model parameters and latent positions. This approach, even though it is quite accurate, becomes prohibitively computationally expensive when scaling to networks with more than a couple hundred nodes. Various methods have been proposed to reduce computational costs for dynamic latent position network models. Sarkar and Moore (2005) introduced a two-stage procedure to optimize the likelihood: multidimensional scaling followed by conjugate gradient update rule. Raftery et al. (2012) used the case-control likelihood approximation to speed up the estimation algorithm. In Liu and Chen (2021), the authors propose to use a variational inference algorithm for estimating the dynamic latent position of the network model.

For the second aspect, the original formulation of CLSNA models assumes that all nodes are present at all time points. In the real world of dynamic networks, the sets of nodes tend to change over time. Sewell and Chen (2015b) develop a model for dynamic networks in which only a subset of edges are observed and the edges are missing at random. Zhu et al. (2023) handles the issue by pre-processing the data and kee** a subset of congress members who are present during the whole period of time studied.

In this paper, we directly address the above two challenges of scalability. Specifically, we (i) develop a Stochastic Gradient Descent (SGD) parameter estimation method that allows us to scale to larger networks; (ii) accompany that by a novel corresponding approach to variance estimation that builds on the notion of Laplace approximation; and (iii) implement these advances within an extension of the CLSNA framework that allows nodes to join and leave the network. Motivated by Laplace’s approximation we replace the true posterior probability distribution by a Gaussian, with the mean at the MAP solution and precision matrix determined by the observed Fisher information. Rue et al. (2009) developed the class of INLA algorithms to perform Laplace’s approximation when the precision matrix is sparse, for instance in the case of the conditional independence network structure in state-space models and Gaussian Markov Random Fields (GMRFs). However, in the class of CLSNA models that we study here, the assumption of sparsity does not remain applicable, necessitating the novel development we offer here.

We use the Congressional Hashtag Network from the platform formerly known as Twitter, now referred to as “X”, as detailed in Zhu et al. (2023). The network was constructed from tweets by US congress-persons from 2010 to 2020. Each year, a binary network was formed, wherein nodes represented sitting members of Congress. Edges between two nodes indicated that their common hashtag usage exceeded the annual average among all pairs of congresspersons. We will revisit the X network to highlight the importance of the two aspects of scalability we propose in this paper: a flexible extended model that accommodates varying sets of nodes, and a fast and scalable SGD-based model inference method. We note that the simplicity and effectiveness of the proposed SGD-based inference and uncertainty quantification developed in this paper is potentially a valuable methodology for generic statistical models. We plan to investigate this direction in future works.

The rest of the paper is organized as follows. Section 2 reviews the standard CLSNA model in Zhu et al. (2023) and presents the extended CLSNA model that enables us to accommodate real world dynamic networks with nodes entering and leaving the networks. Section 3 introduces the two-stage SGD-based inference algorithm: the SGD-based point estimation method and SGD-based variance estimation method. Section 4 provides results from simulations with factorial designs to assess the capability of the proposed algorithm under different settings.111Code for this paper can be found at https://github.com/KolaczykResearch/SGD4CLSNA In Section 5, we report the results from fitting the extended model to the X Congressional Hashtag Network.

2 Extended CLSNA model

The CLSNA model of Zhu et al. (2023) assumes that all nodes are present at all time points. In the real world of dynamic networks, the sets of nodes tend to change over time. Take the X Congressional Hashtag Network as an example. Figure 1 shows numbers of re-elected and newly elected Democratic and Republican congress members in the X network. We observe significant number of changes in the sets of nodes present at each time because congress members only serve a fixed-year term before they are considered for reelection and because newly elected or sitting congress members decided to open an X account.

In order to use the original CLSNA model, Zhu et al. (2023) keep a subset of congress members who were in office and on X during the whole period of time. After applying this filtering criteria, the number of nodes was reduced from around 500 to 207.

By develo** a model that accommodates nodes entering and leaving the network, we could make inference based on the full data set. As a result, we are able to obtain latent position estimates at each time and estimates of the attractive or repulsive forces between a given node and its neighbors (influencing its movement over time) that are unbiased by factors like time in office and early technology adoption.

Refer to caption
Figure 1: Numbers of re-elected and newly elected Democratic and Republican congress members in the X hashtag network

2.1 Model Definition

Let Gt=(Vt,Et)subscript𝐺𝑡subscript𝑉𝑡subscript𝐸𝑡G_{t}=(V_{t},E_{t})italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be a network evolving in (discrete) time t𝑡titalic_t, with vertex set Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and edge set Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We allow for Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to vary over time. Let Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the (random) adjacency matrix at time t𝑡titalic_t corresponding to Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Data is in the form of time series of adjacency matrices {yt:t=1,,T}conditional-setsubscript𝑦𝑡𝑡1𝑇\{y_{t}:t=1,\cdots,T\}{ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t = 1 , ⋯ , italic_T }, where yt,ij=1subscript𝑦𝑡𝑖𝑗1y_{t,ij}=1italic_y start_POSTSUBSCRIPT italic_t , italic_i italic_j end_POSTSUBSCRIPT = 1 if there is an edge between node i𝑖iitalic_i and node j𝑗jitalic_j at time t𝑡titalic_t and yt,ij=0subscript𝑦𝑡𝑖𝑗0y_{t,ij}=0italic_y start_POSTSUBSCRIPT italic_t , italic_i italic_j end_POSTSUBSCRIPT = 0 otherwise.

We model Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using a latent space approach, with attractors added in the latent level to capture the notions of attractive and repulsive forces. In addition, we want to accommodate nodes entering and leaving the network. Let zt,ipsubscript𝑧𝑡𝑖superscript𝑝z_{t,i}\in\mathbb{R}^{p}italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT be a time-indexed latent (i.e., unobserved) position for node i𝑖iitalic_i in p𝑝pitalic_p-dimensional Euclidean space, and let zt={zt,i}subscript𝑧𝑡subscript𝑧𝑡𝑖z_{t}=\{z_{t,i}\}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT }. Assume that each of the Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT nodes falls into one of two groups, i.e., Democratic and Republican, with known node label π(i)𝒞𝜋𝑖𝒞\pi(i)\in\mathcal{C}italic_π ( italic_i ) ∈ caligraphic_C for node i𝑖iitalic_i, where 𝒞={1,2}𝒞12\mathcal{C}=\{1,2\}caligraphic_C = { 1 , 2 } is the set of group labels.

Formally, we define our model as follows:

Yt,ij|pt,ijBernoulli(pt,ij)similar-toconditionalsubscript𝑌𝑡𝑖𝑗subscript𝑝𝑡𝑖𝑗𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖subscript𝑝𝑡𝑖𝑗Y_{t,ij}|p_{t,ij}\sim Bernoulli(p_{t,ij})italic_Y start_POSTSUBSCRIPT italic_t , italic_i italic_j end_POSTSUBSCRIPT | italic_p start_POSTSUBSCRIPT italic_t , italic_i italic_j end_POSTSUBSCRIPT ∼ italic_B italic_e italic_r italic_n italic_o italic_u italic_l italic_l italic_i ( italic_p start_POSTSUBSCRIPT italic_t , italic_i italic_j end_POSTSUBSCRIPT ) (1)

where at time t=1𝑡1t=1italic_t = 1,

logit(pt,ij)=αs(𝒛t,i,𝒛t,j)logitsubscript𝑝𝑡𝑖𝑗𝛼𝑠subscript𝒛𝑡𝑖subscript𝒛𝑡𝑗\hbox{logit}(p_{t,ij})=\alpha-s(\bm{z}_{t,i},\bm{z}_{t,j})logit ( italic_p start_POSTSUBSCRIPT italic_t , italic_i italic_j end_POSTSUBSCRIPT ) = italic_α - italic_s ( bold_italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) (2)
Zt,iNormal(0,τ2Ip),similar-tosubscript𝑍𝑡𝑖𝑁𝑜𝑟𝑚𝑎𝑙0superscript𝜏2subscript𝐼𝑝Z_{t,i}\sim Normal(0,\tau^{2}I_{p})\enskip,italic_Z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∼ italic_N italic_o italic_r italic_m italic_a italic_l ( 0 , italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , (3)

and at time t2𝑡2t\geq 2italic_t ≥ 2, if node i𝑖iitalic_i is absent at time t1𝑡1t-1italic_t - 1,

logit(pt,ij)=α+δYt1,ijs(𝒛t,i,𝒛t,j)logitsubscript𝑝𝑡𝑖𝑗𝛼𝛿subscript𝑌𝑡1𝑖𝑗𝑠subscript𝒛𝑡𝑖subscript𝒛𝑡𝑗\hbox{logit}(p_{t,ij})=\alpha+\delta Y_{t-1,ij}-s(\bm{z}_{t,i},\bm{z}_{t,j})logit ( italic_p start_POSTSUBSCRIPT italic_t , italic_i italic_j end_POSTSUBSCRIPT ) = italic_α + italic_δ italic_Y start_POSTSUBSCRIPT italic_t - 1 , italic_i italic_j end_POSTSUBSCRIPT - italic_s ( bold_italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) (4)
Zt,iNormal(μt,i,ϕ2Ip)similar-tosubscript𝑍𝑡𝑖𝑁𝑜𝑟𝑚𝑎𝑙subscript𝜇𝑡𝑖superscriptitalic-ϕ2subscript𝐼𝑝Z_{t,i}\sim Normal(\mu_{t,i},\phi^{2}I_{p})italic_Z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∼ italic_N italic_o italic_r italic_m italic_a italic_l ( italic_μ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) (5)
μt,i=z¯t1π(i),subscript𝜇𝑡𝑖subscriptsuperscript¯𝑧𝜋𝑖𝑡1\mu_{t,i}=\overline{z}^{\pi(i)}_{t-1}\enskip,italic_μ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = over¯ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_π ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , (6)

or else at time t2𝑡2t\geq 2italic_t ≥ 2, if node i𝑖iitalic_i is present at time t1𝑡1t-1italic_t - 1,

logit(pt,ij)=αs(𝒛t,i,𝒛t,j)logitsubscript𝑝𝑡𝑖𝑗𝛼𝑠subscript𝒛𝑡𝑖subscript𝒛𝑡𝑗\hbox{logit}(p_{t,ij})=\alpha-s(\bm{z}_{t,i},\bm{z}_{t,j})logit ( italic_p start_POSTSUBSCRIPT italic_t , italic_i italic_j end_POSTSUBSCRIPT ) = italic_α - italic_s ( bold_italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) (7)
Zt,i|Zt1,i=zt1,iNormal(μt,i,σ2Ip)conditionalsubscript𝑍𝑡𝑖subscript𝑍𝑡1𝑖subscript𝑧𝑡1𝑖similar-to𝑁𝑜𝑟𝑚𝑎𝑙subscript𝜇𝑡𝑖superscript𝜎2subscript𝐼𝑝Z_{t,i}|Z_{t-1,i}=z_{t-1,i}\sim Normal(\mu_{t,i},\sigma^{2}I_{p})italic_Z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT ∼ italic_N italic_o italic_r italic_m italic_a italic_l ( italic_μ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) (8)
μt,i=zt1,i+γwπ(i)Aiw(zt1,Yt1)+γbAib(zt1,Yt1).subscript𝜇𝑡𝑖subscript𝑧𝑡1𝑖subscriptsuperscript𝛾𝜋𝑖𝑤superscriptsubscript𝐴𝑖𝑤subscript𝑧𝑡1subscript𝑌𝑡1subscript𝛾𝑏superscriptsubscript𝐴𝑖𝑏subscript𝑧𝑡1subscript𝑌𝑡1\mu_{t,i}={z}_{t-1,i}+\gamma^{\pi(i)}_{w}A_{i}^{w}(z_{t-1},Y_{t-1})+\gamma_{b}% A_{i}^{b}(z_{t-1},Y_{t-1})\enskip.italic_μ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT italic_π ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + italic_γ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) . (9)

Here s(,)𝑠s(\cdot,\cdot)italic_s ( ⋅ , ⋅ ) is a similarity function, and Aiwsuperscriptsubscript𝐴𝑖𝑤A_{i}^{w}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT and Aibsuperscriptsubscript𝐴𝑖𝑏A_{i}^{b}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT are the two attractor functions for node i𝑖iitalic_i in Yt1subscript𝑌𝑡1Y_{t-1}italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Euclidean distance is used as the similarity function, s(zt,i,zt,j)=zt,izt,j2𝑠subscript𝑧𝑡𝑖subscript𝑧𝑡𝑗subscriptnormsubscript𝑧𝑡𝑖subscript𝑧𝑡𝑗2s(z_{t,i},z_{t,j})=||z_{t,i}-z_{t,j}||_{2}italic_s ( italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) = | | italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and the two attractors for node i𝑖iitalic_i are defined as follows222This application employs a two-group CLSNA model with attractors that mimic attractive and repulsive force. It is a specific version of the general CLSNA model class. Each node is assigned to one of two labeled groups, and movements are dictated by attractors shaped by neighboring nodes’ influence.,

Aiw(𝒛t1,Yt1)=𝒛¯t1,i1𝒛t1,i,𝒛¯t1,i1=1|S1(i)|jS1(i)𝒛t1,jformulae-sequencesuperscriptsubscript𝐴𝑖𝑤subscript𝒛𝑡1subscript𝑌𝑡1subscriptsuperscript¯𝒛1𝑡1𝑖subscript𝒛𝑡1𝑖subscriptsuperscript¯𝒛1𝑡1𝑖1superscriptsubscript𝑆1𝑖subscript𝑗superscriptsubscript𝑆1𝑖subscript𝒛𝑡1𝑗A_{i}^{w}(\bm{z}_{t-1},Y_{t-1})=\bar{\bm{z}}^{1}_{t-1,i}-\bm{z}_{t-1,i},\,\bar% {\bm{z}}^{1}_{t-1,i}=\frac{1}{|S_{1}^{(i)}|}\sum_{j\in S_{1}^{(i)}}\bm{z}_{t-1% ,j}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = over¯ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT , over¯ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_t - 1 , italic_j end_POSTSUBSCRIPT (10)
Aib(𝒛t1,Yt1)=𝒛¯t1,i2𝒛t1,i,𝒛¯t1,i2=1|S2(i)|jS2(i)𝒛t1,j,formulae-sequencesuperscriptsubscript𝐴𝑖𝑏subscript𝒛𝑡1subscript𝑌𝑡1subscriptsuperscript¯𝒛2𝑡1𝑖subscript𝒛𝑡1𝑖subscriptsuperscript¯𝒛2𝑡1𝑖1superscriptsubscript𝑆2𝑖subscript𝑗superscriptsubscript𝑆2𝑖subscript𝒛𝑡1𝑗A_{i}^{b}(\bm{z}_{t-1},Y_{t-1})=\bar{\bm{z}}^{2}_{t-1,i}-\bm{z}_{t-1,i},\,\bar% {\bm{z}}^{2}_{t-1,i}=\frac{1}{|S_{2}^{(i)}|}\sum_{j\in S_{2}^{(i)}}\bm{z}_{t-1% ,j}\enskip,italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = over¯ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT , over¯ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_t - 1 , italic_j end_POSTSUBSCRIPT , (11)

which are the discrepancies of 𝒛t1,isubscript𝒛𝑡1𝑖\bm{z}_{t-1,i}bold_italic_z start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT from two local averages at time t1𝑡1t-1italic_t - 1. The sets S1(i)superscriptsubscript𝑆1𝑖S_{1}^{(i)}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and S2(i)superscriptsubscript𝑆2𝑖S_{2}^{(i)}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT are defined based on group membership and network connectivity as follows:

  1. 1.

    S1(i)={j𝒩i|Yij=1,π(i)=π(j)}superscriptsubscript𝑆1𝑖conditional-set𝑗𝒩𝑖formulae-sequencesubscript𝑌𝑖𝑗1𝜋𝑖𝜋𝑗S_{1}^{(i)}=\{j\in\mathcal{N}\setminus i\,|\,Y_{ij}=1,\pi(i)=\pi(j)\}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = { italic_j ∈ caligraphic_N ∖ italic_i | italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 , italic_π ( italic_i ) = italic_π ( italic_j ) }, neighbors of node i𝑖iitalic_i in the same group

  2. 2.

    S2(i)={j𝒩i|Yij=1,π(i)π(j)}superscriptsubscript𝑆2𝑖conditional-set𝑗𝒩𝑖formulae-sequencesubscript𝑌𝑖𝑗1𝜋𝑖𝜋𝑗S_{2}^{(i)}=\{j\in\mathcal{N}\setminus i\,|\,Y_{ij}=1,\pi(i)\neq\pi(j)\}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = { italic_j ∈ caligraphic_N ∖ italic_i | italic_Y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 , italic_π ( italic_i ) ≠ italic_π ( italic_j ) }, neighbors of node i𝑖iitalic_i in a different group.

Note that even though it is not explicitly mentioned in the notation, the sets S1(i)superscriptsubscript𝑆1𝑖S_{1}^{(i)}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and S2(i)superscriptsubscript𝑆2𝑖S_{2}^{(i)}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT also depend on time because the connectivity sets are allowed to change with time. When S1(i)superscriptsubscript𝑆1𝑖S_{1}^{(i)}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT or S2(i)superscriptsubscript𝑆2𝑖S_{2}^{(i)}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT or both are empty, we set Aiw(𝒛t1,Yt1)=0superscriptsubscript𝐴𝑖𝑤subscript𝒛𝑡1subscript𝑌𝑡10A_{i}^{w}(\bm{z}_{t-1},Y_{t-1})=0italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = 0, or Aib(𝒛t1,Yt1)=0superscriptsubscript𝐴𝑖𝑏subscript𝒛𝑡1subscript𝑌𝑡10A_{i}^{b}(\bm{z}_{t-1},Y_{t-1})=0italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = 0 or both to be zero respectively. This implies that if node i𝑖iitalic_i is not connected to any members of a specific group, it receives no force from that group.

Note that we automatically recover the original CLSNA model when VtVsubscript𝑉𝑡𝑉V_{t}\equiv Vitalic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ italic_V is fixed over time. In this case, at time t2𝑡2t\geq 2italic_t ≥ 2, a node i𝑖iitalic_i is always present at time t1𝑡1t-1italic_t - 1 and (4) and (5) are redundant in that case.

In this proposed extended model, each node lies in a p𝑝pitalic_p-dimensional Euclidean latent space. The smaller the distance between two nodes in the latent space, the greater their probability of being connected, as in (1), (2), (4), and (7). The expressions in Eqs. (10) and (11) capture the discrepancy between the current latent position of node i𝑖iitalic_i and the average of that of its current neighbors in groups 1111 and 2222, respectively. The corresponding parameters γ1w,γ2w,subscriptsuperscript𝛾𝑤1subscriptsuperscript𝛾𝑤2\gamma^{w}_{1},\gamma^{w}_{2},italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , and γbsuperscript𝛾𝑏\gamma^{b}italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT represent attractive and repulsive forces respectively. CLSNA model allows the network connectivity to enter the temporal evolution of latent positions in the form of attractors. Specifically, in the proposed model the evolution of latent positions for each node i𝑖iitalic_i from t1𝑡1t-1italic_t - 1 to t𝑡titalic_t is modeled by the normal transition distribution in Eq. (9), the mean vector of which depends not only on the latent position of itself at time t1𝑡1t-1italic_t - 1, but also on the two local averages, one from its neighbors in the same group, the other from its neighbors in a different group, as captured in (10) and (11). Strength of attraction and repulsion toward local averages is summarized by the attractor functions and the associated parameters. The parameter δ𝛿\deltaitalic_δ captures edge persistence. For δ>0𝛿0\delta>0italic_δ > 0, the probability of an edge at time t𝑡titalic_t will be increased when one exists already at time t1𝑡1t-1italic_t - 1.

The extended model is a natural extension of the original model. If there are no nodes entering or leaving the networks, the extended model is identical to the original model of Zhu et al. (2023). When the sets of nodes change over time, the model classifies nodes into two categories and models the two types of nodes separately:

  1. 1.

    For actors who are “retained” from time t1𝑡1t-1italic_t - 1 to t𝑡titalic_t, the manner in which network connectivity influences the temporal evolution of latent positions remains consistent with the original model: the evolution of latent positions for the node i𝑖iitalic_i from t1𝑡1t-1italic_t - 1 to t𝑡titalic_t is modeled by the normal transition distribution, the mean vector of which is a linear combination of the latent position of itself at time t1𝑡1t-1italic_t - 1 and the mean of both its neighbors in the same group and its neighbors in a different group.

  2. 2.

    For an actor that is present at time t𝑡titalic_t but is absent at time t1𝑡1t-1italic_t - 1, we cannot make an educated guess of its current position at time t𝑡titalic_t because of its absence at time t1𝑡1t-1italic_t - 1. For this reason we choose to model the prior distribution of its latent position as a normal distribution, the mean vector of which is the mean of all members in the same group at t1𝑡1t-1italic_t - 1, which we denote by z¯t1π(i)subscriptsuperscript¯𝑧𝜋𝑖𝑡1\overline{z}^{\pi(i)}_{t-1}over¯ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_π ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT.

3 SGD-based Inference Method

We develop a two-stage algorithm to make inferences based on the posterior π(θ,Z1:T|Y1:T)𝜋𝜃conditionalsubscript𝑍:1𝑇subscript𝑌:1𝑇\pi(\theta,Z_{1:T}|Y_{1:T})italic_π ( italic_θ , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ), where θ=(α,δ,γw,γb)𝜃𝛼𝛿superscript𝛾𝑤superscript𝛾𝑏\theta=(\alpha,\delta,\gamma^{w},\gamma^{b})italic_θ = ( italic_α , italic_δ , italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ). At the first stage we use SGD to compute a point estimate. At the second stage, we estimate marginal posterior standard deviations by a novel approach to quadratic approximation of the log-posterior density, which is also based on SGD.

3.1 Point Estimate

The posterior distribution of latent positions and parameters is

π(Z1:T,θ|Y1:T)𝜋subscript𝑍:1𝑇conditional𝜃subscript𝑌:1𝑇\displaystyle\pi(Z_{1:T},\theta\,|\,Y_{1:T})italic_π ( italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_θ | italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) (12)
P(Y1:T,Z1:T|θ)π(θ)proportional-toabsent𝑃subscript𝑌:1𝑇conditionalsubscript𝑍:1𝑇𝜃𝜋𝜃\displaystyle\propto P(Y_{1:T},Z_{1:T}\,|\,\theta)\pi(\theta)∝ italic_P ( italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_θ ) italic_π ( italic_θ )
(t=2TP(Zt|Zt1,Yt1)P(Yt|Zt,Yt1))P(Y1|Z1)P(Z1)π(θ)proportional-toabsentsuperscriptsubscriptproduct𝑡2𝑇𝑃conditionalsubscript𝑍𝑡subscript𝑍𝑡1subscript𝑌𝑡1𝑃conditionalsubscript𝑌𝑡subscript𝑍𝑡subscript𝑌𝑡1𝑃conditionalsubscript𝑌1subscript𝑍1𝑃subscript𝑍1𝜋𝜃\displaystyle\propto\left(\prod_{t=2}^{T}P(Z_{t}\,|\,Z_{t-1},Y_{t-1})P(Y_{t}\,% |\,Z_{t},Y_{t-1})\right)P(Y_{1}\,|\,Z_{1})P(Z_{1})\pi(\theta)∝ ( ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_P ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ) italic_P ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_P ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_π ( italic_θ )
=(t=2Ti=1Nt(P(Zt,i|Zt1,Yt1)j:jiP(Yt,ij|Zt,i,Zt,j,Yt1,ij)))\displaystyle=\left(\prod_{t=2}^{T}\prod_{i=1}^{N_{t}}\left(P(Z_{t,i}\,|\,Z_{t% -1},Y_{t-1})\prod_{j:j\neq i}P(Y_{t,ij}\,|\,Z_{t,i},Z_{t,j},Y_{t-1,ij})\right)% \right)\cdot= ( ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_P ( italic_Z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j : italic_j ≠ italic_i end_POSTSUBSCRIPT italic_P ( italic_Y start_POSTSUBSCRIPT italic_t , italic_i italic_j end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 , italic_i italic_j end_POSTSUBSCRIPT ) ) ) ⋅
P(Y1|Z1)P(Z1)π(θ).𝑃conditionalsubscript𝑌1subscript𝑍1𝑃subscript𝑍1𝜋𝜃\displaystyle\hskip 289.07999ptP(Y_{1}\,|\,Z_{1})P(Z_{1})\pi(\theta).italic_P ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_P ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_π ( italic_θ ) .

The log posterior distribution (i.e., taking the log of (12)), therefore, can be written as sums of simpler functions. To optimize the posterior function with SGD, at each step, we randomly sample different terms from these sums. The key steps are:

  1. 1.

    Initialize Z1:T,θsubscript𝑍:1𝑇𝜃Z_{1:T},\thetaitalic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_θ.

  2. 2.

    Randomly sample indices of different terms from the different summands of the log-posterior distribution function. Compute the randomly sampled summand functions of the log-posterior distribution function.

  3. 3.

    Update Z1:T,θsubscript𝑍:1𝑇𝜃Z_{1:T},\thetaitalic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_θ based on stochastic gradient descent using the output from step 2.

Repeat steps 2-3 until convergence criterion is met. In practice, we use gradient descent when the sample size is small and it is computationally feasible. In our implementation, we used gradient descent with momentum (when the sample size was small) or stochastic gradient descent with momentum (when the sample size was large), see Polyak (1964).

Input: Network time-series Y1:Tsubscript𝑌:1𝑇Y_{1:T}italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT. Current values θ(k)superscript𝜃𝑘\theta^{(k)}italic_θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, Z1:T(k)superscriptsubscript𝑍:1𝑇𝑘Z_{1:T}^{(k)}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT for parameters and latent positions.
      λ𝜆\lambdaitalic_λ: step size for Z1:Tsubscript𝑍:1𝑇Z_{1:T}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, λ𝜆\lambdaitalic_λ: step size for θ𝜃\thetaitalic_θ
      Output: θ(k+1),Z1:T(k+1)superscript𝜃𝑘1superscriptsubscript𝑍:1𝑇𝑘1\theta^{(k+1)},Z_{1:T}^{(k+1)}italic_θ start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT

1:Set θ𝜃\thetaitalic_θ, Z1:Tsubscript𝑍:1𝑇Z_{1:T}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT to be θ(k)superscript𝜃𝑘\theta^{(k)}italic_θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, Z1:T(k)superscriptsubscript𝑍:1𝑇𝑘Z_{1:T}^{(k)}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT.
2:Uniformly sample the terms (factors) from the posterior distribution:
t=2Ti=1NP(Zt,i|Zt1,Yt1)j:jiP(Yt,ij|Zt,i,Zt,j,Yt1,ij)P(Y1|Z1)P(Z1)π(θ)superscriptsubscriptproduct𝑡2𝑇superscriptsubscriptproduct𝑖1𝑁𝑃conditionalsubscript𝑍𝑡𝑖subscript𝑍𝑡1subscript𝑌𝑡1subscriptproduct:𝑗𝑗𝑖𝑃conditionalsubscript𝑌𝑡𝑖𝑗subscript𝑍𝑡𝑖subscript𝑍𝑡𝑗subscript𝑌𝑡1𝑖𝑗𝑃conditionalsubscript𝑌1subscript𝑍1𝑃subscript𝑍1𝜋𝜃\displaystyle\prod_{t=2}^{T}\prod_{i=1}^{N}P(Z_{t,i}\,|\,Z_{t-1},Y_{t-1})\prod% _{j:j\neq i}P(Y_{t,ij}\,|\,Z_{t,i},Z_{t,j},Y_{t-1,ij})P(Y_{1}\,|\,Z_{1})P(Z_{1% })\pi(\theta)∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P ( italic_Z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j : italic_j ≠ italic_i end_POSTSUBSCRIPT italic_P ( italic_Y start_POSTSUBSCRIPT italic_t , italic_i italic_j end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t - 1 , italic_i italic_j end_POSTSUBSCRIPT ) italic_P ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_P ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_π ( italic_θ ) (13)
Denote the sampled terms by pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Define πSGD(θ,Z1:T|Y1:T)subscript𝜋𝑆𝐺𝐷𝜃conditionalsubscript𝑍:1𝑇subscript𝑌:1𝑇\pi_{SGD}(\theta,Z_{1:T}|Y_{1:T})italic_π start_POSTSUBSCRIPT italic_S italic_G italic_D end_POSTSUBSCRIPT ( italic_θ , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) to be ipisubscriptproduct𝑖subscript𝑝𝑖\prod_{i}p_{i}∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
3:Take a SGD step (see also Remark 3.2)
Z1:T(k+1)=superscriptsubscript𝑍:1𝑇𝑘1absent\displaystyle Z_{1:T}^{(k+1)}=italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = Z1:T(k)λ(logπSGD(θ(k),Z1:T(k)|Y1:T)Z1:T)superscriptsubscript𝑍:1𝑇𝑘𝜆subscript𝜋𝑆𝐺𝐷superscript𝜃𝑘conditionalsuperscriptsubscript𝑍:1𝑇𝑘subscript𝑌:1𝑇subscript𝑍:1𝑇\displaystyle Z_{1:T}^{(k)}-\lambda\left(\frac{\partial\log\pi_{SGD}(\theta^{(% k)},Z_{1:T}^{(k)}|Y_{1:T})}{\partial Z_{1:T}}\right)italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - italic_λ ( divide start_ARG ∂ roman_log italic_π start_POSTSUBSCRIPT italic_S italic_G italic_D end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT end_ARG )
θ(k+1)=superscript𝜃𝑘1absent\displaystyle\theta^{(k+1)}=italic_θ start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = θ(k)λ(logπSGD(θ(k),Z1:T(k)|Y1:T)θ)superscript𝜃𝑘𝜆subscript𝜋𝑆𝐺𝐷superscript𝜃𝑘conditionalsuperscriptsubscript𝑍:1𝑇𝑘subscript𝑌:1𝑇𝜃\displaystyle\theta^{(k)}-\lambda\left(\frac{\partial\log\pi_{SGD}(\theta^{(k)% },Z_{1:T}^{(k)}|Y_{1:T})}{\partial\theta}\right)italic_θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - italic_λ ( divide start_ARG ∂ roman_log italic_π start_POSTSUBSCRIPT italic_S italic_G italic_D end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ end_ARG ) (14)
Algorithm 1 SGDstep

Formally, Algorithm 1 is the main building block for the SGD-based point estimation method, which we present as pseudo-code in Algorithm 2.

Input: Network time-series Y1:Tsubscript𝑌:1𝑇Y_{1:T}italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT. Initial values θinitsuperscript𝜃𝑖𝑛𝑖𝑡\theta^{init}italic_θ start_POSTSUPERSCRIPT italic_i italic_n italic_i italic_t end_POSTSUPERSCRIPT, Z1:Tinitsuperscriptsubscript𝑍:1𝑇𝑖𝑛𝑖𝑡Z_{1:T}^{init}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_i italic_t end_POSTSUPERSCRIPT for parameters and latent positions.
      M: maximum number of iterations, λ𝜆\lambdaitalic_λ: step size for Z1:Tsubscript𝑍:1𝑇Z_{1:T}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, λ𝜆\lambdaitalic_λ: step size for θ𝜃\thetaitalic_θ
      Output: θ(k),Z1:T(k)superscript𝜃𝑘superscriptsubscript𝑍:1𝑇𝑘\theta^{(k)},Z_{1:T}^{(k)}italic_θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT

1:Set initial values θ(0)superscript𝜃0\theta^{(0)}italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, Z1:T(0)superscriptsubscript𝑍:1𝑇0Z_{1:T}^{(0)}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT to be θinitsuperscript𝜃𝑖𝑛𝑖𝑡\theta^{init}italic_θ start_POSTSUPERSCRIPT italic_i italic_n italic_i italic_t end_POSTSUPERSCRIPT, Z1:Tinitsuperscriptsubscript𝑍:1𝑇𝑖𝑛𝑖𝑡Z_{1:T}^{init}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_i italic_t end_POSTSUPERSCRIPT.
2:for k0𝑘0k\leftarrow 0italic_k ← 0 to M𝑀Mitalic_M do
3:     Take a SGD step to optimize the posterior distribution: (θ(k+1),Z1:T(k+1))superscript𝜃𝑘1superscriptsubscript𝑍:1𝑇𝑘1(\theta^{(k+1)},Z_{1:T}^{(k+1)})( italic_θ start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ) = SGDstep(θ(k),Z1:T(k)superscript𝜃𝑘superscriptsubscript𝑍:1𝑇𝑘\theta^{(k)},Z_{1:T}^{(k)}italic_θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT)
4:     Stop if stop** criteria is met
5:end for
Algorithm 2 SGD-based Point estimation Method
Remark 3.1.

In our work using gradient descent, we encountered a recurring challenge. For the majority of time points, the algorithm performs as expected, closely matching the true underlying values in simulation. However, there were a couple of time points at which an anomalous sign inversion occurs in one dimension of the latent position. In particular, if zt,i=[zt,i1,zt,i2,,zt,ip]subscript𝑧𝑡𝑖superscriptsubscript𝑧𝑡𝑖1superscriptsubscript𝑧𝑡𝑖2superscriptsubscript𝑧𝑡𝑖𝑝z_{t,i}=[z_{t,i}^{1},z_{t,i}^{2},\ldots,z_{t,i}^{p}]italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = [ italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] was the true value, it was estimated to be z~t,i=[zt,i1,zt,i2,,zt,ip]subscript~𝑧𝑡𝑖superscriptsubscript𝑧𝑡𝑖1superscriptsubscript𝑧𝑡𝑖2superscriptsubscript𝑧𝑡𝑖𝑝\tilde{z}_{t,i}=[-z_{t,i}^{1},z_{t,i}^{2},\ldots,z_{t,i}^{p}]over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = [ - italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ], even though all the estimates for other times match the truth.

This problem is rooted in the complex landscape of our problem, where the model sometimes settles in local modes due to its non-convex nature. The issue with local modes is similar to the issue of label-switching found in the dynamic stochastic block model literature and the issue may not be solved without an extra assumption e.g. that most of the nodes do not change group across two different time steps (Matias and Miele, 2017).

The resolution to this issue that we found to work well is to start with a CLSNA model with latent variables living in a higher dimensional space from the one we intend on having, that is, we aim to fit a model with Zp𝑍superscript𝑝Z\in\mathbb{R}^{p}italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT but we intentionally choose Zq𝑍superscript𝑞Z\in\mathbb{R}^{q}italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT with q>p𝑞𝑝q>pitalic_q > italic_p . Then, we use the first p𝑝pitalic_p principal components of the vector of the latent position variables for Z1:Tsubscript𝑍:1𝑇Z_{1:T}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT. This method was the most effective of several alternatives attempted to remedy local modes. Intuitively, one can think of this approach as creating a path between modes thus allowing SGD to travel between modes in the higher dimension. In both the simulated and real data examples, we initially set p𝑝pitalic_p to 2 and chose q𝑞qitalic_q as 3. We begin by fitting a model with Z𝑍Zitalic_Z in 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, followed by applying principal component analysis (PCA) to the resulting latent positions, thereby compressing Z𝑍Zitalic_Z into 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT while maximizing information preservation. This reduced form is then used as the initial value for fitting a model with Z𝑍Zitalic_Z in 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This approach, albeit doubling the cost of fitting the model and performing PCA, provides more accurate and faster convergence towards the true values, as it effectively navigates through potential local optima that could hinder convergence in a single-stage fitting process.

Remark 3.2.

Motivated by the work in Hinton (2012), at the initial stage of Algorithm 1 in Eq.(14) we use the sign of the gradient of the global parameters (+1/-1), i.e., we replace (14) by

θ(k+1)=θ(k)λsign(logπSGD(θ(k),Z1:T(k)|Y1:T)θ).superscript𝜃𝑘1superscript𝜃𝑘𝜆signsubscript𝜋𝑆𝐺𝐷superscript𝜃𝑘conditionalsuperscriptsubscript𝑍:1𝑇𝑘subscript𝑌:1𝑇𝜃\displaystyle\theta^{(k+1)}=\theta^{(k)}-\lambda\hskip 2.84544pt\text{sign}% \left(\frac{\partial\log\pi_{SGD}(\theta^{(k)},Z_{1:T}^{(k)}|Y_{1:T})}{% \partial\theta}\right).italic_θ start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = italic_θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - italic_λ sign ( divide start_ARG ∂ roman_log italic_π start_POSTSUBSCRIPT italic_S italic_G italic_D end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ end_ARG ) .

This is due to the fact that the magnitude and variance of the related gradient terms is much larger than the magnitude and variance of the gradients of the latent positions. This leads to more stability. As we approach the conclusion of the training phase, we switch to Eq.(14) for better accuracy.

3.2 Variance Estimation

Under an MCMC framework, both point estimates for θ𝜃\thetaitalic_θ and corresponding uncertainties can be estimated using samples drawn from the posterior. However, our proposed SGD approach above only gives a point estimate for θ𝜃\thetaitalic_θ, without quantifying uncertainty.

We propose a novel approach to estimate the posterior variance for parameters of interest. It is a method for approximate Bayesian inference based on Laplace’s method. Laplace’s approximation, when applied in scenarios like state-space models and Gaussian Markov Random Fields, leverages the sparsity of the precision matrix as detailed by Rue et al. (2009). However, in more general contexts, such as CLSNA, this assumption of sparsity is not always valid. In response to this, our proposed approach builds upon Laplace’s foundational principles but adapts to the challenges of non-sparse structures. Specifically, it transforms the variance estimation problem into an optimization problem that can also be solved with SGD.

In this section, we present the novel variance estimation algorithm that we propose.

3.2.1 Some Useful Results of Multivariate Normal Distribution

Our variance estimation method exploits the properties of the conditional distribution of a multivariate Gaussian distribution. Let x𝑥xitalic_x follow a multivariate normal distribution x𝒩(μ,Σ)similar-to𝑥𝒩𝜇Σx\sim\mathcal{N}(\mu,\Sigma)italic_x ∼ caligraphic_N ( italic_μ , roman_Σ ), and hence the conditional distribution of a subset vector x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, given its complement vector x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, is also a multivariate normal distribution x1|x2𝒩(μ1|2,Σ1|2)similar-toconditionalsubscript𝑥1subscript𝑥2𝒩subscript𝜇conditional12subscriptΣconditional12x_{1}|x_{2}\sim\mathcal{N}(\mu_{1|2},\Sigma_{1|2})italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT ) with the conditional mean and covariance given by μ1|2=μ1+Σ12Σ221(x2μ2)subscript𝜇conditional12subscript𝜇1subscriptΣ12superscriptsubscriptΣ221subscript𝑥2subscript𝜇2\mu_{1|2}=\mu_{1}+\Sigma_{12}\Sigma_{22}^{-1}(x_{2}-\mu_{2})italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and Σ1|2=Σ11Σ12Σ221Σ21subscriptΣconditional12subscriptΣ11subscriptΣ12superscriptsubscriptΣ221subscriptΣ21\Sigma_{1|2}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}roman_Σ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT - roman_Σ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT, respectively. The block-wise mean and covariance matrices are defined as μ=[μ1μ2]𝜇delimited-[]subscript𝜇1subscript𝜇2\mu=\left[\begin{smallmatrix}\mu_{1}\\ \mu_{2}\end{smallmatrix}\right]italic_μ = [ start_ROW start_CELL italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW ] and Σ=[Σ11Σ12Σ21Σ22]Σdelimited-[]subscriptΣ11subscriptΣ12subscriptΣ21subscriptΣ22\Sigma=\left[\begin{smallmatrix}\Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22}\end{smallmatrix}\right]roman_Σ = [ start_ROW start_CELL roman_Σ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL roman_Σ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Σ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL roman_Σ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL end_ROW ].

More specifically, the covariance of the conditional distribution Cov(x1|x2)Covconditionalsubscript𝑥1subscript𝑥2\text{Cov}(x_{1}|x_{2})Cov ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is a constant as a function of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, indicating that the shape of the conditional distribution is independent of the specific value of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Additionally, writing p(x1|x2)𝑝conditionalsubscript𝑥1subscript𝑥2p(x_{1}|x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) as the p.d.f of the conditional distribution of x1|x2conditionalsubscript𝑥1subscript𝑥2x_{1}|x_{2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the mean μ1|2=μ1+Σ12Σ221(x2μ2)subscript𝜇conditional12subscript𝜇1subscriptΣ12superscriptsubscriptΣ221subscript𝑥2subscript𝜇2\mu_{1|2}=\mu_{1}+\Sigma_{12}\Sigma_{22}^{-1}(x_{2}-\mu_{2})italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is the expectation, median, and mode of the distribution. The maximum of the p.d.f. function p(x1=μ1|2|x2)𝑝subscript𝑥1conditionalsubscript𝜇conditional12subscript𝑥2p(x_{1}=\mu_{1|2}|x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is a constant as a function of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Therefore, we can recover the shape of the marginal distribution of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from the joint distribution by varying x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and following the line x1=μ1|2subscript𝑥1subscript𝜇conditional12x_{1}=\mu_{1|2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT.

Theorem 3.1.

Given the multivariate Gaussian distribution setting presented above, write p(x1,x2)𝑝subscript𝑥1subscript𝑥2p(x_{1},x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) as the p.d.f of the joint distribution of x1,x2subscript𝑥1subscript𝑥2x_{1},x_{2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. p(x2)𝑝subscript𝑥2p(x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) as the p.d.f of the marginal distribution of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then, we have p(x2)p(x1=μ1|2,x2)proportional-to𝑝subscript𝑥2𝑝subscript𝑥1subscript𝜇conditional12subscript𝑥2p(x_{2})\propto p(x_{1}=\mu_{1|2},x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∝ italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

The proof of the theorem is provided in the supplementary file. In particular, the marginal distribution is proportional to the joint distribution evaluated on the curve of the conditional mean of x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT given x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Theorem 7.1 provides an expression for the shape of the marginal distribution from the joint distribution, when the joint distribution is a multivariate normal distribution.

Next, we proceed with an important result demonstrating that for a normal distribution the ideas of Theorem 7.1 yield a formula for the variance of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that we can then be turned into a practical SGD-based algorithm for uncertainty quantification, see Subsection 3.2.2. In particular, recall that in a univariate normal distribution, two key parameters - the mean and variance - describe the distribution fully. By considering two distinct points from the distribution, we can deduce these parameters. In Figure 2 we aim to visualize this idea, which then leads us to Corollary 7.1.1 containing the expression for the variance.

x𝑥xitalic_xy𝑦yitalic_yp(x2)𝑝subscript𝑥2p(x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )“The curve corresponding to (μ1|2,x2)subscript𝜇conditional12subscript𝑥2(\mu_{1|2},x_{2})( italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )p(x1,x2)𝑝subscript𝑥1subscript𝑥2p(x_{1},x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )(μ2,p(μ1,μ2))subscript𝜇2𝑝subscript𝜇1subscript𝜇2(\mu_{2},p(\mu_{1},\mu_{2}))( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_p ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) )(μ2+η,p(μ1|2,μ2+η))subscript𝜇2𝜂𝑝subscript𝜇conditional12subscript𝜇2𝜂(\mu_{2}+\eta,p(\mu_{1|2},\mu_{2}+\eta))( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_η , italic_p ( italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_η ) )
Figure 2: Variance Estimation in Bivariate Normal Distribution
Corollary 3.1.1.

Consider the multivariate Gaussian distribution setting presented above, and let x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be scalar. Then, for some fixed x~2μ2subscriptnormal-~𝑥2subscript𝜇2\tilde{x}_{2}\neq\mu_{2}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≠ italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

𝑉𝑎𝑟(x2)=12(μ2x~2)2logp(x1=μ1,x2=μ2)logp(x1=μ1|2,x2=x~2)𝑉𝑎𝑟subscript𝑥212superscriptsubscript𝜇2subscript~𝑥22𝑝formulae-sequencesubscript𝑥1subscript𝜇1subscript𝑥2subscript𝜇2𝑝formulae-sequencesubscript𝑥1subscript𝜇conditional12subscript𝑥2subscript~𝑥2\text{Var}(x_{2})=\frac{1}{2}\frac{(\mu_{2}-\tilde{x}_{2})^{2}}{\log p(x_{1}=% \mu_{1},x_{2}=\mu_{2})-\log p(x_{1}=\mu_{1|2},x_{2}=\tilde{x}_{2})}Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_log italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - roman_log italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG (15)

The proof of the corollary is provided in the supplementary file.

Beyond just the variance represented by the diagonal elements of the covariance matrix, there’s also a formula for the off-diagonal elements. This allows us to reconstruct a single column of the covariance matrix.

Corollary 3.1.2.

Consider the multivariate Gaussian distribution setting presented above. Then, for some fixed x~2μ2subscriptnormal-~𝑥2subscript𝜇2\tilde{x}_{2}\neq\mu_{2}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≠ italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

Σ12=Σ22(x~2μ2)(μ1|2μ1),subscriptΣ12subscriptΣ22subscript~𝑥2subscript𝜇2subscript𝜇conditional12subscript𝜇1\Sigma_{12}=\frac{\Sigma_{22}}{(\tilde{x}_{2}-\mu_{2})}(\mu_{1|2}-\mu_{1}),roman_Σ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT = divide start_ARG roman_Σ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_ARG start_ARG ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ( italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ,

where μ1|2subscript𝜇conditional12\mu_{1|2}italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT denotes the conditional mean of x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT given x2=x~2subscript𝑥2subscriptnormal-~𝑥2x_{2}=\tilde{x}_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

The methodologies established for a generic normal distribution p(x1,x2)𝑝subscript𝑥1subscript𝑥2p(x_{1},x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are now adapted to estimate variance in the posterior distribution π(|Y)\pi(\cdot|Y)italic_π ( ⋅ | italic_Y ), effectively applying the same theoretical concepts to the Bayesian analysis under a Laplace approximation.

3.2.2 The Variance Estimation Algorithm

Corollary 7.1.1 and Corollary 3.1.2 are the primary components in our proposed SGD-based method for variance estimation of a single parameter, as detailed in Algorithm 3. The method involves two stages. Initially, we compute a point estimate using an SGD parameter estimation method. In the second stage, with this point estimate, we will redo the optimization with the constraint of fixing the parameter to be variance-estimated, adjusting it by a certain η𝜂\etaitalic_η perturbation size. Thanks to Corollary 7.1.1 and Corollary 3.1.2, we can then estimate the variance/covariance of this parameter by observing the changes in the posterior function and the changes in the other parameters resulting from the perturbation.

Input: Network time-series Y1:Tsubscript𝑌:1𝑇Y_{1:T}italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT. The mode of logπ(Z,θ|Y)𝜋𝑍conditional𝜃𝑌\log\pi(Z,\theta|Y)roman_log italic_π ( italic_Z , italic_θ | italic_Y ), θ*superscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT denoting the mode for all hyperparameters (α,δ,γw,γb)𝛼𝛿superscript𝛾𝑤superscript𝛾𝑏(\alpha,\delta,\gamma^{w},\gamma^{b})( italic_α , italic_δ , italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ), Z1:T*superscriptsubscript𝑍:1𝑇Z_{1:T}^{*}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT for latent positions.
      M: maximum number of iterations, λ𝜆\lambdaitalic_λ: step size for Z1:Tsubscript𝑍:1𝑇Z_{1:T}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, λ𝜆\lambdaitalic_λ: step size for θ𝜃\thetaitalic_θ, ηαsubscript𝜂𝛼\eta_{\alpha}italic_η start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT: perturbation size for α𝛼\alphaitalic_α       Output: Var(α|Y)^^Varconditional𝛼𝑌\widehat{\text{Var}(\alpha|Y)}over^ start_ARG Var ( italic_α | italic_Y ) end_ARG(the posterior variance of α𝛼\alphaitalic_α) (where α𝛼\alphaitalic_α is the intercept term in Eq. 2)

1:Set initial values θ(0)superscript𝜃0\theta^{(0)}italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, Z1:T(0)superscriptsubscript𝑍:1𝑇0Z_{1:T}^{(0)}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT to be θ*superscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, Z1:T*superscriptsubscript𝑍:1𝑇Z_{1:T}^{*}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.
2:for k0𝑘0k\leftarrow 0italic_k ← 0 to M𝑀Mitalic_M do
3:     At the onset of each iteration, the parameter of interest is anchored at a position that’s a fixed perturbation, denoted as ηαsubscript𝜂𝛼\eta_{\alpha}italic_η start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, from its modal value. Specifically, we set α(k)=α*+ηαsuperscript𝛼𝑘superscript𝛼subscript𝜂𝛼\alpha^{(k)}=\alpha^{*}+\eta_{\alpha}italic_α start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_α start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT. This action positions the tuple (α(k),δ(k),γw,(k),γb,(k))superscript𝛼𝑘superscript𝛿𝑘superscript𝛾𝑤𝑘superscript𝛾𝑏𝑘(\alpha^{(k)},\delta^{(k)},\gamma^{w,(k)},\gamma^{b,(k)})( italic_α start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_δ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_w , ( italic_k ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_b , ( italic_k ) end_POSTSUPERSCRIPT ) at (α*+ηα,δ(k),γw,(k),γb,(k))superscript𝛼subscript𝜂𝛼superscript𝛿𝑘superscript𝛾𝑤𝑘superscript𝛾𝑏𝑘(\alpha^{*}+\eta_{\alpha},\delta^{(k)},\gamma^{w,(k)},\gamma^{b,(k)})( italic_α start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_δ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_w , ( italic_k ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_b , ( italic_k ) end_POSTSUPERSCRIPT ).
4:     Take a SGD step to optimize the posterior distribution: (θ(k+1),Z1:T(k+1))superscript𝜃𝑘1superscriptsubscript𝑍:1𝑇𝑘1(\theta^{(k+1)},Z_{1:T}^{(k+1)})( italic_θ start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ) = SGDstep(θ(k),Z1:T(k)superscript𝜃𝑘superscriptsubscript𝑍:1𝑇𝑘\theta^{(k)},Z_{1:T}^{(k)}italic_θ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT)
5:     Stop if stop** criteria is met and let k𝑘kitalic_k be the corresponding iteration.
6:end for
7:
Var(α|Y)^=12ηα2π(θ(0),Z1:T(0)|Y1:T)π(θ(k+1),Z1:T(k+1)|Y1:T)^Varconditional𝛼𝑌12superscriptsubscript𝜂𝛼2𝜋superscript𝜃0conditionalsubscriptsuperscript𝑍0:1𝑇subscript𝑌:1𝑇𝜋superscript𝜃𝑘1conditionalsubscriptsuperscript𝑍𝑘1:1𝑇subscript𝑌:1𝑇\widehat{\text{Var}(\alpha|Y)}=\frac{1}{2}\frac{\eta_{\alpha}^{2}}{\pi(\theta^% {(0)},Z^{(0)}_{1:T}|Y_{1:T})-\pi(\theta^{(k+1)},Z^{(k+1)}_{1:T}|Y_{1:T})}over^ start_ARG Var ( italic_α | italic_Y ) end_ARG = divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG italic_η start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_π ( italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) - italic_π ( italic_θ start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) end_ARG
8:
Cov(α,θi|Y)^=Var(α|Y)^ηα(θi(k+1)θi(0))^Cov𝛼conditionalsubscript𝜃𝑖𝑌^Varconditional𝛼𝑌subscript𝜂𝛼superscriptsubscript𝜃𝑖𝑘1superscriptsubscript𝜃𝑖0\widehat{\text{Cov}(\alpha,\theta_{i}|Y)}=\frac{\widehat{\text{Var}(\alpha|Y)}% }{\eta_{\alpha}}(\theta_{i}^{(k+1)}-\theta_{i}^{(0)})over^ start_ARG Cov ( italic_α , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_Y ) end_ARG = divide start_ARG over^ start_ARG Var ( italic_α | italic_Y ) end_ARG end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT )
9:
Cov(α,Z1:T|Y)^=Var(α|Y)^ηα(Z1:T(k+1)Z1:T(0))^Cov𝛼conditionalsubscript𝑍:1𝑇𝑌^Varconditional𝛼𝑌subscript𝜂𝛼superscriptsubscript𝑍:1𝑇𝑘1superscriptsubscript𝑍:1𝑇0\widehat{\text{Cov}(\alpha,Z_{1:T}|Y)}=\frac{\widehat{\text{Var}(\alpha|Y)}}{% \eta_{\alpha}}(Z_{1:T}^{(k+1)}-Z_{1:T}^{(0)})over^ start_ARG Cov ( italic_α , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_Y ) end_ARG = divide start_ARG over^ start_ARG Var ( italic_α | italic_Y ) end_ARG end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG ( italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT - italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT )
Cov(α,Z1:T|Y)Cov𝛼conditionalsubscript𝑍:1𝑇𝑌\text{Cov}(\alpha,Z_{1:T}|Y)Cov ( italic_α , italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_Y ) denote the covariance vector between the 1-dimensional random variable α𝛼\alphaitalic_α and the vector random variable Z1:Tsubscript𝑍:1𝑇Z_{1:T}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT.
Algorithm 3 SGD-based Var/Cov Estimation Method for a single parameter

Algorithm 3 details an SGD-based approach for estimating the variance of a single parameter, in this instance for α𝛼\alphaitalic_α. To extend this estimation to all parameters of interest, we repeatedly apply Algorithm 3 individually to each parameter. Since we are able to recover the covariance matrix, we are also able to estimate variance of functions of parameters. As an illustration, let us look for example into Var(γwγb)Varsuperscript𝛾𝑤superscript𝛾𝑏\text{Var}(\gamma^{w}-\gamma^{b})Var ( italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT - italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ). This estimation is achieved through a sequential application of Algorithm 3 to each component parameter. Initially, the algorithm is applied to γwsuperscript𝛾𝑤\gamma^{w}italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT, yielding estimates Var(γw|Y)^^Varconditionalsuperscript𝛾𝑤𝑌\widehat{\text{Var}(\gamma^{w}|Y)}over^ start_ARG Var ( italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT | italic_Y ) end_ARG and Cov(γw,γb|Y)^^Covsuperscript𝛾𝑤conditionalsuperscript𝛾𝑏𝑌\widehat{\text{Cov}(\gamma^{w},\gamma^{b}|Y)}over^ start_ARG Cov ( italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT | italic_Y ) end_ARG. Following this, a similar application to γbsuperscript𝛾𝑏\gamma^{b}italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT provides Var(γb|Y)^^Varconditionalsuperscript𝛾𝑏𝑌\widehat{\text{Var}(\gamma^{b}|Y)}over^ start_ARG Var ( italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT | italic_Y ) end_ARG and an alternative estimate of Cov(γw,γb|Y)^^Covsuperscript𝛾𝑤conditionalsuperscript𝛾𝑏𝑌\widehat{\text{Cov}(\gamma^{w},\gamma^{b}|Y)}over^ start_ARG Cov ( italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT | italic_Y ) end_ARG. The variance of the composite parameter γwγbsuperscript𝛾𝑤superscript𝛾𝑏\gamma^{w}-\gamma^{b}italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT - italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT is then approximated using the formula Var(γw|Y)^+Var(γb|Y)^2Cov(γw,γb|Y)^^Varconditionalsuperscript𝛾𝑤𝑌^Varconditionalsuperscript𝛾𝑏𝑌2^Covsuperscript𝛾𝑤conditionalsuperscript𝛾𝑏𝑌\widehat{\text{Var}(\gamma^{w}|Y)}+\widehat{\text{Var}(\gamma^{b}|Y)}-2% \widehat{\text{Cov}(\gamma^{w},\gamma^{b}|Y)}over^ start_ARG Var ( italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT | italic_Y ) end_ARG + over^ start_ARG Var ( italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT | italic_Y ) end_ARG - 2 over^ start_ARG Cov ( italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT | italic_Y ) end_ARG. Given that the function is not exactly quadratic, discrepancies between the two covariance estimates may arise. To mitigate this, one can proceed with averaging these estimates for a more robust estimation.

4 Simulation studies

We simulate two networks, one with n=100𝑛100n=100italic_n = 100 actors and one with n=1000𝑛1000n=1000italic_n = 1000 actors. For each one of them, we simulate two parameter settings, one for positive attraction or ‘flocking’ (γb0superscript𝛾𝑏0\gamma^{b}\geq 0italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ≥ 0) and the other for repulsion (γb0superscript𝛾𝑏0\gamma^{b}\leq 0italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ≤ 0) or ‘polarization’ between two groups over time. For each setting we simulate 20202020 data sets with the number of time points T=10𝑇10T=10italic_T = 10.

α=1𝛼1\alpha=1italic_α = 1 δ=2𝛿2\delta=2italic_δ = 2 γw=0.25superscript𝛾𝑤0.25\gamma^{w}=0.25italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT = 0.25 γb=0.5superscript𝛾𝑏0.5\gamma^{b}=0.5italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = 0.5
α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG Var(α)^^Var𝛼\widehat{\text{Var}(\alpha)}over^ start_ARG Var ( italic_α ) end_ARG δ^^𝛿\hat{\delta}over^ start_ARG italic_δ end_ARG Var(δ)^^Var𝛿\widehat{\text{Var}(\delta)}over^ start_ARG Var ( italic_δ ) end_ARG γw^^superscript𝛾𝑤\hat{\gamma^{w}}over^ start_ARG italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_ARG Var(γw)^^Varsuperscript𝛾𝑤\widehat{\text{Var}(\gamma^{w})}over^ start_ARG Var ( italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) end_ARG γb^^superscript𝛾𝑏\hat{\gamma^{b}}over^ start_ARG italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG Var(γb)^^Varsuperscript𝛾𝑏\widehat{\text{Var}(\gamma^{b})}over^ start_ARG Var ( italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) end_ARG
n=100 0.817 (0.029) 0.025 (0.001) 1.934 (0.023) 0.024 (0.001) 0.202 (0.135) 0.125 (0.026) 0.489 (0.14) 0.117 (0.011)
n=1000 0.972 (0.003) 0.003 (<<<0.001) 1.992 (0.003) 0.002 (<<<0.001) 0.25 (0.028) 0.03 (0.001) 0.498 (0.030) 0.028 (0.001)
α=1𝛼1\alpha=1italic_α = 1 δ=3𝛿3\delta=3italic_δ = 3 γw=0.45superscript𝛾𝑤0.45\gamma^{w}=0.45italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT = 0.45 γb=0.5superscript𝛾𝑏0.5\gamma^{b}=-0.5italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = - 0.5
α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG Var(α)^^Var𝛼\widehat{\text{Var}(\alpha)}over^ start_ARG Var ( italic_α ) end_ARG δ^^𝛿\hat{\delta}over^ start_ARG italic_δ end_ARG Var(δ)^^Var𝛿\widehat{\text{Var}(\delta)}over^ start_ARG Var ( italic_δ ) end_ARG γw^^superscript𝛾𝑤\hat{\gamma^{w}}over^ start_ARG italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_ARG Var(γw)^^Varsuperscript𝛾𝑤\widehat{\text{Var}(\gamma^{w})}over^ start_ARG Var ( italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) end_ARG γb^^superscript𝛾𝑏\hat{\gamma^{b}}over^ start_ARG italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG Var(γb)^^Varsuperscript𝛾𝑏\widehat{\text{Var}(\gamma^{b})}over^ start_ARG Var ( italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) end_ARG
n=100 0.825 (0.043) 0.038(0.001) 2.868 (0.045) 0.038 (0.001) 0.302 (0.041) 0.035 (0.003) -0.54 (0.035) 0.03 (0.004)
n=1000 0.971 (0.003) 0.004 (<0.001absent0.001<0.001< 0.001) 2.976 (0.003) 0.004 ( <0.001absent0.001<0.001< 0.001) 0.433 (0.018) 0.015 (0.004) -0.509 (0.017) 0.014(0.004)
Table 1: Posterior-based mean (empirical standard deviation) for point estimation (Algorithm 2) and variance estimation (Algorithm 3) in flocking and polarization settings, based on n=100𝑛100n=100italic_n = 100 and n=1000𝑛1000n=1000italic_n = 1000 nodes, with T=10𝑇10T=10italic_T = 10 time points.

We want to compare the mean of the point estimates over 20202020 simulations with the truth and we compare the mean of the estimated variances with the standard deviation of the point estimates over 20202020 simulations.

With the number of nodes n=100𝑛100n=100italic_n = 100, Table 1 shows that the point estimates are reasonably accurate compared with the truth. In the flocking setting, all estimates are slightly biased to be smaller than the truth. In the polarization setting, all estimates except for repulsion between groups (γbsuperscript𝛾𝑏\gamma^{b}italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT) are slightly biased to be smaller than the truth. The variance estimates are reasonably accurate and are generally slightly smaller than the true standard deviation (See Table 1). The empirical standard deviations are obtained by 20 MCMC trials in each parameter setting. (See Table 1, where the empirical standard deviations are shown in parenthesis).

Table 1 shows that when the number of nodes is n=1000𝑛1000n=1000italic_n = 1000, the point estimates are more accurate than when the number of nodes n=100𝑛100n=100italic_n = 100, both in the flocking setting and the polarization setting, and similarly for the variance estimates.

We notice that the estimation of the standard deviation slightly underestimates the true standard deviation. One reason is that the posterior function can have more than one local minimum. This class of variation estimation methods is able to approximate the contribution of one local minimum to the total variance. Note that multiple local minima contribute to total variance, but in our experiments this did not result into serious issues.

We also compare the computational efficiency of the proposed SGD algorithm with MCMC algorithm. When both run on the same CPU resources, the MCMC algorithm took about an hour to obtain 50,000 samples when the number of nodes n=100𝑛100n=100italic_n = 100 while the proposed SGD-based algorithm only took fewer than 5 minutes. In the X data analysis in the following section, on the reduced data set with 200 nodes, the MCMC took about four hours (CPU) while the proposed SGD-based algorithm only took fewer than 10 minutes to run(GPU). In addition, the proposed SGD-based algorithm took about 20 minutes when n=1000𝑛1000n=1000italic_n = 1000 as well (GPU).

5 X Data Analysis

5.1 Exploratory Analysis

The X Congressional Hashtag Networks dataset in Zhu et al. (2023) was based on 843,907 tweets from 796 US Congresspersons’ accounts from 2010-2020. Yearly binary networks were built from the tweets. Nodes represent members and edges show whether two members used common hashtags more than the year’s average. The nodes in the yearly networks vary as Congress members’ X participation changes due to reelections, late joins, or early departures see Figure 1. Figure 3 shows temporal evolution of edge density within the Democratic and Republican parties individually, alongside the inter-party edge density, as well as the edge density across the entire network. Similar trends were observed in Zhu et al. (2023): connections increased for the first four years for all types of edges. From 2015 to 2020, edge density within the Democratic party continues to rise, while inter-party edge density decays slowly and edge density within the Republican party drops sharply.

Refer to caption
Figure 3: Temporal evolution of edge density within the Democratic and Republican parties, the inter-party edge density and the edge density among all members

Figure 1 shows numbers of re-elected and newly elected Democratic congressman and Republican congressman in the X network. We can see an upward trend in the number of X users in both parties. The number of Democratic X users grows more steadily with fewer fluctuation compared with the Republican X users. Notably, Republicans outnumber Democrats in the X network for 9 out of 11 years.

5.2 Implementation Details

We use gradient descent with momentum and choose different learning hyperparameters for the latent position parameters and the global parameters. With gradient descent, we use all the terms in the log posterior function instead of taking a sampled posterior function. We stop running gradient descent when the parameter updates drop below a threshold of ϵ=104italic-ϵsuperscript104\epsilon=10^{-4}italic_ϵ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, a condition we check every 100 gradient steps. Priors for α𝛼\alphaitalic_α and δ𝛿\deltaitalic_δ were chosen to be 𝒩(0,100)𝒩0100\mathcal{N}(0,100)caligraphic_N ( 0 , 100 ) to keep it flat and uninformative. We chose the priors for γwsuperscript𝛾𝑤\gamma^{w}italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT and γbsuperscript𝛾𝑏\gamma^{b}italic_γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT to be 𝒩(0.5,100)𝒩0.5100\mathcal{N}(0.5,100)caligraphic_N ( 0.5 , 100 ) and 𝒩(0.5,100)𝒩0.5100\mathcal{N}(-0.5,100)caligraphic_N ( - 0.5 , 100 ) to reflect the prior belief of polarization, however these are also quite uninformative given the large variance. We fix τ2superscript𝜏2\tau^{2}italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT at 10, σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT at 1 and ϕ2superscriptitalic-ϕ2\phi^{2}italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT at 10.

5.3 Results

Refer to caption
Figure 4: The point estimates of the latent positions with the extend model. Blue-Democrats, Red-Republicans. At each time, two parties consistently occupy different halves of the space. The Democrats were flocking during the whole time. Conversely, the Republicans initially displayed a similar flocking behavior, but gradually began to disperse around year 2016.
α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG δ^^𝛿\hat{\delta}over^ start_ARG italic_δ end_ARG γ^1wsuperscriptsubscript^𝛾1𝑤\hat{\gamma}_{1}^{w}over^ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT γ^2wsuperscriptsubscript^𝛾2𝑤\hat{\gamma}_{2}^{w}over^ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT γ^bsuperscript^𝛾𝑏\hat{\gamma}^{b}over^ start_ARG italic_γ end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT
Mean 3.2 1.08 0.34 -0.11 -0.22
SD 0.011 0.010 0.017 0.013 0.0085
Table 2: 1-Democrats, 2-Republicans.The point estimates and the posterior standard deviation of the global parameters in the extended model fitted to the full X data set spanning from 2010 to 2020. The between-group attraction is -0.22, indicating that there is polarization between the two parties. The within group coefficient is 0.34 for Democrats, and -0.11 for Republicans, indicating that the Democrats were flocking, while the Republicans were polarizing within their own group.

We fit our model with time-invariant parameters to the full X data set from year 2010 to 2020. The point estimates and the posterior standard deviation of model parameters are provided in Table 2. Point estimates were produced using the SGD methodology of Section 3.1 and standard deviation estimates were produced using the methodology of Section 3.2. In the supplementary file, we present two diagnostic tests demonstrating that the underlying assumptions of the variance estimation method proposed in Section 3.2 are largely accurate for this dataset, i.e., the diagnostic tests do not show evidence of violating the underlying assumptions of the variance estimation method.

The point estimates of persistence parameter is 1.08, implying that there is a higher likelihood for it to also appear at time t𝑡titalic_t if an edge is present at time t1𝑡1t-1italic_t - 1. The between-group attraction is -0.22, indicating that there is polarization between the two parties. The within group coefficient is 0.34 for Democrats, and -0.11 for Republicans, indicating that the Democrats were flocking, while the Republicans were polarizing within their own group. Importantly, we note that in the original model in Zhu et al. (2023), only members that are present at all times are kept. The parameter estimation for attractions within group indicated that while both parties have moved away from one another, they generally flocked to their own. By incorporating all members into the extended model, we effectively eliminate the potential for selection bias, ensuring a more accurate and comprehensive analysis of the data. As a result of this approach, we have uncovered that the force within the Republican group is negative, in contrast to Zhu et al. (2023).

Figure 4 shows the point estimates of latent positions for each member of Congress in the X hashtag networks. At each time, two parties consistently occupy different halves of the space. The Democrats were flocking during the whole time. Conversely, the Republicans initially displayed a similar flocking behavior, but gradually began to disperse after year 2017.

Refer to caption
Figure 5: The trajectory of the mean of the latent positions of the members in each party. The lighter color are results obtained from the reduced data set and the darker color are results obtained from the full data set. In both cases, the average position of the two parties gradually and consistently approached each other from 2011 to 2016 and after 2016, both of them took a U-turn and moved apart.

Figure 5 shows the trajectory of the mean of the latent positions of the members in each party. The lighter color represents results obtained from the reduced data set used in Zhu et al. (2023)’s original analysis. While the darker color represents results obtained with our model and inference techniques from the full data set. In both cases, the average position of the two parties gradually and consistently approached each other from 2011 to 2016. But after 2016, both of them took a U-turn and moved apart. However, we see that the analysis based on the full data set suggests that the Democrats were in fact noticeably closer to the Republicans just before 2016 than suggested by the analysis based on the reduced data.

Refer to caption
Figure 6: Comparison of attraction parameters from MCMC method on the reduced data set and the SGD-based method on the reduced and full data set. The MCMC method and the SGD-based method yield very similar point estimates and variance estimates for the attraction parameters on the reduced data set. On the full data set, the model concludes that there is a large attractive force within the Democratic Party. By contrast, we find that within the Republican Party the force is repulsive.

Figure 6 compares attraction parameters with that found in Zhu et al. (2023). The MCMC and the SGD-based method yield close point estimates and variance estimates for the attraction parameters on the reduced data set. The point estimates obtained from the MCMC method fall within the 95% confidence interval of the SGD-based method. On the full data set, the model concludes that there is a large attractive force within the Democratic Party. By contrast, we find that within the Republican Party the force is repulsive.

Refer to caption
Figure 7: Evolution of posterior means and 95% CI for Democrats and Republicans, and between-group attraction/repulsion in X Congressional Hashtag Networks, the values of which at each time period in the horizontal axis are obtained by fitting the model using separately parameterized networks in the corresponding time periods.

To analyze change in attraction, repulsion, we fit a series of models that allow a change-point to vary from 2011 to 2019. The resulting fitted models with different change-points were compared based on Bayesian Information Criterion (BIC). The one with the lowest BIC value was selected. We identified 2012 as the year of change-point (See supplementary file for more details on this). Figure 7 illustrates the evolution of within-group attraction and repulsion for Democrats and Republicans, and between-group attraction and repulsion. The within-Republican coefficient is positive in the first time period and is negative in the second time periods from 2012 to 2020. This suggests polarization within Republican members started to rise within the period 2012-2020. The within-Democrat coefficient is positive in both time periods, although its magnitude decreases a bit in the second time period from 2012 to 2020. The between-group coefficient (orange bars) is not statistically significant at the 95% confidence level from 2010 to 2012 and is negative in the second time periods from 2012 to 2020.

6 Discussion

The extended CLSNA model proposed in this paper together with the SGD-based inference method, successfully addressed the challenge of scalability to large networks with nodes entering and leaving the network. In particular, (a) the introduction of the SGD parameter estimation method, (b) the development of the novel variance estimation approach for approximate Bayesian inference which transforms variance estimation into an optimization problem solvable with SGD, and, (c) the extension of the model to allow dynamic node participation, have significantly improved the CLSNA model’s utility for analyzing dynamic networks. When applied to longitudinal social networks on X, the model mitigates selection bias and reveals previously concealed negative force within the Republican party.

The SGD parameter estimation method and the variance estimation method are general approaches to approximate Bayesian inference. Therefore, they can be combined with other strategies to make inference faster. For example, they can be used with case-control log-likelihood approximation as in Raftery et al. (2012). This can be achieved by using stratified sampling to draw the edge probability terms during the stochastic sampling process.

In future work we plan to incorporate observed explanatory variable terms into the edge probability. For instance, a linear predictor based on explanatory variables can be integrated into the edge probability function, see Hoff (2007). In this way, we can more effectively separate the latent effect from the fixed effect and accurately capture the repulsive and attractive force among the different actors.

Acknowledgement

This research was supported by US NSF-DMS 2311500, NSF SES-2120115, Canadian NSERC RGPIN-2023-03566 and NSERC DGDND-2023-03566.

References

  • Hinton (2012) Hinton, G. (2012) Neural networks for machine learning. Coursera. URL: https://www.coursera.org/course/neuralnets.
  • Hoff (2007) Hoff, P. (2007) Modeling homophily and stochastic equivalence in symmetric relational data. Advances in Neural Information Processing Systems, 20.
  • Hoff et al. (2002) Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002) Latent space approaches to social network analysis. Journal of the American Statistical Association, 97, 1090–1098.
  • Liu and Chen (2021) Liu, Y. and Chen, Y. (2021) Variational inference for latent space models for dynamic networks. arXiv preprint arXiv:2105.14093.
  • Matias and Miele (2017) Matias, C. and Miele, V. (2017) Statistical clustering of temporal networks through a dynamic stochastic block model. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 79, 1119–1141.
  • Polyak (1964) Polyak, B. T. (1964) Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4, 1–17.
  • Raftery et al. (2012) Raftery, A. E., Niu, X., Hoff, P. D. and Yeung, K. Y. (2012) Fast inference for the latent space network model using a case-control approximate likelihood. Journal of Computational and Graphical Statistics, 21, 901–919.
  • Rue et al. (2009) Rue, H., Martino, S. and Chopin, N. (2009) Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. Journal of the Royal Statistical Society: Series B (statistical methodology), 71, 319–392.
  • Sarkar and Moore (2005) Sarkar, P. and Moore, A. W. (2005) Dynamic social network analysis using latent space models. SIGKDD Explorations, 7, 31–40.
  • Sewell and Chen (2015a) Sewell, D. K. and Chen, Y. (2015a) Analysis of the formation of the structure of social networks by using latent space models for ranked dynamic networks. Journal of the Royal Statistical Society: Series C: Applied Statistics, 611–633.
  • Sewell and Chen (2015b) — (2015b) Latent space models for dynamic networks. Journal of the American Statistical Association, 110, 1646–1657.
  • Sewell and Chen (2016) — (2016) Latent space models for dynamic networks with weighted edges. Social Networks, 44, 105–116.
  • Zhu et al. (2023) Zhu, X., Caliskan, C., Christenson, D. P., Spiliopoulos, K., Walker, D. and Kolaczyk, E. D. (2023) Disentangling positive and negative partisanship in social media interactions using a coevolving latent space network with attractors model. Journal of the Royal Statistical Society Series A: Statistics in Society. URL: https://doi.org/10.1093/jrsssa/qnad008. Qnad008.

Supplementary Information for ‘Stochastic gradient descent-based inference for dynamic network models with attractors’

Hancong Pan

Department of Mathematics and Statistics, Boston University

and

Xiao**g Zhu

Department of Mathematics and Statistics, Boston University

and

Cantay Caliskan

Goergen Institute for Data Science, University of Rochester

and

Dino P. Christenson

Department of Political Science, Washington University in St. Louis

and

Konstantinos Spiliopoulos

Department of Mathematics and Statistics, Boston University

and

Dylan Walker

Argyros School of Business and Economics, Chapman University

and

Eric D. Kolaczyk

Department of Mathematics and Statistics, McGill University

E-mail: [email protected]

This supplementary document contains two sections. In Section 7 we present the proofs of the theoretical results that appear in the main body of the paper. In Section 8 we present some additional statistical tests and diagnostics to back up the validity of the numerical results reported in the main body of the paper.

7 Proofs

Theorem 7.1.

Given the multivariate Gaussian distribution setting presented above, write p(x1,x2)𝑝subscript𝑥1subscript𝑥2p(x_{1},x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) as the p.d.f of the joint distribution of x1,x2subscript𝑥1subscript𝑥2x_{1},x_{2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. p(x2)𝑝subscript𝑥2p(x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) as the p.d.f of the marginal distribution of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then, we have p(x2)p(x1=μ1|2,x2)proportional-to𝑝subscript𝑥2𝑝subscript𝑥1subscript𝜇conditional12subscript𝑥2p(x_{2})\propto p(x_{1}=\mu_{1|2},x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∝ italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

Proof.

Let us explicitly state the conditional mean and covariance matrix for our given multivariate Gaussian distribution. The conditional mean of x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT given x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, denoted μ1|2subscript𝜇conditional12\mu_{1|2}italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT, is:

μ1|2=μ1+Σ12Σ221(x2μ2).subscript𝜇conditional12subscript𝜇1subscriptΣ12superscriptsubscriptΣ221subscript𝑥2subscript𝜇2\mu_{1|2}=\mu_{1}+\Sigma_{12}\Sigma_{22}^{-1}(x_{2}-\mu_{2}).italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Additionally, the conditional covariance matrix, denoted Σ1|2subscriptΣconditional12\Sigma_{1|2}roman_Σ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT, is described as:

Σ1|2=Σ11Σ12Σ221Σ21.subscriptΣconditional12subscriptΣ11subscriptΣ12superscriptsubscriptΣ221subscriptΣ21\Sigma_{1|2}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.roman_Σ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT - roman_Σ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT .

The joint distribution, p(x1,x2)𝑝subscript𝑥1subscript𝑥2p(x_{1},x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), for the multivariate Gaussian is:

p(x1,x2)=1(2π)k2|Σ|12exp(12(xμ)TΣ1(xμ)).𝑝subscript𝑥1subscript𝑥21superscript2𝜋𝑘2superscriptΣ1212superscript𝑥𝜇𝑇superscriptΣ1𝑥𝜇p(x_{1},x_{2})=\frac{1}{(2\pi)^{\frac{k}{2}}|\Sigma|^{\frac{1}{2}}}\exp\left(-% \frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)\right).italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT divide start_ARG italic_k end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT | roman_Σ | start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_μ ) ) .

The conditional distributions p(x1|x2)𝑝conditionalsubscript𝑥1subscript𝑥2p(x_{1}|x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is:

p(x1|x2)=1(2π)n2|Σ1|2|12exp(12(x1μ1|2)TΣ1|21(x1μ1|2)).𝑝conditionalsubscript𝑥1subscript𝑥21superscript2𝜋𝑛2superscriptsubscriptΣconditional121212superscriptsubscript𝑥1subscript𝜇conditional12𝑇superscriptsubscriptΣconditional121subscript𝑥1subscript𝜇conditional12p(x_{1}|x_{2})=\frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma_{1|2}|^{\frac{1}{2}}}\exp% \left(-\frac{1}{2}(x_{1}-\mu_{1|2})^{T}\Sigma_{1|2}^{-1}(x_{1}-\mu_{1|2})% \right).italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT divide start_ARG italic_n end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT | roman_Σ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT ) ) .

By the definition of conditional probabilities, we have:

p(x2)=p(x1,x2)p(x1|x2).𝑝subscript𝑥2𝑝subscript𝑥1subscript𝑥2𝑝conditionalsubscript𝑥1subscript𝑥2p(x_{2})=\frac{p(x_{1},x_{2})}{p(x_{1}|x_{2})}.italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG .

Inserting x1=μ1|2subscript𝑥1subscript𝜇conditional12x_{1}=\mu_{1|2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT into this relation, we get:

p(x2)=p(μ1|2,x2)p(μ1|2|x2).𝑝subscript𝑥2𝑝subscript𝜇conditional12subscript𝑥2𝑝conditionalsubscript𝜇conditional12subscript𝑥2p(x_{2})=\frac{p(\mu_{1|2},x_{2})}{p(\mu_{1|2}|x_{2})}.italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG italic_p ( italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG .

Since p(μ1|2|x2)=1(2π)n2|Σ1|2|12𝑝conditionalsubscript𝜇conditional12subscript𝑥21superscript2𝜋𝑛2superscriptsubscriptΣconditional1212p(\mu_{1|2}|x_{2})=\frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma_{1|2}|^{\frac{1}{2}}}italic_p ( italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT divide start_ARG italic_n end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT | roman_Σ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG and Σ1|2subscriptΣconditional12\Sigma_{1|2}roman_Σ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT is constant as a function of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we get that

p(x2)=Cp(μ1|2,x2).𝑝subscript𝑥2𝐶𝑝subscript𝜇conditional12subscript𝑥2p(x_{2})=Cp(\mu_{1|2},x_{2}).italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_C italic_p ( italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

where C=(2π)n2|Σ11Σ12Σ221Σ21|12𝐶superscript2𝜋𝑛2superscriptsubscriptΣ11subscriptΣ12superscriptsubscriptΣ221subscriptΣ2112C=(2\pi)^{\frac{n}{2}}|\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}|^{% \frac{1}{2}}italic_C = ( 2 italic_π ) start_POSTSUPERSCRIPT divide start_ARG italic_n end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT | roman_Σ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT - roman_Σ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT, a constant as a function of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. ∎

Corollary 7.1.1.

Consider the multivariate Gaussian distribution setting presented above, and let x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be scalar. Then, for some fixed x~2μ2subscriptnormal-~𝑥2subscript𝜇2\tilde{x}_{2}\neq\mu_{2}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≠ italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

𝑉𝑎𝑟(x2)=12(μ2x~2)2logp(x1=μ1,x2=μ2)logp(x1=μ1|2,x2=x~2)𝑉𝑎𝑟subscript𝑥212superscriptsubscript𝜇2subscript~𝑥22𝑝formulae-sequencesubscript𝑥1subscript𝜇1subscript𝑥2subscript𝜇2𝑝formulae-sequencesubscript𝑥1subscript𝜇conditional12subscript𝑥2subscript~𝑥2\text{Var}(x_{2})=\frac{1}{2}\frac{(\mu_{2}-\tilde{x}_{2})^{2}}{\log p(x_{1}=% \mu_{1},x_{2}=\mu_{2})-\log p(x_{1}=\mu_{1|2},x_{2}=\tilde{x}_{2})}Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_log italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - roman_log italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG (16)
Proof.

Recall that the p.d.f of a univariate Gaussian distribution x2𝒩(μ2,Var(x2))similar-tosubscript𝑥2𝒩subscript𝜇2Varsubscript𝑥2x_{2}\sim\mathcal{N}(\mu_{2},\text{Var}(x_{2}))italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) is given by:

p(x2)=12πVar(x2)exp((x2μ2)22Var(x2)).𝑝subscript𝑥212𝜋Varsubscript𝑥2superscriptsubscript𝑥2subscript𝜇222Varsubscript𝑥2p(x_{2})=\frac{1}{\sqrt{2\pi\text{Var}(x_{2})}}\exp\left(-\frac{(x_{2}-\mu_{2}% )^{2}}{2\text{Var}(x_{2})}\right).italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG end_ARG roman_exp ( - divide start_ARG ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ) . (17)

Taking the logarithm, we obtain:

logp(x2)=12log(2πVar(x2))(x2μ2)22Var(x2).𝑝subscript𝑥2122𝜋Varsubscript𝑥2superscriptsubscript𝑥2subscript𝜇222Varsubscript𝑥2\log p(x_{2})=-\frac{1}{2}\log(2\pi\text{Var}(x_{2}))-\frac{(x_{2}-\mu_{2})^{2% }}{2\text{Var}(x_{2})}.roman_log italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) - divide start_ARG ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG . (18)

From Theorem 7.1, we deduce that logp(x2)𝑝subscript𝑥2\log p(x_{2})roman_log italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and logp(x1=μ1|2,x2)𝑝subscript𝑥1subscript𝜇conditional12subscript𝑥2\log p(x_{1}=\mu_{1|2},x_{2})roman_log italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) differ by a constant since p(x2)p(x1=μ1|2,x2)proportional-to𝑝subscript𝑥2𝑝subscript𝑥1subscript𝜇conditional12subscript𝑥2p(x_{2})\propto p(x_{1}=\mu_{1|2},x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∝ italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Considering two distinct points, x2=μ2subscript𝑥2subscript𝜇2x_{2}=\mu_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and x2=x~2subscript𝑥2subscript~𝑥2x_{2}=\tilde{x}_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:

For x2=μ2subscript𝑥2subscript𝜇2x_{2}=\mu_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:

logp(x2=μ2)=12log(2πVar(x2)).𝑝subscript𝑥2subscript𝜇2122𝜋Varsubscript𝑥2\log p(x_{2}=\mu_{2})=-\frac{1}{2}\log(2\pi\text{Var}(x_{2})).roman_log italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) . (19)

For x2=x~2subscript𝑥2subscript~𝑥2x_{2}=\tilde{x}_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:

logp(x2=x~2)=12log(2πVar(x2))(x~2μ2)22Var(x2).𝑝subscript𝑥2subscript~𝑥2122𝜋Varsubscript𝑥2superscriptsubscript~𝑥2subscript𝜇222Varsubscript𝑥2\log p(x_{2}=\tilde{x}_{2})=-\frac{1}{2}\log(2\pi\text{Var}(x_{2}))-\frac{(% \tilde{x}_{2}-\mu_{2})^{2}}{2\text{Var}(x_{2})}.roman_log italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) - divide start_ARG ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG . (20)

Define the difference in these logarithmic probabilities as:

Δ=logp(x1=μ1,x2=μ2)logp(x1=μ1|2,x2=x~2).Δ𝑝formulae-sequencesubscript𝑥1subscript𝜇1subscript𝑥2subscript𝜇2𝑝formulae-sequencesubscript𝑥1subscript𝜇conditional12subscript𝑥2subscript~𝑥2\Delta=\log p(x_{1}=\mu_{1},x_{2}=\mu_{2})-\log p(x_{1}=\mu_{1|2},x_{2}=\tilde% {x}_{2}).roman_Δ = roman_log italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - roman_log italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (21)

Using the above, we get:

Δ=(μ2x~2)22Var(x2).Δsuperscriptsubscript𝜇2subscript~𝑥222Varsubscript𝑥2\Delta=\frac{(\mu_{2}-\tilde{x}_{2})^{2}}{2\text{Var}(x_{2})}.roman_Δ = divide start_ARG ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG . (22)

Rearranging gives:

Var(x2)=(μ2x~2)22(logp(x1=μ1,x2=μ2)logp(x1=μ1|2,x2=x~2)),Varsubscript𝑥2superscriptsubscript𝜇2subscript~𝑥222𝑝formulae-sequencesubscript𝑥1subscript𝜇1subscript𝑥2subscript𝜇2𝑝formulae-sequencesubscript𝑥1subscript𝜇conditional12subscript𝑥2subscript~𝑥2\text{Var}(x_{2})=\frac{(\mu_{2}-\tilde{x}_{2})^{2}}{2(\log p(x_{1}=\mu_{1},x_% {2}=\mu_{2})-\log p(x_{1}=\mu_{1|2},x_{2}=\tilde{x}_{2}))},Var ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( roman_log italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - roman_log italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 | 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) end_ARG , (23)

completing the proof of the corollary. ∎

8 Supplementary Results in X Data Analysis

8.1 BIC values

In our analysis, we utilized the Bayesian Information Criterion (BIC) to determine the best change-point among the competing models. For the X platform dataset, we evaluated the BIC values for potential single change-points from 2012 onwards, as presented in Table below. The model suggesting a change-point in 2012 yielded the lowest BIC value.

2012 2013 2014 2015 2016 2017 2018 2019
BIC 220138 220163 220201 220189 220212 220203 220234 220224
Table 3: BIC values for competing models with different change-point for the X platform data.

8.2 Diagnostics

Note that the variance estimation algorithm is similar to a quadratic approximation to the log-posterior density. We performed diagnostics to evaluate the assumption that the log likelihood function can be approximated by a quadratic function in a small neighborhood centered around the mode.

8.2.1 Normality test

We rewrite equation (20) as follows:

maxθ1=θ1*ηlogπ(Z,θ|Y)=maxθ1=θ1*logπ(Z,θ|Y)12η2Var(θ1|Y)^,subscriptsubscript𝜃1superscriptsubscript𝜃1𝜂𝜋𝑍conditional𝜃𝑌subscriptsubscript𝜃1superscriptsubscript𝜃1𝜋𝑍conditional𝜃𝑌12superscript𝜂2^Varconditionalsubscript𝜃1𝑌\displaystyle\max_{\theta_{1}=\theta_{1}^{*}-\eta}\log\pi(Z,\theta|Y)=\max_{% \theta_{1}=\theta_{1}^{*}}\log\pi(Z,\theta|Y)-\frac{1}{2}\frac{\eta^{2}}{% \widehat{\text{Var}(\theta_{1}|Y)}},roman_max start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_η end_POSTSUBSCRIPT roman_log italic_π ( italic_Z , italic_θ | italic_Y ) = roman_max start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log italic_π ( italic_Z , italic_θ | italic_Y ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG Var ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_Y ) end_ARG end_ARG , (24)

and if our assumptions are correct, then Var(θ1|Y)^^Varconditionalsubscript𝜃1𝑌\widehat{\text{Var}(\theta_{1}|Y)}over^ start_ARG Var ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_Y ) end_ARG should be a constant no matter the choice of η𝜂\etaitalic_η. We vary η𝜂\etaitalic_η to obtain a plot of maxθ1=θ1*ηlogπ(Z,θ|Y)subscriptsubscript𝜃1superscriptsubscript𝜃1𝜂𝜋𝑍conditional𝜃𝑌\max_{\theta_{1}=\theta_{1}^{*}-\eta}\log\pi(Z,\theta|Y)roman_max start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_η end_POSTSUBSCRIPT roman_log italic_π ( italic_Z , italic_θ | italic_Y ) as a function of η𝜂\etaitalic_η to test whether this assumption is valid. If this assumption holds, maxθ1=θ1*ηlogπ(Z,θ|Y)subscriptsubscript𝜃1superscriptsubscript𝜃1𝜂𝜋𝑍conditional𝜃𝑌\max_{\theta_{1}=\theta_{1}^{*}-\eta}\log\pi(Z,\theta|Y)roman_max start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_η end_POSTSUBSCRIPT roman_log italic_π ( italic_Z , italic_θ | italic_Y ) should be approximately a quadratic function as a function of η𝜂\etaitalic_η.

We choose θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to be γRwsubscriptsuperscript𝛾𝑤𝑅\gamma^{w}_{R}italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, and set η𝜂\etaitalic_η to be -0.03, -0.02, -0.01, 0.01, 0.02, 0.03. In Figure 8 we plot maxθ1=θ1*ηlogπ(Z,θ|Y)subscriptsubscript𝜃1superscriptsubscript𝜃1𝜂𝜋𝑍conditional𝜃𝑌\max_{\theta_{1}=\theta_{1}^{*}-\eta}\log\pi(Z,\theta|Y)roman_max start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_η end_POSTSUBSCRIPT roman_log italic_π ( italic_Z , italic_θ | italic_Y ) as a function of η𝜂\etaitalic_η (left) and η2superscript𝜂2\eta^{2}italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (right). We can see that maxθ1=θ1*ηlogπ(Z,θ|Y)subscriptsubscript𝜃1superscriptsubscript𝜃1𝜂𝜋𝑍conditional𝜃𝑌\max_{\theta_{1}=\theta_{1}^{*}-\eta}\log\pi(Z,\theta|Y)roman_max start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_η end_POSTSUBSCRIPT roman_log italic_π ( italic_Z , italic_θ | italic_Y ) is approximately quadratic as a function of η𝜂\etaitalic_η and linear as a function of η2superscript𝜂2\eta^{2}italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The diagnostic plot does not show evidence of violating the underlying assumptions of the variance estimation method.

Refer to caption
Figure 8: We choose θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to be γRwsubscriptsuperscript𝛾𝑤𝑅\gamma^{w}_{R}italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, and set η𝜂\etaitalic_η to be -0.03, -0.02, -0.01, 0.01, 0.02, 0.03. We plot maxθ1=θ1*ηlogπ(Z,θ|Y)subscriptsubscript𝜃1superscriptsubscript𝜃1𝜂𝜋𝑍conditional𝜃𝑌\max_{\theta_{1}=\theta_{1}^{*}-\eta}\log\pi(Z,\theta|Y)roman_max start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_η end_POSTSUBSCRIPT roman_log italic_π ( italic_Z , italic_θ | italic_Y ) as a function of η𝜂\etaitalic_η (left) and η2superscript𝜂2\eta^{2}italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (right).

8.2.2 Linearity test

Refer to caption
Figure 9: We choose θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to be γRwsubscriptsuperscript𝛾𝑤𝑅\gamma^{w}_{R}italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, and set η𝜂\etaitalic_η to be -0.03, -0.02, -0.01, 0.01, 0.02, 0.03. We calculate μ(Xj|X1=μ1+η)μjη𝜇conditionalsubscript𝑋𝑗subscript𝑋1subscript𝜇1𝜂subscript𝜇𝑗𝜂\frac{\mu(X_{j}|X_{1}=\mu_{1}+\eta)-\mu_{j}}{\eta}divide start_ARG italic_μ ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_η ) - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_η end_ARG for each latent position parameters under each η𝜂\etaitalic_η. The latent position parameters are ordered by the mean of its slopes and plot the mean maximum and minimum of its slopes (left) and the slopes of 95% of the latent position parameters after removing the tail 5% that have a very large or very small slopes.

From Theorem 7.1, we see that the conditional mean of X2:n|X1conditionalsubscript𝑋:2𝑛subscript𝑋1X_{2:n}|X_{1}italic_X start_POSTSUBSCRIPT 2 : italic_n end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is linear as a function of X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT under the Gaussian assumption, i.e.,

μ(X2:n|X1=μ1+η)=μ2:n+Σ21Σ111η.𝜇conditionalsubscript𝑋:2𝑛subscript𝑋1subscript𝜇1𝜂subscript𝜇:2𝑛subscriptΣ21superscriptsubscriptΣ111𝜂\displaystyle\mu(X_{2:n}|X_{1}=\mu_{1}+\eta)=\mu_{2:n}+\Sigma_{21}\Sigma_{11}^% {-1}\eta.italic_μ ( italic_X start_POSTSUBSCRIPT 2 : italic_n end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_η ) = italic_μ start_POSTSUBSCRIPT 2 : italic_n end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_η . (25)

We vary η𝜂\etaitalic_η to obtain the slope μ(Xj|X1=μ1+η)μjη𝜇conditionalsubscript𝑋𝑗subscript𝑋1subscript𝜇1𝜂subscript𝜇𝑗𝜂\frac{\mu(X_{j}|X_{1}=\mu_{1}+\eta)-\mu_{j}}{\eta}divide start_ARG italic_μ ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_η ) - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_η end_ARG for each Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with different η𝜂\etaitalic_η to test whether this assumption is valid. If this assumption holds, μ(Xj|X1=μ1+η)μjη𝜇conditionalsubscript𝑋𝑗subscript𝑋1subscript𝜇1𝜂subscript𝜇𝑗𝜂\frac{\mu(X_{j}|X_{1}=\mu_{1}+\eta)-\mu_{j}}{\eta}divide start_ARG italic_μ ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_η ) - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_η end_ARG should be a constant regardless the choice of η𝜂\etaitalic_η for each Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

We use the results of the previous test, set θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to be γRwsubscriptsuperscript𝛾𝑤𝑅\gamma^{w}_{R}italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, and η𝜂\etaitalic_η to take values in the set {0.03,0.02,0.01,0.01,0.02,0.03}0.030.020.010.010.020.03\{-0.03,-0.02,-0.01,0.01,0.02,0.03\}{ - 0.03 , - 0.02 , - 0.01 , 0.01 , 0.02 , 0.03 }. We calculate μ(Xj|X1=μ1+η)μjη𝜇conditionalsubscript𝑋𝑗subscript𝑋1subscript𝜇1𝜂subscript𝜇𝑗𝜂\frac{\mu(X_{j}|X_{1}=\mu_{1}+\eta)-\mu_{j}}{\eta}divide start_ARG italic_μ ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_η ) - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_η end_ARG for each latent position parameters and each η𝜂\etaitalic_η. In Figure 9 we order the latent position parameters by the mean of its slopes and plot the mean maximum and minimum of its slopes (left) and the slopes of 95% of the latent position parameters after removing the tail 5% that have a very large or very small slopes. We can see that the slopes, or the correlated changes for each latent position parameters when we change γRwsubscriptsuperscript𝛾𝑤𝑅\gamma^{w}_{R}italic_γ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT are close under different choice of η𝜂\etaitalic_η. Therefore the diagnostic plot does not show substantial evidence of violating the underlying assumptions of the variance estimation method.