A Theoretical Framework for an Efficient Normalizing Flow-Based Solution to the Schrödinger Equation

Daniel Freedman¹ Eyal Rozenberg^1,2 Alex Bronstein²
¹Verily Research ²Technion - Israel Institute of Technology

Abstract

A central problem in quantum mechanics involves solving the Electronic Schrödinger Equation for a molecule or material. The Variational Monte Carlo approach to this problem approximates a particular variational objective via sampling, and then optimizes this approximated objective over a chosen parameterized family of wavefunctions, known as the ansatz. Recently neural networks have been used as the ansatz, with accompanying success. However, sampling from such wavefunctions has required the use of a Markov Chain Monte Carlo approach, which is inherently inefficient. In this work, we propose a solution to this problem via an ansatz which is cheap to sample from, yet satisfies the requisite quantum mechanical properties. We prove that a normalizing flow using the following two essential ingredients satisfies our requirements: (a) a base distribution which is constructed from Determinantal Point Processes; (b) flow layers which are equivariant to a particular subgroup of the permutation group. We then show how to construct both continuous and discrete normalizing flows which satisfy the requisite equivariance. We further demonstrate the manner in which the non-smooth nature (“cusps”) of the wavefunction may be captured, and how the framework may be generalized to provide induction across multiple molecules. The resulting theoretical framework entails an efficient approach to solving the Electronic Schrödinger Equation.

1 Introduction

The Electronic Schrödinger Equation A central problem in quantum mechanics involves solving the Electronic Schrödinger Equation to compute the ground state energy and wavefunction of a molecule or material. This problem has manifold applications in chemistry, condensed matter physics, and materials science. A standard computational approach to this problem is based on Variational Monte Carlo (Ceperley and Alder, 1986; Austin et al., 2012; Gubernatis et al., 2016; Foulkes et al., 2001; Needs et al., 2009): a particular variational objective is approximated via sampling, and the approximated objective is optimized over a family of wavefunctions, yielding an upper bound on the ground state energy. The heart of this method is the wavefunction family, also known as the ansatz; recent work has proposed using neural networks as a flexible ansatz, and has achieved very high quality results, which we now describe further.

Neural Network Ansätze We begin by noting that various works have used neural networks as the ansätze in the case of pure-spin systems (sometimes also referred to as “discrete space systems”), for example (Carleo and Troyer, 2017; Deng et al., 2017; Gao and Duan, 2017; Levine et al., 2019; Sharir et al., 2022; Passetti et al., 2023). In terms of continuous space problems of the sort that interest us, DeepWF (Han et al., 2019) bases its on ansatz on the classical Slater-Jastrow formalism, but learns both the symmetric and antisymmetric parts; the latter contains only two-electron terms, limiting the accuracy. PauliNet (Hermann et al., 2020; Schätzle et al., 2021) also bases its ansatz on the Slater-Jastrow-Backflow form, but does so in a way that captures many-electron interactions, while respecting permutation-equivariance; this, as well as the inclusion of cusp terms, leads to much higher accuracy (e.g. 97.3% of the correlation energy for boron atoms). FermiNet (Pfau et al., 2020; Spencer et al., 2020) attains still higher accuracy (e.g. 99.8% of the correlation energy for boron atoms) by using an appropriately designed neural network to represent the entire wavefunction, which contains a generalization of Slater determinants to account for all-electron interactions. A hybrid solution which improves upon both PauliNet and FermiNet is presented in (Gerard et al., 2022). Techniques for learning / induction across several molecules or materials at once are presented in (Gao and Günnemann, 2023; Scherbela et al., 2024; Gerard et al., 2024). We briefly mention applications to periodic systems (Wilson et al., 2022; Li et al., 2022; Pescia et al., 2022; Cassella et al., 2023); techniques that use Diffusion Monte Carlo (Wilson et al., 2021; Ren et al., 2023); and methods that deal with excited states (Entwistle et al., 2023; Pfau et al., 2023; Naito et al., 2023). Finally, we mention two works which use normalizing flows (Thiede et al., 2022; Saleh et al., 2023). Both are limited in their applicability, as the former is restricted to one-dimensional systems by construction, while the latter makes use of the flows in a non-standard way and thus cannot scale past systems with a few electrons.

Goals and Contributions In order to be able to apply the Variational Monte Carlo formalism to the ansätze just described, such as PauliNet or FermiNet, one must be able to sample from the densities corresponding to the wavefunctions given by their neural networks. In general, this is only possibly using Markov Chain Monte Carlo (MCMC) techniques such as Langevin Monte Carlo (Umrigar et al., 1993) or any of several variations. The issue with using such MCMC approaches to sampling is that they are inherently time-consuming: each sample is itself the solution of a stochastic differential equation as time goes to infinity. The main goal of this paper is to solve the problem of sampling inefficiency, thereby yielding faster algorithms for solving the Electronic Schrödinger Equation. We achieve this goal by specifying a wavefunction ansatz which is easy to sample from, yet satisfies the requisite quantum mechanical properties. The ansatz is based on normalizing flows, which unlike (Thiede et al., 2022; Saleh et al., 2023) are general and can be applied to a space of any dimensionality. We provide the following contributions:

•

We establish that such an ansatz can be instantiated as a normalizing flow with these characteristics: (a) its base distribution is symmetric under permutations, and vanishes for identical electrons; (b) the flow transformation is equivariant to a particular subgroup of the permutation group.
•

We show that the base distribution can be constructed using a particular combination of Determinantal Point Processes.
•

We construct both continuous and discrete normalizing flows obeying the requisite equivariance.
•

We provide a training regimen based on standard stochastic gradient descent.
•

We show how to accommodate cusps, which encapsulate non-smooth aspects of the wavefunction.
•

We generalize the framework so that induction across multiple molecules may be accommodated, while including the necessary additional invariances, in particular rigid motion invariance.

2 Problem Setup

2.1 Goals

The Setting Our overall goal is to compute the ground state wavefunction and energy of a molecule given its molecular parameters and spin multiplicity. Denote $x_{i}=(r_{i},s_{i})$ to be the pair consisting of the position and spin for the $i^{th}$ electron; $x$ will denote the entire ordered list $(x_{1},\dots,x_{n})$ , with corresponding definitions for $r$ and $s$ . We specify wavefunctions as $\psi(x)$ ; due to the fact that electrons are Fermions, valid wavefunctions must be antisymmetric, that is if $\pi\in\mathbb{S}_{n}$ is a permutation, then

\psi(\pi x)=(-1)^{\pi}\psi(x)

(1)

where as usual, $(-1)^{\pi}$ is shorthand for $(-1)^{N(\pi)}$ where $N(\pi)$ is the minimal number of flips to produce $\pi$ .

Let $R_{I}$ and $Z_{I}$ denote the position and atomic number of the $I^{th}$ nucleus, and let the Laplacian for the $i^{th}$ electron be $\Delta_{i}=\frac{\partial^{2}}{\partial r_{i1}^{2}}+\frac{\partial^{2}}{% \partial r_{i2}^{2}}+\frac{\partial^{2}}{\partial r_{i3}^{2}}$ ; then the Hamiltonian is given by

H=-\frac{1}{2}\underbrace{\sum_{i}\Delta_{i}}_{\Delta}+\underbrace{\sum_{i>j}% \frac{1}{\|r_{i}-r_{j}\|}-\sum_{iI}\frac{Z_{I}}{\|r_{i}-R_{I}\|}+\sum_{I>J}% \frac{Z_{I}Z_{J}}{\|R_{I}-R_{J}\|}}_{V(x)}=-\tfrac{1}{2}\Delta+V(x)

(2)

Our goal is to compute the ground state wavefunction, which we denote as $\psi_{0}(x)$ and corresponding ground state energy $E_{0}$ . They may be computed using the variational principle:

\psi_{0}=\operatorname*{argmin}_{\psi\in\Psi}\frac{\langle\psi|H|\psi\rangle}{% \langle\psi|\psi\rangle}\qquad\text{and}\qquad E_{0}=\frac{\langle\psi_{0}|H|% \psi_{0}\rangle}{\langle\psi_{0}|\psi_{0}\rangle}

(3)

where $\Psi$ is the set of all possible valid wavefunctions, and $H$ is the Hamiltonian. If we specify the wavefunction ansatz as a neural network with parameters $\theta$ , this becomes

\theta^{*}=\operatorname*{argmin}_{\theta}\frac{\langle\psi(\cdot;\theta)|H|% \psi(\cdot;\theta)\rangle}{\langle\psi(\cdot;\theta)|\psi(\cdot;\theta)\rangle% }\qquad\text{and}\qquad E^{*}=\frac{\langle\psi(\cdot;\theta^{*})|H|\psi(\cdot% ;\theta^{*})\rangle}{\langle\psi(\cdot;\theta^{*})|\psi(\cdot;\theta^{*})% \rangle}\geq E_{0}

(4)

That is, we compute an upper bound $E^{*}$ to the ground state energy $E_{0}$ . The more expressive the ansatz, the tighter the bound will be.

Variational Monte Carlo The issue with the formulation to this point is the need to compute the inner products in Equations (3) and (4), which correspond to very high-dimensional integrals. A standard solution to this problem is based on a Monte Carlo scheme. To begin with, let us define the local energy as

\mathcal{E}(x)\equiv\frac{H\psi(x)}{\psi(x)}=-\frac{\Delta\psi(x)}{2\psi(x)}+V% (x)\qquad\text{and}\qquad\mathcal{E}_{r}(x)=\text{Real}\{\mathcal{E}(x)\}

(5)

In this case, one can simplify the minimand in Equation (3) (see Appendix A) as

\frac{\langle\psi|H|\psi\rangle}{\langle\psi|\psi\rangle}\,=\,\mathbb{E}_{x% \sim\rho(\cdot)}\left[\mathcal{E}_{r}(x)\right]\,\approx\,\frac{1}{K}\sum_{k=1% }^{K}\mathcal{E}_{r}\left(x^{(k)}\right)\qquad\text{with}\qquad\rho(x)=\frac{|% \psi(x)|^{2}}{\langle\psi|\psi\rangle}

(6)

where the $x^{(k)}$ are sampled from $\rho(x)=|\psi(x)|^{2}/\langle\psi|\psi\rangle$ .

2.2 The General Approach

As enumerated in Section 1, a number of recent works have followed the above approach using a variety of neural networks as the ansatz for the wavefunction $\psi(\cdot;\theta)$ . In order to do so, one must be able to sample from $\rho(x;\theta)=|\psi(x;\theta)|^{2}/\langle\psi|\psi\rangle$ ; and as the networks are quite general, the only feasible method for sampling is a Markov Chain Monte Carlo technique such as Langevin Monte Carlo (Umrigar et al., 1993) or any of several variations. These techniques can be time-consuming, as each sample is itself the solution of a stochastic differential equation as time goes to infinity.

A solution to this problem presents itself if we can somehow specify a wavefunction $\psi(x)$ which is easy to sample from. We are interested in wavefunctions which satisfy the following three properties:

(W1)

There is an explicit functional form for the wavefunction $\psi(x)$ .
(W2)

$\psi$ is antisymmetric.
(W3)

We can sample non-iteratively (in constant time) from $|\psi(\cdot)|^{2}$ .

The first two properties are necessary for any form of Variational Monte Carlo: (W1) allows us to evaluate the local energy $\mathcal{E}_{r}$ in (5) for use in (6); and (W2) is required for valid electronic (Fermionic) wavefunctions. But (W3) is the new ingredient: if we have a family of wavefunctions $\psi$ satisfying (W1)-(W3), then solving the minimization in (4) via the Monte Carlo approach in (6) will be considerably accelerated, as each sample will only require constant time to generate. We add a fourth property, which is not strictly necessary but is both desirable and will prove useful:

(W4)

$\psi$ is normalized, that is $\int|\psi(x)|^{2}dx=1$ .

It turns out that generating such wavefunctions is possible using the following procedure:

Theorem 1.

Let $\rho(\cdot)$ be a probability density function which we can sample from in constant time. Let $\rho(\cdot)$ satisfy two additional properties:

(D1)

$\rho(x)$ is symmetric: $\rho(\pi x)=\rho(x)$ for all permutations $\pi\in\mathbb{S}_{n}$ .
(D2)

$\rho(x)=0$ if $x_{i}=x_{j}$ for any $i,j$ .

Finally, let $\kappa(x)$ be a complex function which satisfies $|\kappa(x)|=1\,\,\forall x$ , and is nearly antisymmetric:

\kappa(\pi x)=\begin{cases}(-1)^{\pi}\kappa(x)&\text{if }x_{i}\neq x_{j}\text{% for all }i,j\\ \bar{\kappa}&\text{otherwise}\end{cases}

(7)

where $\bar{\kappa}\in\mathbb{C}$ is an arbitrary value with $|\bar{\kappa}|=1$ . Then $\psi$ satisfies (W1)-(W4) if and only if $\psi$ can be written as $\psi(x)=\kappa(x)\sqrt{\rho(x)}$ with $\kappa$ and $\rho$ satisfying the above-stated properties.

Proof: See Appendix B.
The general idea expressed in Theorem 1 is that we can build the wavefunction $\psi$ out of an easy-to-sample-from density function satisfying additional properties (D1)-(D2); and a nearly antisymmetric phase function $\kappa$ . In what follows, we will show how to construct both of these ingredients. But before doing so, we take a short detour to address the most important practical scenario, that of fixed spin multiplicity.

2.3 Fixed Spin Multiplicity

Notation As in most approaches to this problem, we assume that the spin multiplicity of the molecule is specified, which is equivalent to fixing the number of spin up and spin down electrons, denoted $n_{u}$ and $n_{d}$ respectively, with $n_{u}+n_{d}=n$ . Define the canonical spin vector to be given by $\bar{s}=[\uparrow,\dots,\uparrow,\downarrow,\dots,\downarrow]$ , i.e. the first $n_{u}$ are $\uparrow$ , the last $n_{d}$ are $\downarrow$ . We let the sets of indices of up and down spin electrons for the canonical spin vector be denoted by $\mathcal{N}_{u}=\{1,\dots,n_{u}\}$ and $\mathcal{N}_{d}=\{n_{u}+1,\dots,n\}$ . Finally, we will be interested in the subgroup of permutations in which a permutation is applied separately to spin-up and spin-down electrons. We denote this subgroup by

\mathbb{G}\equiv\mathbb{S}_{\mathcal{N}_{u}}\times\mathbb{S}_{\mathcal{N}_{d}}% \quad\quad(\mathbb{G}\text{ is a subgroup of }\mathbb{S}_{n})

(8)

Specification of the Density In the case of fixed spin multiplicity, the specification of the density $\rho(x)$ is simplified:

Theorem 2.

Given a configuration $x=(r,s)$ , let a permutation which maps the spin vector $s$ to the canonical spin vector $\bar{s}$ be given by $\bar{\pi}_{s}$ , i.e. $\bar{s}=\bar{\pi}_{s}s$ . Let $\bar{\rho}(r)$ be a density function on electron positions (i.e. no spins) satisfying

(R1)

$\bar{\rho}$ is $\mathbb{G}$ -invariant: $\bar{\rho}(\pi r)=\bar{\rho}(r)\text{ for all }\pi\in\mathbb{G}$
(R2)

$\bar{\rho}(r)=0\text{ if }r_{i}=r_{j},\text{ for }i,j\in\mathcal{N}_{u}\text{ % or }i,j\in\mathcal{N}_{d}$

A density $\rho(x)=\rho(r,s)$ satisfies conditions (D1)-(D2) in Theorem 1 if and only if it may be written as $\rho(r,s)=\bar{\rho}(\bar{\pi}_{s}r)$ for a density $\bar{\rho}(r)$ satisfying conditions (R1) and (R2).

Proof: See Appendix C.
To summarize: in the case of fixed spin multiplicity, specifying a wavefunction $\psi$ satisfying our desired conditions (W1)-(W4) is equivalent to specifying a density $\bar{\rho}(r)$ satisfying conditions (R1)-(R2); and then applying the transformations given in Theorems 1 and 2 to map from $\bar{\rho}$ to $\psi$ .¹¹1We have for the moment ignored the issue of the phase $\kappa$ , which we return to in Sections 3.6 and 4.1. Therefore, henceforth we will focus exclusively on specifying densities $\bar{\rho}(r)$ satisfying conditions (R1)-(R2). To avoid unnecessary notational complexity, we will drop the bars and simply write $\rho(r)$ .

3 Using Normalizing Flows to Construct the Wavefunction Ansatz

3.1 Sufficient Properties of the Normalizing Flow’s Base Density and Transformation

Our goal is to use a normalizing flow to construct the density $\rho(r)$ . Let $D$ be the ambient dimension (i.e. $D=3$ ) and $n$ be the number of electrons. The relevant vectors will live in the space $\mathbb{R}^{Dn}$ construed as the Cartesian product $\mathbb{R}^{D}\times\dots\times\mathbb{R}^{D}$ (which is of course isomorphic to $\mathbb{R}^{Dn}$ ). A normalizing flow will consist of two ingredients: (1) a base random variable $z$ , which lives in $\mathbb{R}^{Dn}$ , and is described by the density $\rho_{z}(z)$ ; (2) an invertible transformation $T:\mathbb{R}^{Dn}\to\mathbb{R}^{Dn}$ , such that $r=T(z)$ . In this case, the density $\rho(r)$ is the push-forward of $\rho_{z}$ along $T$ , and is given by the change of variables formula

\rho(r)=\rho_{z}(T^{-1}(r))|\det J_{T^{-1}}(r)|

(9)

Recall that we would like our density $\rho(r)$ to satisfy conditions (R1)-(R2) laid out in Theorem 2. The following theorem establishes conditions for this to occur:

Theorem 3.

Suppose that we have a normalizing flow, whose base density $\rho_{z}$ satisfies properties (R1) and (R2) from Theorem 2, and whose transformation $T$ is $\mathbb{G}$ -equivariant. Then the density resulting from the normalizing flow will satisfy properties (R1) and (R2).

Proof: See Appendix D.
Armed with this key result, we now set out to design the base density $\rho_{z}$ and transformation $T$ which satisfy the conditions of Theorem 3.

3.2 The Base Density via Determinantal Point Processes

In most cases in machine learning, the base density for a normalizing flow is taken to be a standard distribution, most often a Gaussian. In our case, we require that the base density have certain special properties, namely (R1) and (R2) from Theorem 2. It turns out that Determinantal Point Processes (DPPs) have just the properties we require. In particular, we are interested in the class of DPPs known as Projection DPPs (Gautier et al., 2019; Lavancier et al., 2015), which can be specified as follows. We will let $y$ specify a generic point in $\mathbb{R}^{D}$ . Let $h_{k}:\mathbb{R}^{D}\to\mathbb{R}$ for $k=1,\dots,n$ be a set of $n$ functions which are orthogonal, that is $\langle h_{i},h_{j}\rangle=\int_{\mathbb{R}^{D}}h_{i}(y)h_{j}(y)dy=\delta_{ij}$ . Let $H(y)$ be the column vector composed by stacking the individual functions $h_{i}(y)$ and define the kernel function as $K(y,y^{\prime})=H(y)^{T}H(y^{\prime})$ . Then for a given collection of $n$ points in $\mathbb{R}^{D}$ , that is $r=(r_{1},\dots,r_{n})$ , we define the $n\times n$ kernel matrix $\mathbf{K}_{n}(r)$ , from which the density of the Projection DPP may be specified:

\mathbf{K}_{n}(r)=\begin{bmatrix}K(r_{1},r_{1})&\dots&K(r_{1},r_{n})\\ \vdots&\ddots&\vdots\\ K(r_{n},r_{1})&\dots&K(r_{n},r_{n})\end{bmatrix}\qquad\Rightarrow\qquad\rho_{% dpp}(r;n)=\frac{1}{n!}\det\mathbf{K}_{n}(r)

(10)

Since $\mathbf{K}_{n}(r)$ is positive semi-definite, it follows that its determinant is non-negative so that $\rho_{dpp}(r;n)$ is non-negative, as desired. A proof that $\rho_{dpp}(r;n)$ is properly normalized (i.e. integrates to $1$ ) can be found, for example, in Proposition 2.10 of (Johansson, 2006).

Given the notion of a Projection DPP, we may define the base density as follows. As above, let the base random variable be $z$ , where $z$ can be broken into spin-up and spin-down pieces, denoted $z_{u}$ and $z_{d}$ . (Specifically, $z_{u}$ and $z_{d}$ are the parts of $z$ corresponding to electrons in $\mathcal{N}_{u}$ and $\mathcal{N}_{d}$ , respectively.) The base density can then be constructed by taking

\rho_{z}(z)=\rho_{dpp}(z_{u};n_{u})\rho_{dpp}(z_{d};n_{d})

(11)

That is, $z_{u}$ and $z_{d}$ are chosen from two independent Projection DPPs. We then have the following theorem:

Theorem 4.

Let $\rho_{z}$ be the density specified in Equation (11). Then $\rho_{z}$ satisfies conditions (R1) and (R2) from Theorem 2.

Proof: See Appendix E.
We therefore have an explicit form for the base density from Equations (10) and (11). Furthermore, sampling from the base density amounts to sampling from two independent Projection DPPs. A sampling procedure for Projection DPPs is specified in Appendix F.

3.3 $\mathbb{G}$ -Equivariant Layers

As noted in Section 3, we require the normalizing flow transformation to be $\mathbb{G}$ -equivariant. Of course, chaining together many layers which are each $\mathbb{G}$ -equivariant results in an overall transformation which is also $\mathbb{G}$ -equivariant. Now, suppose that a particular layer $\ell$ can be written as

r^{\ell+1}=T^{\ell}(r^{\ell})

(12)

where $r^{\ell}=(r_{1}^{\ell},\dots,r_{n}^{\ell})$ and likewise for $r^{\ell+1}$ . We will need to see the action on the spin-up and spin-down electrons separately, so we denote $r_{u}^{\ell}=(r_{i}^{\ell})_{i\in\mathcal{N}_{u}}$ and $r_{d}^{\ell}=(r_{i}^{\ell})_{i\in\mathcal{N}_{d}}$ ; and we may write

r_{u}^{\ell+1}=T_{u}^{\ell}(r_{u}^{\ell},r_{d}^{\ell})\quad\text{and}\quad r_{% d}^{\ell+1}=T_{d}^{\ell}(r_{u}^{\ell},r_{d}^{\ell})

(13)

For notational convenience, we use $\alpha\in\{u,d\}$ to denote the spin, and the complement of the spin is given by ${\hat{\alpha}}$ (i.e. if $\alpha=u$ then ${\hat{\alpha}}=d$ and vice-versa). Then we have the following theorem:

Theorem 5.

The transformation $T^{\ell}$ is $\mathbb{G}$ -equivariant if and only if

T_{\alpha}^{\ell}(\pi_{\alpha}r_{\alpha}^{\ell},\pi_{\hat{\alpha}}r_{\hat{% \alpha}}^{\ell})=\pi_{\alpha}T_{\alpha}^{\ell}(r_{\alpha}^{\ell},r_{\hat{% \alpha}}^{\ell})\qquad\alpha\in\{u,d\}

(14)

That is, $T_{\alpha}^{\ell}$ is equivariant with respect to $r_{\alpha}^{\ell}$ , and invariant with respect to $r_{\hat{\alpha}}^{\ell}$ .

Proof: See Appendix G.

We now show how to specify continuous and discrete normalizing flows satisfying Theorem 5.

3.4 Continuous Normalizing Flows

According to Theorem 3, we are required a find a transformation which is $\mathbb{G}$ -equivariant. We now show this can be achieved via a continuous normalizing flow. We specify this flow via the ordinary differential equation (ODE)

\frac{dv}{dt}=\Gamma_{t}(v),\quad\text{with}\quad v(0)=z\sim\rho_{z}(\cdot)% \quad\text{and}\quad r=v(1)

(15)

That is, the transformation $r=T(z)$ is derived as follows: the initial condition is sampled from the base density; and $r$ is gotten by integrating the ODE forward to time $t=1$ . $\Gamma$ ’s $t$ -dependence is indicated via a subscript for notational convenience. We then have the following theorem:

Theorem 6.

Let the transformation $r=T(z)$ be specified as in Equation (15). Then $T$ is $\mathbb{G}$ -equivariant if $\Gamma_{t}$ is $\mathbb{G}$ -equivariant for all $t$ .

Proof: See Appendix H.
It therefore suffices to design a $\mathbb{G}$ -equivariant function $\Gamma_{t}$ . Let us break this down by spin: from Theorem 5, we know that this implies that for all $t$ , we have that $\Gamma_{t}(\pi_{\alpha}r_{\alpha},\pi_{\hat{\alpha}}r_{\hat{\alpha}})=\pi_{% \alpha}\Gamma_{t}(r_{\alpha},r_{\hat{\alpha}})\text{ for }\alpha\in\{u,d\}$ . We show in Appendix K how to implement a layer of $\Gamma$ with a combination of multihead attention, fully connected layers, and linear projections ( $\Gamma$ can be composed of many such layers).

Continuous normalizing flows are elegant; however, they can present some numerical difficulties. In particular, the issue of ODE stiffness frequently arises in deep learning pipelines involving continuous normalizing flows. Thus, we now present an alternative method, based on discrete normalizing flows.

3.5 Discrete Normalizing Flows

Our goal is now to design such functions $T_{u}^{\ell}$ and $T_{d}^{\ell}$ which satisfy Equation (14), and for which the overall transformation $T^{\ell}=(T_{u}^{\ell},T_{d}^{\ell})$ is invertible. The goal of the layer we propose here is to not sacrifice on expressivity, especially when compared to many layers which are designed for discrete normalizing flows. In particular, the main issue will be to show that the expressivity can be retained even with the joint requirements of invertibility and $\mathbb{G}$ -equivariance. We note that the kind of transformation we propose below is not generally used for normalizing flows, as the determinant of its Jacobian is not fast to compute; however, this is not an issue in our case, as the dimension of the spaces we are dealing with are relatively small. For a more detailed discussion, see Appendix I.

To solve this problem, we introduce the Split Subspace Layer; we note that this layer may be of broader interest in machine learning, independent of the current setting. As before, we take $D$ to represent the ambient spatial dimension; in our case, $D=3$ . A key parameter for the $\ell^{th}$ layer will be the orthogonal matrix $\Lambda_{\alpha}^{\ell}\in O(D)$ ; in particular, we divide this matrix into 2 pieces

\Lambda_{\alpha}^{\ell}=[\beta_{\alpha}^{\ell},\xi_{\alpha}^{\ell}]\quad\text{% with }\beta_{\alpha}^{\ell}\in\mathbb{R}^{D\times D_{\beta}}\quad\text{ and }% \quad\xi_{\alpha}^{\ell}\in\mathbb{R}^{D\times(D-D_{\beta})}

(16)

That is, $\beta_{\alpha}^{\ell}$ represents the first $D_{\beta}$ columns of $\Lambda_{\alpha}^{\ell}$ , and $\xi_{\alpha}^{\ell}$ represents the final $D-D_{\beta}$ columns. For each electron $i$ , we compute the inner product of its coordinates with $\beta_{\alpha}^{\ell}$ , i.e.

\gamma_{\alpha,i}^{\ell}=(\beta_{\alpha}^{\ell})^{T}r_{\alpha,i}^{\ell}\quad% \text{so that }\gamma_{\alpha,i}^{\ell}\in\mathbb{R}^{D_{\beta}}

(17)

We can collect the individual vectors $\gamma_{\alpha,i}^{\ell}$ into a list $\gamma_{\alpha}^{\ell}=(\gamma_{\alpha,i}^{\ell})_{i\in{\mathcal{N}_{\alpha}}}$ . Given this, we define the Split Subspace Layer $T_{\alpha}^{\ell}$ on a per-electron basis by

r_{\alpha,i}^{\ell+1}=T_{\alpha,i}^{\ell}(r_{\alpha}^{\ell},r_{\hat{\alpha}}^{% \ell})=r_{\alpha,i}^{\ell}+\xi_{\alpha}^{\ell}\varphi_{\alpha,i}^{\ell}(\gamma% _{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})\quad\text{with}\quad\varphi_{% \alpha,i}^{\ell}(\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})\in% \mathbb{R}^{D-D_{\beta}}

(18)

where $\varphi_{\alpha}^{\ell}$ is a network, and $\varphi_{\alpha,i}^{\ell}$ is the part of (the output of) $\varphi_{\alpha}^{\ell}$ corresponding to the $i^{th}$ electron. The layer is referred to as the Split Subspace Layer due to the fact that its input is one subspace of $\mathbb{R}^{D}$ , given by $\beta_{\alpha}^{\ell}$ ; whereas its output is in the orthogonal complement of this subspace, given by $\xi_{\alpha}^{\ell}$ .

The main ingredient of the layer is the network $\varphi_{\alpha}^{\ell}$ . We now show two things: (1) the layer is invertible for any choice of $\varphi_{\alpha}^{\ell}$ (2) we derive conditions on $\varphi_{\alpha}^{\ell}$ to achieve $\mathbb{G}$ -equivariance of $T_{\alpha}^{\ell}$ .

Theorem 7.

Let $T^{\ell}$ be a Split Subspace Layer, as given in Equation (18). Then $T^{\ell}$ is invertible. In particular, let $\underline{\gamma}_{\alpha,i}^{\ell+1}=(\beta_{\alpha}^{\ell})^{T}r_{\alpha,i}% ^{\ell+1}$ ; then the inverse of the layer is given by

r_{\alpha,i}^{\ell}=r_{\alpha,i}^{\ell+1}-\xi_{\alpha}^{\ell}\varphi_{\alpha,i% }^{\ell}(\underline{\gamma}_{\alpha}^{\ell+1},\underline{\gamma}_{\hat{\alpha}% }^{\ell+1})

(19)

Furthermore, the layer $T^{\ell}$ is $\mathbb{G}$ -equivariant if

\varphi_{\alpha}^{\ell}(\pi_{\alpha}\gamma_{\alpha}^{\ell},\pi_{\hat{\alpha}}% \gamma_{\hat{\alpha}}^{\ell})=\pi_{\alpha}\varphi_{\alpha}^{\ell}(\gamma_{% \alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})

(20)

i.e. if $\varphi_{\alpha}^{\ell}(\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})$ is equivariant with respect to permutations on $\gamma_{\alpha}^{\ell}$ and invariant with respect to permutations on $\gamma_{\hat{\alpha}}^{\ell}$ .

Proof: See Appendix J.
The Split Subspace Layer therefore depends on implementation of the network $\varphi_{\alpha}^{\ell}$ so that it satisfies Equation (20). We show in Appendix K how $\varphi_{\alpha}^{\ell}$ can be implemented with a combination of multihead attention, fully connected layers, and linear projections. We specify a more general version of the Split Subspace Layer in Appendix L.

3.6 Training via SGD

Log Domain: Density In order to avoid numerical issues, it is best to operate in the log domain. Suppose that

\psi(r)=e^{q(r)+iw(r)}\quad\Leftrightarrow\quad q(r)=\tfrac{1}{2}\log\rho(r)% \quad\text{and}\quad w(r)=\text{atan2}\left(\kappa_{i}(r),\kappa_{r}(r)\right)

(21)

where $\kappa_{r}(r)$ and $\kappa_{i}(r)$ are the real and imaginary parts of the phase $\kappa(r)$ , respectively; and atan2 is the “full” arctangent.

The log-density $q(r;\theta)$ may be computed for both continuous and discrete normalizing flows, where we now introduce the parameters $\theta$ of the network explicitly. Consider a sample $z$ chosen from the base density $\rho_{z}(z)$ , and in analogy to $q(r)$ , define $q_{z}(z)=\tfrac{1}{2}\log\rho_{z}(z)$ . Now, in the case of a continuous normalizing flow, let $v(t)$ satisfy Equation (15); then $q(r;\theta)$ can be by computed (Chen et al., 2018) by solving the ODE

\frac{da}{dt}=-\text{Trace}\left(\frac{\partial\Gamma_{t}}{\partial v}(v(t);% \theta)\right)\quad\text{with}\quad\text{ with }a(0)=q_{z}(z)\quad\text{and}% \quad q(r;\theta)=a(1)

(22)

which is the continuous analogue of the change of variables formula. In the case of a discrete normalizing flow, fix the following notation: $r^{0}=z$ , $r=r^{L+1}$ , and $T=T^{L}\circ\dots\circ T^{0}$ . Then we may use a logarithmic version of the standard change of variables formula (9):

q(r;\theta)=q_{z}\left(T^{-1}(r;\theta)\right)+\frac{1}{2}\sum_{\ell=0}^{L}% \log\left|\det J_{(T^{\ell})^{-1}}(r^{\ell+1};\theta)\right|

(23)

Log Domain: Gradient of the Objective Recall that our goal in finding an approximation to the ground state wavefunction is to solve the optimization problem in Equation (4). Using Equation (6) and noting that $\langle\psi(\cdot;\theta)|\psi(\cdot;\theta)\rangle=1$ since $\rho(\cdot;\theta)$ is normalized, we may write the objective function to be minimized as

\mathcal{L}(\theta)=\langle\psi(\cdot;\theta)|H|\psi(\cdot;\theta)\rangle\,=\,% \mathbb{E}_{r\sim\rho(\cdot;\theta)}\left[\mathcal{E}_{r}(r;\theta)\right]\,% \approx\,\frac{1}{K}\sum_{k=1}^{K}\mathcal{E}_{r}\left(r^{(k)};\theta\right)

(24)

with samples $r^{(k)}\sim\rho(\cdot;\theta)$ . Then we have the following theorem, which shows that the local energy can be written entirely as a function of $q(r;\theta)$ and the potential $V(r)$ , so that the phase $w(r;\theta)$ does not appear; and furthermore gives the gradient of the objective function $\mathcal{L}(\theta)$ .

Theorem 8.

The local energy can be written as

\mathcal{E}_{r}(r;\theta)=-\tfrac{1}{2}\Delta_{r}q(r;\theta)-\tfrac{1}{2}\|% \nabla_{r}q(r;\theta)\|^{2}+V(r)

(25)

In particular, the local energy is independent of the phase $w(r;\theta)$ . Furthermore, let

\Omega(r;\theta)=\nabla_{\theta}\mathcal{E}_{r}(r;\theta)+2\mathcal{E}_{r}(r;% \theta)\nabla_{\theta}q(r;\theta)

(26)

Then the gradient of the loss function may be written as

\nabla_{\theta}\mathcal{L}(\theta)=\mathbb{E}_{r\sim\rho(\cdot;\theta)}\left[% \Omega(r;\theta)\right]\approx\frac{1}{K}\sum_{k=1}^{K}\Omega\left(r^{(k)};% \theta\right)

(27)

with samples $r^{(k)}\sim\rho(\cdot;\theta)$ .

Proof: See Appendix M.
Thus, in order to optimize the objective in Equation (24), we may use gradient descent using the estimate for the gradient in Equation (27). A detailed version of the optimization routine is given in Appendix N.

4 Further Details: Phase, Cusps, and Induction

4.1 The Phase

Since the Hamiltonian is time-reversal invariant and Hermitian, both its eigenvalues and its eigenfunctions are real. Since the ground-state wavefunction we are looking for is real, the phase can be taken to belong to the two element set $\{0,\pi\}$ . Given that we now know how to solve for an approximation to the density $\rho_{0}(r)$ corresponding to the ground state wavefunction, we now show one way of assigning the phase so that the resulting ground state wavefunction $\psi_{0}(r)$ is appropriately antisymmetric.

Theorem 9.

Let $\rho_{0}(r)$ be the the density for the ground state wavefunction. Let $\prec$ be a strict total order on $\mathbb{R}^{D}$ , and define the set

\mathcal{R}=\{r=(r_{1},\dots r_{n}):r_{1}\prec r_{2}\prec\dots\prec r_{n_{u}}% \,\,\text{ and }\,\,r_{n_{u}+1}\prec r_{n_{u}+2}\prec\dots\prec r_{n}\}

(28)

For any $r$ without $r_{i}=r_{j}$ , define the permutation ${\pi_{\prec}(r)}\in\mathbb{G}$ by ${\pi_{\prec}(r)}r\in\mathcal{R}$ . Then a valid antisymmetric ground state wavefunction is given by

\psi_{0}(r)=\begin{cases}(-1)^{\pi_{\prec}(r)}\sqrt{\rho_{0}(r)}&\text{if }r_{% i}\neq r_{j}\,\,\forall i,j\\ 0&\text{otherwise}\end{cases}

(29)

Proof: See Appendix O.
Thus, given the density $\rho_{0}$ , we can use Theorem 9 to easily compute the ground state wavefunction $\psi_{0}$ . A question remains: what is the strict total order $\prec$ ? Any choice is valid, but the simplest thing to do is to use lexicographic ordering on the coordinates of the two points in $\mathbb{R}^{D}$ that are being compared.

4.2 Incorporating Cusps

Electron-Electron Cusps Wavefunctions are known to have certain non-smooth properties, known as cusps. In particular, the gradient of the wavefunction should exhibit a discontinuity when two electrons coincide. One way to incorporate such gradient discontinuities is via the introduction of terms which depend on the distance between electrons (Pfau et al., 2020); as the distance is itself a continuous but non-smooth function of the electron positions, using distances can allow us to model such cusps. In the case of the discrete normalizing flow, our goal will be to design a layer which incorporates the inter-electron distances directly. Given the requirements of a normalizing flow, the challenge is to enforce invertibility for such a layer. We have the following result:

Theorem 10.

Let the set of distances be given by $\delta^{\ell}=\left\{\delta_{ij}^{\ell}\right\}_{i<j}$ where $\delta_{ij}^{\ell}=\|r_{i}^{\ell}-r_{j}^{\ell}\|$ . Given a layer of the form

r_{i}^{\ell+1}=\Theta^{\ell}(\delta^{\ell};\theta)\,r_{i}^{\ell}+t^{\ell}(% \delta^{\ell};\theta)\qquad\text{with }\Theta^{\ell}(\delta^{\ell};\theta)\in O% (D)\text{ and }t^{\ell}(\delta^{\ell};\theta)\in\mathbb{R}^{D}

(30)

Then the layer is both $\mathbb{G}$ -equivariant as well as invertible.

Proof: See Appendix P.
The essence of this layer to rotate all electrons in a given configuration $r=(r_{1},\dots,r_{n})$ by the same rotation matrix $\Theta^{\ell}$ and translation vector $t^{\ell}$ ; and the rotation matrix and translation vector are both functions the configuration $r$ entirely through the distances $\delta^{\ell}$ . The latter fact is crucial, as it means that different configurations $r$ are treated differently, which gives the layer expressivity. An implementation of this layer based on a Deep Set architecture (Zaheer et al., 2017) is given in Appendix Q.

It is also known that the gradient of the wavefunction should exhibit a discontinuity when an electron and nucleus coincide. The treatment is similar, and is given in Appendix R.

4.3 Induction Across Multiple Molecules

In an effort to accelerate the ground state computation, we may try to learn the ground state wavefunctions and energies for an entire class of molecules simultaneously, as in (Gao and Günnemann, 2023; Scherbela et al., 2024; Gerard et al., 2024). In particular, the molecular parameters are given by $R=(R_{1},\dots,R_{N})$ , the nuclear positions; and $Z=(Z_{1},\dots,Z_{N})$ , the atomic numbers of each nucleus. Then our goal is to learn a function of the form $\psi_{0}(x;R,Z)$ , i.e. a ground state wavefunction which is explicitly parameterized by the molecular parameters. This entails computing the density $\rho(r;R,Z)$ . However, this latter task is made more complicated by the fact that two new invariances are required:

	$\displaystyle\rho(r;\pi R,\pi Z)=\rho(r;R,Z)\quad\text{for }\pi\in\mathbb{S}_{% N}\quad\text{(nuclear permutation invariance)}$		(31)
	$\displaystyle\rho(\tau r;\tau R,Z)=\rho(r;R,Z)\quad\text{for }\tau\in E(D)% \quad\text{(joint rigid motion invariance)}$		(32)

We henceforth assume that the nuclei have their center of mass at the origin, i.e. $\bar{R}=\frac{1}{N}\sum_{I=1}^{N}R_{I}=0$ ; this removes the need to deal with translations, which generally require special (and uninteresting) treatment, e.g. see (Satorras et al., 2021). Thus, Equation (32) becomes

\rho(\Theta r;\Theta R,Z)=\rho(r;R,Z)\quad\text{for }\Theta\in O(D)\quad\text{% (joint rotation invariance)}

(33)

We now show that densities satisfying Equations (31) and (33) can be realized via a variation of the continuous normalizing flow we have introduced in Section 3.4:

Theorem 11.

Let $\bar{R}=\frac{1}{N}\sum_{I=1}^{N}R_{I}=0$ . Given a continuous normalizing flow of the form $dv/dt=\Gamma_{t}(v;R,Z)$ with $v(0)=z\sim\rho_{z}(\cdot)$ and $r=v(1)$ . Let the function $\Gamma_{t}$ be invariant with respect to nuclear permutations and equivariant with respect to joint rotations, i.e. for all $t$

\Gamma_{t}(v;\pi R,\pi Z)=\Gamma_{t}(v;R,Z)\,\,\forall\pi\in\mathbb{S}_{N}% \hskip 21.33955pt\Gamma_{t}(\Theta v;\Theta R,Z)=\Theta\Gamma_{t}(v;R,Z)\,\,% \forall\Theta\in O(D)

(34)

Furthermore, suppose that the base density is invariant with respect to rotations, $\rho_{z}(\Theta z)=\rho_{z}(z)$ for $\Theta\in O(D)$ . Then the resulting density $\rho(r;R,Z)$ satisfies Equations (31) and (33).

Proof: See Appendix S.
First, we note that the base density in Equation (11) can be made invariant to rotations by constructing the relevant Projection DPP from a kernel function $K(y,y^{\prime})=H(y)^{T}H(y)$ , where the functions $h_{i}(y)$ are derived from taking arbitrary rotationally-invariant functions $\tilde{h}_{i}(y)$ , and orthogonalizing them with Gram-Schmidt; e.g. one may use Gaussians of varying bandwidths, $\tilde{h}_{i}(y)=e^{-\|y\|^{2}/\sigma_{i}^{2}}$ .

Now, we turn to the construction of $\Gamma_{t}$ . Recall from Theorem 6 that $\Gamma_{t}(\cdot;R,Z)$ must be $\mathbb{G}$ -equivariant for all $t$ . Furthermore, we have already noted that $\mathbb{G}$ -equivariant functions may be constructed using a combination of standard pieces: multihead attention, fully connected layers, and linear projections. It would be nice if we were able to use this result while also incorporating the extra conditions in Equation (32). We now show that this is possible:

Theorem 12.

Let $\phi_{t}(v;R,Z)$ be a function which is $\mathbb{G}$ -equivariant with respect to $v$ i.e. $\phi_{t}(gv;R,Z)=g\phi_{t}(v;R,Z)$ for $g\in\mathbb{G}$ . Let $\omega_{t}(v;R,z)$ be a function whose output is itself a rotation, i.e. $\omega_{t}(v;R,z)\in O(D)$ . Let $\omega_{t}$ be $\mathbb{G}$ -invariant with respect to $v$ , and $O(D)$ -equivariant jointly with respect to $v$ and $R$ i.e. $\omega_{t}(\Theta v;\Theta R,Z)=\Theta\omega_{t}(v;R,Z)$ . Finally, let both $\phi_{t}$ and $\omega_{t}$ be permutation-invariant jointly with respect to $R$ and $Z$ i.e. $\phi_{t}(v;\pi R,\pi Z)=\phi_{t}(v;R,Z)$ and likewise for $\omega_{t}$ . Then the function

\Gamma_{t}(v;R,Z)=\zeta\phi_{t}(\zeta^{-1}v;\zeta^{-1}R,Z)\qquad\text{where}% \qquad\zeta=\omega_{t}(v;R,Z)

(35)

satisfies the properties in Equation (34) and is $\mathbb{G}$ -equivariant with respect to $v$ .

Proof: See Appendix T.
We can use the previously mentioned recipe in Appendix K in order to construct a $\mathbb{G}$ -equivariant $\phi_{t}$ , with an extra path in the network for the $R,Z$ dependence, based on either Deep Set or a Transformer architecture with pooling to gain the requisite invariance. The function $\omega_{t}$ can be constructed by using an $E(D)$ Equivariant Graph Neural Network (Satorras et al., 2021) whose output is a rotation matrix, similar to what is done in (Kaba et al., 2023). More detailed information is contained in Appendix U.

5 Concluding Remarks, Limitations, and Future Work

We have demonstrated a theoretical framework for efficiently solving the Electronic Schrödinger Equation using normalizing flows. Using these flows allows us to sample efficiently from the wavefunction, thereby side-step** the need for time-consuming MCMC approaches to sampling. The framework’s construction does not easily admit extensions to either diffusion models (Yang et al., 2023) or flow-matching (Lipman et al., 2022), both of which are very powerful and useful techniques. Future work will focus on adapting the framework to accommodate one or both of these methods.

References

Austin et al. (2012) Brian M Austin, Dmitry Yu Zubarev, and William A Lester Jr. Quantum monte carlo and related approaches. Chemical reviews, 112(1):263–288, 2012.
Carleo and Troyer (2017) Giuseppe Carleo and Matthias Troyer. Solving the quantum many-body problem with artificial neural networks. Science, 355(6325):602–606, 2017.
Cassella et al. (2023) Gino Cassella, Halvard Sutterud, Sam Azadi, ND Drummond, David Pfau, James S Spencer, and W Matthew C Foulkes. Discovering quantum phase transitions with fermionic neural networks. Physical Review Letters, 130(3):036401, 2023.
Ceperley and Alder (1986) David Ceperley and Berni Alder. Quantum monte carlo. Science, 231(4738):555–560, 1986.
Chen et al. (2018) Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
Deng et al. (2017) Dong-Ling Deng, Xiaopeng Li, and S Das Sarma. Quantum entanglement in neural network states. Physical Review X, 7(2):021021, 2017.
Entwistle et al. (2023) Michael T Entwistle, Zeno Schätzle, Paolo A Erdman, Jan Hermann, and Frank Noé. Electronic excited states in deep variational monte carlo. Nature Communications, 14(1):274, 2023.
Foulkes et al. (2001) William MC Foulkes, Lubos Mitas, RJ Needs, and Guna Rajagopal. Quantum monte carlo simulations of solids. Reviews of Modern Physics, 73(1):33, 2001.
Gao and Günnemann (2023) Nicholas Gao and Stephan Günnemann. Generalizing neural wave functions. In International Conference on Machine Learning, pages 10708–10726. PMLR, 2023.
Gao and Duan (2017) Xun Gao and Lu-Ming Duan. Efficient representation of quantum many-body states with deep neural networks. Nature communications, 8(1):662, 2017.
Gautier et al. (2019) Guillaume Gautier, Rémi Bardenet, and Michal Valko. On two ways to use determinantal point processes for monte carlo integration. Advances in Neural Information Processing Systems, 32, 2019.
Gerard et al. (2022) Leon Gerard, Michael Scherbela, Philipp Marquetand, and Philipp Grohs. Gold-standard solutions to the schrödinger equation using deep learning: How much physics do we need? Advances in Neural Information Processing Systems, 35:10282–10294, 2022.
Gerard et al. (2024) Leon Gerard, Michael Scherbela, Halvard Sutterud, Matthew Foulkes, and Philipp Grohs. Transferable neural wavefunctions for solids. arXiv preprint arXiv:2405.07599, 2024.
Gubernatis et al. (2016) James Gubernatis, Naoki Kawashima, and Philipp Werner. Quantum Monte Carlo Methods. Cambridge University Press, 2016.
Han et al. (2019) Jiequn Han, Linfeng Zhang, and E Weinan. Solving many-electron schrödinger equation using deep neural networks. Journal of Computational Physics, 399:108929, 2019.
Hermann et al. (2020) Jan Hermann, Zeno Schätzle, and Frank Noé. Deep-neural-network solution of the electronic schrödinger equation. Nature Chemistry, 12(10):891–897, 2020.
Johansson (2006) Kurt Johansson. Random matrices and determinantal processes. In Les Houches, volume 83, pages 1–56. Elsevier, 2006.
Kaba et al. (2023) Sékou-Oumar Kaba, Arnab Kumar Mondal, Yan Zhang, Yoshua Bengio, and Siamak Ravanbakhsh. Equivariance with learned canonicalization functions. In International Conference on Machine Learning, pages 15546–15566. PMLR, 2023.
Köhler et al. (2020) Jonas Köhler, Leon Klein, and Frank Noé. Equivariant flows: exact likelihood generative learning for symmetric densities. In International conference on machine learning, pages 5361–5370. PMLR, 2020.
Lavancier et al. (2015) Frédéric Lavancier, Jesper Møller, and Ege Rubak. Determinantal point process models and statistical inference. Journal of the Royal Statistical Society Series B: Statistical Methodology, 77(4):853–877, 2015.
Levine et al. (2019) Yoav Levine, Or Sharir, Nadav Cohen, and Amnon Shashua. Quantum entanglement in deep learning architectures. Physical review letters, 122(6):065301, 2019.
Li et al. (2022) Xiang Li, Zhe Li, and Ji Chen. Ab initio calculation of real solids via neural network ansatz. Nature Communications, 13(1):7895, 2022.
Lipman et al. (2022) Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2022.
Naito et al. (2023) Tomoya Naito, Hisashi Naito, Koji Hashimoto, et al. Multi-body wave function of ground and low-lying excited states using unornamented deep neural networks. Physical Review Research, 5(3):033189, 2023.
Needs et al. (2009) Richarad J Needs, Michael D Towler, Neil D Drummond, and P López Ríos. Continuum variational and diffusion quantum monte carlo calculations. Journal of Physics: Condensed Matter, 22(2):023201, 2009.
Passetti et al. (2023) Giacomo Passetti, Damian Hofmann, Pit Neitemeier, Lukas Grunwald, Michael A Sentef, and Dante M Kennes. Can neural quantum states learn volume-law ground states? Physical Review Letters, 131(3):036502, 2023.
Pescia et al. (2022) Gabriel Pescia, Jiequn Han, Alessandro Lovato, Jianfeng Lu, and Giuseppe Carleo. Neural-network quantum states for periodic systems in continuous space. Physical Review Research, 4(2):023138, 2022.
Pfau et al. (2020) David Pfau, James S Spencer, Alexander GDG Matthews, and W Matthew C Foulkes. Ab initio solution of the many-electron schrödinger equation with deep neural networks. Physical Review Research, 2(3):033429, 2020.
Pfau et al. (2023) David Pfau, Simon Axelrod, Halvard Sutterud, Ingrid von Glehn, and James S Spencer. Natural quantum monte carlo computation of excited states. arXiv preprint arXiv:2308.16848, 2023.
Ren et al. (2023) Weiluo Ren, Weizhong Fu, Xiaojie Wu, and Ji Chen. Towards the ground state of molecules via diffusion monte carlo on neural networks. Nature Communications, 14(1):1860, 2023.
Saleh et al. (2023) Yahya Saleh, Álvaro Fernández Corral, Armin Iske, Jochen Küpper, and Andrey Yachmenev. Computing excited states of molecules using normalizing flows. arXiv preprint arXiv:2308.16468, 2023.
Satorras et al. (2021) Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E(n) equivariant graph neural networks. In International conference on machine learning, pages 9323–9332. PMLR, 2021.
Schätzle et al. (2021) Zeno Schätzle, Jan Hermann, and Frank Noé. Convergence to the fixed-node limit in deep variational monte carlo. The Journal of Chemical Physics, 154(12), 2021.
Scherbela et al. (2024) Michael Scherbela, Leon Gerard, and Philipp Grohs. Towards a transferable fermionic neural wavefunction for molecules. Nature Communications, 15(1):120, 2024.
Sharir et al. (2022) Or Sharir, Amnon Shashua, and Giuseppe Carleo. Neural tensor contractions and the expressive power of deep neural quantum states. Physical Review B, 106(20):205136, 2022.
Spencer et al. (2020) James S Spencer, David Pfau, Aleksandar Botev, and W Matthew C Foulkes. Better, faster fermionic neural networks. arXiv preprint arXiv:2011.07125, 2020.
Thiede et al. (2022) Luca Thiede, Chong Sun, and Alán Aspuru-Guzik. Waveflow: Enforcing boundary conditions in smooth normalizing flows with application to fermionic wave functions. arXiv preprint arXiv:2211.14839, 2022.
Umrigar et al. (1993) CJ Umrigar, MP Nightingale, and KJ Runge. A diffusion monte carlo algorithm with very small time-step errors. The Journal of chemical physics, 99(4):2865–2890, 1993.
Wilson et al. (2021) Max Wilson, Nicholas Gao, Filip Wudarski, Eleanor Rieffel, and Norm M Tubman. Simulations of state-of-the-art fermionic neural network wave functions with diffusion monte carlo. arXiv preprint arXiv:2103.12570, 2021.
Wilson et al. (2022) Max Wilson, Saverio Moroni, Markus Holzmann, Nicholas Gao, Filip Wudarski, Tejs Vegge, and Arghya Bhowmik. Wave function ansatz (but periodic) networks and the homogeneous electron gas. arXiv preprint arXiv:2202.04622, 2022.
Yang et al. (2023) Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017.

Appendix A Derivation of Equation (6)

Recall that the local energy is defined as

\mathcal{E}(x)\equiv\frac{H\psi(x)}{\psi(x)}=-\frac{\Delta\psi(x)}{2\psi(x)}+V% (x)

(36)

with

\mathcal{E}_{r}(x)=\text{Real}\{\mathcal{E}(x)\}

(37)

In this case, one may write

$\displaystyle\frac{\langle\psi\|H\|\psi\rangle}{\langle\psi\|\psi\rangle}$	$\displaystyle=\text{Real}\left\{\frac{\langle\psi\|H\|\psi\rangle}{\langle\psi\|% \psi\rangle}\right\}$
	$\displaystyle=\text{Real}\left\{\frac{\int\psi^{*}(x)H\psi(x)dx}{\langle\psi\|% \psi\rangle}\right\}$
	$\displaystyle=\frac{1}{\langle\psi\|\psi\rangle}\int\text{Real}\left\{\psi^{*}(% x)H\psi(x)\right\}dx$
	$\displaystyle=\frac{1}{\langle\psi\|\psi\rangle}\int\text{Real}\left\{\frac{% \psi(x)}{\psi(x)}\psi^{*}(x)H\psi(x)\right\}dx$
	$\displaystyle=\frac{1}{\langle\psi\|\psi\rangle}\int\text{Real}\left\{\|\psi(x)\|% ^{2}\frac{H\psi(x)}{\psi(x)}\right\}dx$
	$\displaystyle=\int\text{Real}\left\{\frac{H\psi(x)}{\psi(x)}\right\}\frac{\|% \psi(x)\|^{2}}{\langle\psi\|\psi\rangle}dx$
	$\displaystyle=\int\mathcal{E}_{r}(x)\rho(x)dx$
	$\displaystyle=\mathbb{E}_{x\sim\rho(\cdot)}\left[\mathcal{E}_{r}(x)\right]$
	$\displaystyle\approx\frac{1}{K}\sum_{k=1}^{K}\mathcal{E}_{r}\left(x^{(k)}\right)$	(38)

where the $x^{(k)}$ are sampled from $\rho(\cdot)=|\psi(\cdot)|^{2}/\langle\psi|\psi\rangle$ . Note that in the first line, we have used the fact that $H$ is a symmetric operator so that the quadratic form is real; in the third line, the fact that $\langle\psi|\psi\rangle$ is real; and in the sixth line, the fact that $|\psi(x)|^{2}$ is real.

Appendix B Proof of Theorem 1

Theorem.

Let $\rho(\cdot)$ be a probability density function which we can sample from in constant time. Let $\rho(\cdot)$ satisfy two additional properties:

(D1)

$\rho(x)$ is symmetric: $\rho(\pi x)=\rho(x)$ for all permutations $\pi\in\mathbb{S}_{n}$ .
(D2)

$\rho(x)=0$ if $x_{i}=x_{j}$ for any $i,j$ .

Finally, let $\kappa(x)$ be a complex function which satisfies $|\kappa(x)|=1\,\,\forall x$ , and is nearly antisymmetric:

\kappa(\pi x)=\begin{cases}(-1)^{\pi}\kappa(x)&\text{if }x_{i}\neq x_{j}\text{% for all }i,j\\ \bar{\kappa}&\text{otherwise}\end{cases}

Proof.

Suppose that $\psi(x)=\kappa(x)\sqrt{\rho(x)}$ , let us prove each of properties (W1)-(W4).

(W1) The functional form for $\psi(x)$ is just $\kappa(x)\sqrt{\rho(x)}$ , which we know explicitly.

(W2) Antisymmetry of $\psi$ : we break down by cases. Suppose that $x$ is such that $x_{i}\neq x_{j}\text{ for all }i,j$ . Then:

$\displaystyle\psi(\pi x)$	$\displaystyle=\kappa(\pi x)\sqrt{\rho(\pi x)}$
	$\displaystyle=(-1)^{\pi}\kappa(x)\sqrt{\rho(x)}$
	$\displaystyle=(-1)^{\pi}\psi(x)$	(39)

where in the second line we have used the two facts that $\kappa$ is antisymmetric and $\rho$ is symmetric. Now, suppose that $x_{i}=x_{j}$ for some $i,j$ :

$\displaystyle\psi(\pi x)$	$\displaystyle=\kappa(\pi x)\sqrt{\rho(\pi x)}$
	$\displaystyle=\bar{\kappa}\sqrt{\rho(x)}$
	$\displaystyle=0$	(40)

where in the third line, we have used (D2). But this is precisely what is required for an antisymmetric function: if $a(x)$ is antisymmetric and $x_{i}=x_{j}$ of some $i,j$ , then $\pi_{ij}x=x$ , where $\pi_{ij}$ is the permutation which flips $i$ and $j$ , so that

a(\pi_{ij}x)=a(x)\,\Rightarrow\,-a(x)=a(x)\,\Rightarrow\,a(x)=0\,\Rightarrow\,% a(\pi x)=(-1)^{\pi}a(x)=0

(41)

where we have used the fact that $(-1)^{\pi_{ij}}=-1$ , since only one flip is required.

(W3) We can sample in constant time from $\rho(x)=\|\psi(x)\|^{2}$ by assumption.

(W4) $\psi$ is normalized:

\int\|\psi(x)\|^{2}dx=\int|\kappa(x)|^{2}\rho(x)dx=\int\rho(x)dx=1

(42)

since $\rho(x)$ is a probability density function.

Thus, we have proved the forward direction.

Now, let us assume properties (W1)-(W4). We can always express a complex number $c$ in terms of a magnitude and a phase; in particular, we may write $c=me^{i\nu}$ , where $m\geq 0$ is a real number, and $\nu\in[0,2\pi)$ . (W1) tells us that we have an explicit form for the complex-valued function $\psi(x)$ ; thus, we know that

\psi(x)=m(x)e^{i\nu(x)}\equiv\sqrt{\rho(x)}\kappa(x)

(43)

with $|\kappa(x)|=1$ , where we have used the fact that $m(x)\geq 0$ . Note that $\rho(x)=\|\psi(x)\|^{2}$ . As $\rho(x)\geq 0$ , and

\int\rho(x)dx=\int\|\psi(x)\|^{2}dx=1

(44)

by (W4), then $\rho(x)$ is a density. Furthermore, by (W3) $\rho(x)=\|\psi(x)\|^{2}$ may be sampled in constant time. Finally, by (W2), $\psi(x)$ is antisymmetric; thus,

		$\displaystyle\psi(\pi x)=(-1)^{\pi}\psi(x)\quad\forall\pi,x$
	$\displaystyle\Leftrightarrow\quad$	$\displaystyle\sqrt{\rho(\pi x)}\kappa(\pi x)=(-1)^{\pi}\sqrt{\rho(x)}\kappa(x)% \quad\forall\pi,x$		(45)

In cases where $\pi x=x$ and $(-1)^{\pi}=-1$ , then we have that

	$\displaystyle\sqrt{\rho(\pi x)}\kappa(\pi x)=(-1)^{\pi}\sqrt{\rho(x)}\kappa(x)$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle\sqrt{\rho(x)}\kappa(x)=-\sqrt{\rho(x)}\kappa(x)$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle\rho(x)=0$	(46)

where the third line follows from the fact that $\kappa(x)\neq 0$ since $|\kappa(x)|$ = 1. However, note that $\pi x=x$ and $(-1)^{\pi}=-1$ is equivalent to $x_{i}=x_{j}$ for some $i,j$ . Thus, we have established (D2). Furthermore, in such cases we can take $\kappa(x)=\bar{\kappa}$ , since it plays no role. In all other cases, i.e. where $x_{i}\neq x_{j}\text{ for all }i,j$ , we have that

		$\displaystyle\sqrt{\rho(\pi x)}\kappa(\pi x)=(-1)^{\pi}\sqrt{\rho(x)}\kappa(x)% \quad\forall\pi,x\text{ such that }x_{i}\neq x_{j}\text{ for all }i,j$
	$\displaystyle\Leftrightarrow\quad$	$\displaystyle\sqrt{\rho(\pi x)}=\sqrt{\rho(x)}\quad\text{and}\quad\kappa(\pi x% )=(-1)^{\pi}\kappa(x)\quad\forall\pi,x\text{ such that }x_{i}\neq x_{j}\text{ % for all }i,j$		(47)

where the second line holds since this must hold true for all $\pi$ and all relevant $x$ . $\kappa(\pi x)=(-1)^{\pi}\kappa(x)$ establishes the remainder of the nearly antisymmetric character of $\kappa$ . Finally,

\sqrt{\rho(\pi x)}=\sqrt{\rho(x)}\quad\Leftrightarrow\quad\rho(\pi x)=\rho(x)% \quad\forall\pi,x\text{ such that }x_{i}\neq x_{j}\text{ for all }i,j

(48)

since $\rho(x)\geq 0$ . This shows that $\rho(x)$ is symmetric for the case $x_{i}\neq x_{j}\text{ for all }i,j$ . Indeed $\rho(x)$ is symmetric for all $x$ , including those for which $x_{i}=x_{j}$ for some $i,j$ , as in the latter case we have shown that $\rho(x)=0$ . This establishes (D1) and completes the proof. ∎

Appendix C Proof of Theorem 2

Theorem.

(R1)

$\bar{\rho}$ is $\mathbb{G}$ -invariant: $\bar{\rho}(\pi r)=\bar{\rho}(r)\text{ for all }\pi\in\mathbb{G}$
(R2)

$\bar{\rho}(r)=0\text{ if }r_{i}=r_{j},\text{ for }i,j\in\mathcal{N}_{u}\text{ % or }i,j\in\mathcal{N}_{d}$

Proof.

Let us prove the forward direction: assume a density $\rho(x)=\rho(r,s)$ satisfying conditions (D1) and (D2), and we will show that it must be written as $\rho(r,s)=\bar{\rho}(\bar{\pi}_{s}^{-1}r)$ for $\bar{\rho}$ satisfying conditions (R1) and (R2). Let $\bar{r}=\bar{\pi}_{s}r$ and $\bar{x}=\bar{\pi}_{s}x=(\bar{r},\bar{s})$ . In this case, we have that

\rho(x)=\rho(\bar{\pi}_{s}x)=\rho(\bar{x})

(49)

where the first equality comes our requirement that $\rho(x)$ satisfy condition (D1), and the second equality from the definition of $\bar{x}$ . As a result, it is sufficient for us to focus on constructing a density $\rho(\bar{x})=\rho(\bar{r},\bar{s})$ , i.e. a density where the spins are in canonical order. As $\bar{s}$ is fixed as the canonical ordering, we may suppress it, writing $\rho(\bar{x})=\rho(\bar{r},\bar{s})\equiv\bar{\rho}(\bar{r})$ for a function $\bar{\rho}$ . Now, $\bar{\rho}$ must satisfy condition (D1); however, the only permutations that are relevant are those that preserve the canonical spin ordering $\bar{s}$ . More specifically, the relevant permutations $\pi$ are those for which $\pi\bar{s}=\bar{s}$ ; it is easy to see that those permutations form the group $\mathbb{S}_{\mathcal{N}_{u}}\times\mathbb{S}_{\mathcal{N}_{d}}=\mathbb{G}$ . Thus, we must have that

\bar{\rho}(\pi\bar{r})=\bar{\rho}(\bar{r})\text{ for all }\pi\in\mathbb{G}

(50)

That is, $\bar{\rho}$ is $\mathbb{G}$ -invariant, which is condition (R1).

Now let us turn to condition (D2), which states that $\rho(x)=0$ if $x_{i}=x_{j}$ for any $i,j$ . The requirement $x_{i}=x_{j}$ implies that both $r_{i}=r_{j}$ and $s_{i}=s_{j}$ ; and the condition $s_{i}=s_{j}$ is equivalent to $i,j\in\mathcal{N}_{u}$ or $i,j\in\mathcal{N}_{d}$ . Thus, condition (D2) is equivalent to

\bar{\rho}(\bar{r})=0\text{ if }\bar{r}_{i}=\bar{r}_{j},\text{ for }i,j\in% \mathcal{N}_{u}\text{ or }i,j\in\mathcal{N}_{d}

(51)

which is simply condition (R2).

Thus, we have proven conditions (R1) and (R2) must hold. Finally, using Equation (49) and the definitions of $\bar{\rho}$ and $\bar{r}$ , we have that

\rho(x)=\rho(\bar{x})=\bar{\rho}(\bar{r})=\bar{\rho}(\bar{\pi}_{s}r)

(52)

which completes the proof of the forward direction.

Now, let us prove the reverse direction: assume a density $\bar{\rho}$ satisfying conditions (R1) and (R2), and we will show that $\rho(r,s)=\bar{\rho}(\bar{\pi}_{s}r)$ satisfies conditions (D1) and (D2). Let us begin by computing $\bar{\pi}_{\pi s}$ , which will prove useful in what follows. $\bar{\pi}_{\pi s}$ is defined by $\bar{s}=\bar{\pi}_{\pi s}\pi s$ . However, we also know that $\bar{s}=\bar{\pi}_{s}s$ ; setting these equal gives

\bar{\pi}_{\pi s}\pi s=\bar{\pi}_{s}s\quad\Rightarrow\quad\bar{\pi}_{\pi s}\pi% =\bar{\pi}_{s}\hat{\pi}

(53)

where $\hat{\pi}$ is some permutation leaves $s$ unchanged, i.e. such that $\hat{\pi}s=s$ . Thus

\bar{\pi}_{\pi s}=\bar{\pi}_{s}\hat{\pi}\pi^{-1}

(54)

However, we know that $\bar{s}=\bar{\pi}_{s}s$ so that $s=\bar{\pi}_{s}^{-1}\bar{s}$ . Using the fact that $\hat{\pi}s=s$ gives

\hat{\pi}\bar{\pi}_{s}^{-1}\bar{s}=\bar{\pi}_{s}^{-1}\bar{s}\quad\Rightarrow% \quad\hat{\pi}\bar{\pi}_{s}^{-1}=\bar{\pi}_{s}^{-1}\breve{\pi}

(55)

where $\breve{\pi}$ is some permutation that leaves $\bar{s}$ unchanged; which precisely implies that $\breve{\pi}\in\mathbb{G}$ . Rearranging gives

\hat{\pi}=\bar{\pi}_{s}^{-1}\breve{\pi}\bar{\pi}_{s}\quad\Rightarrow\quad\bar{% \pi}_{\pi s}=\bar{\pi}_{s}\bar{\pi}_{s}^{-1}\breve{\pi}\bar{\pi}_{s}\pi^{-1}=% \breve{\pi}\bar{\pi}_{s}\pi^{-1}

(56)

Now, for any permutation $\pi$ , we have that

\rho(\pi x)=\bar{\rho}(\bar{\pi}_{\pi s}\pi r)=\bar{\rho}(\breve{\pi}\bar{\pi}% _{s}\pi^{-1}\pi r)=\bar{\rho}(\breve{\pi}(\bar{\pi}_{s}r))=\bar{\rho}(\bar{\pi% }_{s}r)=\rho(x)

(57)

where in the second last equality, we have used the fact that $\breve{\pi}\in\mathbb{G}$ , and that $\bar{\rho}$ is $\mathbb{G}$ -invariant by (R1). Thus, we have established property (D1), i.e. that $\rho$ is symmetric.

Now, turning to condition (D2), let us consider an $x$ such that $x_{i}=x_{j}$ for a particular $i,j$ . Continuing to use the notation $\bar{r}=\pi_{s}r$ , this implies that $\bar{r}_{i}=\bar{r}_{j}$ for either $i,j\in\mathcal{N}_{u}$ or $i,j\in\mathcal{N}_{d}$ . Thus, by condition (R2), we have that $\bar{\rho}(\bar{r})=0$ . But then

\rho(x)=\bar{\rho}(\bar{\pi}_{s}r)=\bar{\rho}(\bar{r})=0

(58)

which is precisely condition (D2); this completes the proof. ∎

Appendix D Proof of Theorem 3

Theorem.

Proof.

Let us begin by proving that the density resulting from the normalizing flow will satisfy condition (R1). Theorem 1 in (Köhler et al., 2020) states the following: “Let $\rho$ be a density on $\mathbb{R}^{m}$ which is $\mathbb{G}$ -invariant and $\mathbb{G}>\mathbb{H}$ . If $f$ is an $\mathbb{H}$ -equivariant diffeomorphism, then $\rho_{f}$ , the push-forward of $\rho$ along $f$ , is $\mathbb{H}$ -invariant.” In our instance, we may take $\mathbb{H}=\mathbb{G}$ , and thereby have established that the density resulting from the normalizing flow is $\mathbb{G}$ -invariant, thus satisfying condition (R1).

We now turn to proving that the density resulting from the normalizing flow will satisfy condition (R2). Suppose that the random variable for the base density is given by $z$ , with density $\rho_{z}$ ; and the normalizing flow is given by transformation $T$ , i.e. $r=T(z)$ . Then by the change of variables formula, we know that the density of $r$ is given by $\rho_{r}(r)=\rho_{z}(T^{-1}(r))|\det J_{T^{-1}}(r)|$ . Now, we are interested in the case when $r_{i}=r_{j}$ for $i,j\in\mathcal{N}_{u}$ (we may equally consider the case of $\mathcal{N}_{d}$ , they are identical). Let $\pi_{ij}\in\mathbb{G}$ be the permutation whose only action is to flip the coordinates of electrons $i$ and $j$ . Given that $r_{i}=r_{j}$ , then by definition we have that $\pi_{ij}r=r$ . In this case, we have that

z\equiv T^{-1}(r)=T^{-1}(\pi_{ij}r)=\pi_{ij}T^{-1}(r)

(59)

where the latter equality is due to the $\mathbb{G}$ -equivariance of $T^{-1}$ , which follows straightforwardly from the $\mathbb{G}$ -equivariance of $T$ . Rearranging the above, we have that

T^{-1}(r)=\pi_{ij}^{-1}z=\pi_{ij}z

(60)

where the second equality is due to the fact that $\pi_{ij}^{-1}=\pi_{ij}$ , as $\pi_{ij}$ simply flips electrons $i$ and $j$ . However, $z=T^{-1}(r)$ , so combining Equations (59) and (60) gives $z=\pi_{ij}z$ . Plugging this into the equation for the change of variables gives

\rho_{r}(r)=\rho_{z}(T^{-1}(r))|\det J_{T^{-1}}(r)|=\rho_{z}(z)|\det J_{T^{-1}% }(r)|

(61)

But we know that $z$ is such that $z=\pi_{ij}z$ , which means that $z_{i}=z_{j}$ ; and for such $z$ ’s, we know that $\rho_{z}(z)=0$ , by the assumption of condition (R2) for the base density. Thus, plugging back into Equation (61) gives

\rho_{r}(r)=0\cdot|\det J_{T^{-1}}(r)|=0

(62)

as desired. ∎

Appendix E Proof of Theorem 4

Theorem.

Let $\rho_{z}(z)=\rho_{dpp}(z_{u};n_{u})\rho_{dpp}(z_{d};n_{d})$ . Then $\rho_{z}$ satisfies conditions (R1) and (R2) from Theorem 2.

Proof.

Let us begin with property (R1): we would like to prove that $\rho_{z}$ is $\mathbb{G}$ -invariant. Let $\pi\in\mathbb{G}$ ; as $\mathbb{G}=\mathbb{S}_{\mathcal{N}_{u}}\times\mathbb{S}_{\mathcal{N}_{d}}$ , we may write the permutation $\pi=\pi_{u}\otimes\pi_{d}$ , where $\pi_{u}$ is a permutation which applies to the indices in $\mathcal{N}_{u}$ , and similarly for $\pi_{d}$ and the indices in $\mathcal{N}_{d}$ . Thus,

\rho_{z}(\pi z)=\rho_{dpp}(\pi_{u}z_{u};n_{u})\rho_{dpp}(\pi_{d}z_{d};n_{d})

(63)

Now, recall that the Projection DPP’s density is defined by

\mathbf{K}_{n}(r)=\begin{bmatrix}K(r_{1},r_{1})&\dots&K(r_{1},r_{n})\\ \vdots&\ddots&\vdots\\ K(r_{n},r_{1})&\dots&K(r_{n},r_{n})\end{bmatrix}\qquad\Rightarrow\qquad\rho_{% dpp}(r;n)=\frac{1}{n!}\det\mathbf{K}_{n}(r)

(64)

Thus, we must compute $\mathbf{K}_{n}(\pi r)$ for a permutation $\pi$ . We may represent the action of $\pi$ on a vector of length $n$ by an $n\times n$ matrix $P_{\pi}$ . It is then straightforward to see that

\mathbf{K}_{n}(\pi r)=P_{\pi}\mathbf{K}_{n}(r)P_{\pi}^{T}

(65)

and thus that

\det\mathbf{K}_{n}(\pi r)=\det\left(P_{\pi}\mathbf{K}_{n}(r)P_{\pi}^{T}\right)% =\det\left(P_{\pi}^{T}P_{\pi}\mathbf{K}_{n}(r)\right)=\det(P_{\pi})^{2}\det% \mathbf{K}_{n}(r)=\det\mathbf{K}_{n}(r)

(66)

where the second equality uses the cyclic property of the determinant; the third equality that a determinant of products is the product of determinants; and the fourth equality that the determinant of a permutation matrix $P_{\pi}$ is $\pm 1$ . Thus, we have that

\rho_{dpp}(\pi_{u}z_{u};n_{u})=\frac{1}{n_{u}!}\det\mathbf{K}_{n_{u}}(\pi_{u}z% _{u})=\frac{1}{n_{u}!}\det\mathbf{K}_{n_{u}}(z_{u})=\rho_{dpp}(z_{u};n_{u})

(67)

Likewise, $\rho_{dpp}(\pi_{d}z_{d};n_{d})=\rho_{dpp}(z_{d};n_{d})$ . This gives finally that $\rho_{z}(\pi z)=\rho_{z}(z)$ , establishing that $\rho_{z}$ is $\mathbb{G}$ -invariant, i.e. satisfies condition (R1).

We now turn to condition (R2): we would like to prove that $\rho_{z}(z)=0\text{ if }z_{i}=z_{j},\text{ for }i,j\in\mathcal{N}_{u}\text{ or% }i,j\in\mathcal{N}_{d}$ . Let us focus on the case of spin-up electrons, i.e. $i,j\in\mathcal{N}_{u}$ ; the spin-down case will follow analogously. We know that

\rho_{dpp}(z_{u};n_{u})=\frac{1}{n_{u}!}\det\mathbf{K}_{n_{u}}(z_{u})

(68)

Given the definition of the matrix $\mathbf{K}_{n_{u}}(z_{u})$ , it is straightforward to see that if $z_{i}=z_{j}$ , then $\mathbf{K}_{n_{u}}(z_{u})$ has identical columns for $i$ and $j$ . However, a matrix with two identical columns is rank deficient, and therefore has determinant $0$ . Thus, we have that $\rho_{dpp}(z_{u};n_{u})=0$ so that $\rho_{z}(z)=0$ , establishing that $\rho_{z}$ satisfies condition (R2). ∎

Appendix F Sampling Procedure for Projection DPPs

In order to sample from a Projection DPP, we may follows the procedure outlined in (Lavancier et al., 2015), which we reproduce in Algorithm 1. We note that the speed of the sampling algorithm is largely unimportant, as one may sample as many samples as one would like offline, prior to (and independent from) the process of minimizing the variational objective.

Algorithm 1 Sampling from a Projection Determinantal Point Process

n

H(y)

sample

r_{n}

from the distribution with density

\rho_{n}(y)=\frac{1}{n}\|H(y)\|^{2}

e_{1}\leftarrow H(r_{n})/\|H(r_{n})\|

for

i=n-1

1

sample

r_{i}

from the distribution with density

\rho_{i}(y)=\frac{1}{i}\|H(y)\|^{2}-\frac{1}{i}\sum_{j=1}^{n-i}|e_{j}^{T}H(y)|% ^{2}

c_{i}\leftarrow H(r_{i})-\frac{1}{i}\sum_{j=1}^{n-i}\left(e_{j}^{T}H(r_{i})% \right)e_{j}

e_{n-i+1}\leftarrow c_{i}/\|c_{i}\|

end for

return

r=(r_{1},\dots,r_{n})

In order to sample from $\rho_{i}(y)$ , one may use rejection sampling; for further details, see (Lavancier et al., 2015). Note that the algorithm can be generalized in a straightforward fashion to a complex orthonormal basis $H(x)$ by replacing all transposes with Hermitian transposes.

Appendix G Proof of Theorem 5

Theorem.

The transformation $T^{\ell}$ is $\mathbb{G}$ -equivariant if and only if

T_{\alpha}^{\ell}(\pi_{\alpha}r_{\alpha}^{\ell},\pi_{\hat{\alpha}}r_{\hat{% \alpha}}^{\ell})=\pi_{\alpha}T_{\alpha}^{\ell}(r_{\alpha}^{\ell},r_{\hat{% \alpha}}^{\ell})\qquad\alpha\in\{u,d\}

That is, $T_{\alpha}^{\ell}$ is equivariant with respect to $r_{\alpha}^{\ell}$ , and invariant with respect to $r_{\hat{\alpha}}^{\ell}$ .

Proof.

Let us begin with the forward direction: suppose that $T^{\ell}$ is $\mathbb{G}$ -equivariant. Let $\pi\in\mathbb{G}$ ; as $\mathbb{G}=\mathbb{S}_{\mathcal{N}_{u}}\times\mathbb{S}_{\mathcal{N}_{d}}$ , we may write the permutation $\pi=\pi_{u}\otimes\pi_{d}$ , where $\pi_{u}$ is a permutation which applies to the indices in $\mathcal{N}_{u}$ , and similarly for $\pi_{d}$ and the indices in $\mathcal{N}_{d}$ . Then $\mathbb{G}$ -equivariance of $T^{\ell}$ implies

T^{\ell}(\pi r^{\ell})=\pi T^{\ell}(r^{\ell})=\pi r^{\ell+1}

(69)

Now, let us break this down by spin. Note that

\pi r^{\ell+1}=(\pi_{u}r_{u}^{\ell+1},\pi_{d}r_{d}^{\ell+1})

(70)

and also

T^{\ell}(\pi r^{\ell})=(T_{u}^{\ell}(\pi_{u}r_{u}^{\ell},\pi_{d}r_{d}^{\ell})% \,,\,T_{d}^{\ell}(\pi_{u}r_{u}^{\ell},\pi_{d}r_{d}^{\ell}))

(71)

But Equation (69) says that $\pi r^{\ell+1}=T^{\ell}(\pi r^{\ell})$ , so we may combine the last two equations to give

T_{u}^{\ell}(\pi_{u}r_{u}^{\ell},\pi_{d}r_{d}^{\ell})=\pi_{u}T_{u}^{\ell}(r_{u% }^{\ell},r_{d}^{\ell})\qquad\text{and}\qquad T_{d}^{\ell}(\pi_{u}r_{u}^{\ell},% \pi_{d}r_{d}^{\ell})=\pi_{d}T_{d}^{\ell}(r_{u}^{\ell},r_{d}^{\ell})

(72)

In words, $T_{u}^{\ell}$ is equivariant with respect to $\pi_{u}$ , and invariant with respect to $\pi_{d}$ ; and the reverse is true for $T_{d}^{\ell}$ . For notational convenience, we use $\alpha\in\{u,d\}$ to denote the spin, and the complement of the spin is given by ${\hat{\alpha}}$ (i.e. if $\alpha=u$ then ${\hat{\alpha}}=d$ ). In this case, we may summarize Equation (72) as

T_{\alpha}^{\ell}(\pi_{\alpha}r_{\alpha}^{\ell},\pi_{\hat{\alpha}}r_{\hat{% \alpha}}^{\ell})=\pi_{\alpha}T_{\alpha}^{\ell}(r_{\alpha}^{\ell},r_{\hat{% \alpha}}^{\ell})\qquad\alpha\in\{u,d\}

(73)

which completes the proof for the forward direction.

Now, suppose that $T_{\alpha}^{\ell}(\pi_{\alpha}r_{\alpha}^{\ell},\pi_{\hat{\alpha}}r_{\hat{% \alpha}}^{\ell})=\pi_{\alpha}T_{\alpha}^{\ell}(r_{\alpha}^{\ell},r_{\hat{% \alpha}}^{\ell})$ . For a given permutation $\pi=\pi_{u}\otimes\pi_{d}$ , we have

T^{\ell}(\pi r^{\ell})=(T_{u}^{\ell}(\pi_{u}r_{u}^{\ell},\pi_{d}r_{d}^{\ell})% \,,\,T_{d}^{\ell}(\pi_{u}r_{u}^{\ell},\pi_{d}r_{d}^{\ell}))=(\pi_{u}T_{u}^{% \ell}(r_{u}^{\ell},r_{d}^{\ell})\,,\,\pi_{d}T_{d}^{\ell}(r_{u}^{\ell},r_{d}^{% \ell}))=\pi T^{l}(r^{\ell})

(74)

so that $T^{\ell}$ is $\mathbb{G}$ -equivariant, as desired. This completes the proof for the reverse direction. ∎

Appendix H Proof of Theorem 6

Theorem.

Let the transformation $r=T(z)$ be specified by the ODE

\frac{dv}{dt}=\Gamma(v),\quad\text{with}\quad v(0)=z\sim\rho_{z}(\cdot)\quad% \text{and}\quad r=v(1)

Then $T$ is $\mathbb{G}$ -equivariant if $\Gamma$ is $\mathbb{G}$ -equivariant.

Proof.

The result follows directly from Theorem 2 in (Köhler et al., 2020). ∎

Appendix I The Complexity of Discrete Normalizing Flows

The limiting factor in the complexity of the discrete normalizing flow is the computation of determinants; unlike traditional normalizing flows, we make no effort to accelerate the determinant of the Jacobian, which allows us to have more expressive layers. In particular, the relevant space is of dimension $Dn$ , where $D=3$ and $n$ is on the order of tens of electrons for small molecules. Thus, the overall dimension of the space is low hundreds.

We note that the determinant of the Jacobian is cubic in the dimension; for a low-dimensional space this is acceptable. Furthermore, popular methods based on neural networks, such as FermiNet (Pfau et al., 2020) and PauliNet (Hermann et al., 2020) use determinants explicitly in their ansätze, so that they have similar complexity. However, these methods use Markov Chain Monte Carlo sampling, so that they incur extra overhead from having to sample by solving for the limit of a stochastic differential equation, which our method avoids.

Appendix J Proof of Theorem 7

Theorem.

Let $T^{\ell}$ be a Split Subspace Layer, as given by

r_{\alpha,i}^{\ell+1}=T_{\alpha,i}^{\ell}(r_{\alpha}^{\ell},r_{\hat{\alpha}}^{% \ell})=r_{\alpha,i}^{\ell}+\xi_{\alpha}^{\ell}\varphi_{\alpha,i}^{\ell}(\gamma% _{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})\quad\text{with}\quad\varphi_{% \alpha,i}^{\ell}(\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})\in% \mathbb{R}^{D-D_{\beta}}

Then $T^{\ell}$ is invertible. In particular, let $\underline{\gamma}_{\alpha,i}^{\ell+1}=(\beta_{\alpha}^{\ell})^{T}r_{\alpha,i}% ^{\ell+1}$ ; then the inverse of the layer is given by

r_{\alpha,i}^{\ell}=r_{\alpha,i}^{\ell+1}-\xi_{\alpha}^{\ell}\varphi_{\alpha,i% }^{\ell}(\underline{\gamma}_{\alpha}^{\ell+1},\underline{\gamma}_{\hat{\alpha}% }^{\ell+1})

Furthermore, the layer $T^{\ell}$ is $\mathbb{G}$ -equivariant if

\varphi_{\alpha}^{\ell}(\pi_{\alpha}\gamma_{\alpha}^{\ell},\pi_{\hat{\alpha}}% \gamma_{\hat{\alpha}}^{\ell})=\pi_{\alpha}\varphi_{\alpha}^{\ell}(\gamma_{% \alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})

Proof.

Let us first prove the layer’s inverse. First, note that $\underline{\gamma}_{\alpha,i}^{\ell+1}$ can be computed entirely from variables in layer ${\ell+1}$ . Also note that $\underline{\gamma}_{\alpha,i}^{\ell+1}\neq\gamma_{\alpha,i}^{\ell+1}$ , since $\gamma_{\alpha,i}^{\ell+1}=(\beta_{\alpha}^{\ell+1})^{T}r_{\alpha,i}^{\ell+1}$ - i.e. $\underline{\gamma}_{\alpha,i}^{\ell+1}$ uses $\beta_{\alpha}^{\ell}$ , while $\gamma_{\alpha,i}^{\ell+1}$ uses $\beta_{\alpha}^{\ell+1}$ . Now, we show that

	$\displaystyle\underline{\gamma}_{\alpha,i}^{\ell+1}$	$\displaystyle=(\beta_{\alpha}^{\ell})^{T}r_{\alpha,i}^{\ell+1}$
		$\displaystyle=(\beta_{\alpha}^{\ell})^{T}\left(r_{\alpha,i}^{\ell}+\xi_{\alpha% }^{\ell}\varphi_{\alpha,i}^{\ell}(\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}% ^{\ell})\right)$
		$\displaystyle=(\beta_{\alpha}^{\ell})^{T}r_{\alpha,i}^{\ell}+(\beta_{\alpha}^{% \ell})^{T}\xi_{\alpha}^{\ell}\varphi_{\alpha,i}^{\ell}(\gamma_{\alpha}^{\ell},% \gamma_{\hat{\alpha}}^{\ell})$
		$\displaystyle=\gamma_{\alpha,i}^{\ell}+0=\gamma_{\alpha,i}^{\ell}$

where the equality in the last line follows from the fact that $\xi_{\alpha}^{\ell}$ is the orthogonal complement of $\beta_{\alpha}^{\ell}$ , so that $(\beta_{\alpha}^{\ell})^{T}\xi_{\alpha}^{\ell}=0$ . That is, the Split Subspace Layer has the nice property that it preserves projections onto the subspace given by $\beta_{\alpha}^{\ell}$ .

Given this, the inverse is straightforwardly computed by rearranging Equation (18):

r_{\alpha,i}^{\ell}=r_{\alpha,i}^{\ell+1}-\xi_{\alpha}^{\ell}\varphi_{\alpha,i% }^{\ell}(\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})=r_{\alpha,i}^{% \ell+1}-\xi_{\alpha}^{\ell}\varphi_{\alpha,i}^{\ell}(\underline{\gamma}_{% \alpha}^{\ell+1},\underline{\gamma}_{\hat{\alpha}}^{\ell+1})

(75)

Note that everything on the right-hand side depends on variables from layer ${\ell+1}$ , as desired. Thus, we have shown the layer in Equation (18) is invertible regardless of the form of the network $\varphi_{\alpha}^{\ell}$ .

Now let us turn to proving the layer’s $\mathbb{G}$ -equivariance. Recall that the conditions for $\mathbb{G}$ -equivariance are given by Equation (14), which we can combine with Equation (18):

	$\displaystyle T_{\alpha}^{\ell}(\pi_{\alpha}r_{\alpha}^{\ell},\pi_{\hat{\alpha% }}r_{\hat{\alpha}}^{\ell})=\pi_{\alpha}T_{\alpha}^{\ell}(r_{\alpha}^{\ell},r_{% \hat{\alpha}}^{\ell})$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle T_{\alpha,i}^{\ell}(\pi_{\alpha}r_{\alpha}^{\ell},\pi_{\hat{% \alpha}}r_{\hat{\alpha}}^{\ell})=T_{\alpha,\pi_{\alpha}(i)}^{\ell}(r_{\alpha}^% {\ell},r_{\hat{\alpha}}^{\ell})$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle r_{\alpha,\pi_{\alpha}(i)}^{\ell}+\xi_{\alpha}^{\ell}\varphi_{% \alpha,i}^{\ell}(\pi_{\alpha}\gamma_{\alpha}^{\ell},\pi_{\alpha}\gamma_{\hat{% \alpha}}^{\ell})=r_{\alpha,\pi_{\alpha}(i)}^{\ell}+\xi_{\alpha}^{\ell}\varphi_% {\alpha,\pi_{\alpha}(i)}^{\ell}(\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{% \ell})$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle\varphi_{\alpha,i}^{\ell}(\pi_{\alpha}\gamma_{\alpha}^{\ell},\pi_% {\alpha}\gamma_{\hat{\alpha}}^{\ell})=\varphi_{\alpha,\pi_{\alpha}(i)}^{\ell}(% \gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle\varphi_{\alpha}^{\ell}(\pi_{\alpha}\gamma_{\alpha}^{\ell},\pi_{% \hat{\alpha}}\gamma_{\hat{\alpha}}^{\ell})=\pi_{\alpha}\varphi_{\alpha}^{\ell}% (\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})$	(76)

where $\pi_{\alpha}(i)$ indicates the index that electron $i$ is moved to under the permutation $\pi_{\alpha}$ ; and the fourth line follows from the fact that the previous statement must be true for all possible outputs of $\varphi_{\alpha}^{\ell}$ . This completes the proof. ∎

Appendix K Implementation of the $\mathbb{G}$ -Equivariant Layer

As we have seen, invertibility places no special restrictions on the form of $\varphi_{\alpha}^{\ell}$ . With regard to the conditions imposed by $\mathbb{G}$ -equivariance in Equation (20), there are several ways to achieve them. We propose the following method, as it uses standard off-the-shelf architectures; we use the variables $\zeta_{\alpha,i}^{\ell}$ to represent intermediate quantities.

Lifting: Map each value $\gamma_{\alpha,i}^{\ell}$ from dimension $D_{\beta}$ to dimension $D_{\zeta}$ :

\zeta_{\alpha,i}^{\ell}=W_{\alpha}\gamma_{\alpha,i}^{\ell}

(77)

where there are two matrices $W_{\alpha}$ of dimension $D_{\zeta}\times D_{\beta}$ , one for each spin $\alpha\in\{u,d\}$ .

Multihead Attention: We have two Multihead Attention (MHA) layers $\tau_{\alpha}^{\ell}$ , one for each spin. Each MHA takes as input the the list $\zeta_{\alpha}^{\ell}=\{\zeta_{\alpha,i}^{\ell}\}_{i\in{\mathcal{N}_{\alpha}}}$ . The output of the MHA is then

\zeta_{\alpha}^{\ell}\leftarrow\tau_{\alpha}^{\ell}(\zeta_{\alpha}^{\ell})

(78)

Fully Connected Layer Per Spin: There are two fully connected layers $\mu_{\alpha}^{\ell}$ , one for each spin. The layer is applied per electron, with the same layer being applied to electrons of a given spin:

\zeta_{\alpha,i}^{\ell}\leftarrow\mu_{\alpha}^{\ell}(\zeta_{\alpha,i}^{\ell})% \quad\text{for }i\in{\mathcal{N}_{\alpha}}

(79)

4.

Average: Form the average values: $\bar{\zeta}_{\alpha}^{\ell}=\frac{1}{{n_{\alpha}}}\sum_{i\in{\mathcal{N}_{% \alpha}}}\zeta_{\alpha,i}^{\ell}$ .

Fully Connected Layer with Spin Mixing: We have two fully connected layers $\hat{\mu}_{\alpha}^{\ell}$ , one for each spin. Then:

\varphi_{\alpha,i}^{\ell}(\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})% =\hat{\mu}_{\alpha}^{\ell}(\texttt{CAT}(\zeta_{\alpha,i}^{\ell}\,,\,\bar{\zeta% }_{\hat{\alpha}}^{\ell}))\quad\text{for }i\in{\mathcal{N}_{\alpha}}

(80)

The output of the MLPs $\mu_{\alpha}^{\ell}$ is of dimension $D-D_{\beta}$ .

Due to the permutation-equivariance of Multihead Attention, the $\mathbb{G}$ -equivariance follows naturally. Some comments are in order:

•

We can choose $D_{\beta}\in\{1,\dots,D-1\}$ . Since in our case $D=3$ , this gives us exactly two choices: $D_{\beta}=1$ or $D_{\beta}=2$ .
•

The fully connected layers should use smooth activation functions, i.e. not ReLU. There are many possible smooth substitutes for ReLU-like activations, such as Swish, SiLU, etc.
•

To achieve orthogonalization, i.e. to ensure that $\xi_{\alpha}^{\ell}$ is itself orthonormal and is also orthogonal to $\beta_{\alpha}^{\ell}$ , it is important to use a smooth procedure. Gram-Schmidt may be employed for this purpose: an initial (e.g. random) set of vectors are chosen, which are then orthonormalized by the procedure.
•

In the special case of Helium, there are only 2 electrons: one which is spin-up, and the other which is spin-down. In this case, the requirement that $\varphi_{\alpha}^{\ell}(\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})$ be equivariant with respect to permutations of $\gamma_{\alpha}^{\ell}$ is trivially satisfied; likewise, the requirement that $\varphi_{\alpha}^{\ell}(\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})$ be invariant with respect to permutations of $\gamma_{\hat{\alpha}}^{\ell}$ is also trivially satisfied. As a result, the Multihead Attention layers $\tau_{\alpha}^{\ell}$ may be replaced by the identity, with everything else remaining the same.

Appendix L A Generalized Variant of the Split Subspace Layer

We note that a generalized variant of the Split Subspace Layer is as follows:

r_{\alpha,i}^{\ell+1}=T_{\alpha,i}^{\ell}(r_{\alpha}^{\ell},r_{\hat{\alpha}}^{% \ell})=\beta_{\alpha}^{\ell}\eta_{\alpha,i}^{\ell}(\gamma_{\alpha}^{\ell},% \gamma_{\hat{\alpha}}^{\ell})+\xi_{\alpha}^{\ell}\varphi_{\alpha,i}^{\ell}(% \gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})

(81)

where both $\phi$ and $\eta$ satisfy the conditions in (20), and $\eta$ is explicitly invertible in the sense that the system of equations $y_{\alpha,i}=\eta_{\alpha,i}^{\ell}(\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha% }}^{\ell})$ for all $\alpha,i$ may be inverted to solve for all values of $\gamma_{\alpha,i}^{\ell}$ . An example of such an $\eta$ is given by $\eta_{\alpha,i}^{\ell}(\gamma_{\alpha}^{\ell},\gamma_{\hat{\alpha}}^{\ell})=f(% A\gamma_{\alpha,i}^{\ell}+\sum_{j\neq i}B\gamma_{\alpha,j}^{\ell}+\sum_{j}C% \gamma_{{\hat{\alpha}},j}^{\ell})$ for $D_{\beta}\times D_{\beta}$ matrices $A,B,C$ and an invertible nonlinearity $f:\mathbb{R}^{D_{\beta}}\to\mathbb{R}^{D_{\beta}}$ (such as the cube of each element).

Appendix M Proof of Theorem 8

Theorem.

The local energy can be written as

\mathcal{E}_{r}(r;\theta)=-\tfrac{1}{2}\Delta_{r}q(r;\theta)-\tfrac{1}{2}\|% \nabla_{r}q(r;\theta)\|^{2}+V(r)

In particular, the local energy is independent of the phase $w(r;\theta)$ . Furthermore, let

\Omega(r;\theta)=\nabla_{\theta}\mathcal{E}_{r}(r;\theta)+2\mathcal{E}_{r}(r;% \theta)\nabla_{\theta}q(r;\theta)

Then the gradient of the loss function may be written as

\nabla_{\theta}\mathcal{L}(\theta)=\mathbb{E}_{r\sim\rho(\cdot;\theta)}\left[% \Omega(r;\theta)\right]\approx\frac{1}{K}\sum_{k=1}^{K}\Omega\left(r^{(k)};% \theta\right)

with samples $r^{(k)}\sim\rho(\cdot;\theta)$ .

Proof.

For the moment, we suppress $\theta$ for convenience. Recall that the overall (i.e. complex) local energy is defined by

\mathcal{E}(r)=\frac{H\psi(r)}{\psi(r)}=-\frac{\Delta\psi(r)}{2\psi(r)}+V(r)

(82)

Let $r_{j,d}$ be the $d^{th}$ component of the position vector of the $j^{th}$ electron, $d=1,\dots,D$ . Then plugging in $\psi(r)=e^{q(r)+iw(r)}$ , we have that

\frac{\partial\psi}{\partial r_{j,d}}=e^{q(r)+iw(r)}\left(\frac{\partial q}{% \partial r_{j,d}}+i\frac{\partial w}{\partial r_{j,d}}\right)=\psi(r)\left(% \frac{\partial q}{\partial r_{j,d}}+i\frac{\partial w}{\partial r_{j,d}}\right)

(83)

and

	$\displaystyle\frac{\partial^{2}\psi}{\partial r_{j,d}^{2}}$	$\displaystyle=e^{q(r)+iw(r)}\left(\frac{\partial^{2}q}{\partial r_{j,d}^{2}}+i% \frac{\partial^{2}w}{\partial r_{j,d}^{2}}\right)+e^{q(r)+iw(r)}\left(\frac{% \partial q}{\partial r_{j,d}}+i\frac{\partial w}{\partial r_{j,d}}\right)^{2}$
		$\displaystyle=\psi(r)\left[\left(\frac{\partial^{2}q}{\partial r_{j,d}^{2}}+% \left(\frac{\partial q}{\partial r_{j,d}}\right)^{2}-\left(\frac{\partial w}{% \partial r_{j,d}}\right)^{2}\right)+i\left(\frac{\partial^{2}w}{\partial r_{j,% d}^{2}}+2\frac{\partial q}{\partial r_{j,d}}\frac{\partial w}{\partial r_{j,d}% }\right)\right]$		(84)

With the appropriate summation, this immediately yields

\mathcal{E}(r)=-\frac{1}{2}\sum_{j=1}^{n}\left[\left(\Delta_{j}q+\|\nabla_{j}q% \|^{2}-\|\nabla_{j}w\|^{2}\right)+i\left(\Delta_{j}w+2\nabla_{j}q\cdot\nabla_{% j}w\right)\right]+V(r)

(85)

so that its real part simplifies to

	$\displaystyle\mathcal{E}_{r}(r)$	$\displaystyle=-\frac{1}{2}\sum_{j=1}^{n}\left(\\|\nabla_{j}w\\|^{2}-\Delta_{j}q-% \\|\nabla_{j}q\\|^{2}\right)+V(r)$
		$\displaystyle=\tfrac{1}{2}\left(\\|\nabla w\\|^{2}-\Delta q-\\|\nabla q\\|^{2}% \right)+V(r)$		(86)

Now, it is known that since the Hamiltonian is time-reversal invariant and Hermitian, both its eigenvalues and its eigenfunctions are real. Since the ground-state wavefunction we are looking for is real, the phase $w(r)$ can be taken to belong to the two element set $\{0,\pi\}$ , where $w(r)=0$ corresponds to positive values of the wavefunction $\psi(r)$ , and $w(r)=\pi$ to negative values of $\psi(r)$ . Thus, where the sign of $\psi(r)$ does not change $w(r)$ is constant, and therefore $\|\nabla w(r)\|^{2}=0$ .

We are then left to consider the case when the sign of $\psi(r)$ flips, and therefore there is a discontinuity in $w(r)$ ; this occurs precisely where $\psi(r)=0$ . However, recall from Equation (24)

\mathcal{L}(\theta)=\langle\psi(\cdot;\theta)|H|\psi(\cdot;\theta)\rangle\,=\,% \mathbb{E}_{r\sim\rho(\cdot;\theta)}\left[\mathcal{E}_{r}(r;\theta)\right]

(87)

When $\psi(r)=0$ then $\rho(r)=0$ ; thus, samples where there is a discontinuity are never selected. We may therefore set the local energy at such values of $r$ to any value we wish, without affecting the value of $\mathcal{L}(\theta)$ . In particular, we are free to set $\|\nabla w(r)\|=0$ at such points. In conclusion, then, we have demonstrated that

\mathcal{E}_{r}(r;\theta)=-\tfrac{1}{2}\Delta_{r}q(r;\theta)-\tfrac{1}{2}\|% \nabla_{r}q(r;\theta)\|^{2}+V(r)

(88)

which is independent of the phase $w(r)$ .

Turning to the second part of the theorem, we note that

\mathcal{L}(\theta)\,=\,\mathbb{E}_{r\sim\rho(\cdot;\theta)}\left[\mathcal{E}_% {r}(r;\theta)\right]\,=\,\int\mathcal{E}_{r}(r;\theta)\rho(r;\theta)dr

(89)

so that

\nabla_{\theta}\mathcal{L}(\theta)\,=\,\int\left(\nabla_{\theta}\mathcal{E}_{r% }(r;\theta)\rho(r;\theta)+\mathcal{E}_{r}(r;\theta)\nabla_{\theta}\rho(r;% \theta)\right)dr

(90)

However, since $q(r;\theta)=\frac{1}{2}\log\rho(r;\theta)$ , then $\nabla_{\theta}q(r;\theta)=\nabla_{\theta}\rho(r;\theta)/2\rho(r;\theta)$ , or $\nabla_{\theta}\rho(r;\theta)=2\rho(r;\theta)\nabla_{\theta}q(r;\theta)$ . Plugging this in gives

$\displaystyle\nabla_{\theta}\mathcal{L}(\theta)$	$\displaystyle\,=\,\int\left(\nabla_{\theta}\mathcal{E}_{r}(r;\theta)\rho(r;% \theta)+2\mathcal{E}_{r}(r;\theta)\rho(r;\theta)\nabla_{\theta}q(r;\theta)% \right)dr$
	$\displaystyle\,=\,\int\rho(r;\theta)\left(\nabla_{\theta}\mathcal{E}_{r}(r;% \theta)+2\mathcal{E}_{r}(r;\theta)\nabla_{\theta}q(r;\theta)\right)dr$
	$\displaystyle\,=\,\mathbb{E}_{r\sim\rho(\cdot;\theta)}\left[\Omega(r;\theta)\right]$	(91)

where $\Omega(r;\theta)=\nabla_{\theta}\mathcal{E}_{r}(r;\theta)+2\mathcal{E}_{r}(r;% \theta)\nabla_{\theta}q(r;\theta)$ . ∎

Appendix N Optimization of the Objective Function

In order to optimize the objective in Equation (24), we use the procedure in Algorithm 2, which is specified for the discrete normalizing flow; the procedure for the continuous normalizing flow will be similar. Note that we initially sample a large number $K_{large}$ of samples from the base density; we emphasize that this step can be performed entirely offline, and does not entail additional computational complexity.

Algorithm 2 Computation of Ground State Wavefunction and Energy

0: base log-density

q_{z}(\cdot)

, normalizing flow

\{T^{\ell}(\cdot;\theta)\}_{\ell=0}^{L}

, potential

V(\cdot)

, learning rate

\epsilon

sample

\mathcal{Z}=\left\{z^{(k)}\right\}_{k=1}^{K_{large}}

for

z^{(k)}\sim\rho_{z}(\cdot)

and

K_{large}

a very large number of samples

take

q(r;\theta)

from (23) and use auto-differentiation to compute

\nabla_{r}q(r;\theta)

and

\Delta_{r}q(r;\theta)

using

\nabla_{r}q(r;\theta)

and

\Delta_{r}q(r;\theta)

, compute

\mathcal{E}_{r}(r;\theta)

from (25)

using auto-differentiation, compute the function

\Omega(r;\theta)

as in (26)

initialize

\theta

, e.g. using Xavier initialization

while not converged do

sample

K

values of

z^{(k)}

from

\mathcal{Z}

compute

r^{(k)}=T(z^{(k)};\theta)

using

T=T_{L}\circ\dots\circ T_{0}

compute the energy

E=\frac{1}{K}\sum_{k=1}^{K}\mathcal{E}_{r}(r^{(k)};\theta)

compute the gradient

g=\frac{1}{K}\sum_{k=1}^{K}\Omega(r^{(k)};\theta)

take

\theta\leftarrow\theta-\epsilon g

end while

return

E

\theta

Appendix O Proof of Theorem 9

Theorem.

Let $\rho_{0}(r)$ be the density for the ground state wavefunction. Let $\prec$ be a strict total order on $\mathbb{R}^{D}$ , and define the set

\mathcal{R}=\{r=(r_{1},\dots r_{n}):r_{1}\prec r_{2}\prec\dots\prec r_{n_{u}}% \,\,\text{ and }\,\,r_{n_{u}+1}\prec r_{n_{u}+2}\prec\dots\prec r_{n}\}

For any $r$ without $r_{i}=r_{j}$ , define the permutation ${\pi_{\prec}(r)}\in\mathbb{G}$ by ${\pi_{\prec}(r)}r\in\mathcal{R}$ . Then a valid antisymmetric ground state wavefunction is given by

\psi_{0}(r)=\begin{cases}(-1)^{\pi_{\prec}(r)}\sqrt{\rho_{0}(r)}&\text{if }r_{% i}\neq r_{j}\,\,\forall i,j\\ 0&\text{otherwise}\end{cases}

Proof.

We begin by noting that the set $\mathcal{R}$ contains the spin-up electrons in ascending order, according to the ordering relation $\prec$ , and the spin-down electrons also in ascending order. Now, begin by considering the case of $r$ for which $r_{i}=r_{j}$ for some pair of electrons $i$ and $j$ ; in this case, $\psi_{0}(r)=0$ , as is required by antisymmetry. Now, consider the case of $r$ for which $r_{i}\neq r_{j}\,\,\forall i,j$ . In this case, for any permutation $\pi\in\mathbb{G}$ we have that

\psi_{0}(\pi r)=(-1)^{\pi_{\prec}(\pi r)}\sqrt{\rho_{0}(\pi r)}

(92)

However, recall that ${\pi_{\prec}(r)}$ is defined by

{\pi_{\prec}(r)}r\in\mathcal{R}

(93)

Therefore, $\pi_{\prec}(\pi r)$ is defined by

\pi_{\prec}(\pi r)\pi r\in\mathcal{R}

(94)

Comparing the latter two equations, we see that

\pi_{\prec}(\pi r)\pi={\pi_{\prec}(r)}\qquad\Rightarrow\qquad\pi_{\prec}(\pi r% )={\pi_{\prec}(r)}\pi^{-1}

(95)

Furthermore, we know that as $\rho_{0}(x)$ is the density for the ground state wavefunction, it must satisfy property (D1) of Theorem 2, namely it must be $\mathbb{G}$ -invariant; therefore, we must have that

\rho_{0}(\pi r)=\rho_{0}(r)

(96)

Plugging Equations (95) and (96) into (92) gives

$\displaystyle\psi_{0}(\pi r)$	$\displaystyle=(-1)^{{\pi_{\prec}(r)}\pi^{-1}}\sqrt{\rho_{0}(r)}$
	$\displaystyle=(-1)^{\pi}(-1)^{\pi_{\prec}(r)}\sqrt{\rho_{0}(r)}$
	$\displaystyle=(-1)^{\pi}\psi_{0}(r)$	(97)

where in the second line, we have used the facts that $(-1)^{\pi_{a}\pi_{b}}=(-1)^{\pi_{a}}(-1)^{\pi_{b}}$ ; and that $(-1)^{\pi^{-1}}=(-1)^{\pi}$ . But Equation (97) is exactly the antisymmetry property we desire, and so we have completed the proof.

Finally, we note that $\psi_{0}(r)>0$ for $r\in\mathcal{R}$ ; this is an arbitrary choice, and we could have equally well defined a second ground state wavefunction $\tilde{\psi}_{0}$ with $\tilde{\psi}_{0}(r)<0$ for $r\in\mathcal{R}$ . It is easy to see that in this case, $\tilde{\psi}_{0}(r)=-\psi_{0}(r)$ for all $r$ . However, this is not surprising: either $\psi_{0}$ or $-\psi_{0}$ may be taken as an eigenfunction of $H$ , as eigenfunctions are only defined up to sign. ∎

Appendix P Proof of Theorem 10

Theorem.

Let the set of distances be given by $\delta^{\ell}=\left\{\delta_{ij}^{\ell}\right\}_{i<j}$ where $\delta_{ij}^{\ell}=\|r_{i}^{\ell}-r_{j}^{\ell}\|$ . Given a layer of the form

r_{i}^{\ell+1}=\Theta^{\ell}(\delta^{\ell};\theta)\,r_{i}^{\ell}+t^{\ell}(% \delta^{\ell};\theta)\qquad\text{with }\Theta^{\ell}(\delta^{\ell};\theta)\in O% (D)\text{ and }t^{\ell}(\delta^{\ell};\theta)\in\mathbb{R}^{D}

Then the layer is both $\mathbb{G}$ -equivariant as well as invertible.

Proof.

Let us begin with invertibility. We may compute the inter-electron distances at layer ${\ell+1}$ :

$\displaystyle\delta_{ij}^{\ell+1}$	$\displaystyle=\\|r_{i}^{\ell+1}-r_{j}^{\ell+1}\\|$
	$\displaystyle=\\|\Theta^{\ell}(\delta^{\ell};\theta)\,r_{i}^{\ell}+t^{\ell}(% \delta^{\ell};\theta)-\Theta^{\ell}(\delta^{\ell};\theta)\,r_{j}^{\ell}-t^{% \ell}(\delta^{\ell};\theta)\\|$
	$\displaystyle=\\|\Theta^{\ell}(\delta^{\ell};\theta)(r_{i}^{\ell}-r_{j}^{\ell})% \\|=\\|r_{i}^{\ell}-r_{j}^{\ell}\\|=\delta_{ij}^{\ell}$	(98)

where the third line holds since $\Theta^{\ell}(\delta^{\ell};\theta)\in O(D)$ . That is, since we are rotating and translating all of the electrons with the same rotation matrix and translation vector the inter-electron distances are preserved. As a result, the inverse is simply

	$\displaystyle r_{i}^{\ell}$	$\displaystyle=\Theta^{\ell}(\delta^{\ell};\theta)^{-1}\,(r_{i}^{\ell+1}-t^{% \ell}(\delta^{\ell};\theta))$
		$\displaystyle=\Theta^{\ell}(\delta^{\ell+1};\theta)^{T}\,(r_{i}^{\ell+1}-t^{% \ell}(\delta^{\ell+1};\theta))$		(99)

where we have used the fact that for a rotation matrix, $\Theta^{-1}=\Theta^{T}$ . Note that all of the arguments on the right-hand side of the equation depend only on quantities from layer ${\ell+1}$ , as desired.

Having established invertibility, let us turn to $\mathbb{G}$ -equivariance. Let $\pi\in\mathbb{G}$ , and denote the layer by $r^{\ell+1}=Q(r^{\ell})$ , so that $r_{i}^{\ell+1}=Q_{i}(r^{\ell})$ . Note that since $\delta^{\ell}$ is the set of distances, we have that $\pi\delta^{\ell}=\delta^{\ell}$ : a set is inherently unordered, and therefore is unaffected by permutations. Then we have that

$\displaystyle Q_{i}(\pi r^{\ell})$	$\displaystyle=\Theta^{\ell}(\pi\delta^{\ell};\theta)\,r_{\pi(i)}^{\ell}+t^{% \ell}(\pi\delta^{\ell};\theta)$
	$\displaystyle=\Theta^{\ell}(\delta^{\ell};\theta)\,r_{\pi(i)}^{\ell}+t^{\ell}(% \delta^{\ell};\theta)$
	$\displaystyle=Q_{\pi(i)}(r^{\ell})$	(100)

so that $Q(\pi r^{l})=\pi Q(r^{\ell})$ , as desired. ∎

Appendix Q Implementation of the Electron-Electron Cusp Layer

Recall from Equation (30) that the network must be a function of the set of inter-electron distances $\delta^{\ell}$ . Using multihead attention will be inefficient, as we must apply it to all pairs of electrons, leading to quartic complexity. Instead, we propose the following Deep Set (Zaheer et al., 2017) style layer:

MLP Per Electron Pair: Apply the same Multilayer Perceptron $\eta^{\ell}$ to each electron pair individually:

\zeta_{ij}^{\ell}=\eta^{\ell}(\delta_{ij}^{\ell})\quad\text{for all }i<j

(101)

2.

Average: Form the average value: $\bar{\zeta}^{\ell}=\frac{1}{\frac{1}{2}n(n-1)}\sum_{i<j}\zeta_{ij}^{\ell}$ .

Overall MLP: Apply a Multilayer Perceptron $\hat{\eta}^{\ell}$ to the average:

\bar{\zeta}^{\ell}\leftarrow\hat{\eta}^{\ell}(\bar{\zeta}^{\ell})

(102)

The output should be of dimension $D^{2}+D$ , which is equal to $12$ when $D=3$ .

Split into Rotation and Translation:

$\displaystyle t^{\ell}$	$\displaystyle=\text{First $D$ components of }\bar{\zeta}^{\ell}$
$\displaystyle A^{\ell}$	$\displaystyle=\text{Last $D^{2}$ components of }\bar{\zeta}^{\ell},\text{ % reshaped into a $D\times D$ matrix}$	(103)
$\displaystyle B^{\ell}$	$\displaystyle=A^{\ell}-(A^{\ell})^{T},\text{ a skew-symmetric matrix}$
$\displaystyle\Theta^{\ell}$	$\displaystyle=\exp(B^{\ell}),\text{ using the matrix exponential}$

Notes:

•

The reason we parameterize the rotation as an exponential of a skew-symmetric matrix is so that the layer can effectively be a residual-style layer: if we choose $A^{\ell}=0$ and $t^{\ell}=0$ , then we recover $r_{i}^{\ell+1}=r_{i}^{\ell}$ . (This is harder if we use a rotation matrix directly, as the identity transformation $r_{i}^{\ell+1}=r_{i}^{\ell}$ is only recovered if $\Theta^{\ell}=I$ , which is harder to achieve.)
•

It is proposed to use one such layer, or a very small number of such layers, somewhere near the beginning of the flow. The work of incorporating the cusps in the appropriate manner can then be performed by subsequent layers.

Appendix R Electron-Nuclear Cusps

It is also known that the gradient of the wavefunction should exhibit a discontinuity when an electron and nucleus coincide. As in the case of electron-electron cusps, we may treat this by incorporating the electron-nuclear distances directly; we may design our layer exactly analogously to the electron-electron cusp layer, with one main caveat: to preserve invertibility, we can only deal with a single nucleus at a time. In particular, for a given nucleus $I$ with position $R_{I}$ , let $\delta_{I}^{\ell}=\left\{\delta_{iI}^{\ell}\right\}_{i=1}^{n}$ with $\delta_{iI}^{\ell}=\|r_{i}^{\ell}-R_{I}\|$ . Then the layer looks like

r_{i}^{\ell+1}=\Theta^{\ell}(\delta_{I}^{\ell};\theta)\,(r_{i}^{\ell}-R_{I})+R% _{I}

(104)

Note in the above that only the rotation matrix is parameterized, and the translation vector is fixed. We must include one such layer for each nucleus $I$ .

Appendix S Proof of Theorem 11

Theorem.

\Gamma_{t}(v;\pi R,\pi Z)=\Gamma_{t}(v;R,Z)\,\,\forall\pi\in\mathbb{S}_{N}% \hskip 21.33955pt\Gamma_{t}(\Theta v;\Theta R,Z)=\Theta\Gamma_{t}(v;R,Z)\,\,% \forall\Theta\in O(D)

Proof.

Let us first consider permutation invariance, i.e. Equation (31). Let $r$ be produced by solving the flow

dv/dt=\Gamma_{t}(v;R,Z)\text{ with }v(0)=z\sim\rho_{z}(\cdot)\text{ and }r=v(1)

(105)

Consider a permutation $\pi$ on the nuclei, and let $\tilde{r}$ be the resulting electronic positions. Then $\tilde{r}$ is produced by solving the flow

d\tilde{v}/dt=\Gamma_{t}(\tilde{v};\pi R,\pi Z)\text{ with }\tilde{v}(0)=z\sim% \rho_{z}(\cdot)\text{ and }\tilde{r}=\tilde{v}(1)

(106)

However, we know that $\Gamma_{t}(\tilde{v};\pi R,\pi Z)=\Gamma_{t}(\tilde{v};R,Z)$ . Thus, $\tilde{r}$ is given by

d\tilde{v}/dt=\Gamma_{t}(\tilde{v};R,Z)\text{ with }\tilde{v}(0)=z\sim\rho_{z}% (\cdot)\text{ and }\tilde{r}=\tilde{v}(1)

(107)

which is precisely equivalent to the equation for $r$ ; thus $\tilde{r}=r$ , i.e. the random variables representing the electronic positions are identical in both cases. Thus, their distributions must be equal: $\rho(r;\pi R,\pi Z)=\rho(r;R,Z)$ , so Equation (31) is established.

Let us now turn to joint rotation invariance, i.e. Equation (33). As we know that $\Gamma_{t}$ satisfies rotation equivariance, i.e. $\Gamma_{t}(\Theta v;\Theta R,Z)=\Theta\Gamma_{t}(v;R,Z)$ , we may apply Theorems 1 and 2 from (Köhler et al., 2020) (noting that $R$ is irrelevant for the flow, which is entirely in $v$ ). This yields immediately that $\rho(\Theta r;\Theta R,Z)=\rho(r;R,Z)$ , so Equation (33) is established. ∎

Appendix T Proof of Theorem 12

Theorem.

\Gamma_{t}(v;R,Z)=\zeta\phi_{t}(\zeta^{-1}v;\zeta^{-1}R,Z)\qquad\text{where}% \qquad\zeta=\omega_{t}(v;R,Z)

satisfies the properties in Equation (34) and is $\mathbb{G}$ -equivariant with respect to $v$ .

Proof.

Let us begin with the first condition in Equation (34), namely we wish to show that $\Gamma_{t}(v;\pi R,\pi Z)=\Gamma_{t}(v;R,Z)$ . Use tilde’s to denote the variables after the permutation $\pi$ has been applied. Thus,

\tilde{\zeta}=\omega_{t}(v;\pi R,\pi Z)=\omega_{t}(v;R,Z)=\zeta

(108)

where we have used the fact that $\omega_{t}$ is permutation-invariant jointly with respect to $R$ and $Z$ . Then

$\displaystyle\Gamma_{t}(v;\pi R,\pi Z)$	$\displaystyle=\tilde{\zeta}\phi_{t}(\tilde{\zeta}^{-1}v;\tilde{\zeta}^{-1}\pi R% ,\pi Z)$
	$\displaystyle=\zeta\phi_{t}(\zeta^{-1}v;\zeta^{-1}\pi R,\pi Z)$
	$\displaystyle=\zeta\phi_{t}(\zeta^{-1}v;\pi\zeta^{-1}R,\pi Z)$
	$\displaystyle=\zeta\phi_{t}(\zeta^{-1}v;\zeta^{-1}R,Z)$
	$\displaystyle=\Gamma_{t}(v;R,Z)$	(109)

where in the second line we have used the fact that $\tilde{\zeta}=\zeta$ ; in the third line, the fact that the operation of applying an identical rotation to a list of vectors commutes with a permutation applied to that list of vectors; and in the fourth line, the fact that $\phi_{t}$ is permutation-invariant jointly with respect to $R$ and $Z$ . We have thus established the first condition in Equation (34).

Now let us turn to the second condition in Equation (34), that is we need to show that $\Gamma_{t}(\Theta v;\Theta R,Z)=\Theta\Gamma_{t}(v;R,Z)$ . We have that

\tilde{\zeta}=\omega_{t}(\Theta v;\Theta R,Z)=\Theta\omega_{t}(v;R,Z)=\Theta\zeta

(110)

where we have used the fact that $\omega_{t}$ is $O(D)$ -equivariant jointly with respect to $v$ and $R$ . Then

$\displaystyle\Gamma_{t}(\Theta v;\Theta R,Z)$	$\displaystyle=\tilde{\zeta}\phi_{t}(\tilde{\zeta}^{-1}\Theta v;\tilde{\zeta}^{% -1}\Theta R,Z)$
	$\displaystyle=\Theta\zeta\phi_{t}(\zeta^{-1}\Theta^{-1}\Theta v;\zeta^{-1}% \Theta^{-1}\Theta R,Z)$
	$\displaystyle=\Theta\zeta\phi_{t}(\zeta^{-1}v;\zeta^{-1}R,Z)$
	$\displaystyle=\Theta\Gamma_{t}(v;R,Z)$	(111)

as desired.

Finally, let us turn to demonstrating the $\mathbb{G}$ -equivariance of $\Gamma_{t}$ with respect to $v$ . Let $g\in\mathbb{G}$ ; then we have that

\tilde{\zeta}=\omega_{t}(gv;R,Z)=\omega_{t}(v;R,Z)=\zeta

(112)

where we have used the fact that $\omega_{t}$ is $\mathbb{G}$ -invariant with respect to $v$ . Then

$\displaystyle\Gamma_{t}(gv;R,Z)$	$\displaystyle=\tilde{\zeta}\phi_{t}(\tilde{\zeta}^{-1}gv;\tilde{\zeta}^{-1}R,Z)$
	$\displaystyle=\zeta\phi_{t}(\zeta^{-1}gv;\zeta^{-1}R,Z)$
	$\displaystyle=\zeta\phi_{t}(g\zeta^{-1}v;\zeta^{-1}R,Z)$
	$\displaystyle=\zeta g\phi_{t}(\zeta^{-1}v;\zeta^{-1}R,Z)$
	$\displaystyle=g\zeta\phi_{t}(\zeta^{-1}v;\zeta^{-1}R,Z)$
	$\displaystyle=g\Gamma_{t}(v;R,Z)$	(113)

where in the second line we have used the fact that $\tilde{\zeta}=\zeta$ ; in the third and fifth lines, the fact that the operation of applying an identical rotation to a list of vectors commutes with a permutation applied to that list of vectors; and in the fourth line, the fact that $\phi_{t}$ is $\mathbb{G}$ -equivariant with respect to $v$ . This completes the proof. ∎

Appendix U Implementation of Continuous Normalizing Flow for Multiple Molecules

We must implement both networks mentioned in Theorem 12: the functions $\phi_{t}$ and $\omega_{t}$ . The function $\phi_{t}$ is $\mathbb{G}$ -equivariant, so that we may use the general recipe described in Appendix K; however, it has the additional properties it depends on both $R$ and $Z$ , and must be permutation-invariant jointly with respect to these two variables. Therefore, the following minor modification may be made to the recipe described in Appendix K (noting that the notation changes slightly as we no longer have layers $\ell$ - the flow is continuous; and that we replace the variables $\gamma_{\alpha,i}^{\ell}$ with $v_{\alpha,i}$ ). We compute a Deep Set (Zaheer et al., 2017) function on $R,Z$ , i.e. on the inputs $\{(R_{I},Z_{I})\}$ ; the output of this function is permutation-invariant by construction. This output is then fed into the Fully Connected Layer with Spin Mixing as an extra input. An alternative to the Deep Set approach is to apply a transformer to $R,Z$ , where each token is the pair $(R_{I},Z_{I})$ , and then apply an averaging step at the end; this will also produce a permutation-invariant function.

In order to implement the function $\omega_{t}$ , recall that its output is a rotation matrix. Furthermore, it is $\mathbb{G}$ -invariant in $v$ ; $O(D)$ -equivariant with respect to $v$ and $R$ jointly; and permutation-invariant with respect to $R$ and $Z$ jointly. We may use an EGNN architecture (Satorras et al., 2021) jointly on electrons and nuclei. In the EGNN:

•

The positions of the electrons and nuclei are initialized as $v$ and $R$ respectively.
•
The hidden vectors of the electrons and nuclei are initialized in order to encode two things:
1. 1.
  
  Whether the vertex corresponds to an electron or a nucleus.
2. 2.
  
  Properties of the vertex: (a) in the case of an electron, whether the spin is up or down; (b) in the case of a nucleus, the atomic number $Z_{I}$ .
This encoding can be achieved via combining one-hot vectors with linear projections of varying dimensionalities.

For each of the $D$ final layers of the EGNN, one may then take the position vectors for that layer and form an average over them; this yields a total of $D$ new vectors. These $D$ vectors are clearly $\mathbb{G}$ -invariant in $v$ , as reordering within spins does not matter; permutation-invariant in $R$ and $Z$ jointly; and $O(D)$ -equivariant with respect to $v$ and $R$ jointly, by the built-in equivariance properties of EGNNs. We then take these $D$ vectors, and perform Gram-Schmidt on them to obtain a rotation matrix $\Theta$ , noting that Gram-Schmidt retains the equivariance property. A similar idea is discussed in (Kaba et al., 2023). This completes the implementation.

$\displaystyle\frac{\langle\psi\|H\|\psi\rangle}{\langle\psi\|\psi\rangle}$	$\displaystyle=\text{Real}\left\{\frac{\langle\psi\|H\|\psi\rangle}{\langle\psi\|% \psi\rangle}\right\}$
	$\displaystyle=\text{Real}\left\{\frac{\int\psi^{*}(x)H\psi(x)dx}{\langle\psi\|% \psi\rangle}\right\}$
	$\displaystyle=\frac{1}{\langle\psi\|\psi\rangle}\int\text{Real}\left\{\psi^{*}(% x)H\psi(x)\right\}dx$
	$\displaystyle=\frac{1}{\langle\psi\|\psi\rangle}\int\text{Real}\left\{\frac{% \psi(x)}{\psi(x)}\psi^{*}(x)H\psi(x)\right\}dx$
	$\displaystyle=\frac{1}{\langle\psi\|\psi\rangle}\int\text{Real}\left\{\|\psi(x)\|% ^{2}\frac{H\psi(x)}{\psi(x)}\right\}dx$
	$\displaystyle=\int\text{Real}\left\{\frac{H\psi(x)}{\psi(x)}\right\}\frac{\|% \psi(x)\|^{2}}{\langle\psi\|\psi\rangle}dx$
	$\displaystyle=\int\mathcal{E}_{r}(x)\rho(x)dx$
	$\displaystyle=\mathbb{E}_{x\sim\rho(\cdot)}\left[\mathcal{E}_{r}(x)\right]$
	$\displaystyle\approx\frac{1}{K}\sum_{k=1}^{K}\mathcal{E}_{r}\left(x^{(k)}\right)$	(38)

$\displaystyle\delta_{ij}^{\ell+1}$	$\displaystyle=\\|r_{i}^{\ell+1}-r_{j}^{\ell+1}\\|$
	$\displaystyle=\\|\Theta^{\ell}(\delta^{\ell};\theta)\,r_{i}^{\ell}+t^{\ell}(% \delta^{\ell};\theta)-\Theta^{\ell}(\delta^{\ell};\theta)\,r_{j}^{\ell}-t^{% \ell}(\delta^{\ell};\theta)\\|$
	$\displaystyle=\\|\Theta^{\ell}(\delta^{\ell};\theta)(r_{i}^{\ell}-r_{j}^{\ell})% \\|=\\|r_{i}^{\ell}-r_{j}^{\ell}\\|=\delta_{ij}^{\ell}$	(98)

A Theoretical Framework for an Efficient Normalizing Flow-Based Solution to the Schrödinger Equation

Abstract

1 Introduction

2 Problem Setup

2.1 Goals

2.2 The General Approach

Theorem 1.

2.3 Fixed Spin Multiplicity

Theorem 2.

3 Using Normalizing Flows to Construct the Wavefunction Ansatz

3.1 Sufficient Properties of the Normalizing Flow’s Base Density and Transformation

Theorem 3.

3.2 The Base Density via Determinantal Point Processes

Theorem 4.

3.3 𝔾𝔾\mathbb{G}blackboard_G-Equivariant Layers

Theorem 5.

3.4 Continuous Normalizing Flows

Theorem 6.

3.5 Discrete Normalizing Flows

Theorem 7.

3.6 Training via SGD

Theorem 8.

4 Further Details: Phase, Cusps, and Induction

4.1 The Phase

Theorem 9.

4.2 Incorporating Cusps

Theorem 10.

4.3 Induction Across Multiple Molecules

Theorem 11.

Theorem 12.

5 Concluding Remarks, Limitations, and Future Work

References

Appendix A Derivation of Equation (6)

Appendix B Proof of Theorem 1

Theorem.

Proof.

Appendix C Proof of Theorem 2

Theorem.

Proof.

Appendix D Proof of Theorem 3

Theorem.

Proof.

Appendix E Proof of Theorem 4

Theorem.

Proof.

Appendix F Sampling Procedure for Projection DPPs

Appendix G Proof of Theorem 5

Theorem.

Proof.

Appendix H Proof of Theorem 6

Theorem.

Proof.

Appendix I The Complexity of Discrete Normalizing Flows

Appendix J Proof of Theorem 7

Theorem.

Proof.

Appendix K Implementation of the 𝔾𝔾\mathbb{G}blackboard_G-Equivariant Layer

Appendix L A Generalized Variant of the Split Subspace Layer

Appendix M Proof of Theorem 8

Theorem.

Proof.

Appendix N Optimization of the Objective Function

Appendix O Proof of Theorem 9

Theorem.

Proof.

Appendix P Proof of Theorem 10

Theorem.

Proof.

Appendix Q Implementation of the Electron-Electron Cusp Layer

Appendix R Electron-Nuclear Cusps

Appendix S Proof of Theorem 11

Theorem.

Proof.

Appendix T Proof of Theorem 12

Theorem.

Proof.

Appendix U Implementation of Continuous Normalizing Flow for Multiple Molecules

3.3 $\mathbb{G}$ -Equivariant Layers

Appendix K Implementation of the $\mathbb{G}$ -Equivariant Layer