\jyear

2022

[1]\fnmShihao \surShao

[1]\fnmQinghua \surCui

1]\orgdivDepartment of Biomedical Informatics, State Key Laboratory of Vascular Homeostasis and Remodeling, School of Basic Medical Sciences, \orgnamePeking University, \orgaddress\cityBei**g, \postcode100191, \countryChina

2]\orgdivSchool of Electronics Engineering and Computer Science, \orgnamePeking University, \orgaddress\cityBei**g, \postcode100871, \countryChina

FreeCG: Free the Design Space of Clebsch–Gordan Transform for machine learning force field

[email protected] (S. S.) \fnmHaoran \surGeng [email protected] (Q. C.) [ [

Abstract

The Clebsch–Gordan Transform (CG transform) effectively encodes many-body interactions. Many studies have proven its accuracy in depicting atomic environments, although this comes with high computational needs. The computational burden of this challenge is hard to reduce due to the need for permutation equivariance, which limits the design space of the CG transform layer. We show that, implementing the CG transform layer on permutation-invariant inputs allows complete freedom in the design of this layer without affecting symmetry. Develo** further on this premise, our idea is to create a CG transform layer that operates on permutation-invariant abstract edges generated from real edge information. We bring in group CG transform with sparse path, abstract edges shuffling, and attention enhancer to form a powerful and efficient CG transform layer. Our method, known as FreeCG, achieves State-of-The-Art (SoTA) results in force prediction for MD17, rMD17, MD22, and property prediction in QM9 datasets with notable enhancement. It introduces a novel paradigm for carrying out efficient and expressive CG transform in future geometric neural network designs.

keywords:

Group Equivariance; Tensor Product; Irreducible Representation; Machine Learning Force Field

1 Introduction

Accurate modelling of molecular force field is of great importance for drug development amaro2018drug ; das2021drug ; chen2024design , materials science zepeda2017probing ; liu2024layer , chemical reaction kinetics zeng2020complex ; meuwly2021machine , nanotechnology srivastava2021recent ; wang2023novel , among others. Density Functional Theory (DFT) kohn1965self and other ab initio methods martin2020electronic ; ceperley1980ground ; bartlett2007coupled demonstrate excellent precision but requiring intensive computational resources, which highly limits its usage for many-body system burke2012perspective ; jones2015density ; cohen2008insights . Classical Force Fields are cheap but do not offer the same level of precision as the previous ones lindorff2010improved ; brooks2009charmm . Machine Learning Force Fields (MLFFs) cui2024geometry ; wang2023quinnet ; wang2024enhancing ; musaelian2023learning ; batzner20223 ; drautz2019atomic ; batatia2022mace ; tholke2021equivariant ; schutt2018schnet ; chmiela2017machine offer a satisfying trade-off between accuracy and efficiency, which is expected to perform as powerful as DFT or other high accuracy references, but with orders-of-magnitude speedup.

Kernel-based methods chmiela2017machine ; drautz2019atomic are the starting point of MLFFs. sGDML chmiela2017machine introduces the properties of conservative fields into MLFF, namely, the predicted energy’s negative partial derivative with respect to atom position is taken as the predicted force. Subsequently, the attention is shifted to deep neural networks. Message-Passing Neural Networks (MPNNs) have performed SoTA on several molecular dynamic datasets schutt2018schnet ; schutt2021equivariant . Group and group representation theory play important roles in the design of MPNNs for MLFFs. An intuitive idea is to maintain roration and translation equivariance in the design of neural network. For example, we naturally hope the predicted force can move with respect to the input molecule, and the energy unchanged in this process. Graph neural networks that obey this property are called Equivariant Graph Neural Networks (EGNNs) thomas2018tensor ; satorras2021n ; gasteiger2020directional . On top of that, there are several works utilizing the powerful transformer structure vaswani2017attention , and reporting satisfying results fuchs2020se ; liao2022equiformer . To better model the many-body interactions, irreducible representations (irreps) are adopted to represent high order geometric objects. The CG transform is used to translate between different irreps. The use of irreps significantly enhances the expressivity of models. Several works process and aggregate geometric information via leveraging such high degree irreps, which shows significant performance boost batatia2022mace ; batzner20223 ; musaelian2023learning ; gasteiger2021gemnet ; thomas2018tensor .

However, the benefit of high degree irreps and CG transform upon them is at the cost of heavy computational overhead. Irreps are extensions of scalars and vectors, and in this way CG transform also extends the dot product. The higher the degree of irreps for the CG transform, the greater the computational demands. The requirements for being permutation equivariant make this burden hard to be alleviated. EGNNs require each node to receive information from neighbour atoms together with the edges linking them, where the heavy computation of CG transform happens for each neighbour atom and edge, which means we cannot naïvely remove some neighbor computation, as it will break permutation equivariance. Moreover, the narrowness of the design space prevents us from freely constructing the CG transform layer. We need to operate on each neighbour atom in an equal way (e.g., the predecessors typically assign a same Multi-Layer Preceptor (MLP) operating on scalar features of the edge to produce the weights for each computation between the central atom and each neighbouring one batzner20223 ; musaelian2023learning ). To confront this challenge, we propose FreeCG. The model combines geometric features from the surrounding edges near each atom. We call the different aggregated edge geometric features abstract edges, which are permutation invariant. By the invariance transitivity, we show that CG transform on these abstract edges is always permutation invariant, regardless of concrete design, and does not affecting the permutation equivariance of the layer, thus being free of the burdens above. Futhermore, the abstract edges are constructed from different real edges, so they contain refined features of them for better model expressive power. The invariance nature of abstract edges allows us to assign different weights to different edges, instead of weights computed by the same MLP. We put abstract edges into groups, and operate on each group individually, to further decrease the computation demands. Works that keep ${\text{E}}(3)$ equivariance are more expensive batzner20223 ; musaelian2023learning , since it requires an extra parity argument being $1$ or $-1$ , and thus the number of irreps doubled. Instead, we select an efficient set of paths for CG transform so that we maintain ${\text{E}}(3)$ equivariance while being more efficient than kee** SE(3) equivariance. The abstract edges shuffling, inspired by zhang2018shufflenet , is also implied for combination of irreps features. The abstract edges are then plugged back into the self-attention calculation to improve the quality of the attention scores. The operations above are available thanks to the equivariance proposition about the abstract edges.

To evaluate our FreeCG, following previous works wang2024enhancing ; wang2023quinnet ; batzner20223 ; musaelian2023learning , we collect standard force field prediction benchmarks, MD17 chmiela2017machine , revised MD17 (rMD17) Christensen2020 , and MD22 chmiela2023accurate . To further examine the generalization of FreeCG on molecular propoerty prediction, we also conduct experiments on QM9 ruddigkeit2012enumeration ; ramakrishnan2014quantum . We follow the conventional training/validation/test splits, and report our results against several SoTA methods on these datasets. Remarkably, FreeCG outperforms other methods for force prediction in most molecules with maximum margins. The ablation studies are also conducted to validate the effectiveness of each proposed module, and evaluate the sensitivity of the hyper-parameters. To examine the efficiency of our modification to CG transform, we both provide theoretical numbers and evaluate the speed w.r.t. the group number and sparisty of CG transform. The speed and size of the overall FreeCG is also benchmarked, which proves the efficiency of our whole model.

The contributions of our work are summarized as follows:

•

We reveal two major issues in the current EGNNs with CG transform: heavy computation demands and narrow design space.
•

We propose to leverage permutation-invariant abstract edges, and by our proposed proposition, we completely free the design space of CG transform.
•

We propose FreeCG, comprising of three main components: Group CG transform with sparse path, abstract edges shuffling, and Attention enhancer, contributing to information-rich and efficient model with high-degree CG transform (see Fig. 1).
•

Experiments on small molecule datasets MD17, rMD17, large molecules ones MD22, and molecular property datasets QM9 reveals the SoTA performance of FreeCG.
•

We benchmark the speed and memory usage of our proposed modules and the overall FreeCG, demonstrating their efficiency.
•

Since the design space of CG transform is unrestricted, it presents a new paradigm for designing CG transform in future research, extending beyond the design in this work.

2 Results

2.1 Background

Group, equivariance and invariance. Permutation, rotation, and translation form different groups in group theory. Formally, a set with a binary operation $(G,*)$ is said to be a group if and only if the following conditions hold: 1) $g_{1}*g_{2}\in G,$ for any $g_{1},g_{2}\in G$ (closure) 2) $(g_{1}*g_{2})*g_{3}=g_{1}*(g_{2}*g_{3}),$ for any $g_{1},g_{2},g_{3}\in G$ (associativity) 3) There exists a group element $e\in G$ , such that $g*e=e*g=g,$ for any $g\in G$ . ( $e$ identity element) 4) There is a group element $g^{\prime}$ w.r.t. $g$ , such that $g*g^{\prime}=g^{\prime}*g=e$ , for each $g\in G$ . ( $g^{\prime}$ inverse element) The group elements $g\in G$ , according to the representation theory, can be represented as linear transformations $\mathcal{P}_{V}(g)\in GL(V)$ on vector space $V$ . Given a function $f:X\to Y$ , where $X$ and $Y$ are vector spaces. It is said to be $G$ -equivariant if and only if $f(\mathcal{P}_{X}(g)x)=\mathcal{P}_{Y}(g)f(x)$ , for any $g\in G$ . $G$ -invariance is a special case when $\mathcal{P}_{Y}(g)$ is an identity matrix. Permutation equivariance and ${\text{E}}(3)$ -equivariance are two properties each layer of our model obeys. Permutation equivariance means the index of node or edge features will be consistent when passing a layer. ${\text{E}}(3)$ -equivariance covers rotation, translation, and reflection, where the translation is explicitly guaranteed via only considering the relative distances between atoms, thus we consider ${\text{O}}(3)$ -equivariance where translations are omitted. It is intuitive to correspondingly change directional features when the whole molecule rotates or reflects.

Tensor, irreps and CG transform. Tensors are high-dimensional generalizations of scalars, vectors, and matrices. Scalars and vectors are both special cases of Cartesian tensors. Tensor product can generate high-rank tensors from low-rank ones. Formally, tensors are the results of tensor product of several vectors and covectors. In our context, it is not essential to distinguish between vectors and covectors. Tensors representing groups can be further decomposed to the direct sum of irreps. For example, tensors of ${\rm SO}(3)$ (omit reflection compared to ${\text{O}}(3)$ ) on 9-space (from tensor product of two $3\times 3$ rotation matrix) can be decomposed into $1\times 1$ ( $l=0$ ), $3\times 3$ ( $l=1$ ), and $5\times 5$ ( $l=2$ ) irreps, which are called Wigner-D matrices. In EGNNs, we often project the distance vector between atoms onto the unit sphere $S^{2}$ with the central atom as the center of sphere. Actually, $S^{2}$ is homomorphic to the quotient group SO(3) $/$ SO(2), thus it also has its own irreps, e.g., $l=0$ scalar and $l=1$ vector. $S^{2}$ irreps are the main features we maintain in our model, where irreps with degree $l$ has $2l+1$ elements, which are often indexed by $m$ . To combine these features, we can calculate the tensor product between them, and the results can, again, be decomposed to irreps. This process is CG transform, which utilizes CG coefficients to perform transformations. For instance, $A^{1,l_{2}l_{3}\mapsto l_{1}}_{m_{1}}=\sum_{m_{2},m_{3}}C^{l_{1}l_{2}l_{3}}_{m% _{1}m_{2}m_{3}}A^{2,l_{2}}_{m_{2}}A^{3,l_{3}}_{m_{3}}$ , where $A^{l}$ are $S^{2}$ irreps, $m$ denotes the elements of irreps, and $C$ the CG coefficient. To satisfy ${\text{O}}(3)$ , we consider an additional variable, parity $p$ , which takes the values of $1$ or $-1$ . Irreps with $p=-1$ will be inverse when the space is reflected, and $p=1$ unchanged. The above formula of CG transform becomes:

	$\displaystyle A^{1,l_{2}p_{2}l_{3}p_{3}\mapsto l_{1}p_{1}}_{m_{1}}=\mathbbm{1}% _{(p_{1}=p_{2}p_{3})}\sum_{m_{2},m_{3}}C^{l_{1}l_{2}l_{3}}_{m_{1}m_{2}m_{3}}$		(1)
	$\displaystyle A^{2,l_{2}p_{2}}_{m_{2}}A^{3,l_{3}p_{3}}_{m_{3}}$		(1)

where $\mathbbm{1}_{(expression)}$ is the indicator function, outputing $1$ if $expression$ is true, and $0$ otherwise. Given a vector ( $l=1$ $S^{2}$ irreps), we can lift it to irreps with arbitrary degree $l$ and $p=(-1)^{l}$ , via a series of real spherical harmonics $(Y^{l}_{m=1},...,Y^{l}_{m=2l+1})$ . For further details about group theory, we refer interested readers to related books and papers zee2016group ; raczka1986theory ; thomas2018tensor ; jeevanjee2011introduction ; cohen2018spherical .

2.2 Problem analysis

The task of force field prediction can be formalised as follows: Given a set of atoms with their positions and atom types $\{\bm{X},\bm{Z}\}$ , the neural network $f_{\theta}$ with parameter $\theta$ aims to predict the energy, and by which it derives the predicted force on each atom. In each layer of NequIP batzner20223 , messages from neighboring atoms are aggregated and combined with the features of the central atom. The messages are created via CG transform between the irreps. Here, we revisit the critical step constructing messages to a central atom $a$ in NequIP:

	$\displaystyle\mathcal{L}^{l_{e}p_{e}l_{n}p_{n}\mapsto l_{o}p_{o}}_{acm_{o}}(% \bm{X},\bm{N})=\mathbbm{1}_{(p_{o}=p_{e}p_{n})}\sum_{m_{e}m_{n}}C^{l_{o}l_{e}l% _{n}}_{m_{o}m_{e}m_{n}}$		(2)
	$\displaystyle\sum_{b\in\mathcal{N}(a)}(R(\lVert r_{ab}\rVert)^{l_{o}l_{e}l_{n}% }_{c})Y^{l_{e}}_{m_{e}}(\frac{r_{ab}}{\lVert r_{ab}\rVert})N^{l_{n}p_{n}}_{bcm% _{n}}$		(2)

where $\mathcal{N}(a)$ is the set of neighboring atoms of atom $a$ . $R$ is a MLP. $\lVert*\rVert$ is the Euclidean norm. $N_{b}$ is the features of node $b$ . $r_{ab}$ is the vector pointing from atom $a$ to $b$ . Consider the vector function form of Eq. 2: $\bm{\mathcal{L}}^{l_{e}p_{e}l_{n}p_{n}\mapsto l_{o}p_{o}}_{cm_{o}}=(\mathcal{L% }^{l_{e}p_{e}l_{n}p_{n}\mapsto l_{o}p_{o}}_{1cm_{o}},\mathcal{L}^{l_{e}p_{e}l_% {n}p_{n}\mapsto l_{o}p_{o}}_{2cm_{o}},...)$ , which is permutation equivariant w.r.t. permutation operations acting on $\bm{X}$ and $\bm{N}$ . Formaly, it means $\bm{\mathcal{L}}^{l_{e}p_{e}l_{n}p_{n}\mapsto l_{o}p_{o}}_{cm_{o}}(\mathcal{P}% _{\bm{X}}\bm{X},\mathcal{P}_{\bm{N}}\bm{N})=\mathcal{P}_{\bm{\mathcal{L}}}\bm{% \mathcal{L}}^{l_{e}p_{e}l_{n}p_{n}\mapsto l_{o}p_{o}}_{cm_{o}}(\bm{X},\bm{N})$ . Put simply, if we exchange the indexes of two atoms, for example, 1 and 2, and feed them into function $\bm{\mathcal{L}}^{l_{e}p_{e}l_{n}p_{n}\mapsto l_{o}p_{o}}_{cm_{o}}$ , it equals to that we directly change the index 1 and 2 of the output of function $\bm{\mathcal{L}}^{l_{e}p_{e}l_{n}p_{n}\mapsto l_{o}p_{o}}_{cm_{o}}$ , which is $(\mathcal{L}^{l_{e}p_{e}l_{n}p_{n}\mapsto l_{o}p_{o}}_{2cm_{o}},\mathcal{L}^{l% _{e}p_{e}l_{n}p_{n}\mapsto l_{o}p_{o}}_{1cm_{o}},...)$ . This property is simple, but very important for the molecular neural networks, since the properties of a molecule should not depend on the order in which these atoms are arranged.

Most works take this property for granted. However, the permutation equivariance is actually important but vulnerable. It limits the design space to a very small scope, and make the network poorly scalable when the number of neighbors arises. Specifically, it brings the following issues:

Problem 1.

The CG transform layer scales as $\mathcal{O}(\max\limits_{i}{\rm card}(\mathcal{N}(i)))$ , where ${\rm card}(X)$ is the number of elements in set $X$ . One cannot arbitrarily remove calculations for a specific neighboring atom because it would break the permutation equivariance.

Problem 2.

The design space is limited for maintaining permutation equivariance. For example, in Eq. 2, the formulation and the parameters of $R$ should be the same across different neighboring atoms, thus forbidding the design for complicated CG transform layers.

Problem 1 brings heavy computation demands, as the CG transform itself is very time-consuming, compared to dot product and element-wise multiplication. We provide a detailed analysis for the efficiency of CG transform in Method section. On the other hand, the narrowness for design space brought by problem 2 makes it hard to design a high expressive CG transform layer, as only limited structures can be designed to maintain permutation equivariance. To address these problems, we aim to free the CG transform in messages transmissions from the constraints of permutation equivariance without compromising the overall equivariance of the network. Here, we leverage a simple and useful mathematical property. Consider a function $h$ that can be written as:

h(x)=h^{{}^{\prime}}(h_{1}(x),h_{2}(x),...)

(3)

if $h_{*}(x)$ are all $G$ -invariant, then, no matter how we design $h$ , the overall function $h$ must be $G$ -equivariant. The proof is simple, as:

	$\displaystyle h^{{}^{\prime}}(h_{1}(\mathcal{P}_{X}(g)x),h_{2}(\mathcal{P}_{X}% (g)x),...)=$		(4)
	$\displaystyle\mathcal{P}_{h}(e)h^{{}^{\prime}}(h_{1}(x),h_{2}(x),...)$		(4)

Invariants components will not affect the equivariance of the layer. Thus, we can first obtain a set of invariant functions $h_{*}$ , then we can freely design the function $h$ of them.

2.3 FreeCG

Abstract edges. The above proposition presents an elegant way to solve problem 1 and 2. The idea is to put CG transform inside the function $h$ , and by the conclusion, we can completely free the design space of the CG transform. The first step is to construct the permutation invariant function $h_{*}$ . To emphasize the geometric information, we want these $h_{*}$ ’s to be the aggregation of edge features. We call $h_{*}$ ’s abstract edges. For the concrete design, we take the transformer architecture in VisNet wang2024enhancing as an efficient tool to construct abstract edges. The detailed information of VisNet is in Appendix. In VisNet, each edge maintains high-degree features $E_{ij}=E^{l=1}_{ij}\oplus E^{l=2}_{ij}$ consisting of irreps $E^{l}_{ij}=Y^{l}(r_{ij}/\lVert r_{ij}\rVert)$ . The above features are invariant to layer index $L$ . The computed attention $a^{L,t}_{ij}$ is multiplied to each edge. The sum of them $\hat{E}^{L}_{i,t}=\sum_{ij\in\mathcal{E}(i)}a^{L}_{ij,t}E^{L}_{ij}$ forms an abstract edge, where we omit the degree $l$ , and $L$ the index of the layer. $t$ denotes the index of the $t$ -th abstract edges $(\hat{E}^{L}_{i,t=1},\hat{E}^{L}_{i,t=2},...)$ . In the original VisNet, it was used to update the geometric feature $d\overline{E}^{L+1}_{i}=\hat{E}^{L}_{i}+o^{L,1}_{i}\cdot{\rm Linear}(\overline% {E}^{L}_{i})$ , where ${\rm Linear}$ is a fully-connected linear operation, which performs across the dimension of $t$ , thus does not break the equivariance, and $o^{L,1}_{i}$ is a variable generated from the node feature, as we will introduce in Method section. We leverage the fact that $\hat{E}^{L}$ fits the requirements of $h_{*}$ in our proposition, take them as abstract edges, and propose methods to construct CG transform function $h$ upon it. The proof that each abstract edge meets the requirement for $h_{*}$ , namely, it is permutation invariant, is in Method section.

Group CG transform. The number of abstract edges is decided by us, so the complexity for computing CG transform in Problem 1 is controlled to be constant. The proposition above gives us enough freedom to construct the CG transform function $h$ , expanding the design space to maximum, alleviating Problem 2. The idea is to use CG transform to replace the updating mechanism of $\overline{E}$ in VisNet. A naïve attempt is to directly take the CG transform between $\overline{E}^{L}$ and $\hat{E}^{L}$ to acquire $\overline{E}^{L+1}$ . However, we want to further decrease the $O(T^{2})$ time complexity for the CG transform, where $T$ is the number of abstract edges, even though it is a constant number. Leveraging the unlimited freedom in constructing $h$ , and taking inspiration of group convolution krizhevsky2012imagenet , we propose group CG transform (distinct from the group in group theory). We first split the abstract edges of $\overline{E}^{L}$ and $\hat{E}^{L}$ into groups, where each index of abstract edge belongs to some group $U_{g}$ , the integer $g$ ranges from $1$ to $G$ , and $G$ a hyper-parameter for the number of total groups. Then a group CG transform acts as:

	$\displaystyle d\overline{E}^{L+1,l_{o},p_{o}}_{i,t_{o}m_{o}}=\mathbbm{1}_{(p_{% o}=p_{1}p_{2})}\sum_{l_{1},l_{2}}\sum_{m_{1},m_{2}}C^{l_{o},l_{1},l_{2}}_{m_{o% }m_{1}m_{2}}$		(5)
	$\displaystyle\sum_{t_{1},t_{2}\in U_{g}}W^{l_{o},l_{1},l_{2}}_{t_{o}t_{1}t_{2}% }o^{L,1}\rm{Linear}(\overline{E}^{L}_{i})^{l_{1},p_{1}}_{t_{1}m_{1}}\hat{E}^{L% ,l_{2},p_{2}}_{i,t_{2}m_{2}}$		(5)

where $t_{o}\in U_{g}$ . The group CG transform decreases the time complexity to $O(T^{2}/G)$ . Here, the parameters $W$ for CG transform are also worth emphasizing. They are not necessary to be kept the same across different abstract edges $t$ to keep permutation equivariance, and do not need to adopt the same MLP for each edge to calculate weights. Thus, we directly assign different weights $W$ for different abstract edges to enhance the model expressive ability. In contrast to previous methods, We save the computational cost for calculating weights for each edge.

Sparse path. Typically, ensuring SO(3) equivariance is considered more efficent than ensuring O(3) equivariance. It is because we often need to consider both $p=1$ and $p=-1$ for a single $l$ for O(3), thus the total computation is quadrupled, and memory usage is doubled. Here we propose a method to keep O(3) while being even efficient than SO(3). We only keep $(l=1,p=-1)$ and $(l=2,p=1)$ , which is same as the order of directly using spherical harmonics. In such way, It suffices that each output irreps containing information from both input irreps through CG transform, as $(l=1,p=-1)*(l=2,p=1)\mapsto(l=1,p=-1)$ , $(l=1,p=-1)*(l=1,p=-1)\mapsto(l=2,p=1)$ , $(l=2,p=1)*(l=2,p=1)\mapsto(l=2,p=1)$ , and $(l=1,p=-1)*(l=1,p=-1)\mapsto(l=2,p=1)$ . There are only 4 path in contrast to 8 path for SO(3), being O(3) equivariant but even efficient than being SO(3) equivariant. We show the path in Fig. 2(a).

Abstract edges shuffling. Inspired by ShuffleNet zhang2018shufflenet , we can also shuffle the abstract edges to make the information exchanged comprehensively. We shuffle all the abstract edges. Specifically, we increase the indices of all irreps by $1.5*T/G$ . If the index exceeds $T$ , we start counting from 1 again. Theoretically, the shuffling strategy can be arbitrary as long as maintaining the same strategy for each layer during every inference. This process is also depicted in Fig. 2(b) The ablation on different strategies is shown in Results section.

Abstract edges enhance self-attention. The Transformer integrates neighboring atoms information in molecular tasks through self-attention mechanism, which aims to capture relations for those atoms exhibiting strong interatomic correlations. In order to better utilize abstract edge information, we leverage it to augment the generation of attention scores. To calculate the self-attention, the node scalar features are processed to generate query $Q$ , key $K$ , and value $V$ for each atom, respectively. Then, the self attention is computed as $A_{ij}=\mathop{\rm SiLU}(Q_{i}\odot K_{j})$ , where $\odot$ represents dot product, and SiLU is the activation function. Note that VisNet is different from other transformer-based models where $A_{ij}$ is scaled by the SiLU instead of Softmax across different $j$ . We integrate the information of abstract edges by:

A_{ij}=\mathop{\rm SiLU}(Q_{i}\odot K_{j}+\max\limits_{t}(\overline{E}^{L}_{j,% t}\odot E_{ij}))

(6)

where the $\max\limits_{t}(\overline{E}^{L}_{j,t}\odot E_{ij})$ means each abstract edge $\overline{E}^{L}_{j,t}$ is dot product with the real edge features $E_{ij}$ , and the maximum value among different abstract edges. There has no $L$ superscript for $E_{ij}$ because they are the same across different layers. We take it as the additional contribution to the self-attention, as it measures how abstract edges contain the information of the edge linking atoms $i$ and $j$ .

2.4 Experiments

To evaluate the performance of FreeCG, we collect molecular force field datasets, on which we compare our methods with other SoTA MLFFs, which include small molecular dataset MD17 chmiela2017machine with its revised version, rMD17 Christensen2020 , and large molecule dataset MD22 chmiela2023accurate . To test the generalization capacity of the proposed FreeCG, we also evaluate the performance of FreeCG on a standard molecule property prediction dataset, QM9 ruddigkeit2012enumeration ; ramakrishnan2014quantum . We take popular SoTA models into the comparison, including sGDML chmiela2017machine , SchNet schutt2018schnet , DimeNet gasteiger2020directional , SphereNet liu2021spherical , PaxNet zhang2022efficient , PaiNN schutt2021equivariant , SpookyNet unke2021spookynet , ET tholke2021equivariant , GemNet gasteiger2021gemnet , ComENet wang2022comenet , NequIP batzner20223 , UniTE qiao2022informing , SO3KRATES frank2022so3krates , MACE batatia2022mace , Allegro musaelian2023learning , BOTNet batatia2022design , VisNet wang2024enhancing , and QuinNet wang2023quinnet . The ablations on each component and corresponding hyper-parameters of FreeCG are also presented. The comprehensive introduction for each datasets, and detailed settings for the training are in Method section.

Table 1: Performances on MD17 dataset.

Molecule	SchNet	DimeNet	PaiNN	SpookeyNet	ET	GemNet	NequIP	SO3KRATES	VisNet	QuinNet	FreeCG
Energy Prediction
Aspirin	0.37	0.204	0.167	0.151	0.123	-	0.131	0.139	0.116	0.119	0.110
Ethanol	0.08	0.064	0.064	0.052	0.052	-	0.051	0.052	0.051	0.050	0.049
Malondialdehyde	0.13	0.104	0.091	0.079	0.077	-	0.076	0.077	0.075	0.078	0.094
Naphthalene	0.16	0.122	0.116	0.116	0.085	-	0.113	0.115	0.085	0.101	0.083
Salicylic acid	0.20	0.134	0.116	0.114	0.093	-	0.106	0.016	0.092	0.101	0.090
Toluene	0.12	0.102	0.095	0.094	0.074	-	0.092	0.095	0.074	0.080	0.076
Uracil	0.14	0.115	0.106	0.105	0.095	-	0.104	0.103	0.095	0.096	0.097
Force Prediction
Aspirin	1.35	0.499	0.338	0.258	0.253	0.217	0.184	0.236	0.155	0.145	0.122
Ethanol	0.39	0.230	0.224	0.094	0.109	0.085	0.071	0.096	0.060	0.060	0.053
Malondialdehyde	0.66	0.383	0.319	0.167	0.169	0.155	0.129	0.147	0.100	0.097	0.095
Naphthalene	0.58	0.215	0.077	0.089	0.061	0.051	0.039	0.074	0.039	0.039	0.034
Salicylic acid	0.85	0.374	0.195	0.180	0.129	0.125	0.090	0.145	0.084	0.080	0.070
Toluene	0.57	0.216	0.094	0.087	0.067	0.060	0.046	0.073	0.039	0.039	0.035
Uracil	0.56	0.301	0.139	0.119	0.095	0.097	0.076	0.111	0.062	0.062	0.059

The results are reported in mean abosolute error (MAE). The energy and force are measured in kcal/mol and kcal/mol/Å, respectively. The best numbers are marked in bold.

Table 2: Performances on rMD17 dataset.

Molecule	UNiTE	GemNet	NequIP	MACE	Allergo	BOTNet	VisNet	QuinNet	FreeCG
Energy Prediction
Aspirin	0.055	-	0.0530	0.0507	0.0530	0.0530	0.0445	0.0486	0.0530
Azobenzene	0.025	-	0.0161	0.0277	0.0277	0.0161	0.0156	0.0394	0.0217
Benzene	0.002	-	0.0009	0.0092	0.0069	0.0007	0.0007	0.0096	0.0107
Ethanol	0.014	-	0.0092	0.0032	0.0092	0.0092	0.0078	0.0096	0.0087
Malonaldehyde	0.025	-	0.0184	0.0185	0.0138	0.185	0.0132	0.0168	0.0146
Naphthalene	0.011	-	0.0046	0.1153	0.0046	0.0046	0.0057	0.0174	0.0118
Paracetamol	0.044	-	0.0323	0.0300	0.0346	0.0300	0.0258	0.0362	0.0392
Salicylic acid	0.017	-	0.0161	0.0208	0.0208	0.0185	0.0161	0.033	0.0233
Toluene	0.010	-	0.0069	0.0115	0.0092	0.0069	0.0059	0.0139	0.0334
Uracil	0.013	-	0.0092	0.0115	0.0138	0.0092	0.0069	0.0149	0.0116
Force Prediction
Aspirin	0.175	0.2191	0.1891	0.1522	0.1684	0.1900	0.1520	0.1429	0.1212
Azobenzene	0.097	-	0.0669	0.0692	0.0600	0.0761	0.0585	0.0513	0.0486
Benzene	0.017	0.0115	0.0069	0.0069	0.0046	0.0069	0.0056	0.0047	0.0056
Ethanol	0.085	0.083	0.0646	0.0484	0.0484	0.0738	0.0522	0.0516	0.0438
Malonaldehyde	0.152	0.1522	0.0118	0.0946	0.0830	0.1338	0.0893	0.0875	0.0802
Naphthalene	0.060	0.0438	0.0300	0.0369	0.0208	0.0415	0.0291	0.0242	0.0228
Paracetamol	0.164	-	0.1361	0.1107	0.1130	0.1338	0.1029	0.0979	0.0840
Salicylic acid	0.088	0.1222	0.0922	0.0715	0.0669	0.0992	0.0795	0.0771	0.0648
Toluene	0.058	0.0507	0.0369	0.0350	0.0415	0.0438	0.0264	0.0244	0.0239
Uracil	0.088	0.0876	0.0669	0.0484	0.0415	0.0738	0.0495	0.0487	0.0446

The results are reported in MAE. The energy and force are measured in kcal/mol and kcal/mol/Å, respectively. The best numbers are marked in bold.

Table 3: Performances on MD22 dataset.

Molecule	sGDML	ViSNet	ViSNet-Improper	ViSNet-LSRM	MACE	QuinNet	FreeCG
Energy Prediction
Ac-Ala3-NHMe	0.391	0.0636	0.0546	0.0673	0.0631	0.0840	0.507
AT-AT	0.720	0.0708	0.0668	0.0780	0.108	0.144	0.0665
AT-AT-CG-CG	1.42	0.196	0.197	0.118	0.154	0.379	0.254
DHA	1.29	0.0741	0.0700	0.0897	0.135	0.118	0.0761
Buckyball catcher	1.17	0.508	0.537	0.319	0.489	0.563	0.512
Stachyose	4.00	0.0915	0.0882	0.104	0.122	0.226	0.183
Double-walled nanotube	4.00	0.800	0.601	1.81	1.67	1.81	0.543
Force Prediction
Ac-Ala3-NHMe	0.790	0.0830	0.0709	0.0942	0.0876	0.0681	0.0531
AT-AT	0.690	0.0812	0.0776	0.0781	0.0992	0.0687	0.0634
AT-AT-CG-CG	0.700	0.148	0.139	0.1064	0.1153	0.1273	0.1252
DHA	0.750	0.0598	0.0554	0.0598	0.0646	0.0515	0.0507
Buckyball catcher	0.680	0.184	0.201	0.1026	0.0853	0.1091	0.1783
Stachyose	0.680	0.0879	0.0802	0.0767	0.0876	0.0543	0.612
Double-walled nanotube	0.520	0.362	0.292	0.3391	0.2767	0.2473	0.2449

The results are reported in MAE. The energy and force are measured in kcal/mol and kcal/mol/Å, respectively. The best numbers are marked in bold. Note that the energy MAE is calculated without divided by the total number of atoms, unlike wang2023quinnet .

Table 4: Molecular property prediction on QM9 dataset.

Target		SchNet	EGNN	DimeNet++	PaiNN	SphereNet	PaxNet	ET	ComENet	ViSNet	FreeCG
$\mu$	mD	33	29	29.7	12	24.5	10.8	11	24.5	9.5	11.4
$\alpha$	m $a^{3}_{0}$	235	71	43.5	45	44.9	44.7	59	45.2	41.1	38.2
$\epsilon_{HOMO}$	meV	41	29	24.6	27.6	22.8	22.8	20.3	23.1	17.3	16.6
$\epsilon_{LUMO}$	meV	34	25	19.5	20.4	18.9	19.2	17.5	19.8	14.8	13.5
$U_{0}$	meV	14	11	6.32	5.85	6.26	5.9	6.15	6.59	4.23	4.11
$U$	meV	19	12	6.28	5.83	6.36	5.92	6.38	6.82	4.25	4.51

The results are reported in MAE. The best numbers are marked in bold.

Dynamics on small molecules. MD17 is a famous molecule dynamic benchmark for small molecules. FreeCG outperforms others in all force prediction tasks. It also significantly decreases the force prediction errors for the most hard-to-predict molecule in this datasets, aspirin, by 15%. Remarkably, our methods also decrease the MAE by over 10% for ethnol, naphthalene, and salicylic acid. FreeCG does not have a particular preference for the size of molecules. It demonstrates strong performance for aspirin (180.2 g/mol) and excels on ethanol (46.1 g/mol). The energy prediction is also competitive when compared to other SoTA methods. rMD17 is the revised version of MD17. It recomputed the trajectories of each atom with higher accuracy. The force prediction accuracy of FreeCG is still leading in majority of the molecules. It improves the force results compared to the baseline model, VisNet, in all molecules except for the benzene, and performs SoTA on more atoms. Note that the results on benzene is already extreme high with previous models. The results for MD17 and rMD17 can be referred to Tab. 1 and 2, respectively.

Dynamics on large molecules. MD22 is a large molecules benchmark adopted by several studies wang2024enhancing ; wang2023quinnet ; li2023long . As shown in Tab. 3, it reveals that FreeCG also performs well for large scale data. It leads in most tracks for force prediction, and shows comparable results for energy prediction. Remarkably, The decreasing in MAE for energy and force prediction on Ac-Ala3-NHMe are both around 20%. The performances for the other models are not consistent well for force prediction, while VisNet-LSRM exhibits strong performance for energy prediction. It is also reasonable that all modern deep neural network-based methods outperform sGDML, as a classical kernel method.

Molecular property prediction. To examine the generalization power on molecular property prediction of FreeCG, we collect QM9 as a standard benchmark for this task. FreeCG performs the best for most properties. VisNet also performs the second best in most measures. Although these two methods are proposed to be MLFFs, but they are even more comparable than others in molecular property prediction tasks.

Efficiency benchmarking. Chignolin dataset wang2023aimd comprises of nearly 10,000 166-atom mini protein, which is taken as a benchmark for testing the memory usage and inference speed. We compare the inference speed and memory usage of FreeCG with VisNet, NequIP, and Allegro. The results are shown in Tab. 3. FreeCG adds little extra time and memory cost, compared to the baseline model, VisNet. It is also the most efficient for both memory and speed, compared to the other two CG transform-based methods, NequIP and Allegro. The overall results prove the effciency of FreeCG. The number of groups in group CG transform also impacts the inference speed. Fig. 4 shows the theoretical number of paths and the actual inference time for different group numbers. A computation analysis for CG transform can be referred to Method section.

Ablation Study. We conduct ablations on different modules we propose, as well as the strategies for abstract edges shuffling. The results are shown in Tab. 5. It reveals that each of our module contributes to the final score of FreeCG. In the final implementation of abstract edges shuffling, we add the index of each abstract edge by $1.5*T/G$ . Here we also study the influence of the shuffling strategies. We adopt $0.5*T/G$ , $1.0*T/G$ , and $1.5*T/G$ for comparing the performance. We can see from the result that $1.5*T/G$ works the best. The group numbers are also evaluated and a small number of groups appears to be a good choice.

3 Discussion

This work proposes FreeCG, a geometric neural network that frees the design space of CG transform. We reveal two main issues in designing a CG transform-based neural networks: 1) the computational overhead and 2) the limitation in designing the CG transform layer. We analyse and prove that these two problems root in the mathematical constrain posed by the permutation equivariance. Proposing and leveraging an interesting proposition, we bypass the constrain by designing CG transform layer upon the permutation-invariant abstract edges. On top of this free design platform we set up, we propose group CG transform, sparse path, abstract edges shuffling, and attention enhancer to achieve a high expressive and efficient MLFF model. We conduct experiments in various data types and tasks, e.g., force prediction for small molecules, large molecules, and molecular property prediction task. The results prove that FreeCG is the current SoTA for force and property prediction. The speed and memory demands are also tested on the Chignolin dataset.

Beyond this, the proposed CG transform design paradigm is also available for the future design of CG transform-based neural networks. The proposition clearly shows that once the permutation invariant mathematical objects are created, the CG transform designed on them is completely free. The way to create permutation invariant objects, and to design the CG transform layer upon them can be well pushed beyond the way we do in this work. Thus, it also points out a paradigm for expressive and efficient CG transform-based neural network design in the future.

4 Methods

4.1 Experimental settings

We conduct all the experiments under the same software and hardware settings. The machine is equipped with an Intel^® Xeon^® Gold 6330 CPU @ 2.00GHz, with NVIDIA Tesla A100 80G GPU. We run the experiments for each molecule on a single GPU. Pytorch 1.10.0 is used as the basic machine learning python library. For the CG transform operations, we adopt e3nn 0.5.1. Matplotlib 3.0.3 is utilized for plotting. The details can be referred to Tab. 6. We report the hyperparameters used in Tab. 7. For training/validation/test splits, we follow previous works wang2024enhancing ; wang2023quinnet . We pick up the model for evaluating on test set based on the performance on the validation set. If the model does not improve for a given number of epochs, we will terminate the training and select the checkpoint with the best validation score. As previous works, Exponential Moving Average (EMA) is adopted to generate the model weights. The detailed training configurations are shown in Tab. 7.

4.2 Model implementation

Here we show how FreeCG is built upon VisNet. This section provides detailed explanations of the implementation details, ensuring FreeCG can be replicated effectively.

Input layer. Given the atom coordinates and types $\{\bm{X}=r_{0},r_{1},r_{2},...,r_{N}),\bm{Z}=(z_{1},z_{2},...,z_{n})\}$ , where $r\in\mathbbm{R}^{3}$ the Cartesian coordinates of atom, and $z$ the atom type (atom numbers). First we embed the atom types to the latent space, and take them as our first layer’s node features $h_{i}={\rm embedding}(z_{i})\in\mathbbm{R}^{C}$ . $C$ is the dimension of the latent space. For each atom, we only consider neighbouring atoms within a given radius $\mathcal{N}(i)$ , where we maintain the distance vector from the central atom to the neighbouring atoms, and lift them to $(l=1,p=-1$ and $l=2,p=1)$ irreps $E_{ij}\in\mathbbm{R}^{3+5}$ via real spherical harmonics applied on the unit vector $E_{ij}=Y^{l}(r_{ij}/\lVert r_{ij}\rVert)$ , where we also calculate the corresponding Euclidean norm $\lVert r_{ij}\rVert$ . The Euclidean norm of vectors are then converted to high-dimension scalar features (edge attributes) $f_{ij}={\rm RBF}(r_{ij})\in\mathbbm{R}^{C}$ by radial basis functions (RBFs). We also maintain zero-initialized abstract edges $\overline{E}^{L=0}_{i}={\bm{0}}$ for each node to be updated in the following layers. We assign the same number of abstract edges as the dimension of the latent features, such that additional operations to align the dimension numbers are not required.

Intermediate layers. Here, we use a superscript $L$ to denote the index of layer that the features are in. The message-passing between atoms is implemented by a transformer architecture. For each atom $i$ , the neighbouring atoms $j\in\mathcal{N}(i)$ will send messages to $i$ , and the messages are aggregated to update the information of $i$ . The query, key, and value of the node features are first calculated, respectively: $q_{i}=f_{q}(h_{i})$ , $k_{j}=f_{k}(h_{j})$ , $v_{j}=f_{v}(h_{j})$ . The edge attributes are also converted to auxiliary terms $dk_{j}=f_{dk}(f_{ij})$ and $dv_{j}=f_{dv}(f_{ij})$ to modulate keys and and values of atoms. Here functions $f$ are all fully-connected linear operations. Then we calculate the self-attention from $i$ to $j$ , which is

	$\displaystyle a_{ij}={\rm SiLU}\bigg{(}{\rm Cutoff}(\|\|r_{ij}\|\|)q_{i}k_{j}dk_{j}+$		(7)
	$\displaystyle{\rm AttEnhancer}(r_{ij},\overline{E}^{L}_{j})\bigg{)}$		(7)

where ${\rm Cutoff}(\cdot)$ is a cosine cutoff function, and ${\rm AttEnhancer}(\cdot)$ the proposed attention enhancer module, as we will formulate its details. First, recall the dimension of $\overline{E}^{L}_{i}\in\mathbbm{R}^{C*8}$ and $r_{ij}\in\mathbbm{R}^{8}$ . Each of the $C$ abstract edges will undergo a dot product with $r_{ij}$ . The highest value among them will be the output of ${\rm AttEnhancer}$ . In other word,

{\rm AttEnhancer}(\overline{E}^{L}_{i},r_{ij})=\max_{C}(\overline{E}^{L}_{i}% \odot r_{ij})

(8)

as we introduce in Eq. (6). Then the values are multiplied with $dv$ and attention.

\hat{v}_{j\mapsto i}=v_{j}\cdot dv_{j}\cdot a_{ij}

(9)

It then undergoes two different fully-connected operations to generate two coefficients $s_{1}$ and $s_{2}$ . They are used to generate the abstract edges:

\hat{E}^{L}_{j\mapsto i}=\overline{E}^{L}_{i}\cdot s_{1}+E_{ij}\cdot s_{2}

(10)

This variable, together with $\hat{v}_{j\mapsto i}$ , are aggregated by sum:

\hat{E}^{L}_{i}=\sum_{j\in\mathcal{N}(i)}\hat{E}^{L}_{j\mapsto i}

(11)

\hat{v}^{L}_{i}=\sum_{j\in\mathcal{N}(i)}\hat{v}^{L}_{j\mapsto i}

(12)

$\hat{v}^{L}_{i}$ then converts to three variables for further operation:

o^{L,1}_{i},o^{L,2}_{i},o^{L,3}_{i}={\rm Linear}(\hat{v}^{L}_{i})

(13)

$\hat{E}^{L}_{i}\in\mathbbm{R}^{C*8}$ and $\overline{E}^{L}_{i}\in\mathbbm{R}^{C*8}$ are used for the following group CG transform and abstract edges shuffling. First $\overline{E}^{L}_{i}\in\mathbbm{R}^{C*8}$ undergoes a fully-connected operation along $C$ dimension, and multiply with $o^{L,1}_{i}$ , which means we get $o^{L,1}_{i}\cdot{\rm Linear}(\overline{E}^{L}_{i})$ . It, together with $\hat{E}^{L}_{i}\in\mathbbm{R}^{C*8}$ , are then divided into $G$ groups along $C$ dimension, where we get $\hat{E}^{L}_{i,t\in G_{g}}\in\mathbbm{R}^{\frac{C}{G}*8}$ , and $(o^{L,1}_{i}\cdot{\rm Linear}(\overline{E}^{L}_{i}))_{t\in G_{g}}\in\mathbbm{R% }^{\frac{C}{G}*8}$ . Then, we perform CG transform between two variables in fully connected form with learnable weights, and concatenate the results to generate $d\overline{E}^{\prime L+1}_{i}$ before shuffling, as shown in Eq. (5). For the shuffling strategies, we add $\frac{3C}{2G}$ to each index of the abstract edges $\overline{E}^{\prime L+1}_{i}$ . Then, it is added with $\hat{E}^{L}_{i}$ to form a residual structure, as we show here:

d\overline{E}^{L+1}_{i}={\rm shuffle}(d\overline{E}^{\prime L+1}_{i})+\hat{E}^% {L}_{i}

(14)

where $d\overline{E}^{L+1}_{i}$ is added to $\overline{E}^{L}_{i}$ to obtain $\overline{E}^{L+1}_{i}$ . Next, we update $h$ and $f$ . We first show the update for $h$ :

	$\displaystyle dh^{L+1}_{i}=h^{L}_{i}+\bigg{(}{\rm Linear_{1}}(\overline{E}^{L}% _{i})\odot{\rm Linear_{2}}(\overline{E}^{L}_{i})\bigg{)}\cdot o^{L,2}_{i}$		(15)
	$\displaystyle+o^{L,3}_{i}$		(15)

To update $f$ , we follow VisNet to leverage rejection of vectors:

	$\displaystyle df^{L+1}_{ij}=f^{L}_{ij}+{\rm RejCalc_{trg}}(\overline{E}^{L}_{i% },r_{ij})\odot$		(16)
	$\displaystyle{\rm RejCalc_{src}}(\overline{E}^{L}_{i},r_{ij})\cdot{\rm SiLU}({% \rm Linear}(f^{L}_{ij}))$		(16)

where rejection calculation module $RejCalc$ is:

RejCalc_{mode}(a,b)=a-({\rm Linear_{mode}}(a)\odot b)\cdot b

(17)

The updated $\overline{E}^{L+1}$ , $h^{L+1}$ , and $f^{L+1}$ are fed into the next layer.

Output layers are different with respect to the task our model performs. We introduce the details for each task.

Force field prediction. Our model is based on energy-conservative field, which means we derive the force from the predicted potential energy. Following VisNet wang2024enhancing and PaiNN schutt2021equivariant , we predict the potential energy of the molecule via equivariant gated module.

h^{L+1}_{i},r^{L+1}_{i}={\rm MLP}\bigg{(}{\rm Concat}(h,||{\rm Linear_{1}}(% \overline{E}^{L}_{i})||)\bigg{)}

(18)

where ${\rm MLP}$ is an $1$ -hidden layer multi-layer preceptor. There is one more step to update $\overline{E}^{L+1}_{i}$ :

\overline{E}^{L+1}_{i}={\rm Linear_{2}}(\overline{E}^{L}_{i})\cdot r^{L+1}_{i}

(19)

We stack this module two times and finally, the total energy of the molecule is the sum of the last-layer node features $h^{\bm{L}}$ :

y=\sum_{i}h^{\bm{L}}_{i}

(20)

and the force is the negative gradients of the total energy:

F_{i}=\nabla_{r_{i}}y

(21)

Property prediction. The calculations for properties in QM9 follow the same procedure as energy prediction in force field prediction, with the exception of molecular dipole and electronic spatial extent. We first need to calculate the center of mass $r_{c}$ , which is:

r_{c}=\frac{\sum_{i}m_{i}\cdot r_{i}}{\sum_{i}m_{i}}

(22)

For molecular dipole, the formula is:

\mu=\left\|\sum_{i}\overline{E}_{i}^{\bm{L}}+h_{i}^{\bm{L}}(r_{i}-r_{c})\right\|

(23)

and for electronic spatial extent:

\langle R^{2}\rangle=\sum_{i}h_{i}^{\bm{L}}\lVert r_{i}-r_{c}\rVert

(24)

It suffices to change the output head for different tasks.

4.3 Datasets details

MD17 and rMD17. They are all molecular dynamic datasets for small molecules. MD17 chmiela2017machine is proposed by Chmiela, S., et al. It contains ab-initio level molecular dynamic trajectories. There are four types of information provided in the dataset: 1) atomic numbers, 2) atomic position, 3) molecular energy, and 4) force on each atom. To alleivate the noise during the trajectory computation, Christensen, A. S. et al. also propose revised MD17 Christensen2020 , where molecular trajectories are calculated at the PBE/def2-SVP level of theory. The tight SCF convergence and dense DFT integration grid further guarantee the accuracy of the calculated trajectories.

MD22. Compared to MD17 and rMD17, MD22 comprises of large molecules ranging from 42 to 370 atoms. The trajectories are sampled between 400K and 500K at 1fs resolution. The energy and force labels are acquired at the PBE+MBD level of theory. The root mean squared test error of force prediction is controlled to be around 1 kcal/mol/Å in the original paper chmiela2023accurate . Thus, the training data sizes for different molecules vary. Generally, the larger the molecules, the smaller the training data size.

QM9. QM9 consists of around 130,000 molecules with 19 properties regression tasks. It a subset of GDB-17 database ruddigkeit2012enumeration . The data is calculated at B3LYP/6-31G(2df,p) based DFT level of accuracy. Since the attributes are different for various properties, we adopt different output heads for them, as discussed in Sec. 4.2.

Chignolin. The AIMD-Chig dataset consists of 2 million conformations of the 166-atom protein Chignolin, sampled at the M06-2X/6-31 G* based DFT level. There are around 10,000 conformations, which covers folded, unfolded, and metastable states. We take this dataset as our efficiency benchmark, following wang2024enhancing .

4.4 Proof of the permutation invariance of abstract edges

According to Sec. 4.2, we first recall the last step for generating abstract edges:

\hat{E}^{L}_{i}=\sum_{j\in\mathcal{N}(i)}\hat{E}^{L}_{j\mapsto i}=\sum_{p\in P% }\sum_{j\in\mathcal{N}(i)}\frac{\mathcal{P}(p)\hat{E}^{L}_{j\mapsto i}}{{\rm Card% }(P)}

(25)

where $P$ is the set for all permutation operations, here we omit the subscript of $\mathcal{P}$ for specific spaces to work on. Note we are proving that the abstract edges for each atom are permutation invariant, and we can freely design CG transform per atom, thus the permutation is applied to $j$ but not $i$ . It sums over all the permutation operations, and thus the last step is permutation invariant. Then, it suffices to show that each of the previous step are all at least permutation equivariant. It also suffices to show they are permutation equivariant w.r.t. single index switch operation, as each permutation operation can be made by several switches. If we exchange, without loss of generality, index $x$ and $y$ , then those $a_{ij}$ that $x$ or $y$ shows up in the subscript for $j$ will exchange with each other, and so do $\hat{v}_{j\mapsto i}$ and $\hat{E}^{L}_{j\mapsto i}$ . Thus, the rest steps are equivariant w.r.t. single switch, and so they are permutation equivariant. Therefore, we conclude our proof that abstract edges are permutation invariant.

4.5 Analysis on the efficiency of CG transform

CG transform comprises of two single steps: 1) tensor product between two irreps, and 2) the decomposition of the output tensors into irreps. These transforms are actually quadratic homogeneous polynomials. For the sake of convenience, we discuss SO(3) group here. Recall the CG transform formula:

C^{l_{c}}_{m_{c}}=\sum_{m_{a},m_{b}}CA^{l_{a}}_{m_{a}}B^{l_{b}}_{m_{b}}

(26)

where $m_{a}+m_{b}=m_{c}$ . For example, if we take single multiplication and addiction as a basic operation, then two $l=1$ irreps generates a $l=2$ irreps will consume 1 basic operations for $m=\pm 2$ , 3 for $m=\pm 1$ , and 5 for $m=0$ , which is in total 13 basic operations. A good way to intepret irreps is to take it as the generalization of vector and scalar. A dot product between vectors only consumes 5 basic operations, compared to the 13 ones above. Thus CG transform is very time consuming. The table of basic operations for the CG transform between each pair is shown in Tab. 8.

Author contributions

S. S. initiated, conceived the study, conducted all experiments, and wrote the manuscript under the guidance of Q. C. H. G. discussed the projects, and set up the experimental platform. Q. C. supervised the project. All authors reviewed and approved the final manuscript.

Data availability

All datasets are accessible for free on the internet. MD17: [http://www.quantum-machine.org/gdml/data/npz]. rMD17: [https://archive.materialscloud.org/record/file?filename=rmd17.tar.bz2&record_id=466]. MD22: [http://www.quantum-machine.org/gdml/data/npz]. QM9: [https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/molnet_publish/qm9.zip]. Chignolin: [https://github.com/microsoft/AI2BMD/tree/ViSNet/chignolin_data].

Code availability

The code for reproduction will be publicly available upon official publishing.

References

(1) Amaro, R. E. & Mulholland, A. J. Multiscale methods in drug design bridge chemical and biological complexity in the search for cures. Nature Reviews Chemistry 2, 0148 (2018).
(2) Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nature Biomedical Engineering 5, 613–623 (2021).
(3) Chen, S. et al. Design of target specific peptide inhibitors using generative deep learning and molecular dynamics simulations. Nature Communications 15, 1611 (2024).
(4) Zepeda-Ruiz, L. A., Stukowski, A., Oppelstrup, T. & Bulatov, V. V. Probing the limits of metal plasticity with molecular dynamics simulations. Nature 550, 492–495 (2017).
(5) Liu, M. et al. Layer-by-layer phase transformation in ti3o5 revealed by machine-learning molecular dynamics simulations. Nature Communications 15, 3079 (2024).
(6) Zeng, J., Cao, L., Xu, M., Zhu, T. & Zhang, J. Z. Complex reaction processes in combustion unraveled by neural network-based molecular dynamics simulation. Nature communications 11, 5713 (2020).
(7) Meuwly, M. Machine learning for chemical reactions. Chemical Reviews 121, 10218–10239 (2021).
(8) Srivastava, I., Kotia, A., Ghosh, S. K. & Ali, M. K. A. Recent advances of molecular dynamics simulations in nanotribology. Journal of Molecular Liquids 335, 116154 (2021).
(9) Wang, Z., Zhu, J. & Li, S. Novel strategy for reducing the minimum miscible pressure in a co2–oil system using nonionic surfactant: Insights from molecular dynamics simulations. Applied Energy 352, 121966 (2023).
(10) Kohn, W. & Sham, L. J. Self-consistent equations including exchange and correlation effects. Physical review 140, A1133 (1965).
(11) Martin, R. M. Electronic structure: basic theory and practical methods (Cambridge university press, 2020).
(12) Ceperley, D. M. & Alder, B. J. Ground state of the electron gas by a stochastic method. Physical review letters 45, 566 (1980).
(13) Bartlett, R. J. & Musiał, M. Coupled-cluster theory in quantum chemistry. Reviews of Modern Physics 79, 291 (2007).
(14) Burke, K. Perspective on density functional theory. The Journal of chemical physics 136 (2012).
(15) Jones, R. O. Density functional theory: Its origins, rise to prominence, and future. Reviews of modern physics 87, 897 (2015).
(16) Cohen, A. J., Mori-Sánchez, P. & Yang, W. Insights into current limitations of density functional theory. Science 321, 792–794 (2008).
(17) Lindorff-Larsen, K. et al. Improved side-chain torsion potentials for the amber ff99sb protein force field. Proteins: Structure, Function, and Bioinformatics 78, 1950–1958 (2010).
(18) Brooks, B. R. et al. Charmm: the biomolecular simulation program. Journal of computational chemistry 30, 1545–1614 (2009).
(19) Cui, T. et al. Geometry-enhanced pretraining on interatomic potentials. Nature Machine Intelligence 1–9 (2024).
(20) Wang, Z., Liu, G., Zhou, Y., Wang, T. & Shao, B. Quinnet: efficiently incorporating quintuple interactions into geometric deep learning force fields. In Proceedings of the 37th International Conference on Neural Information Processing Systems, 77043–77055 (2023).
(21) Wang, Y. et al. Enhancing geometric representations for molecules with equivariant vector-scalar interactive message passing. Nature Communications 15, 313 (2024).
(22) Musaelian, A. et al. Learning local equivariant representations for large-scale atomistic dynamics. Nature Communications 14, 579 (2023).
(23) Batzner, S. et al. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature communications 13, 2453 (2022).
(24) Drautz, R. Atomic cluster expansion for accurate and transferable interatomic potentials. Physical Review B 99, 014104 (2019).
(25) Batatia, I., Kovacs, D. P., Simm, G., Ortner, C. & Csányi, G. Mace: Higher order equivariant message passing neural networks for fast and accurate force fields. Advances in Neural Information Processing Systems 35, 11423–11436 (2022).
(26) Thölke, P. & De Fabritiis, G. Equivariant transformers for neural network based molecular potentials. In International Conference on Learning Representations (2021).
(27) Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. Schnet–a deep learning architecture for molecules and materials. The Journal of Chemical Physics 148 (2018).
(28) Chmiela, S. et al. Machine learning of accurate energy-conserving molecular force fields. Science advances 3, e1603015 (2017).
(29) Schütt, K., Unke, O. & Gastegger, M. Equivariant message passing for the prediction of tensorial properties and molecular spectra. In International Conference on Machine Learning, 9377–9388 (PMLR, 2021).
(30) Thomas, N. et al. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint arXiv:1802.08219 (2018).
(31) Satorras, V. G., Hoogeboom, E. & Welling, M. E (n) equivariant graph neural networks. In International conference on machine learning, 9323–9332 (PMLR, 2021).
(32) Gasteiger, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123 (2020).
(33) Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems 30 (2017).
(34) Fuchs, F., Worrall, D., Fischer, V. & Welling, M. Se (3)-transformers: 3d roto-translation equivariant attention networks. Advances in neural information processing systems 33, 1970–1981 (2020).
(35) Liao, Y.-L. & Smidt, T. Equiformer: Equivariant graph attention transformer for 3d atomistic graphs. arXiv preprint arXiv:2206.11990 (2022).
(36) Gasteiger, J., Becker, F. & Günnemann, S. Gemnet: Universal directional graph neural networks for molecules. Advances in Neural Information Processing Systems 34, 6790–6802 (2021).
(37) Zhang, X., Zhou, X., Lin, M. & Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6848–6856 (2018).
(38) Christensen, A. S. & Von Lilienfeld, O. A. On the role of gradients for machine learning of molecular energies and forces. Machine Learning: Science and Technology 1, 045018 (2020).
(39) Chmiela, S. et al. Accurate global machine learning force fields for molecules with hundreds of atoms. Science Advances 9, eadf0873 (2023).
(40) Ruddigkeit, L., Van Deursen, R., Blum, L. C. & Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of chemical information and modeling 52, 2864–2875 (2012).
(41) Ramakrishnan, R., Dral, P. O., Rupp, M. & Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data 1, 1–7 (2014).
(42) Zee, A. Group theory in a nutshell for physicists, vol. 17 (Princeton University Press, 2016).
(43) Raczka, R. & Barut, A. O. Theory of group representations and applications (World Scientific Publishing Company, 1986).
(44) Jeevanjee, N. An introduction to tensors and group theory for physicists (Springer, 2011).
(45) Cohen, T. S., Geiger, M., Köhler, J. & Welling, M. Spherical cnns. arXiv preprint arXiv:1801.10130 (2018).
(46) Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).
(47) Liu, Y. et al. Spherical message passing for 3d graph networks. arXiv preprint arXiv:2102.05013 (2021).
(48) Zhang, S., Liu, Y. & Xie, L. Efficient and accurate physics-aware multiplex graph neural networks for 3d small molecules and macromolecule complexes. arXiv preprint arXiv:2206.02789 (2022).
(49) Unke, O. T. et al. Spookynet: Learning force fields with electronic degrees of freedom and nonlocal effects. Nature communications 12, 7273 (2021).
(50) Wang, L., Liu, Y., Lin, Y., Liu, H. & Ji, S. Comenet: Towards complete and efficient message passing for 3d molecular graphs. Advances in Neural Information Processing Systems 35, 650–664 (2022).
(51) Qiao, Z. et al. Informing geometric deep learning with electronic interactions to accelerate quantum chemistry. Proceedings of the National Academy of Sciences 119, e2205221119 (2022).
(52) Frank, T., Unke, O. & Müller, K.-R. So3krates: Equivariant attention for interactions on arbitrary length-scales in molecular systems. Advances in Neural Information Processing Systems 35, 29400–29413 (2022).
(53) Batatia, I. et al. The design space of e (3)-equivariant atom-centered interatomic potentials. arXiv preprint arXiv:2205.06643 (2022).
(54) Li, Y. et al. Long-short-range message-passing: A physics-informed framework to capture non-local interaction for scalable molecular dynamics simulation. arXiv preprint arXiv:2304.13542 (2023).
(55) Wang, T., He, X., Li, M., Shao, B. & Liu, T.-Y. Aimd-chig: Exploring the conformational space of a 166-atom protein chignolin with ab initio molecular dynamics. Scientific Data 10, 549 (2023).

Refer to caption — Figure 1: Main problems and FreeCG overview. a. Permutation equivariance requires performing the CG transform per atom with consistent settings. It limits the available design space. b. We construct abstract edges that are invariant under permutations, on which the CG transformation is applied. This approach ensures that the entire layer is always permutation equivariant, maximizing the available design space. c. The architecture of a single layer of FreeCG. The self-attention mechanism generates abstract edges through a permutation-invariant process. The abstract edges are also used to enhance the quality of the attention score, denoted as Attention Enhancer. d. The Group CG transform organizes abstract edges into groups and performs the CG transform on each group. We adopt sparse path for CG transform, enabling lower computation demands while maintaining stronger O(3) equivariance. Abstract edges shuffling improves the information exchange between different irreps. The details for sparse path and abstract edges shuffling can be referred to Fig. 2.

Table 5: Ablation on different modules.

Method	Aspirin
Method	Val loss	Energy	Force
VisNet	-	0.116	0.155
+ Group tensor product
32 groups	0.0509	0.123	0.144
8 groups	0.0416	0.112	0.129
+ Group shuffling
1-group shuffle	0.0401	0.112	0.128
0.5-group shuffle	0.0396	0.110	0.128
1.5-group shuffle	0.0384	0.111	0.125
+ Attention Enhancer	0.0345	0.110	0.122

Abstract edges shuffling and Attention Enhancer are added upon the best choices of the above modules, with respect to the validation loss.

Table 6: Hardware and software settings.

Hardware

Software

CPU

GPU

Neural Network

Equivariance

Plotting

Intel^® Xeon^®

Gold 6330 CPU @ 2.00GHz

NVIDIA Tesla A100

Pytorch 1.10.0

e3nn 0.5.1

Matplotlib 3.0.3

Table 7: Hyperparameters for each dataset.

Hyperparameter	MD17	rMD17	MD22	QM9
initial learning rate	4e-4, 2e-4	2e-4	2e-4, 1e-4	1e-4
Learning rate decay factor	0.8
Learning rate decay patience	30	30	30	15
Learning rate warmup step	1000	1000	1000	10000
Optimizer	AdamW ( $\beta(0.9,0.999)$ )
Epoch	3000	3000	3000	1500
batch size	4	4	4	32
Number of layers	9
Cutoff	5.0, 4.0	5.0	5.0, 4.0	5.0
Force/Energy loss weights	0.95/0.05	0.95/0.05	0.95/0.05	-
Dimension of latent feature	256	256	256	512
Number of groups	8

Table 8: The basic operation number for each type of CG transform.

$l_{o}=2$	0	1	2	$l_{o}=1$	0	1	2	$l_{o}=2$	0	1	2
0	1			0	-	3	-	0	-	-	5
1		3		1	3	6	9	1	-	9	12
2			5	2	-	9	12	2	5	12	19

l_{o}

denotes the output degree. The column and row numbers denote degrees of two input irreps, respectively. The cyan blocks denote the operations in normal neural networks, while the others are for high order CG transform.