License: arXiv.org perpetual non-exclusive license
arXiv:2310.02508v2 [cs.LG] 27 Dec 2023

Ophiuchus: Scalable Modeling of Protein Structures through Hierarchical Coarse-graining SO(3)-Equivariant Autoencoders

Allan dos Santos Costa
Center for Bits and Atoms
MIT Media Lab, Molecular Machines
[email protected]
&Ilan Mitnikov
Center for Bits and Atoms
MIT Media Lab, Molecular Machines

&Mario Geiger
Atomic Architects
MIT Research Laboratory of Electronics &Manvitha Ponnapati
Center for Bits and Atoms
MIT Media Lab, Molecular Machines

&Tess Smidt
Atomic Architects
MIT Research Laboratory of Electronics
&Joseph Jacobson
Center for Bits and Atoms
MIT Media Lab, Molecular Machines
Abstract

Three-dimensional native states of natural proteins display recurring and hierarchical patterns. Yet, traditional graph-based modeling of protein structures is often limited to operate within a single fine-grained resolution, and lacks hourglass neural architectures to learn those high-level building blocks. We narrow this gap by introducing Ophiuchus, an SO(3)-equivariant coarse-graining model that efficiently operates on all-atom protein structures. Our model departs from current approaches that employ graph modeling, instead focusing on local convolutional coarsening to model sequence-motif interactions with efficient time complexity in protein length. We measure the reconstruction capabilities of Ophiuchus across different compression rates, and compare it to existing models. We examine the learned latent space and demonstrate its utility through conformational interpolation. Finally, we leverage denoising diffusion probabilistic models (DDPM) in the latent space to efficiently sample protein structures. Our experiments demonstrate Ophiuchus to be a scalable basis for efficient protein modeling and generation.

1 Introduction

Proteins form the basis of all biological processes and understanding them is critical to biological discovery, medical research and drug development. Their three-dimensional structures often display modular organization across multiple scales, making them promising candidates for modeling in motif-based design spaces [Bystroff & Baker (1998); Mackenzie & Grigoryan (2017); Swanson et al. (2022)]. Harnessing these coarser, lower-frequency building blocks is of great relevance to the investigation of the mechanisms behind protein evolution, folding and dynamics [Mackenzie et al. (2016)], and may be instrumental in enabling more efficient computation on protein structural data through coarse and latent variable modeling [Kmiecik et al. (2016); Ramaswamy et al. (2021)].

Recent developments in deep learning architectures applied to protein sequences and structures demonstrate the remarkable capabilities of neural models in the domain of protein modeling and design [Jumper et al. (2021); Baek et al. (2021b); Ingraham et al. (2022); Watson et al. (2022)]. Still, current state-of-the-art architectures lack the structure and mechanisms to directly learn and operate on modular protein blocks.

To fill this gap, we introduce Ophiuchus, a deep SO(3)-equivariant model that captures joint encodings of sequence-structure motifs of all-atom protein structures. Our model is a novel autoencoder that uses one-dimensional sequence convolutions on geometric features to learn coarsened representations of proteins. Ophiuchus outperforms existing SO(3)-equivariant autoencoders [Fu et al. (2023)] on the protein reconstruction task. We present extensive ablations of model performance across different autoencoder layouts and compression settings. We demonstrate that our model learns a robust and structured representation of protein structures by learning a denoising diffusion probabilistic model (DDPM) [Ho et al. (2020)] in the latent space. We find Ophiuchus to enable significantly faster sampling of protein structures, as compared to existing diffusion models [Wu et al. (2022a); Yim et al. (2023); Watson et al. (2023)], while producing unconditional samples of comparable quality and diversity.

Refer to caption
Figure 1: Coarsening a Three-Dimensional Sequence. (a) Each residue is represented with its 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT atom position and geometric features that encode its label and the positions of its other atoms. (\ast) side-chains with non-orderable atoms are encoded in a permutation-invariant way. (b) Our proposed model uses roto-translation equivariant convolutions to coarsen these positions and geometric features. (c) We train deep autoencoders to reconstruct all-atom protein structures directly in three-dimensions.

Our main contributions are summarized as follows:

  • Novel Autoencoder: We introduce a novel SO(3)-equivariant autoencoder for protein sequence and all-atom structure representation. We propose novel learning algorithms for coarsening and refining protein representations, leveraging irreducible representations of SO(3) to efficiently model geometric information. We demonstrate the power of our latent space through unsupervised clustering and latent interpolation.

  • Extensive Ablation: We offer an in-depth examination of our architecture through extensive ablation across different protein lengths, coarsening resolutions and model sizes. We study the trade-off of producing a coarsened representation of a protein at different resolutions and the recoverability of its sequence and structure.

  • Latent Diffusion: We explore a novel generative approach to proteins by performing latent diffusion on geometric feature representations. We train diffusion models for multiple resolutions, and provide diverse benchmarks to assess sample quality. To the best of our knowledge, this is the first generative model to directly produce all-atom structures of proteins.

2 Background and Related Work

2.1 Modularity and Hierarchy in Proteins

Protein sequences and structures display significant degrees of modularity. [Vallat et al. (2015)] introduces a library of common super-secondary structural motifs (Smotifs), while [Mackenzie et al. (2016)] shows protein structural space to be efficiently describable by small tertiary alphabets (TERMs). Motif-based methods have been successfully used in protein folding and design [Bystroff & Baker (1998); Li et al. (2022)]. Inspired by this hierarchical nature of proteins, our proposed model learns coarse-grained representations of protein structures.

2.2 Symmetries in Neural Architecture for Biomolecules

Learning algorithms greatly benefit from proactively exploiting symmetry structures present in their data domain [Bronstein et al. (2021); Smidt (2021)]. In this work, we investigate three relevant symmetries for the domain of protein structures:

Euclidean Equivariance of Coordinates and Feature Representations. Neural models equipped with roto-translational (Euclidean) invariance or equivariance have been shown to outperform competitors in molecular and point cloud tasks [Townshend et al. (2022); Miller et al. (2020); Deng et al. (2021)]. Similar results have been extensively reported across different structural tasks of protein modeling [Liu et al. (2022); **g et al. (2021)]. Our proposed model takes advantage of Euclidean equivariance both in processing of coordinates and in its internal feature representations, which are composed of scalars and higher order geometric tensors [Thomas et al. (2018); Weiler et al. (2018)].

Translation Equivariance of Sequence. One-dimensional Convolutional Neural Networks (CNNs) have been demonstrated to successfully model protein sequences across a variety of tasks [Karydis (2017); Hou et al. (2018); Lee et al. (2019); Yang et al. (2022)]. These models capture sequence-motif representations that are equivariant to translation of the sequence. However, sequential convolution is less common in architectures for protein structures, which are often cast as Graph Neural Networks (GNNs) [Zhang et al. (2021)]. Notably, [Fan et al. (2022)] proposes a CNN network to model the regularity of one-dimensional sequences along with three-dimensional structures, but they restrict their layout to coarsening. In this work, we further integrate geometry into sequence by directly using three-dimensional vector feature representations and transformations in 1D convolutions. We use this CNN to investigate an autoencoding approach to protein structures.

Permutation Invariances of Atomic Order. In order to capture the permutable ordering of atoms, neural models of molecules are often implemented with permutation-invariant GNNs [Wieder et al. (2020)]. Nevertheless, protein structures are sequentially ordered, and most standard side-chain heavy atoms are readily orderable, with exception of four residues [Jumper et al. (2021)]. We use this fact to design an efficient approach to directly model all-atom protein structures, introducing a method to parse atomic positions in parallel channels as roto-translational equivariant feature representations.

2.3 Unsupervised Learning of Proteins

Unsupervised techniques for capturing protein sequence and structure have witnessed remarkable advancements in recent years [Lin et al. (2023); Elnaggar et al. (2023); Zhang et al. (2022)]. Amongst unsupervised methods, autoencoder models learn to produce efficient low-dimensional representations using an informational bottleneck. These models have been successfully deployed to diverse protein tasks of modeling and sampling [Eguchi et al. (2020); Lin et al. (2021); Wang et al. (2022); Mansoor et al. (2023); Visani et al. (2023)], and have received renewed attention for enabling the learning of coarse representations of molecules [Wang & Gómez-Bombarelli (2019); Yang & Gómez-Bombarelli (2023); Winter et al. (2021); Wehmeyer & Noé (2018); Ramaswamy et al. (2021)]. However, existing three-dimensional autoencoders for proteins do not have the structure or mechanisms to explore the extent to which coarsening is possible in proteins. In this work, we fill this gap with extensive experiments on an autoencoder for deep protein representation coarsening.

2.4 Denoising Diffusion for Proteins

Denoising Diffusion Probabilistic Models (DDPM) [Sohl-Dickstein et al. (2015); Ho et al. (2020)] have found widespread adoption through diverse architectures for generative sampling of protein structures. Chroma [Ingraham et al. (2022)] trains random graph message passing through roto-translational invariant features, while RFDiffusion [Watson et al. (2022)] fine-tunes pretrained folding model RoseTTAFold [Baek et al. (2021a)] to denoising, employing SE(3)-equivariant transformers in structural decoding [Fuchs et al. (2020)]. [Yim et al. (2023); Anand & Achim (2022)] generalize denoising diffusion to frames of reference, employing Invariant Point Attention [Jumper et al. (2021)] to model three dimensions, while FoldingDiff [Wu et al. (2022a)] explores denoising in angular space. More recently, [Fu et al. (2023)] proposes a latent diffusion model on coarsened representations learned through Equivariant Graph Neural Networks (EGNN) [Satorras et al. (2022)]. In contrast, our model uses roto-translation equivariant features to produce increasingly richer structural representations from autoencoded sequence and coordinates. We propose a novel latent diffusion model that samples directly in this space for generating protein structures.

3 The Ophiuchus Architecture

We represent a protein as a sequence of N𝑁Nitalic_N residues each with an anchor position 𝐏1×3𝐏superscript13\mathbf{P}\in\mathbb{R}^{1\times 3}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 3 end_POSTSUPERSCRIPT and a tensor of irreducible representations of SO(3) 𝐕0:lmaxsuperscript𝐕:0subscript𝑙\mathbf{V}^{0:{l_{\max}}}bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where 𝐕ld×(2l+1)superscript𝐕𝑙superscript𝑑2𝑙1\mathbf{V}^{l}\in\mathbb{R}^{d\times(2l+1)}bold_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × ( 2 italic_l + 1 ) end_POSTSUPERSCRIPT and degree l[0,lmax]𝑙0subscript𝑙l\in[0,{l_{\max}}]italic_l ∈ [ 0 , italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]. A residue state is defined as (𝐏,𝐕0:lmax)𝐏superscript𝐕:0subscript𝑙(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). These representations are directly produced from sequence labels and all-atom positions, as we describe in the Atom Encoder/Decoder sections. To capture the diverse interactions within a protein, we propose three main components. Self-Interaction learns geometric representations of each residue independently, modeling local interactions of atoms within a single residue. Sequence Convolution simultaneously updates sequential segments of residues, modeling inter-residue interactions between sequence neighbors. Finally, Spatial Convolution employs message-passing of geometric features to model interactions of residues that are nearby in 3D space. We compose these three modules to build an hourglass architecture.

Refer to caption
Figure 2: Building Blocks of Ophiuchus: (a) Atom Encoder and (b) Atom Decoder enable the model to directly take and produce atomic coordinates. (c) Self-Interaction updates representations internally, across different vector orders l𝑙litalic_l. (d) Spatial Convolution interacts spatial neighbors. (e) Sequence Convolution and (f) Transpose Sequence Convolution communicate sequence neighbors and produce coarser and finer representations, respectively. (g) Hourglass model: we compose those modules to build encoder and decoder models, stacking them into an autoencoder.

3.1 All-Atom Atom Encoder and Decoder

Given a particular i𝑖iitalic_i-th residue, let 𝐑isubscript𝐑𝑖\mathbf{R}_{i}\in\mathcal{R}bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_R denote its residue label, 𝐏iα1×3superscriptsubscript𝐏𝑖𝛼superscript13\mathbf{P}_{i}^{\alpha}\in\mathbb{R}^{1\times 3}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 3 end_POSTSUPERSCRIPT denote the global position of its alpha carbon (𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT), and 𝐏in×3subscriptsuperscript𝐏𝑖superscript𝑛3\mathbf{P}^{\ast}_{i}\in\mathbb{R}^{n\times 3}bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × 3 end_POSTSUPERSCRIPT the position of all n𝑛nitalic_n other atoms relative to 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT. We produce initial residue representations (𝐏,𝐕0:lmax)isubscript𝐏superscript𝐕:0subscript𝑙𝑖(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})_{i}( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by setting anchors 𝐏i=𝐏iαsubscript𝐏𝑖subscriptsuperscript𝐏𝛼𝑖\mathbf{P}_{i}=\mathbf{P}^{\alpha}_{i}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_P start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, scalars 𝐕il=0=Embed(𝐑i)superscriptsubscript𝐕𝑖𝑙0Embedsubscript𝐑𝑖\mathbf{V}_{i}^{l=0}=\textrm{Embed}(\mathbf{R}_{i})bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l = 0 end_POSTSUPERSCRIPT = Embed ( bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , and geometric vectors 𝐕il>0superscriptsubscript𝐕𝑖𝑙0\mathbf{V}_{i}^{l>0}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l > 0 end_POSTSUPERSCRIPT to explicitly encode relative atomic positions 𝐏isubscriptsuperscript𝐏𝑖\mathbf{P}^{\ast}_{i}bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

In particular, provided the residue label 𝐑𝐑\mathbf{R}bold_R, the heavy atoms of most standard protein residues are readily put in a canonical order, enabling direct treatment of atom positions as a stack of signals on SO(3): 𝐕l=1=𝐏*superscript𝐕𝑙1superscript𝐏\mathbf{V}^{l=1}=\mathbf{P}^{*}bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT = bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. However, some of the standard residues present two-permutations within their atom configurations, in which pairs of atoms have ordering indices (v𝑣vitalic_v, u𝑢uitalic_u) that may be exchanged (Appendix A.1). To handle these cases, we instead use geometric vectors to encode the center 𝐕centerl=1=12(𝐏v*+𝐏u*)subscriptsuperscript𝐕𝑙1center12subscriptsuperscript𝐏𝑣subscriptsuperscript𝐏𝑢\mathbf{V}^{l=1}_{\textrm{center}}=\frac{1}{2}(\mathbf{P}^{*}_{v}+\mathbf{P}^{% *}_{u})bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT center end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) and the unsigned difference 𝐕diffl=2=Y2(12(𝐏v*𝐏u*))subscriptsuperscript𝐕𝑙2diffsubscript𝑌212subscriptsuperscript𝐏𝑣subscriptsuperscript𝐏𝑢\mathbf{V}^{l=2}_{\textrm{diff}}=Y_{2}(\frac{1}{2}(\mathbf{P}^{*}_{v}-\mathbf{% P}^{*}_{u}))bold_V start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ) between the positions of the pair, where Y2subscript𝑌2Y_{2}italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a spherical harmonics projector of degree l=2𝑙2l=2italic_l = 2. This signal is invariant to corresponding atomic two-flips, while still directly carrying information about positioning and angularity. To invert this encoding, we invert a signal of degree l=2𝑙2l=2italic_l = 2 into two arbitrarily ordered vectors of degrees l=1𝑙1l=1italic_l = 1. Please refer to Appendix A.2 for further details.

Input: 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT Position 𝐏α1×3superscript𝐏𝛼superscript13\mathbf{P}^{\alpha}\in\mathbb{R}^{1\times 3}bold_P start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 3 end_POSTSUPERSCRIPT
Input: All-Atom Relative Positions 𝐏n×3superscript𝐏superscript𝑛3\mathbf{P}^{\ast}\in\mathbb{R}^{n\times 3}bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × 3 end_POSTSUPERSCRIPT
Input: Residue Label 𝐑𝐑\mathbf{R}\in\mathcal{R}bold_R ∈ caligraphic_R
Output: Latent Representation (𝐏,𝐕l=0:2)𝐏superscript𝐕:𝑙02(\mathbf{P},\mathbf{V}^{l=0:2})( bold_P , bold_V start_POSTSUPERSCRIPT italic_l = 0 : 2 end_POSTSUPERSCRIPT )
𝐏𝐏α𝐏superscript𝐏𝛼\mathbf{P}\leftarrow\mathbf{P}^{\alpha}bold_P ← bold_P start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT 𝐕l=0Embed(𝐑)superscript𝐕𝑙0Embed𝐑\mathbf{V}^{l=0}\leftarrow\textrm{Embed}(\mathbf{R})bold_V start_POSTSUPERSCRIPT italic_l = 0 end_POSTSUPERSCRIPT ← Embed ( bold_R ) 𝐕orderedl=1GetOrderablePositions(𝐑,𝐏)subscriptsuperscript𝐕𝑙1orderedGetOrderablePositions𝐑superscript𝐏\mathbf{V}^{l=1}_{\textrm{ordered}}\leftarrow\textrm{GetOrderablePositions}(% \mathbf{R},\mathbf{P}^{\ast})bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ordered end_POSTSUBSCRIPT ← GetOrderablePositions ( bold_R , bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) 𝐏v*,𝐏u*GetUnorderablePositionPairs(𝐑,𝐏)subscriptsuperscript𝐏𝑣subscriptsuperscript𝐏𝑢GetUnorderablePositionPairs𝐑superscript𝐏{\mathbf{P}^{*}_{v},\mathbf{P}^{*}_{u}\leftarrow\textrm{% GetUnorderablePositionPairs}(\mathbf{R},\mathbf{P}^{\ast})}\;bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ← GetUnorderablePositionPairs ( bold_R , bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) 𝐕centerl=112(𝐏v*+𝐏u*)subscriptsuperscript𝐕𝑙1center12subscriptsuperscript𝐏𝑣subscriptsuperscript𝐏𝑢\mathbf{V}^{l=1}_{\textrm{center}}\leftarrow\frac{1}{2}(\mathbf{P}^{*}_{v}+% \mathbf{P}^{*}_{u})bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT center end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) 𝐕diffl=2Y2(12(𝐏v*𝐏u*))subscriptsuperscript𝐕𝑙2diffsuperscript𝑌212subscriptsuperscript𝐏𝑣subscriptsuperscript𝐏𝑢\mathbf{V}^{l=2}_{\textrm{diff}}\leftarrow Y^{2}\big{(}\frac{1}{2}(\mathbf{P}^% {*}_{v}-\mathbf{P}^{*}_{u})\big{)}bold_V start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT ← italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ) 𝐕l=0:2𝐕orderedl=1𝐕centerl=1𝐕diffl=2superscript𝐕:𝑙02direct-sumsubscriptsuperscript𝐕𝑙1orderedsubscriptsuperscript𝐕𝑙1centersubscriptsuperscript𝐕𝑙2diff\mathbf{V}^{l=0:2}\leftarrow\mathbf{V}^{l=1}_{\textrm{ordered}}\oplus\mathbf{V% }^{l=1}_{\textrm{center}}\oplus\mathbf{V}^{l=2}_{\textrm{diff}}bold_V start_POSTSUPERSCRIPT italic_l = 0 : 2 end_POSTSUPERSCRIPT ← bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ordered end_POSTSUBSCRIPT ⊕ bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT center end_POSTSUBSCRIPT ⊕ bold_V start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT return (𝐏,𝐕l=0:2)𝐏superscript𝐕normal-:𝑙02(\mathbf{P},\mathbf{V}^{l=0:2})( bold_P , bold_V start_POSTSUPERSCRIPT italic_l = 0 : 2 end_POSTSUPERSCRIPT )
Algorithm 1 All-Atom Encoding
Input: Latent Representation (𝐏,𝐕l=0:2)𝐏superscript𝐕:𝑙02(\mathbf{P},\mathbf{V}^{l=0:2})( bold_P , bold_V start_POSTSUPERSCRIPT italic_l = 0 : 2 end_POSTSUPERSCRIPT )
Output: 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT Position 𝐏α1×3superscript𝐏𝛼superscript13\mathbf{P}^{\alpha}\in\mathbb{R}^{1\times 3}bold_P start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 3 end_POSTSUPERSCRIPT
Output: All-Atom Relative Positions 𝐏||×n×3superscript𝐏superscript𝑛3\mathbf{P}^{\ast}\in\mathbb{R}^{|\mathcal{R}|\times n\times 3}bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_R | × italic_n × 3 end_POSTSUPERSCRIPT
Output: Residue Label Logits ||bold-ℓsuperscript\bm{\ell}\in\mathbb{R}^{|\mathcal{R}|}bold_ℓ ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_R | end_POSTSUPERSCRIPT
𝐏α𝐏superscript𝐏𝛼𝐏\mathbf{P}^{\alpha}\leftarrow\mathbf{P}bold_P start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ← bold_P LogSoftmax(Linear(𝐕l=0))bold-ℓLogSoftmaxLinearsuperscript𝐕𝑙0\bm{\ell}\leftarrow{\textrm{LogSoftmax}}\big{(}\textrm{Linear}(\mathbf{V}^{l=0% })\big{)}bold_ℓ ← LogSoftmax ( Linear ( bold_V start_POSTSUPERSCRIPT italic_l = 0 end_POSTSUPERSCRIPT ) ) 𝐕^orderedl=1Linear(𝐕l=1)subscriptsuperscript^𝐕𝑙1orderedLinearsuperscript𝐕𝑙1\hat{\mathbf{V}}^{l=1}_{\textrm{ordered}}\leftarrow\textrm{Linear}(\mathbf{V}^% {l=1})over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ordered end_POSTSUBSCRIPT ← Linear ( bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT ) 𝐕^centerl=1,𝐕^diffl=2Linear(𝐕l=0:2)subscriptsuperscript^𝐕𝑙1centersubscriptsuperscript^𝐕𝑙2diffLinearsuperscript𝐕:𝑙02\hat{\mathbf{V}}^{l=1}_{\textrm{center}},\hat{\mathbf{V}}^{l=2}_{\textrm{diff}% }\leftarrow\textrm{Linear}(\mathbf{V}^{l=0:2})over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT center end_POSTSUBSCRIPT , over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT ← Linear ( bold_V start_POSTSUPERSCRIPT italic_l = 0 : 2 end_POSTSUPERSCRIPT ) Δ𝐏v,uEigendecompose(𝐕^diffl=2)Δsubscript𝐏𝑣𝑢Eigendecomposesubscriptsuperscript^𝐕𝑙2diff\Delta\mathbf{P}_{v,u}\leftarrow\textrm{Eigendecompose}\big{(}\hat{\mathbf{V}}% ^{l=2}_{\textrm{diff}}\big{)}roman_Δ bold_P start_POSTSUBSCRIPT italic_v , italic_u end_POSTSUBSCRIPT ← Eigendecompose ( over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT ) 𝐏v*,𝐏u*=𝐕^centerl=1+Δ𝐏v,u,𝐕^centerl=1Δ𝐏v,uformulae-sequencesubscriptsuperscript𝐏𝑣subscriptsuperscript𝐏𝑢subscriptsuperscript^𝐕𝑙1centerΔsubscript𝐏𝑣𝑢subscriptsuperscript^𝐕𝑙1centerΔsubscript𝐏𝑣𝑢\mathbf{P}^{*}_{v},\mathbf{P}^{*}_{u}=\hat{\mathbf{V}}^{l=1}_{\textrm{center}}% +\Delta\mathbf{P}_{v,u},\hat{\mathbf{V}}^{l=1}_{\textrm{center}}-\Delta\mathbf% {P}_{v,u}bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT center end_POSTSUBSCRIPT + roman_Δ bold_P start_POSTSUBSCRIPT italic_v , italic_u end_POSTSUBSCRIPT , over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT center end_POSTSUBSCRIPT - roman_Δ bold_P start_POSTSUBSCRIPT italic_v , italic_u end_POSTSUBSCRIPT 𝐏𝐏𝐏v*𝐏u*superscript𝐏direct-sumsuperscript𝐏subscriptsuperscript𝐏𝑣subscriptsuperscript𝐏𝑢\mathbf{P}^{\ast}\leftarrow\mathbf{P}^{\ast}\oplus\mathbf{P}^{*}_{v}\oplus% \mathbf{P}^{*}_{u}bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⊕ bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ⊕ bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT return (𝐏α,𝐏,)superscript𝐏𝛼superscript𝐏normal-∗bold-ℓ(\mathbf{P}^{\alpha},\mathbf{P}^{\ast},\bm{\ell})( bold_P start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_ℓ )
Algorithm 2 All-Atom Decoding

This processing makes Ophiuchus strictly blind to ordering flips of permutable atoms, while still enabling it to operate directly on all atoms in an efficient, stacked representation. In Appendix A.1, we illustrate how this approach correctly handles the geometry of side-chain atoms.

3.2 Self-Interaction

Our Self-Interaction is designed to model the internal interactions of atoms within each residue. This transformation updates the feature vectors 𝐕i0:lmaxsubscriptsuperscript𝐕:0subscript𝑙𝑖\mathbf{V}^{0:{l_{\max}}}_{i}bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT centered at the same residue i𝑖iitalic_i. Importantly, it blends feature vectors 𝐕lsuperscript𝐕𝑙\mathbf{V}^{l}bold_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT of varying degrees l𝑙litalic_l by employing tensor products of the features with themselves. We offer two implementations of these tensor products to cater to different computational needs. Our Self-Interaction module draws inspiration from MACE [Batatia et al. (2022)]. For a comprehensive explanation, please refer to Appendix A.3.

Input: Latent Representation (𝐏,𝐕0:lmax)𝐏superscript𝐕:0subscript𝑙(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
𝐕0:lmax𝐕0:lmax(𝐕0:lmax)2superscript𝐕:0subscript𝑙direct-sumsuperscript𝐕:0subscript𝑙superscriptsuperscript𝐕:0subscript𝑙tensor-productabsent2\mathbf{V}^{0:{l_{\max}}}\leftarrow\mathbf{V}^{0:{l_{\max}}}\oplus\left(% \mathbf{V}^{0:{l_{\max}}}\right)^{\otimes 2}bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ← bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊕ ( bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊗ 2 end_POSTSUPERSCRIPT  normal-▷\triangleright Tensor Square and Concatenate
𝐕0:lmaxLinear(𝐕0:lmax)superscript𝐕:0subscript𝑙Linearsuperscript𝐕:0subscript𝑙\mathbf{V}^{0:{l_{\max}}}\leftarrow\mathrm{Linear}(\mathbf{V}^{0:{l_{\max}}})bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ← roman_Linear ( bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )  normal-▷\triangleright Update features
𝐕0:lmaxMLP(𝐕l=0)𝐕0:lmaxsuperscript𝐕:0subscript𝑙MLPsuperscript𝐕𝑙0superscript𝐕:0subscript𝑙\mathbf{V}^{0:{l_{\max}}}\leftarrow\textrm{MLP}(\mathbf{V}^{l=0})\cdot\mathbf{% V}^{0:{l_{\max}}}bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ← MLP ( bold_V start_POSTSUPERSCRIPT italic_l = 0 end_POSTSUPERSCRIPT ) ⋅ bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT  normal-▷\triangleright Gate Activation Function
return (𝐏,𝐕0:lmax)𝐏superscript𝐕normal-:0subscript𝑙(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
Algorithm 3 Self-Interaction

3.3 Sequence Convolution

To take advantage of the sequential nature of proteins, we propose a one-dimensional, roto-translational equivariant convolutional layer for acting on geometric features and positions of sequence neighbors. Given a kernel window size K𝐾Kitalic_K and stride S𝑆Sitalic_S, we concatenate representations 𝐕iK2:i+K20:lmaxsubscriptsuperscript𝐕:0subscript𝑙:𝑖𝐾2𝑖𝐾2\mathbf{V}^{0:{l_{\max}}}_{i-\frac{K}{2}:i+\frac{K}{2}}bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i - divide start_ARG italic_K end_ARG start_ARG 2 end_ARG : italic_i + divide start_ARG italic_K end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT with the same l𝑙litalic_l value. Additionally, we include normalized relative vectors between anchoring positions 𝐏iK2:i+K2subscript𝐏:𝑖𝐾2𝑖𝐾2\mathbf{P}_{i-\frac{K}{2}:i+\frac{K}{2}}bold_P start_POSTSUBSCRIPT italic_i - divide start_ARG italic_K end_ARG start_ARG 2 end_ARG : italic_i + divide start_ARG italic_K end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT. Following conventional CNN architectures, this concatenated representation undergoes a linear transformation. The scalars in the resulting representation are then converted into weights, which are used to combine window coordinates into a new coordinate. To ensure translational equivariance, these weights are constrained to sum to one.

Input: Window of Latent Representations (𝐏,𝐕0:lmax)1:Ksubscript𝐏superscript𝐕:0subscript𝑙:1𝐾(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})_{1:K}( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT
w1:KSoftmax(MLP(𝐕1:K0))subscript𝑤:1𝐾SoftmaxMLPsubscriptsuperscript𝐕0:1𝐾w_{1:K}\leftarrow\textrm{Softmax}\Big{(}\textrm{MLP}\big{(}\mathbf{V}^{0}_{1:K% }\big{)}\Big{)}italic_w start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT ← Softmax ( MLP ( bold_V start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT ) )   normal-▷\triangleright Such that kwk=1subscript𝑘subscript𝑤𝑘1\sum_{k}w_{k}=1∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1
𝐏k=1Kwk𝐏k𝐏superscriptsubscript𝑘1𝐾subscript𝑤𝑘subscript𝐏𝑘\mathbf{P}\leftarrow\sum_{k=1}^{K}w_{k}\mathbf{P}_{k}bold_P ← ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT   normal-▷\triangleright Coarsen coordinates
𝐕~K𝐕1:K0:lmax~𝐕subscriptdirect-sum𝐾subscriptsuperscript𝐕:0subscript𝑙:1𝐾\tilde{\mathbf{V}}\leftarrow\bigoplus_{K}\mathbf{V}^{0:{l_{\max}}}_{1:K}over~ start_ARG bold_V end_ARG ← ⨁ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT   normal-▷\triangleright Stack features
𝐏~i=1,j=1K,KY(𝐏i𝐏j)~𝐏subscriptsuperscriptdirect-sum𝐾𝐾formulae-sequence𝑖1𝑗1𝑌subscript𝐏𝑖subscript𝐏𝑗\tilde{\mathbf{P}}\leftarrow\bigoplus^{K,K}_{i=1,j=1}Y\left(\mathbf{P}_{i}-% \mathbf{P}_{j}\right)over~ start_ARG bold_P end_ARG ← ⨁ start_POSTSUPERSCRIPT italic_K , italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 , italic_j = 1 end_POSTSUBSCRIPT italic_Y ( bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )   normal-▷\triangleright Stack relative vectors
𝐕0:lmaxLinear(𝐕~𝐏~)superscript𝐕:0subscript𝑙Lineardirect-sum~𝐕~𝐏\mathbf{V}^{0:{l_{\max}}}\leftarrow\textrm{Linear}\Big{(}\tilde{\mathbf{V}}% \oplus\tilde{\mathbf{P}}\Big{)}bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ← Linear ( over~ start_ARG bold_V end_ARG ⊕ over~ start_ARG bold_P end_ARG )   normal-▷\triangleright Coarsen features and vectors
return (𝐏,𝐕0:lmax)𝐏superscript𝐕normal-:0subscript𝑙(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
Algorithm 4 Sequence Convolution

When S>1𝑆1S>1italic_S > 1, sequence convolutions reduce the dimensionality along the sequence axis, yielding mixed representations and coarse coordinates. To reverse this procedure, we introduce a transpose convolution algorithm that uses its 𝐕l=1superscript𝐕𝑙1\mathbf{V}^{l=1}bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT features to spawn coordinates. For further details, please refer to Appendix A.6.

3.4 Spatial Convolution

To capture interactions of residues that are close in three-dimensional space, we introduce the Spatial Convolution. This operation updates representations and positions through message passing within k-nearest spatial neighbors. Message representations incorporate SO(3) signals from the vector difference between neighbor coordinates, and we aggregate messages with a permutation-invariant means. After aggregation, we linearly transform the vector representations into a an update for the coordinates.

Input: Latent Representations (𝐏,𝐕0:lmax)1:Nsubscript𝐏superscript𝐕:0subscript𝑙:1𝑁(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})_{1:N}( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT
Input: Output Node Index i𝑖iitalic_i
(𝐏~,𝐕~0:lmax)1:kk-Nearest-Neighbors(𝐏i,𝐏1:N)subscript~𝐏superscript~𝐕:0subscript𝑙:1𝑘k-Nearest-Neighborssubscript𝐏𝑖subscript𝐏:1𝑁(\tilde{\mathbf{P}},\tilde{\mathbf{V}}^{0:{l_{\max}}})_{1:k}\leftarrow\textrm{% $k$-Nearest-Neighbors}(\mathbf{P}_{i},\mathbf{P}_{1:N})( over~ start_ARG bold_P end_ARG , over~ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT ← italic_k -Nearest-Neighbors ( bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT ) R1:k,ϕ1:kEmbed(𝐏~1:k𝐏i2),Y(𝐏~1:k𝐏i)formulae-sequencesubscript𝑅:1𝑘subscriptitalic-ϕ:1𝑘Embedsubscriptnormsubscript~𝐏:1𝑘subscript𝐏𝑖2𝑌subscript~𝐏:1𝑘subscript𝐏𝑖R_{1:k},\;\phi_{1:k}\leftarrow\textrm{Embed}(||\tilde{\mathbf{P}}_{1:k}-% \mathbf{P}_{i}||_{2}),\;Y(\tilde{\mathbf{P}}_{1:k}-\mathbf{P}_{i})italic_R start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT ← Embed ( | | over~ start_ARG bold_P end_ARG start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT - bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_Y ( over~ start_ARG bold_P end_ARG start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT - bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )   normal-▷\triangleright Edge Features
𝐕~1:k0:lmaxMLP(Rk)(Linear(𝐕~1:k0:lmax)+Linear(ϕ1:k))subscriptsuperscript~𝐕:0subscript𝑙:1𝑘MLPsubscript𝑅𝑘Linearsubscriptsuperscript~𝐕:0subscript𝑙:1𝑘Linearsubscriptitalic-ϕ:1𝑘\tilde{\mathbf{V}}^{0:{l_{\max}}}_{1:k}\leftarrow\textrm{MLP}(R_{k})\cdot\Big{% (}\textrm{Linear}(\tilde{\mathbf{V}}^{0:{l_{\max}}}_{1:k})+\textrm{Linear}% \left(\phi_{1:k}\right)\Big{)}over~ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT ← MLP ( italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ ( Linear ( over~ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT ) + Linear ( italic_ϕ start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT ) )   normal-▷\triangleright Prepare messages
𝐕0:lmaxLinear(𝐕i0:lmax+1k(k𝐕~k0:lmax))superscript𝐕:0subscript𝑙Linearsubscriptsuperscript𝐕:0subscript𝑙𝑖1𝑘subscript𝑘subscriptsuperscript~𝐕:0subscript𝑙𝑘\mathbf{V}^{0:{l_{\max}}}\leftarrow\textrm{Linear}\left(\mathbf{V}^{0:{l_{\max% }}}_{i}+\frac{1}{k}\left(\sum_{k}\tilde{\mathbf{V}}^{0:{l_{\max}}}_{k}\right)\right)bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ← Linear ( bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ( ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) )   normal-▷\triangleright Aggregate and update
𝐏𝐏i+Linear(𝐕l=1)𝐏subscript𝐏𝑖Linearsuperscript𝐕𝑙1\mathbf{P}\leftarrow\mathbf{P}_{i}+\textrm{Linear}\left(\mathbf{V}^{l=1}\right)bold_P ← bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + Linear ( bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT )  normal-▷\triangleright Update positions
return (𝐏,𝐕0:lmax)𝐏superscript𝐕normal-:0subscript𝑙(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
Algorithm 5 Spatial Convolution

3.5 Deep Coarsening Autoencoder

We compose Space Convolution, Self-Interaction and Sequence Convolution modules to define a coarsening/refining block. The block mixes representations across all relevant axes of the domain while producing coarsened, downsampled positions and mixed embeddings. The reverse result – finer positions and decoupled embeddings – is achieved by changing the standard Sequence Convolution to its transpose counterpart. When employing sequence convolutions of stride S>1𝑆1S>1italic_S > 1, we increase the dimensionality of the feature representation according to a rescaling factor hyperparameter ρ𝜌\rhoitalic_ρ. We stack L𝐿Litalic_L coarsening blocks to build a deep neural encoder \mathcal{E}caligraphic_E (Alg.7), and symetrically L𝐿Litalic_L refining blocks to build a decoder 𝒟𝒟\mathcal{D}caligraphic_D (Alg. 8).

3.6 Autoencoder Reconstruction Losses

We use a number of reconstruction losses to ensure good quality of produced proteins.

Vector Map Loss. We train the model by directly comparing internal three-dimensional vector difference maps. Let V(𝐏)𝑉𝐏V(\mathbf{P})italic_V ( bold_P ) denote the internal vector map between all atoms 𝐏𝐏\mathbf{P}bold_P in our data, that is, V(𝐏)i,j=(𝐏i𝐏j)3𝑉superscript𝐏𝑖𝑗superscript𝐏𝑖superscript𝐏𝑗superscript3V(\mathbf{P})^{i,j}=(\mathbf{P}^{i}-\mathbf{P}^{j})\in\mathbb{R}^{3}italic_V ( bold_P ) start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT = ( bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - bold_P start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. We define the vector map loss as VectorMap=HuberLoss(V(𝐏),V(𝐏^))subscriptVectorMapHuberLoss𝑉𝐏𝑉^𝐏\mathcal{L}_{\textrm{VectorMap}}=\textrm{HuberLoss}(V(\mathbf{P}),V(\hat{% \mathbf{P}}))caligraphic_L start_POSTSUBSCRIPT VectorMap end_POSTSUBSCRIPT = HuberLoss ( italic_V ( bold_P ) , italic_V ( over^ start_ARG bold_P end_ARG ) ) [Huber (1992)]. When computing this loss, an additional stage is employed for processing permutation symmetry breaks. More details can be found in Appendix B.1.

Residue Label Cross Entropy Loss. We train the model to predict logits bold-ℓ\bm{\ell}bold_ℓ over alphabet \mathcal{R}caligraphic_R for each residue. We use the cross entropy between predicted logits and ground labels: CrossEntropy=CrossEntropy(,𝐑)subscriptCrossEntropyCrossEntropybold-ℓ𝐑\mathcal{L}_{\textrm{CrossEntropy}}=\textrm{CrossEntropy}(\bm{\ell},\mathbf{R})caligraphic_L start_POSTSUBSCRIPT CrossEntropy end_POSTSUBSCRIPT = CrossEntropy ( bold_ℓ , bold_R )

Chemistry Losses. We incorporate L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm-based losses for comparing bonds, angles and dihedrals between prediction and ground truth. For non-bonded atoms, a clash loss is evaluated using standard Van der Waals atomic radii (Appendix B.2)

Please refer to Appendix B for further details on the loss.

3.7 Latent Diffusion

We train an SO(3)-equivariant DDPM [Ho et al. (2020); Sohl-Dickstein et al. (2015)] on the latent space of our autoencoder. We pre-train an autoencoder and transform each protein 𝐗𝐗\mathbf{X}bold_X from the dataset into a geometric tensor of irreducible representations of SO(3): 𝐙=(𝐗)𝐙𝐗\mathbf{Z}=\mathcal{E}(\mathbf{X})bold_Z = caligraphic_E ( bold_X ). We attach a diffusion process of T𝑇Titalic_T steps on the latent variables, making 𝐙0=𝐙subscript𝐙0𝐙\mathbf{Z}_{0}=\mathbf{Z}bold_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_Z and 𝐙T=𝒩(0,1)subscript𝐙𝑇𝒩01\mathbf{Z}_{T}=\mathcal{N}(0,1)bold_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = caligraphic_N ( 0 , 1 ). We follow the parameterization described in [Salimans & Ho (2022)], and train a denoising model to reconstruct the original data 𝐙0subscript𝐙0\mathbf{Z}_{0}bold_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from its noised version 𝐙tsubscript𝐙𝑡\mathbf{Z}_{t}bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

diffusion=𝔼ϵ,t[w(λt)𝐙^0(𝐙t)𝐙022]subscriptdiffusionsubscript𝔼italic-ϵ𝑡delimited-[]𝑤subscript𝜆𝑡subscriptsuperscriptnormsubscript^𝐙0subscript𝐙𝑡subscript𝐙022\mathcal{L}_{\textrm{diffusion}}=\mathbb{E}_{{\epsilon},t}\big{[}w(\lambda_{t}% )||\hat{\mathbf{Z}}_{0}(\mathbf{Z}_{t})-\mathbf{Z}_{0}||^{2}_{2}\big{]}caligraphic_L start_POSTSUBSCRIPT diffusion end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_ϵ , italic_t end_POSTSUBSCRIPT [ italic_w ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | over^ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]

In order to ensure that bottleneck representations 𝐙0subscript𝐙0\mathbf{Z}_{0}bold_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are well-behaved for generation purposes, we regularize the latent space of our autoencoder (Appendix B.3). We build a denoising network with LDsubscript𝐿𝐷L_{D}italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT layers of Self-Interaction. Please refer to Appendix F for further details.

3.8 Implementation

We train all models on single-GPU A6000 and GTX-1080 Ti machines. We implement Ophiuchus in Jax using the python libraries e3nn-jax [Geiger & Smidt (2022)] and Haiku [Hennigan et al. (2020)].

4 Methods and Results

4.1 Autoencoder Architecture Comparison

We compare Ophiuchus to the architecture proposed in [Fu et al. (2023)], which uses the EGNN-based architecture [Satorras et al. (2022)] for autoencoding protein backbones. To the best of our knowledge, this is the only other model that attempts protein reconstruction in three-dimensions with roto-translation equivariant networks. For demonstration purposes, we curate a small dataset of protein 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT-backbones from the PDB with lengths between 16 and 64 and maximum resolution of up to 1.5 Å, resulting in 1267 proteins. We split the data into train, validation and test sets with ratio [0.8, 0.1, 0.1]. In table 1, we report the test performance at best validation step, while avoiding over-fitting during training.

Table 1: Reconstruction from single feature bottleneck

Model Downsampling Factor Channels/Layer ##\## Params [1e6] Cα𝛼\alphaitalic_α-RMSD (Å) \downarrow Residue Acc. (%percent\%%) \uparrow \downarrow EGNN 2 [32, 32] 0.68 1.01 88 EGNN 4 [32, 32, 48] 1.54 1.12 80 EGNN 8 [32, 32, 48, 72] 3.30 2.06 73 EGNN 16 [32, 32, 48, 72, 108] 6.99 11.4 25 Ophiuchus 2 [5, 7] 0.018 0.11 98 Ophiuchus 4 [5, 7, 10] 0.026 0.14 97 Ophiuchus 8 [5, 7, 10, 15] 0.049 0.36 79 Ophiuchus 16 [5, 7, 10, 15, 22] 0.068 0.43 37

We find that Ophiuchus vastly outperforms the EGNN-based architecture. Ophiuchus recovers the protein sequence and 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT backbone with significantly better accuracy, while using orders of magnitude less parameters. Refer to Appendix C for further details.

4.2 Architecture Ablation

To investigate the effects of different architecture layouts and coarsening rates, we train different instantiations of Ophiuchus to coarsen representations of contiguous 160-sized protein fragments from the Protein Databank (PDB) [Berman et al. (2000)]. We filter out entries tagged with resolution higher than 2.5 Å, and total sequence lengths larger than 512. We also ensure proteins in the dataset have same chirality. For ablations, the sequence convolution uses kernel size of 5 and stride 3, channel rescale factor per layer is one of {1.5, 1.7, 2.0}, and the number of downsampling layers ranges in 3-5. The initial residue representation uses 16 channels, where each channel is composed of one scalar (l=0𝑙0l=0italic_l = 0) and one 3D vector (l=1𝑙1l=1italic_l = 1). All experiments were repeated 3 times with different initialization seeds and data shuffles.

Table 2: Ophiuchus Recovery from Compressed Representations - All-Atom

Downsampling Factor Channels/Layer # Params [1e6] Cα𝛼\alphaitalic_α-RMSD (Å) \downarrow All-Atom RMSD (Å) \downarrow GDT-TS \uparrow GDT-HA \uparrow Residue Acc. (%) \uparrow 17 [16, 24, 36] 0.34 0.90 ±plus-or-minus\pm± 0.20 0.68 ±plus-or-minus\pm± 0.08 94 ±plus-or-minus\pm± 3 76 ±plus-or-minus\pm± 4 97 ±plus-or-minus\pm± 2 17 [16, 27, 45] 0.38 0.89 ±plus-or-minus\pm± 0.21 0.70 ±plus-or-minus\pm± 0.09 94 ±plus-or-minus\pm± 3 77 ±plus-or-minus\pm± 5 98 ±plus-or-minus\pm± 1 17 [16, 32, 64] 0.49 1.02 ±plus-or-minus\pm± 0.25 0.72 ±plus-or-minus\pm± 0.09 92 ±plus-or-minus\pm± 4 73 ±plus-or-minus\pm± 5 98 ±plus-or-minus\pm± 1 53 [16, 24, 36, 54] 0.49 1.03 ±plus-or-minus\pm± 0.18 0.83 ±plus-or-minus\pm± 0.10 91 ±plus-or-minus\pm± 3 72 ±plus-or-minus\pm± 5 60 ±plus-or-minus\pm± 4 53 [16, 27, 45, 76] 0.67 0.92 ±plus-or-minus\pm± 0.19 0.77 ±plus-or-minus\pm± 0.09 93 ±plus-or-minus\pm± 3 75 ±plus-or-minus\pm± 5 66 ±plus-or-minus\pm± 4 53 [16, 32, 64, 128] 1.26 1.25 ±plus-or-minus\pm± 0.32 0.80 ±plus-or-minus\pm± 0.16 86 ±plus-or-minus\pm± 5 65 ±plus-or-minus\pm± 6 67 ±plus-or-minus\pm± 5 160 [16, 24, 36, 54, 81] 0.77 1.67 ±plus-or-minus\pm± 0.24 1.81 ±plus-or-minus\pm± 0.16 77 ±plus-or-minus\pm± 4 54 ±plus-or-minus\pm± 5 17 ±plus-or-minus\pm± 3 160 [16, 27, 45, 76, 129] 1.34 1.39 ±plus-or-minus\pm± 0.23 1.51 ±plus-or-minus\pm± 0.17 83 ±plus-or-minus\pm± 4 60 ±plus-or-minus\pm± 5 17 ±plus-or-minus\pm± 3 160 [16, 32, 64, 128, 256] 3.77 1.21 ±plus-or-minus\pm± 0.25 1.03 ±plus-or-minus\pm± 0.15 87 ±plus-or-minus\pm± 5 65 ±plus-or-minus\pm± 6 27 ±plus-or-minus\pm± 4

In our experiments we find a trade-off between domain coarsening factor and reconstruction performance. In Table 2 we show that although residue recovery suffers from large downsampling factors, structure recovery rates remain comparable between various settings. Moreover, we find that we are able to model all atoms in proteins, as opposed to only Cα𝛼\alphaitalic_α atoms (as commonly done), and still recover the structure with high precision. These results demonstrate that protein data can be captured effectively and efficiently using sequence-modular geometric representations. We directly utilize the learned compact latent space as shown below by various examples. For further ablation analysis please refer to Appendix D.

Refer to caption
Figure 3: Protein Latent Diffusion. (a) We attach a diffusion process to Ophiuchus representations and learn a denoising network to sample embeddings of SO(3) that decode to protein structures. (b) Random samples from 485-length backbone model.

4.3 Latent Conformational Interpolation

To demonstrate the power of Ophiuchus’s geometric latent space, we show smooth interpolation between two states of a protein structure without explicit latent regularization (as opposed to [Ramaswamy et al. (2021)]). We use the PDBFlex dataset [Hrabe et al. (2016)] and pick pairs of flexible proteins. Conformational snapshots of these proteins are used as the endpoints for the interpolation. We train a large Ophiuchus reconstruction model on general PDB data. The model coarsens up to 485-residue proteins into a single high-order geometric representation using 6 convolutional downsampling layers each with kernel size 5 and stride 3. The endpoint structures are compressed into single geometric representation which enables direct latent interpolation in feature .

We compare the results of linear interpolation in the latent space against linear interpolation in the coordinate domain (Fig. 11). To determine chemical validity of intermediate states, we scan protein data to quantify average bond lengths and inter-bond angles. We calculate the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT deviation from these averages for bonds and angles of interpolated structures. Additionally, we measure atomic clashes by counting collisions of Van der-Waals radii of non-bonded atoms (Fig. 11). Although the latent and autoencoder-reconstructed interpolations perform worse than direct interpolation near the original data points, we find that only the latent interpolation structures maintain a consistent profile of chemical validity throughout the trajectory, while direct interpolation in the coordinate domain disrupts it significantly. This demonstrates that the learned latent space compactly and smoothly represents protein conformations.

4.4 Latent Diffusion Experiments and Benchmarks

Our ablation study (Tables 2 and 4) shows successful recovery of backbone structure of large proteins even for large coarsening rates. However, we find that for sequence reconstruction, larger models and longer training times are required. During inference, all-atom models rely on the decoding of the sequence, thus making significantly harder for models to resolve all-atom generation. Due to computational constraints, we investigate all-atom latent diffusion models for short sequence lengths, and focus on backbone models for large proteins. We train the all-atom models with mini-proteins of sequences shorter than 64 residues, leveraging the MiniProtein scaffolds dataset produced by [Cao et al. (2022)]. In this regime, our model is precise and successfully reconstructs sequence and all-atom positions. We also instantiate an Ophiuchus model for generating the backbone trace for large proteins of length 485. For that, we train our model on the PDB data curated by [Yim et al. (2023)]. We compare the quality of diffusion samples from our model to RFDiffusion (T=50) and FrameDiff (N=500 and noise=0.1) samples of similar lengths. We generated 500 unconditional samples from each of the models for evaluation.

In Table 3 we compare sampling the latent space of Ophiuchus to existing models. We find that our model performs comparably in terms of different generated sample metrics, while enabling orders of magnitude faster sampling for proteins. For all comparisons we run all models on a single RTX6000 GPU. Please refer to Appendix F for more details.

Table 3: Comparison to different diffusion models.

Model Dataset Sampling Time (s) \downarrow scRMSD (< 2Å) \uparrow scTM (>0.5) \uparrow Diversity \uparrow FrameDiff [Yim et al. (2023)] PDB 8.6 0.17 0.81 0.42 RFDiffusion [Trippe et al. (2023)] PDB + AlphaFold DB 50 0.79 0.99 0.64 Ophiuchus-64 All-Atom MiniProtein 0.15 0.32 0.56 0.72 Ophiuchus-485 Backbone PDB 0.46 0.18 0.36 0.39

5 Conclusion and Future Work

In this work, we introduced a new autoencoder model for protein structure and sequence representation. Through extensive ablation on its architecture, we quantified the trade-offs between model complexity and representation quality. We demonstrated the power of our learned representations in latent interpolation, and investigated its usage as basis for efficient latent generation of backbone and all-atom protein structures. Our studies suggest Ophiuchus to provide a strong foundation for constructing state-of-the-art protein neural architectures. In future work, we will investigate scaling Ophiuchus representations and generation to larger proteins and additional molecular domains.

References

  • Anand & Achim (2022) Namrata Anand and Tudor Achim. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models, 2022.
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016.
  • Baek et al. (2021a) Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N. Kinch, R. Dustin Schaeffer, Claudia Millán, Hahnbeom Park, Carson Adams, Caleb R. Glassman, Andy DeGiovanni, Jose H. Pereira, Andria V. Rodrigues, Alberdina A. van Dijk, Ana C. Ebrecht, Diederik J. Opperman, Theo Sagmeister, Christoph Buhlheller, Tea Pavkov-Keller, Manoj K. Rathinaswamy, Udit Dalwadi, Calvin K. Yip, John E. Burke, K. Christopher Garcia, Nick V. Grishin, Paul D. Adams, Randy J. Read, and David Baker. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, August 2021a. doi: 10.1126/science.abj8754. URL https://doi.org/10.1126/science.abj8754.
  • Baek et al. (2021b) Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021b.
  • Batatia et al. (2022) Ilyes Batatia, David Peter Kovacs, Gregor N. C. Simm, Christoph Ortner, and Gabor Csanyi. MACE: Higher order equivariant message passing neural networks for fast and accurate force fields. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=YPpSngE-ZU.
  • Bentley (1975) Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
  • Berman et al. (2000) Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
  • Bronstein et al. (2021) Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Velickovic. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. CoRR, abs/2104.13478, 2021. URL https://arxiv.longhoe.net/abs/2104.13478.
  • Bystroff & Baker (1998) Christopher Bystroff and David Baker. Prediction of local structure in proteins using a library of sequence-structure motifs. Journal of molecular biology, 281(3):565–577, 1998.
  • Cao et al. (2022) Longxing Cao, Brian Coventry, Inna Goreshnik, Buwei Huang, William Sheffler, Joon Sung Park, Kevin M Jude, Iva Marković, Rameshwar U Kadam, Koen HG Verschueren, et al. Design of protein-binding proteins from the target structure alone. Nature, 605(7910):551–560, 2022.
  • Deng et al. (2021) Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacchi, and Leonidas Guibas. Vector neurons: A general framework for so(3)-equivariant networks, 2021.
  • Eguchi et al. (2020) Raphael R. Eguchi, Christian A. Choe, and Po-Ssu Huang. Ig-VAE: Generative modeling of protein structure by direct 3d coordinate generation. PLOS Computational Biology, August 2020. doi: 10.1101/2020.08.07.242347. URL https://doi.org/10.1101/2020.08.07.242347.
  • Elfwing et al. (2017) Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, 2017.
  • Elnaggar et al. (2023) Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Charlotte Rochereau, and Burkhard Rost. Ankh: Optimized protein language model unlocks general-purpose modelling, 2023.
  • Fan et al. (2022) Hehe Fan, Zhangyang Wang, Yi Yang, and Mohan Kankanhalli. Continuous-discrete convolution for geometry-sequence modeling in proteins. In The Eleventh International Conference on Learning Representations, 2022.
  • Fu et al. (2023) Cong Fu, Keqiang Yan, Limei Wang, Wing Yee Au, Michael McThrow, Tao Komikado, Koji Maruhashi, Kanji Uchino, Xiaoning Qian, and Shuiwang Ji. A latent diffusion model for protein structure generation, 2023.
  • Fuchs et al. (2020) Fabian B. Fuchs, Daniel E. Worrall, Volker Fischer, and Max Welling. Se(3)-transformers: 3d roto-translation equivariant attention networks, 2020.
  • Geiger & Smidt (2022) Mario Geiger and Tess Smidt. e3nn: Euclidean neural networks, 2022.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
  • Hennigan et al. (2020) Tom Hennigan, Trevor Cai, Tamara Norman, Lena Martens, and Igor Babuschkin. Haiku: Sonnet for JAX. 2020. URL http://github.com/deepmind/dm-haiku.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020.
  • Hou et al. (2018) Jie Hou, Badri Adhikari, and Jianlin Cheng. Deepsf: deep convolutional neural network for map** protein sequences to folds. Bioinformatics, 34(8):1295–1303, 2018.
  • Hrabe et al. (2016) Thomas Hrabe, Zhanwen Li, Mayya Sedova, Piotr Rotkiewicz, Lukasz Jaroszewski, and Adam Godzik. Pdbflex: exploring flexibility in protein structures. Nucleic acids research, 44(D1):D423–D428, 2016.
  • Huber (1992) Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pp.  492–518. Springer, 1992.
  • Ingraham et al. (2022) John Ingraham, Max Baranov, Zak Costello, Vincent Frappier, Ahmed Ismail, Shan Tie, Wujie Wang, Vincent Xue, Fritz Obermeyer, Andrew Beam, and Gevorg Grigoryan. Illuminating protein space with a programmable generative model. biorxiv, December 2022. doi: 10.1101/2022.12.01.518682. URL https://doi.org/10.1101/2022.12.01.518682.
  • **g et al. (2021) Bowen **g, Stephan Eismann, Patricia Suriana, Raphael J. L. Townshend, and Ron Dror. Learning from protein structure with geometric vector perceptrons, 2021.
  • Jumper et al. (2021) John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  • Karydis (2017) Thrasyvoulos Karydis. Learning hierarchical motif embeddings for protein engineering. PhD thesis, Massachusetts Institute of Technology, 2017.
  • King & Koes (2021) Jonathan Edward King and David Ryan Koes. Sidechainnet: An all-atom protein structure dataset for machine learning. Proteins: Structure, Function, and Bioinformatics, 89(11):1489–1496, 2021.
  • Kmiecik et al. (2016) Sebastian Kmiecik, Dominik Gront, Michal Kolinski, Lukasz Wieteska, Aleksandra Elzbieta Dawid, and Andrzej Kolinski. Coarse-grained protein models and their applications. Chemical reviews, 116(14):7898–7936, 2016.
  • Lee et al. (2019) Ingoo Lee, Jongsoo Keum, and Hojung Nam. Deepconv-dti: Prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS computational biology, 15(6):e1007129, 2019.
  • Li et al. (2022) Alex J. Li, Vikram Sundar, Gevorg Grigoryan, and Amy E. Keating. Terminator: A neural framework for structure-based protein design using tertiary repeating motifs, 2022.
  • Liao & Smidt (2023) Yi-Lun Liao and Tess Smidt. Equiformer: Equivariant graph attention transformer for 3d atomistic graphs, 2023.
  • Lin & AlQuraishi (2023) Yeqing Lin and Mohammed AlQuraishi. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds, 2023.
  • Lin et al. (2021) Zeming Lin, Tom Sercu, Yann LeCun, and Alexander Rives. Deep generative models create new and diverse protein structures. In Machine Learning for Structural Biology Workshop, NeurIPS, 2021.
  • Lin et al. (2023) Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  • Liu et al. (2022) David D Liu, Ligia Melo, Allan Costa, Martin Vögele, Raphael JL Townshend, and Ron O Dror. Euclidean transformers for macromolecular structures: Lessons learned. 2022 ICML Workshop on Computational Biology, 2022.
  • Mackenzie & Grigoryan (2017) Craig O Mackenzie and Gevorg Grigoryan. Protein structural motifs in prediction and design. Current opinion in structural biology, 44:161–167, 2017.
  • Mackenzie et al. (2016) Craig O. Mackenzie, Jianfu Zhou, and Gevorg Grigoryan. Tertiary alphabet for the observable protein structural universe. Proceedings of the National Academy of Sciences, 113(47), November 2016. doi: 10.1073/pnas.1607178113. URL https://doi.org/10.1073/pnas.1607178113.
  • Mansoor et al. (2023) Sanaa Mansoor, Minkyung Baek, Hahnbeom Park, Gyu Rie Lee, and David Baker. Protein ensemble generation through variational autoencoder latent space sampling. bioRxiv, pp.  2023–08, 2023.
  • Miller et al. (2020) Benjamin Kurt Miller, Mario Geiger, Tess E. Smidt, and Frank Noé. Relevance of rotationally equivariant convolutions for predicting molecular properties. CoRR, abs/2008.08461, 2020. URL https://arxiv.longhoe.net/abs/2008.08461.
  • Ramaswamy et al. (2021) Venkata K Ramaswamy, Samuel C Musson, Chris G Willcocks, and Matteo T Degiacomi. Deep learning protein conformational space with convolutions and latent interpolations. Physical Review X, 11(1):011052, 2021.
  • Salimans & Ho (2022) Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models, 2022.
  • Satorras et al. (2022) Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E(n) equivariant graph neural networks, 2022.
  • Smidt (2021) Tess E Smidt. Euclidean symmetry and equivariance in machine learning. Trends in Chemistry, 3(2):82–85, 2021.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
  • Swanson et al. (2022) Sebastian Swanson, Venkatesh Sivaraman, Gevorg Grigoryan, and Amy E Keating. Tertiary motifs as building blocks for the design of protein-binding peptides. Protein Science, 31(6):e4322, 2022.
  • Thomas et al. (2018) Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation- and translation-equivariant neural networks for 3d point clouds, 2018.
  • Townshend et al. (2022) Raphael J. L. Townshend, Martin Vögele, Patricia Suriana, Alexander Derry, Alexander Powers, Yianni Laloudakis, Sidhika Balachandar, Bowen **g, Brandon Anderson, Stephan Eismann, Risi Kondor, Russ B. Altman, and Ron O. Dror. Atom3d: Tasks on molecules in three dimensions, 2022.
  • Trippe et al. (2023) Brian L. Trippe, Jason Yim, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, and Tommi Jaakkola. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem, 2023.
  • Vallat et al. (2015) Brinda Vallat, Carlos Madrid-Aliste, and Andras Fiser. Modularity of protein folds as a tool for template-free modeling of structures. PLoS computational biology, 11(8):e1004419, 2015.
  • Visani et al. (2023) Gian Marco Visani, Michael N. Pun, Arman Angaji, and Armita Nourmohammad. Holographic-(v)ae: an end-to-end so(3)-equivariant (variational) autoencoder in fourier space, 2023.
  • Wang & Gómez-Bombarelli (2019) Wujie Wang and Rafael Gómez-Bombarelli. Coarse-graining auto-encoders for molecular dynamics. npj Computational Materials, 5(1):125, 2019.
  • Wang et al. (2022) Wujie Wang, Minkai Xu, Chen Cai, Benjamin Kurt Miller, Tess Smidt, Yusu Wang, Jian Tang, and Rafael Gómez-Bombarelli. Generative coarse-graining of molecular conformations, 2022.
  • Watson et al. (2022) Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Nikita Hanikel, Samuel J. Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sap**ton, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile Mathieu, Regina Barzilay, Tommi S. Jaakkola, Frank DiMaio, Minkyung Baek, and David Baker. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv, 2022. doi: 10.1101/2022.12.09.519842. URL https://www.biorxiv.org/content/early/2022/12/10/2022.12.09.519842.
  • Watson et al. (2023) Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion. Nature, pp.  1–3, 2023.
  • Wehmeyer & Noé (2018) Christoph Wehmeyer and Frank Noé . Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics. The Journal of Chemical Physics, 148(24), mar 2018. doi: 10.1063/1.5011399. URL https://doi.org/10.1063%2F1.5011399.
  • Weiler et al. (2018) Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data, 2018.
  • Wieder et al. (2020) Oliver Wieder, Stefan Kohlbacher, Mélaine Kuenemann, Arthur Garon, Pierre Ducrot, Thomas Seidel, and Thierry Langer. A compact review of molecular property prediction with graph neural networks. Drug Discovery Today: Technologies, 37:1–12, 2020.
  • Winter et al. (2021) Robin Winter, Frank Noé, and Djork-Arné Clevert. Auto-encoding molecular conformations, 2021.
  • Wu et al. (2022a) Kevin E. Wu, Kevin K. Yang, Rianne van den Berg, James Y. Zou, Alex X. Lu, and Ava P. Amini. Protein structure generation via folding diffusion, 2022a.
  • Wu et al. (2022b) Ruidong Wu, Fan Ding, Rui Wang, Rui Shen, Xiwen Zhang, Shitong Luo, Chenpeng Su, Zuofan Wu, Qi Xie, Bonnie Berger, et al. High-resolution de novo structure prediction from primary sequence. BioRxiv, pp.  2022–07, 2022b.
  • Yang et al. (2022) Kevin K Yang, Nicolo Fusi, and Alex X Lu. Convolutions are competitive with transformers for protein sequence pretraining. bioRxiv, pp.  2022–05, 2022.
  • Yang & Gómez-Bombarelli (2023) Soojung Yang and Rafael Gómez-Bombarelli. Chemically transferable generative backmap** of coarse-grained proteins, 2023.
  • Yim et al. (2023) Jason Yim, Brian L. Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay, and Tommi Jaakkola. Se(3) diffusion model with application to protein backbone generation, 2023.
  • Zhang et al. (2021) Xiao-Meng Zhang, Li Liang, Lin Liu, and Ming-**g Tang. Graph neural networks and their current applications in bioinformatics. Frontiers in genetics, 12:690049, 2021.
  • Zhang et al. (2022) Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, and Jian Tang. Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125, 2022.

Appendix

Appendix A Architecture Details

A.1 All-Atom Representation

A canonical ordering of the atoms of each residue enables the local geometry to be described in a stacked array representation, where each feature channel corresponds to an atom. To directly encode positions, we stack the 3D coordinates of each atom. The coordinates vector behaves as the irreducible-representation of SO(3)𝑆𝑂3SO(3)italic_S italic_O ( 3 ) of degree l=1𝑙1l=1italic_l = 1. The atomic coordinates are taken relative to the 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT of each residue. In practice, for implementing this ordering we follow the atom14 tensor format of SidechainNet [King & Koes (2021)], where a vector P14×3𝑃superscript143P\in\mathbb{R}^{14\times 3}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT 14 × 3 end_POSTSUPERSCRIPT contains the atomic positions per residue. In Ophiuchus, we rearrange this encoding: one of those dimensions, the 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT coordinate, is used as the absolute position; the 13 remaining 3D-vectors are centered at the 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, and used as geometric features. The geometric features of residues with fewer than 14 atoms are zero-padded (Figure 4).

Still, four of the standard residues (Aspartic Acid, Glutamic Acid, Phenylalanine and Tyrosine) have at most two pairs of atoms that are interchangeable, due to the presence 180superscript180180^{\circ}180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT-rotation symmetries [Jumper et al. (2021)]. In Figure 5, we show how stacking their relative positions leads to representations that differ even when the same structure occurs across different rotations of a side-chain. To solve this issue, instead of stacking two 3D vectors (𝐏u*,𝐏v*subscriptsuperscript𝐏𝑢subscriptsuperscript𝐏𝑣\mathbf{P}^{*}_{u},\mathbf{P}^{*}_{v}bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT), our method uses a single l=1𝑙1l=1italic_l = 1 vector 𝐕centerl=1=12(𝐏v*+𝐏u*)subscriptsuperscript𝐕𝑙1center12subscriptsuperscript𝐏𝑣subscriptsuperscript𝐏𝑢\mathbf{V}^{l=1}_{\textrm{center}}=\frac{1}{2}(\mathbf{P}^{*}_{v}+\mathbf{P}^{% *}_{u})bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT center end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ). The mean makes this feature invariant to the (v,u)𝑣𝑢(v,u)( italic_v , italic_u ) atomic permutations, and the resulting vector points to the midpoint between the two atoms. To fully describe the positioning, difference (𝐏v*𝐏u*)subscriptsuperscript𝐏𝑣subscriptsuperscript𝐏𝑢(\mathbf{P}^{*}_{v}-\mathbf{P}^{*}_{u})( bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) must be encoded as well. For that, we use a single l=2𝑙2l=2italic_l = 2 feature 𝐕diffl=2=Yl=2(12(𝐏v*𝐏u*))subscriptsuperscript𝐕𝑙2diffsubscript𝑌𝑙212subscriptsuperscript𝐏𝑣subscriptsuperscript𝐏𝑢\mathbf{V}^{l=2}_{\textrm{diff}}=Y_{l=2}\big{(}\frac{1}{2}(\mathbf{P}^{*}_{v}-% \mathbf{P}^{*}_{u})\big{)}bold_V start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_l = 2 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ). This feature is produced by projecting the difference of positions into a degree l=2𝑙2l=2italic_l = 2 spherical harmonics basis. Let x3𝑥superscript3x\in\mathbb{R}^{3}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT denote a 3D vector. Then its projection into a feature of degree l=2𝑙2l=2italic_l = 2 is defined as:

Y2(x)=[15x0x2,15x0x1,52(x02+2x12x22),15x1x2,152(x02+x22)]subscript𝑌2𝑥15subscript𝑥0subscript𝑥215𝑥0𝑥152superscriptsubscript𝑥022superscriptsubscript𝑥12superscriptsubscript𝑥2215subscript𝑥1subscript𝑥2152superscriptsubscript𝑥02superscriptsubscript𝑥22Y_{2}(x)=\left[\sqrt{15}\cdot x_{0}\cdot x_{2},\sqrt{15}\cdot x0\cdot x1,\frac% {\sqrt{5}}{2}\left(-x_{0}^{2}+2x_{1}^{2}-x_{2}^{2}\right),\sqrt{15}\cdot x_{1}% \cdot x_{2},\frac{\sqrt{15}}{2}\cdot(-x_{0}^{2}+x_{2}^{2})\right]italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) = [ square-root start_ARG 15 end_ARG ⋅ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , square-root start_ARG 15 end_ARG ⋅ italic_x 0 ⋅ italic_x 1 , divide start_ARG square-root start_ARG 5 end_ARG end_ARG start_ARG 2 end_ARG ( - italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , square-root start_ARG 15 end_ARG ⋅ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , divide start_ARG square-root start_ARG 15 end_ARG end_ARG start_ARG 2 end_ARG ⋅ ( - italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ]

Where each dimension of the resulting term is indexed by the order m[l,l]=[2,2]𝑚𝑙𝑙22m\in[-l,l]=[-2,2]italic_m ∈ [ - italic_l , italic_l ] = [ - 2 , 2 ], for degree l=2𝑙2l=2italic_l = 2. We note that for m{2,1,1}𝑚211m\in\{-2,-1,1\}italic_m ∈ { - 2 , - 1 , 1 }, two components of x𝑥xitalic_x directly multiply, while for m{0,2}𝑚02m\in\{0,2\}italic_m ∈ { 0 , 2 } only squared terms of x𝑥xitalic_x are present. In both cases, the terms are invariant to flip** the sign of x𝑥xitalic_x, such that Y2(x)=Y2(x)subscript𝑌2𝑥subscript𝑌2𝑥Y_{2}(x)=Y_{2}(-x)italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) = italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( - italic_x ). Equivalently, 𝐕diffl=2subscriptsuperscript𝐕𝑙2diff\mathbf{V}^{l=2}_{\textrm{diff}}bold_V start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT is invariant to reordering of the two atoms (v,u)(u,v)𝑣𝑢𝑢𝑣(v,u)\rightarrow(u,v)( italic_v , italic_u ) → ( italic_u , italic_v ):

𝐕diffl=2=Yl=2(12(𝐏v*𝐏u*))=Yl=2(12(𝐏u*𝐏v*))subscriptsuperscript𝐕𝑙2diffsubscript𝑌𝑙212subscriptsuperscript𝐏𝑣subscriptsuperscript𝐏𝑢subscript𝑌𝑙212subscriptsuperscript𝐏𝑢subscriptsuperscript𝐏𝑣\mathbf{V}^{l=2}_{\textrm{diff}}=Y_{l=2}\left(\frac{1}{2}(\mathbf{P}^{*}_{v}-% \mathbf{P}^{*}_{u})\right)=Y_{l=2}\left(\frac{1}{2}(\mathbf{P}^{*}_{u}-\mathbf% {P}^{*}_{v})\right)bold_V start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_l = 2 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ) = italic_Y start_POSTSUBSCRIPT italic_l = 2 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) )

In Figure 5, we compare the geometric latent space of a network that uses this permutation invariant encoding, versus one that uses naive stacking of atomic positions 𝐏*superscript𝐏\mathbf{P}^{*}bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. We find that Ophiuchus correctly maps the geometry of the data, while direct stacking leads to representations that do not reflect the underlying symmetries.

Refer to caption
Figure 4: Geometric Representations of Protein Residues. (a) We leverage the bases of spherical harmonics to represent signals on SO(3). (b-d) Examples of individual protein residues encoded in spherical harmonics bases: residue label is directly encoded in a order l=0𝑙0l=0italic_l = 0 representation as a one-hot vector; atom positions in a canonical ordering are encoded as l=1𝑙1l=1italic_l = 1 features; additionally, unorderable atom positions are encoded as l=1𝑙1l=1italic_l = 1 and l=2𝑙2l=2italic_l = 2 features that are invariant to their permutation flips. In the figures, we displace l=2𝑙2l=2italic_l = 2 features for illustrative purposes – in practice, the signal is processed as centered in the 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT. (b) Tryptophan is the largest residue we consider, utilizing all dimensions of nmax=13subscript𝑛max13n_{\textrm{max}}=13italic_n start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = 13 atoms in the input 𝐕orderedl=1subscriptsuperscript𝐕𝑙1ordered\mathbf{V}^{l=1}_{\textrm{ordered}}bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ordered end_POSTSUBSCRIPT. (c) Tyrosine needs padding for 𝐕l=1superscript𝐕𝑙1\mathbf{V}^{l=1}bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT, but produces two 𝐕l=2superscript𝐕𝑙2\mathbf{V}^{l=2}bold_V start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT from its two pairs of permutable atoms. (d) Aspartic Acid has a single pair of permutable atoms, and its 𝐕l=2superscript𝐕𝑙2\mathbf{V}^{l=2}bold_V start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT is padded.
Refer to caption
Figure 5: Permutation Invariance of Side-chain Atoms in Stacked Geometric Representations. (a) We provide an example in which we rotate the side-chain of Tyrosine by 2π2𝜋2\pi2 italic_π radians while kee** the ordering of atoms fixed. Note that the structure is the same after rotation by π𝜋\piitalic_π radians. (b) An SO(3)-equivariant network may stack atomic relative positions in a canonical order. However, because of the permutation symmetries, the naive stacked representation will lead to latent representations that are different even when the data is geometrically the same. To demonstrate this, we plot two components (m=1𝑚1m=-1italic_m = - 1 and m=0𝑚0m=0italic_m = 0) of an internal 𝐕l=1superscript𝐕𝑙1\mathbf{V}^{l=1}bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT feature, while rotating the side-chain positions by 2π2𝜋2\pi2 italic_π radians. This network represents the structures rotated by θ=0𝜃0\theta=0italic_θ = 0 (red) and θ=π𝜃𝜋\theta=\piitalic_θ = italic_π (cyan) differently despite having exactly the same geometric features. (c-d) Plots of internal 𝐕l=1,2superscript𝐕𝑙12\mathbf{V}^{l=1,2}bold_V start_POSTSUPERSCRIPT italic_l = 1 , 2 end_POSTSUPERSCRIPT of Ophiuchus which encodes the position of permutable pairs jointly as an l=1𝑙1l=1italic_l = 1 center of symmetry and an l=2𝑙2l=2italic_l = 2 difference of positions, resulting in the same representation for structures rotated by θ=0𝜃0\theta=0italic_θ = 0 (red) and θ=π𝜃𝜋\theta=\piitalic_θ = italic_π (cyan).

A.2 All-Atom Decoding

Given a latent representation at the residue-level, (𝐏,𝐕0:lmax)𝐏superscript𝐕:0subscript𝑙(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), we take 𝐏𝐏\mathbf{P}bold_P directly as the 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT position of the residue. The hidden scalar representations 𝐕l=0superscript𝐕𝑙0\mathbf{V}^{l=0}bold_V start_POSTSUPERSCRIPT italic_l = 0 end_POSTSUPERSCRIPT are transformed into categorical logits ||bold-ℓsuperscript\bm{\ell}\in\mathbb{R}^{|\mathcal{R}|}bold_ℓ ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_R | end_POSTSUPERSCRIPT to predict the probabilities of residue label 𝐑^^𝐑\hat{\mathbf{R}}over^ start_ARG bold_R end_ARG. To decode the side-chain atoms, Ophiuchus produces relative positions to 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT for all atoms of each residue. During training, we enforce the residue label to be the ground truth. During inference, we output the residue label corresponding to the largest logit value.

Relative coordinates are produced directly from geometric features. We linearly project 𝐕l=1superscript𝐕𝑙1\mathbf{V}^{l=1}bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT to obtain the relative position vectors of orderable atoms 𝐕^orderedl=1subscriptsuperscript^𝐕𝑙1ordered\hat{\mathbf{V}}^{l=1}_{\textrm{ordered}}over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ordered end_POSTSUBSCRIPT. To decode positions (𝐏^v*,𝐏^u*)subscriptsuperscript^𝐏𝑣subscriptsuperscript^𝐏𝑢(\hat{\mathbf{P}}^{*}_{v},\hat{\mathbf{P}}^{*}_{u})( over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) for an unorderable pair of atoms (v,u)𝑣𝑢(v,u)( italic_v , italic_u ), we linearly project 𝐕l>0superscript𝐕𝑙0\mathbf{V}^{l>0}bold_V start_POSTSUPERSCRIPT italic_l > 0 end_POSTSUPERSCRIPT to predict 𝐕^centerl=1subscriptsuperscript^𝐕𝑙1center\hat{\mathbf{V}}^{l=1}_{\textrm{center}}over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT center end_POSTSUBSCRIPT and 𝐕^diffl=2subscriptsuperscript^𝐕𝑙2diff\hat{\mathbf{V}}^{l=2}_{\textrm{diff}}over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT. To produce two relative positions out of 𝐕^diffl=2subscriptsuperscript^𝐕𝑙2diff\hat{\mathbf{V}}^{l=2}_{\textrm{diff}}over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT, we determine the rotation axis around which the feature rotates the least by taking the left-eigenvector with smallest eigenvalue of 𝒳l=2𝐕^diffl=2superscript𝒳𝑙2subscriptsuperscript^𝐕𝑙2diff\mathcal{X}^{l=2}\hat{\mathbf{V}}^{l=2}_{\textrm{diff}}caligraphic_X start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT, where 𝒳l=2superscript𝒳𝑙2\mathcal{X}^{l=2}caligraphic_X start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT is the generator of the irreducible representations of degree l=2𝑙2l=2italic_l = 2. We illustrate this procedure in Figure 6 and explain the process in detail in the caption. This method proves effective because the output direction of this axis is inherently ambiguous, aligning perfectly with our requirement for the vectors to be unorderable.

Refer to caption
Figure 6: Decoding Permutable Atoms. Sketch on how to decode a double-sided arrow (𝐕l=2superscript𝐕𝑙2\mathbf{V}^{l=2}bold_V start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT signals) into two unordered vectors (𝐏^u*subscriptsuperscript^𝐏𝑢\hat{\mathbf{P}}^{*}_{u}over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, 𝐏^v*subscriptsuperscript^𝐏𝑣\hat{\mathbf{P}}^{*}_{v}over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT). Yl=2:35:superscript𝑌𝑙2superscript3superscript5Y^{l=2}:\mathbb{R}^{3}\to\mathbb{R}^{5}italic_Y start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT can be viewed as a 3 dimensional manifold embed in a 5 dimensional space. We exploit the fact that the points on that manifold are unaffected by a rotation around their pre-image vector. To extend the definition to all the points of 5superscript5\mathbb{R}^{5}blackboard_R start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT (i.e. also outside of the manifold), we look for the axis of rotation with the smallest impact on the 5d entry. For that, we compute the left-eigenvector with the smallest eigenvalue of the action of the generators on the input point: 𝒳l=2𝐕^diffl=2superscript𝒳𝑙2subscriptsuperscript^𝐕𝑙2diff\mathcal{X}^{l=2}\hat{\mathbf{V}}^{l=2}_{\textrm{diff}}caligraphic_X start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT italic_l = 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT. The resulting axis is used as a relative position ±Δ𝐏v,u*plus-or-minusΔsubscriptsuperscript𝐏𝑣𝑢\pm\Delta\mathbf{P}^{*}_{v,u}± roman_Δ bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_u end_POSTSUBSCRIPT between the two atoms, and is used to recover atomic positions through 𝐏v*=𝐕centerl=1+Δ𝐏v,u*subscriptsuperscript𝐏𝑣subscriptsuperscript𝐕𝑙1centerΔsubscriptsuperscript𝐏𝑣𝑢\mathbf{P}^{*}_{v}=\mathbf{V}^{l=1}_{\textrm{center}}+\Delta\mathbf{P}^{*}_{v,u}bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT center end_POSTSUBSCRIPT + roman_Δ bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_u end_POSTSUBSCRIPT and 𝐏u*=𝐕centerl=1Δ𝐏v,u*subscriptsuperscript𝐏𝑢subscriptsuperscript𝐕𝑙1centerΔsubscriptsuperscript𝐏𝑣𝑢\mathbf{P}^{*}_{u}=\mathbf{V}^{l=1}_{\textrm{center}}-\Delta\mathbf{P}^{*}_{v,u}bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT center end_POSTSUBSCRIPT - roman_Δ bold_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_u end_POSTSUBSCRIPT.
Refer to caption
Figure 7: Building Blocks of Self-Interaction. (a) Self-Interaction updates only SO(3)-equivariant features, which are represented as a D𝐷Ditalic_D channels each with vectors of degree l𝑙litalic_l up to degree lmaxsubscript𝑙maxl_{\textrm{max}}italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT. (b) A roto-translational equivariant linear layer transforms only within the same order l𝑙litalic_l. (c-d) we use the tensor square operation to interact features across different degrees l𝑙litalic_l. We employ two instantiations of this operation. (c) The Self-Interaction in autoencoder models applies the square operation within the same channel of the representation, interacting features with themselves across l𝑙litalic_l of same channel dimension d[0,D]𝑑0𝐷d\in[0,D]italic_d ∈ [ 0 , italic_D ]. (d) The Self-Interaction in diffusion models chunks the representation in groups of channels before the square operation. It is more expressive, but imposes a harder computational cost.

A.3 Details of Self-Interaction

The objective of our Self-Interaction module is to function exclusively based on the geometric features 𝐕0:lmaxsuperscript𝐕:0subscript𝑙\mathbf{V}^{0:{l_{\max}}}bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, while concurrently mixing irreducible representations across various l𝑙litalic_l values. To accomplish this, we calculate the tensor product of the representation with itself; this operation is termed the "tensor square" and is symbolized by (𝐕0:lmax)2superscriptsuperscript𝐕:0subscript𝑙tensor-productabsent2\left(\mathbf{V}^{0:{l_{\max}}}\right)^{\otimes 2}( bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊗ 2 end_POSTSUPERSCRIPT. As the channel dimensions expand, the computational complexity tied to the tensor square increases quadratically. To solve this computational load, we instead perform the square operation channel-wise, or by chunks of channels. Figures 7.c and 7.d illustrate these operations. After obtaining the squares from the chunked or individual channels, the segmented results are subsequently concatenated to generate an updated representation 𝐕0:lmaxsuperscript𝐕:0subscript𝑙\mathbf{V}^{0:{l_{\max}}}bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, which is transformed through a learnable linear layer to the output dimensionality.

A.4 Non-Linearities

To incorporate non-linearities in our geometric representations 𝐕l=0:lmaxsuperscript𝐕:𝑙0subscript𝑙max\mathbf{V}^{l=0:l_{\textrm{max}}}bold_V start_POSTSUPERSCRIPT italic_l = 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we employ a similar roto-translation equivariant gate mechanism as described in Equiformer [Liao & Smidt (2023)]. This mechanism is present at the last step of Self-Interaction and in the message preparation step of Spatial Convolution (Figure (2)). In both cases, we implement the activation by first isolating the scalar representations 𝐕l=0superscript𝐕𝑙0\mathbf{V}^{l=0}bold_V start_POSTSUPERSCRIPT italic_l = 0 end_POSTSUPERSCRIPT and transforming them through a standard MultiLayerPerceptron (MLP). We use the SiLu activation function [Elfwing et al. (2017)] after each layer of the MLP. In the output vector, a scalar is produced for and multiplied into each channel of 𝐕l=0:lmaxsuperscript𝐕:𝑙0subscript𝑙max\mathbf{V}^{l=0:l_{\textrm{max}}}bold_V start_POSTSUPERSCRIPT italic_l = 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Refer to caption
Figure 8: Convolutions of Ophiuchus. (a) In the Spatial Convolution, we update the feature representation 𝐕i0:lmaxsuperscriptsubscript𝐕𝑖:0subscript𝑙max\mathbf{V}_{i}^{0:l_{\textrm{max}}}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and position 𝐏isubscript𝐏𝑖\mathbf{P}_{i}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of a node i𝑖iitalic_i by first aggregating messages from its k𝑘kitalic_k nearest-neighbors. The message is composed out of the neighbor features 𝐕~1:k0:lmaxsuperscriptsubscript~𝐕:1𝑘:0subscript𝑙max\mathbf{\tilde{V}}_{1:k}^{0:l_{\textrm{max}}}over~ start_ARG bold_V end_ARG start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and the relative position between the nodes 𝐏i𝐏~1:ksubscript𝐏𝑖subscript~𝐏:1𝑘\mathbf{P}_{i}-\tilde{\mathbf{P}}_{1:k}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG bold_P end_ARG start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT. After updating the features 𝐕i0:lmaxsuperscriptsubscript𝐕𝑖:0subscript𝑙max\mathbf{V}_{i}^{0:l_{\textrm{max}}}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we project vectors 𝐕il=1superscriptsubscript𝐕𝑖𝑙1\mathbf{V}_{i}^{l=1}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT to predict an update to the position 𝐏isubscript𝐏𝑖\mathbf{P}_{i}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. (b) In the Sequence Convolution, we concatenate the feature representations of sequence neighbors 𝐕~1:K0:lmaxsuperscriptsubscript~𝐕:1𝐾:0subscript𝑙max\mathbf{\tilde{V}}_{1:K}^{0:l_{\textrm{max}}}over~ start_ARG bold_V end_ARG start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT along with spherical harmonic signals of their relative positions 𝐏i𝐏jsubscript𝐏𝑖subscript𝐏𝑗\mathbf{P}_{i}-\mathbf{P}_{j}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, for i[1,K]𝑖1𝐾i\in[1,K]italic_i ∈ [ 1 , italic_K ], j[1,K]𝑗1𝐾j\in[1,K]italic_j ∈ [ 1 , italic_K ]. The concatenated feature vector is then linearly projected to the output dimensionality. To produce a coarsened position, each 𝐕~1:K0:lmaxsuperscriptsubscript~𝐕:1𝐾:0subscript𝑙max\mathbf{\tilde{V}}_{1:K}^{0:l_{\textrm{max}}}over~ start_ARG bold_V end_ARG start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT produces a score that is used as weight in summing the original positions 𝐏1:Ksubscript𝐏:1𝐾\mathbf{P}_{1:K}bold_P start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT. The Softmax function is necessary to ensure the sum of the weights is normalized.

A.5 Roto-Translation Equivariance of Sequence Convolution

A Sequence Convolution kernel takes in K𝐾Kitalic_K coordinates 𝐏1:Ksubscript𝐏:1𝐾\mathbf{P}_{1:K}bold_P start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT to produce a single coordinate 𝐏=i=1Kwi𝐏i𝐏superscriptsubscript𝑖1𝐾subscript𝑤𝑖subscript𝐏𝑖\mathbf{P}=\sum_{i=1}^{K}w_{i}\mathbf{P}_{i}bold_P = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We show that these weights need to be normalized in order for translation equivariance to be satisfied. Let T𝑇Titalic_T denote a 3D translation vector, then translation equivariance requires:

(𝐏+T)=i=1Kwi(𝐏i+T)=i=1Kwi𝐏i+i=1KwiTi=1Kwi=1𝐏𝑇superscriptsubscript𝑖1𝐾subscript𝑤𝑖subscript𝐏𝑖𝑇superscriptsubscript𝑖1𝐾subscript𝑤𝑖subscript𝐏𝑖superscriptsubscript𝑖1𝐾subscript𝑤𝑖𝑇superscriptsubscript𝑖1𝐾subscript𝑤𝑖1(\mathbf{P}+T)=\sum_{i=1}^{K}w_{i}(\mathbf{P}_{i}+T)=\sum_{i=1}^{K}w_{i}% \mathbf{P}_{i}+\sum_{i=1}^{K}w_{i}T\;\;\rightarrow\;\;\sum_{i=1}^{K}w_{i}=1( bold_P + italic_T ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_T ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T → ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1

Rotation equivariance is immediately satisfied since the sum of 3D vectors is a rotation equivariant operation. Let R𝑅Ritalic_R denote a rotation matrix. Then,

(R𝐏)=Ri=1Kwi(𝐏i)=i=1Kwi(R𝐏i)𝑅𝐏𝑅superscriptsubscript𝑖1𝐾subscript𝑤𝑖subscript𝐏𝑖superscriptsubscript𝑖1𝐾subscript𝑤𝑖𝑅subscript𝐏𝑖(R\mathbf{P})=R\sum_{i=1}^{K}w_{i}(\mathbf{P}_{i})=\sum_{i=1}^{K}w_{i}(R% \mathbf{P}_{i})( italic_R bold_P ) = italic_R ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_R bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

A.6 Transpose Sequence Convolution

Given a single coarse anchor position and features representation (𝐏,𝐕l=0:lmax)𝐏superscript𝐕:𝑙0subscript𝑙max(\mathbf{P},\mathbf{V}^{l=0:l_{\textrm{max}}})( bold_P , bold_V start_POSTSUPERSCRIPT italic_l = 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), we first map 𝐕l=0:lmaxsuperscript𝐕:𝑙0subscript𝑙max\mathbf{V}^{l=0:l_{\textrm{max}}}bold_V start_POSTSUPERSCRIPT italic_l = 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT into a new representation, resha** it by chunking K𝐾Kitalic_K features. We then project 𝐕l=1superscript𝐕𝑙1\mathbf{V}^{l=1}bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT and produce K𝐾Kitalic_K relative position vectors Δ𝐏1:KΔsubscript𝐏:1𝐾\Delta\mathbf{P}_{1:K}roman_Δ bold_P start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT, which are summed with the original position 𝐏𝐏\mathbf{P}bold_P to produce K𝐾Kitalic_K new coordinates.

Input: Kernel Size K𝐾Kitalic_K
Input: Latent Representations (𝐏,𝐕0:lmax)𝐏superscript𝐕:0subscript𝑙(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
Δ𝐏1:KLinear(𝐕l=1)Δsubscript𝐏:1𝐾Linearsuperscript𝐕𝑙1\Delta\mathbf{P}_{1:K}\leftarrow\textrm{Linear}\left(\mathbf{V}^{l=1}\right)roman_Δ bold_P start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT ← Linear ( bold_V start_POSTSUPERSCRIPT italic_l = 1 end_POSTSUPERSCRIPT )   normal-▷\triangleright Predict K𝐾Kitalic_K relative positions
𝐏1:K𝐏+Δ𝐏1:Ksubscript𝐏:1𝐾𝐏Δsubscript𝐏:1𝐾\mathbf{P}_{1:K}\leftarrow\mathbf{P}+\Delta\mathbf{P}_{1:K}bold_P start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT ← bold_P + roman_Δ bold_P start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT 𝐕1:Kl=0:lmaxReshapeK(Linear(𝐕l=0:lmax))subscriptsuperscript𝐕:𝑙0subscript𝑙max:1𝐾subscriptReshape𝐾Linearsuperscript𝐕:𝑙0subscript𝑙max\mathbf{V}^{l=0:l_{\textrm{max}}}_{1:K}\leftarrow\textrm{Reshape}_{K}\left(% \textrm{Linear}(\mathbf{V}^{l=0:l_{\textrm{max}}})\right)bold_V start_POSTSUPERSCRIPT italic_l = 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT ← Reshape start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( Linear ( bold_V start_POSTSUPERSCRIPT italic_l = 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) )   normal-▷\triangleright Split the channels in K𝐾Kitalic_K chunks
return (𝐏,𝐕0:lmax)1:Ksubscript𝐏superscript𝐕normal-:0subscript𝑙normal-:1𝐾(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})_{1:K}( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT
Algorithm 6 Transpose Sequence Convolution

This procedure generates windows of K𝐾Kitalic_K representations and positions. These windows may intersect in the decoded output. We resolve those superpositions by taking the average position and average representations within intersections.

A.7 Layer Normalization and Residual Connections

Training deep models can be challenging due to vanishing or exploding gradients. We employ layer normalization [Ba et al. (2016)] and residual connections [He et al. (2015)] in order to tackle those challenges. We incorporate layer normalization and residual connections at the end of every Self-Interaction and every convolution. To keep roto-translation equivariance, we use the layer normalization described in Equiformer [Liao & Smidt (2023)], which rescales the l𝑙litalic_l signals independetly within a representation by using the root mean square value of the vectors. We found both residuals and layer norms to be critical in training deep models for large proteins.

A.8 Encoder-Decoder

Below we describe the encoder/decoder algorithm using the building blocks previously introduced.

Input: 𝐂αsubscript𝐂𝛼\mathbf{C}_{\alpha}bold_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT Position 𝐏α1×3superscript𝐏𝛼superscript13\mathbf{P}^{\alpha}\in\mathbb{R}^{1\times 3}bold_P start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 3 end_POSTSUPERSCRIPT
Input: All-Atom Relative Positions 𝐏n×3superscript𝐏superscript𝑛3\mathbf{P}^{\ast}\in\mathbb{R}^{n\times 3}bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × 3 end_POSTSUPERSCRIPT
Input: Residue Label 𝐑𝐑\mathbf{R}\in\mathcal{R}bold_R ∈ caligraphic_R
Output: Latent Representation (𝐏,𝐕l=0:lmax)𝐏superscript𝐕:𝑙0subscript𝑙max(\mathbf{P},\mathbf{V}^{l=0:l_{\textrm{max}}})( bold_P , bold_V start_POSTSUPERSCRIPT italic_l = 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
(𝐏,𝐕0:2)All-Atom Encoding(𝐏α,𝐏,𝐑)𝐏superscript𝐕:02All-Atom Encodingsuperscript𝐏𝛼superscript𝐏𝐑(\mathbf{P},\mathbf{V}^{0:2})\leftarrow\textrm{All-Atom Encoding}(\mathbf{P}^{% \alpha},\mathbf{P}^{\ast},\mathbf{R})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : 2 end_POSTSUPERSCRIPT ) ← All-Atom Encoding ( bold_P start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_R )  normal-▷\triangleright Initial Residue Encoding
(𝐏,𝐕0:lmax)Linear(𝐏,𝐕0:2)𝐏superscript𝐕:0subscript𝑙maxLinear𝐏superscript𝐕:02(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Linear}(\mathbf{% P},\mathbf{V}^{0:2})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ← Linear ( bold_P , bold_V start_POSTSUPERSCRIPT 0 : 2 end_POSTSUPERSCRIPT ) for i1normal-←𝑖1i\leftarrow 1italic_i ← 1 to L𝐿Litalic_L do
       (𝐏,𝐕0:lmax)Self-Interaction(𝐏,𝐕0:lmax)𝐏superscript𝐕:0subscript𝑙maxSelf-Interaction𝐏superscript𝐕:0subscript𝑙max(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Self-Interaction% }(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ← Self-Interaction ( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (𝐏,𝐕0:lmax)Spatial Convolution(𝐏,𝐕0:lmax)𝐏superscript𝐕:0subscript𝑙maxSpatial Convolution𝐏superscript𝐕:0subscript𝑙max(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Spatial % Convolution}(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ← Spatial Convolution ( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (𝐏,𝐕0:lmax)Sequence Convolution(𝐏,𝐕0:lmax)𝐏superscript𝐕:0subscript𝑙maxSequence Convolution𝐏superscript𝐕:0subscript𝑙max(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Sequence % Convolution}(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ← Sequence Convolution ( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
end for
return (𝐏,𝐕0:lmax)𝐏superscript𝐕normal-:0subscript𝑙\mathbf{(}\mathbf{P},\mathbf{V}^{0:{l_{\max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
Algorithm 7 Encoder
Input: Latent Representation (𝐏,𝐕0:lmax)Latent Representation 𝐏superscript𝐕:0subscript𝑙\textrm{Latent Representation }(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})Latent Representation ( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
Output: Protein Structure and Sequence Logits (𝐏^α,𝐏^,)superscript^𝐏𝛼superscript^𝐏bold-ℓ(\hat{\mathbf{P}}^{\alpha},\hat{\mathbf{P}}^{\ast},\bm{\ell})( over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_ℓ )
for i1normal-←𝑖1i\leftarrow 1italic_i ← 1 to L𝐿Litalic_L do
       (𝐏,𝐕0:lmax)Self-Interaction(𝐏,𝐕0:lmax)𝐏superscript𝐕:0subscript𝑙maxSelf-Interaction𝐏superscript𝐕:0subscript𝑙max(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Self-Interaction% }(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ← Self-Interaction ( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (𝐏,𝐕0:lmax)Spatial Convolution(𝐏,𝐕0:lmax)𝐏superscript𝐕:0subscript𝑙maxSpatial Convolution𝐏superscript𝐕:0subscript𝑙max(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Spatial % Convolution}(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ← Spatial Convolution ( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (𝐏,𝐕0:lmax)Transpose Sequence Convolution(𝐏,𝐕0:lmax)𝐏superscript𝐕:0subscript𝑙maxTranspose Sequence Convolution𝐏superscript𝐕:0subscript𝑙max(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Transpose % Sequence Convolution}(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ← Transpose Sequence Convolution ( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
end for
(𝐏,𝐕0:2)Linear(𝐏,𝐕0:lmax)𝐏superscript𝐕:02Linear𝐏superscript𝐕:0subscript𝑙max(\mathbf{P},\mathbf{V}^{0:2})\leftarrow\textrm{Linear}(\mathbf{P},\mathbf{V}^{% 0:l_{\textrm{max}}})( bold_P , bold_V start_POSTSUPERSCRIPT 0 : 2 end_POSTSUPERSCRIPT ) ← Linear ( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (𝐏^α,𝐏^,)All-Atom Decoding(𝐏,𝐕0:2)superscript^𝐏𝛼superscript^𝐏bold-ℓAll-Atom Decoding𝐏superscript𝐕:02(\hat{\mathbf{P}}^{\alpha},\hat{\mathbf{P}}^{\ast},\bm{\ell})\leftarrow\textrm% {All-Atom Decoding}(\mathbf{P},\mathbf{V}^{0:2})( over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_ℓ ) ← All-Atom Decoding ( bold_P , bold_V start_POSTSUPERSCRIPT 0 : 2 end_POSTSUPERSCRIPT )   normal-▷\triangleright Final Protein Decoding
return (𝐏^α,𝐏^,)superscriptnormal-^𝐏𝛼superscriptnormal-^𝐏normal-∗bold-ℓ(\hat{\mathbf{P}}^{\alpha},\hat{\mathbf{P}}^{\ast},\bm{\ell})( over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_ℓ )
Algorithm 8 Decoder

A.9 Variable Sequence Length

We handle proteins of different sequence length by setting a maximum size for the model input. During training, proteins that are larger than the maximum size are cropped, and those that are smaller are padded. The boundary residue is given a special token that labels it as the end of the protein. For inference, we crop the tail of the output after the sequence position where this special token is predicted.

A.10 Time Complexity

We analyse the time complexity of a forward pass through the Ophiuchus autoencoder. Let the list (𝐏,𝐕0:lmax)Nsubscript𝐏superscript𝐕:0subscript𝑙max𝑁(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})_{N}( bold_P , bold_V start_POSTSUPERSCRIPT 0 : italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT be an arbitrary latent state with N𝑁Nitalic_N positions and N𝑁Nitalic_N geometric representations of dimensionality D𝐷Ditalic_D, where for each d[0,D]𝑑0𝐷d\in[0,D]italic_d ∈ [ 0 , italic_D ] there are geometric features of degree l[0,lmax]𝑙0subscript𝑙maxl\in[0,l_{\textrm{max}}]italic_l ∈ [ 0 , italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ]. Note that size of a geometric representation grows as O(Dlmax2)𝑂𝐷subscriptsuperscript𝑙2maxO(D\cdot l^{2}_{\textrm{max}})italic_O ( italic_D ⋅ italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) in memory.

  • The cost of an SO(3)-equivariant linear layer (Fig. 7.b) is O(D2lmax3)𝑂superscript𝐷2superscriptsubscript𝑙max3O(D^{2}\cdot l_{\textrm{max}}^{3})italic_O ( italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).

  • The cost of a channel-wise tensor square operation (Fig. 7.c) is O(Dlmax4)𝑂𝐷subscriptsuperscript𝑙4maxO(D\cdot l^{4}_{\textrm{max}})italic_O ( italic_D ⋅ italic_l start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ).

  • In Self-Interaction (Alg. 3), we use a tensor square and project it using a linear layer. The time complexity is given by O(N(D2lmax3+Dlmax4))𝑂𝑁superscript𝐷2superscriptsubscript𝑙max3𝐷superscriptsubscript𝑙max4O\left(N\cdot(D^{2}\cdot l_{\textrm{max}}^{3}+D\cdot l_{\textrm{max}}^{4})\right)italic_O ( italic_N ⋅ ( italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_D ⋅ italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) ) for the whole protein.

  • In Spatial Convolution (Alg. 5), a node aggregates geometric messages from k𝑘kitalic_k of its neighbors. Resolving the k𝑘kitalic_k nearest neighbors can be efficiently done in O((N+k)logN)𝑂𝑁𝑘𝑁O\left((N+k)\log N\right)italic_O ( ( italic_N + italic_k ) roman_log italic_N ) through the k-d tree data structure [Bentley (1975)]. For each residue, its k𝑘kitalic_k neighbors prepare messages through linear layers, at total cost O(NkD2lmax3)𝑂𝑁𝑘superscript𝐷2subscriptsuperscript𝑙3maxO(N\cdot k\cdot D^{2}\cdot l^{3}_{\textrm{max}})italic_O ( italic_N ⋅ italic_k ⋅ italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_l start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ).

  • In a Sequence Convolution (Alg. 4), a kernel stacks K𝐾Kitalic_K geometric representations of dimensionality D𝐷Ditalic_D and linearly maps them to a new feature of dimensionality ρD𝜌𝐷\rho\cdot Ditalic_ρ ⋅ italic_D, where ρ𝜌\rhoitalic_ρ is a rescaling factor, yielding O((KD)(ρD)lmax3)=O(KρD2lmax3)𝑂𝐾𝐷𝜌𝐷subscriptsuperscript𝑙3max𝑂𝐾𝜌superscript𝐷2subscriptsuperscript𝑙3maxO((K\cdot D)\cdot(\rho\cdot D)\cdot l^{3}_{\textrm{max}})=O(K\cdot\rho\cdot D^% {2}\cdot l^{3}_{\textrm{max}})italic_O ( ( italic_K ⋅ italic_D ) ⋅ ( italic_ρ ⋅ italic_D ) ⋅ italic_l start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) = italic_O ( italic_K ⋅ italic_ρ ⋅ italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_l start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ). With length N𝑁Nitalic_N and stride S𝑆Sitalic_S, the total cost is O(NSKρD2lmax3)𝑂𝑁𝑆𝐾𝜌superscript𝐷2subscriptsuperscript𝑙3maxO\left(\frac{N}{S}\cdot K\cdot\rho\cdot D^{2}\cdot l^{3}_{\textrm{max}}\right)italic_O ( divide start_ARG italic_N end_ARG start_ARG italic_S end_ARG ⋅ italic_K ⋅ italic_ρ ⋅ italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_l start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ).

  • The cost of an Ophiuchus Block is the sum of the terms above,

    O(Dlmax3N(KSρD+kD))𝑂𝐷superscriptsubscript𝑙max3𝑁𝐾𝑆𝜌𝐷𝑘𝐷O\left(D\cdot l_{\textrm{max}}^{3}\cdot N\cdot(\frac{K}{S}\cdot\rho D+kD)\right)italic_O ( italic_D ⋅ italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ⋅ italic_N ⋅ ( divide start_ARG italic_K end_ARG start_ARG italic_S end_ARG ⋅ italic_ρ italic_D + italic_k italic_D ) )
  • An Autoencoder (Alg. 7,8) that uses stride S𝑆Sitalic_S convolutions for coarsening uses L=O(logS(N))𝐿𝑂subscript𝑆𝑁L=O\left(\log_{S}(N)\right)italic_L = italic_O ( roman_log start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_N ) ) layers to reduce a protein of size N𝑁Nitalic_N into a single representation. At depth i𝑖iitalic_i, the dimensionality is given by Di=ρiDsubscript𝐷𝑖superscript𝜌𝑖𝐷D_{i}=\rho^{i}Ditalic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ρ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_D and the sequence length is given by Ni=NSisubscript𝑁𝑖𝑁superscript𝑆𝑖N_{i}=\frac{N}{S^{i}}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_N end_ARG start_ARG italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG. The time complexity of our Autoencoder is given by geometric sum:

    O(ilogS(N)(lmax3ρiDNSi(Kρi+1D+kρiD)))=O(Nlmax3(Kρ+k)D2ilogS(N)(ρ2S)i)𝑂superscriptsubscript𝑖subscript𝑆𝑁superscriptsubscript𝑙max3superscript𝜌𝑖𝐷𝑁superscript𝑆𝑖𝐾superscript𝜌𝑖1𝐷𝑘superscript𝜌𝑖𝐷𝑂𝑁superscriptsubscript𝑙max3𝐾𝜌𝑘superscript𝐷2superscriptsubscript𝑖subscript𝑆𝑁superscriptsuperscript𝜌2𝑆𝑖O\left(\sum_{i}^{\log_{S}(N)}\left(l_{\textrm{max}}^{3}\rho^{i}D\frac{N}{S^{i}% }(K\rho^{i+1}D+k\rho^{i}D)\right)\right)=O\left(Nl_{\textrm{max}}^{3}(K\rho+k)% D^{2}\sum_{i}^{\log_{S}(N)}\left(\frac{\rho^{2}}{S}\right)^{i}\right)italic_O ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_D divide start_ARG italic_N end_ARG start_ARG italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ( italic_K italic_ρ start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT italic_D + italic_k italic_ρ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_D ) ) ) = italic_O ( italic_N italic_l start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_K italic_ρ + italic_k ) italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ( divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_S end_ARG ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )

    We are interested in the dependence on the length N𝑁Nitalic_N of a protein, therefore, we keep only relevant parameters. Summing the geometric series and using the identity xlogb(a)=alogb(x)superscript𝑥subscript𝑏𝑎superscript𝑎subscript𝑏𝑥x^{\log_{b}(a)}=a^{\log_{b}(x)}italic_x start_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_a ) end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT we get:

    =O(N(ρ2S)logS(N)+11ρ2S1)={O(N1+logSρ2)for ρ2/S>1O(NlogSN)for ρ2/S=1O(N)for ρ2/S<1absent𝑂𝑁superscriptsuperscript𝜌2𝑆subscript𝑆𝑁11superscript𝜌2𝑆1cases𝑂superscript𝑁1subscript𝑆superscript𝜌2for superscript𝜌2𝑆1𝑂𝑁subscript𝑆𝑁for superscript𝜌2𝑆1𝑂𝑁for superscript𝜌2𝑆1=O\left(N\frac{(\frac{\rho^{2}}{S})^{\log_{S}(N)+1}-1}{\frac{\rho^{2}}{S}-1}% \right)=\begin{cases}O\left(N^{1+\log_{S}\rho^{2}}\right)&\text{for }\rho^{2}/% S>1\\ O\left(N\log_{S}N\right)&\text{for }\rho^{2}/S=1\\ O\left(N\right)&\text{for }\rho^{2}/S<1\end{cases}= italic_O ( italic_N divide start_ARG ( divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_S end_ARG ) start_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_N ) + 1 end_POSTSUPERSCRIPT - 1 end_ARG start_ARG divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_S end_ARG - 1 end_ARG ) = { start_ROW start_CELL italic_O ( italic_N start_POSTSUPERSCRIPT 1 + roman_log start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_CELL start_CELL for italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_S > 1 end_CELL end_ROW start_ROW start_CELL italic_O ( italic_N roman_log start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT italic_N ) end_CELL start_CELL for italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_S = 1 end_CELL end_ROW start_ROW start_CELL italic_O ( italic_N ) end_CELL start_CELL for italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_S < 1 end_CELL end_ROW

    In most of our experiments we operate in the ρ2/S<1superscript𝜌2𝑆1\rho^{2}/S<1italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_S < 1 regime.

Appendix B Loss Details

B.1 Details on Vector Map Loss

The Huber Loss [Huber (1992)] behaves linearly for large inputs, and quadratically for small ones. It is defined as:

HuberLoss(y,f(x))={12(yf(x))2if |yf(x)|δ,δ|yf(x)|12δ2otherwise.HuberLoss𝑦𝑓𝑥cases12superscript𝑦𝑓𝑥2if 𝑦𝑓𝑥𝛿𝛿𝑦𝑓𝑥12superscript𝛿2otherwise.\text{HuberLoss}(y,f(x))=\begin{cases}\frac{1}{2}(y-f(x))^{2}&\text{if }|y-f(x% )|\leq\delta,\\ \delta\cdot|y-f(x)|-\frac{1}{2}\delta^{2}&\text{otherwise.}\end{cases}HuberLoss ( italic_y , italic_f ( italic_x ) ) = { start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_y - italic_f ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL if | italic_y - italic_f ( italic_x ) | ≤ italic_δ , end_CELL end_ROW start_ROW start_CELL italic_δ ⋅ | italic_y - italic_f ( italic_x ) | - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL otherwise. end_CELL end_ROW

We found it to significantly improve training stability for large models compared to mean squared error. We use δ=0.5𝛿0.5\delta=0.5italic_δ = 0.5 for all our experiments.

The vector map loss measures differences of internal vector maps V(𝐏)V(𝐏^)𝑉𝐏𝑉^𝐏V(\mathbf{P})-V(\hat{\mathbf{P}})italic_V ( bold_P ) - italic_V ( over^ start_ARG bold_P end_ARG ) between predicted and ground positions, where V(𝐏)i,j=(𝐏i𝐏j)×3𝑉superscript𝐏𝑖𝑗superscript𝐏𝑖superscript𝐏𝑗superscriptabsent3V(\mathbf{P})^{i,j}=(\mathbf{P}^{i}-\mathbf{P}^{j})\in\mathbb{R}^{\times 3}italic_V ( bold_P ) start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT = ( bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - bold_P start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT × 3 end_POSTSUPERSCRIPT. Our output algorithm for decoding atoms produces arbitrary symmetry breaks (Appendix A.2) for positions 𝐏v*superscriptsubscript𝐏𝑣\mathbf{P}_{v}^{*}bold_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and 𝐏u*superscriptsubscript𝐏𝑢\mathbf{P}_{u}^{*}bold_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT of atoms that are not orderable. Because of that, a loss on the vector map is not directly applicable to the output of our model, since the order of the model output might differ from the order of the ground truth data. To solve that, we consider both possible orderings of permutable atoms, and choose the one that minimizes the loss. Solving for the optimal ordering is not feasible for the system as a whole, since the number of permutations to be considered scales exponentially with N𝑁Nitalic_N. Instead, we first compute a vector map loss internal to each residue. We consider the alternative order of permutable atoms, and choose the candidate that minimizes this local loss. This ordering is used for the rest of our losses.

B.2 Chemical Losses

We consider bonds, interbond angles, dihedral angles and steric clashes when computing a loss for chemical validity. Let 𝐏NAtom×3𝐏superscriptsubscript𝑁Atom3\mathbf{P}\in\mathbb{R}^{N_{\textrm{Atom}}\times 3}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT Atom end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT be the list of atom positions in ground truth data. We denote 𝐏^NAtom×3^𝐏superscriptsubscript𝑁Atom3\hat{\mathbf{P}}\in\mathbb{R}^{N_{\textrm{Atom}}\times 3}over^ start_ARG bold_P end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT Atom end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT as the list of atom positions predicted by the model. For each chemical interaction, we precompute indices of atoms that perform the interaction. For example, for bonds we precompute a list of pairs of atoms that are bonded according to the chemical profile of each residue. Our chemical losses then take form:

Bonds=1||(v,u)𝐏^v𝐏^u2𝐏v𝐏u222subscriptBonds1subscript𝑣𝑢subscriptsuperscriptnormsubscriptnormsubscript^𝐏𝑣subscript^𝐏𝑢2subscriptnormsubscript𝐏𝑣subscript𝐏𝑢222\mathcal{L}_{\textrm{Bonds}}=\frac{1}{|\mathcal{B}|}\sum_{(v,u)\in\mathcal{B}}% \Big{|}\Big{|}\;\;||\hat{\mathbf{P}}_{v}-\hat{\mathbf{P}}_{u}||_{2}-||\mathbf{% P}_{v}-\mathbf{P}_{u}||_{2}\;\;\Big{|}\Big{|}^{2}_{2}caligraphic_L start_POSTSUBSCRIPT Bonds end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT ( italic_v , italic_u ) ∈ caligraphic_B end_POSTSUBSCRIPT | | | | over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - | | bold_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - bold_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

where \mathcal{B}caligraphic_B is a list of pair of indices of atoms that form a bond. We compare the distance between bonded atoms in prediction and ground truth data.

Angles=1|𝒜|(v,u,p)𝒜α(𝐏^v,𝐏^u𝐏^p)α(𝐏v,𝐏u,𝐏p)22subscriptAngles1𝒜subscript𝑣𝑢𝑝𝒜subscriptsuperscriptnorm𝛼subscript^𝐏𝑣subscript^𝐏𝑢subscript^𝐏𝑝𝛼subscript𝐏𝑣subscript𝐏𝑢subscript𝐏𝑝22\mathcal{L}_{\textrm{Angles}}=\frac{1}{|\mathcal{A}|}\sum_{(v,u,p)\in\mathcal{% A}}||\alpha(\hat{\mathbf{P}}_{v},\hat{\mathbf{P}}_{u}\hat{\mathbf{P}}_{p})-% \alpha\left(\mathbf{P}_{v},\mathbf{P}_{u},\mathbf{P}_{p}\right)||^{2}_{2}caligraphic_L start_POSTSUBSCRIPT Angles end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_A | end_ARG ∑ start_POSTSUBSCRIPT ( italic_v , italic_u , italic_p ) ∈ caligraphic_A end_POSTSUBSCRIPT | | italic_α ( over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) - italic_α ( bold_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

where 𝒜𝒜\mathcal{A}caligraphic_A is a list of 3-tuples of indices of atoms that are connected through bonds. The tuple takes the form (v,u,p)𝑣𝑢𝑝(v,u,p)( italic_v , italic_u , italic_p ) where u𝑢uitalic_u is connected to v𝑣vitalic_v and to p𝑝pitalic_p. Here, the function α(,,)𝛼\alpha(\cdot,\cdot,\cdot)italic_α ( ⋅ , ⋅ , ⋅ ) measures the angle in radians between positions of atoms that are connected through bonds.

Dihedrals=1|𝒟|(v,u,p,q)𝒟τ(𝐏^v,𝐏^u,𝐏^p,𝐏^q)τ(𝐏v,𝐏u,𝐏p,𝐏q)22subscriptDihedrals1𝒟subscript𝑣𝑢𝑝𝑞𝒟subscriptsuperscriptnorm𝜏subscript^𝐏𝑣subscript^𝐏𝑢subscript^𝐏𝑝subscript^𝐏𝑞𝜏subscript𝐏𝑣subscript𝐏𝑢subscript𝐏𝑝subscript𝐏𝑞22\mathcal{L}_{\textrm{Dihedrals}}=\frac{1}{|\mathcal{D}|}\sum_{(v,u,p,q)\in% \mathcal{D}}||\tau(\hat{\mathbf{P}}_{v},\hat{\mathbf{P}}_{u},\hat{\mathbf{P}}_% {p},\hat{\mathbf{P}}_{q})-\tau(\mathbf{P}_{v},\mathbf{P}_{u},\mathbf{P}_{p},% \mathbf{P}_{q})||^{2}_{2}caligraphic_L start_POSTSUBSCRIPT Dihedrals end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_v , italic_u , italic_p , italic_q ) ∈ caligraphic_D end_POSTSUBSCRIPT | | italic_τ ( over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) - italic_τ ( bold_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

where 𝒟𝒟\mathcal{D}caligraphic_D is a list of 4-tuples of indices of atoms that are connected by bonds, that is, (v,u,p,q)𝑣𝑢𝑝𝑞(v,u,p,q)( italic_v , italic_u , italic_p , italic_q ) where (v,u)𝑣𝑢(v,u)( italic_v , italic_u ), (u,p)𝑢𝑝(u,p)( italic_u , italic_p ) and (p,q)𝑝𝑞(p,q)( italic_p , italic_q ) are connected by bonds. Here, the function τ(,,,)𝜏\tau(\cdot,\cdot,\cdot,\cdot)italic_τ ( ⋅ , ⋅ , ⋅ , ⋅ ) measures the dihedral angle in radians.

Clashes=1|𝒞|(v,u)𝒞H(rv+ru𝐏^v𝐏^u2)subscriptClashes1𝒞subscript𝑣𝑢𝒞𝐻subscript𝑟𝑣subscript𝑟𝑢subscriptnormsubscript^𝐏𝑣subscript^𝐏𝑢2\mathcal{L}_{\textrm{Clashes}}=\frac{1}{|\mathcal{C}|}\sum_{(v,u)\in\mathcal{C% }}H(r_{v}+r_{u}-||\hat{\mathbf{P}}_{v}-\hat{\mathbf{P}}_{u}||_{2})caligraphic_L start_POSTSUBSCRIPT Clashes end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_C | end_ARG ∑ start_POSTSUBSCRIPT ( italic_v , italic_u ) ∈ caligraphic_C end_POSTSUBSCRIPT italic_H ( italic_r start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - | | over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )

where H𝐻Hitalic_H is a smooth and differentiable Heaviside-like step function, 𝒞𝒞\mathcal{C}caligraphic_C is a list of pair of indices of atoms that are not bonded, and (rv,rusubscript𝑟𝑣subscript𝑟𝑢r_{v},r_{u}italic_r start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT) are the van der Waals radii of atoms v𝑣vitalic_v, u𝑢uitalic_u.

B.3 Regularization

When training autoencoder models for latent diffusion, we regularize the learned latent space so that representations are amenable to the relevant range scales of the source distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ). Let 𝐕ilsubscriptsuperscript𝐕𝑙𝑖\mathbf{V}^{l}_{i}bold_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the i𝑖iitalic_i-th channel of a vector representation 𝐕lsuperscript𝐕𝑙\mathbf{V}^{l}bold_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. We regularize the autoencoder latent space by optimizing radial and angular components of our vectors:

reg=iReLU(1𝐕il11)+iij(𝐕il𝐕jl)subscriptregsubscript𝑖ReLU1subscriptsuperscriptnormsuperscriptsubscript𝐕𝑖𝑙11subscript𝑖subscript𝑖𝑗superscriptsubscript𝐕𝑖𝑙superscriptsubscript𝐕𝑗𝑙\mathcal{L}_{\textrm{reg}}=\sum_{i}\textrm{ReLU}(1-||\mathbf{V}_{i}^{l}||^{1}_% {1})+\sum_{i}\sum_{i\neq j}(\mathbf{V}_{i}^{l}\cdot\mathbf{V}_{j}^{l})caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ReLU ( 1 - | | bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ⋅ bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT )

The first term penalizes vector magnitudes larger than one, and the second term induces vectors to spread angularly. We find these regularizations to significantly help training of the denoising diffusion model.

B.4 Total Loss

We weight the different losses in our pipeline. For the standard training of the autoencoder, we use weights:

=10VectorMap+CrossEntropy+0.1Bonds+0.1LAngles+0.1Dihedrals+10Clashes10subscriptVectorMapsubscriptCrossEntropy0.1subscriptBonds0.1subscript𝐿Angles0.1subscriptDihedrals10subscriptClashes\mathcal{L}=10\cdot\mathcal{L}_{\textrm{VectorMap}}+\mathcal{L}_{\textrm{% CrossEntropy}}+0.1\cdot\mathcal{L}_{\textrm{Bonds}}+0.1\cdot L_{\textrm{Angles% }}+0.1\cdot\mathcal{L}_{\textrm{Dihedrals}}+10\cdot\mathcal{L}_{\textrm{% Clashes}}caligraphic_L = 10 ⋅ caligraphic_L start_POSTSUBSCRIPT VectorMap end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT CrossEntropy end_POSTSUBSCRIPT + 0.1 ⋅ caligraphic_L start_POSTSUBSCRIPT Bonds end_POSTSUBSCRIPT + 0.1 ⋅ italic_L start_POSTSUBSCRIPT Angles end_POSTSUBSCRIPT + 0.1 ⋅ caligraphic_L start_POSTSUBSCRIPT Dihedrals end_POSTSUBSCRIPT + 10 ⋅ caligraphic_L start_POSTSUBSCRIPT Clashes end_POSTSUBSCRIPT

For fine-tuning the model, we increase the weight of chemical losses significantly:

=10VectorMap+CrossEntropy+1.0Bonds+1.0LAngles+1.0Dihedrals+100Clashes10subscriptVectorMapsubscriptCrossEntropy1.0subscriptBonds1.0subscript𝐿Angles1.0subscriptDihedrals100subscriptClashes\mathcal{L}=10\cdot\mathcal{L}_{\textrm{VectorMap}}+\mathcal{L}_{\textrm{% CrossEntropy}}+1.0\cdot\mathcal{L}_{\textrm{Bonds}}+1.0\cdot L_{\textrm{Angles% }}+1.0\cdot\mathcal{L}_{\textrm{Dihedrals}}+100\cdot\mathcal{L}_{\textrm{% Clashes}}caligraphic_L = 10 ⋅ caligraphic_L start_POSTSUBSCRIPT VectorMap end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT CrossEntropy end_POSTSUBSCRIPT + 1.0 ⋅ caligraphic_L start_POSTSUBSCRIPT Bonds end_POSTSUBSCRIPT + 1.0 ⋅ italic_L start_POSTSUBSCRIPT Angles end_POSTSUBSCRIPT + 1.0 ⋅ caligraphic_L start_POSTSUBSCRIPT Dihedrals end_POSTSUBSCRIPT + 100 ⋅ caligraphic_L start_POSTSUBSCRIPT Clashes end_POSTSUBSCRIPT

We find that high weight values for chemical losses at early training may hurt the model convergence, in particular for models that operate on large lengths.

Appendix C Details on Autoencoder Comparison

We implement the comparison EGNN model in its original form, following [Satorras et al. (2022)]. We use kernel size K=3𝐾3K=3italic_K = 3 and stride S=2𝑆2S=2italic_S = 2 for downsampling and upsampling, and follow the procedure described in [Fu et al. (2023)]. For this comparison, we train the models to minimize the residue label cross entropy, and the vector map loss.

Ophiuchus significantly outperforms the EGNN-based architecture. The standard EGNN is SO(3)-equivariant with respect to its positions, however it models features with SO(3)-invariant representations. As part of its message passing, EGNN uses relative vectors between nodes to update positions. However, when downsampling positions, the total number of relative vectors available reduces quadratically, making it increasingly challenging to recover coordinates. Our method instead uses SO(3)-equivariant feature representations, and is able to keep 3D vector information in features as it coarsens positions. Thus, with very few parameters our model is able to encode and recover protein structures.

Appendix D More on Ablation

In addition to all-atom ablation found in Table 2 we conducted a similar ablation study on models trained on backbone only atoms as shown in Table 4. We found that backbone only models performed slightly better on backbone reconstruction. Furthermore, in Fig. 9.a we compare the relative vector loss between ground truth coordinates and the coordinates reconstructed from the autoencoder with respect to different downsampling factors. We average over different trials and channel rescale factors. As expected, we find that for lower downsampling factors the structure reconstruction accuracy is better. In Fig 9.b we similarly plot the residue recovery accuracy with respect to different downsampling factors. Again, we find the expected result, the residue recovery is better for lower downsampling factors.

Notably, the relative change in structure recovery accuracy with respect to different downsampling factors is much lower compared to the relative change in residue recovery accuracy for different downsampling factors. This suggest our model was able to learn much more efficiently a compressed prior for structure as compared to sequence, which coincides with the common knowledge that sequence has high redundancy in biological proteins. In Fig. 9.c we compare the structure reconstrution accuracy across different channel rescaling factors. Interestingly we find that for larger rescaling factors the structure reconstruction accuracy is slightly lower.

However, since we trained only for 10 epochs, it is likely that due to the larger number of model parameters when employing a larger rescaling factor it would take somewhat longer to achieve similar results. Finally, in Fig. 9.d we compare the residue recovery accuracy across different rescaling factors. We see that for higher rescaling factors we get a higher residue recovery rate. This suggests that sequence recovery is highly dependant on the number of model parameters and is not easily capturable by efficient structural models.

Refer to caption
Figure 9: Ablation Training Curves. We plot metrics across 10 training epochs for our ablated models from Table 2. (a-b) compares models across downsampling factors and highlights the tradeoff between downsampling and reconstruction. (c-d) compares different rescaling factors for fixed downsampling at 160.
Table 4: Recovery rates from bottleneck representations - Backbone only
Factor Channels/Layer # Params [1e6] \downarrow Cα𝛼\alphaitalic_α-RMSD (Å) \downarrow GDT-TS \uparrow GDT-HA \uparrow
17 [16, 24, 36] 0.34 0.81 ±plus-or-minus\pm± 0.31 96 ±plus-or-minus\pm± 3 81 ±plus-or-minus\pm± 5
17 [16, 27, 45] 0.38 0.99 ±plus-or-minus\pm± 0.45 95 ±plus-or-minus\pm± 3 81 ±plus-or-minus\pm± 6
17 [16, 32, 64] 0.49 1.03 ±plus-or-minus\pm± 0.42 92 ±plus-or-minus\pm± 4 74 ±plus-or-minus\pm± 6
53 [16, 24, 36, 54] 0.49 0.99 ±plus-or-minus\pm± 0.38 92 ±plus-or-minus\pm± 5 74 ±plus-or-minus\pm± 8
53 [16, 27, 45, 76] 0.67 1.08 ±plus-or-minus\pm± 0.40 91 ±plus-or-minus\pm± 6 71 ±plus-or-minus\pm± 8
53 [16, 32, 64, 128] 1.26 1.02 ±plus-or-minus\pm± 0.64 92 ±plus-or-minus\pm± 9 75 ±plus-or-minus\pm± 11
160 [16, 24, 36, 54, 81] 0.77 1.33 ±plus-or-minus\pm± 0.42 84 ±plus-or-minus\pm± 7 63 ±plus-or-minus\pm± 8
160 [16, 27, 45, 76, 129] 1.34 1.11 ±plus-or-minus\pm± 0.29 89 ±plus-or-minus\pm± 4 69 ±plus-or-minus\pm± 7
160 [16, 32, 64, 128, 256] 3.77 0.90 ±plus-or-minus\pm± 0.44 94 ±plus-or-minus\pm± 7 77 ±plus-or-minus\pm± 9

Appendix E Latent Space Analysis

E.1 Visualization of Latent Space

To visualize the learned latent space of Ophiuchus, we forward 50k samples from the training set through a large 485-length model (Figure 10). We collect the scalar component of bottleneck representations, and use t-SNE to produce 2D points for coordinates. We similarly produce 3D points and use those for coloring. The result is visualized in Figure 10. Visual inspection of neighboring points reveals unsupervised clustering of similar folds and sequences.

Refer to caption
Figure 10: Latent Space Analysis: we qualitatively examine the unsupervised clustering capabilities of our representations through t-SNE.
Refer to caption
Figure 11: Latent Conformational Interpolation. Top: Comparison of chemical validity metrics across interpolated structures of Nuclear Factor of Activated T-Cell (NFAT) Protein (PDB ID: 1S9K and PDB ID: 2O93). Bottom: Comparison of chemical validity metrics across interpolated structures of Maltose-binding periplasmic protein (PDB ID: 4JKM and PDB 6LF3). For both proteins, we plot results for interpolations on the original data space, on the latent space, and on the autoencoder-reconstructed data space.

Appendix F Latent Diffusion Details

We train all Ophiuchus diffusion models with learning rate lr=1×103lr1superscript103\textrm{lr}=1\times 10^{-3}lr = 1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for 10,000 epochs. For denoising networks we use Self-Interactions with the chunked-channel tensor square operation (Alg. 3).

Our tested models are trained on two different datasets. The MiniProtein scaffolds dataset consists of 66k all-atom structures of sequence length between 50 and 65, and composes of diverse folds and sequences across 5 secondary structure topologies, and is introduced by [Cao et al. (2022)]. We also train a model on the data curated by [Yim et al. (2023)], which consists of approximately 22k proteins, to compare Ophiuchus to the performance of FrameDiff and RFDiffusion in backbone generation for proteins up to 485 residues.

F.1 Self-Consistency Scores

To compute the scTM scores, we recover 8 sequences using ProteinMPNN for 500 sampled backbones from the tested diffusion models. We used a sampling temperature of 0.1 for ProteinMPNN. Unlike the original work, where the sequences where folded using AlphaFold2, we use OmegaFold [Wu et al. (2022b)] similar to [Lin & AlQuraishi (2023)]. The predicted structures are aligned to the original sampled backbones and TM-Score and RMSD is calculated for each alignment. To calculate the diversity measurement, we hierarchically clustered samples using MaxCluster. Diversity is defined as the number of clusters divided by the total number of samples, as described in [Yim et al. (2023)].

F.2 MiniProtein Model

In Figure (13) we show generated samples from our miniprotein model, and compare the marginal distribution of our predictions and ground truth data. In Figure (14.b) we show the distribution of TM-scores for joint sampling of sequence and all-atom structure by the diffusion model. We produce marginal distributions of generated samples Fig.(14.e) and find them to successfully approximate the densities of ground truth data. To test the robustness of joint sampling of structure and sequence, we compute self-consistency TM-Scores[Trippe et al. (2023)]. 54% of our sampled backbones have scTM scores > 0.5 compared to 77.6% of samples from RFDiffusion. We also sample 100 proteins between 50-65 amino acids in 0.21s compared to 11s taken by RFDiffusion on a RTX A6000.

F.3 Backbone Model

We include metrics on designability in Figure 12.

Refer to caption
Figure 12: Designability of Sampled Backbones . (a) To analyze the designability of our sampled backbones we show a plot of scTM vs plDDT. (b) To analyze the composition of secondary structures in the samples we show a plot of helix percentage in a sample vs beta-sheet percentage
Refer to caption
Figure 13: All-Atom Latent Diffusion. (a) Random samples from an all-atom MiniProtein model. (b) Random Samples from MiniProtein backbone model. (d) Ramachandran plots of sampled (left) and ground (right) distributions. (e) Comparison of Cα𝛼\alphaitalic_α-Cα𝛼\alphaitalic_α distances and all-atom bond lengths between learned (red) and ground (blue) distributions.
Refer to caption
Figure 14: Self-Consistency Template Matching Scores for Ophiuchus Diffusion. (a) Distribution of scTM scores for 500 sampled backbone. (b) Distribution of TM-Scores of jointly sampled backbones and sequences from an all-atom diffusion model and corresponding OmegaFold models. (c) Distribution of average plDDT scores for 500 sampled backbones (d) TM-Score between sampled backbone (green) and OmegaFold structure (blue) (e) TM-Score between sampled backbone (green) and the most confident OmegaFold structure of the sequence recovered from ProteinMPNN (blue)

Appendix G Visualization of All-Atom Reconstructions

Refer to caption
Figure 15: Reconstruction of 128-length all-atom proteins. Model used for visualization reconstructs all-atom protein structures from coarse representations of 120 scalars and 120 3D vectors.

Appendix H Visualization of Backbone Reconstructions

Refer to caption
Figure 16: Reconstruction of 485-length protein backbones. Model used for visualization reconstructs large backbones from coarse representations of 160 scalars and 160 3D vectors.

Appendix I Visualization of Random Backbone Samples

Refer to caption
Figure 17: Random Samples from an Ophiuchus 485-length Backbone Model. Model used for visualization samples SO(3) representations of 160 scalars and 160 3D vectors. We measure 0.46 seconds to sample a protein backbone.