Importance Weighted Expectation-Maximization for Protein Sequence Design

Zhenqiao Song    Lei Li
Abstract

Designing protein sequences with desired biological function is crucial in biology and chemistry. Recent machine learning methods use a surrogate sequence-function model to replace the expensive wet-lab validation. How can we efficiently generate diverse and novel protein sequences with high fitness? In this paper, we propose IsEM-Pro, an approach to generate protein sequences towards a given fitness criterion. At its core, IsEM-Pro is a latent generative model, augmented by combinatorial structure features from a separately learned Markov random fields (MRFs). We develop an Monte Carlo Expectation-Maximization method (MCEM) to learn the model. During inference, sampling from its latent space enhances diversity while its MRFs features guide the exploration in high fitness regions. Experiments on eight protein sequence design tasks show that our IsEM-Pro outperforms the previous best methods by at least 55% on average fitness score and generates more diverse and novel protein sequences.

Machine Learning, ICML

1 Introduction

Protein engineering aims to discover protein variants with desired biological function, such as fluorescence intensity (Biswas et al., 2021), enzyme activity (Fox et al., 2007), and therapeutic efficacy (Lagassé et al., 2017). Protein sequences embody their function through spontaneous folding of amino-acid sequences into three dimensional structures (Go, 1983; Chothia, 1984; Starr & Thornton, 2017). The map** from protein sequence to functional property forms a protein fitness landscape that characterizes the protein functional levels, such as the capability to catalyze reaction or bind a specific ligand (Romero & Arnold, 2009; Ren et al., 2022). Traditional approaches to design a specific protein with desired fitness objective involve obtaining protein variants by random mutagenesis (Labrou, 2010) or recombination in laboratory experiments (Ma et al., 2003). These variants are screened and selected in wet-lab experiments (Arnold, 1998) as illustrated in Figure 2.

Refer to caption

(a) Fujiyama landscape.

Refer to caption

(b) Badlands landscape.

Figure 1: Protein Fitness Landscape (distribution of any specific functional property for proteins). Protein may exhibit single-peaked fitness landscape (Fujiyama landscape (a)) or multi-peaked landscape (Badlands landscape (b)) (Kauffman & Weinberger, 1989). In Fujiyama landscape, any method could perform well. However, for the rougher Badlands landscape, previous methods get trapped in a worse local optima while our proposed IsEM-Pro can climb much closer to the global optima through the iterative sampling in the latent space.
Refer to caption
Figure 2: Workflow of traditional protein sequence design. We aim to accelerate this process by directly generating desirable sequences.

However, these approaches requires iterative cycles of random mutagenesis and wet-lab validation, which are both money-consuming and time-intensive. Recent machine learning methods attempt to build a surrogate model of protein fitness landscape to accelerate expensive wet-lab screening (Luo et al., 2021; Meier et al., 2021). How can we efficiently discover satisfactory proteins over the exponentially large discrete space? Ideal protein molecules should be novel, diverse and exhibit high fitness. On one hand, designing novel and diverse protein sequences can uncover new functions and also lead to functional diversification (Singh et al., 2016). On the other hand, in addition to the evolutionary pressure for a protein to conserve specific positions of its sequence for function, the diversification of protein sequences can avoid undesired inter-domain association such as misfolding (Wright et al., 2005).

In this paper, we propose an Importance sampling based Expectation-Maximization (EM) method to efficiently design novel, diverse and desirable Protein sequences (IsEM-Pro). Specifically, we introduce a latent variable in the generative model to capture the inter-dependencies in protein sequences. Sampling in latent space leads to more diverse candidates and can escape from locally optimal fitness regions. Instead of using standard variational inference models such as variational auto-encoder (VAE) (Kingma & Welling, 2013), we leverage importance sampling inside the EM algorithm to learn the latent generative model. As illustrated in Figure 1, our approach can navigate through multiple local optimum, and yield better overall performance. We further incorporate combinatorial structure of amino acids in protein sequences using Markov random fields (MRFs). It guides the model towards higher fitness landscape, leading to faster uphill path to desired proteins.

We carry out extensive experiments on eight protein sequence design tasks and compare the proposed method with previous strong baselines. The contribution of this paper are listed as follows:

  • We propose a structure-enhanced latent generative model for protein sequence design.

  • We develop an efficient method to learn the proposed generative model, based on importance sampling inside the EM algorithm.

  • Experiments on eight protein datasets with different objectives demonstrate that our IsEM-Pro generates protein sequences with at least 55% higher average fitness score and higher diversity and novelty than previous best methods. Further analyse show that the protein sequences designed by our model can fold stably, giving empirical evidence that our proposed IsEM-Pro has the ability to generate real proteins.

2 Background

In this section, we review protein sequence design and basic variational inference.

2.1 Protein Sequence Design upon Wild-Type

The protein sequence design problem is to search for a sequence with desired property in the sequence space 𝒱Lsuperscript𝒱𝐿\mathcal{V}^{L}caligraphic_V start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, where 𝒱𝒱\mathcal{V}caligraphic_V denotes the vocabulary of amino acids and L denotes the desired sequence length. The target is to find a protein sequence with highest fitness given by a protein fitness function f:𝒱L:𝑓superscript𝒱𝐿f:\mathcal{V}^{L}\rightarrow\mathbb{R}italic_f : caligraphic_V start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT → blackboard_R, which can be measured through wet-lab experiments. Wild-type refers to protein occurring in nature. Evolutionary search based methods are widely used (Bloom & Arnold, 2009; Arnold, 2018; Angermueller et al., 2019; Ren et al., 2022). They use wild-type sequence as starting point during iterative search. In this paper, we do not focus on the modification upon the wild-type sequence, but aim to efficiently generate novel and diverse sequences with improved protein functional properties.

2.2 Monte Carlo Expectation-Maximization

A latent generative model assumes data x (e.g. a protein sequence) is generated from a latent variable z. A classic algorithm to learn a latent generative model is Monte Carlo expectation-maximization (MCEM). The optimization procedure for maximizing the log marginal likelihood is to alternate between expectation step (E-step) and maximization step (M-step) (Neal & Hinton, 1998; Jordan et al., 1999). EM directly targets the log marginal likelihood of an observation x by involving a variational distribution qϕ(z)subscript𝑞italic-ϕ𝑧q_{\phi}(z)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z ):

logpθ(x)=Eqϕ(z)[logpθ(x,z)logqϕ(z)]+DKL(qϕ(z)||pθ(z|x))\small\begin{split}\log p_{\theta}(x)&=E_{q_{\phi}(z)}[\log p_{\theta}(x,z)-% \log q_{\phi}(z)]\\ &+D_{KL}(q_{\phi}(z)||p_{\theta}(z|x))\end{split}start_ROW start_CELL roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) end_CELL start_CELL = italic_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_z ) - roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_x ) ) end_CELL end_ROW (1)

where pθ(z|x)subscript𝑝𝜃conditional𝑧𝑥p_{\theta}(z|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_x ) is the true posterior distribution and pθ(x,z)=pθ(x|z)pθ(z)subscript𝑝𝜃𝑥𝑧subscript𝑝𝜃conditional𝑥𝑧subscript𝑝𝜃𝑧p_{\theta}(x,z)=p_{\theta}(x|z)p_{\theta}(z)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_z ) = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_z ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z ) is the joint distribution, composed of the conditional likelihood pθ(x|z)subscript𝑝𝜃conditional𝑥𝑧p_{\theta}(x|z)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_z ) and the prior pθ(z)subscript𝑝𝜃𝑧p_{\theta}(z)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z ). In MCEM, E-step samples a set of z from qϕ(z)subscript𝑞italic-ϕ𝑧q_{\phi}(z)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z ) to estimate the expectation 1 using Monte Carlo method, and then M-step fits the model parameters θ𝜃\thetaitalic_θ by maximizing the Monte Carlo estimation (Levine & Casella, 2001). It can be proved that this process will never decease the log marginal likelihood (Dieng & Paisley, 2019). We will develop our method based on this MCEM framework.

3 Proposed Method: IsEM-Pro

In this section, we describe our method in detail. We will first present the probabilistic model and its learning algorithm. To make the learning more efficient, we describe how to uncover and use the potential constraints conveyed in the protein sequences.

3.1 Problem Formulation

Our goal is to search over a space of discrete protein sequences 𝒱Lsuperscript𝒱𝐿\mathcal{V}^{L}caligraphic_V start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT𝒱𝒱\mathcal{V}caligraphic_V consists of 20 amino acids and L𝐿Litalic_L is the sequence length – for sequence x𝒱L𝑥superscript𝒱𝐿x\in\mathcal{V}^{L}italic_x ∈ caligraphic_V start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT that maximizes a given fitness function f:𝒱L:𝑓superscript𝒱𝐿f:\mathcal{V}^{L}\rightarrow\mathbb{R}italic_f : caligraphic_V start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT → blackboard_R. Let the fitness value y=f(x)𝑦𝑓𝑥y=f(x)italic_y = italic_f ( italic_x ), given a predefined threshold λ𝜆\lambdaitalic_λ, we define a conditional likelihood function:

P(𝒮|x)={1,f(x)λ,0,otherwise\small P(\mathcal{S}|x)=\left\{\begin{aligned} 1&,&{f(x)\geq\lambda,}\\ 0&,&{\text{otherwise}}\end{aligned}\right.italic_P ( caligraphic_S | italic_x ) = { start_ROW start_CELL 1 end_CELL start_CELL , end_CELL start_CELL italic_f ( italic_x ) ≥ italic_λ , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL , end_CELL start_CELL otherwise end_CELL end_ROW (2)

where 𝒮𝒮\mathcal{S}caligraphic_S represents the event that the fitness of x is ideal (yλ𝑦𝜆y\geq\lambdaitalic_y ≥ italic_λ). Using Pd(x)subscript𝑃𝑑𝑥P_{d}(x)italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) to denote the protein distribution in nature, we assume we have access to a set of observations of x drawn from it. We also assume a class of generative models Pθ(x)subscript𝑃𝜃𝑥P_{\theta}(x)italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) that can be trained with these samples and can approximate Pdsubscript𝑃𝑑P_{d}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT well. Since the search space is exponentially large (O(20Lsuperscript20𝐿20^{L}20 start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT)), random search would be time-intensive. We formulate the protein design problem as generating satisfactory sequences from the posterior distribution Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ):

Pθ(x|𝒮)=Pθ(x)P(𝒮|x)Pθ(𝒮)subscript𝑃𝜃conditional𝑥𝒮subscript𝑃𝜃𝑥𝑃conditional𝒮𝑥subscript𝑃𝜃𝒮\small P_{\theta}(x|\mathcal{S})=\frac{P_{\theta}(x)P(\mathcal{S}|x)}{P_{% \theta}(\mathcal{S})}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) italic_P ( caligraphic_S | italic_x ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_S ) end_ARG (3)

where Pθ(𝒮)=xPθ(x)P(𝒮|x)𝑑xsubscript𝑃𝜃𝒮subscript𝑥subscript𝑃𝜃𝑥𝑃conditional𝒮𝑥differential-d𝑥P_{\theta}(\mathcal{S})=\int_{x}P_{\theta}(x)P(\mathcal{S}|x)dxitalic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_S ) = ∫ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) italic_P ( caligraphic_S | italic_x ) italic_d italic_x is a normalization constant which does not rely on x. Protein sequences generated from Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) are not only more likely to be real proteins, but also have higher functional scores (i.e., fitness). The higher the λ𝜆\lambdaitalic_λ is, the higher fitness the discovered protein sequences have.

3.2 Probabilistic Model

Directly generating satisfactory sequences from the posterior distribution Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) is highly efficient compared with randomly search over the exponentially discrete space. However, realizing this idea is difficult as computing Pθ(𝒮)=xP(𝒮|x)Pθ(x)𝑑xsubscript𝑃𝜃𝒮subscript𝑥𝑃conditional𝒮𝑥subscript𝑃𝜃𝑥differential-d𝑥P_{\theta}(\mathcal{S})=\int_{x}P(\mathcal{S}|x)P_{\theta}(x)dxitalic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_S ) = ∫ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_P ( caligraphic_S | italic_x ) italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) italic_d italic_x needs an integration over all possible x, which is intractable. Instead, we propose to learn a proposal distribution Qϕ(x)subscript𝑄italic-ϕ𝑥Q_{\phi}(x)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) with learnable parameter ϕitalic-ϕ\phiitalic_ϕ to approximate Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ). Following Brookes et al. (2019), to find the optimal ϕitalic-ϕ\phiitalic_ϕ of the proposal distribution, we choose to minimize the KL divergence between the posterior distribution Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) and the proposal distribution Qϕ(x)subscript𝑄italic-ϕ𝑥Q_{\phi}(x)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ):

ϕ=argmaxϕDKL(Pθ(x|𝒮)||Qϕ(x))=argmaxϕEPθ(x|𝒮)logQϕ(x)+(𝒫θ)\small\begin{split}\phi^{*}&=\operatorname*{argmax}_{\phi}-D_{KL}(P_{\theta}(x% |\mathcal{S})||Q_{\phi}(x))\\ &=\operatorname*{argmax}_{\phi}E_{P_{\theta}(x|\mathcal{S})}\log Q_{\phi}(x)+% \mathcal{H(P_{\theta})}\end{split}start_ROW start_CELL italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL start_CELL = roman_argmax start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) | | italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_argmax start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) end_POSTSUBSCRIPT roman_log italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) + caligraphic_H ( caligraphic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) end_CELL end_ROW (4)

where (𝒫θ)=EPθ(x|𝒮)logPθ(x|𝒮)subscript𝒫𝜃subscript𝐸subscript𝑃𝜃conditional𝑥𝒮subscript𝑃𝜃conditional𝑥𝒮\mathcal{H(P_{\theta})}=-E_{P_{\theta}(x|\mathcal{S})}\log P_{\theta}(x|% \mathcal{S})caligraphic_H ( caligraphic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = - italic_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) is the entropy of Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) and can be dropped because it does not matter ϕitalic-ϕ\phiitalic_ϕ.

Diversity is a key consideration in our protein design procedure, which not only satisfies the diverse nature of species, but also can reduce undesired inter-domain misfolding (Wright et al., 2005). In order to promote the diversity of the designed protein sequences, we introduce a latent variable z into our model to capture the high-order dependencies among amino acids in protein sequences. Thus our final goal is to maximize the log marginal likelihood logQϕ(x)subscript𝑄italic-ϕ𝑥\log Q_{\phi}(x)roman_log italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) of sequence x from the posterior distribution Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) integrating over z:

=EPθ(x|𝒮)logQϕ(x)=EPθ(x|𝒮){ERω(z)[logQϕ(x,z)logRω(z)]+DKL(Rω(z)||Qϕ(z|x)}=EPθ(x|𝒮)(Rω(z),ϕ)\small\begin{split}\mathcal{L}=E_{P_{\theta}(x|\mathcal{S})}&\log Q_{\phi}(x)% \\ =E_{P_{\theta}(x|\mathcal{S})}&\{E_{R_{\omega}(z)}[\log Q_{\phi}(x,z)-\log R_{% \omega}(z)]\\ &+D_{KL}(R_{\omega}(z)||Q_{\phi}(z|x)\}\\ =E_{P_{\theta}(x|\mathcal{S})}&\mathcal{F}(R_{\omega}(z),\phi)\end{split}start_ROW start_CELL caligraphic_L = italic_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) end_POSTSUBSCRIPT end_CELL start_CELL roman_log italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) end_CELL end_ROW start_ROW start_CELL = italic_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) end_POSTSUBSCRIPT end_CELL start_CELL { italic_E start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_z ) end_POSTSUBSCRIPT [ roman_log italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_z ) - roman_log italic_R start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_z ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_z ) | | italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z | italic_x ) } end_CELL end_ROW start_ROW start_CELL = italic_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) end_POSTSUBSCRIPT end_CELL start_CELL caligraphic_F ( italic_R start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_z ) , italic_ϕ ) end_CELL end_ROW (5)

where the second equality is the EM objective defined in Equation 1 with approximate posterior distribution Rω(z)subscript𝑅𝜔𝑧R_{\omega}(z)italic_R start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_z ). ω𝜔\omegaitalic_ω is jointly learned with ϕitalic-ϕ\phiitalic_ϕ by maximizing the above expectation.

For the joint distribution Qϕ(x,z)=P(z)Qϕ(x|z)subscript𝑄italic-ϕ𝑥𝑧𝑃𝑧subscript𝑄italic-ϕconditional𝑥𝑧Q_{\phi}(x,z)=P(z)Q_{\phi}(x|z)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_z ) = italic_P ( italic_z ) italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x | italic_z ), we use standard normal distribution for P(z)𝑃𝑧P(z)italic_P ( italic_z ) and Transformer decoder (Vaswani et al., 2017) for Qϕ(x|z)subscript𝑄italic-ϕconditional𝑥𝑧Q_{\phi}(x|z)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x | italic_z ), which is augmented by the combinatorial structure features introduced in subsection 3.4.

3.3 Importance Weighted EM

To maximize the objective defined in Equation 5, we plan to learn our proposal distribution Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT through importance sampling based EM, of which the sampling procedure and iterative optimization process can lead to a better estimate (Figure 1 (b)), resulting in novel and diverse proteins with higher fitness.

Since we can not directly sample from Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) due to the intractable integration factor Pθ(𝒮)subscript𝑃𝜃𝒮P_{\theta}(\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_S ), we choose to approximate the expectation using importance sampling with proposal distribution Qϕ(x)subscript𝑄italic-ϕ𝑥Q_{\phi}(x)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) (Neal, 2001; Dieng & Paisley, 2019) as used in Brookes et al. (2019). Because 𝒮𝒮\mathcal{S}caligraphic_S is only conditioned on x, we have Pθ(x,z|𝒮)=Pθ(x,z)P(𝒮|x)Pθ(𝒮)subscript𝑃𝜃𝑥conditional𝑧𝒮subscript𝑃𝜃𝑥𝑧𝑃conditional𝒮𝑥subscript𝑃𝜃𝒮P_{\theta}(x,z|\mathcal{S})=\frac{P_{\theta}(x,z)P(\mathcal{S}|x)}{P_{\theta}(% \mathcal{S})}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_z | caligraphic_S ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_z ) italic_P ( caligraphic_S | italic_x ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_S ) end_ARG. We assume the latent variable z in Pθ(x,z)subscript𝑃𝜃𝑥𝑧P_{\theta}(x,z)italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_z ) and Qϕ(x,z)subscript𝑄italic-ϕ𝑥𝑧Q_{\phi}(x,z)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_z ) are defined on the same latent space 𝒵𝒵\mathcal{Z}caligraphic_Z and have the same prior P(z)𝑃𝑧P(z)italic_P ( italic_z ). Then the final objective in Equation 5 can be reformulated as:

=EQϕ(x)Pθ(x|𝒮)Qϕ(x)(Rω(z),ϕ)=EQϕ(x,z)Pθ(x,z|𝒮)Qϕ(x,z)(Rω(z),ϕ)=1𝒞EQϕ(x,z)Pθ(x|z)Qϕ(x|z)P(𝒮|x)(Rω(z),ϕ)subscript𝐸subscript𝑄italic-ϕ𝑥subscript𝑃𝜃conditional𝑥𝒮subscript𝑄italic-ϕ𝑥subscript𝑅𝜔𝑧italic-ϕsubscript𝐸subscript𝑄italic-ϕ𝑥𝑧subscript𝑃𝜃𝑥conditional𝑧𝒮subscript𝑄italic-ϕ𝑥𝑧subscript𝑅𝜔𝑧italic-ϕ1𝒞subscript𝐸subscript𝑄italic-ϕ𝑥𝑧subscript𝑃𝜃conditional𝑥𝑧subscript𝑄italic-ϕconditional𝑥𝑧𝑃conditional𝒮𝑥subscript𝑅𝜔𝑧italic-ϕ\small\begin{split}\mathcal{L}&=E_{{Q_{\phi}(x)}}\frac{P_{\theta}(x|\mathcal{S% })}{Q_{\phi}(x)}\mathcal{F}(R_{\omega}(z),\phi)\\ &=E_{{Q_{\phi}(x,z)}}\frac{P_{\theta}(x,z|\mathcal{S})}{Q_{\phi}(x,z)}\mathcal% {F}(R_{\omega}(z),\phi)\\ &=\frac{1}{\mathcal{C}}E_{{Q_{\phi}(x,z)}}\frac{P_{\theta}(x|z)}{Q_{\phi}(x|z)% }P(\mathcal{S}|x)\mathcal{F}(R_{\omega}(z),\phi)\\ \end{split}start_ROW start_CELL caligraphic_L end_CELL start_CELL = italic_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) end_ARG caligraphic_F ( italic_R start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_z ) , italic_ϕ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_z ) end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_z | caligraphic_S ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_z ) end_ARG caligraphic_F ( italic_R start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_z ) , italic_ϕ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG caligraphic_C end_ARG italic_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_z ) end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_z ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x | italic_z ) end_ARG italic_P ( caligraphic_S | italic_x ) caligraphic_F ( italic_R start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_z ) , italic_ϕ ) end_CELL end_ROW (6)

The first equality is due to the importance sampling, the second equality holds because (Rω(z),ϕ)subscript𝑅𝜔𝑧italic-ϕ\mathcal{F}(R_{\omega}(z),\phi)caligraphic_F ( italic_R start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_z ) , italic_ϕ ) is an integration over z and does not rely on z and we use a trick Ep(x)[f(x)]=Ep(x,y)[f(x)]subscript𝐸𝑝𝑥delimited-[]𝑓𝑥subscript𝐸𝑝𝑥𝑦delimited-[]𝑓𝑥E_{p(x)}[f(x)]=E_{p(x,y)}[f(x)]italic_E start_POSTSUBSCRIPT italic_p ( italic_x ) end_POSTSUBSCRIPT [ italic_f ( italic_x ) ] = italic_E start_POSTSUBSCRIPT italic_p ( italic_x , italic_y ) end_POSTSUBSCRIPT [ italic_f ( italic_x ) ] proposed by Brookes et al. (2019). In the third equality, 𝒞=Pθ(𝒮)𝒞subscript𝑃𝜃𝒮\mathcal{C}=P_{\theta}(\mathcal{S})caligraphic_C = italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_S ) is a constant which does not rely on ϕitalic-ϕ\phiitalic_ϕ and ω𝜔\omegaitalic_ω, and will be dropped in the following learning process.

We use importance sampling based EM to approximate the above objective (Hastings, 1970; Levine & Casella, 2001) with joint samples (xn,zn)Qϕ(x,z)similar-tosubscript𝑥𝑛subscript𝑧𝑛subscript𝑄italic-ϕ𝑥𝑧(x_{n},z_{n})\sim Q_{\phi}(x,z)( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∼ italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_z ). Specifically, at iteration t, the optimization process can be reformulated as :
E-step:

t=n=1Nw(xn,zn)(Rω(zn),ϕ)n=1Nw(xn,zn)subscript𝑡superscriptsubscript𝑛1𝑁𝑤subscript𝑥𝑛subscript𝑧𝑛subscript𝑅𝜔subscript𝑧𝑛italic-ϕsuperscriptsubscript𝑛1𝑁𝑤subscript𝑥𝑛subscript𝑧𝑛\small\mathcal{L}_{t}=\frac{\sum_{n=1}^{N}w(x_{n},z_{n})\mathcal{F}(R_{\omega}% (z_{n}),\phi)}{\sum_{n=1}^{N}w(x_{n},z_{n})}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) caligraphic_F ( italic_R start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_ϕ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG (7)

M-step:

ϕ(t+1)=argmaxϕtsuperscriptitalic-ϕ𝑡1subscriptargmaxitalic-ϕsubscript𝑡\small\phi^{(t+1)}=\operatorname*{argmax}_{\phi}\mathcal{L}_{t}italic_ϕ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = roman_argmax start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (8)

where w(xn,zn)=Pθ(xn|zn)Qϕ(t)(xn|zn)P(𝒮|xn)𝑤subscript𝑥𝑛subscript𝑧𝑛subscript𝑃𝜃conditionalsubscript𝑥𝑛subscript𝑧𝑛subscript𝑄superscriptitalic-ϕ𝑡conditionalsubscript𝑥𝑛subscript𝑧𝑛𝑃conditional𝒮subscript𝑥𝑛w(x_{n},z_{n})=\frac{P_{\theta}(x_{n}|z_{n})}{Q_{\phi^{(t)}}(x_{n}|z_{n})}P(% \mathcal{S}|x_{n})italic_w ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG italic_P ( caligraphic_S | italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is the unnormalized importance weight, and N is the sample size.

Through the combined sampling procedure and iterative optimization process, we can obtain a good proposal distribution Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, from which we can generate satisfactory sequences with small time cost.

3.4 Guiding Model Climbing through Combinatorial Structure

As shown in previous work, the combinatorial structure of amino acids in protein sequences can be learned from a generative graphical model Markov random fields (MRFs) fitted on the sequences from the same family (Hopf et al., 2017; Luo et al., 2021). These structure constraints are the results of the evolutionary process under natural selection and may reveal clues on which amino-acid combinations are more favorable than others. Thus we incorporate these features into our model to guide it towards higher fitness landscape to faster find desired protein sequences.

Given a protein sequence x=(x1,x2,..,xL)x=(x_{1},x_{2},..,x_{L})italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . , italic_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) with L𝐿Litalic_L amino acids, the generative model generates it with likelihood PL(x)=exp(E(x))Zsubscript𝑃𝐿𝑥𝐸𝑥𝑍P_{L}(x)=\frac{\exp(E(x))}{Z}italic_P start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG roman_exp ( italic_E ( italic_x ) ) end_ARG start_ARG italic_Z end_ARG where Z=xexp(E(x))𝑑x𝑍subscript𝑥𝐸𝑥differential-d𝑥Z=\int_{x}\exp(E(x))dxitalic_Z = ∫ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_exp ( italic_E ( italic_x ) ) italic_d italic_x is a normalization constant and E(x) is the corresponding energy function, which is defined as the sum of all pairwise constraints and single-site constraints as follows:

E(x)=i=1Lεi(xi)+i=1Lj=1,jiLεij(xi,xj)𝐸𝑥superscriptsubscript𝑖1𝐿subscript𝜀𝑖subscript𝑥𝑖superscriptsubscript𝑖1𝐿superscriptsubscriptformulae-sequence𝑗1𝑗𝑖𝐿subscript𝜀𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗\small E(x)=\sum_{i=1}^{L}\varepsilon_{i}(x_{i})+\sum_{i=1}^{L}\sum_{j=1,j\neq i% }^{L}\varepsilon_{ij}(x_{i},x_{j})italic_E ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (9)

where εi(xi)subscript𝜀𝑖subscript𝑥𝑖\varepsilon_{i}(x_{i})italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the single-site constraint of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at position i and εij(xi,xj)subscript𝜀𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗\varepsilon_{ij}(x_{i},x_{j})italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the pairwise constraint of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at position i, j. The above graphical model is illustrated in Figure 3.

We train the model following CCMpred (Seemayer et al., 2014) using a pseudo-likelihood P^L(x)subscript^𝑃𝐿𝑥\hat{P}_{L}(x)over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x ) (provided in Appendix E) combined with Ridge regularization to make the learning of PL(x)subscript𝑃𝐿𝑥P_{L}(x)italic_P start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x ) easier. But different from them, we additionally add a lasso regularizer to the training objective to make the graph sparse, of which the regularization coefficients are set to the same values as the ridge regularizer:

L=xlogP^L(x)(ε)(ε)(ε)=λsinglei=1Lεi11+λpairi,j=1,ijεij11(ε)=λsinglei=1Lεi22+λpairi,j=1,ijεij22𝐿subscript𝑥subscript^𝑃𝐿𝑥𝜀𝜀𝜀subscript𝜆singlesuperscriptsubscript𝑖1𝐿superscriptsubscriptnormsubscript𝜀𝑖11subscript𝜆pairsubscriptformulae-sequence𝑖𝑗1𝑖𝑗superscriptsubscriptnormsubscript𝜀𝑖𝑗11𝜀subscript𝜆singlesuperscriptsubscript𝑖1𝐿superscriptsubscriptnormsubscript𝜀𝑖22subscript𝜆pairsubscriptformulae-sequence𝑖𝑗1𝑖𝑗superscriptsubscriptnormsubscript𝜀𝑖𝑗22\small\begin{split}L&=\sum_{x}\log\hat{P}_{L}(x)-\mathcal{L}(\varepsilon)-% \mathcal{R}(\varepsilon)\\ \mathcal{L}(\varepsilon)&=\lambda_{\text{single}}\sum_{i=1}^{L}||\varepsilon_{% i}||_{1}^{1}+\lambda_{\text{pair}}\sum_{i,j=1,i\neq j}||\varepsilon_{ij}||_{1}% ^{1}\\ \mathcal{R}(\varepsilon)&=\lambda_{\text{single}}\sum_{i=1}^{L}||\varepsilon_{% i}||_{2}^{2}+\lambda_{\text{pair}}\sum_{i,j=1,i\neq j}||\varepsilon_{ij}||_{2}% ^{2}\\ \end{split}start_ROW start_CELL italic_L end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x ) - caligraphic_L ( italic_ε ) - caligraphic_R ( italic_ε ) end_CELL end_ROW start_ROW start_CELL caligraphic_L ( italic_ε ) end_CELL start_CELL = italic_λ start_POSTSUBSCRIPT single end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT | | italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT pair end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 , italic_i ≠ italic_j end_POSTSUBSCRIPT | | italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_R ( italic_ε ) end_CELL start_CELL = italic_λ start_POSTSUBSCRIPT single end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT | | italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT pair end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 , italic_i ≠ italic_j end_POSTSUBSCRIPT | | italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW (10)

where εi=[εi(a1),εi(a2),,εi(a20)]subscript𝜀𝑖subscript𝜀𝑖subscript𝑎1subscript𝜀𝑖subscript𝑎2subscript𝜀𝑖subscript𝑎20\varepsilon_{i}=[\varepsilon_{i}(a_{1}),\varepsilon_{i}(a_{2}),...,\varepsilon% _{i}(a_{20})]italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT ) ] is the vector of the single-site constraints of the 20 amino acids at position i, and εij=[εij(a1,a2),εij(a1,a3),,εij(aL,aL1)]subscript𝜀𝑖𝑗subscript𝜀𝑖𝑗subscript𝑎1subscript𝑎2subscript𝜀𝑖𝑗subscript𝑎1subscript𝑎3subscript𝜀𝑖𝑗subscript𝑎𝐿subscript𝑎𝐿1\varepsilon_{ij}=[\varepsilon_{ij}(a_{1},a_{2}),\varepsilon_{ij}(a_{1},a_{3}),% ...,\varepsilon_{ij}(a_{L},a_{L-1})]italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [ italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) , … , italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT ) ] is the vector of all possible pairwise constraints at position i, j. Following (Kamisetty et al., 2013), we set λsingle=1subscript𝜆single1\lambda_{\text{single}}=1italic_λ start_POSTSUBSCRIPT single end_POSTSUBSCRIPT = 1 and λpair=0.2(L1)subscript𝜆pair0.2𝐿1\lambda_{\text{pair}}=0.2*(L-1)italic_λ start_POSTSUBSCRIPT pair end_POSTSUBSCRIPT = 0.2 ∗ ( italic_L - 1 ).

Refer to caption
Figure 3: Learning combinatorial structure constraints through Markov random fields.

After training the MRFs, we can encode a protein sequence x with the learned constraints. Specifically, we first encode the i-th amino acid by concatenating its corresponding single-site constraint as well as the possible pairwise ones:

𝜺𝒊(xi)=[εi(xi),εi1(xi,a1),,εiL(xi,aL)]εij(xi,aj)=[εij(xi,a1),εij(xi,a2),,εij(xi,a20)]subscript𝜺𝒊subscript𝑥𝑖subscript𝜀𝑖subscript𝑥𝑖subscript𝜀𝑖1subscript𝑥𝑖subscript𝑎1bold-⋅subscript𝜀𝑖𝐿subscript𝑥𝑖subscript𝑎𝐿bold-⋅subscript𝜀𝑖𝑗subscript𝑥𝑖subscript𝑎𝑗bold-⋅subscript𝜀𝑖𝑗subscript𝑥𝑖subscript𝑎1subscript𝜀𝑖𝑗subscript𝑥𝑖subscript𝑎2subscript𝜀𝑖𝑗subscript𝑥𝑖subscript𝑎20\begin{split}&\boldsymbol{\varepsilon_{i}}(x_{i})=[\varepsilon_{i}(x_{i}),% \varepsilon_{i1}(x_{i},a_{1\boldsymbol{\cdot}}),...,\varepsilon_{iL}(x_{i},a_{% L\boldsymbol{\cdot}})]\\ &\varepsilon_{ij}(x_{i},a_{j\boldsymbol{\cdot}})=[\varepsilon_{ij}(x_{i},a_{1}% ),\varepsilon_{ij}(x_{i},a_{2}),...,\varepsilon_{ij}(x_{i},a_{20})]\end{split}start_ROW start_CELL end_CELL start_CELL bold_italic_ε start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = [ italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_ε start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 bold_⋅ end_POSTSUBSCRIPT ) , … , italic_ε start_POSTSUBSCRIPT italic_i italic_L end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_L bold_⋅ end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j bold_⋅ end_POSTSUBSCRIPT ) = [ italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT ) ] end_CELL end_ROW (11)

where εij(xi,aj)subscript𝜀𝑖𝑗subscript𝑥𝑖subscript𝑎𝑗bold-⋅\varepsilon_{ij}(x_{i},a_{j\boldsymbol{\cdot}})italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j bold_⋅ end_POSTSUBSCRIPT ) gathers the 20 amino acids for any position ji𝑗𝑖j\neq iitalic_j ≠ italic_i. Then we map 𝜺𝒊(xi)subscript𝜺𝒊subscript𝑥𝑖\boldsymbol{\varepsilon_{i}}(x_{i})bold_italic_ε start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to the amino-acid embedding space with trainable parameter Wεsubscript𝑊𝜀W_{\varepsilon}italic_W start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT, and add the mapped vector to the original amino-acid embedding e(xi)𝑒subscript𝑥𝑖e(x_{i})italic_e ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to get the final feature vector as our model input:

e^(xi)=e(xi)+Wε𝜺𝒊(xi)H0=z~,Hi=e^(xi1)for 1i<Lformulae-sequence^𝑒subscript𝑥𝑖𝑒subscript𝑥𝑖subscript𝑊𝜀subscript𝜺𝒊subscript𝑥𝑖subscript𝐻0~𝑧subscript𝐻𝑖^𝑒subscript𝑥𝑖1for1𝑖𝐿\small\begin{split}&\hat{e}(x_{i})=e(x_{i})+W_{\varepsilon}*\boldsymbol{% \varepsilon_{i}}(x_{i})\\ &H_{0}=\tilde{z},\quad H_{i}=\hat{e}(x_{i-1})\;\text{for}\;1\leq i<L\end{split}start_ROW start_CELL end_CELL start_CELL over^ start_ARG italic_e end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_e ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_W start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ∗ bold_italic_ε start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over~ start_ARG italic_z end_ARG , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_e end_ARG ( italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) for 1 ≤ italic_i < italic_L end_CELL end_ROW (12)

which means for the autoregressive Transformer decoder, the first token input is set the sampled latent vector z~~𝑧\tilde{z}over~ start_ARG italic_z end_ARG and the input for other position is set to the combinatorial structure augmented feature vector of last token.

At iteration t in MCEM, the learning process for the combinatorial structure enhanced latent generative model becomes:
E-step:

t=n=1Nw(xn,zn)(Rω(zn),ϕ;𝜺)n=1Nw(xn,zn)subscript𝑡superscriptsubscript𝑛1𝑁𝑤subscript𝑥𝑛subscript𝑧𝑛subscript𝑅𝜔subscript𝑧𝑛italic-ϕ𝜺superscriptsubscript𝑛1𝑁𝑤subscript𝑥𝑛subscript𝑧𝑛\small\mathcal{L}_{t}=\frac{\sum_{n=1}^{N}w(x_{n},z_{n})\mathcal{F}(R_{\omega}% (z_{n}),\phi;\boldsymbol{\varepsilon})}{\sum_{n=1}^{N}w(x_{n},z_{n})}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) caligraphic_F ( italic_R start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_ϕ ; bold_italic_ε ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG (13)

M-step:

ϕ(t+1)=argmaxϕtsuperscriptitalic-ϕ𝑡1subscriptargmaxitalic-ϕsubscript𝑡\small\phi^{(t+1)}=\operatorname*{argmax}_{\phi}\mathcal{L}_{t}italic_ϕ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = roman_argmax start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (14)

𝜺𝜺\boldsymbol{\varepsilon}bold_italic_ε is fixed during latent generative model learning and we omit it in the following parts to make description simple. The overall learning algorithm is given in Appendix B.2.

4 Experiments

In this section, we conduct extensive experiments to validate the effectiveness of our proposed IsEM-Pro on protein sequence design task.

4.1 Implementation Details

Our model is built based on Transformer (Vaswani et al., 2017) with 6666-layer encoder initialized by ESM-2 (Lin et al., 2022)111https://dl.fbaipublicfiles.com/fairesm/models/esm2_t6_8M_UR50D.pt and 2222-layer decoder with random initialization, of which the encoder parameters are fixed during training process. Thus the MRFs features are only incorporated in decoder. The model hidden size and feed-forward hidden size are set to 320320320320 and 1280128012801280 respectively as ESM-2. We use the [CLS] representation from the last layer of encoder to calculate the mean and variance vectors of the latent variable through single-layer map**. Then the sampled latent vector is used as the first token input of decoder. The latent vector size is correspondingly set to 320320320320. We first train a VAE model as Pθsubscript𝑃𝜃P_{\theta}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT for 30303030 epochs and ϕ(0)superscriptitalic-ϕ0\phi^{(0)}italic_ϕ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is initialized by θ𝜃\thetaitalic_θ. The number of iterative process in the importance sampling based VEM is set to 10101010. The protein combinatorial structure constraints 𝜺𝜺\boldsymbol{\varepsilon}bold_italic_ε are learned on the training sequences for each dataset instead of real multiple sequence alignments (MSAs) to keep a fair comparison.

The mini-batch size and learning rate are set to 4,09640964,0964 , 096 tokens and 1111e-5555 respectively. The model is trained with 1111 NVIDIA RTX A6000600060006000 GPU card. We apply Adam algorithm (Kingma & Ba, 2014) as the optimizer with a linear warm-up over the first 4,00040004,0004 , 000 steps and linear decay for later steps. We randomly split each dataset into training/validation sets with the ratio of 9999:1111. We run all the experiments for five times and report the average scores. More experimental settings are given in Appendix B.1.

In inference, we design protein sequences by taking the wild-type as encoder input and the latent vector is sampled from prior distribution P(z)𝑃𝑧P(z)italic_P ( italic_z ). The sequences are decoded using sampling strategy with top-5555. The candidate number is set to K=128128128128 following the setting of Jain et al. (2022) on the GFP dataset.

4.2 Datasets

Following Ren et al. (2022), we evaluate our method on the following eight protein engineering benchmarks:
(1) Green Fluorescent Protein (avGFP): The goal is to design sequences with higher log-fluorescence intensity values. We collect data following Sarkisyan et al. (2016). (2) Adeno-Associated Viruses (AAV): The target is to generate amino-acid segment (position 561588561588561-588561 - 588) for higher gene therapeutic efficiency. We collect data following Bryant et al. (2021). (3) TEM-1 β𝛽\betaitalic_β-Lactamase (TEM): The goal is to design high thermodynamic-stable sequences. We merge the data from Firnberg et al. (2014). (4) Ubiquitination Factor Ube4b (E4B): The objective is to design sequences with higher enzyme activity. We gather data following Starita et al. (2013). (5) Aliphatic Amide Hydrolase (AMIE): The goal is to produce amidase sequences with higher enzyme activity. We merge data following Wrenbeck et al. (2017). (6) Levoglucosan Kinase (LGK): The target is to optimize LGK protein sequences with improved enzyme activity. We collect data following Klesmith et al. (2015). (7) Poly(A)-binding Protein (Pab1): The goal is to design sequences with higher binding fitness to multiple adenosine monophosphates. We gather data following Melamed et al. (2013). (8) SUMO E2 Conjugase (UBE2I): We aim to find human SUMO E2 conjugase with higher growth rescue rate. Data are obtained following Weile et al. (2017). The detailed data statistics, including protein sequence length, data size and data source are provided in Appendix A.

Models avGFP AAV TEM E4B AMIE LGK Pab1 UBE2I Average
CMA-ES 4.4924.4924.4924.492 3.4173.417-3.417- 3.417 0.3750.3750.3750.375 0.7680.768-0.768- 0.768 8.2248.224-8.224- 8.224 0.0770.077-0.077- 0.077 0.1640.1640.1640.164 2.4612.4612.4612.461 0.6240.624-0.624- 0.624
FBGAN 1.2511.2511.2511.251 4.2274.227-4.227- 4.227 0.0060.0060.0060.006 0.3690.3690.3690.369 2.4102.410-2.410- 2.410 1.2061.206-1.206- 1.206 0.0290.0290.0290.029 0.2080.2080.2080.208 0.7470.747-0.747- 0.747
DbAS 3.5483.5483.5483.548 4.3274.3274.3274.327 0.0030.0030.0030.003 1.2861.286-1.286- 1.286 2.6582.658-2.658- 2.658 1.1481.148-1.148- 1.148 1.5241.5241.5241.524 3.0883.0883.0883.088 0.9240.9240.9240.924
CbAS 3.5503.5503.5503.550 4.3364.3364.3364.336 0.1060.1060.1060.106 1.0001.000-1.000- 1.000 1.3061.306-1.306- 1.306 0.3620.362-0.362- 0.362 1.8421.8421.8421.842 3.2633.2633.2633.263 1.3031.3031.3031.303
PEX 3.7643.7643.7643.764 3.2653.2653.2653.265 0.1210.1210.1210.121 5.0195.0195.0195.019 0.4740.474-0.474- 0.474 0.0070.0070.0070.007 1.1531.1531.1531.153 1.9951.9951.9951.995 1.8561.8561.8561.856
GFlowNet-AL 5.0625.0625.0625.062 1.2051.2051.2051.205 1.5521.5521.5521.552 3.1553.1553.1553.155 0.0590.0590.0590.059 0.0270.0270.0270.027 2.1682.1682.1682.168 3.5763.5763.5763.576 2.1012.1012.1012.101
ESM-Search 2.6102.6102.6102.610 5.0995.099-5.099- 5.099 0.1480.1480.1480.148 1.8601.860-1.860- 1.860 2.3512.351-2.351- 2.351 0.0290.029-0.029- 0.029 1.4061.4061.4061.406 3.2443.2443.2443.244 0.2410.241-0.241- 0.241
\hdashlineIsEM-Pro 6.185 4.813 1.850 5.737 0.062 0.035 2.923 4.536 3.267
– w/o ESM 1.2141.2141.2141.214 4.3134.313-4.313- 4.313 0.0050.0050.0050.005 1.3521.352-1.352- 1.352 6.3766.376-6.376- 6.376 0.2250.225-0.225- 0.225 0.0720.0720.0720.072 1.8431.8431.8431.843 1.1411.141-1.141- 1.141
– w/o ISEM 4.7084.7084.7084.708 1.1301.1301.1301.130 0.7080.7080.7080.708 0.0460.0460.0460.046 2.3352.335-2.335- 2.335 0.0770.077-0.077- 0.077 1.9131.9131.9131.913 0.4750.4750.4750.475 1.3421.3421.3421.342
– w/o MRFs 4.3764.3764.3764.376 1.0081.0081.0081.008 0.9520.9520.9520.952 0.0450.0450.0450.045 1.7711.771-1.771- 1.771 0.0120.012-0.012- 0.012 1.6521.6521.6521.652 2.4182.4182.4182.418 1.0831.0831.0831.083
– w/o LV 4.2744.2744.2744.274 2.2512.2512.2512.251 0.0780.0780.0780.078 1.6121.612-1.612- 1.612 2.2662.266-2.266- 2.266 0.9310.931-0.931- 0.931 0.0410.0410.0410.041 0.2620.262-0.262- 0.262 0.1960.1960.1960.196
Table 1: Maximum fitness scores (MFS) of all methods on eight datasets. Higher values indicate better functional properties in the dataset.
Models avGFP AAV TEM E4B AMIE LGK Pab1 UBE2I Average
CMA-ES 225.12 23.5023.5023.5023.50 261.60261.60261.60261.60 86.8186.8186.8186.81 283.90283.90283.90283.90 317.08317.08317.08317.08 61.1661.1661.1661.16 140.92140.92140.92140.92 175.01175.01175.01175.01
FBGAN 0.640.640.640.64 8.318.318.318.31 0.460.460.460.46 3.873.873.873.87 33.8733.8733.8733.87 17.3517.3517.3517.35 3.073.073.073.07 3.003.003.003.00 8.828.828.828.82
DbAS 3.043.043.043.04 3.003.003.003.00 3.673.673.673.67 5.945.945.945.94 1.321.321.321.32 2.302.302.302.30 4.054.054.054.05 11.8011.8011.8011.80 4.334.334.334.33
CbAS 1.311.311.311.31 3.013.013.013.01 7.037.037.037.03 7.097.097.097.09 6.016.016.016.01 6.156.156.156.15 9.869.869.869.86 22.7322.7322.7322.73 8.238.238.238.23
PEX 6.836.836.836.83 4.354.354.354.35 10.2610.2610.2610.26 5.225.225.225.22 7.567.567.567.56 13.2413.2413.2413.24 5.335.335.335.33 10.3210.3210.3210.32 7.887.887.887.88
GFlowNet-AL 224.78224.78224.78224.78 25.57 266.43 43.6243.6243.6243.62 219.84219.84219.84219.84 212.25212.25212.25212.25 37.1337.1337.1337.13 49.7949.7949.7949.79 134.92134.92134.92134.92
ESM-Search 3.793.793.793.79 3.583.583.583.58 11.5611.5611.5611.56 3.823.823.823.82 3.833.833.833.83 3.783.783.783.78 5.715.715.715.71 6.596.596.596.59 5.335.335.335.33
\hdashlineIsEM-Pro 218.62218.62218.62218.62 22.9222.9222.9222.92 202.09202.09202.09202.09 91.35 293.30 405.99 68.27 122.66122.66122.66122.66 178.15
– w/o ESM 204.21204.21204.21204.21 13.8713.8713.8713.87 194.78194.78194.78194.78 7.907.907.907.90 276.88276.88276.88276.88 362.98362.98362.98362.98 3.303.303.303.30 119.87119.87119.87119.87 147.96147.96147.96147.96
– w/o ISEM 122.15122.15122.15122.15 22.9122.9122.9122.91 70.1270.1270.1270.12 86.1786.1786.1786.17 145.74145.74145.74145.74 169.17169.17169.17169.17 60.2260.2260.2260.22 13.0513.0513.0513.05 153.09153.09153.09153.09
– w/o MRFs 217.35217.35217.35217.35 17.0217.0217.0217.02 225.88225.88225.88225.88 84.6484.6484.6484.64 268.07268.07268.07268.07 381.66381.66381.66381.66 66.2666.2666.2666.26 143.29 175.52175.52175.52175.52
– w/o LV 22.7022.7022.7022.70 5.265.265.265.26 10.1210.1210.1210.12 5.245.245.245.24 8.658.658.658.65 20.2820.2820.2820.28 0.670.670.670.67 2.982.982.982.98 9.489.489.489.48
Table 2: Diversity scores of all models on eight datasets. Higher values indicate more diverse protein sequences.
Models avGFP AAV TEM E4B AMIE LGK Pab1 UBE2I Average
CMA-ES 221.55221.55221.55221.55 22.7322.7322.7322.73 269.25269.25269.25269.25 93.7893.7893.7893.78 256.07256.07256.07256.07 415.63415.63415.63415.63 59.3559.3559.3559.35 128.10128.10128.10128.10 183.30183.30183.30183.30
FBGAN 0.050.050.050.05 2.762.762.762.76 0.080.080.080.08 0.630.630.630.63 57.8757.8757.8757.87 39.3639.3639.3639.36 0.750.750.750.75 0.800.800.800.80 12.4312.4312.4312.43
DbAS 1.011.011.011.01 3.013.013.013.01 1.471.471.471.47 1.091.091.091.09 1.121.121.121.12 1.631.631.631.63 1.641.641.641.64 2.052.052.052.05 1.661.661.661.66
CbAS 4.024.024.024.02 3.033.033.033.03 2.062.062.062.06 1.901.901.901.90 1.331.331.331.33 1.091.091.091.09 2.712.712.712.71 2.952.952.952.95 1.921.921.921.92
PEX 3.593.593.593.59 1.881.881.881.88 8.578.578.578.57 4.084.084.084.08 4.504.504.504.50 10.5310.5310.5310.53 3.633.633.633.63 10.2410.2410.2410.24 5.875.875.875.87
GFlowNet-AL 221.95221.95221.95221.95 22.8322.8322.8322.83 266.99266.99266.99266.99 86.7886.7886.7886.78 316.79316.79316.79316.79 412.58412.58412.58412.58 61.3361.3361.3361.33 143.28143.28143.28143.28 191.56191.56191.56191.56
ESM-Search 1.501.501.501.50 1.831.831.831.83 7.327.327.327.32 1.211.211.211.21 0.920.920.920.92 0.910.910.910.91 5.465.465.465.46 6.146.146.146.14 3.163.163.163.16
\hdashlineIsEM-Pro 226.31 23.81 270.27 96.57 332.23 420.93 70.09 153.27 199.18
– w/o ESM 198.08198.08198.08198.08 9.259.259.259.25 176.20176.20176.20176.20 3.793.793.793.79 264.36264.36264.36264.36 340.62340.62340.62340.62 1.491.491.491.49 110.53110.53110.53110.53 138.04138.04138.04138.04
– w/o ISEM 195.85195.85195.85195.85 16.3616.3616.3616.36 244.05244.05244.05244.05 85.8185.8185.8185.81 306.22306.22306.22306.22 382.12382.12382.12382.12 53.0153.0153.0153.01 99.5199.5199.5199.51 180.40180.40180.40180.40
– w/o MRFs 207.59207.59207.59207.59 9.539.539.539.53 221.49221.49221.49221.49 90.8190.8190.8190.81 244.14244.14244.14244.14 327.14327.14327.14327.14 58.4458.4458.4458.44 129.75129.75129.75129.75 161.11161.11161.11161.11
– w/o LV 16.6716.6716.6716.67 0.890.890.890.89 4.394.394.394.39 0.480.480.480.48 3.983.983.983.98 9.649.649.649.64 0.210.210.210.21 1.761.761.761.76 4.754.754.754.75
Table 3: Novelty scores of all models on eight datasets. Higher values indicate more novel protein sequences.
Refer to caption

(a) avGFP and E4B

Refer to caption

(a) AAV and TEM

Refer to caption

(b) AMIE and LGK

Refer to caption

(c) Pab1 and UBE2I

Figure 4: Approximate KL divergence on eight protein datasets.

4.3 Baseline Models

We compare our method against the following representative baselines: (1) CMA-ES (Hansen & Ostermeier, 2001) is a famous evolutionary search algorithm. (2) FBGAN proposed by Gupta & Zou (2019) is a novel feedback-loop architecture with generative model GAN. (3) DbAS (Brookes & Listgarten, 2018) is a probabilistic modeling framework and uses adaptive sampling algorithm. (4) CbAS (Brookes et al., 2019) improves on DbAS by conditioning on the desired properties. (5) PEX proposed by Ren et al. (2022) is a model-guided sequence design algorithm using proximal exploration. (6) GFlowNet-AL (Jain et al., 2022) applies GFlowNet to design biological sequences. We use the implementations of CMA-ES, DbAS and CbAS provided in Trabucco et al. (2022) and for other baselines, we apply their released codes. To better analyze the influence of different components in our model, we also conduct ablation tests as follows: (1) IsEM-Pro-w/o-ESM removes ESM-2 as encoder initialization. (2) IsEM-Pro-w/o-ISEM removes iterative optimization process. (3) IsEM-Pro-w/o-MRFs removes MRFs features and iterative optimization process. (4) IsEM-Pro-w/o-LV removes latent variable, MRFs features and iterative optimization process. (5) ESM-Search samples sequences from the softmax distribution obtained by finetuning ESM-2 on the protein datasets and taking the wild-type as input.

4.4 Evaluation Metrics

We use three automatic metrics to evaluate the performance of the designed sequences: (1) MFS: Maximum fitness score. The oracle model adopted to evaluate MFS is described in Appendix B.1; (2) Diversity proposed by Jain et al. (2022) is used to evaluate how different the designed candidates are from each other; (3) Novelty proposed by Jain et al. (2022) is used to evaluate how different the proposed candidates are from the sequences in training data.

4.5 Main Results

Table 1, 2 and 3 respectively report the maximum fitness scores, diversity scores and novelty scores of all models.

IsEM-Pro achieves the highest fitness scores on all protein families and outperforms the previous best method GFlowNet-AL by 55% on average (Table 1). The reasons are two-folds. On one hand, the importance sampling based VEM can help our model to navigate to a better region instead of getting trapped in a worse local optima. On the other hand, the combinatorial structure features help to recognize the preferred mutation patterns which have higher success rate under the nature selection pressure, potentially leading to sequences with higher fitness scores.

IsEM-Pro achieves the highest average diversity score over the eight tasks (Table 2). Our model gains the highest diversity on 4444 out of 8888 tasks while GFlowNet-AL gains 2222o and CMA-ES gains 1111. It indicates that though involving combinatorial structure constraints can give the guidance for preferred protein patterns, it might also limit the sequence design to these patterns to some extent. The involved latent variable can capture complex inter-dependencies among amino acids, which benefits for more diverse protein design.

IsEM-Pro can design more novel protein sequences on all datasets (Table 3). Our model achieves higher novelty scores on all datasets due to the reason that more new samples are involved during the importance sampling based iterative optimization process, which is beneficial for more novel protein design.

4.6 Ablation Study

Bottom halves of Table 1, 2 and 3 report the results of ablation tests. IsEM-Pro-w/o-MRFs improves the average diversity and novelty scores by as much as 33333333x compared with IsEM-Pro-w/o-LV, which demonstrates that introducing a latent variable can significantly help to generate diverse proteins. IsEM-Pro-w/o-MRFs achieves higher maximum fitness scores than IsEM-Pro-w/o-ESM on all datasets, validating that adopting a pretrained protein language model as the encoder helps to design more satisfactory protein sequences. However, directly finetuning ESM-2 to sample candidates (ESM-Search) drops 1.31.31.31.3 points on average fitness score compared with taking ESM-2 as an encoder (IsEM-Pro-w/o-MRFs), demonstrating that ESM-2 is not suitable for direct sequence design. Incorporating combinatorial structure features can further improve the fitness of the designed proteins (IsEM-Pro-w/o-ISEM V.S. IsEM-Pro-w/o-MRFs), based on which learning the proposal distribution by importance sampling based VEM can better promote more desirable, diverse and novel protein generation.

5 Analysis

In this section, we will make rigorous analyse to demonstrate the effectiveness of our method from different aspects.

Refer to caption

(a) 3333-D visualization of sequence with highest fitness.

Refer to caption

(b) Superposition of 5555 fluorescent proteins in top-5 templates.

Figure 5: 3-D visualization of a designed sequence of green fluorescent protein.
Methods MFS Diversity Novelty
Latent-Add 4.1024.1024.1024.102 218.26218.26218.26218.26 209.85209.85209.85209.85
Latent-Memory 4.0404.0404.0404.040 219.00 211.38211.38211.38211.38
IsEM-Pro 6.185 218.62218.62218.62218.62 226.31
Table 4: Results of different schemes of introducing a latent variable with a pretrained encoder evaluated on avGFP dataset.

5.1 Approximate KL Divergence

To validate how close the proposal distribution Qϕ(x)subscript𝑄italic-ϕ𝑥Q_{\phi}(x)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) is to the posterior distribution Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ), we calculate the KL divergence through Monte Carlo approximation. Here we calculate the KL divergence between Qϕ(x)subscript𝑄italic-ϕ𝑥Q_{\phi}(x)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) and Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) as we have proven in Lemma C.1𝐶.1C.1italic_C .1 (provided in Appendix C.2) that the sampling difference between these two distributions can be bounded under this divergence. We leverage an unbiased and low-variance estimator (proof shown in Appendix C.1) to approximate KL divergence as follows:

DKL(Qϕ(x)||Pθ(x|𝒮))=EQϕ(x)[r(x)1logr(x)]\small D_{KL}(Q_{\phi}(x)||P_{\theta}(x|\mathcal{S}))=E_{Q_{\phi}(x)}[r(x)-1-% \log r(x)]italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) | | italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) ) = italic_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x ) - 1 - roman_log italic_r ( italic_x ) ] (15)

where r(x)=Pθ(x|𝒮)Qϕ(x)𝑟𝑥subscript𝑃𝜃conditional𝑥𝒮subscript𝑄italic-ϕ𝑥r(x)=\frac{P_{\theta}(x|\mathcal{S})}{Q_{\phi}(x)}italic_r ( italic_x ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) end_ARG. The approximate KL divergence on eight protein datasets over 1111k-10101010k samples are illustrated in Figure 4. From the figure, we can see that the variance of KL divergence is very small over different sample size for all datasets. Besides, the KL divergence finally arrives at a small value, such as 0.0180.0180.0180.018 for E4B and 0.0330.0330.0330.033 for avGFP. It gives empirical evidence that when we sample from the ultimate Qϕ(x)subscript𝑄italic-ϕ𝑥Q_{\phi}(x)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ), it has minor difference compared with sampling from the posterior distribution Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ).

5.2 Effect of VAE Implementation Method

Next, we study the effect of different implementation schemes of involving a latent variable with a pretrained encoder. Some works have tried adding the latent representation to the original embedding layer (Latent-Add) or using it as an additional memory (Latent-Memory) when adopting a pretrained language model as encoder (Li et al., 2020). We also implement our model with these two schemes, and evaluate the model performance on avGFP dataset. Table 4 shows that our method, which takes the latent representation as the first token input of decoder, achieves a higher fitness and novelty scores though with a mild decrease on diversity.

5.3 Case Study

To gain an insight on how well the designed proteins are, we analyze the generated avGFP sequence with highest fitness in detail using Phyre2 tool (Kelley et al., 2015). Figure 5 (a) illustrates the generated variant can fold stably. According to the software, the most similar protein is Cytochrome b562 integral fusion with enhanced green fluorescent protein (EGFP) (Edwards et al., 2008). There are 227 residues ( 96% of the candidate sequence) have been modeled with 100.0% confidence by using this protein as template. Details are given in Appendix D. Figure 5(b) visualizes the superposition of the top-5 most similar templates to our sequence in the protein data bank, which are all fluorescent proteins and show highly consistent structure in most regions, validating that our model can design a real fluorescent protein.

6 Related Work

Machine Learning for Protein Fitness Landscape Prediction. Machine learning has been increasingly used for modeling protein fitness landscape, which is crucial for protein engineering. Some work leverage co-evolution information from multiple sequence alignments to predict fitness scores (Kamisetty et al., 2013; Luo et al., 2021). Melamed et al. (2013) propose to construct a deep latent generative model to capture higher-order mutations. Meier et al. (2021) propose to use pretrained protein language models to enable zero-shot prediction. The learned protein landscape models can be used to replace the expensive wet-lab validation to screen enormous designed sequences (Rao et al., 2019; Ren et al., 2022).

Methods for Protein Sequence Design. Protein sequence design has been studied with a wide variety of methods, including traditional directed evolution (Arnold, 1998; Dalby, 2011; Packer & Liu, 2015; Arnold, 2018) and machine learning methods. The mainly used machine learning algorithms include reinforcement learning (Angermueller et al., 2019; Jain et al., 2022), Bayesian optimization (Belanger et al., 2019; Moss et al., 2020; Terayama et al., 2021), search using deep generative models (Brookes & Listgarten, 2018; Brookes et al., 2019; Madani et al., 2020; Kumar & Levine, 2020; Das et al., 2021; Hoffman et al., 2022; Melnyk et al., 2021; Ren et al., 2022) adaptive evolution methods (Hansen, 2006; Swersky et al., 2020; Sinai et al., 2020) as well as likelihood-free inference (Zhang et al., 2021).

Extending Brookes et al. (2019), we propose importance sampling based Monte Carlo EM to learn a latent generative model which is enhanced by combinatorial structure features of protein space. The whole framework can not only help the generative model to climb to a better region in either Fujiyama landscape or Badlands landscape (Kauffman & Weinberger, 1989), but also significantly promote design diversity and novelty.

7 Conclusion

This paper proposes IsEM-Pro, a latent generative model for protein sequence design, which incorporates additional combinatorial structure features learned by MRFs. We use importance weighted EM to learn the model, which can not only enhance design diversity and novelty, but also lead to protein sequences with higher fitness. Experimental results on eight protein sequence design tasks show that our method outperforms several strong baselines on all metrics.

References

  • Angermueller et al. (2019) Angermueller, C., Dohan, D., Belanger, D., Deshpande, R., Murphy, K., and Colwell, L. Model-based reinforcement learning for biological sequence design. In International conference on learning representations, 2019.
  • Arnold (1998) Arnold, F. H. Design by directed evolution. Accounts of chemical research, 31(3):125–131, 1998.
  • Arnold (2018) Arnold, F. H. Directed evolution: bringing new chemistry to life. Angewandte Chemie International Edition, 57(16):4143–4148, 2018.
  • Belanger et al. (2019) Belanger, D., Vora, S., Mariet, Z., Deshpande, R., Dohan, D., Angermueller, C., Murphy, K., Chapelle, O., and Colwell, L. Biological sequences design using batched bayesian optimization. 2019.
  • Biswas et al. (2021) Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M., and Church, G. M. Low-n protein engineering with data-efficient deep learning. Nature methods, 18(4):389–396, 2021.
  • Bloom & Arnold (2009) Bloom, J. D. and Arnold, F. H. In the light of directed evolution: pathways of adaptive protein evolution. Proceedings of the National Academy of Sciences, 106(supplement_1):9995–10000, 2009.
  • Brookes et al. (2019) Brookes, D., Park, H., and Listgarten, J. Conditioning by adaptive sampling for robust design. In International conference on machine learning, pp. 773–782. PMLR, 2019.
  • Brookes & Listgarten (2018) Brookes, D. H. and Listgarten, J. Design by adaptive sampling. arXiv preprint arXiv:1810.03714, 2018.
  • Bryant et al. (2021) Bryant, D. H., Bashir, A., Sinai, S., Jain, N. K., Ogden, P. J., Riley, P. F., Church, G. M., Colwell, L. J., and Kelsic, E. D. Deep diversification of an aav capsid protein by machine learning. Nature Biotechnology, 39(6):691–696, 2021.
  • Chothia (1984) Chothia, C. Principles that determine the structure of proteins. Annual review of biochemistry, 53(1):537–572, 1984.
  • Csiszár & Körner (2011) Csiszár, I. and Körner, J. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
  • Dalby (2011) Dalby, P. A. Strategy and success for the directed evolution of enzymes. Current opinion in structural biology, 21(4):473–480, 2011.
  • Das et al. (2021) Das, P., Sercu, T., Wadhawan, K., Padhi, I., Gehrmann, S., Cipcigan, F., Chenthamarakshan, V., Strobelt, H., Dos Santos, C., Chen, P.-Y., et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nature Biomedical Engineering, 5(6):613–623, 2021.
  • Dieng & Paisley (2019) Dieng, A. B. and Paisley, J. Reweighted expectation maximization. arXiv preprint arXiv:1906.05850, 2019.
  • Edwards et al. (2008) Edwards, W. R., Busse, K., Allemann, R. K., and Jones, D. D. Linking the functions of unrelated proteins using a novel directed evolution domain insertion method. Nucleic acids research, 36(13):e78–e78, 2008.
  • Firnberg et al. (2014) Firnberg, E., Labonte, J. W., Gray, J. J., and Ostermeier, M. A comprehensive, high-resolution map of a gene’s fitness landscape. Molecular biology and evolution, 31(6):1581–1592, 2014.
  • Fox et al. (2007) Fox, R. J., Davis, S. C., Mundorff, E. C., Newman, L. M., Gavrilovic, V., Ma, S. K., Chung, L. M., Ching, C., Tam, S., Muley, S., et al. Improving catalytic function by prosar-driven enzyme evolution. Nature biotechnology, 25(3):338–344, 2007.
  • Go (1983) Go, N. Theoretical studies of protein folding. Annual review of biophysics and bioengineering, 12(1):183–210, 1983.
  • Gupta & Zou (2019) Gupta, A. and Zou, J. Feedback gan for dna optimizes protein functions. Nature Machine Intelligence, 1(2):105–111, 2019.
  • Hansen (2006) Hansen, N. The cma evolution strategy: a comparing review. Towards a new evolutionary computation, pp.  75–102, 2006.
  • Hansen & Ostermeier (2001) Hansen, N. and Ostermeier, A. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
  • Hastings (1970) Hastings, W. K. Monte carlo sampling methods using markov chains and their applications. 1970.
  • Higgins et al. (2016) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016.
  • Hoffman et al. (2022) Hoffman, S. C., Chenthamarakshan, V., Wadhawan, K., Chen, P.-Y., and Das, P. Optimizing molecules using efficient queries from property evaluations. Nature Machine Intelligence, 4(1):21–31, 2022.
  • Hopf et al. (2017) Hopf, T. A., Ingraham, J. B., Poelwijk, F. J., Schärfe, C. P., Springer, M., Sander, C., and Marks, D. S. Mutation effects predicted from sequence co-variation. Nature biotechnology, 35(2):128–135, 2017.
  • Jain et al. (2022) Jain, M., Bengio, E., Hernandez-Garcia, A., Rector-Brooks, J., Dossou, B. F., Ekbote, C. A., Fu, J., Zhang, T., Kilgour, M., Zhang, D., et al. Biological sequence design with gflownets. In International Conference on Machine Learning, pp. 9786–9801. PMLR, 2022.
  • Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
  • Kamisetty et al. (2013) Kamisetty, H., Ovchinnikov, S., and Baker, D. Assessing the utility of coevolution-based residue–residue contact predictions in a sequence-and structure-rich era. Proceedings of the National Academy of Sciences, 110(39):15674–15679, 2013.
  • Kauffman & Weinberger (1989) Kauffman, S. A. and Weinberger, E. D. The nk model of rugged fitness landscapes and its application to maturation of the immune response. Journal of theoretical biology, 141(2):211–245, 1989.
  • Kelley et al. (2015) Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N., and Sternberg, M. J. The phyre2 web portal for protein modeling, prediction and analysis. Nature protocols, 10(6):845–858, 2015.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. 2014.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Klesmith et al. (2015) Klesmith, J. R., Bacik, J.-P., Michalczyk, R., and Whitehead, T. A. Comprehensive sequence-flux map** of a levoglucosan utilization pathway in e. coli. ACS synthetic biology, 4(11):1235–1243, 2015.
  • Kumar & Levine (2020) Kumar, A. and Levine, S. Model inversion networks for model-based optimization. Advances in Neural Information Processing Systems, 33:5126–5137, 2020.
  • Labrou (2010) Labrou, N. E. Random mutagenesis methods for in vitro directed enzyme evolution. Current Protein and Peptide Science, 11(1):91–100, 2010.
  • Lagassé et al. (2017) Lagassé, H. D., Alexaki, A., Simhadri, V. L., Katagiri, N. H., Jankowski, W., Sauna, Z. E., and Kimchi-Sarfaty, C. Recent advances in (therapeutic protein) drug development. F1000Research, 6, 2017.
  • Levine & Casella (2001) Levine, R. A. and Casella, G. Implementations of the monte carlo em algorithm. Journal of Computational and Graphical Statistics, 10(3):422–439, 2001.
  • Li et al. (2020) Li, C., Gao, X., Li, Y., Peng, B., Li, X., Zhang, Y., and Gao, J. Optimus: Organizing sentences via pre-trained modeling of a latent space. arXiv preprint arXiv:2004.04092, 2020.
  • Lin et al. (2022) Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
  • Luo et al. (2021) Luo, Y., Jiang, G., Yu, T., Liu, Y., Vo, L., Ding, H., Su, Y., Qian, W. W., Zhao, H., and Peng, J. Ecnet is an evolutionary context-integrated deep learning framework for protein engineering. Nature communications, 12(1):1–14, 2021.
  • Ma et al. (2003) Ma, J. K., Drake, P. M., and Christou, P. The production of recombinant pharmaceutical proteins in plants. Nature reviews genetics, 4(10):794–805, 2003.
  • Madani et al. (2020) Madani, A., McCann, B., Naik, N., Keskar, N. S., Anand, N., Eguchi, R. R., Huang, P.-S., and Socher, R. Progen: Language modeling for protein generation. arXiv preprint arXiv:2004.03497, 2020.
  • Meier et al. (2021) Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., and Rives, A. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
  • Melamed et al. (2013) Melamed, D., Young, D. L., Gamble, C. E., Miller, C. R., and Fields, S. Deep mutational scanning of an rrm domain of the saccharomyces cerevisiae poly (a)-binding protein. Rna, 19(11):1537–1551, 2013.
  • Melnyk et al. (2021) Melnyk, I., Das, P., Chenthamarakshan, V., and Lozano, A. Benchmarking deep generative models for diverse antibody sequence design. arXiv preprint arXiv:2111.06801, 2021.
  • Moss et al. (2020) Moss, H., Leslie, D., Beck, D., Gonzalez, J., and Rayson, P. Boss: Bayesian optimization over string spaces. Advances in neural information processing systems, 33:15476–15486, 2020.
  • Neal (2001) Neal, R. M. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
  • Neal & Hinton (1998) Neal, R. M. and Hinton, G. E. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pp.  355–368. Springer, 1998.
  • Packer & Liu (2015) Packer, M. S. and Liu, D. R. Methods for the directed evolution of proteins. Nature Reviews Genetics, 16(7):379–394, 2015.
  • Rao et al. (2019) Rao, R., Bhattacharya, N., Thomas, N., Duan, Y., Chen, P., Canny, J., Abbeel, P., and Song, Y. Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019.
  • Ren et al. (2022) Ren, Z., Li, J., Ding, F., Zhou, Y., Ma, J., and Peng, J. Proximal exploration for model-guided protein sequence design. bioRxiv, 2022.
  • Rives et al. (2021) Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C. L., Ma, J., et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  • Romero & Arnold (2009) Romero, P. A. and Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology, 10(12):866–876, 2009.
  • Sarkisyan et al. (2016) Sarkisyan, K. S., Bolotin, D. A., Meer, M. V., Usmanova, D. R., Mishin, A. S., Sharonov, G. V., Ivankov, D. N., Bozhanova, N. G., Baranov, M. S., Soylemez, O., et al. Local fitness landscape of the green fluorescent protein. Nature, 533(7603):397–401, 2016.
  • Seemayer et al. (2014) Seemayer, S., Gruber, M., and Söding, J. Ccmpred—fast and precise prediction of protein residue–residue contacts from correlated mutations. Bioinformatics, 30(21):3128–3130, 2014.
  • Sinai et al. (2020) Sinai, S., Wang, R., Whatley, A., Slocum, S., Locane, E., and Kelsic, E. D. Adalead: A simple and robust adaptive greedy search algorithm for sequence design. arXiv preprint arXiv:2010.02141, 2020.
  • Singh et al. (2016) Singh, A., Pandey, A., Srivastava, A. K., Tran, L.-S. P., and Pandey, G. K. Plant protein phosphatases 2c: from genomic diversity to functional multiplicity and importance in stress management. Critical Reviews in Biotechnology, 36(6):1023–1035, 2016.
  • Starita et al. (2013) Starita, L. M., Pruneda, J. N., Lo, R. S., Fowler, D. M., Kim, H. J., Hiatt, J. B., Shendure, J., Brzovic, P. S., Fields, S., and Klevit, R. E. Activity-enhancing mutations in an e3 ubiquitin ligase identified by high-throughput mutagenesis. Proceedings of the National Academy of Sciences, 110(14):E1263–E1272, 2013.
  • Starr & Thornton (2017) Starr, T. N. and Thornton, J. W. Exploring protein sequence–function landscapes. Nature biotechnology, 35(2):125–126, 2017.
  • Swersky et al. (2020) Swersky, K., Rubanova, Y., Dohan, D., and Murphy, K. Amortized bayesian optimization over discrete spaces. In Conference on Uncertainty in Artificial Intelligence, pp. 769–778. PMLR, 2020.
  • Terayama et al. (2021) Terayama, K., Sumita, M., Tamura, R., and Tsuda, K. Black-box optimization for automated discovery. Accounts of Chemical Research, 54(6):1334–1346, 2021.
  • Trabucco et al. (2022) Trabucco, B., Geng, X., Kumar, A., and Levine, S. Design-bench: Benchmarks for data-driven offline model-based optimization. In International Conference on Machine Learning, pp. 21658–21676. PMLR, 2022.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Weile et al. (2017) Weile, J., Sun, S., Cote, A. G., Knapp, J., Verby, M., Mellor, J. C., Wu, Y., Pons, C., Wong, C., van Lieshout, N., et al. A framework for exhaustively map** functional missense variants. Molecular systems biology, 13(12):957, 2017.
  • Wrenbeck et al. (2017) Wrenbeck, E. E., Azouz, L. R., and Whitehead, T. A. Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded. Nature communications, 8(1):1–10, 2017.
  • Wright et al. (2005) Wright, C. F., Teichmann, S. A., Clarke, J., and Dobson, C. M. The importance of sequence diversity in the aggregation and evolution of proteins. Nature, 438(7069):878–881, 2005.
  • Zhang et al. (2021) Zhang, D., Fu, J., Bengio, Y., and Courville, A. Unifying likelihood-free inference with black-box sequence design and beyond. arXiv preprint arXiv:2110.03372, 2021.

Appendix A Data Statistics

We provide the detailed data statistics in the following table, including protein sequence length, data size and data source. We have checked and cleaned the data and make sure the data do not contain personally identifiable information or offensive content.

Protein Length Size Data Source
avGFP 237237237237 49,8554985549,85549 , 855 https://figshare.com/articles/dataset/Local_fitness_landscape_of_the_green_fluorescent_protein
AAV 28282828 296,914296914296,914296 , 914 https://github.com/churchlab/Deep_diversification_AAV
TEM 286286286286 17,2381723817,23817 , 238 https://github.com/facebookresearch/esm/tree/main/examples/data
E4B 102102102102 91,0339103391,03391 , 033 https://figshare.com/articles/dataset
AMIE 341341341341 6,63166316,6316 , 631 https://figshare.com/articles/dataset/Normalized_fitness_values_for_AmiE_selections/3505901/2
LGK 439439439439 8,06980698,0698 , 069 https://figshare.com/articles/dataset
Pab1 75757575 36,5223652236,52236 , 522 https://figshare.com/articles/dataset
UBE2I 159159159159 5,35553555,3555 , 355 http://dalai.mshri.on.ca/ jweile/projects/dmsData/
Table 5: Detailed statistics of the eight protein datasets.

Appendix B More Implementation Details

B.1 Additional Experimental Settings

We apply the annealing schedule for the KL term during Pθ(x)subscript𝑃𝜃𝑥P_{\theta}(x)italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) training process following β𝛽\betaitalic_β-VAE (Higgins et al., 2016) to prevent posterior collapse. Specifically, the KL term coefficient starts from 00 and is gradually increased to 1.01.01.01.0 as training goes on. At each iteration in importance sampling based EM learning process, the number of samples from current Qϕ(x,z)subscript𝑄italic-ϕ𝑥𝑧Q_{\phi}(x,z)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_z ) is set to 10101010% of the training data size.

Following (Ren et al., 2022), We construct the oracle model f(x)𝑓𝑥f(x)italic_f ( italic_x ) by adopting the features produced by ESM-1b (Rives et al., 2021) with dimension 1280128012801280 and finetuning an Attention1D decoder to predict the fitness values. Since Brookes et al. (2019) state that the results are insensitive when λ𝜆\lambdaitalic_λ is set in the range [50505050, 100100100100]-th percentile of the fitness scores in the training set, we set λ𝜆\lambdaitalic_λ to 50505050-th percentile of the fitness values in the training data to accommodate more diversity.

B.2 Importance Weighted EM Learning Algorithm

Algorithm 1 Importance Sampling based Expectation-Maximization Training
0:  𝜺𝜺\boldsymbol{\varepsilon}bold_italic_ε: separately learned combinatorial structure features through MRFsPθ(x;𝜺)subscript𝑃𝜃𝑥𝜺P_{\theta}(x;\boldsymbol{\varepsilon})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; bold_italic_ε ): VAE model trained on the protein sequences incorporating 𝜺𝜺\boldsymbol{\varepsilon}bold_italic_εT: number of iteration for importance sampling based EM learningN: number of samples at each iteration during EM learning
0:  Final proposal model Qϕ(T)(x;𝜺)subscript𝑄superscriptitalic-ϕ𝑇𝑥𝜺Q_{\phi^{(T)}}(x;\boldsymbol{\varepsilon})italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ; bold_italic_ε )
1:  set Qϕ(0)(x|z;𝜺)=Pθ(x|z;𝜺)subscript𝑄superscriptitalic-ϕ0conditional𝑥𝑧𝜺subscript𝑃𝜃conditional𝑥𝑧𝜺Q_{\phi^{(0)}}(x|z;\boldsymbol{\varepsilon})=P_{\theta}(x|z;\boldsymbol{% \varepsilon})italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x | italic_z ; bold_italic_ε ) = italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_z ; bold_italic_ε ),   Qϕ(0)(z|x;𝜺)=Pθ(z|x;𝜺)subscript𝑄superscriptitalic-ϕ0conditional𝑧𝑥𝜺subscript𝑃𝜃conditional𝑧𝑥𝜺Q_{\phi^{(0)}}(z|x;\boldsymbol{\varepsilon})=P_{\theta}(z|x;\boldsymbol{% \varepsilon})italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z | italic_x ; bold_italic_ε ) = italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_x ; bold_italic_ε )
2:  for t=0 to T-1 do
3:     sample N pairs of (x(t),z(t))Qϕ(t)(x,z;𝜺)similar-tosuperscript𝑥𝑡superscript𝑧𝑡subscript𝑄superscriptitalic-ϕ𝑡𝑥𝑧𝜺(x^{(t)},z^{(t)})\sim Q_{\phi^{(t)}}(x,z;\boldsymbol{\varepsilon})( italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ∼ italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_z ; bold_italic_ε )
4:     for minibatch in {(x(t),z(t))}i=1Nsuperscriptsubscriptsuperscript𝑥𝑡superscript𝑧𝑡𝑖1𝑁\{(x^{(t)},z^{(t)})\}_{i=1}^{N}{ ( italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT do
5:        Calculate expectation using Monte Carlo approximation defined in Equation 13 in E-step
6:        Maximize the Monte Carlo approximation to update ϕ(t+1)superscriptitalic-ϕ𝑡1\phi^{(t+1)}italic_ϕ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT in Equation 14 in M-step
7:     end for
8:  end for

Appendix C Approximate KL Divergence

C.1 Proof of the Unbiased and Low-Variance Estimator

Letting r(x)=Pθ(x|𝒮)Qϕ(x)𝑟𝑥subscript𝑃𝜃conditional𝑥𝒮subscript𝑄italic-ϕ𝑥r(x)=\frac{P_{\theta}(x|\mathcal{S})}{Q_{\phi}(x)}italic_r ( italic_x ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) end_ARG, we have:

EQϕ(x)[(r(x)1)logr(x)]=EQϕ(x)[logQϕ(x)Pθ(x|𝒮)]=DKL(Qϕ(x)||Pθ(x|𝒮))\begin{split}E_{Q_{\phi}(x)}[(r(x)-1)-\log r(x)]=E_{Q_{\phi}(x)}[\log\frac{Q_{% \phi}(x)}{P_{\theta}(x|\mathcal{S})}]=D_{KL}(Q_{\phi}(x)||P_{\theta}(x|% \mathcal{S}))\end{split}start_ROW start_CELL italic_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT [ ( italic_r ( italic_x ) - 1 ) - roman_log italic_r ( italic_x ) ] = italic_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) end_ARG ] = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) | | italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) ) end_CELL end_ROW (16)

Therefore, this estimator for KL divergence is unbiased.

Assuming f(x)=(x1)logx𝑓𝑥𝑥1𝑥f(x)=(x-1)-\log xitalic_f ( italic_x ) = ( italic_x - 1 ) - roman_log italic_x, since f(x)𝑓𝑥f(x)italic_f ( italic_x ) is a convex function and it achieves the minimum value when x=1𝑥1x=1italic_x = 1, we have:

(x1)logxf(1)=0𝑥1𝑥𝑓10(x-1)-\log x\geq f(1)=0( italic_x - 1 ) - roman_log italic_x ≥ italic_f ( 1 ) = 0 (17)

Thus, (r(x)1)logr(x)𝑟𝑥1𝑟𝑥(r(x)-1)-\log r(x)( italic_r ( italic_x ) - 1 ) - roman_log italic_r ( italic_x ) is always larger than or equals to 00. Instead, in the original KL divergence, logQϕ(x)Pθ(x|𝒮)=logr(x)subscript𝑄italic-ϕ𝑥subscript𝑃𝜃conditional𝑥𝒮𝑟𝑥\log\frac{Q_{\phi}(x)}{P_{\theta}(x|\mathcal{S})}=-\log r(x)roman_log divide start_ARG italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ) end_ARG = - roman_log italic_r ( italic_x ) would be negative for half of the samples. Therefore, EQϕ(x)[(r(x)1)logr(x)]subscript𝐸subscript𝑄italic-ϕ𝑥delimited-[]𝑟𝑥1𝑟𝑥E_{Q_{\phi}(x)}[(r(x)-1)-\log r(x)]italic_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT [ ( italic_r ( italic_x ) - 1 ) - roman_log italic_r ( italic_x ) ] has lower variance compared to the original one.

C.2 Theoretical Understanding

We can prove that under acceptable KL divergence, the samples from the proposal distribution Qϕ(x)subscript𝑄italic-ϕ𝑥Q_{\phi}(x)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) can be bound within a reasonable sampling error with samples from the posterior distribution Pθ(x|𝒮)subscript𝑃𝜃conditional𝑥𝒮P_{\theta}(x|\mathcal{S})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | caligraphic_S ).

Lemma C.1.

If the KL divergence between two distributions P and Q is less than a small positive value δ𝛿\deltaitalic_δ, then the sampling probability difference between P and Q will be bounded by 2δ2𝛿\sqrt{2\delta}square-root start_ARG 2 italic_δ end_ARG for each sample x.

Proof.

Let δ(P,Q)𝛿𝑃𝑄\delta(P,Q)italic_δ ( italic_P , italic_Q ) be the total variation distance between P and Q, due to the Pinsker’s inequality (Csiszár & Körner, 2011), we have:

12x|P(x)Q(x)|=δ(P,Q)12DKL(P||Q)<12δ\small\frac{1}{2}\sum_{x}|P(x)-Q(x)|=\delta(P,Q)\leq\sqrt{\frac{1}{2}D_{KL}(P|% |Q)}<\sqrt{\frac{1}{2}\delta}divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_P ( italic_x ) - italic_Q ( italic_x ) | = italic_δ ( italic_P , italic_Q ) ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P | | italic_Q ) end_ARG < square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_δ end_ARG (18)

The first equality holds when the measurable space is discrete as is the case in this paper. The second inequality is tight if and only if P=Q𝑃𝑄P=Qitalic_P = italic_Q, and then there is no difference between sampling from P𝑃Pitalic_P and Q𝑄Qitalic_Q.

From the above analysis, we can get:

x|P(x)Q(x)|<2δsubscript𝑥𝑃𝑥𝑄𝑥2𝛿\small\sum_{x}|P(x)-Q(x)|<\sqrt{2\delta}∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_P ( italic_x ) - italic_Q ( italic_x ) | < square-root start_ARG 2 italic_δ end_ARG (19)

When δ𝛿\deltaitalic_δ approaches 00, the sampling difference between P(x)𝑃𝑥P(x)italic_P ( italic_x ) and Q(x)𝑄𝑥Q(x)italic_Q ( italic_x ) would be very minor.

Appendix D Case Study

Figure 6 illustrates the complete sequence and secondary structure analyse of our designed protein of avGFP compared with Cytochrome b562 integral fusion with enhanced green fluorescent protein (EGFP). From the figure, we can see that there are much overlap between our designed protein sequence and the chain B of Cytochrome b562 integral fusion with EGFP. It gives empirical evidence that the green fluorescent protein generated by our model is highly likely to be a real protein compared with the proteins we already know. But whether the designed sequences can accelerate wet-lab experiments still need more exploration as we can not 100% trust it.

Refer to caption
Figure 6: Complete sequence and secondary structure comparison between the designed sequence of avGFP and chain B of Cytochrome b562 integral fusion with with enhanced green fluorescent protein. Green and blue parts respectively represent the Alpha helix and Beta strand.

Appendix E Pseudo-Likelihood for Combinatorial Structure Learning

We train the Markov random fields using a pseudo-likelihood as CCMpred (Seemayer et al., 2014) additionally combined with the Lasso regularization and Ridge regularization. The pseudo-likelihood is given in the following Equation:

P^L(x)=logΠi=1LP(xi|x1,x2,,xi1,xi+1,,xL,ε)=i=1Llogexp(εi(xi)+j=1,jiLεij(xi,xj))c𝒱exp(εi(c)+j=1,jiLεij(c,xj))=i=1L{εi(xi)+j=1,jiLεij(xi,xj)logZi}Zi=c𝒱exp(εi(c)+j=1,jiLεij(c,xj))subscript^𝑃𝐿𝑥superscriptsubscriptΠ𝑖1𝐿𝑃conditionalsubscript𝑥𝑖subscript𝑥1subscript𝑥2subscript𝑥𝑖1subscript𝑥𝑖1subscript𝑥𝐿𝜀superscriptsubscript𝑖1𝐿subscript𝜀𝑖subscript𝑥𝑖superscriptsubscriptformulae-sequence𝑗1𝑗𝑖𝐿subscript𝜀𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗subscript𝑐𝒱subscript𝜀𝑖𝑐superscriptsubscriptformulae-sequence𝑗1𝑗𝑖𝐿subscript𝜀𝑖𝑗𝑐subscript𝑥𝑗superscriptsubscript𝑖1𝐿subscript𝜀𝑖subscript𝑥𝑖superscriptsubscriptformulae-sequence𝑗1𝑗𝑖𝐿subscript𝜀𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗subscript𝑍𝑖subscript𝑍𝑖subscript𝑐𝒱subscript𝜀𝑖𝑐superscriptsubscriptformulae-sequence𝑗1𝑗𝑖𝐿subscript𝜀𝑖𝑗𝑐subscript𝑥𝑗\small\begin{split}\hat{P}_{L}(x)&=\log\Pi_{i=1}^{L}P(x_{i}|x_{1},x_{2},...,x_% {i-1},x_{i+1},...,x_{L},\varepsilon)\\ &=\sum_{i=1}^{L}\log\frac{\exp(\varepsilon_{i}(x_{i})+\sum_{j=1,j\neq i}^{L}% \varepsilon_{ij}(x_{i},x_{j}))}{\sum_{c\in\mathcal{V}}\exp(\varepsilon_{i}(c)+% \sum_{j=1,j\neq i}^{L}\varepsilon_{ij}(c,x_{j}))}\\ &=\sum_{i=1}^{L}\{\varepsilon_{i}(x_{i})+\sum_{j=1,j\neq i}^{L}\varepsilon_{ij% }(x_{i},x_{j})-\log Z_{i}\}\\ Z_{i}&=\sum_{c\in\mathcal{V}}\exp(\varepsilon_{i}(c)+\sum_{j=1,j\neq i}^{L}% \varepsilon_{ij}(c,x_{j}))\end{split}start_ROW start_CELL over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x ) end_CELL start_CELL = roman_log roman_Π start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_ε ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log divide start_ARG roman_exp ( italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_V end_POSTSUBSCRIPT roman_exp ( italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_c ) + ∑ start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_c , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT { italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - roman_log italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_CELL end_ROW start_ROW start_CELL italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_V end_POSTSUBSCRIPT roman_exp ( italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_c ) + ∑ start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_c , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_CELL end_ROW (20)

where 𝒱𝒱\mathcal{V}caligraphic_V denotes the vocabulary of 20202020 amino acids.