AdaNovo: Adaptive De Novo Peptide Sequencing with
Conditional Mutual Information

Jun Xia Shaorong Chen **gbo Zhou Tianze Ling Wenjie Du Sizhe Liu Stan Z. Li

Abstract

Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the analysis of protein composition in biological samples. Despite the development of various deep learning methods for identifying amino acid sequences (peptides) responsible for observed spectra, challenges persist in de novo peptide sequencing. Firstly, prior methods struggle to identify amino acids with post-translational modifications (PTMs) due to their lower frequency in training data compared to canonical amino acids, further resulting in decreased peptide-level identification precision. Secondly, diverse types of noise and missing peaks in mass spectra reduce the reliability of training data (peptide-spectrum matches, PSMs). To address these challenges, we propose AdaNovo, a novel framework that calculates conditional mutual information (CMI) between the spectrum and each amino acid/peptide, using CMI for adaptive model training. Extensive experiments demonstrate AdaNovo’s state-of-the-art performance on a 9-species benchmark, where the peptides in the training set are almost completely disjoint from the peptides of the test sets. Moreover, AdaNovo excels in identifying amino acids with PTMs and exhibits robustness against data noise. The supplementary materials contain the official code.

Machine Learning, ICML

1 Introduction

Refer to caption — Figure 1: Comparisons of various *de novo* sequencing methods in terms of amino acid-level precision. ‘G’ and ‘A’ denote Glycine and Alanine, respectively. Both of them are canonical amino acids. ‘M(+15.99)’ and ‘Q(+.98)’ represent oxidation of methionine and deamidation of glutamin, both of which are modified amino acids (the amino acids with PTMs). The results are for the human dataset, which is one of 9-species benchmark (Tran et al., 2017).

Tandem mass spectrometry is a high-throughput tool to identify and quantify proteins in biological samples. However, the precise determination of protein content from observed mass spectra at scale remains a formidable challenge. Central to this challenge is the spectrum identification problem, wherein we are presented with an observed mass spectrum and the corresponding precursor information (mass and charge of the peptide), and our task is to predict the peptide (amino acid sequence) responsible for generating the spectrum. Currently, spectrum identification is most commonly solved using database search, where the observed spectra from the mass spectrometer are compared to theoretical spectra generated by database of known protein sequences. Software algorithms match experimental spectra to theoretical spectra from the database, report the best-scoring peptide-spectrum match (PSM) per spectrum.

However, database search relies on a pre-defined database, preventing the identification of unexpected peptide sequences, such as those originated from genetic variation. Additionally, a database cannot be leveraged for the analysis of some types of immunopeptidomics data (VanDuijn et al., 2017), in antibody sequencing (Tran et al., 2017) or in vaccine development (Mayer & Impens, 2021). Also, the task of constructing a precise database for metaproteomic analyses, including those related to the human microbiome or environmental samples, is deemed impossible (Muth et al., 2013). All of these limitations necessitate de novo peptide sequencing from the observed mass spectra without using prior knowledge in the form of a peptide sequence database.

Since the early 1990s, de novo methods based on the graph theory (Bartels, 1990; Frank, 2009), Hidden Markov Model (Fischer et al., 2005), or dynamic programming (Dančík et al., 1999; Ma et al., 2003; Frank & Pevzner, 2005) were developed to score peptide sequences against observed spectra. With the rise of deep learning, some researchers train the deep neural networks using PSMs (Tran et al., 2017; Qiao et al., 2021; Yilmaz et al., 2022), where they regard the spectra and matching peptides as the inputs and labels, respectively. And then, the trained models are expected to identify never-before-seen peptides. Although these methods have achieved notable progress, as shown in Figure 1, we observe that they struggle to identify the amino acids with PTMs, further leading to low amino acid-level and peptide-level precision. However, the identification of amino acids with PTMs holds significant biological importance because PTMs plays a pivotal role in elucidating protein function and studying disease mechanisms (Deribe et al., 2010).

On the other hand, some of the expected peaks in mass spectra may be missing due to instrument malfunction or multiple cleavage events occurring on the same peptide, and some additional peaks may undesirably appear in the spectrum, created by instrument noise or non-peptide molecules in the biological samples. All of these make the spectra and peptides labels for training being poorly matched.

To address above issues, we propose a new framework, AdaNovo, to calculate the conditional mutual information (CMI) between the spectrum and each amino acid in the matching peptide. This can measures the importance of different target amino acids by their dependence on the source spectrum. Based on the amino acid-level CMI, we obtain the PSM-level CMI between the spectrum and the entire peptide to measure the matching level of each spectrum-peptide pair in the training PSM data. Subsequently, we design an adaptive training approach based on both the amino acid- and PSM-level CMI, which adaptively re-weights the training losses of the corresponding amino acids.

We conduct the training and evaluation of our model on the widely-used 9-species datasets and observe that AdaNovo outperforms state-of-the-art methods in predicting never-before-seen peptide sequences and demonstrate significantly higher precision in identifying the amino acids with PTMs.

2 Background

Proteomics research focuses on large-scale studies to characterize the proteome, the entire set of proteins, in a living organism. Tandem mass spectrometry (MS), as the mainstream high-throughput technique to identify protein sequences, plays an essential role in proteomics research. As shown in Figure 2, in a standard identification workflow of shotgun proteomics (Wolters et al., 2001), proteins undergo initial digestion by enzymes, yielding a mixture of peptides. A tandem mass spectrometer measures mass-to-charge (m/z) ratios of each peptides in a two-scan process. During the first scan (MS1), the mass-to-charge (m/z) ratios of intact peptides, also known as precursors, are measured. Following this, peptides undergo fragmentation, and the resulting fragments are analyzed in a subsequent scan. In the second scan (MS2), peptides are fragmented at random locations along the peptide backbone, generating peaks corresponding to prefixes (b-ions) and suffixes (y-ions) of the peptide, each associated with a specific charge state. Consequently, the MS2 spectrum comprises a collection of peaks. Each peak is characterized by an m/z value and an associated intensity. The intensity, though unitless, is directly proportional to the number of ions contributing to the observed peak. The m/z value is measured with remarkable precision, while the intensity is measured with comparatively lower precision. The core of the above pipeline is the spectrum identification problem, where we aim to predict the peptide sequence responsible for generating the observed MS2 spectrum and the corresponding precursor information (mass and charge of the peptide)

3 Related Work

Early de novo sequencing methods used dynamic programming to score peptide sequences against each observed spectrum. PEAKS (Ma et al., 2003) uses a sophisticated dynamic programming algorithm to compute the best sequences whose fragment ions can best interpret the peaks in the MS2 spectrum. Graph-based algorithms, such as Sherenga (Dančík et al., 1999) and pNovo (Taylor & Johnson, 2001), first translated the spectrum into a “spectrum graph” where nodes in the graph correspond to peaks in the spectrum and two nodes are connected by an edge if the mass difference between the two corresponding peaks is equal to the mass of an amino acid. The de novo peptide sequencing problem is thus cast as finding the path in the resulting graph.

Recently, machine learning (Fischer et al., 2005; Frank & Pevzner, 2005) have been introduced into de novo peptide sequencing and significantly improved the accuracy. The PepNovo (Frank & Pevzner, 2005) algorithm present a novel scoring method, which uses a probabilistic network whose structure reflects the chemical and physical rules that govern the peptide fragmentation. The Novor algorithm (Ma, 2015) achieved improved performance by using large decision trees as score function in a dynamic programming algorithm.

The first deep neural network method for de novo peptide sequencing, DeepNovo (Tran et al., 2017), treats the de novo sequencing task as an image caption task and combines CNN with LSTM to predict the sequence. SMSNet (Karunratanakul et al., 2019) is a hybrid approach which leverages a multi-step Sequence-Mask-Search strategy and adopts the encoder-decoder architecture, basically formulating peptide sequencing as a spectra-to-peptide language translation problem. PointNovo (Qiao et al., 2021) adopts an order invariant network structure for peptide sequencing, which focuses specifically on high-resolution mass spectrometry data. Similar to SMSNet (Karunratanakul et al., 2019), Casanovo (Yilmaz et al., 2022) frames the problem as a language translation problem and employs a transformer framework that has been widely used to process and predict sequences.

Although de novo methods have achieved notable progress, we observe that they have difficulty in identifying the amino acids with PTMs because these amino acids occur much less frequently in datasets compared to other common amino acids, making it challenging for the model to learn. Additionally, mass spectrometry data contains a significant amount of noise typically originated from the electronic fluctuations in the instruments and other molecules in the biological samples. In other words, there are plenty of noise peaks mixing together with the real ions. All of these make the peptides labels being less reliable. These issues limit the predictive accuracy and widespread use of de novo methods. The AdaNovo model proposed in this paper effectively alleviates both of them.

4 Methods

4.1 Task Formulation

Formally, we denote mass spectrum peaks as $\mathbf{x}=\left\{\left(m_{i},I_{i}\right)\right\}_{i=1}^{M}$ , where each peak $\left(m_{i},I_{i}\right)$ forms a 2-tuple representing the m/z and intensity value, and $M$ is the number of peaks that can be varied across different mass spectra. Also, we denote the precursor as $\mathbf{z}=\left\{\left(m_{prec},c_{prec}\right)\right\}$ , consisting of the total mass $m_{prec}\in\mathbb{R}$ and charge state $c_{prec}\in\left\{1,2,\dots,10\right\}$ of the spectrum. Additionally, we represent the peptide sequence as $\mathbf{y}=\left\{\left(y_{1},y_{2},\dots,y_{N}\right)\right\}$ , where $N$ is the peptide length and can be varied across different peptides. $\mathbf{y}_{<j}$ means the previous amino acids sequence appearing before the index $j$ in the peptide $\mathbf{y}$ . The de novo peptide sequencing models are designed to predict the probability of each amino acid:

P(\mathbf{y}\mid\mathbf{x},\mathbf{z};\theta)=\prod_{j=1}^{N}p\left(y_{j}\mid% \mathbf{y}_{<j},\mathbf{x},\mathbf{z};\theta\right),

(1)

where $j$ is the index of each amino acid position in the peptide sequence and $\theta$ is the model parameter. In general, previous models (Tran et al., 2017; Yilmaz et al., 2022; Qiao et al., 2021) are optimized using the cross-entropy (CE) loss during training:

\mathcal{L}_{\mathrm{CE}}(\theta)=-\sum_{j=1}^{N}\log p\left(y_{j}\mid\mathbf{% y}_{<j},\mathbf{x},\mathbf{z};\theta\right).

(2)

During inference, the de novo sequencing models typically predict the probabilities of target amino acids in an autoregressive manner and generate hypotheses using heuristic search algorithms like beam search (SCIENCE, 1977).

4.2 Model Architectures

As shown in Figure 3, AdaNovo consists of a mass spectrum encoder (MS Encoder) and two peptide decoders (Peptide Decoder #1 and Peptide Decoder #2). All of these models are built on the Transformer (Vaswani et al., 2017).

In order to feed the MS peaks to MS Encoder, we regard each mass spectrum peak $\left(m_{i},I_{i}\right)$ as a ‘word’ in natural language processing and obtain its embedding by individually embed its m/z value and intensity value before combining them through summation. Specifically, we employ a fixed, sinusoidal embedding (Vaswani et al., 2017) to project m/z value $m_{i}$ to a $d$ dimensional vector $f_{i}$ ,

f_{i}=\left\{\begin{array}[]{ll}\sin\left(m_{i}/\left(\frac{\lambda_{\max}}{% \lambda_{\min}}\left(\frac{\lambda_{\min}}{2\pi}\right)^{2i/d}\right)\right),&% \text{ for }i\leq d/2\\ \cos\left(m_{i}/\left(\frac{\lambda_{\max}}{\lambda_{\min}}\left(\frac{\lambda% _{\min}}{2\pi}\right)^{2i/d}\right)\right),&\text{ for }i>d/2\end{array}\right.

(3)

where $\lambda_{\max}$ = 10,000 and $\lambda_{\min}$ = 0.001. The input embeddings furnish a detailed portrayal of high-precision m/z information. Analogous to the consideration of relative positions in the initial transformer model (Vaswani et al., 2017), these embeddings potentially facilitate the model’s attention to m/z variations between peaks. Such attention to detail is crucial for the accurate identification of amino acids within the peptide sequence. The intensity, measured with less precision compared to the m/z value, undergoes embedding by projection into d dimensions through a linear layer. Subsequently, the m/z and intensity embeddings are amalgamated through summation, resulting in the generation of the input peak embedding. Kindly note that the mass spectrum peaks are permutation invariant, i.e., the order in which the peaks appear in the spectrum does not affect the identification results. Therefore, it is unnecessary to account for an extra positional embedding (Ke et al., 2021) like natural language processing when feed the peaks into transformer.

Similarly, for the precursor $\mathbf{z}=\left\{\left(m_{prec},c_{prec}\right)\right\}$ to be fed into the Peptide Decoder #1, we employ the same sinusoidal embedding for $m_{prec}$ as the m/z above and an embedding layer to embed $c_{prec}$ . Finally, we obtain the input precursor embedding by summarizing the above 2 embeddings. As for the peptide sequence, the amino acid vocabulary encompasses the 20 canonical amino acids, along with post-translationally modified versions of three among them (oxidation of methionine and deamidation of asparagine or glutamine). Additionally, a special stop token signals the end of decoding, resulting in a total of 24 tokens. Peptide Decoder #1 and Peptide Decoder #2 undergo autoregressive training, wherein they receive the preceding amino acid sequence $\mathbf{y}_{<j}$ prior to amino acid $j$ during the prediction process for the identity of amino acid $j$ . However, different from Peptide Decoder #1, Peptide Decoder #2 exclusively utilizes $\mathbf{y}_{<j}$ as input because we want to calculate the conditional probability $p(y_{j}\mid\mathbf{y}_{<j})$ , which is the prerequisite for calculating the conditional mutual information between the mass spectrum ( $\mathbf{x}$ and $\mathbf{z}$ ) and amino acid $y_{j}$ .

4.3 Training Strategies

The training strategies consist of amino acid-level (§ 4.3.1) and PSM-level adaptive training (§ 4.3.2), which we elaborate on below.

4.3.1 Amino acid-level Adaptive Training.

As mentioned above, previous de novo sequencing models struggle to identify amino acids with PTMs because they occur much less frequently in datasets compared to other canonical amino acids, making it challenging for the model to learn. Therefore, we expect to emphasize the amino acids with PTMs to improve the models’ ability in identifying them. This resembles the up-sampling methods in long-tailed classification where researchers emphasize samples from the tail class during training (Zhang et al., 2023; Ren et al., 2018). We also compare with these alternative methods in Section 5.7. On the other hand, when predict the amino acid with PTMs $y_{j}$ , we should rely more on mass spectrometry data (x and z) and less on the historical predictions of previous amino acids $y_{<j}$ because the mass shifts resulting from PTMs are only manifested in the mass spectrometry data. This motivates us to measure the mutual information (MI) between each target amino acid and mass spectrum conditioned on previous amino acids, i.e., conditional mutual information (CMI) (Wyner, 1978) between each target amino acid and mass spectrum,

	$\displaystyle\operatorname{CMI}(\mathbf{x},\mathbf{z};y_{j})$	$\displaystyle=\operatorname{MI}\left(\mathbf{x},\mathbf{z};y_{j}\mid\mathbf{y}% _{<j}\right)$
		$\displaystyle=\log\left(\frac{p\left(y_{j},\mathbf{x},\mathbf{z}\mid\mathbf{y}% _{<j}\right)}{p\left(y_{j}\mid\mathbf{y}_{<j}\right)\cdot p\left(\mathbf{x},% \mathbf{z}\mid\mathbf{y}_{<j}\right)}\right).$

However, it is computationally impractical to calculate the $\operatorname{CMI}$ with above definition. To address this, we proceed to enhance its computational tractability by decomposing the conditional joint distribution,

	$\displaystyle\operatorname{CMI}(\mathbf{x},\mathbf{z};y_{j})$	$\displaystyle=\operatorname{MI}\left(\mathbf{x},\mathbf{z};y_{j}\mid\mathbf{y}% _{<j}\right)$
		$\displaystyle=\log\left(\frac{p\left(y_{j},\mathbf{x},\mathbf{z}\mid\mathbf{y}% _{<j}\right)}{p\left(y_{j}\mid\mathbf{y}_{<j}\right)\cdot p\left(\mathbf{x},% \mathbf{z}\mid\mathbf{y}_{<j}\right)}\right).$
		$\displaystyle=\log\left(\frac{p\left(y_{j}\mid\mathbf{x},\mathbf{z},\mathbf{y}% _{<j}\right)\cdot p\left(\mathbf{x},\mathbf{z}\mid\mathbf{y}_{<j}\right)}{p% \left(y_{j}\mid\mathbf{y}_{<j}\right)\cdot p\left(\mathbf{x},\mathbf{z}\mid% \mathbf{y}_{<j}\right)}\right)$
		$\displaystyle=\log\left(\frac{p\left(y_{j}\mid\mathbf{x},\mathbf{z},\mathbf{y}% _{<j}\right)}{p\left(y_{j}\mid\mathbf{y}_{<j}\right)}\right).$

In this way, the $\operatorname{CMI}(\mathbf{x},\mathbf{z};y_{j})$ can be obtained with $p\left(y_{j}\mid\mathbf{x},\mathbf{z},\mathbf{y}_{<j}\right)$ and $p\left(y_{j}\mid\mathbf{y}_{<j}\right)$ , which are the output of the Peptide Decoder #1 and Peptide Decoder #2, respectively. Moreover, to reduce the variances and stabilize the distribution of the amino acid-level CMI in each peptide, we normalize the CMI values in the peptide and then scale the normalized values to obtain the amino acid-level training weight $w_{j}^{aa}$ for $y_{j}$ ,

w_{j}^{aa}=\max\left\{0,s_{1}\cdot\frac{\operatorname{CMI}\left(\mathbf{x},% \mathbf{z};y_{j}\right)-\mu^{aa}}{\sigma^{aa}}+1\right\},

(4)

where $\mu^{aa}$ and $\sigma^{aa}$ are the mean values and the standard deviations of all the CMI values in each peptide, and $s_{1}$ is a hyperparameter that controls the effect of amino acid-level adaptive training.

4.3.2 PSM-level Adaptive Training.

As we introduced before, the training PSMs samples are of different matching levels because of the signal noise and missing peaks. To alleviate the negative effect of poorly matched mass spectrometry and peptide pairs and encourage the well-matched pairs, we adopt the mutual information between them as a measure of matching levels. Formally,

$\displaystyle\operatorname{MI}(\mathbf{x},\mathbf{z};\mathbf{y})$	$\displaystyle=\frac{1}{\|\mathbf{y}\|}\log\left(\frac{p(\mathbf{x},\mathbf{z},% \mathbf{y})}{p(\mathbf{x},\mathbf{z})\cdot p(\mathbf{y})}\right)$	(5)
	$\displaystyle=\frac{1}{\|\mathbf{y}\|}\log\left(\frac{p(\mathbf{y}\mid\mathbf{x}% ,\mathbf{z})}{p(\mathbf{y})}\right)$
	$\displaystyle=\frac{1}{\|\mathbf{y}\|}\log\left(\frac{\prod_{j}p\left(y_{j}\mid% \mathbf{x},\mathbf{z},\mathbf{y}_{<j}\right)}{\prod_{j}p\left(y_{j}\mid\mathbf% {y}_{<j}\right)}\right)$
	$\displaystyle=\frac{1}{\|\mathbf{y}\|}\sum_{j}\log\left(\frac{p\left(y_{j}\mid% \mathbf{x},\mathbf{z},\mathbf{y}_{<j}\right)}{p\left(y_{j}\mid\mathbf{y}_{<j}% \right)}\right)$
	$\displaystyle=\frac{1}{\|\mathbf{y}\|}\sum_{j}\operatorname{CMI}(\mathbf{x},% \mathbf{z};y_{j}).$

In other words, the mutual information can be derived by averaging all the amino acid-level $\operatorname{CMI}(\mathbf{x},\mathbf{z};y_{j})$ over the peptide. Similarly, we normalize all the $\operatorname{MI}$ values across all the PSMs in each mini-batch and then scale the normalized values to obtain the PSM-level training weight $w^{psm}$ ,

w^{psm}=\max\left\{0,s_{2}\cdot\frac{\operatorname{MI}(\mathbf{x},\mathbf{z};% \mathbf{y})-\mu^{psm}}{\sigma^{psm}}+1\right\},

(6)

where $\mu^{psm}$ and $\sigma^{psm}$ are the mean values and the standard deviations of the MI values of all the PSMs in each minibatch, and $s_{2}$ is a hyperparameter that controls the effect of PSM-level adaptive training.

4.3.3 Adaptive Training Loss

In our adaptive training method, we re-weight each target amino acid $y_{j}$ with the following loss,

\mathcal{L}_{1}(\theta_{1})=-\sum_{j=1}^{N}w_{j}\log p\left(y_{j}\mid\mathbf{y% }_{<j},\mathbf{x},\mathbf{z};\theta_{1}\right),

(7)

where $\theta_{1}$ are the parameters of MS Encoder and Peptide Decoder #1, and

w_{j}=w_{j}^{aa}\cdot w^{psm}.

(8)

Additionally, Peptide Decoder #2 is trained with the following loss,

\mathcal{L}_{2}(\theta_{2})=-\sum_{j=1}^{N}\log p\left(y_{j}\mid\mathbf{y}_{<j% };\theta_{2}\right),

(9)

where $\theta_{2}$ are the parameters of Peptide Decoder #2. The overall training loss is,

\mathcal{L}_{\tt{Ada}}(\theta_{1},\theta_{2})=\mathcal{L}_{1}(\theta_{1})+% \mathcal{L}_{2}(\theta_{2}).

(10)

4.4 Inference

In the inference phase, we only use MS Encoder and Peptide Decoder #1 to predict the peptide. Specifically, we feed the mass spectrometry data to the encoder MS Encoder and the decoder Peptide Decoder #1 predicts the highest-scoring amino acid for each peptide sequence position. The decoder is then fed its preceding amino acid predictions at each decoding step. The decoding process concludes upon predicting the stop token or reaching the predefined maximum peptide length of $\ell=100$ amino acids. We discuss the computational overhead in Section 5.9.

4.5 Precursor m/z filtering

In de novo peptide sequencing, it’s crucial that the relative difference between the total mass of the predicted peptide ( $m_{\text{pred}}$ ) and the observed precursor mass ( $m_{\text{prec}}$ ) remains below a specified threshold value $\epsilon$ for the predicted sequence to be considered plausible. This requirement is expressed as $\Delta m_{ppm}=\frac{\left|m_{\text{prec}}-m_{\text{pred}}\right|\times 10^{6}% }{m_{\text{prec}}}<\epsilon$ . To ensure adherence to this constraint, we not only incorporate precursor information into the model’s learning process but also filter out peptide predictions that don’t meet this criterion. The threshold value $\epsilon$ is determined based on the precursor mass error tolerance used in the database search to establish ground truth peptide sequences for the test data.

5 Experiments

Table 1: Empirical comparison of de novo sequencing models. The table lists the peptide-level and amino acid-level precision of three competing models on all nine benchmark cross-validation folds. Each fold’s test set contains spectra from a single species. Kindly note that peptide-level performance measures are the primary quantifier of the model’s practical utility because the goal is to assign a complete peptide sequence to each spectrum. The best and the second best results are highlighted bold and underlined, respectively.

Species	Peptide-level precision				Amino acid-level precision
Species	DeepNovo	PointNovo	Casanovo	AdaNovo	DeepNovo	PointNovo	Casanovo	AdaNovo
Mouse	0.286	0.355	0.449	0.467	0.623	0.626	0.612	0.646
Human	0.293	0.351	0.343	0.373	0.610	0.606	0.585	0.618
Yeast	0.462	0.534	0.568	0.593	0.750	0.779	0.753	0.793
M. mazei	0.422	0.478	0.474	0.496	0.694	0.712	0.686	0.728
Honeybee	0.330	0.396	0.422	0.431	0.630	0.644	0.640	0.650
Tomato	0.454	0.513	0.463	0.530	0.731	0.733	0.720	0.740
Rice bean	0.436	0.511	0.549	0.546	0.679	0.730	0.727	0.719
Bacillus	0.449	0.518	0.513	0.528	0.742	0.768	0.718	0.739
Clam bacteria	0.253	0.298	0.347	0.372	0.602	0.589	0.617	0.642

5.1 Datasets

To assess AdaNovo’s performance, we employ the nine-species benchmark initially introduced by DeepNovo (Tran et al., 2017). This dataset amalgamates approximately 1.5 million mass spectra from nine distinct experiments, each employing the same instrument to scrutinize peptides from diverse species. Each spectrum is associated with a ground-truth peptide sequence, which comes from database search identification with a standard false discovery rate (FDR) set at 1%. Following the methodology of previous works (Tran et al., 2017; Qiao et al., 2021), we adopt a leave-one-out cross-validation framework. This entails training a model on eight species and testing it on the species held out for each of the nine species. The training set is split 90/10 for training and validation. This framework facilitates the testing of the model on peptide samples that have never been encountered before. Cross-species testing is of paramount importance for de novo sequencing models since practical applications often demand these models to excel in handling mass spectra featuring peptide sequences that have never been observed before.

5.2 Evaluation Metrics

In our assessment of model predictions, we employ precision calculated at both the amino acid and peptide levels, following methodologies presented by previous works (Ma et al., 2003; Tran et al., 2017). These precision metrics serve as performance measures, gauging the quality of a given model’s predictions based on coverage over the test set. For each spectrum, we compare the predicted sequence to the ground truth peptide obtained from the database search.

Consistent with DeepNovo (Tran et al., 2017), our approach to amino acid-level measures begins by calculating the number $N_{\text{match}}^{a}$ of matched amino acid predictions. These are defined as predicted amino acids that (1) exhibit a mass difference of $<0.1\mathrm{Da}$ from the corresponding ground truth amino acid and (2) have either a prefix or suffix with a mass difference of no more than $0.5\mathrm{Da}$ from the corresponding amino acid sequence in the ground truth peptide. Amino acid-level precision is then defined as $N_{\text{match }}^{a}/N_{\text{pred }}^{a}$ , where $N_{\text{pred}}^{a}$ represents the number of predicted amino acids. Similarly, amino acids with PTMs identification precision can be formulated as $N_{\text{match }}^{ptm}/N_{\text{pred }}^{ptm}$ , where $N_{\text{match }}^{ptm}$ and $N_{\text{pred}}^{ptm}$ denote the number of matched amino acids with PTMs and predicted amino acids with PTMs, respectively.

For peptide predictions, a predicted peptide is deemed a correct match only if all of its amino acids are matched. In a collection of $N_{\text{orig }}^{p}$ spectra, if our model provides predictions for a subset of $N_{\text{pred }}^{p}$ and accurately predicts $N_{\text{match }}^{p}$ peptides, coverage is defined as $N_{\text{pred }}^{p}/N_{\text{orig }}^{p}$ . Peptide-level precision is calculated as $N_{\text{match }}^{p}/N_{\text{pred }}^{p}$ .

To construct a precision-coverage curve, predictions are sorted based on the confidence score provided by the model. Amino acid-level confidence scores are derived by applying a softmax to the transformer decoder’s output, serving as a proxy for the probability of each predicted amino acid occurring at a specific position along the peptide sequence. AdaNovo provides amino acid-level confidence scores directly, and we utilize the mean score across all amino acids as a peptide-level confidence score.

5.3 Baselines

We compare AdaNovo with previous de novo peptide sequencing methods including DeepNovo (Tran et al., 2017), Casanovo (Yilmaz et al., 2022) and PointNovo (Qiao et al., 2021). We reproduce the results of Casanovo with the settings and hypermeters of the original paper and report the published results of DeepNovo and PointNovo as their pre-trained weights are unavailable.

5.4 Experimental Settings

The models in our AdaNovo are with 9 layers, embedding size $d$ = 512, and 8 attention heads. We train the models with a batchsize of 32 PSMs and $10^{-5}$ weight decay. The learning rate is linearly increased from zero to $5\times 10^{-4}$ in 100 $k$ warm-up steps, followed by a cosine shaped decay. We train the models for 30 epochs and pick the model weights from the epoch with the lowest validation loss for testing. The hypermeters $s_{1}$ and $s_{2}$ are tuned within the set $\left\{0.05,0.1,0.3\right\}$ .

5.5 Main Results

AdaNovo outperforms state-of-the-art methods. As can be observed in Table 1, AdaNovo outperforms competitive models on most (8 out of 9) species in peptide-level precision compared to DeepNovo, PointNovo and CasaNovo. The peptide-level precision coverage curves (Figure 4) show that AdaNovo consistently outperforms Casanovo over a range of peptide confidence thresholds. This trend is also reflected by the area under the curve (AUC) metric. At amino acid-level, AdaNovo outperforms baselines on most datasets. As shown in Figure 4, the point on the AdaNovo curve corresponding to the filter lies above the casanovo precision-coverage curve, and Adanovo’s AUC consistently exceeds Casanovo’s.

Table 2: Empirical comparison of de novo sequencing models in terms of identifying amino acids with PTMs. The best and the second best results are highlighted bold and underlined, respectively.

Species	PTMs precision
Species	DeepNovo	PointNovo	Casanovo	AdaNovo
Human	0.369	0.415	0.398	0.483
Rice bean	0.644	0.653	0.646	0.689
Clam bacteria	0.510	0.526	0.508	0.575
Bacillus	0.483	0.524	0.470	0.565

AdaNovo can accurately identify the amino acids with PTMs. As demonstrated in Table 2, we compare AdaNovo with other methods in terms of identifying amino acids with PTMs because AdaNovo is designed to accurately identify the amino acids with PTMs. The results in the table indicate that AdaNovo exceeds other competitors by significant margins in identifying amino acids with PTMs, verifying the effectiveness of the amino acid-level adaptive training strategy.

5.6 Ablation Study

Table 3: Ablations on amino acid-level (AA-level) and peptide-level adaptive training strategies. The results are for the Human test set, which is one of 9-species benchmark (Tran et al., 2017).

Model	AA. Prec.	Peptide Prec.	PTMs Prec.
Casanovo	0.585	0.343	0.300
AdaNovo (w/o PSM-level MI)	0.607	0.360	0.478
AdaNovo (w/o AA-level CMI)	0.594	0.349	0.314
AdaNovo	0.618	0.373	0.483

Ablations on amino acid-level and peptide-level adaptive training strategies. To investigate the influence of the amino acid-level and peptide-level adaptive training strategies, we remove each of them from AdaNovo. The results shown in Table 3 indicate that both modules are necessary and effective for the AdaNovo model. Moreover, when we remove the AA-level training strategy in AdaNovo, the precision of the amino acids with PTMs identification drops significantly because the amino acid-level training strategy is designed for identifying amino acids with PTMs. Additionally, the PSM-level training strategy is designed for robustness against data noise, which we verify via the following experiments.

Table 4: Models’ Performance on mass spectrum dataset with synthetic noise. The results are for the Clam bacteria test set, which is one of 9-species benchmark (Tran et al., 2017).

Model	AA. Prec.	Peptide Prec.
CasaNovo	0.582	0.297
AdaNovo (w/o PSM-level MI)	0.586	0.311
AdaNovo (w/o AA-level CMI)	0.614	0.335
AdaNovo	0.621	0.342

Performance on mass spectra with synthetic noise. To verify the effectiveness of the PSM-level adaptive training strategy, we randomly choose 20% spectrum in the training datasets, and add synthetic noise peaks or remove original peaks with higher intensity values. We report the results in Table 4, from which we can observe that the performance would degrade sharply when we remove the PSM-level training strategy. This indicates that PSM-level adaptive training strategy can enhance models’ robustness against data noise in mass spectrum.

5.7 Comparisons with Alternative Methods for identifying amino acids with PTMs

Table 5: Comparisons with alternative methods in terms of identifying amino acids with PTMs. All results are for the yeast test set, which is one of 9-species benchmark (Tran et al., 2017).

Model	AA. Prec.	Peptide Prec.
Casanovo	0.753	0.568
+ Re-weight	0.762	0.576
+ Focal loss	0.745	0.543
AdaNovo (w/o PSM-level MI)	0.784	0.582
AdaNovo	0.793	0.593

In this section, we show the performance of AdaNovo only with amino acid-level loss (denoted as ‘ AdaNovo w/o PSM-level MI’) and compare to some alternative methods in terms of identifying amino acids with PTMs. The first alternative is to re-weight each amino acid $y_{j}$ with the following function,

w_{j}=\frac{N_{total}}{N_{y_{j}}},

(11)

where $N_{total}$ and $N_{y_{j}}$ represent the total number of amino acids and the number of amino acids in the $y_{j}$ category in the dataset, respectively. The second alternative is the focal loss (Lin et al., 2017), we replace the cross entropy loss of Casanovo (Yilmaz et al., 2022) with the focal loss,

\mathcal{L}=-(1-\alpha p(y_{j}\mid\mathbf{x},\mathbf{z},\mathbf{y}_{<j}))^{% \gamma}\log p(y_{j}\mid\mathbf{x},\mathbf{z},\mathbf{y}_{<j}),

where $\alpha$ and $\gamma$ are hyperparameters to adjust the loss weight. The results shown in Table 5 indicate that both AdaNovo and the first alternative can help improve Casanovo’s ability. Additionally, AdaNovo outperforms the alternatives by a notable margin probably because the training and testing datasets are derived from different species, there exists a significant difference in the distribution of PTMs quantities. Also, AdaNovo is inspired by the domain knowledge that the mass shift of PTMs only manifests in the mass spectra, thus shows superiority over the re-weighting methods in long-tailed classification.

5.8 Sensitivity Analysis

The effects of two hyperparameters s1 and s2, which determines the influence of amino acid-level and PSM-level training strategy can be seen in Appendix 5.

5.9 Costs of Computing and Storage

The comparison between AdaNovo and Casanovo regarding model parameters and runtime can be found in Appendix 6.

6 Conclusion and Future Work

In this paper, we discern challenges in existing methods related to identification of the amino acids with PTMs, exacerbated by spectrum data noise stemming from instrument malfunctions and contaminants. These challenges contribute to reduced precision in identification. To address these issues, we introduce a novel approach involving the calculation of conditional mutual information between the spectrum and each amino acid, followed by a re-weighting of each amino acid. Extensive experiments on widely-utilized 9-species datasets affirm that AdaNovo surpasses previous de novo sequencing methods, showcasing superior performance in both amino acid- and peptide-level precision. Notably, AdaNovo exhibits a distinct advantage in identifying amino acids with PTMs. In the future, we plan to train the AdaNovo model on more extensive PSM data, positioning it as a foundational model for mass spectrum-based proteomics.

7 Impact Statements

This paper presents work whose goal is to advance the field of machine learning for protein sequencing. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

References

Bartels (1990) Bartels, C. Fast algorithm for peptide sequencing by mass spectroscopy. Biomedical & environmental mass spectrometry, 19(6):363–368, 1990.
Dančík et al. (1999) Dančík, V., Addona, T. A., Clauser, K. R., Vath, J. E., and Pevzner, P. A. De novo peptide sequencing via tandem mass spectrometry. Journal of computational biology, 6(3-4):327–342, 1999.
Deribe et al. (2010) Deribe, Y. L., Pawson, T., and Dikic, I. Post-translational modifications in signal integration. Nature structural & molecular biology, 17(6):666–672, 2010.
Fischer et al. (2005) Fischer, B., Roth, V., Roos, F., Grossmann, J., Baginsky, S., Widmayer, P., Gruissem, W., and Buhmann, J. M. Novohmm: a hidden markov model for de novo peptide sequencing. Analytical chemistry, 77(22):7265–7273, 2005.
Frank & Pevzner (2005) Frank, A. and Pevzner, P. Pepnovo: de novo peptide sequencing via probabilistic network modeling. Analytical chemistry, 77(4):964–973, 2005.
Frank (2009) Frank, A. M. Predicting intensity ranks of peptide fragment ions. Journal of proteome research, 8(5):2226–2240, 2009.
Karunratanakul et al. (2019) Karunratanakul, K., Tang, H.-Y., Speicher, D. W., Chuangsuwanich, E., and Sriswasdi, S. Uncovering thousands of new peptides with sequence-mask-search hybrid de novo peptide sequencing framework. Molecular & Cellular Proteomics, 18(12):2478–2491, 2019.
Ke et al. (2021) Ke, G., He, D., and Liu, T.-Y. Rethinking positional encoding in language pre-training. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=09-528y2Fgf.
Lin et al. (2017) Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017.
Ma (2015) Ma, B. Novor: real-time peptide de novo sequencing software. Journal of the American Society for Mass Spectrometry, 26(11):1885–1894, 2015.
Ma et al. (2003) Ma, B., Zhang, K., Hendrie, C., Liang, C., Li, M., Doherty-Kirby, A., and Lajoie, G. Peaks: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid communications in mass spectrometry, 17(20):2337–2342, 2003.
Mayer & Impens (2021) Mayer, R. L. and Impens, F. Immunopeptidomics for next-generation bacterial vaccine development. Trends in microbiology, 29(11):1034–1045, 2021.
Muth et al. (2013) Muth, T., Benndorf, D., Reichl, U., Rapp, E., and Martens, L. Searching for a needle in a stack of needles: challenges in metaproteomics data analysis. Molecular BioSystems, 9(4):578–585, 2013.
Qiao et al. (2021) Qiao, R., Tran, N. H., Xin, L., Chen, X., Li, M., Shan, B., and Ghodsi, A. Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices. Nature Machine Intelligence, 3(5):420–425, 2021.
Ren et al. (2018) Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In International conference on machine learning, pp. 4334–4343. PMLR, 2018.
SCIENCE (1977) SCIENCE, C.-M. U. P. P. D. O. C. Speech Understanding Systems. Summary of Results of the Five-Year Research Effort at Carnegie-Mellon University. 1977.
Taylor & Johnson (2001) Taylor, J. A. and Johnson, R. S. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Analytical chemistry, 73(11):2594–2604, 2001.
Tran et al. (2017) Tran, N. H., Zhang, X., Xin, L., Shan, B., and Li, M. De novo peptide sequencing by deep learning. Proceedings of the National Academy of Sciences, 114(31):8247–8252, 2017.
VanDuijn et al. (2017) VanDuijn, M. M., Dekker, L. J., van IJcken, W. F., Sillevis Smitt, P. A., and Luider, T. M. Immune repertoire after immunization as seen by next-generation sequencing and proteomics. Frontiers in Immunology, 8:1286, 2017.
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Vizcaíno et al. (2014) Vizcaíno, J. A., Deutsch, E. W., Wang, R., Csordas, A., Reisinger, F., Ríos, D., Dianes, J. A., Sun, Z., Farrah, T., Bandeira, N., et al. Proteomexchange provides globally coordinated proteomics data submission and dissemination. Nature biotechnology, 32(3):223–226, 2014.
Wolters et al. (2001) Wolters, D. A., Washburn, M. P., and Yates, J. R. An automated multidimensional protein identification technology for shotgun proteomics. Analytical chemistry, 73(23):5683–5690, 2001.
Wyner (1978) Wyner, A. D. A definition of conditional mutual information for arbitrary ensembles. Information and Control, 38(1):51–59, 1978.
Yilmaz et al. (2022) Yilmaz, M., Fondrie, W., Bittremieux, W., Oh, S., and Noble, W. S. De novo mass spectrometry peptide sequencing with a transformer model. In International Conference on Machine Learning, pp. 25514–25522. PMLR, 2022.
Zhang et al. (2023) Zhang, Y., Kang, B., Hooi, B., Yan, S., and Feng, J. Deep long-tailed learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

Appendix A Costs of Computing and Storage

In this part, we compare AdaNovo with Casanovo in terms of the number of model parameters, training time and inference time. The results shown in Table 6. Compared to casanovo, AdaNovo introduced Peptide Decoder #2, resulting in a 40.04% increase in parameter count (from 47.35M to 66.31M). Similarly, under the same hardware settings (1 A100-SXM4-80GB and 32 CPU), training time increased by 7.3% (from 63.27M to 67.92M). However, the inference of AdaNovo is more efficient than CasaNovo.

Table 6: Comparisons with competitive methods in terms of computational overhead. The training and inference time are evaluated on Honeybee dataset, which is one of 9-species benchmark (Tran et al., 2017).

Model	#Params (M)	Training time (h)	Inference time (h)
Casanovo	47.35	63.27	8.42
AdaNovo	66.31	67.92	6.02

Appendix B Sensitivity Analysis

In this section, we investigate the effects of the two hyperparameters $s_{1}$ and $s_{2}$ , which determines the influence of amino acid-level and PSM-level training strategy. As shown in Figure 5, we tune both $s_{1}$ and $s_{2}$ within the range $[0.05,0.1,0.3]$ and observe that the values of these two hyperparameters significantly affect the final performance of the model. Additionally, the optimal hyperparameters vary across different models, indicating differences in noise and the distributions of amino acids with PTMs among different datasets. It is necessary to finely adjust the values of $s_{1}$ and $s_{2}$ based on the dataset, representing the balance between amino acid-level and PSM-level training strategies.

$\displaystyle\operatorname{MI}(\mathbf{x},\mathbf{z};\mathbf{y})$	$\displaystyle=\frac{1}{\|\mathbf{y}\|}\log\left(\frac{p(\mathbf{x},\mathbf{z},% \mathbf{y})}{p(\mathbf{x},\mathbf{z})\cdot p(\mathbf{y})}\right)$	(5)
	$\displaystyle=\frac{1}{\|\mathbf{y}\|}\log\left(\frac{p(\mathbf{y}\mid\mathbf{x}% ,\mathbf{z})}{p(\mathbf{y})}\right)$
	$\displaystyle=\frac{1}{\|\mathbf{y}\|}\log\left(\frac{\prod_{j}p\left(y_{j}\mid% \mathbf{x},\mathbf{z},\mathbf{y}_{<j}\right)}{\prod_{j}p\left(y_{j}\mid\mathbf% {y}_{<j}\right)}\right)$
	$\displaystyle=\frac{1}{\|\mathbf{y}\|}\sum_{j}\log\left(\frac{p\left(y_{j}\mid% \mathbf{x},\mathbf{z},\mathbf{y}_{<j}\right)}{p\left(y_{j}\mid\mathbf{y}_{<j}% \right)}\right)$
	$\displaystyle=\frac{1}{\|\mathbf{y}\|}\sum_{j}\operatorname{CMI}(\mathbf{x},% \mathbf{z};y_{j}).$

AdaNovo: Adaptive De Novo Peptide Sequencing with Conditional Mutual Information