License: arXiv.org perpetual non-exclusive license
arXiv:2403.07013v2 [q-bio.QM] 15 Mar 2024

AdaNovo: Adaptive De Novo Peptide Sequencing with
Conditional Mutual Information

Jun Xia    Shaorong Chen    **gbo Zhou    Tianze Ling    Wenjie Du    Sizhe Liu    Stan Z. Li
Abstract

Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the analysis of protein composition in biological samples. Despite the development of various deep learning methods for identifying amino acid sequences (peptides) responsible for observed spectra, challenges persist in de novo peptide sequencing. Firstly, prior methods struggle to identify amino acids with post-translational modifications (PTMs) due to their lower frequency in training data compared to canonical amino acids, further resulting in decreased peptide-level identification precision. Secondly, diverse types of noise and missing peaks in mass spectra reduce the reliability of training data (peptide-spectrum matches, PSMs). To address these challenges, we propose AdaNovo, a novel framework that calculates conditional mutual information (CMI) between the spectrum and each amino acid/peptide, using CMI for adaptive model training. Extensive experiments demonstrate AdaNovo’s state-of-the-art performance on a 9-species benchmark, where the peptides in the training set are almost completely disjoint from the peptides of the test sets. Moreover, AdaNovo excels in identifying amino acids with PTMs and exhibits robustness against data noise. The supplementary materials contain the official code.

Machine Learning, ICML

1 Introduction

Refer to caption
Figure 1: Comparisons of various de novo sequencing methods in terms of amino acid-level precision. ‘G’ and ‘A’ denote Glycine and Alanine, respectively. Both of them are canonical amino acids. ‘M(+15.99)’ and ‘Q(+.98)’ represent oxidation of methionine and deamidation of glutamin, both of which are modified amino acids (the amino acids with PTMs). The results are for the human dataset, which is one of 9-species benchmark (Tran et al., 2017).
Refer to caption
Figure 2: The identification workflow of shotgun proteomics  (Wolters et al., 2001). The spectrum identification task this work study is to produce the peptide sequence (e.g., ATASPPRQK) for the observed spectrum. In the spectrum, peaks representing b- and y-ions of the associated peptide are highlighted in color, while grey peaks indicate unexpected fragmentation events or noise. The spectrum annotation are created using ProteomeXchange (Vizcaíno et al., 2014).

Tandem mass spectrometry is a high-throughput tool to identify and quantify proteins in biological samples. However, the precise determination of protein content from observed mass spectra at scale remains a formidable challenge. Central to this challenge is the spectrum identification problem, wherein we are presented with an observed mass spectrum and the corresponding precursor information (mass and charge of the peptide), and our task is to predict the peptide (amino acid sequence) responsible for generating the spectrum. Currently, spectrum identification is most commonly solved using database search, where the observed spectra from the mass spectrometer are compared to theoretical spectra generated by database of known protein sequences. Software algorithms match experimental spectra to theoretical spectra from the database, report the best-scoring peptide-spectrum match (PSM) per spectrum.

However, database search relies on a pre-defined database, preventing the identification of unexpected peptide sequences, such as those originated from genetic variation. Additionally, a database cannot be leveraged for the analysis of some types of immunopeptidomics data (VanDuijn et al., 2017), in antibody sequencing (Tran et al., 2017) or in vaccine development (Mayer & Impens, 2021). Also, the task of constructing a precise database for metaproteomic analyses, including those related to the human microbiome or environmental samples, is deemed impossible (Muth et al., 2013). All of these limitations necessitate de novo peptide sequencing from the observed mass spectra without using prior knowledge in the form of a peptide sequence database.

Since the early 1990s, de novo methods based on the graph theory (Bartels, 1990; Frank, 2009), Hidden Markov Model (Fischer et al., 2005), or dynamic programming (Dančík et al., 1999; Ma et al., 2003; Frank & Pevzner, 2005) were developed to score peptide sequences against observed spectra. With the rise of deep learning, some researchers train the deep neural networks using PSMs (Tran et al., 2017; Qiao et al., 2021; Yilmaz et al., 2022), where they regard the spectra and matching peptides as the inputs and labels, respectively. And then, the trained models are expected to identify never-before-seen peptides. Although these methods have achieved notable progress, as shown in Figure 1, we observe that they struggle to identify the amino acids with PTMs, further leading to low amino acid-level and peptide-level precision. However, the identification of amino acids with PTMs holds significant biological importance because PTMs plays a pivotal role in elucidating protein function and studying disease mechanisms (Deribe et al., 2010).

On the other hand, some of the expected peaks in mass spectra may be missing due to instrument malfunction or multiple cleavage events occurring on the same peptide, and some additional peaks may undesirably appear in the spectrum, created by instrument noise or non-peptide molecules in the biological samples. All of these make the spectra and peptides labels for training being poorly matched.

To address above issues, we propose a new framework, AdaNovo, to calculate the conditional mutual information (CMI) between the spectrum and each amino acid in the matching peptide. This can measures the importance of different target amino acids by their dependence on the source spectrum. Based on the amino acid-level CMI, we obtain the PSM-level CMI between the spectrum and the entire peptide to measure the matching level of each spectrum-peptide pair in the training PSM data. Subsequently, we design an adaptive training approach based on both the amino acid- and PSM-level CMI, which adaptively re-weights the training losses of the corresponding amino acids.

We conduct the training and evaluation of our model on the widely-used 9-species datasets and observe that AdaNovo outperforms state-of-the-art methods in predicting never-before-seen peptide sequences and demonstrate significantly higher precision in identifying the amino acids with PTMs.

2 Background

Proteomics research focuses on large-scale studies to characterize the proteome, the entire set of proteins, in a living organism. Tandem mass spectrometry (MS), as the mainstream high-throughput technique to identify protein sequences, plays an essential role in proteomics research. As shown in Figure 2, in a standard identification workflow of shotgun proteomics  (Wolters et al., 2001), proteins undergo initial digestion by enzymes, yielding a mixture of peptides. A tandem mass spectrometer measures mass-to-charge (m/z) ratios of each peptides in a two-scan process. During the first scan (MS1), the mass-to-charge (m/z) ratios of intact peptides, also known as precursors, are measured. Following this, peptides undergo fragmentation, and the resulting fragments are analyzed in a subsequent scan. In the second scan (MS2), peptides are fragmented at random locations along the peptide backbone, generating peaks corresponding to prefixes (b-ions) and suffixes (y-ions) of the peptide, each associated with a specific charge state. Consequently, the MS2 spectrum comprises a collection of peaks. Each peak is characterized by an m/z value and an associated intensity. The intensity, though unitless, is directly proportional to the number of ions contributing to the observed peak. The m/z value is measured with remarkable precision, while the intensity is measured with comparatively lower precision. The core of the above pipeline is the spectrum identification problem, where we aim to predict the peptide sequence responsible for generating the observed MS2 spectrum and the corresponding precursor information (mass and charge of the peptide)

3 Related Work

Early de novo sequencing methods used dynamic programming to score peptide sequences against each observed spectrum. PEAKS (Ma et al., 2003) uses a sophisticated dynamic programming algorithm to compute the best sequences whose fragment ions can best interpret the peaks in the MS2 spectrum. Graph-based algorithms, such as Sherenga (Dančík et al., 1999) and pNovo (Taylor & Johnson, 2001), first translated the spectrum into a “spectrum graph” where nodes in the graph correspond to peaks in the spectrum and two nodes are connected by an edge if the mass difference between the two corresponding peaks is equal to the mass of an amino acid. The de novo peptide sequencing problem is thus cast as finding the path in the resulting graph.

Recently, machine learning (Fischer et al., 2005; Frank & Pevzner, 2005) have been introduced into de novo peptide sequencing and significantly improved the accuracy. The PepNovo (Frank & Pevzner, 2005) algorithm present a novel scoring method, which uses a probabilistic network whose structure reflects the chemical and physical rules that govern the peptide fragmentation. The Novor algorithm (Ma, 2015) achieved improved performance by using large decision trees as score function in a dynamic programming algorithm.

The first deep neural network method for de novo peptide sequencing, DeepNovo (Tran et al., 2017), treats the de novo sequencing task as an image caption task and combines CNN with LSTM to predict the sequence. SMSNet (Karunratanakul et al., 2019) is a hybrid approach which leverages a multi-step Sequence-Mask-Search strategy and adopts the encoder-decoder architecture, basically formulating peptide sequencing as a spectra-to-peptide language translation problem. PointNovo (Qiao et al., 2021) adopts an order invariant network structure for peptide sequencing, which focuses specifically on high-resolution mass spectrometry data. Similar to SMSNet (Karunratanakul et al., 2019), Casanovo (Yilmaz et al., 2022) frames the problem as a language translation problem and employs a transformer framework that has been widely used to process and predict sequences.

Although de novo methods have achieved notable progress, we observe that they have difficulty in identifying the amino acids with PTMs because these amino acids occur much less frequently in datasets compared to other common amino acids, making it challenging for the model to learn. Additionally, mass spectrometry data contains a significant amount of noise typically originated from the electronic fluctuations in the instruments and other molecules in the biological samples. In other words, there are plenty of noise peaks mixing together with the real ions. All of these make the peptides labels being less reliable. These issues limit the predictive accuracy and widespread use of de novo methods. The AdaNovo model proposed in this paper effectively alleviates both of them.

4 Methods

4.1 Task Formulation

Formally, we denote mass spectrum peaks as 𝐱={(mi,Ii)}i=1M𝐱superscriptsubscriptsubscript𝑚𝑖subscript𝐼𝑖𝑖1𝑀\mathbf{x}=\left\{\left(m_{i},I_{i}\right)\right\}_{i=1}^{M}bold_x = { ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, where each peak (mi,Ii)subscript𝑚𝑖subscript𝐼𝑖\left(m_{i},I_{i}\right)( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) forms a 2-tuple representing the m/z and intensity value, and M𝑀Mitalic_M is the number of peaks that can be varied across different mass spectra. Also, we denote the precursor as 𝐳={(mprec,cprec)}𝐳subscript𝑚𝑝𝑟𝑒𝑐subscript𝑐𝑝𝑟𝑒𝑐\mathbf{z}=\left\{\left(m_{prec},c_{prec}\right)\right\}bold_z = { ( italic_m start_POSTSUBSCRIPT italic_p italic_r italic_e italic_c end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_p italic_r italic_e italic_c end_POSTSUBSCRIPT ) }, consisting of the total mass mprecsubscript𝑚𝑝𝑟𝑒𝑐m_{prec}\in\mathbb{R}italic_m start_POSTSUBSCRIPT italic_p italic_r italic_e italic_c end_POSTSUBSCRIPT ∈ blackboard_R and charge state cprec{1,2,,10}subscript𝑐𝑝𝑟𝑒𝑐1210c_{prec}\in\left\{1,2,\dots,10\right\}italic_c start_POSTSUBSCRIPT italic_p italic_r italic_e italic_c end_POSTSUBSCRIPT ∈ { 1 , 2 , … , 10 } of the spectrum. Additionally, we represent the peptide sequence as 𝐲={(y1,y2,,yN)}𝐲subscript𝑦1subscript𝑦2subscript𝑦𝑁\mathbf{y}=\left\{\left(y_{1},y_{2},\dots,y_{N}\right)\right\}bold_y = { ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) }, where N𝑁Nitalic_N is the peptide length and can be varied across different peptides. 𝐲<jsubscript𝐲absent𝑗\mathbf{y}_{<j}bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT means the previous amino acids sequence appearing before the index j𝑗jitalic_j in the peptide 𝐲𝐲\mathbf{y}bold_y. The de novo peptide sequencing models are designed to predict the probability of each amino acid:

P(𝐲𝐱,𝐳;θ)=j=1Np(yj𝐲<j,𝐱,𝐳;θ),𝑃conditional𝐲𝐱𝐳𝜃superscriptsubscriptproduct𝑗1𝑁𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗𝐱𝐳𝜃P(\mathbf{y}\mid\mathbf{x},\mathbf{z};\theta)=\prod_{j=1}^{N}p\left(y_{j}\mid% \mathbf{y}_{<j},\mathbf{x},\mathbf{z};\theta\right),italic_P ( bold_y ∣ bold_x , bold_z ; italic_θ ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT , bold_x , bold_z ; italic_θ ) , (1)

where j𝑗jitalic_j is the index of each amino acid position in the peptide sequence and θ𝜃\thetaitalic_θ is the model parameter. In general, previous models (Tran et al., 2017; Yilmaz et al., 2022; Qiao et al., 2021) are optimized using the cross-entropy (CE) loss during training:

CE(θ)=j=1Nlogp(yj𝐲<j,𝐱,𝐳;θ).subscriptCE𝜃superscriptsubscript𝑗1𝑁𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗𝐱𝐳𝜃\mathcal{L}_{\mathrm{CE}}(\theta)=-\sum_{j=1}^{N}\log p\left(y_{j}\mid\mathbf{% y}_{<j},\mathbf{x},\mathbf{z};\theta\right).caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT ( italic_θ ) = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT , bold_x , bold_z ; italic_θ ) . (2)

During inference, the de novo sequencing models typically predict the probabilities of target amino acids in an autoregressive manner and generate hypotheses using heuristic search algorithms like beam search (SCIENCE, 1977).

4.2 Model Architectures

As shown in Figure 3, AdaNovo consists of a mass spectrum encoder (MS Encoder) and two peptide decoders (Peptide Decoder #1 and Peptide Decoder #2). All of these models are built on the Transformer (Vaswani et al., 2017).

In order to feed the MS peaks to MS Encoder, we regard each mass spectrum peak (mi,Ii)subscript𝑚𝑖subscript𝐼𝑖\left(m_{i},I_{i}\right)( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as a ‘word’ in natural language processing and obtain its embedding by individually embed its m/z value and intensity value before combining them through summation. Specifically, we employ a fixed, sinusoidal embedding (Vaswani et al., 2017) to project m/z value misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a d𝑑ditalic_d dimensional vector fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,

fi={sin(mi/(λmaxλmin(λmin2π)2i/d)), for id/2cos(mi/(λmaxλmin(λmin2π)2i/d)), for i>d/2subscript𝑓𝑖casessubscript𝑚𝑖subscript𝜆subscript𝜆superscriptsubscript𝜆2𝜋2𝑖𝑑 for 𝑖𝑑2subscript𝑚𝑖subscript𝜆subscript𝜆superscriptsubscript𝜆2𝜋2𝑖𝑑 for 𝑖𝑑2f_{i}=\left\{\begin{array}[]{ll}\sin\left(m_{i}/\left(\frac{\lambda_{\max}}{% \lambda_{\min}}\left(\frac{\lambda_{\min}}{2\pi}\right)^{2i/d}\right)\right),&% \text{ for }i\leq d/2\\ \cos\left(m_{i}/\left(\frac{\lambda_{\max}}{\lambda_{\min}}\left(\frac{\lambda% _{\min}}{2\pi}\right)^{2i/d}\right)\right),&\text{ for }i>d/2\end{array}\right.italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL roman_sin ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ( divide start_ARG italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ( divide start_ARG italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_π end_ARG ) start_POSTSUPERSCRIPT 2 italic_i / italic_d end_POSTSUPERSCRIPT ) ) , end_CELL start_CELL for italic_i ≤ italic_d / 2 end_CELL end_ROW start_ROW start_CELL roman_cos ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ( divide start_ARG italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ( divide start_ARG italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_π end_ARG ) start_POSTSUPERSCRIPT 2 italic_i / italic_d end_POSTSUPERSCRIPT ) ) , end_CELL start_CELL for italic_i > italic_d / 2 end_CELL end_ROW end_ARRAY (3)

where λmaxsubscript𝜆\lambda_{\max}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 10,000 and λminsubscript𝜆\lambda_{\min}italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 0.001. The input embeddings furnish a detailed portrayal of high-precision m/z information. Analogous to the consideration of relative positions in the initial transformer model (Vaswani et al., 2017), these embeddings potentially facilitate the model’s attention to m/z variations between peaks. Such attention to detail is crucial for the accurate identification of amino acids within the peptide sequence. The intensity, measured with less precision compared to the m/z value, undergoes embedding by projection into d dimensions through a linear layer. Subsequently, the m/z and intensity embeddings are amalgamated through summation, resulting in the generation of the input peak embedding. Kindly note that the mass spectrum peaks are permutation invariant, i.e., the order in which the peaks appear in the spectrum does not affect the identification results. Therefore, it is unnecessary to account for an extra positional embedding (Ke et al., 2021) like natural language processing when feed the peaks into transformer.

Similarly, for the precursor 𝐳={(mprec,cprec)}𝐳subscript𝑚𝑝𝑟𝑒𝑐subscript𝑐𝑝𝑟𝑒𝑐\mathbf{z}=\left\{\left(m_{prec},c_{prec}\right)\right\}bold_z = { ( italic_m start_POSTSUBSCRIPT italic_p italic_r italic_e italic_c end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_p italic_r italic_e italic_c end_POSTSUBSCRIPT ) } to be fed into the Peptide Decoder #1, we employ the same sinusoidal embedding for mprecsubscript𝑚𝑝𝑟𝑒𝑐m_{prec}italic_m start_POSTSUBSCRIPT italic_p italic_r italic_e italic_c end_POSTSUBSCRIPT as the m/z above and an embedding layer to embed cprecsubscript𝑐𝑝𝑟𝑒𝑐c_{prec}italic_c start_POSTSUBSCRIPT italic_p italic_r italic_e italic_c end_POSTSUBSCRIPT. Finally, we obtain the input precursor embedding by summarizing the above 2 embeddings. As for the peptide sequence, the amino acid vocabulary encompasses the 20 canonical amino acids, along with post-translationally modified versions of three among them (oxidation of methionine and deamidation of asparagine or glutamine). Additionally, a special stop token signals the end of decoding, resulting in a total of 24 tokens. Peptide Decoder #1 and Peptide Decoder #2 undergo autoregressive training, wherein they receive the preceding amino acid sequence 𝐲<jsubscript𝐲absent𝑗\mathbf{y}_{<j}bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT prior to amino acid j𝑗jitalic_j during the prediction process for the identity of amino acid j𝑗jitalic_j. However, different from Peptide Decoder #1, Peptide Decoder #2 exclusively utilizes 𝐲<jsubscript𝐲absent𝑗\mathbf{y}_{<j}bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT as input because we want to calculate the conditional probability p(yj𝐲<j)𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗p(y_{j}\mid\mathbf{y}_{<j})italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ), which is the prerequisite for calculating the conditional mutual information between the mass spectrum (𝐱𝐱\mathbf{x}bold_x and 𝐳𝐳\mathbf{z}bold_z) and amino acid yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Refer to caption
Figure 3: Schematic diagram of AdaNovo framework.

4.3 Training Strategies

The training strategies consist of amino acid-level (§ 4.3.1) and PSM-level adaptive training (§ 4.3.2), which we elaborate on below.

4.3.1 Amino acid-level Adaptive Training.

As mentioned above, previous de novo sequencing models struggle to identify amino acids with PTMs because they occur much less frequently in datasets compared to other canonical amino acids, making it challenging for the model to learn. Therefore, we expect to emphasize the amino acids with PTMs to improve the models’ ability in identifying them. This resembles the up-sampling methods in long-tailed classification where researchers emphasize samples from the tail class during training (Zhang et al., 2023; Ren et al., 2018). We also compare with these alternative methods in Section 5.7. On the other hand, when predict the amino acid with PTMs yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we should rely more on mass spectrometry data (x and z) and less on the historical predictions of previous amino acids y<jsubscript𝑦absent𝑗y_{<j}italic_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT because the mass shifts resulting from PTMs are only manifested in the mass spectrometry data. This motivates us to measure the mutual information (MI) between each target amino acid and mass spectrum conditioned on previous amino acids, i.e., conditional mutual information (CMI) (Wyner, 1978) between each target amino acid and mass spectrum,

CMI(𝐱,𝐳;yj)CMI𝐱𝐳subscript𝑦𝑗\displaystyle\operatorname{CMI}(\mathbf{x},\mathbf{z};y_{j})roman_CMI ( bold_x , bold_z ; italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) =MI(𝐱,𝐳;yj𝐲<j)absentMI𝐱𝐳conditionalsubscript𝑦𝑗subscript𝐲absent𝑗\displaystyle=\operatorname{MI}\left(\mathbf{x},\mathbf{z};y_{j}\mid\mathbf{y}% _{<j}\right)= roman_MI ( bold_x , bold_z ; italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT )
=log(p(yj,𝐱,𝐳𝐲<j)p(yj𝐲<j)p(𝐱,𝐳𝐲<j)).absent𝑝subscript𝑦𝑗𝐱conditional𝐳subscript𝐲absent𝑗𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗𝑝𝐱conditional𝐳subscript𝐲absent𝑗\displaystyle=\log\left(\frac{p\left(y_{j},\mathbf{x},\mathbf{z}\mid\mathbf{y}% _{<j}\right)}{p\left(y_{j}\mid\mathbf{y}_{<j}\right)\cdot p\left(\mathbf{x},% \mathbf{z}\mid\mathbf{y}_{<j}\right)}\right).= roman_log ( divide start_ARG italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_x , bold_z ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) ⋅ italic_p ( bold_x , bold_z ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG ) .

However, it is computationally impractical to calculate the CMICMI\operatorname{CMI}roman_CMI with above definition. To address this, we proceed to enhance its computational tractability by decomposing the conditional joint distribution,

CMI(𝐱,𝐳;yj)CMI𝐱𝐳subscript𝑦𝑗\displaystyle\operatorname{CMI}(\mathbf{x},\mathbf{z};y_{j})roman_CMI ( bold_x , bold_z ; italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) =MI(𝐱,𝐳;yj𝐲<j)absentMI𝐱𝐳conditionalsubscript𝑦𝑗subscript𝐲absent𝑗\displaystyle=\operatorname{MI}\left(\mathbf{x},\mathbf{z};y_{j}\mid\mathbf{y}% _{<j}\right)= roman_MI ( bold_x , bold_z ; italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT )
=log(p(yj,𝐱,𝐳𝐲<j)p(yj𝐲<j)p(𝐱,𝐳𝐲<j)).absent𝑝subscript𝑦𝑗𝐱conditional𝐳subscript𝐲absent𝑗𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗𝑝𝐱conditional𝐳subscript𝐲absent𝑗\displaystyle=\log\left(\frac{p\left(y_{j},\mathbf{x},\mathbf{z}\mid\mathbf{y}% _{<j}\right)}{p\left(y_{j}\mid\mathbf{y}_{<j}\right)\cdot p\left(\mathbf{x},% \mathbf{z}\mid\mathbf{y}_{<j}\right)}\right).= roman_log ( divide start_ARG italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_x , bold_z ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) ⋅ italic_p ( bold_x , bold_z ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG ) .
=log(p(yj𝐱,𝐳,𝐲<j)p(𝐱,𝐳𝐲<j)p(yj𝐲<j)p(𝐱,𝐳𝐲<j))absent𝑝conditionalsubscript𝑦𝑗𝐱𝐳subscript𝐲absent𝑗𝑝𝐱conditional𝐳subscript𝐲absent𝑗𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗𝑝𝐱conditional𝐳subscript𝐲absent𝑗\displaystyle=\log\left(\frac{p\left(y_{j}\mid\mathbf{x},\mathbf{z},\mathbf{y}% _{<j}\right)\cdot p\left(\mathbf{x},\mathbf{z}\mid\mathbf{y}_{<j}\right)}{p% \left(y_{j}\mid\mathbf{y}_{<j}\right)\cdot p\left(\mathbf{x},\mathbf{z}\mid% \mathbf{y}_{<j}\right)}\right)= roman_log ( divide start_ARG italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_x , bold_z , bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) ⋅ italic_p ( bold_x , bold_z ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) ⋅ italic_p ( bold_x , bold_z ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG )
=log(p(yj𝐱,𝐳,𝐲<j)p(yj𝐲<j)).absent𝑝conditionalsubscript𝑦𝑗𝐱𝐳subscript𝐲absent𝑗𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗\displaystyle=\log\left(\frac{p\left(y_{j}\mid\mathbf{x},\mathbf{z},\mathbf{y}% _{<j}\right)}{p\left(y_{j}\mid\mathbf{y}_{<j}\right)}\right).= roman_log ( divide start_ARG italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_x , bold_z , bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG ) .

In this way, the CMI(𝐱,𝐳;yj)CMI𝐱𝐳subscript𝑦𝑗\operatorname{CMI}(\mathbf{x},\mathbf{z};y_{j})roman_CMI ( bold_x , bold_z ; italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) can be obtained with p(yj𝐱,𝐳,𝐲<j)𝑝conditionalsubscript𝑦𝑗𝐱𝐳subscript𝐲absent𝑗p\left(y_{j}\mid\mathbf{x},\mathbf{z},\mathbf{y}_{<j}\right)italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_x , bold_z , bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) and p(yj𝐲<j)𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗p\left(y_{j}\mid\mathbf{y}_{<j}\right)italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ), which are the output of the Peptide Decoder #1 and Peptide Decoder #2, respectively. Moreover, to reduce the variances and stabilize the distribution of the amino acid-level CMI in each peptide, we normalize the CMI values in the peptide and then scale the normalized values to obtain the amino acid-level training weight wjaasuperscriptsubscript𝑤𝑗𝑎𝑎w_{j}^{aa}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_a end_POSTSUPERSCRIPT for yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT,

wjaa=max{0,s1CMI(𝐱,𝐳;yj)μaaσaa+1},superscriptsubscript𝑤𝑗𝑎𝑎0subscript𝑠1CMI𝐱𝐳subscript𝑦𝑗superscript𝜇𝑎𝑎superscript𝜎𝑎𝑎1w_{j}^{aa}=\max\left\{0,s_{1}\cdot\frac{\operatorname{CMI}\left(\mathbf{x},% \mathbf{z};y_{j}\right)-\mu^{aa}}{\sigma^{aa}}+1\right\},italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_a end_POSTSUPERSCRIPT = roman_max { 0 , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ divide start_ARG roman_CMI ( bold_x , bold_z ; italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_μ start_POSTSUPERSCRIPT italic_a italic_a end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT italic_a italic_a end_POSTSUPERSCRIPT end_ARG + 1 } , (4)

where μaasuperscript𝜇𝑎𝑎\mu^{aa}italic_μ start_POSTSUPERSCRIPT italic_a italic_a end_POSTSUPERSCRIPT and σaasuperscript𝜎𝑎𝑎\sigma^{aa}italic_σ start_POSTSUPERSCRIPT italic_a italic_a end_POSTSUPERSCRIPT are the mean values and the standard deviations of all the CMI values in each peptide, and s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a hyperparameter that controls the effect of amino acid-level adaptive training.

4.3.2 PSM-level Adaptive Training.

As we introduced before, the training PSMs samples are of different matching levels because of the signal noise and missing peaks. To alleviate the negative effect of poorly matched mass spectrometry and peptide pairs and encourage the well-matched pairs, we adopt the mutual information between them as a measure of matching levels. Formally,

MI(𝐱,𝐳;𝐲)MI𝐱𝐳𝐲\displaystyle\operatorname{MI}(\mathbf{x},\mathbf{z};\mathbf{y})roman_MI ( bold_x , bold_z ; bold_y ) =1|𝐲|log(p(𝐱,𝐳,𝐲)p(𝐱,𝐳)p(𝐲))absent1𝐲𝑝𝐱𝐳𝐲𝑝𝐱𝐳𝑝𝐲\displaystyle=\frac{1}{|\mathbf{y}|}\log\left(\frac{p(\mathbf{x},\mathbf{z},% \mathbf{y})}{p(\mathbf{x},\mathbf{z})\cdot p(\mathbf{y})}\right)= divide start_ARG 1 end_ARG start_ARG | bold_y | end_ARG roman_log ( divide start_ARG italic_p ( bold_x , bold_z , bold_y ) end_ARG start_ARG italic_p ( bold_x , bold_z ) ⋅ italic_p ( bold_y ) end_ARG ) (5)
=1|𝐲|log(p(𝐲𝐱,𝐳)p(𝐲))absent1𝐲𝑝conditional𝐲𝐱𝐳𝑝𝐲\displaystyle=\frac{1}{|\mathbf{y}|}\log\left(\frac{p(\mathbf{y}\mid\mathbf{x}% ,\mathbf{z})}{p(\mathbf{y})}\right)= divide start_ARG 1 end_ARG start_ARG | bold_y | end_ARG roman_log ( divide start_ARG italic_p ( bold_y ∣ bold_x , bold_z ) end_ARG start_ARG italic_p ( bold_y ) end_ARG )
=1|𝐲|log(jp(yj𝐱,𝐳,𝐲<j)jp(yj𝐲<j))absent1𝐲subscriptproduct𝑗𝑝conditionalsubscript𝑦𝑗𝐱𝐳subscript𝐲absent𝑗subscriptproduct𝑗𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗\displaystyle=\frac{1}{|\mathbf{y}|}\log\left(\frac{\prod_{j}p\left(y_{j}\mid% \mathbf{x},\mathbf{z},\mathbf{y}_{<j}\right)}{\prod_{j}p\left(y_{j}\mid\mathbf% {y}_{<j}\right)}\right)= divide start_ARG 1 end_ARG start_ARG | bold_y | end_ARG roman_log ( divide start_ARG ∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_x , bold_z , bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG )
=1|𝐲|jlog(p(yj𝐱,𝐳,𝐲<j)p(yj𝐲<j))absent1𝐲subscript𝑗𝑝conditionalsubscript𝑦𝑗𝐱𝐳subscript𝐲absent𝑗𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗\displaystyle=\frac{1}{|\mathbf{y}|}\sum_{j}\log\left(\frac{p\left(y_{j}\mid% \mathbf{x},\mathbf{z},\mathbf{y}_{<j}\right)}{p\left(y_{j}\mid\mathbf{y}_{<j}% \right)}\right)= divide start_ARG 1 end_ARG start_ARG | bold_y | end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_x , bold_z , bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) end_ARG )
=1|𝐲|jCMI(𝐱,𝐳;yj).absent1𝐲subscript𝑗CMI𝐱𝐳subscript𝑦𝑗\displaystyle=\frac{1}{|\mathbf{y}|}\sum_{j}\operatorname{CMI}(\mathbf{x},% \mathbf{z};y_{j}).= divide start_ARG 1 end_ARG start_ARG | bold_y | end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_CMI ( bold_x , bold_z ; italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .

In other words, the mutual information can be derived by averaging all the amino acid-level CMI(𝐱,𝐳;yj)CMI𝐱𝐳subscript𝑦𝑗\operatorname{CMI}(\mathbf{x},\mathbf{z};y_{j})roman_CMI ( bold_x , bold_z ; italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) over the peptide. Similarly, we normalize all the MIMI\operatorname{MI}roman_MI values across all the PSMs in each mini-batch and then scale the normalized values to obtain the PSM-level training weight wpsmsuperscript𝑤𝑝𝑠𝑚w^{psm}italic_w start_POSTSUPERSCRIPT italic_p italic_s italic_m end_POSTSUPERSCRIPT,

wpsm=max{0,s2MI(𝐱,𝐳;𝐲)μpsmσpsm+1},superscript𝑤𝑝𝑠𝑚0subscript𝑠2MI𝐱𝐳𝐲superscript𝜇𝑝𝑠𝑚superscript𝜎𝑝𝑠𝑚1w^{psm}=\max\left\{0,s_{2}\cdot\frac{\operatorname{MI}(\mathbf{x},\mathbf{z};% \mathbf{y})-\mu^{psm}}{\sigma^{psm}}+1\right\},italic_w start_POSTSUPERSCRIPT italic_p italic_s italic_m end_POSTSUPERSCRIPT = roman_max { 0 , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ divide start_ARG roman_MI ( bold_x , bold_z ; bold_y ) - italic_μ start_POSTSUPERSCRIPT italic_p italic_s italic_m end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT italic_p italic_s italic_m end_POSTSUPERSCRIPT end_ARG + 1 } , (6)

where μpsmsuperscript𝜇𝑝𝑠𝑚\mu^{psm}italic_μ start_POSTSUPERSCRIPT italic_p italic_s italic_m end_POSTSUPERSCRIPT and σpsmsuperscript𝜎𝑝𝑠𝑚\sigma^{psm}italic_σ start_POSTSUPERSCRIPT italic_p italic_s italic_m end_POSTSUPERSCRIPT are the mean values and the standard deviations of the MI values of all the PSMs in each minibatch, and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a hyperparameter that controls the effect of PSM-level adaptive training.

4.3.3 Adaptive Training Loss

In our adaptive training method, we re-weight each target amino acid yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with the following loss,

1(θ1)=j=1Nwjlogp(yj𝐲<j,𝐱,𝐳;θ1),subscript1subscript𝜃1superscriptsubscript𝑗1𝑁subscript𝑤𝑗𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗𝐱𝐳subscript𝜃1\mathcal{L}_{1}(\theta_{1})=-\sum_{j=1}^{N}w_{j}\log p\left(y_{j}\mid\mathbf{y% }_{<j},\mathbf{x},\mathbf{z};\theta_{1}\right),caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT , bold_x , bold_z ; italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , (7)

where θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are the parameters of MS Encoder and Peptide Decoder #1, and

wj=wjaawpsm.subscript𝑤𝑗superscriptsubscript𝑤𝑗𝑎𝑎superscript𝑤𝑝𝑠𝑚w_{j}=w_{j}^{aa}\cdot w^{psm}.italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_a end_POSTSUPERSCRIPT ⋅ italic_w start_POSTSUPERSCRIPT italic_p italic_s italic_m end_POSTSUPERSCRIPT . (8)

Additionally, Peptide Decoder #2 is trained with the following loss,

2(θ2)=j=1Nlogp(yj𝐲<j;θ2),subscript2subscript𝜃2superscriptsubscript𝑗1𝑁𝑝conditionalsubscript𝑦𝑗subscript𝐲absent𝑗subscript𝜃2\mathcal{L}_{2}(\theta_{2})=-\sum_{j=1}^{N}\log p\left(y_{j}\mid\mathbf{y}_{<j% };\theta_{2}\right),caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (9)

where θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the parameters of Peptide Decoder #2. The overall training loss is,

𝙰𝚍𝚊(θ1,θ2)=1(θ1)+2(θ2).subscript𝙰𝚍𝚊subscript𝜃1subscript𝜃2subscript1subscript𝜃1subscript2subscript𝜃2\mathcal{L}_{\tt{Ada}}(\theta_{1},\theta_{2})=\mathcal{L}_{1}(\theta_{1})+% \mathcal{L}_{2}(\theta_{2}).caligraphic_L start_POSTSUBSCRIPT typewriter_Ada end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (10)

4.4 Inference

In the inference phase, we only use MS Encoder and Peptide Decoder #1 to predict the peptide. Specifically, we feed the mass spectrometry data to the encoder MS Encoder and the decoder Peptide Decoder #1 predicts the highest-scoring amino acid for each peptide sequence position. The decoder is then fed its preceding amino acid predictions at each decoding step. The decoding process concludes upon predicting the stop token or reaching the predefined maximum peptide length of =100100\ell=100roman_ℓ = 100 amino acids. We discuss the computational overhead in Section 5.9.

4.5 Precursor m/z filtering

In de novo peptide sequencing, it’s crucial that the relative difference between the total mass of the predicted peptide (mpredsubscript𝑚predm_{\text{pred}}italic_m start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT) and the observed precursor mass (mprecsubscript𝑚precm_{\text{prec}}italic_m start_POSTSUBSCRIPT prec end_POSTSUBSCRIPT) remains below a specified threshold value ϵitalic-ϵ\epsilonitalic_ϵ for the predicted sequence to be considered plausible. This requirement is expressed as Δmppm=|mprecmpred|×106mprec<ϵΔsubscript𝑚𝑝𝑝𝑚subscript𝑚precsubscript𝑚predsuperscript106subscript𝑚precitalic-ϵ\Delta m_{ppm}=\frac{\left|m_{\text{prec}}-m_{\text{pred}}\right|\times 10^{6}% }{m_{\text{prec}}}<\epsilonroman_Δ italic_m start_POSTSUBSCRIPT italic_p italic_p italic_m end_POSTSUBSCRIPT = divide start_ARG | italic_m start_POSTSUBSCRIPT prec end_POSTSUBSCRIPT - italic_m start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT | × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT prec end_POSTSUBSCRIPT end_ARG < italic_ϵ. To ensure adherence to this constraint, we not only incorporate precursor information into the model’s learning process but also filter out peptide predictions that don’t meet this criterion. The threshold value ϵitalic-ϵ\epsilonitalic_ϵ is determined based on the precursor mass error tolerance used in the database search to establish ground truth peptide sequences for the test data.

5 Experiments

Table 1: Empirical comparison of de novo sequencing models. The table lists the peptide-level and amino acid-level precision of three competing models on all nine benchmark cross-validation folds. Each fold’s test set contains spectra from a single species. Kindly note that peptide-level performance measures are the primary quantifier of the model’s practical utility because the goal is to assign a complete peptide sequence to each spectrum. The best and the second best results are highlighted bold and underlined, respectively.
Species Peptide-level precision Amino acid-level precision
DeepNovo PointNovo Casanovo AdaNovo DeepNovo PointNovo Casanovo AdaNovo
Mouse 0.286 0.355 0.449 0.467 0.623 0.626 0.612 0.646
Human 0.293 0.351 0.343 0.373 0.610 0.606 0.585 0.618
Yeast 0.462 0.534 0.568 0.593 0.750 0.779 0.753 0.793
M. mazei 0.422 0.478 0.474 0.496 0.694 0.712 0.686 0.728
Honeybee 0.330 0.396 0.422 0.431 0.630 0.644 0.640 0.650
Tomato 0.454 0.513 0.463 0.530 0.731 0.733 0.720 0.740
Rice bean 0.436 0.511 0.549 0.546 0.679 0.730 0.727 0.719
Bacillus 0.449 0.518 0.513 0.528 0.742 0.768 0.718 0.739
Clam bacteria 0.253 0.298 0.347 0.372 0.602 0.589 0.617 0.642

5.1 Datasets

To assess AdaNovo’s performance, we employ the nine-species benchmark initially introduced by DeepNovo (Tran et al., 2017). This dataset amalgamates approximately 1.5 million mass spectra from nine distinct experiments, each employing the same instrument to scrutinize peptides from diverse species. Each spectrum is associated with a ground-truth peptide sequence, which comes from database search identification with a standard false discovery rate (FDR) set at 1%. Following the methodology of previous works (Tran et al., 2017; Qiao et al., 2021), we adopt a leave-one-out cross-validation framework. This entails training a model on eight species and testing it on the species held out for each of the nine species. The training set is split 90/10 for training and validation. This framework facilitates the testing of the model on peptide samples that have never been encountered before. Cross-species testing is of paramount importance for de novo sequencing models since practical applications often demand these models to excel in handling mass spectra featuring peptide sequences that have never been observed before.

5.2 Evaluation Metrics

In our assessment of model predictions, we employ precision calculated at both the amino acid and peptide levels, following methodologies presented by previous works (Ma et al., 2003; Tran et al., 2017). These precision metrics serve as performance measures, gauging the quality of a given model’s predictions based on coverage over the test set. For each spectrum, we compare the predicted sequence to the ground truth peptide obtained from the database search.

Consistent with DeepNovo (Tran et al., 2017), our approach to amino acid-level measures begins by calculating the number Nmatchasuperscriptsubscript𝑁match𝑎N_{\text{match}}^{a}italic_N start_POSTSUBSCRIPT match end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT of matched amino acid predictions. These are defined as predicted amino acids that (1) exhibit a mass difference of <0.1Daabsent0.1Da<0.1\mathrm{Da}< 0.1 roman_Da from the corresponding ground truth amino acid and (2) have either a prefix or suffix with a mass difference of no more than 0.5Da0.5Da0.5\mathrm{Da}0.5 roman_Da from the corresponding amino acid sequence in the ground truth peptide. Amino acid-level precision is then defined as Nmatch a/Npred asuperscriptsubscript𝑁match 𝑎superscriptsubscript𝑁pred 𝑎N_{\text{match }}^{a}/N_{\text{pred }}^{a}italic_N start_POSTSUBSCRIPT match end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT / italic_N start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, where Npredasuperscriptsubscript𝑁pred𝑎N_{\text{pred}}^{a}italic_N start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT represents the number of predicted amino acids. Similarly, amino acids with PTMs identification precision can be formulated as Nmatch ptm/Npred ptmsuperscriptsubscript𝑁match 𝑝𝑡𝑚superscriptsubscript𝑁pred 𝑝𝑡𝑚N_{\text{match }}^{ptm}/N_{\text{pred }}^{ptm}italic_N start_POSTSUBSCRIPT match end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_t italic_m end_POSTSUPERSCRIPT / italic_N start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_t italic_m end_POSTSUPERSCRIPT, where Nmatch ptmsuperscriptsubscript𝑁match 𝑝𝑡𝑚N_{\text{match }}^{ptm}italic_N start_POSTSUBSCRIPT match end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_t italic_m end_POSTSUPERSCRIPT and Npredptmsuperscriptsubscript𝑁pred𝑝𝑡𝑚N_{\text{pred}}^{ptm}italic_N start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_t italic_m end_POSTSUPERSCRIPT denote the number of matched amino acids with PTMs and predicted amino acids with PTMs, respectively.

For peptide predictions, a predicted peptide is deemed a correct match only if all of its amino acids are matched. In a collection of Norig psuperscriptsubscript𝑁orig 𝑝N_{\text{orig }}^{p}italic_N start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT spectra, if our model provides predictions for a subset of Npred psuperscriptsubscript𝑁pred 𝑝N_{\text{pred }}^{p}italic_N start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and accurately predicts Nmatch psuperscriptsubscript𝑁match 𝑝N_{\text{match }}^{p}italic_N start_POSTSUBSCRIPT match end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT peptides, coverage is defined as Npred p/Norig psuperscriptsubscript𝑁pred 𝑝superscriptsubscript𝑁orig 𝑝N_{\text{pred }}^{p}/N_{\text{orig }}^{p}italic_N start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT / italic_N start_POSTSUBSCRIPT orig end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Peptide-level precision is calculated as Nmatch p/Npred psuperscriptsubscript𝑁match 𝑝superscriptsubscript𝑁pred 𝑝N_{\text{match }}^{p}/N_{\text{pred }}^{p}italic_N start_POSTSUBSCRIPT match end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT / italic_N start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT.

To construct a precision-coverage curve, predictions are sorted based on the confidence score provided by the model. Amino acid-level confidence scores are derived by applying a softmax to the transformer decoder’s output, serving as a proxy for the probability of each predicted amino acid occurring at a specific position along the peptide sequence. AdaNovo provides amino acid-level confidence scores directly, and we utilize the mean score across all amino acids as a peptide-level confidence score.

5.3 Baselines

We compare AdaNovo with previous de novo peptide sequencing methods including DeepNovo (Tran et al., 2017), Casanovo (Yilmaz et al., 2022) and PointNovo (Qiao et al., 2021). We reproduce the results of Casanovo with the settings and hypermeters of the original paper and report the published results of DeepNovo and PointNovo as their pre-trained weights are unavailable.

5.4 Experimental Settings

The models in our AdaNovo are with 9 layers, embedding size d𝑑ditalic_d = 512, and 8 attention heads. We train the models with a batchsize of 32 PSMs and 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT weight decay. The learning rate is linearly increased from zero to 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT in 100k𝑘kitalic_k warm-up steps, followed by a cosine shaped decay. We train the models for 30 epochs and pick the model weights from the epoch with the lowest validation loss for testing. The hypermeters s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are tuned within the set {0.05,0.1,0.3}0.050.10.3\left\{0.05,0.1,0.3\right\}{ 0.05 , 0.1 , 0.3 }.

Refer to caption
(a) Human (Peptide-level)
Refer to caption
(b) Human (AA-level)
Refer to caption
(c) Clam bacteria (Peptide-level)
Refer to caption
(d) Clam bacteria (AA-level)
Figure 4: Precision-coverage curves for AdaNovo and Casanovo (AA-level: Amino acid-level). Peptide curves are generated by arranging predicted peptides based on their confidence scores. In the case of amino acid-level curves, all amino acids within a specific peptide are assigned equal scores. Both at the amino acid and peptide levels, peptides that meet the precursor m/z filtering criteria are prioritized over those that do not. Similarly, the ranking is applied to all amino acids within peptides that pass the precursor m/z filter compared to those that do not. The transition between unfiltered and filtered entries is denoted by a red star on each curve.

5.5 Main Results

AdaNovo outperforms state-of-the-art methods. As can be observed in Table 1, AdaNovo outperforms competitive models on most (8 out of 9) species in peptide-level precision compared to DeepNovo, PointNovo and CasaNovo. The peptide-level precision coverage curves (Figure 4) show that AdaNovo consistently outperforms Casanovo over a range of peptide confidence thresholds. This trend is also reflected by the area under the curve (AUC) metric. At amino acid-level, AdaNovo outperforms baselines on most datasets. As shown in Figure 4, the point on the AdaNovo curve corresponding to the filter lies above the casanovo precision-coverage curve, and Adanovo’s AUC consistently exceeds Casanovo’s.

Table 2: Empirical comparison of de novo sequencing models in terms of identifying amino acids with PTMs. The best and the second best results are highlighted bold and underlined, respectively.
Species PTMs precision
DeepNovo PointNovo Casanovo AdaNovo
Human 0.369 0.415 0.398 0.483
Rice bean 0.644 0.653 0.646 0.689
Clam bacteria 0.510 0.526 0.508 0.575
Bacillus 0.483 0.524 0.470 0.565

AdaNovo can accurately identify the amino acids with PTMs. As demonstrated in Table 2, we compare AdaNovo with other methods in terms of identifying amino acids with PTMs because AdaNovo is designed to accurately identify the amino acids with PTMs. The results in the table indicate that AdaNovo exceeds other competitors by significant margins in identifying amino acids with PTMs, verifying the effectiveness of the amino acid-level adaptive training strategy.

5.6 Ablation Study

Table 3: Ablations on amino acid-level (AA-level) and peptide-level adaptive training strategies. The results are for the Human test set, which is one of 9-species benchmark (Tran et al., 2017).
Model AA. Prec. Peptide Prec. PTMs Prec.
Casanovo 0.585 0.343 0.300
AdaNovo (w/o PSM-level MI) 0.607 0.360 0.478
AdaNovo (w/o AA-level CMI) 0.594 0.349 0.314
AdaNovo 0.618 0.373 0.483

Ablations on amino acid-level and peptide-level adaptive training strategies. To investigate the influence of the amino acid-level and peptide-level adaptive training strategies, we remove each of them from AdaNovo. The results shown in Table 3 indicate that both modules are necessary and effective for the AdaNovo model. Moreover, when we remove the AA-level training strategy in AdaNovo, the precision of the amino acids with PTMs identification drops significantly because the amino acid-level training strategy is designed for identifying amino acids with PTMs. Additionally, the PSM-level training strategy is designed for robustness against data noise, which we verify via the following experiments.

Table 4: Models’ Performance on mass spectrum dataset with synthetic noise. The results are for the Clam bacteria test set, which is one of 9-species benchmark (Tran et al., 2017).
Model AA. Prec. Peptide Prec.
CasaNovo 0.582 0.297
AdaNovo (w/o PSM-level MI) 0.586 0.311
AdaNovo (w/o AA-level CMI) 0.614 0.335
AdaNovo 0.621 0.342

Performance on mass spectra with synthetic noise. To verify the effectiveness of the PSM-level adaptive training strategy, we randomly choose 20% spectrum in the training datasets, and add synthetic noise peaks or remove original peaks with higher intensity values. We report the results in Table 4, from which we can observe that the performance would degrade sharply when we remove the PSM-level training strategy. This indicates that PSM-level adaptive training strategy can enhance models’ robustness against data noise in mass spectrum.

5.7 Comparisons with Alternative Methods for identifying amino acids with PTMs

Table 5: Comparisons with alternative methods in terms of identifying amino acids with PTMs. All results are for the yeast test set, which is one of 9-species benchmark (Tran et al., 2017).
Model AA. Prec. Peptide Prec.
Casanovo 0.753 0.568
  + Re-weight 0.762 0.576
  + Focal loss 0.745 0.543
AdaNovo (w/o PSM-level MI) 0.784 0.582
AdaNovo 0.793 0.593

In this section, we show the performance of AdaNovo only with amino acid-level loss (denoted as ‘ AdaNovo w/o PSM-level MI’) and compare to some alternative methods in terms of identifying amino acids with PTMs. The first alternative is to re-weight each amino acid yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with the following function,

wj=NtotalNyj,subscript𝑤𝑗subscript𝑁𝑡𝑜𝑡𝑎𝑙subscript𝑁subscript𝑦𝑗w_{j}=\frac{N_{total}}{N_{y_{j}}},italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG , (11)

where Ntotalsubscript𝑁𝑡𝑜𝑡𝑎𝑙N_{total}italic_N start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT and Nyjsubscript𝑁subscript𝑦𝑗N_{y_{j}}italic_N start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT represent the total number of amino acids and the number of amino acids in the yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT category in the dataset, respectively. The second alternative is the focal loss (Lin et al., 2017), we replace the cross entropy loss of Casanovo (Yilmaz et al., 2022) with the focal loss,

=(1αp(yj𝐱,𝐳,𝐲<j))γlogp(yj𝐱,𝐳,𝐲<j),superscript1𝛼𝑝conditionalsubscript𝑦𝑗𝐱𝐳subscript𝐲absent𝑗𝛾𝑝conditionalsubscript𝑦𝑗𝐱𝐳subscript𝐲absent𝑗\mathcal{L}=-(1-\alpha p(y_{j}\mid\mathbf{x},\mathbf{z},\mathbf{y}_{<j}))^{% \gamma}\log p(y_{j}\mid\mathbf{x},\mathbf{z},\mathbf{y}_{<j}),caligraphic_L = - ( 1 - italic_α italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_x , bold_z , bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_x , bold_z , bold_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT ) ,

where α𝛼\alphaitalic_α and γ𝛾\gammaitalic_γ are hyperparameters to adjust the loss weight. The results shown in Table 5 indicate that both AdaNovo and the first alternative can help improve Casanovo’s ability. Additionally, AdaNovo outperforms the alternatives by a notable margin probably because the training and testing datasets are derived from different species, there exists a significant difference in the distribution of PTMs quantities. Also, AdaNovo is inspired by the domain knowledge that the mass shift of PTMs only manifests in the mass spectra, thus shows superiority over the re-weighting methods in long-tailed classification.

5.8 Sensitivity Analysis

The effects of two hyperparameters s1 and s2, which determines the influence of amino acid-level and PSM-level training strategy can be seen in Appendix 5.

5.9 Costs of Computing and Storage

The comparison between AdaNovo and Casanovo regarding model parameters and runtime can be found in Appendix 6.

6 Conclusion and Future Work

In this paper, we discern challenges in existing methods related to identification of the amino acids with PTMs, exacerbated by spectrum data noise stemming from instrument malfunctions and contaminants. These challenges contribute to reduced precision in identification. To address these issues, we introduce a novel approach involving the calculation of conditional mutual information between the spectrum and each amino acid, followed by a re-weighting of each amino acid. Extensive experiments on widely-utilized 9-species datasets affirm that AdaNovo surpasses previous de novo sequencing methods, showcasing superior performance in both amino acid- and peptide-level precision. Notably, AdaNovo exhibits a distinct advantage in identifying amino acids with PTMs. In the future, we plan to train the AdaNovo model on more extensive PSM data, positioning it as a foundational model for mass spectrum-based proteomics.

7 Impact Statements

This paper presents work whose goal is to advance the field of machine learning for protein sequencing. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

References

  • Bartels (1990) Bartels, C. Fast algorithm for peptide sequencing by mass spectroscopy. Biomedical & environmental mass spectrometry, 19(6):363–368, 1990.
  • Dančík et al. (1999) Dančík, V., Addona, T. A., Clauser, K. R., Vath, J. E., and Pevzner, P. A. De novo peptide sequencing via tandem mass spectrometry. Journal of computational biology, 6(3-4):327–342, 1999.
  • Deribe et al. (2010) Deribe, Y. L., Pawson, T., and Dikic, I. Post-translational modifications in signal integration. Nature structural & molecular biology, 17(6):666–672, 2010.
  • Fischer et al. (2005) Fischer, B., Roth, V., Roos, F., Grossmann, J., Baginsky, S., Widmayer, P., Gruissem, W., and Buhmann, J. M. Novohmm: a hidden markov model for de novo peptide sequencing. Analytical chemistry, 77(22):7265–7273, 2005.
  • Frank & Pevzner (2005) Frank, A. and Pevzner, P. Pepnovo: de novo peptide sequencing via probabilistic network modeling. Analytical chemistry, 77(4):964–973, 2005.
  • Frank (2009) Frank, A. M. Predicting intensity ranks of peptide fragment ions. Journal of proteome research, 8(5):2226–2240, 2009.
  • Karunratanakul et al. (2019) Karunratanakul, K., Tang, H.-Y., Speicher, D. W., Chuangsuwanich, E., and Sriswasdi, S. Uncovering thousands of new peptides with sequence-mask-search hybrid de novo peptide sequencing framework. Molecular & Cellular Proteomics, 18(12):2478–2491, 2019.
  • Ke et al. (2021) Ke, G., He, D., and Liu, T.-Y. Rethinking positional encoding in language pre-training. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=09-528y2Fgf.
  • Lin et al. (2017) Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp.  2980–2988, 2017.
  • Ma (2015) Ma, B. Novor: real-time peptide de novo sequencing software. Journal of the American Society for Mass Spectrometry, 26(11):1885–1894, 2015.
  • Ma et al. (2003) Ma, B., Zhang, K., Hendrie, C., Liang, C., Li, M., Doherty-Kirby, A., and Lajoie, G. Peaks: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid communications in mass spectrometry, 17(20):2337–2342, 2003.
  • Mayer & Impens (2021) Mayer, R. L. and Impens, F. Immunopeptidomics for next-generation bacterial vaccine development. Trends in microbiology, 29(11):1034–1045, 2021.
  • Muth et al. (2013) Muth, T., Benndorf, D., Reichl, U., Rapp, E., and Martens, L. Searching for a needle in a stack of needles: challenges in metaproteomics data analysis. Molecular BioSystems, 9(4):578–585, 2013.
  • Qiao et al. (2021) Qiao, R., Tran, N. H., Xin, L., Chen, X., Li, M., Shan, B., and Ghodsi, A. Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices. Nature Machine Intelligence, 3(5):420–425, 2021.
  • Ren et al. (2018) Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In International conference on machine learning, pp.  4334–4343. PMLR, 2018.
  • SCIENCE (1977) SCIENCE, C.-M. U. P. P. D. O. C. Speech Understanding Systems. Summary of Results of the Five-Year Research Effort at Carnegie-Mellon University. 1977.
  • Taylor & Johnson (2001) Taylor, J. A. and Johnson, R. S. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Analytical chemistry, 73(11):2594–2604, 2001.
  • Tran et al. (2017) Tran, N. H., Zhang, X., Xin, L., Shan, B., and Li, M. De novo peptide sequencing by deep learning. Proceedings of the National Academy of Sciences, 114(31):8247–8252, 2017.
  • VanDuijn et al. (2017) VanDuijn, M. M., Dekker, L. J., van IJcken, W. F., Sillevis Smitt, P. A., and Luider, T. M. Immune repertoire after immunization as seen by next-generation sequencing and proteomics. Frontiers in Immunology, 8:1286, 2017.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Vizcaíno et al. (2014) Vizcaíno, J. A., Deutsch, E. W., Wang, R., Csordas, A., Reisinger, F., Ríos, D., Dianes, J. A., Sun, Z., Farrah, T., Bandeira, N., et al. Proteomexchange provides globally coordinated proteomics data submission and dissemination. Nature biotechnology, 32(3):223–226, 2014.
  • Wolters et al. (2001) Wolters, D. A., Washburn, M. P., and Yates, J. R. An automated multidimensional protein identification technology for shotgun proteomics. Analytical chemistry, 73(23):5683–5690, 2001.
  • Wyner (1978) Wyner, A. D. A definition of conditional mutual information for arbitrary ensembles. Information and Control, 38(1):51–59, 1978.
  • Yilmaz et al. (2022) Yilmaz, M., Fondrie, W., Bittremieux, W., Oh, S., and Noble, W. S. De novo mass spectrometry peptide sequencing with a transformer model. In International Conference on Machine Learning, pp.  25514–25522. PMLR, 2022.
  • Zhang et al. (2023) Zhang, Y., Kang, B., Hooi, B., Yan, S., and Feng, J. Deep long-tailed learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

Appendix A Costs of Computing and Storage

In this part, we compare AdaNovo with Casanovo in terms of the number of model parameters, training time and inference time. The results shown in Table  6. Compared to casanovo, AdaNovo introduced Peptide Decoder #2, resulting in a 40.04% increase in parameter count (from 47.35M to 66.31M). Similarly, under the same hardware settings (1 A100-SXM4-80GB and 32 CPU), training time increased by 7.3% (from 63.27M to 67.92M). However, the inference of AdaNovo is more efficient than CasaNovo.

Table 6: Comparisons with competitive methods in terms of computational overhead. The training and inference time are evaluated on Honeybee dataset, which is one of 9-species benchmark (Tran et al., 2017).
Model #Params (M) Training time (h) Inference time (h)
Casanovo 47.35 63.27 8.42
AdaNovo 66.31 67.92 6.02

Appendix B Sensitivity Analysis

In this section, we investigate the effects of the two hyperparameters s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which determines the influence of amino acid-level and PSM-level training strategy. As shown in Figure 5, we tune both s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT within the range [0.05,0.1,0.3]0.050.10.3[0.05,0.1,0.3][ 0.05 , 0.1 , 0.3 ] and observe that the values of these two hyperparameters significantly affect the final performance of the model. Additionally, the optimal hyperparameters vary across different models, indicating differences in noise and the distributions of amino acids with PTMs among different datasets. It is necessary to finely adjust the values of s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT based on the dataset, representing the balance between amino acid-level and PSM-level training strategies.

Refer to caption
(a) Human (Peptide-level)
Refer to caption
(b) Human (AA-level)
Figure 5: The effects of the two hyperparameters s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for adanovo. On the left are the peptide precision of the AdaNovo under different hyperparameter settings; on the right are the corresponding amino acid precision