FusionDTI: Fine-grained Binding Discovery with Token-level Fusion for Drug-Target Interaction

Zhaohan Meng
University of Glasgow
&Zaiqiao Meng
University of Glasgow
&Iadh Ounis
University of Glasgow
Corresponding Author.
Abstract

Predicting drug-target interaction (DTI) is critical in the drug discovery process. Despite remarkable advances in recent DTI models through the integration of representations from diverse drug and target encoders, such models often struggle to capture the fine-grained interactions between drugs and protein, i.e. the binding of specific drug atoms (or substructures) and key amino acids of proteins, which is crucial for understanding the binding mechanisms and optimising drug design. To address this issue, this paper introduces a novel model, called FusionDTI, which uses a token-level Fusion module to effectively learn fine-grained information for Drug-Target Interaction. In particular, our FusionDTI model uses the SELFIES representation of drugs to mitigate sequence fragment invalidation and incorporates the structure-aware (SA) vocabulary of target proteins to address the limitation of amino acid sequences in structural information, additionally leveraging pre-trained language models extensively trained on large-scale biomedical datasets as encoders to capture the complex information of drugs and targets. Experiments on three well-known benchmark datasets show that our proposed FusionDTI model achieves the best performance in DTI prediction compared with seven existing state-of-the-art baselines. Furthermore, our case study indicates that FusionDTI could highlight the potential binding sites, enhancing the explainability of the DTI prediction111The complete code and datasets are available at: https://github.com/ZhaohanM/FusionDTI..

1 Introduction

The task of predicting drug-target interactions (DTI) plays a pivotal role in the drug discovery progress, as it helps identify potential therapeutic effects of drugs on biological targets facilitating the development of effective treatments [2]. DTI fundamentally relies on the binding of specific drug atoms (or substructures) and key amino acids of proteins [32]. In particular, each binding site is an interaction between a single amino acid and a single drug atom, which we refer to as a fine-grained interaction. For instance, Figure 1 B demonstrates the interaction between HIV-1 protease and the drug lopinavir. A critical component of this interaction is the formation of a hydrogen bond between a ketone group in lopinavir (represented in the SELFIES [19] notation as [C][=O]) and the side chain of an aspartate residue Asp25 (i.e. Dd) within the protease [5, 7]. Therefore, capturing such fine-grained interaction information during the fusion of drug and target representations is crucial for building effective DTI prediction models [44, 42, 27, 47].

Refer to caption
Figure 1: A. An illustration of the FusionDTI model contains frozen encoders, the fusion module, and the classifier. The TF focuses on fine-grained interactions between tokens within and across sequences. B. This is a token-level interaction instance of HIV-1 protease and lopinavir. Lopinavir forms a hydrogen bond with residue Dd (Asp25) in the active site of the protease via its ketone molecule ([C][=O]). C. The attention map of TF visualises the weight between tokens, indicating the contribution of each drug atom and residue to the final prediction result.

To obtain representations of drugs and targets for the DTI task, some previous studies [20, 25] have used graph neural networks (GNNs) or convolutional neural networks (CNNs) using a fixed-size window, potentially leading to a loss of contextual information, especially when drugs and targets are in a long-term sequence. These models directly concatenate the representations together to make predictions without considering fine-grained interactions. More recently, some computational models [16, 3] employed the fusion module (e.g. Deep Interactive Inference Network (DIIN) [14] and Bilinear Attention Network (BAN) [18]) to obtain fine-grained interaction information and the 3-mer approach that binds three amino acids together as a target binding site to address the lack of structural information in the amino acid sequence. While useful for highlighting possible regions of interaction, these models do not offer the sufficient granularity needed to gauge the specifics of binding sites, as each binding site only contains one residue [32]. Therefore, obtaining contextual representations of drugs and targets and capturing fine-grained interaction information for DTI remains challenging.

To address these challenges, we propose a novel model (called FusionDTI) with a Token-level Fusion (TF) module for an effective learning of fine-grained interactions between drugs and targets. In particular, our FusionDTI model utilises two pre-trained language models (PLMs), namely Saport [33] as the protein encoder that is able to integrate both residue tokens with structure token; and SELFormer [46] as the drug encoder to ensure that each drug is valid and contains structural information. To effectively learn fine-grained information from these contextual representations of drugs and targets, we explore two strategies for the TF module, i.e. Bilinear Attention Network (BAN) [18] and Cross Attention Network (CAN) [21, 37], to find the best approach for integrating the rich contextual embeddings derived from Saport and SELFormer. We conduct a comprehensive performance comparison against seven existing state-of-the-art DTI prediction models. The results show that our proposed model achieves about 6% accuracy improvement over the best baseline on the BinddingDB dataset. The main contributions of our study are as follows:

  • We propose FusionDTI, a novel model that leverages PLMs to encode drug SELFIES and protein residue and structure for rich semantic representations and uses the token-level fusion to obtain fine-grained interaction information between drugs and targets effectively.

  • We compare two TF modules: CAN and BAN and analyse the influence of fusion scales based on FusionDTI, demonstrating that CAN is superior for DTI prediction both in terms of effectiveness and efficiency.

  • We conduct a case study of three drug-target pairs by FusionDTI to evaluate whether potential binding sites would be highlighted for the DTI prediction explainability.

2 Related Work

2.1 Drug-target Interaction Prediction

DTI prediction serves as an important step in the process of drug discovery [10]. Traditional biomedical measurements from wet experiments are reliable but have a notably high cost and time-consuming development cycle, preventing their application on large-scale data [49]. In contrast, identifying high-confidence DTI pairs by computational models markedly narrow down the search scope of drug candidate libraries, and aims to identify drugs most likely to bind to a target. Support vector machine (SVM) [9] and random forest (RF) [15] are two traditional computational models for DTI by concatenating fingerprint ECFP4 [29] and PSC features [6]. Later works focused on representation learning approaches, such as CNNs and GNNs [20, 25]. For example, DeepConv-DTI [20] employed CNNs and a global max-pooling layer to extract local protein sequence patterns. GraphDTA [25] used GNNs for drug graph encoding and CNNs for protein sequence encoding. More recently, MolTrans [16] introduced an adaptation of the transformer for encoding, further enhanced by a DIIN module [14] to learn fine-grained interactions. DrugBAN [3] incorporated a deep BAN [18] framework with domain adaptation to facilitate explicit pairwise fine-grained interaction learning between drugs and targets. In addition, BioT5 [26] has been proposed as a comprehensive pre-training framework that integrates cross-modelling in biology in the DTI task. Despite these advances, these models have not proposed an effective way to capture fine-grained interaction information in the DTI.

2.2 Drug and Protein Representation

For drug molecules, most existing methods represent the input by the Simplified Molecular Input Line Entry System (SMILES) [39, 40]. However, SMILES suffers from numerous problems in terms of validity and robustness, and some valuable information about the drug structure may be lost which may prevent the model from efficiently mining the knowledge hidden in the data reducing the predictive performance of the model [19]. In particular, SMILES fragments are often invalid and inconsistent with the substructural information of the drug. To address the limitations of SMILES, we apply SELFIES [19], a string-based representation that circumvents the issue of robustness and that always generates valid molecular graphs for each character [19].

Regarding proteins, the conventional approach uses amino acid sequences as model inputs [16, 3], overlooking the crucial structural information of the protein. Inspired by the SA vocabulary of Saprot [33], the Saprot enhances inputs by amalgamating each residue from the amino acid sequence with a 3D geometric feature that is obtained by encoding the structure information of the protein using Foldseek [35]. This innovative combination offers richer protein representations through the SA vocabulary, contributing to the discovery of fine-grained interactions. Our proposed model employs SELFIES for drug encoding and uses Saprot encoding for proteins to generate the semantic representations for both drugs and targets.

2.3 Molecular and Protein Language Models

Molecular language models that train on the large-scale molecular corpus to capture the subtleties of chemical structures and their biological activities have set new standards in encoding chemical compounds achieving meaningful representations [45, 30]. For example, ChemBERTa-2 [1] used RoBERTa-based architectures to capture intricate molecular patterns, significantly enhancing the precision of property prediction. Subsequently, MoLFormer [31] focused on leveraging the self-attention mechanism to interpret the complex, non-linear interactions within molecules, while SELFormer [46] employed SELFIES, ensuring valid and interpretable chemical structures.

Protein language models have revolutionized the way we understand and represent protein sequences, offering richer semantic representations [11, 22, 33]. These models leverage the vast corpus of biological sequence data, learning intricate patterns and features that define the protein functionality and interactions. ProtBERT [11] and ESM [22] applied a transformer architecture to protein sequences, capturing the complex relationships between amino acids. Saport [33] further enhanced this approach by integrating SA vocabularies to provide protein structure information. Furthermore, SaprotHub [34] offers a platform that enables biologists to train, deploy, and share protein models efficiently. Importantly, our FusionDTI is flexible enough to use each of them as a protein encoder.

3 Methodology

3.1 Model Architecture

Given a sequence-based input drug-target pair, the DTI prediction task aims to predict an interaction probability score p[0,1]𝑝01p\in[0,1]italic_p ∈ [ 0 , 1 ] between the given drug-target pair, which is typically achieved through learning a joint representation 𝐅𝐅\mathbf{F}bold_F space from the given sequence-based inputs. To address the DTI task and effectively capture fine-grained interaction, we proposed a novel model, called FusionDTI, which is a bi-encoder model [23] with a fusion module that fuses the representations of drugs and targets. The overall framework of FusionDTI is illustrated in Figure 1 A. In general, FusionDTI takes sequence-based inputs of drugs and targets, which are encoded into token-level representation vectors by two frozen encoders. Then, a fusion module fuses the representations to capture fine-grained binding information for a final prediction through a prediction head.

Input: The initial inputs of drugs and targets are string-based representations. For protein 𝒫𝒫\mathcal{P}caligraphic_P, the SA vocabulary [33, 35] is employed, where each residue is replaced by one of 441 SA vocabularies that bind an amino acid to a 3D geometric feature to address the lack of structural information in the amino acid sequences. For drug 𝒟𝒟\mathcal{D}caligraphic_D, as mentioned in the previous section, we use the SELFIES, which is a formal syntax that always generates valid molecular graphs [19]. We provide the steps and code for obtaining SA and SELFIES sequences in Appendix 6.3.

Encoder: The proposed model contains two frozen encoders: Saport [33] and SELFormer [46], which generate a drug representation 𝐃𝐃\mathbf{D}bold_D and a protein representation 𝐏𝐏\mathbf{P}bold_P separately. It is of note that FusionDTI is flexible enough to easily replace encoders with other advanced PLMs. Furthermore, 𝐃𝐃\mathbf{D}bold_D and 𝐏𝐏\mathbf{P}bold_P are stored in memory for later-stage online training.

Fusion module: In develo** FusionDTI, we have investigated two options for the fusion module: BAN and CAN to fuse representations, as indicated in Figure 2. The CAN is utilised to fuse each pair as 𝐃superscript𝐃\mathbf{D}^{*}bold_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝐏superscript𝐏\mathbf{P}^{*}bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and then concatenate them into one 𝐅𝐅\mathbf{F}bold_F for fine-grained binding information. For BAN, we need to obtain the bilinear attention map and then generate 𝐅𝐅\mathbf{F}bold_F through the bilinear pooling layer.

Prediction head: Finally, we obtain the probability score p𝑝pitalic_p of the DTI prediction by a multilayer perceptron (MLP) classifier trained with the binary cross-entropy loss, i.e. p=MLP(𝐅)𝑝MLP𝐅p=\operatorname{MLP}(\mathbf{F})italic_p = roman_MLP ( bold_F ).

Since the encoders and the fusion module constitute the key components of our FusionDTI model, we will describe them in detail in the following subsections.

3.2 Drug and Protein Encoders

Employing sequences with detailed biological functions and structures is a critical step in exploring the fine-grained binding of drugs and targets. For drugs, SMILES is the most commonly used input sequence but suffers from invalid sequence segments and potential loss of structural information [19]. To address the limitations, we transform SMILES into SELFIES, a formal grammar that generates a valid molecular graph for each element [19]. Besides, to address the lack of structural information in the amino acid sequences, we utilise the SA sequence of targets to combine each amino acid with an SA vocabulary by Foldseek [35].

PLMs have shown promising achievements in the biomedical domain leveraging transformers since they pay attention to contextual information and are pre-trained on large-scale biomedical databases. Therefore, we utilise Saport [33] as a protein encoder to encode protein input 𝒫𝒫\mathcal{P}caligraphic_P of both the SA sequence and amino acid sequence. Meanwhile, SELFormer [46] is used as our drug encoder to encode the drug SELFIES input 𝒟𝒟\mathcal{D}caligraphic_D. Then these encoded protein representation 𝐏𝐏\mathbf{P}bold_P and drug representation 𝐃𝐃\mathbf{D}bold_D are further used as inputs for the later fusion module (Subsection 3.3). These rich contextual representations ensure that we can explore the fine-grained binding information effectively. To further justify this, we also compare our encoders with other existing protein language models (such as ESM-2b [22]) and molecular language models (such as MoLFormer [31] and ChemBERTa-2 [1]), and the results can be found in Section 4.7.

3.3 Fusion Module

Refer to caption
Figure 2: BAN: In step 1, the bilinear attention map matrix is obtained by a bilinear interaction modelling via transformation matrices. In step 2, the joint representation 𝐅𝐅\mathbf{F}bold_F is generated using the attention map by bilinear pooling via the shared transformation matrices 𝐔𝐔\mathbf{U}bold_U and 𝐕𝐕\mathbf{V}bold_V. CAN: It fuses protein and drug representations through multi-head, self-attention and cross-attention. Then fused representations 𝐏superscript𝐏\mathbf{P}^{*}bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝐃superscript𝐃\mathbf{D}^{*}bold_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are concatenated into 𝐅𝐅\mathbf{F}bold_F after mean pooling.

In order to capture the fine-grained binding information between a drug and a target, our FusionDTI model applies a fusion module to learn token-level interactions between the token representations of drugs and targets encoded by their respective encoders. As shown in Figure 2, two fusion modules inspired by the recent literature [3, 43] are investigated to fuse representations: the Bilinear Attention Network [18] and the Cross Attention Network [21, 37].

3.3.1 Bilinear Attention Network (BAN)

Motivated by DrugBAN [3], our model considers BAN [18] as an option of the fusion module to learn pairwise fine-grained interactions between drug 𝐃M×ϕ𝐃superscript𝑀italic-ϕ\mathbf{D}\in\mathbb{R}^{M\times\phi}bold_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_ϕ end_POSTSUPERSCRIPT and target 𝐏N×ρ𝐏superscript𝑁𝜌\mathbf{P}\in\mathbb{R}^{N\times\rho}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_ρ end_POSTSUPERSCRIPT, denoted as FusionDTI-BAN. For BAN as indicated in Figure 2, bilinear attention maps are obtained by a bilinear interaction modelling to capture pairwise weights in step 1, and then the bilinear pooling layer to extract a joint representation 𝐅𝐅\mathbf{F}bold_F. The equation for BAN is shown below:

𝐅=BAN(𝐏,𝐃;Att)=SumPool(σ(𝐏𝐔)Attσ(𝐃𝐕),s),𝐅BAN𝐏𝐃𝐴𝑡𝑡SumPool𝜎superscript𝐏top𝐔𝐴𝑡𝑡𝜎superscript𝐃top𝐕𝑠\begin{split}\mathbf{F}&=\operatorname{BAN}(\mathbf{P},\mathbf{D};Att)\\ &=\mathrm{SumPool}(\sigma(\mathbf{P}^{\top}\mathbf{U})\cdot Att\cdot\sigma(% \mathbf{D}^{\top}\mathbf{V}),s),\end{split}start_ROW start_CELL bold_F end_CELL start_CELL = roman_BAN ( bold_P , bold_D ; italic_A italic_t italic_t ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_SumPool ( italic_σ ( bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U ) ⋅ italic_A italic_t italic_t ⋅ italic_σ ( bold_D start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_V ) , italic_s ) , end_CELL end_ROW (1)

where 𝐔N×K𝐔superscript𝑁𝐾\mathbf{U}\in\mathbb{R}^{N\times K}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT and 𝐕M×K𝐕superscript𝑀𝐾\mathbf{V}\in\mathbb{R}^{M\times K}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_K end_POSTSUPERSCRIPT are transformation matrices for representations. SumPoolSumPool\mathrm{SumPool}roman_SumPool is an operation that performs a one-dimensional and non-overlapped sum pooling operation with stride s𝑠sitalic_s and σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) denotes a non-linear activation function with ReLU()ReLU\mathrm{ReLU}(\cdot)roman_ReLU ( ⋅ ). Attρ×ϕ𝐴𝑡𝑡superscript𝜌italic-ϕAtt\in\mathbb{R}^{\rho\times\phi}italic_A italic_t italic_t ∈ blackboard_R start_POSTSUPERSCRIPT italic_ρ × italic_ϕ end_POSTSUPERSCRIPT represents the bilinear attention maps using the Hadamard product and matrix-matrix multiplication and is defined as:

Att=((𝟏𝐪)σ(𝐏𝐔))σ(𝐕𝐃),𝐴𝑡𝑡1superscript𝐪top𝜎superscript𝐏top𝐔𝜎superscript𝐕top𝐃Att=((\mathbf{1}\cdot\mathbf{q}^{\top})\circ\sigma(\mathbf{P}^{\top}\mathbf{U}% ))\cdot\sigma(\mathbf{V}^{\top}\mathbf{D}),italic_A italic_t italic_t = ( ( bold_1 ⋅ bold_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∘ italic_σ ( bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U ) ) ⋅ italic_σ ( bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_D ) , (2)

Here, 𝟏ρ1superscript𝜌\mathbf{1}\in\mathbb{R}^{\rho}bold_1 ∈ blackboard_R start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT is a fixed all-ones vector, 𝐪K𝐪superscript𝐾\mathbf{q}\in\mathbb{R}^{K}bold_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is a learnable weight vector and \circ denotes the Hadamard product. In this way, pairwise interactions contribute sub-structural pairs to the prediction.

BAN captures the token-level interactions between the protein and drug representations without considering the relationships within each sequence itself, which may limit its ability to understand deeper contextual dependencies.

3.3.2 Cross Attention Network (CAN)

Inspired by ProST [43], we also consider CAN as our fusion module to learn fine-grained interaction information of drugs and targets. We denote our FusionDTI model that uses a CAN fusion module as FusionDTI-CAN. By processing 𝐃m×h𝐃superscript𝑚\mathbf{D}\in\mathbb{R}^{m\times h}bold_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_h end_POSTSUPERSCRIPT and 𝐏n×h𝐏superscript𝑛\mathbf{P}\in\mathbb{R}^{n\times h}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_h end_POSTSUPERSCRIPT separately, the fused drug 𝐃m×hsuperscript𝐃superscript𝑚\mathbf{D}^{*}\in\mathbb{R}^{m\times h}bold_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_h end_POSTSUPERSCRIPT and target 𝐏n×hsuperscript𝐏superscript𝑛\mathbf{P}^{*}\in\mathbb{R}^{n\times h}bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_h end_POSTSUPERSCRIPT representations are obtained. To synthesise the fine-grained joint representation 𝐅𝐅\mathbf{F}bold_F, we employ a pooling aggregation strategy for both 𝐃superscript𝐃\mathbf{D}^{*}bold_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝐏superscript𝐏\mathbf{P}^{*}bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTindependently and then concatenate them as shown in Figure 2. The process is delineated by the following equation:

𝐅=Concat((MeanPool(𝐃,dim=1),MeanPool(𝐏,dim=1)),1),𝐅ConcatMeanPoolsuperscript𝐃dim1MeanPoolsuperscript𝐏dim11\mathbf{F}=\mathrm{Concat}((\mathrm{MeanPool}(\mathbf{D}^{*},\text{dim}=1),% \mathrm{MeanPool}(\mathbf{P}^{*},\text{dim}=1)),1),bold_F = roman_Concat ( ( roman_MeanPool ( bold_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , dim = 1 ) , roman_MeanPool ( bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , dim = 1 ) ) , 1 ) , (3)

where MeanPoolMeanPool\mathrm{MeanPool}roman_MeanPool calculates the element-wise mean of all tokens across the sequence dimension, and ConcatConcat\mathrm{Concat}roman_Concat denotes the concatenation of the resulting mean vectors. In this context, the multi-head, self-attention and cross-attention mechanisms are used to refine the representations of each residue and atom as below:

𝐃=12[MHA(𝐐d,𝐊d,𝐕d)+MHA(𝐐p,𝐊d,𝐕d)],superscript𝐃12delimited-[]MHAsubscript𝐐𝑑subscript𝐊𝑑subscript𝐕𝑑MHAsubscript𝐐𝑝subscript𝐊𝑑subscript𝐕𝑑\mathbf{D}^{*}=\frac{1}{2}\left[\textit{MHA}(\mathbf{Q}_{d},\mathbf{K}_{d},% \mathbf{V}_{d})+\textit{MHA}(\mathbf{Q}_{p},\mathbf{K}_{d},\mathbf{V}_{d})% \right],bold_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ MHA ( bold_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) + MHA ( bold_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] , (4)
𝐏=12[MHA(𝐐p,𝐊p,𝐕p)+MHA(𝐐d,𝐊p,𝐕p)],superscript𝐏12delimited-[]MHAsubscript𝐐𝑝subscript𝐊𝑝subscript𝐕𝑝MHAsubscript𝐐𝑑subscript𝐊𝑝subscript𝐕𝑝\mathbf{P}^{*}=\frac{1}{2}\left[\textit{MHA}(\mathbf{Q}_{p},\mathbf{K}_{p},% \mathbf{V}_{p})+\textit{MHA}(\mathbf{Q}_{d},\mathbf{K}_{p},\mathbf{V}_{p})% \right],bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ MHA ( bold_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + MHA ( bold_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ] , (5)

where 𝐐d,𝐊d,𝐕dm×hsubscript𝐐𝑑subscript𝐊𝑑subscript𝐕𝑑superscript𝑚\mathbf{Q}_{d},\mathbf{K}_{d},\mathbf{V}_{d}\in\mathbb{R}^{m\times h}bold_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_h end_POSTSUPERSCRIPT and 𝐐p,𝐊p,𝐕pn×hsubscript𝐐𝑝subscript𝐊𝑝subscript𝐕𝑝superscript𝑛\mathbf{Q}_{p},\mathbf{K}_{p},\mathbf{V}_{p}\in\mathbb{R}^{n\times h}bold_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_h end_POSTSUPERSCRIPT are the queries, keys and values for drug and target protein, respectively. And MHA denotes the Multi-head Attention mechanism. To guide this process, two distinct sets of projection matrices guide the attention mechanism as follows:

𝐐d=𝐃𝐖qd,𝐊d=𝐃𝐖kd,𝐕d=𝐃𝐖vd,formulae-sequencesubscript𝐐𝑑superscriptsubscript𝐃𝐖𝑞𝑑formulae-sequencesubscript𝐊𝑑superscriptsubscript𝐃𝐖𝑘𝑑subscript𝐕𝑑superscriptsubscript𝐃𝐖𝑣𝑑\mathbf{Q}_{d}=\mathbf{D}\mathbf{W}_{q}^{d},\quad\mathbf{K}_{d}=\mathbf{D}% \mathbf{W}_{k}^{d},\quad\mathbf{V}_{d}=\mathbf{D}\mathbf{W}_{v}^{d},bold_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = bold_DW start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = bold_DW start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = bold_DW start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , (6)
𝐐p=𝐏𝐖qp,𝐊p=𝐏𝐖kp,𝐕p=𝐏𝐖vp,formulae-sequencesubscript𝐐𝑝superscriptsubscript𝐏𝐖𝑞𝑝formulae-sequencesubscript𝐊𝑝superscriptsubscript𝐏𝐖𝑘𝑝subscript𝐕𝑝superscriptsubscript𝐏𝐖𝑣𝑝\mathbf{Q}_{p}=\mathbf{P}\mathbf{W}_{q}^{p},\quad\mathbf{K}_{p}=\mathbf{P}% \mathbf{W}_{k}^{p},\quad\mathbf{V}_{p}=\mathbf{P}\mathbf{W}_{v}^{p},bold_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = bold_PW start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = bold_PW start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = bold_PW start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , (7)

Here, the projection matrices 𝐖qd,𝐖kd,𝐖vdh×hsuperscriptsubscript𝐖𝑞𝑑superscriptsubscript𝐖𝑘𝑑superscriptsubscript𝐖𝑣𝑑superscript\mathbf{W}_{q}^{d},\mathbf{W}_{k}^{d},\mathbf{W}_{v}^{d}\in\mathbb{R}^{h\times h}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_h end_POSTSUPERSCRIPT and 𝐖qp,𝐖kp,𝐖vph×hsuperscriptsubscript𝐖𝑞𝑝superscriptsubscript𝐖𝑘𝑝superscriptsubscript𝐖𝑣𝑝superscript\mathbf{W}_{q}^{p},\mathbf{W}_{k}^{p},\mathbf{W}_{v}^{p}\in\mathbb{R}^{h\times h}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_h end_POSTSUPERSCRIPT are used to derive the queries, keys and values, respectively.

In summary, our CAN module combines multi-head, self-attention and cross-attention mechanisms to capture dependencies within individual sequences and between different sequences for a more nuanced understanding of interactions. In the results of Sections 4.3 and 4.5, we analyse and compare these two fusion strategies and different fusion scales in detail.

4 Experimental Setup and Results

4.1 Datasets and Baselines

Three public DTI datasets, namely BindingDB [13], BioSNAP [48] and Human [24, 8], are used for evaluation, where each dataset is randomly split into training, validation and test sets with a 7:1:2 ratio. Since DTI is a binary classification task, we use AUROC (area under the receiver operating characteristic curve) [3, 16] and AUPRC (area under the precision-call curve) [20, 25] as the major metrics to evaluate a model’s performance.

We compare FusionDTI with seven baseline models in the DTI prediction task. These models include two traditional machine learning methods such as SVM [9] and Random Forest (RF) [15], as well as four deep learning methods including DeepConv-DTI [20], GraphDTA [25], MolTrans [16] and DrugBAN [3]. The latter four models employ the same two-stage process whereby the drug and target features are initially extracted by specialised encoders before being integrated for prediction. In addition, we also include the BioT5 [26] model, which is a biomedical pre-trained language model that could directly predict the DTI. Further details regarding the datasets, baseline models, and the methodology for generating drug SELFIES and protein SA sequences are provided in Appendix 6.3.

4.2 Effectiveness Evaluation for DTI Prediction

Table 1: Performance comparison of FusionDTI and the baselines on the BindingDB, Human and BioSNAP datasets. (Best, Second Best).
BindingDB Human BioSNAP
Method AUROC AUPRC Accuracy AUROC AUPRC AUROC AUPRC Accuracy
SVM .939±plus-or-minus\pm±.001 .928±plus-or-minus\pm±.002 .825±plus-or-minus\pm±.004 .940±plus-or-minus\pm±.006 .920±plus-or-minus\pm±.009 .862±plus-or-minus\pm±.007 .864±plus-or-minus\pm±.004 .777±plus-or-minus\pm±.011
RF .942±plus-or-minus\pm±.011 .921±plus-or-minus\pm±.016 .880±plus-or-minus\pm±.012 .952±plus-or-minus\pm±.011 .953±plus-or-minus\pm±.010 .860±plus-or-minus\pm±.005 .886±plus-or-minus\pm±.005 .804±plus-or-minus\pm±.005
DeepConv-DTI .945±plus-or-minus\pm±.002 .925±plus-or-minus\pm±.005 .882±plus-or-minus\pm±.007 .980±plus-or-minus\pm±.002 .981±plus-or-minus\pm±.002 .886±plus-or-minus\pm±.006 .890±plus-or-minus\pm±.006 .805±plus-or-minus\pm±.009
GraphDTA .951±plus-or-minus\pm±.002 .934±plus-or-minus\pm±.002 .888±plus-or-minus\pm±.005 .981±plus-or-minus\pm±.001 .982±plus-or-minus\pm±.002 .887±plus-or-minus\pm±.008 .890±plus-or-minus\pm±.007 .800±plus-or-minus\pm±.007
MolTrans .952±plus-or-minus\pm±.002 .936±plus-or-minus\pm±.001 .887±plus-or-minus\pm±.006 .980±plus-or-minus\pm±.002 .978±plus-or-minus\pm±.003 .895±plus-or-minus\pm±.004 .897±plus-or-minus\pm±.005 .825±plus-or-minus\pm±.010
DrugBAN .960±plus-or-minus\pm±.001 .948±plus-or-minus\pm±.002 .904±plus-or-minus\pm±.004 .982±plus-or-minus\pm±.002 .980±plus-or-minus\pm±.003 .903±plus-or-minus\pm±.005 .902±plus-or-minus\pm±.004 .834±plus-or-minus\pm±.008
BioT5 .963±plus-or-minus\pm±.001 .952±plus-or-minus\pm±.001 .907±plus-or-minus\pm±.003 .989±plus-or-minus\pm±.001 .985±plus-or-minus\pm±.002 .937±plus-or-minus\pm±.001 .937±plus-or-minus\pm±.004 .874±plus-or-minus\pm±.001
FusionDTI-BAN .975±plus-or-minus\pm±.002 .976±plus-or-minus\pm±.002 .933±plus-or-minus\pm±.003 .984±plus-or-minus\pm±.002 .984±plus-or-minus\pm±.003 .923±plus-or-minus\pm±.002 .921±plus-or-minus\pm±.002 .856±plus-or-minus\pm±.001
FusionDTI-CAN .989±plus-or-minus\pm±.002 .990±plus-or-minus\pm±.002 .961±plus-or-minus\pm±.002 .991±plus-or-minus\pm±.002 .989±plus-or-minus\pm±.002 .951±plus-or-minus\pm±.002 .951±plus-or-minus\pm±.002 .889±plus-or-minus\pm±.002
Refer to caption
Figure 3: Performance comparison of two fusion strategies: BAN and CAN on BindingDB.
Refer to caption
Figure 4: Time comparison on the BindingDB, Human and BioSNAP datasets.

We start by comparing our FusionDTI model (FusionDTI-CAN and FusionDTI-BAN) with seven existing state-of-the-art baselines for DTI prediction on three widely used datasets. Table 1 reports the comparative results. In general, our FusionDTI-CAN model performs the best on all metrics and all three datasets. A key highlight from these results is the exceptional performance of FusionDTI-CAN on the BindingDB dataset, where FusionDTI-CAN demonstrates superior metrics across the board: an AUROC of 0.989, an AUPRC of 0.990, and an accuracy of 96.1%. Note that the main difference between the FusionDTI-CAN model with others is the fusion strategy. Furthermore, although FusionDTI-BAN and DrugBAN have the same BAN module, FusionDTI-BAN performs better across all three datasets. These results highlight not only the marked enhancements of FusionDTI over other models on the BindingDB dataset but also its effectiveness in capturing fine-grained information on DTI. We consider the fine-grained interactions for each drug-target pair in the DTI prediction task, which is why FusionDTI uses the token-level fusion module. Our FusionDTI method is highly aligned with biomedical pathways the binding process relates to the specific atom or substructure interacting with the residue. Therefore, fine-grained interaction information effectively improves the performance of models in predicting DTI.

4.3 Comparison of the BAN and CAN Fusion Modules

There are two fusion strategies available: BAN and CAN, thus determining which one works better is a key step for establishing FusionDTI’s prediction effectiveness. We perform a fair comparison involving the same encoders, classifier and dataset. As shown in Figure 4, we compare BAN and CAN by employing two linear layers to adjust the feature dimensions of the drug and target representations. With the feature dimension increasing, the performance of FusionDTI-CAN continues to rise, while that of FusionDTI-BAN reaches a plateau. When the feature dimension is 512, both of the variants attain their peak positions with an AUC of 0.989 and 0.967, respectively. These results indicate that the CAN module seems to be better suited to the DTI prediction tasks and in capturing fine-grained interaction information. In contrast, BAN may not be able to fully capture fine-grained binding information between proteins and drugs, such as the specific interactions between the drug atoms and residues. Therefore, these findings suggest that the CAN strategy is more effective and adaptable to the complexities involved in DTI prediction, providing a superior performance, especially as the feature dimension scales.

4.4 Efficiency Analysis

Efficiency in computational models is crucial, particularly when handling large-scale and extensive datasets in drug discovery. Our proposed model stores drug representations and target representations in memory for later online training. As evidenced by Figure 4, FusionDTI-CAN and FusionDTI-BAN with pre-encoded representations process the BindingDB dataset much faster than the non-pre-coded models, approximately 45 minutes and 220 minutes, respectively. This stark difference highlights the advantage of pre-encoded, which eliminates the need for real-time data processing and accelerates the overall throughput. While FusionDTI-BAN and DrugBAN have the same fusion module, the pre-encoded FusionDTI-BAN runs faster and predicts more accurately, as shown in Table 1. In addition, FusionDTI-BAN runs faster than FusionDTI-CAN, indicating that the BAN fusion module is more efficient. Ultimately, FusionDTI-BAN with pre-encoded data stands out as a highly efficient approach, offering substantial benefits in scenarios where exists large-scale data. We further analyse the time complexity in Appendix 6.6.

4.5 Ablation Study

Refer to caption
Figure 5: Performance evaluation of fusion scales.
Refer to caption
Figure 6: Performance comparison of protein encoders.
Refer to caption
Figure 7: Performance comparison of drug encoders.

The fine-grained interaction of drug and target representations is critical in DTI as it directly impacts the model’s ability to infer potential binding sites. For FusionDTI, this interaction is facilitated by the CAN module, which markedly enhances the predictive accuracy by capturing the fine-grained interaction information between the drugs and targets. Table 3 demonstrates the impact of the CAN module on the prediction performance using the BindingDB dataset. When the fusion module is omitted, the model achieves an AUC of 0.954 and an accuracy of 0.894. Conversely, using the CAN module, there is a significant improvement, with the AUC increasing to 0.989 and the accuracy reaching 0.961. This highlights the effectiveness of the CAN module in improving the inference ability of FusionDTI. Additionally, in Table 3, we compare the performance of two aggregation strategies within the CAN module. The pooling strategy outperforms the CLS-based aggregation, achieving an AUC and AUPRC of 0.989 and 0.990, respectively. This comparison highlights the superior effectiveness of the pooling in aggregating contextual information. Thus, the integration of a CAN module, particularly employing a pooling aggregation strategy, is shown to be essential for making confident and accurate predictions.

Table 2: Ablation study of FusionDTI on the BindingDB dataset.
CAN AUC AUPRC Accuracy
×\times× 0.954 0.963 0.894
\checkmark 0.989 0.990 0.961
Table 3: Comparison of aggregation strategies for CAN on the BindingDB dataset.
Aggregation AUC AUPRC Accuracy
CLS 0.982 0.983 0.956
Pooling 0.989 0.990 0.961

4.6 Analysis of Fusion Scales

In assessing fusion representations, it is critical to determine whether a more fine-grained modelling enhances the predictive performance. Thus, we define a grou** function with the parameter g (Group size) for averaging per group tokens before the CAN fusion module. The g, representing the number of tokens per group, controls the granularity of the attention mechanism. Specifically, when g is set to 1, the fusion operates at the token level, where each token is considered independently. On the other hand, when g is set to 512, the fusion will run at the global level. We have the flexibility to control the fusion scale for the drug and protein representations, but this needs to meet the requirement that the token length is divisible by group size. As shown in Figure 7, as the number of tokens per group increases from 1 to 512 (Maximum Token Length), the FusionDTI model performance decreases accordingly. This also aligns with the biomedical rules governing drug-protein interactions, where the principal factor influencing the binding is the interplay between the key atoms or substructures in the drug and primary residues in the protein. In addition, the CAN module outperforms BAN consistently at various scale settings, indicating that CAN better access the information between the drug and target. Consequently, this supports that the more detailed the interaction information obtained between the drugs and targets by the fusion module, the more beneficial it is for the enhancement of the model’s prediction performance.

4.7 Evaluation of PLMs Encoding

The protein encoder and drug encoder are fundamental for the token-level fusion of representations, as these encoders are responsible for generating fine-grained representations to better explore interaction information. Our proposed model employs two PLMs encoding two biomedical entities: the drug and protein, respectively. In terms of the protein encoders, Figure 7 compares the the performance of the two protein encoders (Saprot [33] and ESM-2b [22]) in combination with three different drug encoders: ChemBERTa-2 [1], SELFormer [46] and MoLFormer [31]. From the figure, we find that Saprot consistently outperforms ESM-2b when combined with all three drug encoders. As can be seen in Figure 7, SELFormer achieves the best performance in encoding the drug sequences among the three advanced drug encoders. Notably, the top-performing combination is Saprot and SELFormer, hence our proposed FusionDTI uses them as drug and protein encoders.

4.8 Case Study

A further strength of FusionDTI to enable explainability, which is critical for drug design efforts, is the visualisation of each token’s contribution to the final prediction through cross-attention maps. To compare with the DrugBAN model, we examine three identical pairs of DTI from the Protein Data Bank (PDB) [4]: (EZL - 6QL2 [17], 9YA - 5W8L [28] and EJ4 - 4N6H [12]). As shown in Table 4, our proposed model predicts additional binding sites (in bold) evidenced by PDB [4] in comparison to the DrugBAN model. For instance, to predict the interaction of the drug EZL with the target 6QL2, our proposed model using BertViz [38] highlights potential binding sites as illustrated in Figure 8. Our CAN module is effective in capturing fine-grained binding information at the token level. In particular, we address the lack of structural information on protein sequences by employing the SA vocabulary, which matches each residue to a corresponding 3D feature via Foldseek [35]. This study highlights the effectiveness of FusionDTI in enhancing performance on the DTI task, thereby supporting more targeted and efficient drug development efforts. In Section 6.5 of the Appendix, we present three pairs of prediction visualisations.

Table 4: FusionDTI predictions: Bold represents new predictions versus DrugBAN.
Drug-Target Interactions
EZL - 6QL2:
1. sulfonamide oxygen - Leu198, Thr199 and Trp209;
2. amino group - His94, His96, His119 and Thr199;
3. benzothiazole ring - Leu198, Thr200, Tyr131, and Pro201;
4. ethoxy group - Gln135;
9YA - 5W8L:
1. amino group of sulfonamide - Asp140, Glu191;
2. sulfonamide oxygen - Asp140, Ile141 and Val139;
3. carboxylic acid oxygens - Arg168, His192, Asp194 and Thr247;
4. biphenyl rings - Arg105, Asn137 and Pro138;
5. hydrophobic contact - Ala237, Try238 and Leu322;
EJ4 - 4N6H:
1. basic nitrogen of ligand - Asp128;
2. hydrophobic pocket - Tyr308, Ile304 and Tyr129;
3. water molecules - Tyr129, Met132, Trp274, Try308 and Lys214;
Refer to caption
Figure 8: EZL - 6QL2: Fine-grained interactions via attention visualization.

5 Conclusions

With the rapid increase of new diseases and the urgent need for innovative drugs, it is critical to capture and gauge fine-grained interactions, since the binding of specific drug atoms to the main amino acids is key to the DTI task. Despite some achievements, fine-grained interaction information is not effectively captured. To address this challenge, we introduce FusionDTI uses token-level fusion to effectively obtain fine-grained interaction information between drugs and targets.
Limitations: Even if our proposed model identifies potentially useful DTI, these predictions need to be validated by wet experiments, a time-consuming and expensive process.
Potential impacts: We have shown that FusionDTI is effective and efficient in screening for possible DTI in large-scale data as well as in locating potential binding sites in the process of drug design. However, it is not directly applicable to human medical therapy and other biomedical interactions because it lacks clinical validation and regulatory approval for medical use.
For future studies, we aim to investigate token-level interaction in more detail and to apply it to other biomedical scenarios, such as drug-drug interactions and protein-protein interactions.

References

  • Ahmad et al. [2022] Walid Ahmad, Elana Simon, Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta-2: Towards chemical foundation models. arXiv preprint arXiv:2209.01712, 2022.
  • Askr et al. [2023] Heba Askr, Enas Elgeldawi, Heba Aboul Ella, Yaseen AMM Elshaier, Mamdouh M Gomaa, and Aboul Ella Hassanien. Deep learning in drug discovery: an integrative review and future challenges. Artificial Intelligence Review, 56(7):5975–6037, 2023.
  • Bai et al. [2023] Peizhen Bai, Filip Miljković, Bino John, and Hai** Lu. Interpretable bilinear attention network with domain adaptation improves drug–target prediction. Nature Machine Intelligence, 5(2):126–136, 2023.
  • Berman et al. [2007] Helen Berman, Kim Henrick, Haruki Nakamura, and John L Markley. The worldwide protein data bank (wwpdb): ensuring a single, uniform archive of pdb data. Nucleic acids research, 35(suppl_1):D301–D303, 2007.
  • Brik and Wong [2003] Ashraf Brik and Chi-Huey Wong. Hiv-1 protease: mechanism and drug discovery. Organic & biomolecular chemistry, 1(1):5–14, 2003.
  • Cao et al. [2013] Dong-Sheng Cao, Qing-Song Xu, and Yi-Zeng Liang. propy: a tool to generate various modes of chou’s pseaac. Bioinformatics, 29(7):960–962, 2013.
  • Chandwani and Shuter [2008] Ashish Chandwani and Jonathan Shuter. Lopinavir/ritonavir in the treatment of hiv-1 infection: a review. Therapeutics and clinical risk management, 4(5):1023–1033, 2008.
  • Chen et al. [2020] Lifan Chen, Xiaoqin Tan, Dingyan Wang, Feisheng Zhong, Xiaohong Liu, Tianbiao Yang, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, and Mingyue Zheng. Transformercpi: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics, 36(16):4406–4414, 2020.
  • Cortes and Vapnik [1995] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273–297, 1995.
  • Dara et al. [2022] Suresh Dara, Swetha Dhamercherla, Surender Singh Jadav, CH Madhu Babu, and Mohamed Jawed Ahsan. Machine learning in drug discovery: a review. Artificial Intelligence Review, 55(3):1947–1999, 2022.
  • Elnaggar et al. [2021] Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Wang Yu, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, and Burkhard Rost. Prottrans: Towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021. doi: 10.1109/TPAMI.2021.3095381.
  • Fenalti et al. [2014] Gustavo Fenalti, Patrick M Giguere, Vsevolod Katritch, Xi-** Huang, Aaron A Thompson, Vadim Cherezov, Bryan L Roth, and Raymond C Stevens. Molecular control of δ𝛿\deltaitalic_δ-opioid receptor signalling. Nature, 506(7487):191–196, 2014.
  • Gilson et al. [2016] Michael K Gilson, Tiqing Liu, Michael Baitaluk, George Nicola, Linda Hwang, and Jenny Chong. Bindingdb in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic acids research, 44(D1):D1045–D1053, 2016.
  • Gong et al. [2018] Yichen Gong, Heng Luo, and Jian Zhang. Natural language inference over interaction space. International Conference on Learning Representations, 2018.
  • Ho [1995] Tin Kam Ho. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, volume 1, pages 278–282. IEEE, 1995.
  • Huang et al. [2021] Kexin Huang, Cao Xiao, Lucas M Glass, and Jimeng Sun. Moltrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics, 37(6):830–836, 2021.
  • Kazokaitė et al. [2019] Justina Kazokaitė, Visvaldas Kairys, Joana Smirnovienė, Alexey Smirnov, Elena Manakova, Martti Tolvanen, Seppo Parkkila, and Daumantas Matulis. Engineered carbonic anhydrase vi-mimic enzyme switched the structure and affinities of inhibitors. Scientific reports, 9(1):12710, 2019.
  • Kim et al. [2018] **-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. Advances in neural information processing systems, 31, 2018.
  • Krenn et al. [2022] Mario Krenn, Qianxiang Ai, Senja Barthel, Nessa Carson, Angelo Frei, Nathan C Frey, Pascal Friederich, Théophile Gaudin, Alberto Alexander Gayle, Kevin Maik Jablonka, et al. Selfies and the future of molecular string representations. Patterns, 3(10), 2022.
  • Lee et al. [2019] Ingoo Lee, Jongsoo Keum, and Hojung Nam. Deepconv-dti: Prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS computational biology, 15(6):e1007129, 2019.
  • Li et al. [2021] Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Handong Zhao, Rajiv Jain, Varun Manjunatha, and Hongfu Liu. Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5652–5660, 2021.
  • Lin et al. [2023] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  • Liu et al. [2021] Fangyu Liu, Yunlong Jiao, Jordan Massiah, Emine Yilmaz, and Serhii Havrylov. Trans-encoder: Unsupervised sentence-pair modelling through self-and mutual-distillations. In International Conference on Learning Representations, 2021.
  • Liu et al. [2015] Hui Liu, Jianjiang Sun, Jihong Guan, Jie Zheng, and Shuigeng Zhou. Improving compound–protein interaction prediction by building up highly credible negative samples. Bioinformatics, 31(12):i221–i229, 2015.
  • Nguyen et al. [2021] Thin Nguyen, Hang Le, Thomas P Quinn, Tri Nguyen, Thuc Duy Le, and Svetha Venkatesh. Graphdta: predicting drug–target binding affinity with graph neural networks. Bioinformatics, 37(8):1140–1147, 2021.
  • Pei et al. [2023] Qizhi Pei, Wei Zhang, **hua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, and Rui Yan. BioT5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1102–1123, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.70.
  • Peng et al. [2024] Lihong Peng, Xin Liu, Long Yang, Longlong Liu, Zongzheng Bai, Min Chen, Xu Lu, and Libo Nie. Bindti: A bi-directional intention network for drug-target interaction identification based on attention mechanisms. IEEE Journal of Biomedical and Health Informatics, 2024.
  • Rai et al. [2017] Ganesha Rai, Kyle R Brimacombe, Bryan T Mott, Daniel J Urban, Xin Hu, Shyh-Ming Yang, Tobie D Lee, Dorian M Cheff, Jennifer Kouznetsova, Gloria A Benavides, et al. Discovery and optimization of potent, cell-active pyrazole-based inhibitors of lactate dehydrogenase (ldh). Journal of medicinal chemistry, 60(22):9184–9204, 2017.
  • Rogers and Hahn [2010] David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
  • Rong et al. [2020] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems, 33:12559–12571, 2020.
  • Ross et al. [2022] Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, and Payel Das. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 4(12):1256–1264, 2022.
  • Schenone et al. [2013] Monica Schenone, Vlado Dančík, Bridget K Wagner, and Paul A Clemons. Target identification and mechanism of action in chemical biology and drug discovery. Nature chemical biology, 9(4):232–240, 2013.
  • Su et al. [2023] ** Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, and Fajie Yuan. Saprot: protein language modeling with structure-aware vocabulary. Advances in neural information processing systems, pages 2023–10, 2023.
  • Su et al. [2024] ** Su, Zhikai Li, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, Dacheng Ma, The OPMC, Sergey Ovchinnikov, and Fajie Yuan. Saprothub: Making protein modeling accessible to all biologists. bioRxiv, pages 2024–05, 2024.
  • Van Kempen et al. [2024] Michel Van Kempen, Stephanie S Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron LM Gilchrist, Johannes Söding, and Martin Steinegger. Fast and accurate protein structure search with foldseek. Nature Biotechnology, 42(2):243–246, 2024.
  • Varadi et al. [2022] Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Vig [2019] Jesse Vig. A multiscale visualization of attention in the transformer model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 37–42, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-3007. URL https://www.aclweb.org/anthology/P19-3007.
  • Weininger [1988] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
  • Weininger et al. [1989] David Weininger, Arthur Weininger, and Joseph L Weininger. Smiles. 2. algorithm for generation of unique smiles notation. Journal of chemical information and computer sciences, 29(2):97–101, 1989.
  • Wishart et al. [2008] David S Wishart, Craig Knox, An Chi Guo, Dean Cheng, Savita Shrivastava, Dan Tzur, Bijaya Gautam, and Murtaza Hassanali. Drugbank: a knowledgebase for drugs, drug actions and drug targets. Nucleic acids research, 36(suppl_1):D901–D906, 2008.
  • Wu et al. [2022] Yifan Wu, Min Gao, Min Zeng, Jie Zhang, and Min Li. Bridgedpi: a novel graph neural network for predicting drug–protein interactions. Bioinformatics, 38(9):2571–2578, 2022.
  • Xu et al. [2023] Minghao Xu, Xinyu Yuan, Santiago Miret, and Jian Tang. Protst: Multi-modality learning of protein sequences and biomedical texts. In International Conference on Machine Learning, pages 38749–38767. PMLR, 2023.
  • Yazdani-Jahromi et al. [2022] Mehdi Yazdani-Jahromi, Niloofar Yousefi, Aida Tayebi, Elayaraja Kolanthai, Craig J Neal, Sudipta Seal, and Ozlem Ozmen Garibay. Attentionsitedti: an interpretable graph-based model for drug-target interaction prediction using nlp sentence-level relation classification. Briefings in Bioinformatics, 23(4):bbac272, 2022.
  • Ying et al. [2021] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? Advances in neural information processing systems, 34:28877–28888, 2021.
  • Yüksel et al. [2023] Atakan Yüksel, Erva Ulusoy, Atabey Ünlü, and Tunca Doğan. Selformer: molecular representation learning via selfies language models. Machine Learning: Science and Technology, 4(2):025035, 2023.
  • Zeng et al. [2024] Xiaoting Zeng, Weilin Chen, and Baiying Lei. Cat-dti: cross-attention and transformer network with domain adaptation for drug-target interaction prediction. BMC bioinformatics, 25(1):141, 2024.
  • Zitnik et al. [2018] Marinka Zitnik, Rok Sosic, and Jure Leskovec. Biosnap datasets: Stanford biomedical network dataset collection. Note: http://snap. stanford. edu/biodata Cited by, 5(1), 2018.
  • Zitnik et al. [2019] Marinka Zitnik, Francis Nguyen, Bo Wang, Jure Leskovec, Anna Goldenberg, and Michael M Hoffman. Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Information Fusion, 50:71–91, 2019.

6 Appendix

6.1 Hyperparameter of FusionDTI

FusionDTI is implemented in Python 3.8 and the PyTorch framework (1.12.1)222https://pytorch.org/. The computing device we use is the NVIDIA GeForce RTX 3090. In the "Experimental Setup and Results" section, we only present experiment results based on the BindingDB dataset, as the performance trends are identical to the BioSNAP dataset and the Human dataset. Table 5 shows the parameters of the FusionDTI model and Table 6 lists the notations used in this paper with descriptions.

Table 5: Configuration Parameters
Module Hyperparameter Value
Mini-batch Batch size 64 (options: 64, 128)
Drug Encoder PLM HUBioDataLab/SELFormer
Protein Encoder PLM westlake-repl/SaProt_650M_AF2
BAN Heads of bilinear attention 3
Bilinear embedding size 512 (options: 32, 64, 128, 256, 512, 768)
Sum pooling window size 2
CAN Attention heads 8
Hidden dimension 512 (options: 32, 64, 128, 256, 512, 768)
Integration strategies Mean pooling (options: Mean pooling, CLS)
Group size 1 (options: from 1 to 512)
MLP Hidden layer sizes (1024, 512, 256)
Activation Relu (options: Tanh, Relu)
Solver AdamW
(options: AdamW, Adam, RMSprop, Adadelta, LBFGS)
Learning rate scheduler CosineAnnealingLR
(options: CosineAnnealingLR, StepLR, ExponentialLR)
Initial learning rate 1e-4 (options: from 1e-3 to 1e-6)
Maximum epoch 200
Table 6: Notations and Descriptions
Notations Description
𝐃𝐃\mathbf{D}bold_D Drug feature
𝐏𝐏\mathbf{P}bold_P Target feature
𝐪K𝐪superscript𝐾\mathbf{q}\in\mathbb{R}^{K}bold_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT weight vector for bilinear transformation
Attρ×ϕ𝐴𝑡𝑡superscript𝜌italic-ϕAtt\in\mathbb{R}^{\rho\times\phi}italic_A italic_t italic_t ∈ blackboard_R start_POSTSUPERSCRIPT italic_ρ × italic_ϕ end_POSTSUPERSCRIPT Bilinear attention maps in BAN
𝐔N×K𝐔superscript𝑁𝐾\mathbf{U}\in\mathbb{R}^{N\times K}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT Transformation matrix for drug features
𝐕M×K𝐕superscript𝑀𝐾\mathbf{V}\in\mathbb{R}^{M\times K}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_K end_POSTSUPERSCRIPT Transformation matrix for target features
𝐠𝐠\mathbf{g}bold_g The number of tokens per group
𝐃m×hsuperscript𝐃superscript𝑚\mathbf{D}^{*}\in\mathbb{R}^{m\times h}bold_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_h end_POSTSUPERSCRIPT Fused drug representations in token-level interaction
𝐏n×hsuperscript𝐏superscript𝑛\mathbf{P}^{*}\in\mathbb{R}^{n\times h}bold_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_h end_POSTSUPERSCRIPT Fused target representations in token-level interaction
𝐐d,𝐊d,𝐕dm×hsubscript𝐐𝑑subscript𝐊𝑑subscript𝐕𝑑superscript𝑚\mathbf{Q}_{d},\mathbf{K}_{d},\mathbf{V}_{d}\in\mathbb{R}^{m\times h}bold_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_h end_POSTSUPERSCRIPT Queries, keys, and values for the drug in token-level interaction
𝐐p,𝐊p,𝐕pn×hsubscript𝐐𝑝subscript𝐊𝑝subscript𝐕𝑝superscript𝑛\mathbf{Q}_{p},\mathbf{K}_{p},\mathbf{V}_{p}\in\mathbb{R}^{n\times h}bold_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_h end_POSTSUPERSCRIPT Queries, keys, and values for target in token-level interaction
𝐖qd,𝐖kd,𝐖vdH×hsuperscriptsubscript𝐖𝑞𝑑superscriptsubscript𝐖𝑘𝑑superscriptsubscript𝐖𝑣𝑑superscript𝐻\mathbf{W}_{q}^{d},\mathbf{W}_{k}^{d},\mathbf{W}_{v}^{d}\in\mathbb{R}^{H\times h}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_h end_POSTSUPERSCRIPT Projection matrices for drug queries, keys, and values
𝐖qp,𝐖kp,𝐖vph×hsuperscriptsubscript𝐖𝑞𝑝superscriptsubscript𝐖𝑘𝑝superscriptsubscript𝐖𝑣𝑝superscript\mathbf{W}_{q}^{p},\mathbf{W}_{k}^{p},\mathbf{W}_{v}^{p}\in\mathbb{R}^{h\times h}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_h end_POSTSUPERSCRIPT Projection matrices for target queries, keys, and values
𝐅𝐅\mathbf{F}bold_F drug-target joint representation
p[0,1]𝑝01p\in[0,1]italic_p ∈ [ 0 , 1 ] output interaction probability
H𝐻Hitalic_H Number of attention heads in token-level interaction
m,n𝑚𝑛m,nitalic_m , italic_n Sequence lengths for drug and protein respectively
hhitalic_h Hidden dimension in token-level interaction

6.2 Dataset Sources

All the data used in this paper are from public sources. The statistics of the experimental datasets are presented in Table 7.

  1. 1.

    The BindingDB [13] dataset is a web-accessible database of experimentally validated binding affinities, focusing primarily on the interactions of small drug-like molecules and proteins. The BindingDB source is found at https://www.bindingdb.org/bind/index.jsp.

  2. 2.

    The BioSNAP [48] dataset is created from the DrugBank database [41]. It is a balanced dataset with validated positive interactions and an equal number of negative samples randomly obtained from unseen pairs. The BioSNAP source is found at https://github.com/kexinhuang12345/MolTrans.

  3. 3.

    The Human [24, 8] dataset includes highly credible negative samples. The balanced version of the Human dataset contains the same number of positive and negative samples. The Human source is found at https://github.com/lifanchen-simm/transformerCPI.

Table 7: Dataset Statistics
Dataset # Drugs # Proteins # Interactions
BindingDB 14,643 2,623 49,199
BioSNAP 4,510 2,181 27,464
Human 2,726 2,001 6,728

6.3 How to Obtain the Structure-aware (SA) Sequence of a Protein and the SELFIES of a Drug?

To obtain the SA sequence of a protein, the first step is to obtain Uniprot IDs from the UniProt website using information such as the amino acid sequences or protein names, and then save these IDs in a comma-delimited text file. Subsequently, we use the UniProt IDs to fetch the relevant 3D structure file (.cif) from AlphafoldDB [36] using Foldseek. The SA vocabulary of the protein can then be generated from this 3D structure file.

For drugs, the SELFIES could be derived from SMILES strings. This conversion requires specific Python packages, and upon installation, the SELFIES strings can be generated through appropriate scripts. For more detailed procedures, including the necessary code, please refer to our submission file.

Notably, we provide the generation code for SA vocabulary and SELFIES in our GitHub.

6.4 Baselines

We compare the performance of FusionDTI with the following seven models on the DTI task.

  1. 1.

    Support Vector Machine [9] on the concatenated fingerprint ECFP4 [29] (extended connectivity fingerprint, up to four bonds) and PSC [6] (pseudo-amino acid composition) features.

  2. 2.

    Random Forest [15] on the concatenated fingerprint ECFP4 and PSC features.

  3. 3.

    DeepConv-DTI [20] uses a fully connected neural network to encode the ECFP4 drug fingerprint and a CNN along with a global max-pooling layer to extract features from the protein sequences. Then the drug and protein features are concatenated and fed into a fully connected neural network for the final prediction.

  4. 4.

    GraphDTA [25] uses GNN for the encoding of drug molecular graphs, and a CNN is used for the encoding of the protein sequences. The derived vectors of the drug and protein representations are directly concatenated for interaction prediction.

  5. 5.

    MolTrans [16] uses a transformer architecture to encode the drugs and proteins. Then a CNN-based fusion module is adapted to capture DTI interactions.

  6. 6.

    DrugBAN [3] use a Graph Convolution Network and 1D CNN to encode the drug and protein sequences. Then a bilinear attention network [18] is adopted to learn pairwise interactions between the drug and protein. The resulting joint representation is decoded by a fully connected neural network.

  7. 7.

    BioT5 [26] is a cross-modeling model in biology with chemical knowledge and natural language associations.

6.5 Case Study

The top three predictions (PDB ID: 6QL2 [17], 5W8L [28] and 4N6H [12]) of the co-crystalized ligands are derived from Protein Data Bank (PDB) [4]. Following the setup of the DrugBAN case study, we only choose X-ray structures with a resolution greater than 2.5 Å corresponding to human proteins. In addition, the co-crystalized ligands are required to have pIC50 \leq 100 nM and are not part of the training dataset. As shown in Figure 9, we summarise all drug-target interactions predicted by the DrugBAN and FusionDTI for the three sample pairs in the case study.

Refer to caption
Figure 9: FusionDTI predictions: EZL - 6QL2, 9YA - 5W8L and EJ4 - 4N6H

6.6 Time Complexity Analysis

Table 8: Time complexity and parameters comparison of BAN and CAN.
Fusion module Complexity (O) Parameters
BAN O(ρϕK)𝑂𝜌italic-ϕ𝐾O(\rho\cdot\phi\cdot K)italic_O ( italic_ρ ⋅ italic_ϕ ⋅ italic_K ) 790k
CAN O(mnh)𝑂𝑚𝑛O(m\cdot n\cdot h)italic_O ( italic_m ⋅ italic_n ⋅ italic_h ) 1572k

The feature dimensions of the representations generated by different PLM encoders are fixed, but the size of the feature dimensions may not be the same. Therefore, in order to fuse protein and drug representations, we use two linear layers to keep the representations’ feature dimension equal to the token length (512).

The time complexity of BAN depends on the computation of bilinear interaction maps. The bilinear attention involves a Hadamard product and further matrix operations as given in Equation (2). The computation of UTPsuperscript𝑈𝑇𝑃U^{T}Pitalic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P and VTDsuperscript𝑉𝑇𝐷V^{T}Ditalic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D requires O(NρK)𝑂𝑁𝜌𝐾O(N\cdot\rho\cdot K)italic_O ( italic_N ⋅ italic_ρ ⋅ italic_K ) and O(MϕK)𝑂𝑀italic-ϕ𝐾O(M\cdot\phi\cdot K)italic_O ( italic_M ⋅ italic_ϕ ⋅ italic_K ) operations, respectively. Here, K𝐾Kitalic_K denotes the dimensionality of the transformation, which is the rank of the feature space to which the protein and drug features are projected. When the token length is equal to the feature dimension and the dimensions of transformation are two times either, the overall time complexity is O(ρϕK)𝑂𝜌italic-ϕ𝐾O(\rho\cdot\phi\cdot K)italic_O ( italic_ρ ⋅ italic_ϕ ⋅ italic_K ).

For the token-level interaction in the DTI task, the time complexity is also markedly influenced by the attention mechanisms. It also satisfies the condition that the token length is equal to the feature dimension of the drug and protein. With multi-head attention heads (H=8𝐻8H=8italic_H = 8), the complexity for computing the queries, keys, and values in the Equation (6) and (7), as well as the softmax attention weights, is given by O(Hnmh)𝑂𝐻𝑛𝑚O(H\cdot n\cdot m\cdot h)italic_O ( italic_H ⋅ italic_n ⋅ italic_m ⋅ italic_h ), where mandn𝑚𝑎𝑛𝑑𝑛mandnitalic_m italic_a italic_n italic_d italic_n represents the token lengths for the drug and protein, respectively, and hhitalic_h is the hidden dimension. Since each head contributes its own set of computations and the attention mechanism operates over all tokens, the mn𝑚𝑛m\cdot nitalic_m ⋅ italic_n term (stemming from the softmax operation across the token length) becomes significant. This leads to a total time complexity of O(mnh)𝑂𝑚𝑛O(m\cdot n\cdot h)italic_O ( italic_m ⋅ italic_n ⋅ italic_h ) per batch for the attention mechanism.

From the above analysis of the time complexity of the two fusion strategies, the time complexity of CAN is lower than BAN in the case of the same input protein and drug features. BAN is markedly affected by the transformation dimension K𝐾Kitalic_K. When the K𝐾Kitalic_K is larger than the token and feature dimension, the time complexity of BAN is higher than CAN. However, we observe that the number of parameters in BAN is smaller than that of CAN via the Pytroch package, as shown in Table 8.