Multiple Kronecker RLS fusion-based link propagation for drug-side effect prediction

Yuqing Qian [email protected]
Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, P.R.China Ziyu Zheng [email protected]
Department of Mathematical Sciences, University of Nottingham Ningbo, P.R.China Prayag Tiwari* [email protected]
School of Information Technology, Halmstad University, Sweden Yijie Ding* [email protected]
Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, P.R.China Quan Zou [email protected]
Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, P.R.China

*Corresponding author.

Abstract

Drug-side effect prediction has become an essential area of research in the field of pharmacology. As the use of medications continues to rise, so does the importance of understanding and mitigating the potential risks associated with them. At present, researchers have turned to data-driven methods to predict drug-side effects. Drug-side effect prediction is a link prediction problem, and the related data can be described from various perspectives. To process these kinds of data, a multi-view method, called Multiple Kronecker RLS fusion-based link propagation (MKronRLSF-LP), is proposed. MKronRLSF-LP extends the Kron-RLS by finding the consensus partitions and multiple graph Laplacian constraints in the multi-view setting. Both of these multi-view settings contribute to a higher quality result. Extensive experiments have been conducted on drug-side effect datasets, and our empirical results provide evidence that our approach is effective and robust.

1 Introduction

Pharmacovigilance is critical to drug safety and surveillance. The field of pharmacovigilance plays a crucial role in public health by continuously monitoring and evaluating the safety profile of drugs. Pharmacovigilance involves collecting and analyzing data from various sources, including health care professionals (Yang et al., 2016), patients, regulatory authorities, and pharmaceutical companies. These data are then used to identify possible side effects and assess their severity and frequency (Da Silva & Krishnamurthy, 2016; Galeano et al., 2020). Traditionally, drug-side effects were primarily identified through spontaneous reporting systems, where health care professionals and patients reported adverse events to regulatory authorities. However, this approach has limitations, such as underreporting and delayed detection.

To overcome these limitations, researchers have turned to data-driven methods to find drug-side effects. With the advent of electronic health records, large-scale databases containing valuable information on medication usage and patient outcomes have become available. These databases have allowed researchers to analyze vast amounts of data to identify patterns between drugs and side effects.

One of the most commonly used approaches to drug-side effects prediction is model-based methods. Model-based methods involve the use of advanced statistical and machine learning techniques to extract knowledge from large datasets. By analyzing patterns in the data, researchers can identify potential drug-side effects and their associated risk factors. In their work, (Pauwels et al., 2011) predicted the side effects of drugs (Pau’s method) by applying K-nearest neighbor (KNN), support vector machine (SVM), ordinary canonical correlation analysis (OCCA) and sparse canonical correlation analysis (SCCA) from drug chemical substructures; furthermore, their experiment outcome suggests that SCCA performs the best. Sayaka et al. (2012) utilized SCCA to associate targeted proteins with side effects (Miz’s method). Liu et al. (2012) predicted drug side effects (Liu’s method) using SVM and multivariate information, such as the phenotypic characteristics, chemical structures, and biological properties of the drug. Cheng et al. (2013) proposed a phenotypic network inference classifier to associate drugs with side effects (Cheng’s method). NDDSA models (Shabani-Mashcool et al., 2020) the drug-side effects prediction problem using a bipartite graph and applies a resource allocation method to find new links. MKL-LGC (Ding et al., 2018) integrates multiple kernels to describe the diversified information of drugs and side-effects. These kernels are then combined using an optimized linear weighting algorithm. The Local and Global Consistency algorithm (LGC) is used to estimate new potential associations based on the integrated kernel information.

Deep learning techniques (Xu et al., 2022) have been increasingly used to predict drug side effects in recent years. These methods leverage the power of neural networks to analyze complex relationships between drugs, genes, and proteins. In SDPred (Zhao et al., 2022), chemical-chemical associations, chemical substructure, drug target information, word representations of drug molecular substructures, semantic similarity of side effects, and drug side effect associations are integrated. To learn drug-side effect pair representation vectors from different interaction maps, SDPred uses the CNN module. Drug interaction profile similarity (DIPA) provided the most contribution. GCRS (Xuan et al., 2022) builds a complex deep-learning structure to fuse and learn the specific topologies, common topologies and pairwise attributes from multiple drug-side effect heterogeneous graphs. Drug-side effect heterogeneous graphs are constructed using drug-side effect associations, drug-disease associations and drug chemical substructures. Based on a graph attention network, Zhao et al. (2021) developed a prediction model for drug-side effect frequencies that integrated information on similarity, known drug-side effect frequencies, and word embeddings. The above deep learning-based method is a kind of pairwise learning. To keep the sample balanced, this group selected the positive sample from trusted databases and the negative sample by random sampling. Such a treatment results in a certain loss of information and introduces noise to the label.

Drug-side effect prediction is a classic link prediction problem (Yuan et al., 2019). To solve this kind of problem, many multi-view methods have been proposed in recent years (Ding et al., 2021; 2016; Cichonska et al., 2018). Based on the information fusion at different stages of the training process, multi-view methods can roughly be divided into three categories: early fusion, late fusion and fusion during the training phase. Fig. 1 illustrates our taxonomy of multi-view learning method literature.

In early fusion techniques, the views are combined before training process is performed. Multiple kernel learning (MKL) (Wang et al., 2023b; Cichonska et al., 2018; Nascimento et al., 2016) is a typical early fusion technique. For each view, it computes one or more kernels, and then learns the optimal kernel from the base kernels. For example, MKL-KroneckerRLS (Ding et al., 2019) combines diversified information using Centered Kernel Alignment-based Multiple Kernel Learning (CKA-MKL). Based on the optimal kernel, Kronecker regularized least squares (Kro-RLS) was used to classify drug-side effect pairs. It must be noted that the performance of these methods relies heavily on the optimal view, which may be redundant or miss some key information. In late fusion techniques, a different model for each view is separately trained and later a weighted combination is taken as the final model. For instance, in Zhang et al. (2016), an ensemble model was constructed by integrating multiple methods, each providing a unique view. The model incorporates Liu et al. (2012), Cheng et al. (2013), a Integrated Neighbour-based Method (INBM), and a Restricted Boltzmann Machine-based Method (RBMBM). Each model is trained independently, and the final partition is the average weighted average of the base partitions. Late fusion allows for individual modeling of inherently different views, providing flexibility and advantage when dealing with diverse data. However, its drawback is the delayed coupling of information, limiting the extent to which each model can benefit from the information provided by other views.

Refer to caption — Figure 1: Taxonomy of multi-view learning framework literature. Note: "Partition" commonly refers to the learned result. This concept is more commonly found in classification and clustering tasks (Liu et al., 2023; Bruno & Marchand-Maillet, 2009; Wang et al., 2019). (a) Early fusion: the views are combined before the training process is performed; (b) Late fusion: a different model for each view is separately trained and then a combination is taken as the final partition; (c) Fusion during the training phase: it has some degree of freedom to model the views differently but to also ensure that information from other views is exploited during the training phase.

A third category is fusion during the training phase, which combines the benefits of both fusion types. It fuses multiple views at the partition level and enables the model to explore all views while being allowed to model one view differently. This framework has been applied to classification models (Houthuys & Suykens, 2021; Houthuys et al., 2018; Qian et al., 2022b; Xie & Sun, 2020) and clustering models (Lv et al., 2021; Houthuys et al., 2018; Wang et al., 2023a). By exploring consensus or complementarity information from multiple views, multi-view method can achieve better performance than single view method. The consensus principle pursues to achieve view-agreement among views. For instance, Wang et al. (2019) maximized the alignment between the consensus partition (clustering matrix) and the weighted combination of base partitions.

In this work, we apply this technique to the Kron-RLS algorithm. Due to its fast and scalable nature. The proposed method is named Multiple Kronecker RLS fusion-based link propagation (MKronRLSF-LP). Our work’s main contributions are listed as follows:

(1)

We extend Kron-RLS to the multiple information fusion setting by finding the consensus partition and multiple graph Laplacian constraint. Specifically, we generate multiple partitions by normal Kron-RLS and adaptively learn a weight for each partition to control its contribution to the shared partitions. This work was conducted with the aim of fusing partitions while still allowing for some flexibility in modeling single information. Furthermore, multiple graph Laplacian regularization is adopted to boost the performance of semi-supervised learning. Both settings co-evolve toward better performance.
(2)

To fuse the features of multiple information more reasonably, we design an iterative optimization algorithm to effectively fuse multiple Kron-RLS submodels and obtain the final predictive model of drug-side effects. In the whole optimization, we avoid explicit computation of any pairwise matrices, which makes our method suitable for solving problems in large pairwise spaces.
(3)

The proposed method can address the general link prediction problem; it is empirically tested on four real drug-side effect datasets, which are more sparse. The results show that MKronRLSF-LP can achieve excellent classification results and outperform other competitive methods.

The rest of this paper is organized as follows. Section 2 provides a description of the drug-side effect prediction problem. Section 3 reviews related work about MKronRLSF-LP. Section 4 comprehensively presents the proposed MKronRLSF-LP. After reporting the experimental results in Section 5, we conclude this paper and mention future work in Section 6.

2 Problem description

Identification of drug-side effects is an example of the link prediction problem, which has the aim of predicting how likely it is that there is a link between two arbitrary nodes in a network. This problem can also be seen as a recommendation system (Jiang et al., 2019; Fan et al., 2021) task.

Let the drug nodes and side effect nodes of a network be $\displaystyle{\mathbb{D}}=\left\{{{d_{1}},{d_{2}},\ldots,{d_{N}}}\right\}$ and $\displaystyle{\mathbb{S}}=\left\{{{s_{1}},{s_{2}},\ldots,{s_{M}}}\right\}$ , respectively. We denote the number of drug and side effect nodes by $N$ and $M$ , respectively.

We define an adjacency matrix ${\bm{F}}\in{{\mathbb{R}}^{N\times M}}$ to represent the associations between drugs and side effects. Each element of ${\bm{F}}$ is defined as $\displaystyle{\bm{F}}_{i,j}=1$ if the node pair $(d_{i},s_{j})$ is linked and $\displaystyle{\bm{F}}_{i,j}=0$ otherwise.

The link prediction has the aim of predicting whether a link exists for the unknown state node pair $\left({{d_{i}},{s_{j}}}\right)\in{\mathbb{D}}\times{\mathbb{S}}$ . Thus, it is a classification problem. Most methods use regression algorithms to predict a score (ranging from 0-1), which we call the link confidence. Then, a class of 0 or 1 is assigned to the predicted score by the threshold. Higher link confidence indicates a greater probability of the link existing, while lower values indicate the opposite. We define a new matrix $\hat{\bm{F}}$ , which is estimated by the prediction model. Each of elements $\hat{\bm{F}}_{i,j}$ represents the predicted link confidence for the node pair $(d_{i},s_{j})$ . Figure 5 summarizes the link prediction problem discussed in this paper.

3 Related work

3.1 Regularized Least Squares

The objective function of Regularized Least Squares (RLS) regression is:

\mathop{\arg\min}\limits_{f}\ \frac{1}{{2}}\left\|{{\bm{F}}-f\left({\bm{K}}% \right)}\right\|_{F}^{2}+\frac{\lambda}{2}\left\|f\right\|_{K}^{2},

(1)

where $\lambda$ is a regularization parameter, ${\left\|f\right\|_{K}}$ denotes the RKHS norm (Kailath, 1971) of $f\left(\cdot\right)$ . $f\left(\cdot\right)$ is the prediction function and be defined as:

\displaystyle f\left({\bm{K}}\right)={\bm{K}}{\bm{a}},

(2)

where ${\bm{a}}$ is the solution of the model, ${\bm{F}}$ is a kernel matrix with elements

{{\bm{K}}_{i,j}}=k\left({{d_{i}},{d_{j}}}\right)\left({i,j=1,\ldots,N}\right),

(3)

and $k$ represents the kernel function.

By formulating the stationary points of Equation 1 and elimination the unknown parameters ${\bm{a}}$ , the following solution is obtained

\hat{\bm{F}}={\bm{K}}{\left({{\bm{K}}+\lambda{\bm{I}}_{N}}\right)^{-1}}{\bm{F}}.

(4)

There is only one kind of feature space considered in this model. In the drug-side effect identification problem, there are two feature spaces: the drug space and the side effect space.

3.2 Kronecker Regularized Least Squares

Combining the kernels of the two spaces into a single large kernel that directly relates drug-side effect pairs would be a better option. Kronecker product kernel (Hue & Vert, 2010) is used for this. Given the drug kernel ${\bm{K}}_{D}$ and side effect kernel ${\bm{K}}_{S}$ , then we have the kronecker product kernel

{\bm{K}}={{\bm{K}}_{S}}\otimes{{\bm{K}}_{D}},

(5)

where the $\otimes$ indicates the Kronecker product (Laub, 2004). By applying the Kronecker product kernel to RLS, the objective function of Kronecker Regularized Least Squares (Kron-RLS) is botained:

\mathop{\arg\min}\limits_{f}\ \frac{1}{{2}}\left\|{\text{vec}\left({\bm{F}}% \right)-f\left({\bm{K}}\right)}\right\|_{F}^{2}+\frac{\lambda}{2}\left\|f% \right\|_{K}^{2},

(6)

where $\text{vec}\left(\cdot\right)$ is the vectorization operating function. By setting the derivative of Equation 6 w.r.t ${\bm{a}}$ to zero, we obtain:

{\bm{a}}={\left({{\bm{K}}+\lambda{\bm{I}}_{NM}}\right)^{-1}}\text{vec}\left({% \bm{F}}\right).

(7)

Obviously, it needs calculating the inverse of $\left({{\bm{K}}+\lambda{\bm{I}}_{NM}}\right)$ with size of $NM\times NM$ , whose time complexity is $O\left({{N^{3}}{M^{3}}}\right)$ . Thus, a well-known theorem (Raymond & Kashima, 2010; Laub, 2004) is proposed to obtain the approximate inverse.

It is well known that the kernel (Liu et al., 2023; Pekalska & Haasdonk, 2008) matrices are positive semi-definite matrices, they can be eigen decomposed, ${{\bm{K}}_{D}}={{\bm{V}}_{D}}{\bm{\Lambda}_{D}}{\bm{V}}_{D}^{T}$ and ${{\bm{K}}_{S}}={{\bm{V}}_{S}}{\bm{\Lambda}_{S}}{\bm{V}}_{S}^{T}$ . According to the theorem (Raymond & Kashima, 2010; Laub, 2004), the eigenvectors of the Kronecker product kernel ${\bm{K}}$ is the ${\bm{V}}={{\bm{V}}_{S}}\otimes{{\bm{V}}_{D}}$ . Define the matrix $\bm{\Lambda}$ to be either ${\bm{\Lambda}_{i,j}}={\left[{{\bm{\Lambda}_{S}}}\right]_{i,i}}\times{\left[{{% \bm{\Lambda}_{D}}}\right]_{j,j}}$ . The eigenvalues of ${\bm{K}}$ is $\text{diag}\left(\text{vec}\left(\bm{\Lambda}\right)\right)$ . The matrix ${{\bm{K}}+\lambda{\bm{I}}_{NM}}$ has the same eigenvactors ${\bm{V}}$ , and eigenvalues $\text{diag}\left(\text{vec}\left(\bm{\Lambda}+\lambda\mathbf{1}\right)\right)$ . Then, we can rewrite Equation 7 as:

{\bm{K}}{\left({{\bm{K}}+\lambda{\bm{I}}_{NM}}\right)^{-1}}\text{vec}\left({% \bm{F}}\right)={\bm{V}}{\text{diag}{{\left({\text{vec}\left(\bm{\Lambda}\right% )}\right)}}}{{\bm{V}}^{T}}{\bm{V}}{\text{diag}{{\left({\text{vec}\left({\bm{% \Lambda}+\lambda\bm{1}}\right)}\right)}^{-1}}}{{\bm{V}}^{T}}\text{vec}\left({% \bm{F}}\right).

(8)

Since ${\bm{V}}^{T}{\bm{V}}={\bm{I}}_{NM}$ and ${\text{diag}{{\left({\text{vec}\left(\bm{\Lambda}\right)}\right)}}}{\text{diag% }{{\left({\text{vec}\left({\bm{\Lambda}+\lambda\bm{1}}\right)}\right)}^{-1}}}$ is also a diagonal matrix, we further simplify Equation 8 and get

{\bm{K}}{\left({{\bm{K}}+\lambda{\bm{I}}_{NM}}\right)^{-1}}\text{vec}\left({% \bm{F}}\right)={\bm{V}}{\text{diag}{{\left({\text{vec}\left(\bm{{\bm{J}}}% \right)}\right)}}}{\bm{V}}^{T}\text{vec}\left({\bm{F}}\right),

(9)

where the matrix ${\bm{J}}$ to be either

{{\bm{J}}_{i,j}}=\frac{{{\bm{\Lambda}_{i,j}}}}{{{\bm{\Lambda}_{i,j}}+\lambda}}.

(10)

Using the vec-tricks techniques ( $\left({{\bm{A}}\otimes{\bm{B}}}\right)\text{vec}\left({\bm{C}}\right)=\text{% vec}\left({{\bm{B}}{\bm{C}}{{\bm{A}}^{T}}}\right)$ ), we further simplify Equation 8. Then, we get

\hat{\bm{F}}={{\bm{V}}_{D}}{\left({{\bm{J}}\odot\left({{{\bm{V}}_{D}^{T}}{\bm{% F}}{{\bm{V}}_{S}}}\right)}\right)^{T}}{{\bm{V}}_{S}^{T}},

(11)

where $\odot$ represents the Hadamard product. The computational time of this optimization method is $O\left({{N^{3}}+{M^{3}}}\right)$ , which is much less than $O\left({{N^{3}}{M^{3}}}\right)$ .

3.3 Kronecker Regularized Least Squares with Multiple Kernel Learning

Kron-RLS is a kind of kernel method. It can be difficult for nonexpert users to choose an appropriate kernel. To address such limitations, Multiple Kernel Learning (MKL) (Gönen & Alpaydın, 2011) is proposed. Since kernels in MKL can naturally correspond to different views, MKL has been applied with great success to cope with the multi-view data (Wang et al., 2021; Xu et al., 2021; Guo et al., 2021; Qian et al., 2022a; Wang et al., 2023b) by combining kernels appropriately.

Given predefined base kernels $\left\{{{\bm{K}}_{D}^{i}}\right\}_{i=1}^{{P}}$ and $\left\{{{\bm{K}}_{S}^{j}}\right\}_{j=1}^{{Q}}$ from drug feature space and side effect feature space, respectively. These kernels can be built from different types or views. The optimal kernel can be combined by a linear function corresponding to the base kernels:

{\bm{K}}_{D}^{opt}=\sum\limits_{i=1}^{{P}}{w^{i}{\bm{K}}_{D}^{i}}.

(12)

Usually, an additional constraint is imposed on the corresponding combination coefficient $w$ to control its structure:

\sum\limits_{i=1}^{{P}}{w^{i}}=1,w^{i}\geq 0,i=1,\ldots,{P}.

(13)

The optimal side effect kernel ${\bm{K}}_{S}^{opt}$ is omitted.

Based on MKL method, Ding et al. (2019) and Nascimento et al. (2016) developed Kron-RLS based MKL methods, called Kron-RLS with CKA-MKL and Kron-RLS with selfMKL, respectively. Kron-RLS with CKA-MKL combines diversified information using Centered Kernel Alignment-based Multiple Kernel Learning (CKA-MKL). In Kron-RLS with selfMKL, the weights indicating the importance of individual kernels are calculated automatically to select the more relevant kernels. The final decision function of both methods is given by:

{\text{vec}}\left({{\hat{\bm{F}}}}\right)=\left({{\bm{K}}_{S}^{opt}\otimes{\bm% {K}}_{D}^{opt}}\right){\left({{\bm{K}}_{S}^{opt}\otimes{\bm{K}}_{D}^{opt}+% \lambda{\bm{I}}_{NM}}\right)^{-1}}{\text{vec}}\left({\bm{F}}\right).

(14)

4 Proposed method

Existing multi view fusion methods based on Kron-RLS all follow MKL framework. These methods optimize the optimal pairwise kernel as a linear combination of a set of base kernels. Prior to training, all views are fused, and information is not shared during training phase. This is typical early fusion technology. Our proposal addresses this limitation by fusing multi-view information in a consensus partition. Compared with MKL framework, the advantage of the proposed method is that it allows sub partitions to have a certain degree of freedom to model the single information. Further, multiple graph Laplacian regularization is introduced into the consensus partition to boost performance. Fig. 2 illustrates the main procedure of MKronRLSF-LP.

4.1 The construction of kernel matrix

Kron-RLS is a kind of kernel method. We construct drug kernels using five different kinds of functions.

Gaussian Interaction Profile (GIP):

{\left[{{{\bm{K}}_{GIP,D}}}\right]_{i,j}}=\exp\left({-\gamma{{\left\|{{d_{i}}-% {d_{j}}}\right\|}^{2}}}\right),

(15)

where $\gamma$ is the gaussian kernel bandwidth and $\gamma=1$ .

Cosine Similarity (COS):

{\left[{{{\bm{K}}_{COS,D}}}\right]_{i,j}}=\frac{{d_{i}^{T}{d_{j}}}}{{\left|{{d% _{i}}}\right|\left|{{d_{j}}}\right|}}.

(16)

Correlation coefficient (Corr):

{\left[{{{\bm{K}}_{Corr,D}}}\right]_{i,j}}=\frac{{\text{Cov}\left({{d_{i}},{d_% {j}}}\right)}}{{\sqrt{\text{Var}\left({{d_{i}}}\right)\text{Var}\left({{d_{j}}% }\right)}}}.

(17)

Normalized Mutual Information (NMI):

{\left[{{{\bm{K}}_{NMI,D}}}\right]_{i,j}}=\frac{{\text{Q}\left({{d_{i}},{d_{j}% }}\right)}}{{\sqrt{\text{H}\left({{d_{i}}}\right)\text{H}\left({{d_{j}}}\right% )}}},

(18)

where $\text{Q}\left({{d_{i}},{d_{j}}}\right)$ is the mutual information of $d_{i}$ and $d_{j}$ . $\text{H}\left({{d_{i}}}\right)$ and $\text{H}\left({{d_{j}}}\right)$ are the entropies of $d_{i}$ and $d_{j}$ , respectively.

Neural Tangent Kernel (NTK):

{\left[{{K_{NTK,D}}}\right]_{i,j}}={{\mathbb{E}}_{\theta\sim w}}\left[{{f_{NTK% }}\left({\theta,{d_{i}}}\right),{f_{NTK}}\left({\theta,{d_{j}}}\right)}\right],

(19)

where $f_{NTK}$ is a fully connected neural network and $\theta$ is collection of parameters in this network.

Similarity, we construct the side effect kernels ( ${\bm{K}}_{GIP,S}$ , ${\bm{K}}_{COS,S}$ , ${\bm{K}}_{Corr,S}$ , ${\bm{K}}_{NMI,S}$ , ${\bm{K}}_{NTK,S}$ ) in side effect space.

4.2 The MKronRLSF-LP model

Let us define two sets of base kernel sets separately:

	${{{\mathbb{K}}}_{D}}=\left\{{{\bm{K}}_{D}^{1},\ldots,{\bm{K}}_{D}^{P}}\right\},$		(20a)
	${{{\mathbb{K}}}_{S}}=\left\{{{\bm{K}}_{S}^{1},\ldots,{\bm{K}}_{S}^{Q}}\right\},$		(20b)

where $P$ and $Q$ represents the numbers of drug and side effect kernels, respectively. Based on the ${\bm{K}}_{D}$ and ${\bm{K}}_{S}$ , we can get a set of pairwise kernels:

{{\mathbb{K}}}=\left\{{{{\bm{K}}^{1}}={\bm{K}}_{S}^{1}\otimes{\bm{K}}_{D}^{1},% \ldots,{{\bm{K}}^{V}}={\bm{K}}_{S}^{P}\otimes{\bm{K}}_{D}^{Q}}\right\},

(21)

where $V$ denotes the numbers of base pairwise kernels. Obviously, $V$ is equal to $P\times Q$ .

By using multiple partitions, we can manipulate multiple views in a partition space, which enhances the robustness of the model. The following ensemble KronRLS model is obtained

\mathop{\arg\min}\limits_{{{\bm{a}}^{v}}}\ \sum\limits_{v=1}^{V}{\left({\frac{% 1}{2}\left\|{{\mathop{\text{vec}}\nolimits}\left({\bm{F}}\right)-{{\bm{K}}^{v}% }{{\bm{a}}^{v}}}\right\|_{2}^{2}+\frac{{{\lambda_{v}}}}{2}{{\bm{a}}^{{v^{T}}}}% {{\bm{K}}^{v}}{{\bm{a}}^{v}}}\right)}.

(22)

In multi-view methods, the consensus principle establishes consistency between partitions from different views. However, it’s essential to find that these partitions deliver varying degrees of importance to the final prediction, unlike fusion without discrimination. To facilitate this, we introduce a consensus partition, denoted by $\hat{\bm{F}}$ . It is a weighted linear combination of partitions $\hat{\bm{F}}_{v}$ from multiple distinct views. A variable ${\bm{w}}_{v}$ is introduced for view $v$ which characterizes its importance, which is calculated based on the training error. To prevent sparse situations, we employ $\left\|\cdot\right\|_{2}^{2}$ to smooth the weights. Then, we have the following optimization problem

	$\displaystyle\mathop{\arg\min}\limits_{\hat{\bm{F}},{{\bm{a}}^{v}},{{\bm{w}}}}$	$\displaystyle\ \frac{1}{2}\left\\|{{\mathop{\rm vec}\nolimits}\left({\hat{\bm{F% }}}\right)-\sum\limits_{v=1}^{V}{{{\bm{w}}_{v}}{{\bm{K}}^{v}}{{\bm{a}}^{v}}}}% \right\\|_{2}^{2}+{\mu}\sum\limits_{v=1}^{V}{\left({\frac{{\bm{w}}_{v}}{2}\left% \\|{{\mathop{\rm vec}\nolimits}\left({\bm{F}}\right)-{{\bm{K}}^{v}}{{\bm{a}}^{v% }}}\right\\|_{F}^{2}+\frac{{{\lambda_{v}}}}{2}{{\bm{a}}^{{v^{T}}}}{{\bm{K}}^{v}% }{{\bm{a}}^{v}}}\right)}+\frac{1}{2}\beta\left\\|{\bm{w}}\right\\|_{2}^{2}$		(23)
		$\displaystyle s.t.\sum\limits_{v=1}^{V}{{{\bm{w}}_{v}}}=1,{{\bm{w}}_{v}}\geq 0% ,v=1,\ldots,V.$		(23)

In Equation 23, we observe that the consensus partition $\hat{\bm{F}}$ fits to an adjacency matrix ${\bm{F}}$ by an indirect path. As described in section 2, false zeros represent unobserved links in the network. Hence, we must avoid overfitting the observed matrix ${\bm{F}}$ . Inspired by manifold scenarios, the Laplacian operator adeptly mitigates overfitting and noise, preserving the original data structure and kee** nodes with common labels closely associated. This approach is simple, and empirical evidence confirms its effective performance (Pang & Cheung, 2017; Chao & Sun, 2019; Jiang et al., 2023). Here, we apply multiple graph Laplacian regularization to Equation 23, which can effectively explore multiple different views and boost the performance of $\hat{\bm{F}}$ . Specifically, the Kronecker product Laplacian matrix is calculated from the optimal drug and side effect similarity matrix, which are weighted linear combinations of multiple related kernel matrices. The weight of each kernel can be adaptively optimized during the training process and reduce the impact of noisy or less relevant graphs. The optimization problems for MKronRLSF-LP can be formulated as:

$\displaystyle\mathop{\arg\min}\limits_{\hat{\bm{F}},{{\bm{a}}^{v}},{{\bm{w}}},% {\bm{\theta}_{D}},{\bm{\theta}_{S}}}$	$\displaystyle\frac{1}{2}\left\\|{{\mathop{\rm vec}\nolimits}\left({\hat{\bm{F}}% }\right)-\sum\limits_{v=1}^{V}{{{\bm{w}}_{v}}{{\bm{K}}^{v}}{{\bm{a}}^{v}}}}% \right\\|_{2}^{2}+\mu\sum\limits_{v=1}^{V}{\left({\frac{{{{\bm{w}}_{v}}}}{2}% \left\\|{{\mathop{\rm vec}\nolimits}\left({\bm{F}}\right)-{{\bm{K}}^{v}}{{\bm{a% }}^{v}}}\right\\|_{2}^{2}+\frac{{{\lambda^{v}}}}{2}{{\bm{a}}^{{v^{T}}}}{{\bm{K}% }^{v}}{{\bm{a}}^{v}}}\right)}+\frac{1}{2}\beta\left\\|{\bm{w}}\right\\|_{2}^{2}$	(24)
	$\displaystyle+\frac{1}{2}\sigma{\mathop{\rm vec}\nolimits}{\left({\hat{\bm{F}}% }\right)^{T}}{\bm{L}}{\mathop{\rm vec}\nolimits}\left({\hat{\bm{F}}}\right)$
$\displaystyle s.t.$	$\displaystyle\sum\limits_{v=1}^{V}{{{\bm{w}}_{v}}}=1,{{\bm{w}}_{v}}\geq 0,v=1,% \ldots,V,$
	$\displaystyle{\bm{L}}={\bm{I}}_{NM}-\left({{\bm{H}}_{S}^{-0.5}{\bm{K}}_{S}^{}% {\bm{H}}_{S}^{-0.5}}\right)\otimes\left({{\bm{H}}_{D}^{-0.5}{\bm{K}}_{D}^{}{% \bm{H}}_{D}^{-0.5}}\right),$
	$\displaystyle{\bm{K}}_{S}^{}=\sum\limits_{i=1}^{Q}{{{\left[{\bm{\theta}_{S}}% \right]}^{\varepsilon}_{i}}{\bm{K}}_{S}^{i}},{\bm{K}}_{D}^{}=\sum\limits_{i=1% }^{P}{{{\left[{\bm{\theta}_{D}}\right]}^{\varepsilon}_{i}}{\bm{K}}_{D}^{i}},$
	$\displaystyle\sum\limits_{i=1}^{Q}{\left[{\bm{\theta}_{S}}\right]_{i}}=1,{% \left[{\bm{\theta}_{S}}\right]_{i}}\geq 0,i=1,\ldots,Q,\sum\limits_{i=1}^{P}{% \left[{\bm{\theta}_{D}}\right]_{i}}=1,{\left[{\bm{\theta}_{D}}\right]_{i}}\geq 0% ,i=1,\ldots,P.$

where ${\bm{L}}$ is a normalized laplacian matrix, ${\bm{H}}_{S}$ and ${\bm{H}}_{D}$ are diagonal matrix with the $j$ th diagonal elements as $\sum\nolimits_{k}{{{\left[{{\bm{K}}_{S}^{*}}\right]}_{j,k}}}$ and $\sum\nolimits_{k}{{{\left[{{\bm{K}}_{D}^{*}}\right]}_{j,k}}}$ , respectively. And, $\varepsilon>1$ , guaranteeing each graph has a particular contribution to the Laplacian matrix.

Due to the lack of space, we present optimization algorithm of the Equation 24 in Appendix Section A.1.

5 Experiments

In this section, the performance of MKronRLSF-LP is shown, and we make comparisons with baseline methods and other drug-side effect predictors.

5.1 Dataset

Table 1: Summary of the real drug-side effect datasets.

Name	Drug	Side effect	Associations	Sparsity	Reference
Liu	832	1385	59205	94.86%	(Cheng et al., 2013)
Pau	888	1385	61102	95.03%	(Pauwels et al., 2011)
Miz	658	1339	49051	94.43%	(Sayaka et al., 2012)
Luo	708	4192	80164	97.30%	(Luo et al., 2017)

Four real drug-side effect datasets are used to assess the effectiveness of our proposed method. Pau dataset is derived from the SIDER database (Kuhn et al., 2010) which contains information about drugs and their recorded side effects. Miz dataset includes information about drug-protein interactions and drug-side effect interactions, obtained from the DrugBank (Wishart et al., 2006) and SIDER database, respectively. There were 658 drugs with both targeted protein and side effect information. Additionally, Liu et al. mapped drugs in SIDER to DrugBank 3.0 (Knox et al., 2010), resulting in a final dataset of 832 drugs and 1385 side effects. Luo dataset has a large number of side effects and was extracted from the SIDER 2.0. Table 1 summarizes information about the datasets. We can see that these four datasets are sparse. In other words, there are fewer positive samples than negative samples. Thus, drug-side effect prediction can be viewed as a classification problem with extremely imbalanced data.

5.2 Parament setting

In this paper, the objective function 24 contains the following regularization parameters: $\mu$ , $\beta$ , $\sigma$ , $\varepsilon$ and ${{\lambda^{v}},v=1,\ldots,V}$ . To find the right combinations of the regularization parameters of MKronRLSF-LP to give the best performance, the grid search method is performed on the Pau dataset. The optimal parameters with the best AUPR are selected.

We first select ${{\lambda^{v}},v=1,\ldots,V}$ by the relative pairwise kernel with a single view Kron-RLS model. For each parameter $\lambda^{v}$ , we select it in the range from $2^{-5}$ to $2^{5}$ with step $2^{1}$ . The optimal parameters $\lambda^{v}$ are shown in Table 4. According to a previous study(Shi et al., 2019), the performance is not affected by parameter $\varepsilon$ , so it is set to 2. Then, we fix ${{\lambda^{v}},v=1,\ldots,V}$ at the best values and tune $\mu$ , $\beta$ , $\sigma$ from within the range $2^{-10}$ to $2^{0}$ with step $2^{1}$ . The optimal regularization parameters are $\mu=2^{-7}$ , $\beta=2^{0}$ and $\sigma=2^{-8}$ .

5.3 Baseline methods

In this work, we compare MKronRLSF-LP with the following baseline methods: BSV, Comm Kron-RLS(Perrone & Cooper, 1995), Kron-RLS+CKA-MKL(Ding et al., 2019), Kron-RLS+pairwiseMKL(Cichonska et al., 2018), Kron-RLS+self-MKL(Nascimento et al., 2016), MvGRLP(Ding et al., 2021) and MvGCN(Fu et al., 2022). Due to the lack of space, we present details of these baseline methods in Appendix Section A.3. For a fair comparison, the same input as our method is fed into these baseline methods. To achieve the best performance, we also adopt 5-fold CV on the Pau dataset to tune the parameters.

5.4 Threshold finding

Because the MKronRLSF-LP and baseline methods only output the value of regression, we apply a threshold finding operation. For a certain validation set in the five-fold cross-validation (5-fold CV) procedure, we collect the labels and their corresponding predicted scores. Then, we obtain the optimal threshold by maximizing the $F_{score}$ on the predicted scores and labels from this validation sets. A trend of $F_{score}$ , $Recall$ and $Precision$ with different thresholds over four datasets is shown in Fig. 6. While the threshold of prediction rises, the values of $Recall$ is rising. Oppositely, $Precision$ is falling. The $F_{score}$ is the harmonic mean of the $Recall$ and $Precision$ . It thus symmetrically represents both $Recall$ and $Precision$ in one metric. Here, we find the optimal threshold under maximizing the value of $F_{score}$ . Table 5 summarizes the thresholds of different baseline methods on different datasets.

5.5 Comparison with baseline methods

We conduct the 5-fold CV to evaluate the performance of our method versus the baseline method. To further provide a fair and comprehensive comparison, each algorithm is iterated 10 times with different cross index, and then the mean values and standard deviations are reported in Table 3. The best single view is $K_{GIP,D}\otimes K_{NTK,S}$ , which is selected by 5-fold CV on Pau dataset.

First, we observe that the proposed method has the best AUPR and $F_{score}$ on all datasets. Especially, the proposed method has a higher AUPR and $F_{score}$ than BSV on datasets. This indicates the improvement in using multiple views. The simple coupling frameworks BSV and Comm perform well on the Pau dataset. However, BSV and Comm cannot perform as well on other datasets, which indicates that the simple fusion schemes are sensitive to the dataset and not robust. Furthermore, Kron-RLS+pairwiseMKL achieves the highest AUC of 95.01%, 95.02% and 94.70% on the Liu, Pau and Miz datasets, respectively. This shows slight improvements of 0.23%, 0.21% and 0.23% over our method, respectively. As we discussed in Section 5.1, drug-side effect prediction is an extremely imbalanced classification problem. The AUC can be considered as the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Therefore, the AUC is not an important metric for predicting drug side effects.

Another interesting observation is that MKronRLSF-LP outperforms other MKL strategy methods in comparison. For example, it exceeds the best MKL method (CKA-MKL) by 2.1%, 2.32%, 1.43%, 2.51% in terms of AUPR on Liu, Pau, Miz and Luo dataset, respectively. These results verify the effectiveness of the consensus partition and multiple graph Laplacian constraint.

For a more thorough analysis and reliable conclusions, we use post-hoc test statistics to statistically assess the different metrics shown in Table 3. Fig. 3 shows the results of these tests visualized as Critical Difference diagrams. These results show that MKronRLSF-LP is significantly better ranked than all methods in terms of AUPR, $Recall$ and $F_{score}$ . In addition, MKronRLSF-LP is only inferior than Kron-RLS+pairwiseMKL and Kron-RLS+CKA-MKL in terms of AUC and $Precision$ , respectively. Besides, MvGCN is worse ranked than our method. Another point worth mentioning is that there is no sufficient statistical evidence to support that MvGCN performs better than model-based methods. MvGCN uses shallow GCN to avoid over-smoothing. The shallow GCN (Miao et al., 2021) can only capture local neighbourhood information of nodes, but the global features of the network have not been fully explored. A result of this is inaccurate embedding vectors.

In summary, the above experimental results demonstrate the superior prediction performance of MKronRLSF-LP to other baseline methods. We attribute the superiority of MKronRLSF-LP as three aspects: (1) The consensus partition is derived through joint fusion of weighted multiple partitions; (2) MKronRLSF-LP utilizes the multiple graph Laplacian regularization to constrain the consensus predicted value $\hat{\bm{F}}$ , which makes the consensus partition is robust; (3) Unlike existing MKL methods, the proposed MKronRLSF-LP fuses multiple pairwise kernels at the partition level. It is these three factors that contribute to the improvement in prediction performance.

5.6 Ablation study

To validate the benefits of jointly applying the consensus partition and multiple graph Laplacian constraint, we conduct an ablation study by excluding a particular component. First, we construct a Kron-RLS based on each pairwise kernel separately. Each partition learns independently, so it can be regarded as an ensemble Kron-RLS, and its objective function is Equation 22. The results should be consistent for each view, and heterogeneous views have varying degrees of importance in the final prediction. Therefore, we set a consensus partition $\hat{\bm{F}}$ , which is a weighted linear combination of base partitions (as shown in Equation 23). To further improve the performance and robustness of the model, we apply multiple graph Laplacian constraints to the consensus partition. Finally, the objective function 24 of MKronRLSF-LP is obtained. The results of the ablation study are shown in Fig. 4. It can be observed that the consensus partition and the multiple graph Laplacian constraint is helpful for MKronRLSF-LP to achieve the best results.

5.7 Comparisons of computational speed

In order to demonstrate the effectiveness of MKronRLSF-LP, we are now comparing it to different baseline methods in terms of computational speed. Except MvGCN, other methods are performed on a PC equipped with an Intel Core i7-13700 and 16GB RAM. Because MvGCN is a deep learning-based method, it is performed on a workstation equipped with a NVIDIA GeForce RTX 3090 GPU. For all baseline methods, we tested 10 times to report the mean running time. The results are shown in Table 2. The results do not include the kernel calculation time.

As expected, learning from multiple views takes more time than learning from only one view (BSV). Also, since MKronRLSF-LP fuses multiple views at the partition level, it requires more running time than Kron-RLS+CKA-MKL and Kron-RLS+self-MKL. Another observation is that MKronRLSF-LP is much faster than Kron-RLS+pairwiseMKL. This can be explained by looking at the time complexity of MKronRLSF-LP and Kron-RLS+pairwiseMKL. The inverse of pairwise kernels dominates the time complexity of both methods. In our optimization algorithm, we use eigendecomposition techniques to compute the approximate inverse. The time complexity of our method is $O((P+I_{ter})N^{3}+(Q+I_{ter})M^{3})$ . Differently, Kron-RLS+pairwiseMKL solves the system with the conjugate gradient approach that iteratively improves the result by performing matrix-vector products. Hence, Kron-RLS+pairwiseMKL is carried out in $O(I_{ter}PQ(N^{2}M+M^{2}N))$ . When MvGCN deal with Luo dataset, its running time exceeds 2 hours. This is because MvGCN utilizes a self-supervised learning strategy based on deep graph infomax (DGI) to initialize node embeddings. Whenever there are many nodes in a bipartite network, DGI takes a very long time to implement.

Table 2: Mean running time (in seconds) of baseline methods on four datasets.

Methods	Pau	Liu	Miz	Luo
BSV	0.79	0.83	0.68	5.84
Comm Kron-RLS	19.38	20.95	18.39	148.60
Kron-RLS+CKA-MKL	2.69	2.18	2.36	13.13
Kron-RLS+pairwiseMKL	1583.67	1483.26	1364.21	-
Kron-RLS+self-MKL	12.21	13.05	12.09	155.85
MvGRLP	8.94	8.37	7.23	58.53
MvGCN	305.44	329.43	343.50	-
MKronRLSF-LP	50.55	43.9	35.83	280

•

- represents that the method took more than 2 hours to run.

5.8 Comparison with other drug-side effect predictors

A comparison of the proposed drug-side effect prediction method with state-of-the-art methods is also provided. Tables 6,7, 8 and 9 present the results of 5-fold CV in terms of AUPR, AUC, $Recall$ , $Precision$ and $F_{score}$ on the four datasets, respectively. We have highlighted the best results in bold and underlined the second-best results.

Obviously, MKronRLSF-LP achieves the highest AUPR and $F_{score}$ on all datasets. In the problem of drug-side effects prediction, AUPR and $F_{score}$ more desirable metrics (Ezzat et al., 2017; Li et al., 2021). Therefore, we conclude that our method outperforms the other assessed methods. GCRS (Xuan et al., 2022) and SDPred (Zhao et al., 2022) are deep learning-based methods. GCRS constructs multiple heterogeneous graphs and multi-layer convolutional neural networks with attribute-level attention to predict drug-side effect pair nodes. SDPred fuses multiple side information (including drug chemical structures, drug target, drug word, side effect semantic similarity, side effect word) by feature concatenation and adopts CNN and MLP for prediction tasks. However, on Luo dataset, GCRS and SDPred perform poorly; this is probably because they are pairwise learning methods and randomly negative sampling to construct the training set. The randomly negative sampling method cannot be guaranteed due to the reliability and quality of negative sample pairs, which results in a certain loss of information(Zhang et al., 2015; Ali & Aittokallio, 2019). The ensemble model (Zhang et al., 2016) combine Liu’s method (Liu et al., 2012), Cheng’s method (Cheng et al., 2013), INBM and RBM by the average scoring rule. It is obvious that the results of the ensemble model are significantly improved than the results of the sub-model on four datasets.

6 Conclusion

This paper presents MKronRLSF-LP for drug-side effect prediction. The MKronRLSF-LP method solves the general problem of multi-view fusion-based link prediction by utilizing the consensus partition and multiple graph Laplacian constraint. MKronRLSF-LP allows for some degree of freedom to model the views differently and combination weights for each view to find the consensus partition. Each view’s weight is dynamically learned and plays a crucial role in exploring consensus information. It is found that the use of Laplacian regularization enhances semi-supervised learning performance, so a term of multiple graph Laplacian regularization is added to the objective function. Finally, we present an efficient alternating optimization algorithm. The results of our experiments indicate that our proposed methods are superior in terms of their classification results to other baseline algorithms and current drug-side effect predictors.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (NSFC 62172076, 62250028 and U22A2038), the Zhejiang Provincial Natural Science Foundation of China (Grant No. LY23F020003), and the Municipal Government of Quzhou (Grant No. 2023D036).

References

Ali & Aittokallio (2019) Mehreen Ali and Tero Aittokallio. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophysical reviews, 11(1):31–39, 2019.
Bruno & Marchand-Maillet (2009) Eric Bruno and Stéphane Marchand-Maillet. Multiview clustering: a late fusion approach using latent models. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 736–737, 2009.
Byrd et al. (1999) Richard H Byrd, Mary E Hribar, and Jorge Nocedal. An interior point algorithm for large-scale nonlinear programming. SIAM Journal on Optimization, 9(4):877–900, 1999.
Chao & Sun (2019) Guoqing Chao and Shiliang Sun. Semi-supervised multi-view maximum entropy discrimination with expectation laplacian regularization. Information Fusion, 45:296–306, 2019.
Cheng et al. (2013) F. Cheng, W. Li, X. Wang, Y. Zhou, Z. Wu, J. Shen, and Y. Tang. Adverse drug events: database construction and in silico prediction. Journal of Chemical Information & Modeling, 53(4):744–752, 2013.
Cichonska et al. (2018) Anna Cichonska, Tapio Pahikkala, Sandor Szedmak, Heli Julkunen, Antti Airola, Markus Heinonen, Tero Aittokallio, and Juho Rousu. Learning with multiple pairwise kernels for drug bioactivity prediction. Bioinformatics, 34(13):i509–i518, 2018.
Da Silva & Krishnamurthy (2016) Brianna A Da Silva and Mahesh Krishnamurthy. The alarming reality of medication error: a patient case and review of pennsylvania and national data. Journal of community hospital internal medicine perspectives, 6(4):31758, 2016.
Ding et al. (2016) Yijie Ding, Jijun Tang, and Fei Guo. Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC bioinformatics, 17(1):1–13, 2016.
Ding et al. (2018) Yijie Ding, Jijun Tang, and Fei Guo. Identification of drug-side effect association via semisupervised model and multiple kernel learning. IEEE journal of biomedical and health informatics, 23(6):2619–2632, 2018.
Ding et al. (2019) Yijie Ding, Jijun Tang, and Fei Guo. Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing, 325:211–224, 2019.
Ding et al. (2021) Yijie Ding, Jijun Tang, and Fei Guo. Identification of drug-target interactions via multi-view graph regularized link propagation model. Neurocomputing, 461:618–631, 2021.
Ezzat et al. (2017) Ali Ezzat, Peilin Zhao, Min Wu, Xiao-Li Li, and Chee-Keong Kwoh. Drug-target interaction prediction with graph regularized matrix factorization. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 14(3):646–656, 2017. doi: 10.1109/TCBB.2016.2530062.
Fan et al. (2021) Haoyi Fan, Fengbin Zhang, Yuxuan Wei, Zuoyong Li, Changqing Zou, Yue Gao, and Qionghai Dai. Heterogeneous hypergraph variational autoencoder for link prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4125–4138, 2021.
Fu et al. (2022) Haitao Fu, Feng Huang, Xuan Liu, Yang Qiu, and Wen Zhang. Mvgcn: data integration through multi-view graph convolutional network for predicting links in biomedical bipartite networks. Bioinformatics, 38(2):426–434, 2022.
Galeano et al. (2020) Diego Galeano, Shantao Li, Mark Gerstein, and Alberto Paccanaro. Predicting the frequencies of drug side effects. Nature communications, 11(1):1–14, 2020.
Gönen & Alpaydın (2011) Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. The Journal of Machine Learning Research, 12:2211–2268, 2011.
Guo et al. (2021) Xiaoyi Guo, Wei Zhou, Bin Shi, Xiaohua Wang, Aiyan Du, Yijie Ding, Jijun Tang, and Fei Guo. An efficient multiple kernel support vector regression model for assessing dry weight of hemodialysis patients. Current Bioinformatics, 16(2):284–293, 2021.
Houthuys & Suykens (2021) Lynn Houthuys and Johan AK Suykens. Tensor-based restricted kernel machines for multi-view classification. Information Fusion, 68:54–66, 2021.
Houthuys et al. (2018) Lynn Houthuys, Rocco Langone, and Johan AK Suykens. Multi-view kernel spectral clustering. Information Fusion, 44:46–56, 2018.
Hue & Vert (2010) Martial Hue and Jean-Philippe Vert. On learning with kernels for unordered pairs. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 463–470, 2010.
Jiang et al. (2023) Bingbing Jiang, Chenglong Zhang, Yan Zhong, Yi Liu, Yingwei Zhang, Xingyu Wu, and Weiguo Sheng. Adaptive collaborative fusion for multi-view semi-supervised classification. Information Fusion, 96:37–50, 2023.
Jiang et al. (2019) Shuhui Jiang, Zhengming Ding, and Yun Fu. Heterogeneous recommendation via deep low-rank sparse collective factorization. IEEE transactions on pattern analysis and machine intelligence, 42(5):1097–1111, 2019.
Kailath (1971) Thomas Kailath. Rkhs approach to detection and estimation problems–i: Deterministic signals in gaussian noise. IEEE Transactions on Information Theory, 17(5):530–549, 1971.
Knox et al. (2010) Craig Knox, Vivian Law, Timothy Jewison, Philip Liu, Son Ly, Alex Frolkis, Allison Pon, Kelly Banco, Christine Mak, Vanessa Neveu, et al. Drugbank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic acids research, 39(suppl_1):D1035–D1041, 2010.
Kuhn et al. (2010) Michael Kuhn, Monica Campillos, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. A side effect resource to capture phenotypic effects of drugs. Molecular systems biology, 6(1):343, 2010.
Laub (2004) Alan J Laub. Matrix analysis for scientists and engineers. SIAM, 2004.
Li et al. (2021) Tianjiao Li, Xing-Ming Zhao, and Limin Li. Co-vae: Drug-target binding affinity prediction by co-regularized variational autoencoders. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8861–8873, 2021.
Liu et al. (2023) Jiyuan Liu, Xinwang Liu, Yuexiang Yang, Qing Liao, and Yuanqing Xia. Contrastive multi-view kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Liu et al. (2012) Mei Liu, Yonghui Wu, Yukun Chen, **gchun Sun, Zhongming Zhao, Xue-wen Chen, Michael Edwin Matheny, and Hua Xu. Large-scale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs. Journal of the American Medical Informatics Association, 19(e1):e28–e35, 2012.
Luo et al. (2017) Yunan Luo, Xinbin Zhao, **gtian Zhou, **glin Yang, Yanqing Zhang, Wenhua Kuang, Jian Peng, Ligong Chen, and Jianyang Zeng. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nature communications, 8(1):1–13, 2017.
Lv et al. (2021) Juncheng Lv, Zhao Kang, Boyu Wang, Lu** Ji, and Zenglin Xu. Multi-view subspace clustering via partition fusion. Information Sciences, 560:410–423, 2021.
Miao et al. (2021) Xupeng Miao, Wentao Zhang, Yingxia Shao, Bin Cui, Lei Chen, Ce Zhang, and Jiawei Jiang. Lasagne: A multi-layer graph convolutional network framework via node-aware deep architecture. IEEE Transactions on Knowledge and Data Engineering, 35(2):1721–1733, 2021.
Nascimento et al. (2016) André CA Nascimento, Ricardo BC Prudêncio, and Ivan G Costa. A multiple kernel learning algorithm for drug-target interaction prediction. BMC bioinformatics, 17:1–16, 2016.
Nocedal & Wright (2006) Jorge Nocedal and Stephen J Wright. Quadratic programming. Numerical optimization, pp. 448–492, 2006.
Pang & Cheung (2017) Jiahao Pang and Gene Cheung. Graph laplacian regularization for image denoising: Analysis in the continuous domain. IEEE Transactions on Image Processing, 26(4):1770–1785, 2017.
Pauwels et al. (2011) E. Pauwels, V. Stoven, and Y. Yamanishi. Predicting drug side-effect profiles: a chemical fragment-based approach. Bmc Bioinformatics, 12(1):169, 2011.
Pekalska & Haasdonk (2008) El.zbieta Pekalska and Bernard Haasdonk. Kernel discriminant analysis for positive definite and indefinite kernels. IEEE transactions on pattern analysis and machine intelligence, 31(6):1017–1032, 2008.
Perrone & Cooper (1995) Michael P Perrone and Leon N Cooper. When networks disagree: Ensemble methods for hybrid neural networks. In How We Learn; How We Remember: Toward An Understanding Of Brain And Neural Systems: Selected Papers of Leon N Cooper, pp. 342–358. World Scientific, 1995.
Qian et al. (2022a) Yuqing Qian, Yijie Ding, Quan Zou, and Fei Guo. Identification of drug-side effect association via restricted boltzmann machines with penalized term. Briefings in Bioinformatics, 23(6):bbac458, 2022a.
Qian et al. (2022b) Yuqing Qian, Yijie Ding, Quan Zou, and Fei Guo. Multi-view kernel sparse representation for identification of membrane protein types. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(2):1234–1245, 2022b.
Raymond & Kashima (2010) Rudy Raymond and Hisashi Kashima. Fast and scalable algorithms for semi-supervised link prediction on static and dynamic graphs. In Joint european conference on machine learning and knowledge discovery in databases, pp. 131–147. Springer, 2010.
Sayaka et al. (2012) M. Sayaka, P. Edouard, S Véronique, G. Susumu, and Y. Yoshihiro. Relating drug–protein interaction network with drug side effects. Bioinformatics, 2012.
Shabani-Mashcool et al. (2020) S. Shabani-Mashcool, S. A. Marashi, and S. Gharaghani. Nddsa: A network- and domain-based method for predicting drug-side effect associations. Information Processing & Management, 57(6):102357, 2020.
Shi et al. (2019) Caijuan Shi, Changyu Duan, Zhibin Gu, Qi Tian, Gaoyun An, and Ruizhen Zhao. Semi-supervised feature selection analysis with structured multi-view sparse regularization. Neurocomputing, 330:412–424, 2019.
Wang et al. (2023a) Dexian Wang, Tianrui Li, Wei Huang, Zhipeng Luo, ** Deng, Pengfei Zhang, and Minbo Ma. A multi-view clustering algorithm based on deep semi-nmf. Information Fusion, pp. 101884, 2023a.
Wang et al. (2019) Siwei Wang, ** Yin. Multi-view clustering via late fusion alignment maximization. In IJCAI, pp. 3778–3784, 2019.
Wang et al. (2021) Tinghua Wang, Lin Zhang, and Wenyu Hu. Bridging deep and multiple kernel learning: A review. Information Fusion, 67:3–13, 2021.
Wang et al. (2023b) Yizheng Wang, Yixiao Zhai, Yijie Ding, and Quan Zou. Sbsm-pro: Support bio-sequence machine for proteins. arXiv preprint arXiv:2308.10275, 2023b. doi: 10.48550/arXiv.2308.10275.
Wishart et al. (2006) David S Wishart, Craig Knox, An Chi Guo, Savita Shrivastava, Murtaza Hassanali, Paul Stothard, Zhan Chang, and Jennifer Woolsey. Drugbank: a comprehensive resource for in silico drug discovery and exploration. Nucleic acids research, 34(suppl_1):D668–D672, 2006.
Xie & Sun (2020) Xijiong Xie and Shiliang Sun. General multi-view semi-supervised least squares support vector machines with multi-manifold regularization. Information Fusion, 62:63–72, 2020.
Xu et al. (2021) Lixiang Xu, Lu Bai, ** Xiao, Qi Liu, Enhong Chen, Xiaofeng Wang, and Yuanyan Tang. Multiple graph kernel learning based on gmdh-type neural network. Information Fusion, 66:100–110, 2021.
Xu et al. (2022) Xianyu Xu, Ling Yue, Bingchun Li, Ying Liu, Yuan Wang, Wenjuan Zhang, and Lin Wang. Dsgat: predicting frequencies of drug side effects by graph attention networks. Briefings in Bioinformatics, 23(2):bbab586, 2022.
Xuan et al. (2022) ** Xuan, Meng Wang, Yong Liu, Dong Wang, Tiangang Zhang, and Toshiya Nakaguchi. Integrating specific and common topologies of heterogeneous graphs and pairwise attributes for drug-related side effect prediction. Briefings in Bioinformatics, 23(3):bbac126, 2022.
Yang et al. (2016) Bo Yang, Hongbin Pei, Hechang Chen, Jiming Liu, and Shang Xia. Characterizing and discovering spatiotemporal social contact patterns for healthcare. IEEE transactions on pattern analysis and machine intelligence, 39(8):1532–1546, 2016.
Yuan et al. (2019) Weiwei Yuan, Kangya He, Donghai Guan, Li Zhou, and Chenliang Li. Graph kernel based link prediction for signed social networks. Information Fusion, 46:1–10, 2019.
Zha et al. (2009) Zheng-Jun Zha, Tao Mei, **gdong Wang, Zengfu Wang, and Xian-Sheng Hua. Graph-based semi-supervised learning with multiple labels. Journal of Visual Communication and Image Representation, 20(2):97–103, 2009.
Zhang et al. (2015) ** Zhang, Fei Wang, Jianying Hu, and Robert Sorrentino. Label propagation prediction of drug-drug interactions based on clinical side effects. Scientific reports, 5(1):12339, 2015.
Zhang et al. (2019) Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. Graph convolutional networks: a comprehensive review. Computational Social Networks, 6(1):1–23, 2019.
Zhang et al. (2016) Wen Zhang, Hua Zou, Longqiang Luo, Qianchao Liu, Weijian Wu, and Wenyi Xiao. Predicting potential side effects of drugs by recommender methods and ensemble learning. Neurocomputing, 173:979–987, 2016.
Zhao et al. (2021) Haochen Zhao, Kai Zheng, Yaohang Li, and Jianxin Wang. A novel graph attention model for predicting frequencies of drug–side effects from multi-view data. Briefings in Bioinformatics, 22(6):bbab239, 2021.
Zhao et al. (2022) Haochen Zhao, Shaokai Wang, Kai Zheng, Qichang Zhao, Feng Zhu, and Jianxin Wang. A similarity-based deep learning approach for determining the frequencies of drug side effects. Briefings in Bioinformatics, 23(1):bbab449, 2022.

Appendix A Appendix

A.1 Optimization

It is difficult and time-consuming to solve the Equation 24 because it contains multiple variables and large pairwise matrices. In this section, we divide the original problem into five subproblems and develop an iterative algorithm to optimize them. And, we avoid explicit computation of any pairwise matrices in the whole optimization, which makes our method suitable for solving problems in large pairwise spaces.

$\hat{\bm{F}}$ -subproblem: we fix ${{\bm{a}}^{v}}$ , ${\bm{w}}$ , ${\bm{\theta}_{D}}$ and ${\bm{\theta}_{S}}$ to optimize variants $\hat{\bm{F}}$ . Let ${\bm{A}}={\bm{H}}_{S}^{-0.5}{\bm{K}}_{S}^{*}{\bm{H}}_{S}^{-0.5}$ , ${\bm{B}}={\bm{H}}_{D}^{-0.5}{\bm{K}}_{D}^{*}{\bm{H}}_{D}^{-0.5}$ and $\text{vec}\left(\hat{\bm{F}}^{v}\right)={{\bm{K}}^{v}}{{\bm{a}}^{v}}$ . Then, the optimization model of $\hat{\bm{F}}$ as follows:

	$\displaystyle\mathop{\arg\min}\limits_{\hat{\bm{F}}}$	$\displaystyle\frac{1}{2}\left\\|{{\mathop{\rm vec}\nolimits}\left({\hat{\bm{F}}% }\right)-\sum\limits_{v=1}^{V}{{{\bm{w}}_{v}}\text{vec}\left(\hat{\bm{F}}^{v}% \right)}}\right\\|_{2}^{2}+\frac{1}{2}\sigma{\mathop{\rm vec}\nolimits}{\left({% \hat{\bm{F}}}\right)^{T}}{\bm{L}}{\mathop{\rm vec}\nolimits}\left({\hat{\bm{F}% }}\right)$		(25)
	$\displaystyle s.t.$	$\displaystyle{\bm{L}}={\bm{I}}_{NM}-{\bm{A}}\otimes{\bm{B}}.$		(25)

Let the derivative of Equation 25 w.r.t $\hat{\bm{F}}$ to zero, the solution of $\hat{\bm{F}}$ can be obtained:

\text{vec}\left({\hat{\bm{F}}}\right)={\left({\left(1+\sigma\right){\bm{I}}_{% NM}-\sigma{\bm{A}}\otimes{\bm{B}}}\right)^{-1}}\left({\sum\nolimits_{v=1}^{V}{% {{\bm{w}}_{v}}}\text{vec}\left(\hat{\bm{F}}^{v}\right)}\right).

(26)

Notice that the inverse matrix on the right-hand side of Equation 26 needs too much time and memory. Therefore, we use eigen decomposed techniques to compute the approximate inverse. Let ${{\bm{V}}_{A}}{\bm{\Lambda}_{A}}{\bm{V}}_{A}^{T}$ and ${{\bm{V}}_{B}}{\bm{\Lambda}_{B}}{\bm{V}}_{B}^{T}$ be the eigen decomposition of the matrices ${\bm{A}}$ and ${\bm{B}}$ , respectively. Define the matrix ${\bm{U}}$ to be ${{\bm{U}}_{i,j}}={\left[{{\bm{\Lambda}_{A}}}\right]_{i,i}}\times{\left[{{\bm{% \Lambda}_{B}}}\right]_{j,j}}$ . By the theorem (Raymond & Kashima, 2010), the kronecker product matrix ${\bm{A}}\otimes{\bm{B}}$ can be eigendecomposed as $\left({\bm{V}}_{A}\otimes{\bm{V}}_{B}\right)\text{diag}\left(\text{vec}\left({% \bm{U}}\right)\right)\left({\bm{V}}_{A}\otimes{\bm{V}}_{B}\right)^{T}$ . Then substituting it in Equation 26, we can write the inverse matrix in Equation 26 as

{\left({\left(1+\sigma\right){\bm{I}}_{NM}-\sigma{\bm{A}}\otimes{\bm{B}}}% \right)^{-1}}={\left({\left(1+\sigma\right){\bm{I}}_{NM}-\sigma{\left({\bm{V}}% _{A}\otimes{\bm{V}}_{B}\right)\text{diag}\left(\text{vec}\left({\bm{U}}\right)% \right)\left({\bm{V}}_{A}\otimes{\bm{V}}_{B}\right)^{T}}}\right)^{-1}}.

(27)

Since, it holds that $\left({\bm{V}}_{A}\otimes{\bm{V}}_{B}\right)\left({\bm{V}}_{A}\otimes{\bm{V}}_% {B}\right)^{T}={\bm{I}}_{NM}$ . Equation 27 can be transformed into

{\left({\left(1+\sigma\right){\bm{I}}_{NM}-\sigma{\bm{A}}\otimes{\bm{B}}}% \right)^{-1}}={\left({\bm{V}}_{A}\otimes{\bm{V}}_{B}\right)}{\left({\left(1+% \sigma\right){\bm{I}}_{NM}-\sigma{\text{diag}\left(\text{vec}\left({\bm{U}}% \right)\right)}}\right)^{-1}{\left({\bm{V}}_{A}\otimes{\bm{V}}_{B}\right)^{T}}}.

(28)

Notice that the inverse matrix in Equation 28 is a diagonal matrix whose value can be calculated as the matrix ${\bm{W}}$

{\bm{W}}_{i,j}=\left({1+\sigma-\sigma{\bm{U}}_{i,j}}\right)^{-1}

(29)

So, we can further rewrite the Equation 26 as

\text{vec}\left({\hat{\bm{F}}}\right)={\left({\bm{V}}_{A}\otimes{\bm{V}}_{B}% \right)}\text{diag}\left(\text{vec}\left({\bm{W}}\right)\right){{\left({\bm{V}% }_{A}\otimes{\bm{V}}_{B}\right)^{T}}}\left({\sum\nolimits_{v=1}^{V}{{{\bm{w}}_% {v}}}\text{vec}\left(\hat{\bm{F}}^{v}\right)}\right)

(30)

Taking out the vec-tricks operation, we can obtain the solution

\hat{\bm{F}}={\bm{V}}_{B}\left({\bm{W}}\odot\left({\bm{V}}_{B}^{T}\left({\sum% \nolimits_{v=1}^{V}{{{\bm{w}}_{v}}}\hat{\bm{F}}^{v}}\right){\bm{V}}_{A}\right)% \right){\bm{V}}_{A}^{T}

(31)

${\bm{w}}$ -subproblem: we fix all the variants except ${\bm{w}}$ . The formula is as follows:

	$\displaystyle\mathop{\arg\min}\limits_{\bm{w}}$	$\displaystyle\ \frac{1}{2}\left\\|{{\mathop{\rm}\nolimits}{\hat{\bm{F}}}-\sum% \limits_{v=1}^{V}{{{\bm{w}}_{v}}\hat{\bm{F}}^{v}}}\right\\|_{F}^{2}+\mu\sum% \limits_{v=1}^{V}{\left({\frac{{{{\bm{w}}_{v}}}}{2}\left\\|{{\mathop{\rm}% \nolimits}{\bm{F}}-\hat{\bm{F}}^{v}}\right\\|_{F}^{2}}\right)}+\frac{1}{2}\beta% \left\\|{\bm{w}}\right\\|_{2}^{2}$		(32)
	$\displaystyle s.t.$	$\displaystyle\sum\limits_{v=1}^{V}{{{\bm{w}}_{v}}}=1,{{\bm{w}}_{v}}\geq 0,v=1,% \ldots,V.$		(32)

Problem 32 can be simplified as a standard quadratic programming problem (Nocedal & Wright, 2006)

	$\displaystyle\mathop{\arg\min}\limits_{\bm{w}}$	$\displaystyle\ {{\bm{w}}^{T}}{\bm{G}}{\bm{w}}-{\bm{w}}^{T}{\bm{h}}$		(33)
	$\displaystyle s.t.$	$\displaystyle\sum\limits_{v=1}^{V}{{{\bm{w}}_{v}}}=1,{{\bm{w}}_{v}}\geq 0,v=1,% \ldots,V.$		(33)

where ${\bm{G}}\in{{\mathbb{R}}^{V\times V}}$ with the element as

{{\bm{G}}_{i,j}}=\left\{\begin{array}[]{l}\frac{1}{2}\text{trace}\left(\left(% \hat{\bm{F}}^{i}\right)^{T}\hat{\bm{F}}^{j}\right),{\quad}{\rm{if}}\ i\neq j,% \\ \frac{1}{2}{\text{trace}\left(\left(\hat{\bm{F}}^{i}\right)^{T}\hat{\bm{F}}^{j% }\right)}+\frac{1}{2}\beta,{\quad}{\rm{if}}\ i=j.\end{array}\right.

(34)

${\bm{h}}$ is a vector with

{{\bm{h}}_{i}}=\text{trace}\left({{\hat{\bm{F}}}^{T}\hat{\bm{F}}^{i}}\right)-{% \frac{\mu}{2}\left\|{{\bm{F}}-\hat{\bm{F}}^{i}}\right\|_{F}^{2}}.

(35)

The optimization method for Equation 33 is the interior-point optimization algorithm (Byrd et al., 1999).

$\bm{\theta}_{D}$ -subproblem: With the fixed all the variants except $\bm{\theta}_{D}$ , the formula can be written as

$\displaystyle\mathop{\arg\min}\limits_{{\bm{\theta}_{D}}}$	$\displaystyle\ \frac{1}{2}\sigma{\mathop{\rm vec}\nolimits}{\left({\hat{\bm{F}% }}\right)^{T}}{\bm{L}}{\mathop{\rm vec}\nolimits}\left({\hat{\bm{F}}}\right)$	(36)
$\displaystyle s.t.$	$\displaystyle{\bm{L}}={\bm{I}}_{NM}-\left({{\bm{H}}_{S}^{-0.5}{\bm{K}}_{S}^{}% {\bm{H}}_{S}^{-0.5}}\right)\otimes\left({{\bm{H}}_{D}^{-0.5}{\bm{K}}_{D}^{}{% \bm{H}}_{D}^{-0.5}}\right),$
	$\displaystyle{\bm{K}}_{D}^{*}=\sum\limits_{i=1}^{P}{{{\left[{\bm{\theta}_{D}}% \right]}^{\varepsilon}_{i}}{\bm{K}}_{D}^{i}},\sum\limits_{i=1}^{P}\left[{\bm{% \theta}_{D}}\right]_{i}=1,\left[{\bm{\theta}_{D}}\right]_{i}\geq 0,i=1,\ldots,P.$

Let ${\bm{A}}={{\bm{H}}_{S}^{-0.5}{\bm{K}}_{S}^{*}{\bm{H}}_{S}^{-0.5}}$ and ${\bm{B}}^{i}={{\bm{H}}_{D}^{-0.5}{\bm{K}}_{D}^{i}{\bm{H}}_{D}^{-0.5}}$ . Then substituting ${\bm{L}}$ in Equation 36 with ${\bm{A}}$ and ${\bm{B}}^{i}$ , the objective function 36 can be written as

	$\displaystyle\mathop{\arg\min}\limits_{{\bm{\theta}_{D}}}$	$\displaystyle\ -\frac{1}{2}\sigma{\mathop{\rm vec}\nolimits}{\left({\hat{\bm{F% }}}\right)^{T}}{\sum\limits_{i=1}^{P}\left({\bm{A}}\otimes{\bm{B}}^{i}\right)}% {\mathop{\rm vec}\nolimits}\left({\hat{\bm{F}}}\right)$		(37)
	$\displaystyle s.t.$	$\displaystyle\sum\limits_{i=1}^{P}\left[{\bm{\theta}_{D}}\right]_{i}=1,\left[{% \bm{\theta}_{D}}\right]_{i}\geq 0,i=1,\ldots,P.$		(37)

Further, introduce the Lagrange multiplier $\xi$ and the objective function 37 can be converted to a Lagrange function:

\text{Lag}\left(\bm{\theta}_{D},\xi\right)=-\frac{1}{2}\sigma{\mathop{\rm vec}% \nolimits}{\left({\hat{\bm{F}}}\right)^{T}}{\sum\limits_{i=1}^{P}\left({\bm{A}% }\otimes{\bm{B}}^{i}\right)}{\mathop{\rm vec}\nolimits}\left({\hat{\bm{F}}}% \right)-\xi\left(\sum\limits_{i=1}^{P}\left[{\bm{\theta}_{D}}\right]_{i}-1\right)

(38)

Based on setting the derivative of Equation 38 w.r.t $\bm{\theta}_{D}$ and $\xi$ to zero respectively, we have the following solution

\left[{\bm{\theta}_{D}}\right]_{i}={{{{\left({{\mathop{\rm vec}\nolimits}{{% \left({\hat{\bm{F}}}\right)}^{T}}\left({{\bm{A}}\otimes{{\bm{B}}^{i}}}\right){% \mathop{\rm vec}\nolimits}\left({\hat{\bm{F}}}\right)}\right)}^{\frac{1}{{1-% \varepsilon}}}}}\mathord{\left/{\vphantom{{{{\left({{\mathop{\rm vec}\nolimits% }{{\left({\hat{\bm{F}}}\right)}^{T}}\left({{\bm{A}}\otimes{{\bm{B}}^{i}}}% \right){\mathop{\rm vec}\nolimits}\left({\hat{\bm{F}}}\right)}\right)}^{\frac{% 1}{{1-\varepsilon}}}}}{\sum\limits_{j=1}^{P}{{{\left({{\mathop{\rm vec}% \nolimits}{{\left({\hat{\bm{F}}}\right)}^{T}}\left({{\bm{A}}\otimes{{\bm{B}}^{% i}}}\right){\mathop{\rm vec}\nolimits}\left({\hat{\bm{F}}}\right)}\right)}^{% \frac{1}{{1-\varepsilon}}}}}}}}\right.\kern-1.2pt}{\sum\limits_{j=1}^{P}{{{% \left({{\mathop{\rm vec}\nolimits}{{\left({\hat{\bm{F}}}\right)}^{T}}\left({{% \bm{A}}\otimes{{\bm{B}}^{j}}}\right){\mathop{\rm vec}\nolimits}\left({\hat{\bm% {F}}}\right)}\right)}^{\frac{1}{{1-\varepsilon}}}}}}}.

(39)

By using the vec-tricks operation, we can describe the solution as

\left[{\bm{\theta}_{D}}\right]_{i}={{{\text{trace}}{{\left({{{\hat{\bm{F}}}^{T% }}{{\bm{B}}^{i}}\hat{\bm{F}}{{\bm{A}}^{T}}}\right)}^{\frac{1}{{1-\varepsilon}}% }}}\mathord{\left/{\vphantom{{{\text{trace}}{{\left({{{\hat{\bm{F}}}^{T}}{{\bm% {B}}^{i}}\hat{\bm{F}}{{\bm{A}}^{T}}}\right)}^{\frac{1}{{1-\varepsilon}}}}}{% \sum\limits_{j=1}^{P}{trace{{\left({{{\hat{\bm{F}}}^{T}}{{\bm{B}}^{i}}\hat{\bm% {F}}{{\bm{A}}^{T}}}\right)}^{\frac{1}{{1-\varepsilon}}}}}}}}\right.\kern-1.2pt% }{\sum\limits_{j=1}^{P}{\text{trace}{{\left({{{\hat{\bm{F}}}^{T}}{{\bm{B}}^{j}% }\hat{\bm{F}}{{\bm{A}}^{T}}}\right)}^{\frac{1}{{1-\varepsilon}}}}}}}

(40)

$\bm{\theta}_{S}$ -subproblem:The solution of $\bm{\theta}_{S}$ is similarity to $\bm{\theta}_{D}$ . Here, the optimization process is omitted and we directly give the solution

\left[{\bm{\theta}_{S}}\right]_{i}={{{\text{trace}}{{\left({{{\hat{\bm{F}}}^{T% }}{{\bm{B}}}\hat{\bm{F}}{({\bm{A}}^{i})^{T}}}\right)}^{\frac{1}{{1-\varepsilon% }}}}}\mathord{\left/{\vphantom{{{\text{trace}}{{\left({{{\hat{\bm{F}}}^{T}}{{% \bm{B}}^{i}}\hat{\bm{F}}{{\bm{A}}^{T}}}\right)}^{\frac{1}{{1-\varepsilon}}}}}{% \sum\limits_{j=1}^{P}{trace{{\left({{{\hat{\bm{F}}}^{T}}{{\bm{B}}^{i}}\hat{\bm% {F}}{{\bm{A}}^{T}}}\right)}^{\frac{1}{{1-\varepsilon}}}}}}}}\right.\kern-1.2pt% }{\sum\limits_{j=1}^{Q}{\text{trace}{{\left({{{\hat{\bm{F}}}^{T}}{{\bm{B}}}% \hat{\bm{F}}{({\bm{A}}^{j})^{T}}}\right)}^{\frac{1}{{1-\varepsilon}}}}}}}

(41)

where ${\bm{B}}={{\bm{H}}_{D}^{-0.5}{\bm{K}}_{D}^{*}{\bm{H}}_{D}^{-0.5}}$ and ${\bm{A}}^{i}={{\bm{H}}_{S}^{-0.5}{\bm{K}}_{S}^{i}{\bm{H}}_{S}^{-0.5}}$ .

${{\bm{a}}^{v}}$ -subproblem: By drop** all other irrelevant terms with respect ${{\bm{a}}^{v}}$ , we have

\displaystyle\mathop{\arg\min}\limits_{{{\bm{a}}^{v}}}

\displaystyle\ \frac{1}{2}\left\|{{\mathop{\rm vec}\nolimits}\left({\hat{\bm{F% }}}\right)-\sum\limits_{i=1}^{V}{{{\bm{w}}_{i}}{{\bm{K}}^{i}}{{\bm{a}}^{i}}}}% \right\|_{2}^{2}+\mu\left({\frac{{{{\bm{w}}_{v}}}}{2}\left\|{{\mathop{\rm vec}% \nolimits}\left({\bm{F}}\right)-{{\bm{K}}^{v}}{{\bm{a}}^{v}}}\right\|_{2}^{2}+% \frac{{{\lambda^{v}}}}{2}{{\bm{a}}^{{v^{T}}}}{{\bm{K}}^{v}}{{\bm{a}}^{v}}}% \right).

(42)

It can be observed from the objective function 42 that when training the parameter ${\bm{a}}^{v}$ , other views ${\bm{K}}_{i}$ with weight ${\bm{w}}_{i}$ were taken into consideration. Therefore, each partition’s training is not completely separate, but involves information sharing.

Based on setting the derivative of problem 42 w.r.t ${{\bm{a}}^{v}}$ to zero, we get

\left({{{\bm{K}}^{v}}+\frac{{{\lambda_{v}}}}{{1+\mu{{\bm{w}}_{v}}}}{\bm{I}}_{% NM}}\right){{\bm{a}}^{v}}=\frac{1}{{1+\mu{{\bm{w}}_{v}}}}\left({{\mathop{\rm vec% }\nolimits}\left({\hat{\bm{F}}}\right)-\sum\limits_{i=1,i\neq v}^{V}{{{\bm{w}}% _{i}}{{\bm{K}}^{i}}{{\bm{a}}^{i}}}+\mu{{\bm{w}}_{v}}{\mathop{\rm vec}\nolimits% }\left({\bm{F}}\right)}\right)

(43)

Let ${\bm{W}}={{\hat{\bm{F}}}-\sum\limits_{i=1,i\neq v}^{V}{{{\bm{w}}_{i}}{\hat{\bm% {F}}^{i}}}+\mu{{\bm{w}}_{v}}{\bm{F}}}$ , the Equation 43 can be written as

{{\bm{a}}^{v}}=\frac{1}{{1+\mu{{\bm{w}}_{v}}}}{\left({{{\bm{K}}^{v}}+\frac{{{% \lambda_{v}}}}{{1+\mu{{\bm{w}}_{v}}}}{\bm{I}}_{NM}}\right)^{-1}}\text{vec}% \left({\bm{W}}\right).

(44)

We can observe that the form of Equation 44 is similar to Equation 7. Therefore, we use eigen decomposed techniques and the vec-trick operation to effectively compute ${\bm{a}}^{v}$ .

We summarize the complete optimization process for problem 24 in Algorithm 1.

Input: The link matrix

{\bm{F}}

; The regulation parameters

\mu

\beta

\sigma

\varepsilon

and

{\lambda^{v}},v=1,\ldots,V

;

Output: The predicted link matrix

\hat{\bm{F}}

;

1 Compute two sets of base kernel sets

{\mathbb{K}}_{D}

and

{\mathbb{K}}_{S}

by Equation 20a and 20b;

2 Initialize

{{\bm{a}}^{v}},v=1,\ldots,V

by single view Kron-RLS;

{{\bm{w}}_{v}}=1/V,v=1,\ldots,V

;

\bm{\theta}_{D}^{i}=1/P,i=1,\ldots,P

;

\bm{\theta}_{S}^{i}=1/Q,i=1,\ldots,Q

;

3 while Not convergence do

4 Update

\hat{\bm{F}}

by solve the subproblem 25;

5 Update

{\bm{w}}

by solve the subproblem 32;

6 Update

\bm{\theta}_{D}

by solve the subproblem 36;

7 Update

\bm{\theta}_{S}

by Equation 41;

8 for $i=1$ to $V$ do

9 Update

{\bm{a}}^{i}

by solve the subproblem 42;

10 end for

12 end while

Algorithm 1 Optimization for MKronRLSF-LP.

A.2 Measurements

Considering that drug-side effect prediction is an extremely imbalanced classification problem and we do not want incorrect predictions to be recommended by the prediction model, we utilize the following evaluation parameters:

	$Recall=\frac{{TP}}{{TP+FN}},$		(45a)
	$Precision=\frac{{TP}}{{TP+FP}},$		(45b)
	${F_{score}}=2\times\frac{{Precision\times Recall}}{{Precision+Recall}},$		(45c)

where $TP$ , $FN$ , $FP$ and $TN$ are the number of true-positive samples, false-negative samples, false-positive samples and true-negative samples, respectively. The area under the ROC curve (AUC) and area under the precision recall curve (AUPR) is also used to measure predictive accuracy, because they are the most commonly used evaluate metrics in the biomedical link prediction. The precision-recall curve shows the tradeoff between precision and recall at different thresholds. $F_{score}$ is calculated from Precision and Recall. The highest possible value of an $F_{score}$ is 1, indicating perfect precision and recall, and the lowest possible value is 0, if either precision or recall are zero. AUC can be considered as the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance (Li et al., 2021). Therefore, we consider AUPR and $F_{score}$ more desirable metrics (Ezzat et al., 2017; Li et al., 2021).

A.3 Baseline methods

$\bullet$

Best single view (BSV): Applying Kron-RLS to the best single view. The one with the maximum AUPR is chosen here.
$\bullet$

Committee Kron-RLS (Comm Kron-RLS)(Perrone & Cooper, 1995): Each view is trained by Kron-RLS separately, and the final classifier is a weighted average.
$\bullet$

Kron-RLS with Centered Kernel Alignment-based Multiple Kernel Learning (Kron-RLS+CKA-MKL)(Ding et al., 2019): Multiple kernels from the drug space and side effect space are linearly weighted by the optimized CKA-MKL. Finally, Kron-RLS is employed on optimal kernels.
$\bullet$

Kron-RLS with pairwise Multiple Kernel Learning (Kron-RLS+pairwiseMKL)(Cichonska et al., 2018): First, it constructs multiple pairwise kernels. Then, the mixture weights of the pairwise kernels are determined by CKA-MKL. Finally, it learns the Kron-RLS function based on the optimal pairwise kernel.
$\bullet$

Kron-RLS with self-weighted multiple kernel learning (Kron-RLS+self-MKL)(Nascimento et al., 2016): The optimal drug and side effect kernels are linearly weighted based on the multiple base kernel. The proper weights assignment to each kernel is performed automatically.
$\bullet$

Multi-view graph regularized link propagation model (MvGRLP)(Ding et al., 2021): This is an extension of the graph model (Zha et al., 2009). To fuse multi view information, multi-view Laplacian regularization is introduced to constrain the predicted values.
$\bullet$

Multi-view graph convolution network (MvGCN)(Fu et al., 2022): This extends the GCN (Zhang et al., 2019) from a single view to multi-view by combining the embeddings of multiple neighborhood information aggregation layers in each view.

A.4 Code and Data Available

The code and data are available at https://github.com/QYuQing/MKronRLSF-LP.

A.5 Figures

A.6 Tables

Table 3: Prediction performance comparison of baseline methods on four datasets.

Dataset

Methods

AUPR(%)

AUC(%)

Recall

(%)

Precision

(%)

F_{score}

(%)

Liu

BSV

60.12±1.12

93.22±1.63

58.77±0.33

59.09±0.49

58.52±0.23

Comm Kron-RLS

65.63±1.95

94.11±1.45

61.63±0.33

61.9±1.37

61.57±1.65

Kron-RLS

+CKA-MKL

65.92±0.43

92.51±0.08

62.11±0.43

63.09±0.56

62.59±0.41

Kron-RLS

+pairwiseMKL

62.03±0.44

95.01±0.06

65.39±0.24

54.46±0.30

59.43±0.21

Kron-RLS

+self-MKL

65.02±0.47

92.1±0.10

60.97±0.57

63.12±0.61

62.03±0.52

MvGRLP

66.32±0.45

94.29±0.08

63.56±0.46

60.87±0.62

62.18±0.39

MvGCN

62.69±1.81

94.01±0.87

60.81±0.37

60.33±1.31

60.48±1.15

MKronRLSF-LP

68.02±0.44

94.78±0.13

65.18±0.93

61.27±1.08

63.02±0.43

Pau

BSV

65.26±0.98

94.57±0.34

62.54±0.5

60.77±1.27

60.65±0.73

Comm Kron-RLS

65.63±0.36

94.78±0.13

64.01±0.38

60.05±0.49

61.01±0.27

Kron-RLS

+CKA-MKL

65.49±0.37

92.39±0.13

61.65±0.40

63.22±0.51

62.42±0.27

Kron-RLS

+pairwiseMKL

63.48±0.39

95.02±0.07

78.1±0.26

45.01±0.48

57.11±0.36

Kron-RLS

+self-MKL

64.11±1.75

91.94±0.25

62.37±0.29

60.97±1.57

61.65±0.79

MvGRLP

66.17±0.32

94.42±0.07

62.18±0.38

61.95±0.45

62.06±0.22

MvGCN

63.51±1.43

94.08±0.49

63.21±0.69

57.94±1.34

60.4±1.78

MKronRLSF-LP

67.81±0.37

94.81±0.18

65.72±3.58

60.65±3.75

62.87±0.48

Miz

BSV

56.58±2.33

90.71±2.06

62.76±0.69

53.94±2.31

55.39±2.33

Comm Kron-RLS

58.08±1.07

91.36±1.25

62.37±0.81

55.16±1.99

56.54±1.77

Kron-RLS

+CKA-MKL

66.92±0.44

92.58±0.14

62.62±0.52

64.3±0.46

61.45±0.44

Kron-RLS

+pairwiseMKL

62.13±0.29

94.70±0.11

63.78±0.47

56.26±0.42

59.79±0.30

Kron-RLS

+self-MKL

65.84±0.43

92.06±0.16

63.63±0.48

61.77±0.52

60.68±0.43

MvGRLP

66.68±0.35

94.10±0.12

63.46±0.43

61.82±0.30

62.63±0.29

MvGCN

62.17±1.90

93.35±1.73

59.54±0.43

60.74±1.78

59.76±1.95

MKronRLSF-LP

68.35±0.38

94.47±0.09

65.15±2.77

62.10±3.19

63.45±0.53

Luo

BSV

60.40±0.40

94.40±0.11

58.28±0.41

58.68±0.46

58.48±0.39

Comm Kron-RLS

54.19±1.36

91.92±4.01

57.64±2.46

53.16±1.97

52.99±1.54

Kron-RLS

+CKA-MKL

60.87±0.36

92.03±0.15

55.55±0.34

64.15±0.46

59.54±0.36

Kron-RLS

+pairwiseMKL

50.29±0.29

94.37±0.10

55.66±0.39

45.97±0.39

50.35±0.31

Kron-RLS

+self-MKL

22.29±1.57

79.74±1.62

56.62±1.47

20.91±1.64

28.23±1.15

MvGRLP

61.76±0.45

94.08±0.07

58.70±0.40

60.05±0.61

58.37±0.42

MvGCN

61.18±0.41

94.54±0.1

57.94±0.37

61.26±0.48

51.07±0.38

MKronRLSF-LP

63.32±0.58

94.07±0.14

59.43±0.95

61.58±1.22

60.47±0.39

Table 4: The optimal parameters

\lambda^{v}

obtained with the single view Kron-RLS model (based on the relative pairwise kernel).

$\otimes$	${\bm{K}}_{GIP,S}$	${\bm{K}}_{GIP,S}$	${\bm{K}}_{GIP,S}$	${\bm{K}}_{GIP,S}$	${\bm{K}}_{GIP,S}$
${\bm{K}}_{GIP,D}$	$2^{0}$	$2^{2}$	$2^{2}$	$2^{1}$	$2^{1}$
${\bm{K}}_{COS,D}$	$2^{2}$	$2^{3}$	$2^{3}$	$2^{2}$	$2^{-2}$
${\bm{K}}_{Corr,D}$	$2^{3}$	$2^{3}$	$2^{4}$	$2^{2}$	$2^{4}$
${\bm{K}}_{MI,D}$	$2^{0}$	$2^{1}$	$2^{1}$	$2^{0}$	$2^{-1}$
${\bm{K}}_{NTK,D}$	$2^{2}$	$2^{4}$	$2^{3}$	$2^{1}$	$2^{1}$

Table 5: Summary of the threshold of baseline methods on four datasets.

Methods	Liu	Pau	Miz	Luo
BSV	0.145	0.146	0.142	0.128
Comm Kron-RLS	0.205	0.204	0.192	0.183
Kron-RLS+CKA-MKL	0.100	0.106	0.099	0.102
Kron-RLS+pairwiseMKL	0.149	0.159	0.101	0.107
Kron-RLS+self-MKL	0.119	0.116	0.113	0.129
MvGRLP	0.090	0.091	0.094	0.085
MVGCN	0.225	0.237	0.208	0.197
MKronRLSF-LP	0.177	0.168	0.179	0.149

Table 6: Prediction performance comparison of other drug-side effect predictors on Liu datasets.

Methods	AUPR(%)	AUC(%)	$Recall$ (%)	$Precision$ (%)	$F_{score}$ (%)
Liu’s method	28.0	90.7	67.5	34.0	45.2
Cheng’s method	59.2	92.2	59.0	55.7	56.9
RBMBM	61.6	94.1	61.5	57.4	59.4
INBM	64.1	93.4	60.7	60.4	60.6
Ensemble model	66.1	94.8	62.3	61.1	61.7
MKL-LGC^a	67.0	95.1	-	-	-
NDDSA with sschem^c	60.5	94.1	57.9	56.4	57.1
NDDSA without sschem^c	60.4	94.0	57.4	56.8	57.1
MKronRLSF-LP	68.2	94.7	63.8	62.5	63.1

•

- represents not available; the bold and underlined values represent the best and second performance measure in each column, respectively;
•

a and b represents the results are derived from (Ding et al., 2018) and (Shabani-Mashcool et al., 2020), respectively.

Table 7: Prediction performance comparison of other drug-side effect predictors on Pau datasets.

Methods	AUPR(%)	AUC(%)	$Recall$ (%)	$Precision$ (%)	$F_{score}$ (%)
Pau’s method^a	38.9	89.7	51.7	36.1	42.5
Liu’s method	34.7	92.1	64.6	40.0	49.5
Cheng’s method	58.8	82.3	58.3	55.0	56.6
RBMBM	61.3	94.1	60.8	57.7	59.2
INBM	64.1	93.4	60.8	60.5	60.7
Ensembel model	66.0	94.9	62.4	61.2	61.6
MKL-LGC^b	66.8	95.2	-	-	-
NDDSA with sschem^c	60.3	94.2	59.3	54.9	57.0
NDDSA without sschem^c	60.3	94.1	58.2	55.9	57.0
MKronRLSF-LP	67.9	94.7	63.4	62.9	63.2

•

- represents not available; the bold and underlined values represent the best and second performance measure in each column, respectively;
•

a, b and c represents the results are derived from (Zhang et al., 2016), (Ding et al., 2018) and (Shabani-Mashcool et al., 2020), respectively.

Table 8: Prediction performance comparison of other drug-side effect predictors on Miz datasets.

Methods	AUPR(%)	AUC(%)	$Recall$ (%)	$Precision$ (%)	$F_{score}$ (%)
Miz’s method^a	41.2	89.0	52.7	38.7	44.6
Liu’ method	36.3	91.8	64.0	41.5	50.5
Cheng’s method	56.0	92.3	58.4	56.8	57.6
RBMBM	61.7	93.9	60.5	58.8	59.6
INBM	64.6	93.2	61.6	60.5	61.1
Ensemble model	66.6	94.6	62.4	61.9	62.2
MKL-LGC^b	67.3	94.8	-	-	-
NDDSA with sschem^c	60.6	93.9	58.8	56.3	57.5
NDDSA without sschem^c	60.7	93.6	60.0	55.5	57.6
MKronRLSF-LP	68.5	94.5	63.0	64.2	63.6

•

- represents not available; the bold and underlined values represent the best and second performance measure in each column, respectively;
•

a, b and c represents the results are derived from (Zhang et al., 2016), (Ding et al., 2018) and (Shabani-Mashcool et al., 2020), respectively.

Table 9: Prediction performance comparison of other drug-side effect predictors on Luo datasets.

Methods	AUPR(%)	AUC(%)	$Recall$ (%)	$Precision$ (%)	$F_{score}$ (%)
Liu’s method	39.4	93.5	59.6	48.3	53.3
Cheng’s method	53.2	90.9	53.1	52.3	52.7
RBMBM	55.1	93.5	56.1	54.3	55.1
INBM	57.3	91.7	55.8	56.7	56.2
Ensemble model	58.6	93.9	46.1	68.4	55.1
MKL-LGC	61.7	94.6	-	-	-
NDDSA with sschem^a	53.1	94.2	47.6	57.3	52.0
NDDSA without sschem^a	44.5	93.7	44.7	47.8	46.2
GCRS^b	27.2	95.7	-	-	-
SDPred	22.6	94.6	-	-	-
MKronRLSF-LP	63.5	94.1	59.2	61.9	60.5

•

- represents not available; the bold and underlined values represent the best and second performance measure in each column, respectively;
•

a,b represents the results are derived from (Shabani-Mashcool et al., 2020) and (Xuan et al., 2022), respectively.