CN110993121A

Movatterモバイル変換

Info

Publication number: CN110993121A
Application number: CN201911239229.XA
Authority: CN
Inventors: 谢茂强; 刘嘉晖; 刘帆; 金旭; 王琳; 黄亚楼
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-04-10

Abstract

Translated fromChinese

一种基于双协同线性流形的药物关联预测方法，该预测方法属于数据挖掘、生物信息领域，包括：1)根据药物间关联关系数据，构建药物‑药物节点间初始目标关联矩阵；2)根据药物靶蛋白间关联关系数据和靶蛋白间关联关系数据，构建药物‑靶蛋白和靶蛋白‑靶蛋白节点间辅助关联矩阵；3)获取初始目标关联矩阵与辅助矩阵的流形作为输入，构建双协同线性流形学习模型；4)根据一致性原则，通过迭代更新，丰富目标矩阵信息，获取评分较高的关联，认为该两种药物之间存在关联关系，完成药物关联预测任务。本发明采用流形来测量数据的相关性，并采用协作学习来充分利用网络内的一致性。可有效地提高预测的准确性，适用于药物‑药物关联预测。

A drug association prediction method based on double synergistic linear manifold, the prediction method belongs to the fields of data mining and biological information, comprising: 1) constructing an initial target association matrix between drug-drug nodes according to association relationship data between drugs; 2) according to Drug-target protein association data and target protein association data, construct drug-target protein and target protein-target protein node auxiliary association matrix; 3) Obtain the manifold of the initial target association matrix and the auxiliary matrix as input, and construct a double Synergistic linear manifold learning model; 4) According to the principle of consistency, through iterative update, enrich the target matrix information, obtain the correlation with higher score, consider that there is a correlation between the two drugs, and complete the drug correlation prediction task. The present invention uses manifolds to measure the correlation of data, and cooperative learning to take full advantage of the consistency within the network. It can effectively improve the accuracy of prediction and is suitable for drug-drug association prediction.

Description

Translated fromChinese

一种基于双协同线性流形的药物关联预测方法A drug association prediction method based on double synergistic linear manifold

技术领域technical field

本发明涉及数据挖掘及生物信息领域，为一种对异质网络中的节点进行关联预测的方法。The invention relates to the fields of data mining and biological information, and relates to a method for association prediction of nodes in a heterogeneous network.

背景技术Background technique

近年来，异质网络中的链路预测问题得到了广泛的关注。与之相关的许多问题正在广泛研究中，包括社交网络中的好友推荐，蛋白质-蛋白质相互作用预测和航空公司网络重构等。目前，已经提出很多方法利用异构网络的拓扑结构和互连来改进链路预测方法，主要分为三类：基于相似度的算法，基于路径的算法和基于矩阵分解的算法。在基于相似度的方法中，通过计算相似度分数以测量节点之间的相关性。这些算法的计算成本虽然较低，但预测精度不高，因为它们无法充分利用已知网络中的全局结构。基于路径的算法通常利用网络拓扑和节点属性进行链路预测。然而，这些方法从网络中全局学习结构的计算成本过高。基于矩阵分解的算法可以从已知网络中提取潜在特征以进行链路预测。但是，现有的基于分解的算法既不能完全利用辅助网络中的信息，也不能在特征集成期间保留有效信息。In recent years, the problem of link prediction in heterogeneous networks has received extensive attention. Many problems related to it are under extensive research, including friend recommendation in social networks, protein-protein interaction prediction, and airline network reconstruction, etc. At present, many methods have been proposed to improve link prediction methods using the topology and interconnection of heterogeneous networks, which are mainly divided into three categories: similarity-based algorithms, path-based algorithms, and matrix factorization-based algorithms. In similarity-based methods, the correlation between nodes is measured by calculating a similarity score. Although these algorithms are computationally less expensive, their prediction accuracy is not high because they cannot fully exploit the global structure in the known network. Path-based algorithms typically utilize network topology and node properties for link prediction. However, these methods are computationally expensive to learn the structure globally from the network. Algorithms based on matrix factorization can extract latent features from known networks for link prediction. However, existing decomposition-based algorithms can neither fully utilize the information in the auxiliary network nor preserve effective information during feature integration.

最近，用于链路预测的流形学习已在机器学习和模式识别领域中变得越来越流行。流形学习的基本思想是将数据从原始高维空间投影到另一个低维空间，因此可以学习更多潜在信息，在原始数据中再现基本结构特征。它优于仅考虑原始特征空间的算法。经过尺寸变换后，原始特征空间中的多余特征将被删除，节点之间的相互关系将被重建，以更好地表征它们的相似性。流形学习由于其对数据距离测量的有效性而被广泛应用于知识表示模型中。M.Wan等人在《Feature extraction using two-dimensional maximum embeddingdifference》一文中提出了一种二维最大嵌入差异(2DMED)方法，该方法结合了图嵌入和差异准则技术来进行图像特征提取。W.Zhang等人在《Manifold regularized matrixfactorization for drug-drug interaction prediction》一文中引入了一种基于药物特征的流形正则化方法，该方法将交互空间中的药物投影到低维空间中以预测药物-药物相互作用。Recently, manifold learning for link prediction has become increasingly popular in the fields of machine learning and pattern recognition. The basic idea of manifold learning is to project the data from the original high-dimensional space to another low-dimensional space, so more latent information can be learned, and the basic structural features can be reproduced in the original data. It outperforms algorithms that only consider the original feature space. After dimension transformation, redundant features in the original feature space will be removed, and the interrelationships between nodes will be reconstructed to better characterize their similarity. Manifold learning is widely used in knowledge representation models due to its effectiveness on data distance measurements. In the paper "Feature extraction using two-dimensional maximum embeddingdifference", M.Wan et al. proposed a two-dimensional maximum embedding difference (2DMED) method, which combines graph embedding and difference criterion techniques for image feature extraction. In the paper "Manifold regularized matrixfactorization for drug-drug interaction prediction", W. Zhang et al. introduced a manifold regularization method based on drug features, which projected drugs in the interaction space into a low-dimensional space to predict drugs -medicine interactions.

综上所述，现有的异质网络链路预测方法都没有考虑目标网络与辅助网络之间的一致性，从而导致较低的预测性能。To sum up, none of the existing heterogeneous network link prediction methods consider the consistency between the target network and the auxiliary network, resulting in lower prediction performance.

发明内容SUMMARY OF THE INVENTION

本发明的主要目的是为了解决现有药物关联预测技术不能充分利用异质结点间关联关系结构信息的问题，提供一种对异质网络中的节点基于目标网络与辅助网络之间的流形一致性进行药物关联预测的方法。本发明实现了一种基于双协同线性流形的药物关联预测方法，通过协同使用目标网络和辅助网络之间嵌入的流形一致性来优化节点相似性，达到了很好的关联预测效果。The main purpose of the present invention is to solve the problem that the existing drug association prediction technology cannot make full use of the structural information of the association relationship between heterogeneous nodes, and to provide a method for the nodes in the heterogeneous network based on the manifold between the target network and the auxiliary network. Consistent method for drug association prediction. The invention realizes a drug association prediction method based on double synergistic linear manifolds, and optimizes node similarity by using the manifold consistency embedded between the target network and the auxiliary network synergistically, and achieves a good association prediction effect.

本发明的技术方案Technical solution of the present invention

一种基于双协同线性流形的药物关联预测方法，该方法的预测结果可应用于数据挖掘或药物筛选，具体包括以下步骤：A drug association prediction method based on double synergistic linear manifold, the prediction result of the method can be applied to data mining or drug screening, and specifically includes the following steps:

步骤1：根据药物间关联关系数据，构建药物-药物节点间初始目标关联矩阵；Step 1: construct an initial target association matrix between drug-drug nodes according to the association relationship data between drugs;

步骤2：根据药物靶蛋白间关联关系数据和靶蛋白间关联关系数据，构建药物-靶蛋白和靶蛋白-靶蛋白节点间辅助关联矩阵；Step 2: According to the association relationship data between drug target proteins and the association relationship data between target proteins, construct an auxiliary association matrix between drug-target protein and target protein-target protein nodes;

步骤3：获取初始目标关联矩阵与辅助关联矩阵的流形作为输入，构建双协同线性流形学习模型；Step 3: Obtain the manifold of the initial target correlation matrix and the auxiliary correlation matrix as input, and build a double-cooperative linear manifold learning model;

步骤4：根据一致性原则，通过迭代更新，丰富目标关联矩阵信息，获取评分较高的关联，认为该两种药物之间存在关联关系，完成药物关联预测任务。Step 4: According to the principle of consistency, through iterative update, enrich the information of the target association matrix, obtain associations with higher scores, consider that there is an association relationship between the two drugs, and complete the drug association prediction task.

进一步，本发明方法涉及三种关联矩阵，分别定义如下：根据药物间、药物靶蛋白间和靶蛋白间的关联关系数据，分别构建药物-药物、药物-靶蛋白、靶蛋白-靶蛋白的关联矩阵，对于任意关联如果存在已知关联信息，则标记1，否则标记0。Further, the method of the present invention involves three kinds of correlation matrices, which are respectively defined as follows: according to the correlation data between drugs, drug-target proteins and target proteins, drug-drug, drug-target protein, and target protein-target protein associations are constructed respectively. Matrix, for any association, if there is known association information,mark 1, otherwise mark 0.

所述的双协同线性流形学习模型由一致性流形信息项、先验知识约束项和稀疏约束三部分组成。The dual-cooperative linear manifold learning model consists of three parts: consistent manifold information item, prior knowledge constraint item and sparse constraint.

进一步地，所述的一致性流形信息项为：Further, the described consistent manifold information item is:

用于约束根据双向知识丰富后的流形，在尽量丰富更多信息的情况下，协同保持两流形间差别最小。It is used to constrain the manifold enriched according to bidirectional knowledge, in the case of enriching more information as much as possible, and cooperatively keep the difference between the two manifolds to a minimum.

进一步地，所述的先验知识约束项为：Further, the prior knowledge constraints are:

用于约束丰富过程中得到的药物-药物关联矩阵和靶蛋白-靶蛋白关联矩阵，在尽量丰富的情况下，保持与先验知识的统一。Used to constrain the drug-drug association matrix and target protein-target protein association matrix obtained in the enrichment process, in the case of enriching as much as possible, keep the unity with the prior knowledge.

进一步地，所述的稀疏约束为：用于控制复杂程度。Further, the sparsity constraint is used to control the complexity.

本发明的优点和有益效果：Advantages and beneficial effects of the present invention:

本发明可以更好地捕获数据点之间的相关性，并充分利用网络内部的一致性对异质网络中的节点进行关联预测。此外，该技术采用低秩约束，并且结合了先验知识来克服异质网络的稀疏性问题，可以在包含大量链接缺失或未观察到的交互的网络上保持稳定的性能。与当前的最新技术相比，该发明显着提高了在不同的实际应用中预测未知链接的准确性，这些应用包括社会评价网络，基因表型关联预测以及药物与药物相互作用网络，具有一定的扩展性。The present invention can better capture the correlation between data points, and make full use of the internal consistency of the network to perform correlation prediction on the nodes in the heterogeneous network. Furthermore, the technique employs low-rank constraints and incorporates prior knowledge to overcome the sparsity problem of heterogeneous networks, which can maintain stable performance on networks containing a large number of missing links or unobserved interactions. Compared with the current state-of-the-art, the invention significantly improves the accuracy of predicting unknown links in different practical applications, including social evaluation networks, gene-phenotype association prediction, and drug-drug interaction networks, with certain Extensibility.

附图说明Description of drawings

图1是基于双协同线性流形的药物关联预测方法流程图。Figure 1 is a flow chart of a drug association prediction method based on a double synergistic linear manifold.

图2是局部流形拓展至全局流形方法的示意图。Figure 2 is a schematic diagram of the extension of the local manifold to the global manifold method.

图3是双协同线性流行模型的示意图。Figure 3 is a schematic diagram of a bi-synergistic linear prevalence model.

图4是ROC曲线的示意图。Figure 4 is a schematic diagram of the ROC curve.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面对本发明实施方式做进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the embodiments of the present invention are further described in detail below.

实施例1：Example 1:

一种基于双协同线性流形的药物关联预测方法，流程图参见图1，模型图参见图3，该方法包括以下步骤：A method for predicting drug association based on a double synergistic linear manifold, the flowchart is shown in Figure 1, and the model diagram is shown in Figure 3, the method includes the following steps:

步骤101：根据DrugBank数据库抓取的药物间关联关系数据，构建药物-药物节点间关联矩阵；Step 101: construct a drug-drug node relationship matrix according to the drug-drug relationship data captured by the DrugBank database;

步骤102：引入DrugBank数据库和HPRD数据库抓取的辅助网络数据，根据靶蛋白间关联关系数据和药物靶蛋白间关联关系数据，构建药物-靶蛋白节点间辅助关联矩阵和靶蛋白-靶蛋白节点间辅助关联矩阵；Step 102: Introduce the auxiliary network data captured by the DrugBank database and the HPRD database, and construct an auxiliary relationship matrix between drug-target protein nodes and target protein-target protein nodes according to the relationship data between target proteins and the relationship data between drug-target proteins Auxiliary correlation matrix;

其中，步骤101中，药物-药物关联矩阵为目标网络T(758×758)；步骤102中，药物-靶蛋白节点间关联矩阵为关联矩阵R(758×473)，靶蛋白-靶蛋白关联矩阵为辅助矩阵A(473×473)，三个网络共同组成集成异质网络，包含758种药物及473种靶蛋白。若网络节点间存在关联，则对应矩阵元素为1，否则为0；Among them, in step 101, the drug-drug correlation matrix is the target network T (758×758); in step 102, the drug-target protein node correlation matrix is the correlation matrix R (758×473), the target protein-target protein correlation matrix For the auxiliary matrix A (473×473), the three networks together form an integrated heterogeneous network, including 758 drugs and 473 target proteins. If there is an association between network nodes, the corresponding matrix element is 1, otherwise it is 0;

步骤103：获取初始目标关联矩阵与辅助关联矩阵的流形作为输入，构建双协同线性流形学习模型；Step 103: Obtain the manifold of the initial target correlation matrix and the auxiliary correlation matrix as input, and construct a dual-cooperative linear manifold learning model;

所述双协同线性流形学习模型由一致性流形信息项、先验知识约束项和稀疏约束三部分组成。The dual-cooperative linear manifold learning model consists of three parts: a consistent manifold information item, a prior knowledge constraint item and a sparse constraint.

其中，协同线性流形模型中的一致性流形信息项为：用于约束根据双向知识丰富后的流形，在尽量丰富更多信息的情况下，协同保持两流形间差别最小。Among them, the consistent manifold information item in the collaborative linear manifold model is: used to constrain the manifold enriched according to bidirectional knowledge, and in the case of enriching as much information as possible, collaboratively keep the difference between the two manifolds to a minimum.

先验知识约束项为：用于约束丰富过程中得到的药物-药物关联矩阵和靶蛋白-靶蛋白关联矩阵，在尽量丰富的情况下，保持与先验知识的统一。The prior knowledge constraint term is: used to constrain the drug-drug association matrix and the target protein-target protein association matrix obtained in the enrichment process, in the case of as abundant as possible, to maintain the unity with the prior knowledge.

所述的稀疏约束为：用于控制复杂程度。The sparsity constraint is: used to control the complexity.

其中，线性流形学习引用局部线性嵌入方法和稀疏子空间聚类方法。Among them, linear manifold learning refers to local linear embedding methods and sparse subspace clustering methods.

其中，局部线性嵌入假设从数据集X中取样的数据点x_i，φ(I)表示它的k个近邻点。每个点x_i可以近似表示为φ(I)中点的线性组合。系数单位ω_ij是用于重构X_i的点X_j∈φ(i)的线性系数，如公式(1)中所示where the local linear embedding assumes a data point_xi sampled from the dataset X, and φ(I) denotes its k nearest neighbors. Each point_xi can be approximated as a linear combination of points in φ(I). The coefficient unit ω_ij is the linear coefficient used to reconstruct the point X_j ∈ φ(i) of X_i , as shown in Equation (1)

进一步地，稀疏子空间聚类与流形集成，将本地邻域扩展到全局空间以充分利用流形学习的优势，如图2所示。将已知的药物-药物关联作为先验知识嵌入以重建药物靶蛋白互连矩阵，并集成固定权重矩阵W以控制先验知识的影响。得到T的先验约束表示如下：Further, sparse subspace clustering is integrated with manifolds to extend the local neighborhood to the global space to fully exploit the advantages of manifold learning, as shown in Figure 2. Known drug-drug associations are embedded as prior knowledge to reconstruct the drug-target protein interconnection matrix, and a fixed weight matrix W is integrated to control the influence of prior knowledge. The prior constraints for obtaining T are expressed as follows:

其中T⁽⁰⁾是已知药物关联信息的邻接矩阵，对于已知关联信息的标记为1，否则为0。为保留已知的药物关联关系，当药物间存在关联时，W的取值为0.8，不存在关联时，W取值为0.2。where T⁽⁰⁾ is the adjacency matrix of known drug association information, which is 1 for labels with known association information, and 0 otherwise. In order to retain the known drug association relationship, when there is an association between drugs, the value of W is 0.8, and when there is no association, the value of W is 0.2.

将先验知识作为损失函数的约束(先验约束)，使流形约束更加灵活，从而得到鲁棒解。由于目标网络的稀疏性，采用核范数来低秩约束T，表示为T^(*)。由此得出线性流形学习的损失函数如下：Taking the prior knowledge as a constraint on the loss function (a priori constraint) makes the manifold constraint more flexible, resulting in a robust solution. Due to the sparsity of the target network, the kernel norm is adopted to constrain T with low rank, denoted as T^(*) . From this, the loss function of linear manifold learning is obtained as follows:

其中α和γ是超参数，用于协调先验约束和低秩低阶约束的权重。where α and γ are hyperparameters used to coordinate the weights of prior constraints and low-rank low-order constraints.

同样地，靶蛋白间的缺失关联信息也可以通过线性流形学习来推断。得出的辅助网络A上类似的损失函数如下：Similarly, missing association information between target proteins can also be inferred by linear manifold learning. The resulting similar loss function on the auxiliary network A is as follows:

由于局部不变性的思想声明从不同方向学习线性流形获得的结果具有相似性。将对药物关联网络的流形约束及靶蛋白关联网络的流形约束替换为药物和靶蛋白网络的协同约束，得到最终的损失函数为Due to the idea of local invariance, it is stated that the results obtained by learning linear manifolds from different directions are similar. The manifold constraints of the drug association network and the manifold constraints of the target protein association network are replaced by the synergistic constraints of the drug and target protein networks, and the final loss function is obtained as

其中W₁和W₂分别是药物网络和靶蛋白网络的先验知识加权矩阵。α和β和γ是两个超参数，取值分别为1000、1000和0.5。通过协同流形学习，收敛后得到最终的药物-药物关联矩阵，从而得到潜在的药物关联信息。where W1 and_W2 are the prior knowledge weighting matrices_of the drug network and target protein network, respectively. α and β and γ are two hyperparameters with values of 1000, 1000 and 0.5 respectively. Through collaborative manifold learning, the final drug-drug association matrix is obtained after convergence, thereby obtaining potential drug association information.

进一步地，对协同线性流形学习模型的优化具体为：Further, the optimization of the collaborative linear manifold learning model is as follows:

首先，固定靶蛋白关联矩阵对药物关联矩阵进行更新，将药物关联矩阵T的损失函数表示为：First, the target protein association matrix is fixed to update the drug association matrix, and the loss function of the drug association matrix T is expressed as:

其中

in

根据梯度优化方法，得到药物关联矩阵的迭代方程为According to the gradient optimization method, the iterative equation to obtain the drug correlation matrix is:

忽略T的无关项f(T^(k-1))，将

表示为

利用SVT方法得到T的迭代方程为Ignoring the extraneous term f(T^(k-1) ) of T, the

Expressed as

Using the SVT method to obtain the iterative equation of T is:

首先，计算T的更新步长。结合最优化条件，First, the update step size of T is calculated. Combined with the optimal conditions,

步长更新方式如下：当

时，

其中，

The step size update method is as follows: when

hour,

in,

更新完成后得到最终的

即目标网络T第K步迭代的步长。After the update is done get the final

That is, the step size of the K-th iteration of the target network T.

同样，利用梯度优化方法，固定药物关联矩阵对靶蛋白关联矩阵进行更新，得到辅助网络A第K步迭代的步长

Similarly, using the gradient optimization method, the drug association matrix is fixed to update the target protein association matrix, and the step size of the K-th iteration of the auxiliary network A is obtained.

进一步地，结合

对T和A的更新方程如下：Further, combining

The update equations for T and A are as follows:

步骤104：根据一致性原则，通过迭代更新，丰富目标矩阵信息，若某两个药物间的关系值大于特定阈值，则认为药物之间存在关联，从而得到最终的预测结果。Step 104: According to the consistency principle, through iterative update, the information of the target matrix is enriched. If the relationship value between two drugs is greater than a specific threshold, it is considered that there is a relationship between the drugs, so as to obtain the final prediction result.

迭代更新T和A直到模型收敛，求解出最终的相似度矩阵

来预测目标网络T中的缺失链路信息。Iteratively update T and A until the model converges and solve the final similarity matrix

to predict the missing link information in the target network T.

其中，P_ik为矩阵PP中第i行第k列对应的元素值，表示药物i与靶蛋白k的相关性，值越大，相关性越强。通过设置阈值M＝0.8，当P_ik＞M时认为药物i与靶蛋白k存在关联关系。Among them, P_ik is the element value corresponding to the i-th row and the k-th column in the matrix PP, which represents the correlation between the drug i and the target protein k. The larger the value, the stronger the correlation. By setting the threshold M=0.8, when P_ik >M, it is considered that there is a relationship between drug i and target protein k.

综上所述，通过上述步骤101-步骤104的处理，本发明实现了对药物-靶蛋白关联关系建模和求解，达到了很好的药物-靶蛋白关联预测效果。To sum up, through the processing of the above steps 101 to 104, the present invention realizes the modeling and solution of the drug-target protein association relationship, and achieves a good drug-target protein association prediction effect.

下面以具体的实验来验证本发明方法的可行性，本发明实例采用了DrugBank数据库和HPRD数据库抓取的真实药物-药物、药物-靶蛋白和靶蛋白-靶蛋白关联关系进行验证。上述两个数据集来源于https://www.drugbank.ca和http://hprd.org/index_html的公开数据集。下面将对数据集以及实验结果分别进行描述。The feasibility of the method of the present invention is verified by a specific experiment below. The example of the present invention adopts the real drug-drug, drug-target protein and target protein-target protein correlation relationship captured by the DrugBank database and HPRD database for verification. The above two datasets are derived from public datasets athttps://www.drugbank.ca andhttp://hprd.org/index_html . The dataset and experimental results are described below.

详细的数据内容包括FDA认证的药物-药物作用(DDI)、药物-靶蛋白作用(DTI)和靶蛋白-靶蛋白关联(PPI)等三种关联数据。其中，DDI数据集中一共包含828种药物和14746对药物-药物关联，而DTI数据集中则有473种靶蛋白和2416对药物-靶蛋白关联，而PPI数据集中则包含1098对靶蛋白-靶蛋白关联。将这三类关联数据进行预处理和对齐操作后，我们最终得到了758×758的已知DDI关联矩阵、758×473的DTI关联矩阵和473×473的PPI关联矩阵，其中DDI关联矩阵中共有5926对关联，DTI关联矩阵中共有2416对关联，PPI关联矩阵中共有616对关联。对应双协同线性流形模型中的T,R,A。The detailed data content includes three kinds of association data, including FDA-certified drug-drug interaction (DDI), drug-target protein interaction (DTI) and target protein-target protein association (PPI). Among them, the DDI dataset contains a total of 828 drugs and 14746 drug-drug associations, while the DTI dataset contains 473 target proteins and 2416 drug-target protein associations, while the PPI dataset contains 1098 target protein-target protein pairs association. After preprocessing and aligning these three types of correlation data, we finally obtained a known DDI correlation matrix of 758×758, a DTI correlation matrix of 758×473, and a PPI correlation matrix of 473×473, of which there are a total of 5926 pairs of associations, a total of 2416 pairs of associations in the DTI association matrix, and 616 pairs of associations in the PPI association matrix. Corresponding to T, R, A in the bi-coordinated linear manifold model.

为了评估本发明方法所提出的双协同线性流形模型(Collaborative LinearManifold Learning for Link Prediction in Heterogeneous Networks，CLML)的性能，采用邻居集信息记作NSI[1]，矩阵补全链接预测方法记作MCLP[2]，基于流形正则化的矩阵分解方法MRMF[3]，其中基于邻居集信息的方法，平衡约束项的系数λ设置为0.1；对于矩阵补全链接预测方法，三个参数α和β和γ分别设置为0.1,0.1,0.3；对于基于流形正则化的矩阵分解方法，约束项的系数γ设置为0.25，μμ设置为1。In order to evaluate the performance of the collaborative linear manifold model (Collaborative Linear Manifold Learning for Link Prediction in Heterogeneous Networks, CLML) proposed by the method of the present invention, the neighbor set information is denoted as NSI[1], and the matrix completion link prediction method is denoted as MCLP [2], the matrix factorization method based on manifold regularization MRMF[3], in which the method based on neighbor set information, the coefficient λ of the balance constraint term is set to 0.1; for the matrix completion link prediction method, the three parameters α and β are and γ are set to 0.1, 0.1, 0.3, respectively; for the matrix factorization method based on manifold regularization, the coefficient γ of the constraint term is set to 0.25, and μμ is set to 1.

本发明方法选用使用5折交叉检验的方式划分训练集和测试集，使用AUC[4]作为评估指标，其中AUC的定义如下：The method of the present invention selects the method of using 5-fold cross-check to divide the training set and the test set, and uses AUC[4] as the evaluation index, wherein the definition of AUC is as follows:

ROC曲线下的面积，如图4。其中ROC曲线全称为受试者工作特征曲线，它是根据一系列不同的二分类方式，以真阳性率为纵坐标，假阳性率为横坐标绘制的曲线。图形面积越大，说明分类效果越好。The area under the ROC curve is shown in Figure 4. The ROC curve is called the receiver operating characteristic curve, which is a curve drawn according to a series of different binary classification methods, with the true positive rate on the ordinate and the false positive rate on the abscissa. The larger the area of the graph, the better the classification effect.

各个方法的实验结果如表1所示：The experimental results of each method are shown in Table 1:

表1Table 1

本发明方法所提出的CLML算法与基准算法的实验结果可以看出，基于双协同线性流形模型的方法要优于传统的NSI方法，说明通过矩阵补全挖掘潜在的特征的方式可以更好的利用矩阵内信息并提升模型性能。其次CLML方法优于其他所有基准算法，说明流形学习对数据结构信息的获取，双协同流形对辅助信息的进一步使用，均提升了模型的性能，同时表明本发明方法设计的双向协同线性流形约束在药物关联预测功能模块挖掘问题上有效。From the experimental results of the CLML algorithm and the benchmark algorithm proposed by the method of the present invention, it can be seen that the method based on the double-cooperative linear manifold model is better than the traditional NSI method, indicating that the method of mining potential features through matrix completion can be better Take advantage of in-matrix information and improve model performance. Secondly, the CLML method is superior to all other benchmark algorithms, which shows that the acquisition of data structure information by manifold learning and the further use of auxiliary information by double cooperative manifolds improve the performance of the model. The shape constraints are effective in the problem of mining drug association prediction function modules.

综上所述，本发明方法所提出的基于双协同线性流形的药物关联预测方法优于基准算法。进一步说明考虑药物-药物关联关系、靶蛋白-靶蛋白关联关系与药物-靶蛋白关联关系间的线性流形信息可以提升药物关联预测功能模块挖掘任务的性能。To sum up, the drug association prediction method based on the double synergistic linear manifold proposed by the method of the present invention is superior to the benchmark algorithm. It is further demonstrated that considering the linear manifold information among drug-drug associations, target protein-target protein associations and drug-target protein associations can improve the performance of the mining task of drug association prediction functional modules.

本领域技术人员可以理解附图只是一个优选实施例的示意图。Those skilled in the art will appreciate that the accompanying drawings are only schematic diagrams of a preferred embodiment.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

参考文献references

[1]B.Zhu,Y.Xia,An information-theoretic model for link prediction incomplex networks,Sci.Rep.5(2015)13707.[1]B.Zhu,Y.Xia,An information-theoretic model for link prediction incomplex networks,Sci.Rep.5(2015)13707.

[2]M.Gao,L.Chen,B.Li,et al.,A link prediction algorithm based on low-rank matrix completion,Appl.Intell.48(12)(2018)4531–4550.[2] M. Gao, L. Chen, B. Li, et al., A link prediction algorithm based on low-rank matrix completion, Appl. Intell. 48(12)(2018) 4531–4550.

[3]W.Zhang,Y.Chen,D.Li,et al.,Manifold regularized matrixfactorization for drug-drug interaction prediction,J.Biomed.Inf.88(2018)90–97.[3] W. Zhang, Y. Chen, D. Li, et al., Manifold regularized matrixfactorization for drug-drug interaction prediction, J. Biomed. Inf. 88(2018) 90–97.

[4]J.A.Hanley,B.J.McNeil,The meaning and use of the area under areceiver operating characteristic(ROC)curve,Radiology 143(1)(1982)29–36.[4] J.A.Hanley, B.J.McNeil, The meaning and use of the area under areceiver operating characteristic (ROC) curve, Radiology 143(1)(1982) 29–36.

Claims

Translated fromChinese

1.一种基于双协同线性流形的药物关联预测方法，其特征在于该方法的预测结果可应用于数据挖掘或药物筛选，具体包括以下步骤：1. a drug association prediction method based on double synergistic linear manifold, is characterized in that the prediction result of this method can be applied to data mining or drug screening, specifically comprises the following steps:

2.根据权利要求1所述的一种基于双协同线性流形的药物关联预测方法，其特征在于，所述方法涉及三种关联矩阵，分别定义如下：根据药物间、药物靶蛋白间和靶蛋白间的关联关系数据，分别构建药物-药物、药物-靶蛋白、靶蛋白-靶蛋白的关联矩阵，对于任意关联如果存在已知关联信息，则标记1，否则标记0。2. a kind of drug association prediction method based on double synergistic linear manifold according to claim 1, is characterized in that, described method relates to three kinds of association matrices, are defined as follows respectively: The association relationship data between proteins is used to construct drug-drug, drug-target protein, and target protein-target protein association matrices. For any association, if there is known association information, mark 1, otherwise mark 0.

3.根据权利要求1所述的一种基于双协同线性流形的药物关联预测方法，其特征在于，所述双协同线性流形学习模型由一致性流形信息项、先验知识约束项和稀疏约束三部分组成。3. a kind of drug association prediction method based on double synergistic linear manifold according to claim 1, is characterized in that, described double synergy linear manifold learning model is composed of consistent manifold information item, prior knowledge constraint item and The sparse constraint consists of three parts.

4.根据权利要求3所述的一种基于双协同线性流形的药物关联预测方法，其特征在于，所述的一致性流形信息项为：4. a kind of drug association prediction method based on double synergistic linear manifold according to claim 3, is characterized in that, described consistent manifold information item is:

5.根据权利要求3所述的一种基于双协同线性流形的药物关联预测方法，其特征在于，所述的先验知识约束项为：5. a kind of drug association prediction method based on double synergistic linear manifold according to claim 3, is characterized in that, described prior knowledge constraint term is:

6.根据权利要求3所述的一种基于双协同线性流形的药物关联预测方法，其特征在于，所述的稀疏约束为：用于控制复杂程度。6 . The method for predicting drug association based on a double synergistic linear manifold according to claim 3 , wherein the sparse constraint is: used to control the complexity. 7 .