CN114927239A

Movatterモバイル変換

Info

Publication number: CN114927239A
Application number: CN202210422918.XA
Authority: CN
Inventors: 林伟平; 王备战; 姚俊峰; 刘昆宏; 何松; 武连莲; 洪清启; 陈俐燕
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-08-19
Anticipated expiration: 2042-04-21
Also published as: CN114927239B

Abstract

The invention provides a method and a system for automatically generating decision rules applied to drug analysis in the technical field of machine learning, wherein the method comprises the following steps: s10, acquiring medicine data, extracting medicine characteristics of the medicine data and constructing a data set; step S20, calculating the variance, the data complexity and the arrangement importance of each medicine feature in the data set, and screening the important features based on the variance, the data complexity and the arrangement importance; step S30, selecting feature subsets including n (n is 1,2,3, …) features from the important features, and calculating the ranking importance of each feature subset; step S40, searching the feature subsets according to the sequence of the ranking importance to ensure that the minimum feature variation quantity required by each feature subset is generated when the prediction result is turned over, and further generating a category boundary; and step S50, describing the category boundary into a decision rule through a Bayesian formula. The invention has the advantages that: highly interpretable decision rules can be generated for any machine learning model, including uninterpretable black box models.

Description

Translated fromChinese

一种应用于药物分析的决策规则自动生成方法及系统A method and system for automatic generation of decision rules for drug analysis

技术领域technical field

本发明涉及机器学习技术领域，特别指一种应用于药物分析的决策规则自动生成方法及系统。The invention relates to the technical field of machine learning, in particular to a method and system for automatic generation of decision rules applied to drug analysis.

背景技术Background technique

目前，机器学习的理论和技术已经广泛用于药物分析，如预测药物是否具有肝毒性，预测药物是否具有急性毒性。以药物肝毒性的预测为例，基于现有药物数据拟合机器学习模型，便可预测新药物是否对人具有肝毒性，这不仅节约了生物医学实验所需的大量人力物力，也在一定程度上避免了实验风险和伦理问题。At present, the theory and technology of machine learning have been widely used in drug analysis, such as predicting whether a drug has liver toxicity, and predicting whether a drug has acute toxicity. Taking the prediction of drug hepatotoxicity as an example, fitting a machine learning model based on existing drug data can predict whether a new drug is hepatotoxic to humans, which not only saves a lot of manpower and material resources required for biomedical experiments, but also to a certain extent It avoids experimental risks and ethical issues.

药物分析要求机器学习模型在给出预测结果的同时，提供决策规则(决策依据)。然而，传统机器学习模型的结构简单，性能很快达到上限，预测准确性较低；集成学习和深度学习方法虽然具有较好的性能，但可解释性较差。Drug analysis requires machine learning models to provide decision rules (decision basis) while giving prediction results. However, the traditional machine learning model has a simple structure, the performance quickly reaches the upper limit, and the prediction accuracy is low; although the ensemble learning and deep learning methods have good performance, their interpretability is poor.

因此，如何提供一种应用于药物分析的决策规则自动生成方法及系统，为不可解释的机器学习模型生成易于理解的决策规则，成为一个亟待解决的技术问题。Therefore, how to provide a method and system for automatically generating decision rules for drug analysis, and how to generate easy-to-understand decision rules for unexplainable machine learning models, has become an urgent technical problem to be solved.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题，在于提供一种应用于药物分析的决策规则自动生成方法及系统，为不可解释的机器学习模型生成易于理解的决策规则。The technical problem to be solved by the present invention is to provide an automatic generation method and system of decision rules applied to drug analysis, so as to generate easy-to-understand decision rules for inexplicable machine learning models.

第一方面，本发明提供了一种应用于药物分析的决策规则自动生成方法，包括如下步骤：In a first aspect, the present invention provides a method for automatically generating decision rules for drug analysis, comprising the following steps:

步骤S10、获取药物数据，提取所述药物数据的药物特征并构建数据集；Step S10, obtaining drug data, extracting drug features of the drug data and constructing a data set;

步骤S20、计算所述数据集中各药物特征的方差、数据复杂度以及排列重要性，基于所述方差、数据复杂度以及排列重要性筛选重要特征；Step S20, calculating the variance, data complexity and arrangement importance of each drug feature in the data set, and screening important features based on the variance, data complexity and arrangement importance;

步骤S30、从所述重要特征中筛选包含n(n＝1,2,3,…)个特征的特征子集，并计算各所述特征子集的排列重要性；Step S30, screening feature subsets including n (n=1, 2, 3, . . . ) features from the important features, and calculating the arrangement importance of each of the feature subsets;

步骤S40、按所述排列重要性的顺序在各特征子集上搜索使预测结果发生翻转时，各所述特征子集所需的最小特征变化量，进而生成类别边界；Step S40, searching on each feature subset in the order of the importance of the arrangement to make the prediction result reversed, the minimum feature change required by each of the feature subsets, and then generating a category boundary;

步骤S50、通过贝叶斯公式将所述类别边界描述成决策规则。Step S50 , describe the category boundary as a decision rule through Bayesian formula.

进一步地，所述步骤S10中，所述药物数据至少包括分子指纹以及物理化学性质；所述物理化学性质至少包括分子量、疏水参数以及溶解度。Further, in the step S10, the drug data at least includes molecular fingerprints and physicochemical properties; the physicochemical properties at least include molecular weight, hydrophobic parameters, and solubility.

进一步地，所述步骤S20具体为：Further, the step S20 is specifically:

计算所述数据集中各药物特征的方差、数据复杂度以及排列重要性，将各所述药物特征的方差、数据复杂度以及排列重要性进行求和并排序，进而筛选重要特征。Calculate the variance, data complexity, and arrangement importance of each drug feature in the data set, sum and sort the variance, data complexity, and arrangement importance of each drug feature, and then screen important features.

进一步地，所述步骤S30中，所述排列重要性的计算过程如下：Further, in the step S30, the calculation process of the arrangement importance is as follows:

在所述特征子集的特征值不打乱时，计算机器学习模型的第一预测性能；将所述特征子集包含的几个特征的特征值同时进行随机打乱，计算机器学习模型的第二预测性能；将所述第一预测性能减去第二预测性能即为特征子集的排列重要性。When the eigenvalues of the feature subset are not scrambled, the first prediction performance of the machine learning model is calculated; the eigenvalues of several features included in the feature subset are randomly scrambled at the same time, and the first prediction performance of the machine learning model is calculated. 2. Prediction performance; subtracting the second prediction performance from the first prediction performance is the arrangement importance of the feature subset.

进一步地，所述步骤S40具体包括：Further, the step S40 specifically includes:

步骤S41、基于各所述特征子集的优先级依次选取一特征子集；n的取值越小优先级越高；Step S41, selecting a feature subset in turn based on the priority of each of the feature subsets; the smaller the value of n, the higher the priority;

步骤S42、基于反事实理论，按所述排列重要性的顺序在各特征子集上搜索使预测结果发生翻转时，各所述特征子集所需的最小特征变化量，进而生成类别边界。Step S42 , based on the counterfactual theory, search the feature subsets in the order of the importance of the arrangement for the minimum feature change required by each feature subset when the prediction result is reversed, and then generate a category boundary.

第二方面，本发明提供了一种应用于药物分析的决策规则自动生成系统，包括如下模块：In a second aspect, the present invention provides an automatic generation system for decision rules applied to drug analysis, including the following modules:

药物特征提取模块，用于获取药物数据，提取所述药物数据的药物特征并构建数据集；a drug feature extraction module for acquiring drug data, extracting drug features of the drug data and constructing a data set;

重要特征筛选模块，用于计算所述数据集中各药物特征的方差、数据复杂度以及排列重要性，基于所述方差、数据复杂度以及排列重要性筛选重要特征；an important feature screening module, configured to calculate the variance, data complexity and arrangement importance of each drug feature in the data set, and screen important features based on the variance, data complexity and arrangement importance;

特征子集构建模块，用于从所述重要特征中筛选包含n(n＝1,2,3,…)个特征的特征子集，并计算各所述特征子集的排列重要性；a feature subset building module, configured to screen a feature subset including n (n=1, 2, 3, . . . ) features from the important features, and calculate the arrangement importance of each feature subset;

类别边界生成模块，用于按所述排列重要性的顺序在各特征子集上搜索使预测结果发生翻转时，各所述特征子集所需的最小特征变化量，进而生成类别边界；A category boundary generation module, configured to search each feature subset in the order of the importance of the arrangement to make the prediction result reversed, the minimum feature change required by each of the feature subsets, and then generate a category boundary;

决策规则生成模块，用于通过贝叶斯公式将所述类别边界描述成决策规则。The decision rule generation module is used to describe the category boundary as a decision rule through Bayesian formula.

进一步地，所述药物特征提取模块中，所述药物数据至少包括分子指纹以及物理化学性质；所述物理化学性质至少包括分子量、疏水参数以及溶解度。Further, in the drug feature extraction module, the drug data at least includes molecular fingerprints and physicochemical properties; the physicochemical properties at least include molecular weight, hydrophobic parameters, and solubility.

进一步地，所述重要特征筛选模块具体为：Further, the important feature screening module is specifically:

进一步地，所述特征子集构建模块中，所述排列重要性的计算过程如下：Further, in the feature subset building module, the calculation process of the arrangement importance is as follows:

进一步地，所述类别边界生成模块具体包括：Further, the category boundary generation module specifically includes:

特征子集选取单元，用于基于各所述特征子集的优先级依次选取一特征子集；n的取值越小优先级越高；a feature subset selection unit, configured to sequentially select a feature subset based on the priority of each of the feature subsets; the smaller the value of n, the higher the priority;

最小特征变化量搜索单元，用于基于反事实理论，按所述排列重要性的顺序在各特征子集上搜索使预测结果发生翻转时，各所述特征子集所需的最小特征变化量，进而生成类别边界。The minimum feature change amount searching unit is used to search for the minimum feature change amount required by each of the feature subsets when the prediction result is reversed on the feature subsets in the order of the importance of the arrangement based on the counterfactual theory, Then generate class boundaries.

本发明的优点在于：The advantages of the present invention are:

1、通过提取药物数据的药物特征,基于药物特征与预测结果之间的关系来生成类别边界，再通过贝叶斯公式将类别边界描述成决策规则，即仅需药物特征和预测结果即可生成决策规则，与机器学习模型无关，能为不可解释的机器学习模型生成易于理解的决策规则。1. By extracting the drug features of the drug data, the class boundary is generated based on the relationship between the drug feature and the prediction result, and then the class boundary is described as a decision rule by the Bayesian formula, that is, only the drug feature and the prediction result can be generated. Decision rules, independent of machine learning models, can generate easy-to-understand decision rules for uninterpretable machine learning models.

2、通过方差、数据复杂度以及排列重要性来筛选重要特征，即结合多种特征评估方法，更准确的进行特征降维，且从这些重要特征中筛选包含不同数量特征的特征子集，并计算特征子集的排列重要性，即充分考虑了重要特征之间的联系，更加贴近实际情况，有利于生成更加准确的决策规则。2. Screen important features by variance, data complexity, and arrangement importance, that is, combine multiple feature evaluation methods to more accurately reduce feature dimensionality, and select feature subsets containing different numbers of features from these important features, and Calculating the importance of arrangement of feature subsets, that is, fully considering the relationship between important features, is closer to the actual situation, and is conducive to generating more accurate decision rules.

附图说明Description of drawings

下面参照附图结合实施例对本发明作进一步的说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.

图1是本发明一种应用于药物分析的决策规则自动生成方法的流程图。FIG. 1 is a flow chart of a method for automatically generating decision rules applied to drug analysis according to the present invention.

图2是本发明一种应用于药物分析的决策规则自动生成系统的结构示意图。FIG. 2 is a schematic structural diagram of an automatic generation system for decision rules applied to drug analysis according to the present invention.

图3是本发明药物特征提取的示意图。Fig. 3 is a schematic diagram of the feature extraction of the drug of the present invention.

图4是本发明重要特征筛选的示意图。Figure 4 is a schematic diagram of the screening of important features of the present invention.

图5是本发明排列重要性计算过程的示意图。FIG. 5 is a schematic diagram of the process of calculating the importance of ranking according to the present invention.

图6是本发明类别边界搜索的示意图。FIG. 6 is a schematic diagram of the category boundary search of the present invention.

具体实施方式Detailed ways

本申请实施例中的技术方案，总体思路如下：基于药物特征与预测结果之间的关系来生成类别边界，再通过贝叶斯公式将类别边界描述成决策规则，能为任意机器学习模型生成决策规则，是一种通用的方法；通过方差、数据复杂度以及排列重要性来筛选重要特征，更准确的进行特征降维，且从这些重要特征中筛选包含不同数量特征的特征子集，并计算特征子集的排列重要性，充分考虑重要特征之间的联系，以得到更准确的决策规则。The general idea of the technical solutions in the embodiments of the present application is as follows: a category boundary is generated based on the relationship between the drug characteristics and the prediction result, and then the category boundary is described as a decision rule by the Bayesian formula, which can generate a decision for any machine learning model. The rule is a general method; important features are screened by variance, data complexity, and arrangement importance, and feature dimension reduction is performed more accurately, and feature subsets containing different numbers of features are screened from these important features, and calculated The importance of arrangement of feature subsets fully considers the relationship between important features to obtain more accurate decision rules.

请参照图1至图6所示，本发明一种应用于药物分析的决策规则自动生成方法的较佳实施例，包括如下步骤：Please refer to FIG. 1 to FIG. 6 , a preferred embodiment of a method for automatically generating decision rules for drug analysis of the present invention includes the following steps:

步骤S10、从数据库PubChem中获取格式为SDF的药物数据，提取所述药物数据的药物特征并构建数据集；获取所述药物数据后，还依据药物分析任务设定药物的标签数据，例如人肝毒性预测任务，标签数据为1或-1，用于表示该药物对肝脏是否有毒性；Step S10, obtaining the drug data in the format of SDF from the database PubChem, extracting the drug characteristics of the drug data and constructing a data set; after obtaining the drug data, also set the label data of the drug according to the drug analysis task, such as human liver Toxicity prediction task, the label data is 1 or -1, which is used to indicate whether the drug is toxic to the liver;

步骤S50、通过贝叶斯公式将所述类别边界描述成决策规则；Step S50, describe the category boundary as a decision rule by Bayesian formula;

类别边界描述了两类样本在特征子集上的差异，如“Label＝1时，F3＜10，F5＞20”。此类基于特征和阈值的描述还不能直接当成决策规则，一是“F3＜10，F5＞20”是基于部分样本获得，不能直接代表全局模式；二是“Label＝1时，F3＜10，F5＞20”并不直接等同于“ifF3＜10，F5＞20then Label＝1”。The category boundary describes the difference between the two types of samples in feature subsets, such as "When Label=1, F3<10, F5>20". Such descriptions based on features and thresholds cannot be directly used as decision rules. First, "F3 < 10, F5 > 20" are obtained based on partial samples and cannot directly represent the global pattern; second, "Label = 1, F3 < 10, F5>20" is not directly equivalent to "if F3<10, F5>20then Label=1".

对于一条类别边界的描述，首先检查其对现有样本的匹配情况，即符合该描述的样本占该类样本的比例。例如记Label＝1为事件A,Label等于其他值为事件A，记满足“F3＜10，F5＞20”为事件B，则P(B|A)表示Label＝1的样本中，满足“F3＜10，F5＞20”的样本所占的比例。For the description of a category boundary, first check its matching with existing samples, that is, the proportion of samples that meet the description in this category of samples. For example, label Label=1 as event A, Label equal to other values as event A, and denote “F3 < 10, F5 > 20” as event B, then P(B|A) means that in the sample of Label=1, satisfy “F3 <10, the proportion of samples with F5>20".

最终的决策规则要根据类别边界推出Label取值，即由B推出A，因此需要考虑是否能从A→B推出B→A。首先分别计算P(A|B)和

若

则说明由事件B更倾向于推出事件A，而不是

此时得到决策规则B→A，即“ifF3＜10，F5＞20thenLabel＝1”。根据P(B|A)依次处理所有的A→B，P(B|A)越大，说明与该类别边界相符的样本越多，越优先考虑，小于预设阈值时将不被使用。P(A|B)的计算公式如下：The final decision rule is to deduce the value of Label according to the category boundary, that is, A is deduced from B, so it is necessary to consider whether B→A can be deduced from A→B. First calculate P(A|B) and

like

Then it means that event B is more inclined to infer event A, rather than

At this time, the decision rule B→A is obtained, that is, "ifF3<10, F5>20thenLabel=1". According to P(B|A), all A→B are processed in sequence. The larger P(B|A), the more samples that match the boundary of the category, and the higher the priority. If it is less than the preset threshold, it will not be used. The formula for calculating P(A|B) is as follows:

进行完以上步骤，便可以从机器学习模型中得到一系列决策规则。After completing the above steps, a series of decision rules can be obtained from the machine learning model.

步骤S60、输出所述决策规则的集合，代替原有机器学习模型对药物进行预测，并获取匹配的决策规则作为决策依据；药物领域专家对决策规则进行分析，总结药物特点，使药物筛选更有针对性。Step S60, outputting the set of decision-making rules, replacing the original machine learning model to predict the drug, and obtaining the matching decision-making rules as the decision-making basis; experts in the field of drugs analyze the decision-making rules, summarize the characteristics of the drugs, and make drug screening more effective. Targeted.

所述步骤S10中，所述药物数据至少包括分子指纹以及物理化学性质；所述物理化学性质至少包括分子量、疏水参数以及溶解度。In the step S10, the drug data at least includes molecular fingerprints and physicochemical properties; the physicochemical properties at least include molecular weight, hydrophobic parameters, and solubility.

所述分子指纹以及物理化学性质的提取过程如下：利用工具包ChemmineR将药物的化学结构转化为数值向量，得到药物的分子指纹；向量的每一维为1或0，分别表示特定的化学子结构是否存在；使用OpenBabel、ChemmineR和JoeLib等工具和相关数据库，计算药物的物理化学性质，并表示成实数。The extraction process of the molecular fingerprint and physicochemical properties is as follows: using the toolkit ChemmineR to convert the chemical structure of the drug into a numerical vector to obtain the molecular fingerprint of the drug; each dimension of the vector is 1 or 0, representing a specific chemical substructure, respectively. Does it exist? Use tools such as OpenBabel, ChemmineR and JoeLib and related databases to calculate the physicochemical properties of drugs and express them as real numbers.

所述步骤S20具体为：The step S20 is specifically:

方差越大，特征值的离散程度越大，该药物特征越重要，计算公式如下：The greater the variance, the greater the dispersion of eigenvalues, and the more important the drug feature is. The calculation formula is as follows:

其中，n表示样本数量；μ表示平均值。Among them, n is the number of samples; μ is the mean value.

数据复杂度反映了样本的分布情况，可以用来衡量某列特征所包含的信息量大小。Maximal feature efficiency(mfe)反映了特征在区分不同类别时的能力，主要思想是分析不同类别样本在某个特征上的重叠度大小。假设c₁、c₂表示两个类别，f_i为第i列特征，N为样本数量，则R_i为f_i上c₁、c₂两类的重叠区间，计算方法如下：Data complexity reflects the distribution of samples and can be used to measure the amount of information contained in a column of features. Maximal feature efficiency (mfe) reflects the ability of features to distinguish different categories. The main idea is to analyze the overlap of different categories of samples on a certain feature. Assuming that c₁ and c₂ represent two categories, f_i is the feature of the i-th column, and N is the number of samples, then R_i is the overlapping interval of the two categories of c₁ and c₂ on f_i , and the calculation method is as follows:

R_i＝[Min(max(c₁,f_i),max(c₂,f_i)),Max(min(c₁,f_i),min(c₂,f_i))]；R_i =[Min(max(c₁ ,fi₎ ,max(c₂ ,fi₎ ),Max(min(c₁ ,_fi ),min(c₂ ,_fi ))];

未落入此区域的样本数量与全部样本数量的比值即为mfe指标(数据复杂度)，计算方法如下，其中I(.)为判断函数，条件为真时值为1，否则为0。mfe的取值范围为[0,1]，当样本全部落入重叠区域时，表示难以区分两类，mfe值为0；当两类无重叠区域或无样本落入重叠区域时，表示两类边界明显，此时mfe＝1,mfe越大特征越重要：The ratio of the number of samples that do not fall into this area to the total number of samples is the mfe indicator (data complexity). The calculation method is as follows, where I(.) is the judgment function. The value range of mfe is [0,1]. When all the samples fall into the overlapping area, it means that it is difficult to distinguish between two categories, and the mfe value is 0; The boundary is obvious, at this time mfe=1, the larger the mfe, the more important the feature:

所述排列重要性的计算过程如下：机器学习模型训练好后，其预测性能为P,然后随机打乱第i列特征的全部特征值再进行预测，此时性能为P_i',则P_i＝P-P'_i为第i列特征的排列重要性，P_i越大说明第i列特征越重要。The calculation process of the importance of the arrangement is as follows: after the machine learning model is trained, its prediction performance is P, and then all the eigenvalues of the i-th column feature are randomly disturbed and then predicted, and the performance is P_i ' at this time, then P_i =P-P'_i is the arrangement importance of the i-th column feature, and the larger the P_i is, the more important the i-th column feature is.

所述步骤S30中，所述排列重要性的计算过程如下：In the step S30, the calculation process of the arrangement importance is as follows:

在所述特征子集的特征值不打乱时，计算机器学习模型的第一预测性能；将所述特征子集包含的几个特征的特征值同时进行随机打乱，计算机器学习模型的第二预测性能；将所述第一预测性能减去第二预测性能即为特征子集的排列重要性，性能下降幅度越大，排列重要性的取值越大；排列重要性的计算参考图5，以[F₃,F_n]为例进行说明。When the eigenvalues of the feature subset are not scrambled, the first prediction performance of the machine learning model is calculated; the eigenvalues of several features included in the feature subset are randomly scrambled at the same time, and the first prediction performance of the machine learning model is calculated. 2. Prediction performance; subtracting the second prediction performance from the first prediction performance is the arrangement importance of the feature subset. The greater the performance drop, the greater the value of arrangement importance; the calculation of arrangement importance refers to Figure 5 , take [F₃ ,F_n ] as an example for description.

所述步骤S40具体包括：The step S40 specifically includes:

步骤S41、基于各所述特征子集的优先级依次选取一特征子集；n的取值越小优先级越高，因为决策规则涉及的重要特征的数量越少，越易于理解和使用，因此优先考虑特征数量较少的特征子集；Step S41, select a feature subset in turn based on the priority of each of the feature subsets; the smaller the value of n, the higher the priority, because the number of important features involved in the decision rule is less, the easier it is to understand and use, so Prioritize feature subsets with fewer features;

步骤S42、基于反事实理论，按所述排列重要性的顺序在各特征子集上搜索使预测结果发生翻转时，各所述特征子集所需的最小特征变化量，进而生成类别边界，例如“当Label＝1时，F3＜10，F5＞20”。当两个样本互为广义上的反事实实例时，机器学习模型对它们的预测结果不同；计算样本组合在选定特征子集上的差异，预测结果不同时，差异越小说明越接近类别边界。类别边界用于描述两类样本在此特征子集上的差异。Step S42, based on the counterfactual theory, search on each feature subset in the order of the importance of the arrangement to make the prediction result reversed, the minimum feature change required by each of the feature subsets, and then generate a category boundary, such as "When Label=1, F3<10, F5>20". When two samples are counterfactual instances in a broad sense, the machine learning model predicts them differently; calculate the difference between the sample combination on the selected feature subset. When the prediction results are different, the smaller the difference, the closer the category boundary is. . The class boundary is used to describe the difference between the two classes of samples on this feature subset.

假设F为步骤S41选出的一个特征子集，|F|表示特征个数，f(x)为待解释机器学习模型对样本的预测结果(假设为-1或1)，x_i,F表示第i个样本在特征子集F上的特征值，

是x_i,F的第个k特征的值，对于选出的样本组合，计算它们在选定特征子集上的差异，越小越好，计算公式如下：Suppose F is a feature subset selected in step S41, |F| represents the number of features, f(x) is the prediction result of the sample to be explained by the machine learning model (assuming -1 or 1), x_{i, F} represent The eigenvalue of the i-th sample on the feature subset F,

is the value of the k-th feature of x_i,F . For the selected sample combination, calculate the difference between them on the selected feature subset. The smaller the better, the calculation formula is as follows:

两个样本在选定特征子集上的差异有两个要求，一是涉及的特征数量越少越好，即|F|越小越好；二是在所涉及的特征子集上，特征值的差异量越小越好，即

越小越好。特征数量越小，表明划分类别的限定条件越少，有利于得到更短的决策规则。若特征值仅发生微小变化便翻转了预测结果，说明该特征值附近已经十分接近两类的边界。计算所有可能的S(i,j,F)，排序后选出部分值较小的(i,j,F)，相应的

和

的平均值即为备选的类别边界描述。The difference between the two samples on the selected feature subset has two requirements. One is that the fewer the number of features involved, the better, that is, the smaller |F| is, the better; the second is that on the feature subset involved, the eigenvalue The smaller the difference, the better, i.e.

The smaller the better. The smaller the number of features, the less restrictive conditions for classifying categories, which is beneficial to get shorter decision rules. If only a small change in the eigenvalue flips the prediction result, it means that the vicinity of the eigenvalue is very close to the boundary of the two categories. Calculate all possible S(i,j,F), select some (i,j,F) with smaller values after sorting, the corresponding

and

The average value of is the alternative class boundary description.

以上步骤选用与当前样本预测结果不同的样本作为反事实样本。然而，有些数据集的样本数量较少，或者两个类别的样本数量差异较大，则基于以上步骤得到的类别边界可能数量较少，或者泛化能力较差。此时可以通过在单个样本的特征子集上添加扰动量来搜索类别边界，如附图6所示，具体步骤包括：The above steps select samples that are different from the prediction results of the current sample as counterfactual samples. However, for some datasets with a small number of samples, or a large difference in the number of samples between the two categories, the category boundaries obtained based on the above steps may be small in number or have poor generalization ability. At this time, the category boundary can be searched by adding a disturbance amount to the feature subset of a single sample, as shown in Figure 6, and the specific steps include:

选定一个特征子集F，在特征子集F为样本x_i的特征值添加使预测结果发生翻转的最小扰动量，计算前后的特征变化量：Select a feature subset F, add the minimum disturbance amount to the eigenvalue of the sample x_i in the feature subset F to make the prediction result flip, and calculate the feature change before and after:

在包含多个特征的特征子集上，对于F中的第k个特征F_k，首先计算其最大值和最小值，记为F_k,max和F_k,min,然后确定扰动量的搜索范围，F_k的原特征值加上扰动量后应处于区间[F_k,min-αA,F_k,max+αA],其中A＝F_k,max-F_k,min，α＝0.2。确定范围后，可按固定步长搜索扰动量，也可基于贪心策略进行搜索。以网格搜索的方式确定F的扰动量，即先锁定其他特征的值，搜索一个特征值的扰动量。On a feature subset containing multiple features, for the kth feature F_k in F, first calculate its maximum and minimum values, denoted as F_k,max and F_k,min , and then determine the search range of the disturbance amount , the original eigenvalue of F_k should be in the interval [F_k,min -αA,F_k,max +αA] after adding the disturbance amount, where A=F_k,max -F_k,min ,α=0.2. After the range is determined, the perturbation amount can be searched by a fixed step size or based on a greedy strategy. The disturbance of F is determined by grid search, that is, the value of other features is first locked, and the disturbance of one eigenvalue is searched.

计算所有S(i,F,Δx_i,F)，排序后选出部分值较小的(i,F,Δx_i,F)，对应的x_i+Δx_i,F即为备选的类别边界描述。Calculate all S(i,F,Δx_i,F ), select some (i,F,Δx_i,F ) with smaller values after sorting, and the corresponding x_i +Δx_i,F is the alternative category boundary describe.

本发明一种应用于药物分析的决策规则自动生成系统的较佳实施例，包括如下模块：A preferred embodiment of a decision rule automatic generation system applied to drug analysis of the present invention includes the following modules:

药物特征提取模块，用于从数据库PubChem中获取格式为SDF的药物数据，提取所述药物数据的药物特征并构建数据集；获取所述药物数据后，还依据药物分析任务设定药物的标签数据，例如人肝毒性预测任务，标签数据为1或-1，用于表示该药物对肝脏是否有毒性；The drug feature extraction module is used to obtain drug data in the format of SDF from the database PubChem, extract the drug features of the drug data and construct a data set; after obtaining the drug data, also set the label data of the drug according to the drug analysis task , such as the human liver toxicity prediction task, the label data is 1 or -1, which is used to indicate whether the drug is toxic to the liver;

决策规则生成模块，用于通过贝叶斯公式将所述类别边界描述成决策规则；a decision rule generation module, used to describe the category boundary as a decision rule by Bayesian formula;

类别边界描述了两类样本在特征子集上的差异，如“Label＝1时，F3＜10，F5＞20”。此类基于特征和阈值的描述还不能直接当成决策规则，一是“F3＜10，F5＞20”是基于部分样本获得，不能直接代表全局模式；二是“Label＝1时，F3＜10，F5＞20”并不直接等同于“ifF3＜10，F5＞20 then Label＝1”。The category boundary describes the difference between the two types of samples in feature subsets, such as "When Label=1, F3<10, F5>20". Such descriptions based on features and thresholds cannot be directly used as decision rules. First, "F3 < 10, F5 > 20" are obtained based on partial samples and cannot directly represent the global pattern; second, "Label = 1, F3 < 10, F5>20" is not directly equivalent to "if F3<10, F5>20 then Label=1".

对于一条类别边界的描述，首先检查其对现有样本的匹配情况，即符合该描述的样本占该类样本的比例。例如记Label＝1为事件A,Label等于其他值为事件

记满足“F3＜10，F5＞20”为事件B，则P(B|A)表示Label＝1的样本中，满足“F3＜10，F5＞20”的样本所占的比例。For the description of a category boundary, first check its matching with existing samples, that is, the proportion of samples that meet the description in this category of samples. For example, label Label=1 as event A, and Label equals other values as events

Denote event B satisfying "F3<10, F5>20", then P(B|A) represents the proportion of samples satisfying "F3<10, F5>20" among the samples with Label=1.

若

则说明由事件B更倾向于推出事件A，而不是

like

Then it means that event B is more inclined to infer event A, rather than

At this time, the decision rule B→A is obtained, that is, "ifF3<10, F5>20thenLabel=1". Process all A→B sequentially according to P(B|A). The larger the P(B|A), the more samples that match the boundary of the category, and the higher the priority. If it is less than the preset threshold, it will not be used. The formula for calculating P(A|B) is as follows:

所述药物特征提取模块中，所述药物数据至少包括分子指纹以及物理化学性质；所述物理化学性质至少包括分子量、疏水参数以及溶解度。In the drug feature extraction module, the drug data at least includes molecular fingerprints and physicochemical properties; the physicochemical properties at least include molecular weight, hydrophobic parameters, and solubility.

所述重要特征筛选模块具体为：The important feature screening module is specifically:

所述排列重要性的计算过程如下：机器学习模型训练好后，其预测性能为P,然后随机打乱第i列特征的全部特征值再进行预测，此时性能为P'_i,则P_i＝P-P'_i为第i列特征的排列重要性，P_i越大说明第i列特征越重要。The calculation process of the importance of the arrangement is as follows: after the machine learning model is trained, its prediction performance is P, and then all the eigenvalues of the i-th column feature are randomly disrupted and then predicted. At this time, the performance is P'_i , then P_i =P-P'_i is the arrangement importance of the i-th column feature, and the larger the P_i is, the more important the i-th column feature is.

所述特征子集构建模块中，所述排列重要性的计算过程如下：In the feature subset building module, the calculation process of the arrangement importance is as follows:

所述类别边界生成模块具体包括：The category boundary generation module specifically includes:

特征子集选取单元，用于基于各所述特征子集的优先级依次选取一特征子集；n的取值越小优先级越高，因为决策规则涉及的重要特征的数量越少，越易于理解和使用，因此优先考虑特征数量较少的特征子集；The feature subset selection unit is used to sequentially select a feature subset based on the priority of each feature subset; the smaller the value of n, the higher the priority, because the fewer important features involved in the decision rule, the easier it is understand and use, so give preference to a subset of features with a smaller number of features;

最小特征变化量搜索单元，用于基于反事实理论，按所述排列重要性的顺序在各特征子集上搜索使预测结果发生翻转时，各所述特征子集所需的最小特征变化量，进而生成类别边界，例如“当Label＝1时，F3＜10，F5＞20”。当两个样本互为广义上的反事实实例时，机器学习模型对它们的预测结果不同；计算样本组合在选定特征子集上的差异，预测结果不同时，差异越小说明越接近类别边界。类别边界用于描述两类样本在此特征子集上的差异。The minimum feature change amount searching unit is used to search for the minimum feature change amount required by each of the feature subsets when the prediction result is reversed on the feature subsets in the order of the importance of the arrangement based on the counterfactual theory, Further, a class boundary is generated, for example, "When Label=1, F3<10, F5>20". When two samples are counterfactual instances in a broad sense, the machine learning model predicts them differently; calculate the difference between the sample combination on the selected feature subset. When the prediction results are different, the smaller the difference, the closer the category boundary is. . The class boundary is used to describe the difference between the two classes of samples on this feature subset.

和

and

The average value of is the alternative class boundary description.

综上所述，本发明的优点在于：To sum up, the advantages of the present invention are:

虽然以上描述了本发明的具体实施方式，但是熟悉本技术领域的技术人员应当理解，我们所描述的具体的实施例只是说明性的，而不是用于对本发明的范围的限定，熟悉本领域的技术人员在依照本发明的精神所作的等效的修饰以及变化，都应当涵盖在本发明的权利要求所保护的范围内。Although the specific embodiments of the present invention have been described above, those skilled in the art should understand that the specific embodiments we describe are only illustrative, rather than used to limit the scope of the present invention. Equivalent modifications and changes made by a skilled person in accordance with the spirit of the present invention should be included within the scope of protection of the claims of the present invention.

Claims

Translated fromChinese

1.一种应用于药物分析的决策规则自动生成方法，其特征在于：包括如下步骤：1. an automatic generation method of decision rule applied to drug analysis, is characterized in that: comprise the steps:

2.如权利要求1所述的一种应用于药物分析的决策规则自动生成方法，其特征在于：所述步骤S10中，所述药物数据至少包括分子指纹以及物理化学性质；所述物理化学性质至少包括分子量、疏水参数以及溶解度。2 . The method for automatically generating decision rules for drug analysis according to claim 1 , wherein: in the step S10 , the drug data at least includes molecular fingerprints and physicochemical properties; the physicochemical properties At least molecular weight, hydrophobic parameters, and solubility are included.

3.如权利要求1所述的一种应用于药物分析的决策规则自动生成方法，其特征在于：所述步骤S20具体为：3. a kind of decision rule automatic generation method applied to drug analysis as claimed in claim 1 is characterized in that: described step S20 is specifically:

4.如权利要求1所述的一种应用于药物分析的决策规则自动生成方法，其特征在于：所述步骤S30中，所述排列重要性的计算过程如下：4. a kind of decision rule automatic generation method applied to drug analysis as claimed in claim 1 is characterized in that: in described step S30, the calculation process of described arrangement importance is as follows:

5.如权利要求1所述的一种应用于药物分析的决策规则自动生成方法，其特征在于：所述步骤S40具体包括：5. The method for automatically generating decision rules applied to drug analysis according to claim 1, wherein the step S40 specifically comprises:

6.一种应用于药物分析的决策规则自动生成系统，其特征在于：包括如下模块：6. a kind of decision rule automatic generation system applied to drug analysis, is characterized in that: comprise following module:

7.如权利要求6所述的一种应用于药物分析的决策规则自动生成系统，其特征在于：所述药物特征提取模块中，所述药物数据至少包括分子指纹以及物理化学性质；所述物理化学性质至少包括分子量、疏水参数以及溶解度。7. The system for automatically generating decision rules for drug analysis according to claim 6, wherein: in the drug feature extraction module, the drug data at least includes molecular fingerprints and physicochemical properties; the physical Chemical properties include at least molecular weight, hydrophobic parameters, and solubility.

8.如权利要求6所述的一种应用于药物分析的决策规则自动生成系统，其特征在于：所述重要特征筛选模块具体为：8. a kind of decision rule automatic generation system applied to drug analysis as claimed in claim 6, is characterized in that: described important feature screening module is specifically:

9.如权利要求6所述的一种应用于药物分析的决策规则自动生成系统，其特征在于：所述特征子集构建模块中，所述排列重要性的计算过程如下：9. a kind of decision rule automatic generation system applied to drug analysis as claimed in claim 6, is characterized in that: in described feature subset building module, the calculation process of described arrangement importance is as follows:

10.如权利要求6所述的一种应用于药物分析的决策规则自动生成系统，其特征在于：所述类别边界生成模块具体包括：10. The system for automatically generating decision rules for drug analysis as claimed in claim 6, wherein the category boundary generation module specifically comprises: