The zinc-binding protein matter action site based on integrated study is pre- under a kind of unbalanced modeSurvey methodTechnical field
The present invention relates to the zinc-binding protein matter action site prediction technique under a kind of unbalanced mode based on integrated study,It is to identify zinc-binding protein matter action site under non-equilibrium classification mode using integrated study disaggregated model, belong to albumenThe crossing domain of matter group and computer science.
Background technique
With the completion of the Human Genome Project, life science enters genome times afterwards comprehensively, albumen expressed by geneThe matter research topic one of important as life science and natural science field.Protein (Protein) is the base for constituting cellThis organic matter is the material base of life, plays decisive role during biological life.However, this decisive roleNot being not that can simply be determined by single protein, in most situations, need by protein and other protein orLigand interacts to complete specific biological function jointly.
In cell, agent and the undertaker of the protein as vital movement complete spy by interacting with ligandFixed key effect, such as DNA synthesis, signal transduction, gene transcriptional activation, metabolic process of life, antivirus protection etc..ItsSecondary, the treatment aspect that protein acts on various diseases also has great progradation, and especially some virus proteins are invadedIt disturbs, such as Ebola virus (Ebola virus), it can disclose the pathogenesis of certain diseases, find the target spot of some drugsThere is directive function with new drug development.
Metal ion in conjunction with protein, plays its biological function even some life mistakes to protein as co-factorJourney plays conclusive effect.Zinc ion is only second to iron as the in organism second metal ion abundant, the life to organismLong development, disease control, DNA synthesis etc. have important regulating and controlling effect.Zinc ion shortage will lead to some diseases, such as age phaseThe retired property disease closed, malignant tumour and Wilson disease.In addition, zinc also has aging, apoptosis, immune function and oxidative stressImportant function.Zinc ion just exercises the biological functions such as catalysis, rock-steady structure and coordination in conjunction with protein.
To the identification of zinc-binding protein matter action site mainly using biochemical test method.Though these experimental method energyThe interaction sites between protein and zinc ion are measured, but since measuring cost is too high, it is time-consuming and laborious;Moreover, becauseExperiment needs different restrictive condition, using different experimental principles, can make in this way experimental result have certain false negative andFalse positive.Therefore, find that the biological significance of these data has been far from satisfying life by experimental technique and means merelyThe needs of object development.
With the development of information technology and the appearance of magnanimity biological data, some calculation methods such as data mining technology is utilizedAnd machine learning related algorithm automatic identification zinc-binding protein matter action site is a kind of inexorable trend of development.It has costIt low, the advantages that speed is fast, can overcome the disadvantages that the defect of experiment, and further provided for Bioexperiment measurement interaction of a high priceDirectly supports and lead.
The prediction of zinc ion conjugated protein action site is two classification problems, and the action site really combined is seldom,Uncombined action site accounting is very high, and the prediction of zinc-binding protein matter action site is a typical non-equilibrium classification problem.Current existing prediction technique establishes disaggregated model using the methods of data mining, and two class samples are put on an equal footing, are not accounted forTo the disequilibrium of data, the precision for causing zinc-binding protein matter action site to predict is very low.Therefore, zinc-binding protein matter is studiedNon-equilibrium property in action site prediction, the classification accuracy for improving minority class have important research significance.
Summary of the invention
The purpose of the present invention is provide one for the non-equilibrium property classification problem in the prediction of zinc-binding protein matter action siteZinc-binding protein matter action site prediction technique based on integrated study under kind unbalanced mode.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention is as follows:
Zinc-binding protein matter action site prediction technique based on integrated study under a kind of unbalanced mode, including walk as followsIt is rapid:
Step 1: the characteristics of being directed to zinc-binding protein matter action site pre-processes protein source data;
Step 2: by random down-sampling technology to non-equilibrium being balanced of the property place of zinc-binding protein matter action siteReason, obtains several quantum balancing data sets;
Step 3: respectively on several quantum balancing data sets, the protein biochemistry feature for having ga s safety degree is chosen, is carried outCharacter representation, composition characteristic vector;
Step 4: respectively using feature vector as the input of base classifier support vector machines, sample weights are calculated, then are constructedProbabilistic neural network model based on sample weighting finally integrates base disaggregated model support vector machines and based on the general of sample weightingRate neural network model obtains prediction model;
Step 5: prediction model is obtained using step 4, the zinc-binding protein matter action site in target sample is knownNot.
Wherein, in step 1, the pretreatment removes following noise data:
(1) removal homology is higher than 70% peptide chain structure;
(2) duplicate, shorter protein chain and mistake and insecure data are rejected;
(3) removal meets chain of the sequence redundancy less than 20%.
In step 2, the equilibrating processing is that random down-sampling technology is that random lower sampling is carried out to major class sample, oftenIt is secondary to extract quantity identical with group sample, constitute several quantum balancing data sets;The major class sample is uncombined proteinAction site, the group sample are the protein interaction sites that zinc combines.
In step 3, the biochemical character of the ga s safety degree includes feature locations specificity score matrix, conservative scoreWith RW-GRMTP (relative weight of gapless real matches to pseudocounts gapless realRelative weighting with pseudorange);Position-specific scoring matrices are normalized, and are used at histogram and sliding windowReason obtains the vector of one 20 dimension;The conservative score of 20 dimensions is converted into a value;Place is normalized to RW-GRMTPReason, obtains 2 dimensional vectors;Ultimately form the feature vector of one 23 dimension.
In step 4, base classifier SVM support vector machines is respectively trained on several quantum balancing data sets, according to formula(1) and formula (2) calculates separately prediction error rate ejWith the important procedure weight α of disaggregated modelj;
Wherein, all data sets are D, D={ (x1,y1),(x2,y2),…,(xn,yn), xi∈ X, X represent classification problemClass field instance space, yi∈ { 1, -1 }, i=1,2 ... n, n are sample numbers;wmiFor weight, initial value is set as 1/n, i.e. w1=(w11,w12,...,w1n), wherein w1i=1/n;I=1,2 ..., n;M=1,2;Respectively using base point on k equilibrium data collectionClass device SVM is trained, and obtains k classification prediction result Csvm_j(x), j=1 ..., k.
It calculates current sample weights and is normalized, sample classification is correct, reduces corresponding sample weights;If sampleThis classification error increases corresponding sample weights, calculation formula such as formula (3):
Probabilistic neural network model of the building based on sample weighting is to be weighted to protein characteristic data, after weightingInput of the sample data as probabilistic neural network model is predicted, this method is denoted as SWPNN using probabilistic neural network,Prediction result is SWPNN (x).
It integrates base disaggregated model support vector machines and the probabilistic neural network model based on sample weighting obtains prediction modelSSWPNN, SSWPNN={ SVM, SWPNN, kernelopt, spread, f }, wherein kernelopt, spread be respectively SVM andThe parameter of SWPNN classifier, shown in the definition of f such as formula (4);Corresponding weight beta is calculated according to error rate simultaneouslyj;
Wherein, δ is threshold value, Csvm_j(x) and SWPNN (x) be respectively classifier SVM and SWPNN classification results, value is bigIn 0, then the class sample that is positive is predicted, the class sample that is negative is predicted less than 0.If the value of SVM (X) is positive and is less than threshold value δ, andWhen SWPNN (X) is predicted as counter-example, finally integrated prediction result is judged as counter-example, is final with SVM (X) result in the case of otherThe result of judgement.
In step 5, it is utilized respectively integrated model SSWPNN in entire test data set and is predicted, obtains differentClassification results, then result is weighted integrated, zinc-binding protein matter action site in target sample is finally identified, such as formula(5) shown in:
The utility model has the advantages that
The method that the present invention is mentioned is acted on from the angle of machine learning for zinc-binding protein matter under unbalanced modeThe identification problem in site proposes a kind of novel zinc-binding protein matter action site prediction technique based on integrated study, hasEffect solves the prediction of zinc-binding protein matter action site under non-equilibrium classification mode, achieves certain predictablity rate.ThisInvention can be applied to the Forecasting recognition of other type of metal ion conjugated protein action sites after extension.
Detailed description of the invention
The present invention is done with reference to the accompanying drawings and detailed description and is further illustrated, of the invention is above-mentionedAnd/or otherwise advantage will become apparent.
Fig. 1 is the overall framework figure of the method for the present invention.
Fig. 2 is the zinc-binding protein matter action site classifier frame diagram based on SVM and SWPNN model.
Fig. 3 is the prediction procedure chart of SSWPNN classifier.
Specific embodiment
According to following embodiments, the present invention may be better understood.
Overall procedure of the invention is as shown in Figure 1.
The present invention is directed to the zinc-binding protein matter action site forecasting problem under unbalanced dataset, using to down-sampling skillArt makes data tend towards stability being balanced of data.Using integrated technology building based on support vector machines and sample weightingProbabilistic neural network sorter model, and Classification and Identification is carried out to zinc-binding protein matter action site using model.Specific implementationSteps are as follows:
1. equilibrating is handled
The protein interaction sites that zinc combines are called group sample (negative class sample);Uncombined protein interaction sitesReferred to as major class sample (positive class sample).Nothing at random is carried out to major class sample and puts back to lower sampling, while in order to avoid random down-samplingIt is likely to result in the loss of major class sample useful information, takes the upper multiple sampling without replacement of data complete or collected works.Major class sample is carried outRandom nothing puts back to lower sampling, extracts quantity identical with group sample every time, i.e., major class sample is divided into k subset, every heightCollection and group sample synthesize equilibrium data collection D1,D2,…,Dk.The description of its process available algorithm 1:
Algorithm 1: data balancing Processing Algorithm
Input: protein sequence sample data D
Output: quantum balancing data set D1,D2,…,Dk
1 BEGIN;
2 Divide(D);
3 N=CountUp (MinoritySample);
4 For (i=1;I≤k;i++);
5 ExtractedSamplei=RandomExtract (MajoritySample, N);
6 Di=Merge (MinoritySample, ExtractedSamplei);
7 MajoritySample=MajoritySample-ExtractedSamplei;
8 End for;
9 END。
2. attributive character indicates
Choose the biochemical character for having ga s safety degree: position-specific scoring matrices, conservative score and RW-GRMTP(relative weight of gapless real matches to pseudocounts), carries out character representation, and composition is specialLevy vector set.Position-specific scoring matrices are normalized, and are handled using histogram and sliding window, obtain oneThe vector of a 20 dimension;The conservative score of 20 dimensions is converted into a value;RW-GRMTP is normalized, obtains one2 dimensional vectors;Ultimately form the feature vector of one 23 dimension.
3. the probabilistic neural network model of integrated supporting vector machine and sample weighting
It is trained using base classifier support vector machines, according to classification results, sample is weighted, be in someBoundary is easy " difficulty divides sample " of misclassification, probabilistic neural network model of the training based on weighting.
If all data sets are D, D={ (x1,y1),(x2,y2),…,(xn,yn), xi∈ X, X represent the class of classification problemDomain instance space, yi∈ { 1, -1 }, i=1,2 ... n, n are sample numbers.Process are as follows:
Step 1: SVM classifier is respectively trained on several quantum balancing data sets;
It is trained respectively using base classifier SVM on k sub- equilibrium data collection, cross validation is folded using 5-, is obtainedTo k classification prediction result Csvm_j(x), j=1 ..., k.The error rate of prediction is denoted as ej, the significance level weight of disaggregated modelFor αj, calculate such as formula (1) and (2).In formula (1), wmiFor weight, initial value is set as 1/n, i.e. w1=(w11,w12,...,w1n),Middle w1i=1/n;I=1,2 ..., n;M=1,2.
Step 2: current sample weights are calculated and are normalized;
After first round base classifier SVM prediction, if some sample classification is correct, in next round prediction, dropIts low weight;On the contrary, in next round prediction, improving his weight if some sample classification mistake.Sample weights functionCalculate such as formula (3):
Step 3: PNN fallout predictor SWPNN of the training based on sample weighting;
Feature samples data are weighted using calculated weight in Step 2, probabilistic neural of the training based on weightingNetwork model, the method for proposition are denoted as SWPNN, and prediction result is SWPNN (x).Zinc based on SVM and SWPNN model, which combines, to be madeIt is as shown in Figure 2 with site classifier frame.
Step 4: the SWPNN classifier of base disaggregated model SVM and sample weighting is integrated;
The probabilistic neural network model of integrated base classifier SVM and sample weighting propose a kind of new prediction techniqueSSWPNN, i.e. SSWPNN={ SVM, SWPNN, kernelopt, spread, f }, wherein kernelopt, spread are SVM respectivelyWith the parameter of SWPNN classifier, shown in the definition of f such as formula (4).Corresponding weight beta is calculated according to error rate simultaneouslyj(this is basicClassifier is in the weight in final classification device).
Wherein δ is threshold value, Csvm_j(x) and SWPNN (x) be respectively classifier SVM and SWPNN classification results, value is bigIn 0, then the class sample that is positive is predicted, the class sample that is negative is predicted less than 0.If the value of SVM (X) is positive and smaller, it is less than threshold valueδ, and when SWPNN (X) is predicted as counter-example, finally integrated prediction result is judged as counter-example, in the case of other, is with SVM (X) resultThe result finally judged.
Step 5: the integrated model SSWPNN being utilized respectively in Step 4 on entire data set is predicted, is obtained notWith classification results, then be weighted integrated using formula (5) to result, finally identify zinc-binding protein matter action site.FrameFrame model is as shown in Figure 3.
Tested on the data set of 392 protein chains, and with existing there are four types of method (meta-ZincPrediction, ZincExplorer, zincFinder, zincPred) performance comparison is carried out, whether to four kinds of residues(CHED) estimated performance of whole estimated performance or any residue, method of the invention are better than other methods.
The present invention provides the zinc-binding protein matter action site prediction sides under a kind of unbalanced mode based on integrated studyThe thinking and method of method, there are many method and the approach for implementing the technical solution, and the above is only preferred reality of the inventionApply mode, it is noted that for those skilled in the art, without departing from the principle of the present invention,Several improvements and modifications can also be made, these modifications and embellishments should also be considered as the scope of protection of the present invention.In the present embodiment notThe available prior art of specific each component part is realized.