Invention content
The purpose of the present invention is to provide a kind of sides building lung cancer computer aided detection model using support vector machinesMethod.
In order to achieve the above objectives, the present invention adopts the following technical scheme that:
A method of lung cancer computer aided detection model being built using support vector machines, including:
Obtain subject's bronchial epithelial cell, extraction subject's epithelial cell RNA;
According to subject's epithelial cell RNA structure double-stranded cDNA libraries and it is sequenced;
Sequencing result is compared with reference gene group, according to comparison result select subject significant difference gene andSignificant difference makes a variation;
By the significant difference gene of subject and significant difference altered composition at vector and characterization subject whether suffer from lung(0,1) value of cancer is as sample data;
The covariance of Radial basis kernel function and punishment in the supporting vector machine model using Radial basis kernel function is respectively setThe initial value of the factor and adjustment are spaced;
It is training set and forecast set by sample data random division, training set and forecast set is based on described using radial baseThe supporting vector machine model of kernel function is repeatedly trained, and whether is correctly adjusted separately covariance according to the result of forecast set and is punishedPenalty factor, using the supporting vector machine model using Radial basis kernel function after covariance and penalty factor adjustment as lung cancerComputer aided detection model.
Preferably, it is described to each subject's epithelial cell RNA build double-stranded cDNA library and carry out sequencing include:
Whether purity, concentration and the integrality for detecting subject's epithelial cell RNA are qualified;
For the qualified subject epithelial cell RNA of detection, the rRNA in RNA is removed, mRNA is broken into segment at random,Synthetic double chain cDNA simultaneously purifies, and carry out end reparation to double-strand cDNA after purification connects with connector, carries out clip size selection,PCR amplification is finally carried out to build double-stranded cDNA library and be sequenced.
Preferably, described to include according to the notable gene of comparison result selection differences:
The RPKM values of each gene of subject are calculated according to comparison result;
The RPKM matrixes of subject group different genes expression quantity are built, the first of RPKM matrixes is classified as gene name and correlationInformation, each row are respectively the RPKM values that each subject corresponds to gene from secondary series in RPKM matrixes,
For every a line of RPKM matrixes, according to whether suffering from lung cancer grouping carries out T inspections, to obtain different genes expression quantityP values vector;
The gene of significant difference is selected according to the size of P value vectors.
Preferably, described to include according to comparison result selection differences significant variation:
The detection of the variation based on SNP Calling is carried out according to comparison result, obtains the variation information of subject;
The variation information matrix of subject group is built, it is each in the information matrix that makes a variation to be classified as subject's variation letterIt ceases, represents a specific variation in the information matrix that makes a variation per a line;
For the every a line for the information matrix that makes a variation, according to whether suffering from lung cancer grouping carries out Chi-square Test, to obtain different changesDifferent P value vectors,
The variation of significant difference is selected according to P value vector magnitudes.
Preferably, described that the detection of the variation based on SNP Calling is carried out according to comparison result, obtain the variation of subjectInformation includes:
The detection of the variation based on SNP Calling is carried out according to comparison result, obtains the record gene mutation of VCF formatsMake a variation information, and the variation information of the record gene mutation of VCF formats includes believing with the # annotation informations started and including multiple row variationThe abrupt information of breath.
Preferably, sequencing result is compared described with reference gene group, selects subject's according to comparison resultBy the significant difference gene of subject and significant difference variation group after significant difference gene and significant difference variation and describedBefore whether the vector of synthesis and characterization subject suffer from (0,1) value of lung cancer as sample data, this method further includes:
Principal component analysis is carried out to significant difference gene and the significant difference variation selected according to comparison result, from basisSelection can represent the non-phase of member of sequencing result respectively in significant difference gene and the significant difference variation that comparison result is selectedThe strong principal component intersection of closing property regard corresponding gene in the principal component intersection and variation as selected significant difference baseCause and significant difference variation.
Preferably, the ratio of the forecast set is the 10%~50% of the sample data.
Preferably, the value range of the initial value of the covariance and penalty factor is 0.1-5, adjusts the value at intervalRange is 0.1-0.5.
Preferably, this method further includes:
Increase subject's number, updates sample data;
Using updated sample data by sample data random division as training set and forecast set, to training set and forecast setIt is repeatedly trained based on the supporting vector machine model using Radial basis kernel function, it is whether correct according to the result of forecast setAdjust separately covariance and penalty factor.
Beneficial effects of the present invention are as follows:
The verification and measurement ratio for the lung cancer computer aided detection model that technical solution of the present invention is established is more than 90%, error rateNot higher than 5%, the computing resource that detection accuracy height and detection and adjustment consume is small, and adaptability is good, it is easy to accomplish.
Specific implementation mode
In order to illustrate more clearly of the present invention, the present invention is done further with reference to preferred embodiments and drawingsIt is bright.Similar component is indicated with identical reference numeral in attached drawing.It will be appreciated by those skilled in the art that institute is specific belowThe content of description is illustrative and be not restrictive, and should not be limited the scope of the invention with this.
As shown in Figure 1, a kind of utilization support vector machines provided in this embodiment builds lung cancer computer aided detection modelMethod, including:
Obtain subject's bronchial epithelial cell, extraction subject's epithelial cell RNA;
According to subject's epithelial cell RNA structure double-stranded cDNA libraries and it is sequenced;
Sequencing result is compared with reference gene group, according to comparison result select subject significant difference gene andSignificant difference makes a variation;
By the significant difference gene of subject and significant difference altered composition at vector and characterization subject whether suffer from lung(0,1) value of cancer is as sample data;
The covariance of Radial basis kernel function and punishment in the supporting vector machine model using Radial basis kernel function is respectively setThe initial value of the factor and adjustment are spaced;
It is training set and forecast set by sample data random division, training set and forecast set is based on using radial base core letterSeveral supporting vector machine models are repeatedly trained, according to the result of forecast set whether correctly adjust separately covariance and punishment becauseSon, using the supporting vector machine model using Radial basis kernel function after covariance and penalty factor adjustment as lung cancer computer aided manufacturingHelp detection model.
In the specific implementation, double-stranded cDNA library is built to each subject's epithelial cell RNA and carries out sequencing and include:
Whether purity, concentration and the integrality for detecting subject's epithelial cell RNA are qualified;
For the qualified subject epithelial cell RNA of detection, the rRNA in RNA is removed, mRNA is broken into segment at random,Synthetic double chain cDNA simultaneously purifies, and carry out end reparation to double-strand cDNA after purification connects with connector, carries out clip size selection,PCR amplification is finally carried out to build double-stranded cDNA library and be sequenced.
When utilizing computer operation in the present embodiment, after the completion of double-stranded cDNA library structure, double-stranded cDNA library is insertedEnter clip size and concentration is quantified;After library inspection is qualified, upper machine sequencing obtains the both-end sequencing result of Fastq formatsRawData.To RawData files, the connector and Index being added in building library are removed, and remove low sequencing qualitySequencing result data obtain the CleanData of Fastq formats.Later, for " comparing sequencing result and reference gene groupIt is right ", to the CleanData of single subject, compared onto reference gene group by comparing software, the comparison result of acquisition.ItsIn, reference gene group is the sequential file of an a human genome for standard in the industry, including hg19/GRCh37, hg38/GRCh38, Yan Di and Huang Di, two legendary rulers of remote antiquity's genome etc..
In the specific implementation, include according to the notable gene of comparison result selection differences:
The RPKM values of each gene of subject are calculated according to comparison result;
The RPKM matrixes of subject group different genes expression quantity are built, the first of RPKM matrixes is classified as gene name and correlationInformation, each row are respectively the RPKM values that each subject corresponds to gene from secondary series in RPKM matrixes,
For every a line of RPKM matrixes, according to whether suffering from lung cancer grouping carries out T inspections, to obtain different genes expression quantityP values vector;
The gene of significant difference is selected according to the size of P value vectors.
In the specific implementation, include according to comparison result selection differences significant variation:
The detection of the variation based on SNP Calling is carried out according to comparison result, obtains the variation information of subject;
The variation information matrix of subject group is built, it is each in the information matrix that makes a variation to be classified as subject's variation letterIt ceases, represents a specific variation in the information matrix that makes a variation per a line, also the ATGC base information in variation can be changed into numerical valueIt gives a mark (such as A-0, T-1 etc.);
For the every a line for the information matrix that makes a variation, according to whether suffering from lung cancer grouping carries out Chi-square Test, to obtain different changesDifferent P value vectors,
The variation of significant difference is selected according to P value vector magnitudes.
Wherein, RPKM (Reads Per Kilobase per Million):The RPKM of one gene reflects the geneRelative expression quantity, numerical value more high gene expression quantity is bigger.RPKM is by the read numbers divided by map to genome of map to geneAll read numbers (as unit of million) and RNA length (as unit of KB), formula is:
Wherein, total exon reads/mapped reads (millions) are map in all read numbers to corresponding baseThe percentage of cause.
In the specific implementation, the detection of the variation based on SNP Calling is carried out according to comparison result, obtains the change of subjectDifferent information includes:
The detection of the variation based on SNP Calling is carried out according to comparison result, obtains the record gene mutation of VCF formatsMake a variation information, and the variation information of the record gene mutation of VCF formats includes believing with the # annotation informations started and including multiple row variationThe abrupt information of breath.
In the present embodiment, SNP Calling refer to the process of upper different loci variation SNP on detection chromosome, variationExplain that No. 10255 location criterias are A on such as such as No. 1 chromosome, it is G actually to measure, and here it is a variations.VCF formatsA kind of recording gene mutation, the text formatting for indicating abrupt information is refered in particular in bioinformatics;It can be used for indicatingSingle nucleotide polymorphism (single nucleotide Polymorphism, SNP) on biological significance, gene delection/insertion(insertions/deletions, indel), copy number change (copy number variants and structuralVariants, CNV) etc..VCF formats are divided into two large divisions, and a part is started with #, are source to this document, generated timeEtc. the annotation information illustrated;Another part is abrupt information, is the chief component of VCF files, per one change of behaviorDifferent specifying information usually has tens of row.Carry out the software that utilizes of SNP Calling processes have GATK, Altalas,The multiple choices such as Samtools, Freebayes.
In the specific implementation, sequencing result is being compared with reference gene group, subject is selected according to comparison resultSignificant difference gene and significant difference variation after and by the significant difference gene and significant difference altered composition of subjectAt vector and characterization subject whether suffer from (0,1) value of lung cancer as sample data before, this method further includes:
Principal component analysis is carried out to significant difference gene and the significant difference variation selected according to comparison result, from basisSelection can represent the non-phase of member of sequencing result respectively in significant difference gene and the significant difference variation that comparison result is selectedThe strong principal component intersection of closing property, the significant difference gene that corresponding gene in the principal component intersection and variation are alternatively gone out withSignificant difference makes a variation.
In the present embodiment, sample data is reflected by nonlinear method using the supporting vector machine model of Radial basis kernel functionIt is mapped in high-dimensional feature space, finds optimal planar in space.The decision function of supporting vector machine model isWhereinFor α*With XiCorresponding component, b* indicate nearest from hyperplanePositive function and the equidistant value of the negative function nearest from hyperplane.By using High Dimensional Mapping, therefore this reality in this present embodimentIt applies example and uses kernel function of the Radial basis kernel function (Radial basis function, RBF) as supporting vector machine model, diameterIt is k (x, x to the definition of base kernel functioni)=exp (- | | x-xi||2/2σ2).It is had the following advantages using Radial basis kernel function:1. instituteThe space of mapping is infinite dimension, therefore all sample datas can be mapped as distinguishing;2. arithmetic speed is fast, have specialOperation library is calculated;3. normal distribution meets true sequencing distribution.There are two key parameters in Radial basis kernel function, that is, assistVariances sigma and penalty factor, covariance sigma determine that the complexity of proper subspace, the effect of penalty factor are to adjust to supportThe ratio of vector machine model confidence interval and empiric risk selects suitable key parameter extremely important, in the specific implementation, associationThe value range of the initial value of variance and penalty factor be 0.1-5, adjustment interval value range be 0.1-0.5, pass throughIteration adjustment obtains optimal covariance and the value of penalty factor.
In the specific implementation, the ratio of forecast set is the 10%~50% of sample data.
In the specific implementation, this method further includes:
Increase subject's number, updates sample data;
Using updated sample data by sample data random division as training set and forecast set, to training set and forecast setBased on repeatedly being trained using the supporting vector machine model of Radial basis kernel function, whether correctly distinguished according to the result of forecast setAdjust covariance and penalty factor.
Term that the present embodiment is related to and concept are explained as follows:
Transcript profile:One cell is under a time point, a certain physiological condition, the set of intracellular all transcription products,Including mRNA (RNA is ribonucleic acid, Ribonucleic Acid), rRNA, transfer RNA and non-coding RNA;WhereinMRNA (mRNA) is the object that the present embodiment uses.
RNA-Seq:Transcript profile RNA is sequenced with based on second generation sequencing technologies.In practical operation, due to mRNAUnstable easy decomposition, thus it is general want reverse transcription to be sequenced at stable cDNA (but newest SMS methods may be directlyRNA is sequenced).Relative gene chip, RNA-Seq need not be to known synthesising probing needles, it can be found that rare mutation etc..
SNP:Nucleotide polymorphisms (single nucleotide polymorphism, SNP), are primarily referred to as in genomeDNA sequence polymorphism caused by a single nucleotide variation in level.It is most common in human heritable mutationIt is a kind of.Account for 90% or more of all known polymorphisms.SNP is widely present in human genome, average every 500~1000 alkaliBase centering just has 1, estimates that its sum is even more up to 3,000,000.
Gene expression profile:Be one kind in molecular biology field, by cDNA, EST (EST) or few nucleosidesSour chip come measure cellular gene expression situation (including whether specific gene expresses, gene expression abundance, different tissues, different developmentDifferential expression under stage and different physiological status) method.It is exactly briefly the difference of different genes expression quantity.
RPKM (Reads Per Kilobase per Million) and FPKM (Fragments Per Kilobase perMillion):The principle of RPKM and FPKM is similar, is DNA fragmentation difference lies in FPKM is corresponding, such as at oneIn pair-end (double tails) RNA-seq of Illumina, a pair (two) reads correspondences are a DNA fragmentations.There is RPKM(FPKM) concept can compare:The relative expression quantity of two genes in the same sample;Or the same gene in different samplesRelative expression quantity.As each RNA divided by length (as unit of 1000 bases), so that it may with the more same sampleThe relative expression quantity of middle different genes.Similarly, the reason of introducing " every 1,000,000 reads " is that different samples may be sequencedDepth it is different, depth is deeper, and corresponding reads quantity is more.If result divided by the quantity in respective library so energyThe relative expression quantity of the same gene in two different samples is weighed well.
Support vector machines (Support Vector Machine):It is a kind of new based on the machine that can be trainedModeling method in study.SVM is built upon in the VC dimensions theory and Structural risk minization basis of Statistical Learning Theory,It is (i.e. error-free in the complexity (i.e. to the study precision of specific training sample) and learning ability of model according to limited sample informationAccidentally identify arbitrary sample ability) between seek optimal compromise, to obtain best Generalization Ability.By to many samplesThis study, obtains best core parameter and function.The main thought of SVM may be summarized to be at 2 points:It can be applied to lineThe sample that property can divide, can also be applied to the sample of linearly inseparable.The case where for linearly inseparable, is reflected by using non-linearPenetrating algorithm and converting the sample of low-dimensional input space linearly inseparable to high-dimensional feature space makes its linear separability, so that highDimensional feature space carries out linear analysis to the nonlinear characteristic of sample using linear algorithm and is possibly realized;It is based on structure risk mostSmallization theory in feature space construction optimum segmentation hyperplane so that learner obtains global optimization, and wholeThe expected risk of a sample space meets certain upper bound with some probability.Biggest advantage can handle the data volume of " thousands of dimensions ",Such as the variation of thousands of kinds of gene expression amounts and mutation being related in the present embodiment.Finally, one can be obtained containing most importantThe variables set of " characteristic value ", including genome variation, the gene mutation etc. of carrying.
In the description of the present invention, it should be noted that the orientation or positional relationship of the instructions such as term "upper", "lower" is baseIt in orientation or positional relationship shown in the drawings, is merely for convenience of description of the present invention and simplification of the description, rather than indicates or implySignified device or element must have a particular orientation, with specific azimuth configuration and operation, therefore should not be understood as to thisThe limitation of invention.Unless otherwise clearly defined and limited, term " installation ", " connected ", " connection " shall be understood in a broad sense, exampleSuch as, it may be fixed connection or may be dismantle connection, or integral connection;It can be mechanical connection, can also be to be electrically connectedIt connects;It can be directly connected, can also can be indirectly connected through an intermediary the connection inside two elements.For thisFor the those of ordinary skill in field, the specific meanings of the above terms in the present invention can be understood according to specific conditions.
It should also be noted that, in the description of the present invention, relational terms such as first and second and the like are only usedDistinguish one entity or operation from another entity or operation, without necessarily requiring or implying these entities orThere are any actual relationship or orders between operation.Moreover, the terms "include", "comprise" or its any other changeBody is intended to non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wrappedThose elements are included, but also include other elements that are not explicitly listed, or further include for this process, method, articleOr the element that equipment is intrinsic.In the absence of more restrictions, the element limited by sentence "including a ...", andIt is not excluded in process, method, article or equipment in the process, method, article or apparatus that includes the element that there is also other identical elements.
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pairThe restriction of embodiments of the present invention for those of ordinary skill in the art on the basis of the above description can be withIt makes other variations or changes in different ways, all embodiments can not be exhaustive here, it is every to belong to the present inventionThe obvious changes or variations extended out of technical solution still in protection scope of the present invention row.