The method that storehouse order-checking is identified is built with the mononucleotide polymorphism site of Bayesian statistic based on enzyme actionTechnical field
The present invention relates to and a kind of build the method that storehouse order-checking is identified with the mononucleotide polymorphism site of Bayesian statistic based on enzyme action.It is specially and builds storehouse list end (single-end) order-checking based on enzyme action or single nucleotide polymorphism (SNP) site that double; two end (pair-end) order-checking obtains carries out a kind of special Bayesian statistic inspection, thus the genotypic method of accurate identification SNP;When lacking with reference to genome sequence, reliable statistics meaning can be provided for SNP inspection.The method belongs to bioinformatics technique field.This has great importance for the research of non-mode biology and the study on accuracy of genotype identification lacking reference sequences.
Background technology
SNP (SingleNucleotidePolymorphisms) single nucleotide polymorphism, refers to the variation of single core thuja acid on genome, and its quantity is a lot, and in human genome, probably every 1000 bases just have a SNP, rich polymorphism.SNP is by the ideal mark of all kinds of molecular biology research, as built genetic map, and gene type, molecular mark, disease forecasting, the field such as medication guide.
Nowadays, second filial generation DNA sequencing technology is the sequencing technologies of a kind of high flux low cost, and ultimate principle is the order-checking of synthesis limit, limit.For solexa sequence measurement, first with physical method, DNA is interrupted at random, then at fragment two ends plus given joint, joint has amplimer sequence.During order-checking, archaeal dna polymerase synthesizes the complementary strand of fragment to be measured, reads base sequence by the fluorescence signal detected entrained by newly synthesized base, thus obtaining the sequence of fragment to be measured.
Second filial generation sequencing technologies has been widely used for many fields of bioscience, particularly the polymorphism between one species Different Individual of research.It is checked order by individuality that tradition finds the method for SNP marker, obtains short reads, then passes through short sequence alignment program and these short reads comparisons are returned reference sequences, thus obtaining the SNP information that order-checking is individual.Common flow process has (general procedure is as shown in Figure 1): use BWA software that reads comparison is returned reference sequences, uses SAMtools software processes comparison result to find SNP site [1,2];Use SOAP software that reads comparison is returned reference sequences, use SOAPsnp software processes comparison result to find SNP site [3,4].For there being the species of reference sequences can carry out the lookup of SNP marker very easily, but for those non-mode biologies, substantially it is absent from reference sequences.And when without reference to sequence, tradition is found the method for SNP marker and be there is technical bottleneck.
RAD sequencing technologies have employed new storehouse mode of building (enzyme action builds storehouse), its order-checking detailed process is as shown in Figure 2, the specific site of DNA is cut off with restricted enzyme, with physical method, the DNA molecular after enzyme action is interrupted at random again, the DNA molecular of length-specific is selected by agarose gel DNA isolation technics, then specific amplification joint and sequence measuring joints are added at select DNA end, thus building upper machine library to carry out high-flux sequence.
Wherein RAD sequence measurement is method well known in the art, for instance be referred to document [5,6,7,8].SNP site obtains successfully in a lot of fields to utilize RAD sequencing technologies to identify, but till the present invention occurs, method used is typically all and utilizes empirical value to carry out screening and filter.Such as, by two kinds of base depth scales in the 0.25-0.75(low depth base degree of depth in document [6]: the high depth base degree of depth) between site judge into heterozygous sites, ratio judgement below 0.1 becomes site of isozygotying.This method is not statistically significant, and the impact being simultaneously subject to other extraneous factors is relatively larger, and such as order-checking total amount, the SNP genotype accuracy that qualification obtains cannot ensure.Document [9] improves authentication method on the basis of empirical value method, uses method of maximum likelihood to carry out the correction of loci gene type, but its greatest problem is in that to determine the error rate in statistical method.
Document:
1.Li,H.andR.Durbin,FastandaccurateshortreadalignmentwithBurrows-Wheelertransform.Bioinformatics,2009.25(14):p.1754-60.
2.Li,H.,etal.,TheSequenceAlignment/MapformatandSAMtools.Bioinformatics,2009.25(16):p.2078-9.
3.Li,R.,etal.,SNPdetectionformassivelyparallelwhole-genomeresequencing.GenomeRes,2009.19(6):p.1124-32.
4.Li,R.,etal.,SOAP:shortoligonucleotidealignmentprogram.Bioinformatics,2008.24(5):p.713-4.
5.Houston,R.D.,etal.,CharacterisationofQTL-linkedandgenome-widerestrictionsite-associatedDNA(RAD)markersinfarmedAtlanticsalmon.BMCGenomics,2012.13(1):p.244.
6.Scaglione,D.,etal.,RADtagsequencingasasourceofSNPmarkersinCynaracardunculusL.BMCGenomics,2012.13(1):p.3.
7.Davey,J.W.,etal.,SpecialfeaturesofRADSequencingdata:implicationsforgenotyping.MolEcol,2012.
8.Dasmahapatra,K.K.,etal.,Butterflygenomerevealspromiscuousexchangeofmimicryadaptationsamongspecies.Nature,2012.
9.Hohenlohe,P.A.,etal.,PopulationgenomicsofparalleladaptationinthreespinesticklebackusingsequencedRADtags.PLoSGenet.6(2):p.e1000862.
Summary of the invention
It is an object of the invention to provide and a kind of build the method that storehouse order-checking is identified with the mononucleotide polymorphism site of Bayesian statistic based on enzyme action;It is a kind of to build, based on enzyme action, the sequencing data that storehouse order-checking (RAD sequencing technologies) obtains by processing, searching mononucleotide polymorphism site between individual internal or individuality, and gives the technical scheme of statistical test.
The purpose of the present invention is achieved through the following technical solutions:
A kind of building the method that storehouse order-checking is identified with the mononucleotide polymorphism site of Bayesian statistic based on enzyme action, its step is as follows:
1) after the sequencing result obtaining RAD high throughput sequencing technologies, it is filtered removing underproof sequencing sequence to RAD enzyme action end sequencing sequence.
Wherein, RAD high throughput sequencing technologies can be IlluminaGA sequencing technologies, it is also possible to for other high throughput sequencing technologies existing.
Described underproof sequencing sequence is the sequence that sequencing quality exceedes the 50% of whole piece series number lower than the base number of predetermined low quality threshold value, and does not have the sequence of enzyme action feature.
2) sequencing sequence according to order-checking genes of individuals group enzyme action one end, utilizes the full same sex of sequence to generate the information of each individual heap, and calculates each sequence heap order-checking depth information.Such as, the sequencing sequence information of the enzyme action one end after being filtered by each individuality is as the key of Hash, and the value of Hash points to a chained list, for depositing the sequence information of the other end, and calculates order-checking depth information.Available any programming language realizes this process.
3) comparison between two will be carried out by internal all sequences heap one by one, and cluster heap to determine intraindividual candidate's heterozygosis SNP site.
For 3) cluster result, the cluster result of only one of which heap shows to check order in fragment in enzyme action one end to be absent from heterozygous sites, and only the cluster results of two heaps show there is heterozygous sites on fragment is checked order in enzyme action one end.
4) all sequences heap in Different Individual is carried out comparison between two, cluster heap to determine interindividual candidate SNP locus.
For 4) cluster result, only one of which heap cluster result show that two individualities are absent from SNP site, have two cluster results piled to show to exist between individuality SNP site.
Due to follow-up use Bayes statistical method, so each heap or each cluster result are not carried out depth-type filtration herein, it is one of the advantage of the method, retains more SNP site as far as possible.
5) utilize Bayes statistical method that genotypic depth informations various in each candidate SNP locus are analyzed, identify the accuracy of candidate SNP.Owing to lacking the prior probability that reference sequences occurs with various bases on each site, so the actually measured base error rate in each site cannot be obtained.Therefore, this Bayes statistical method uses the exhaustive all error rates being likely to occur of walking method, then selects so that genotype exists the probability the highest situation genotype as this SNP site.Concrete formula is as follows with calculation procedure:
For each candidate SNP locus, can there is base possible in 4, i.e. any in " ATCG " or multiple, the base type of the definition frequency of occurrences the highest (degree of depth is maximum) is G1, and the corresponding degree of depth is N1, all the other bases of definition of successively decreasing successively, i.e. genotype Gi(i=1,2,3,4), degree of depth Ni(i=1,2,3,4).Biologically, general species only there will be two kinds of base types on a SNP site, for instance sequencing data shows that A or T(N occurs in this SNP siteA≥NT), then this site must be isozygoty or two kinds of genotype of heterozygosis.Therefore this bayes method only detects both the above genotype and at the condition lower probability that error rate is ε is:
Wherein N=N1+N2+N3+N4,。
Various genotypic posterior probability are:
Because not having sequence and early-stage Study data, error rate ε cannot determine value, but document [10,11] report IlluminaGA checks order error rate about 1%.Herein, ε is set from 0.01%-5%, step pitch 0.01%.The posterior probability of final utilization is:
P(Nij)Final=max (P (Nij,ε)) i, j ∈ 1,2,3,4}, ε ∈ [0.01%, 5%].
If P is (Nij)FinalIt is not less than 0.95, it was shown that the genotype of this SNP site is ij, is otherwise defined as the data (missing data) that cannot judge.
Technical scheme have employed bioinformatic analysis method, process RAD(restriction-siteassociatedDNA) sequencing data, find the SNP site information in RAD order-checking fragment, bayes method is utilized to identify SNP genotype, lack the bottleneck of reference sequences breaking through non-mode biology, while reducing cost, obtain result accurately.Being firstly introduced Bayes statistical method when identifying SNP site genotype, compared with the method for empirical value before, statistical significance significantly improves, and accuracy also promotes accordingly.
Document:
10.Li,Y.,etal.,Stateoftheartdenovoassemblyofhumangenomesfrommassivelyparallelsequencingdata.HumGenomics,2010.4(4):p.271-7.
11.Xie,W.,etal.,Parent-independentgenotypingforconstructinganultrahigh-densitylinkagemapbasedonpopulationsequencing.ProcNatlAcadSciUSA.107(23):p.10578-83.
Accompanying drawing explanation
Fig. 1 is the Principle of Process figure that tradition finds SNP site method;
Fig. 2 is the order-checking detailed process schematic diagram of RAD sequencing technologies;In figure, (A) digestion with restriction enzyme genomic DNA, and plus P1 joint, each P1 joint contains different sequence labels;(B) with the sample mix of different P1 joints together, interrupt;(C) top connection P2 is added;(D) amplification enrichment RADtags;
Fig. 3 is the example figure of RAD order-checking;
The cluster process figure that Fig. 4 makes a living in heaps, uses EcorI restricted enzyme in legend;
Fig. 5 is enzyme action one end sequence information schematic diagram in heap;
Fig. 6 for internal in heap and individual between SNP site find schematic flow sheet;
Fig. 7 is the example figure of candidate SNP base type and depth information, figure has 20 candidate SNP locus the base type in 15 individualities and depth information respectively, " C | 9 " represents that this site C measures 9 times, " C | 9:T | 3 " represent this site C measure 9 times, T measure 3 times.
Fig. 8 is the example figure of the genotype results of SNP site after Bayesian statistic, in figure, " a " represents homozygous genotype two kinds different respectively with " b ", " h " represents heterozygous genotypes, such as " a " represents AA, and " b " represents CC, and " h " represents AC, in " x1:x2:x3:x4 ", x1 represents the posterior probability isozygotied, x2 represents error rate values when posterior probability of isozygotying is maximum, and x3 represents the posterior probability of heterozygosis, and x4 represents error rate values when heterozygosis posterior probability is maximum.
Fig. 9 utilizes the deletion condition of data after empirical value method and Bayes statistical method.
Figure 10 is the statistic result of empirical value method and Bayes statistical method different loci.
Figure 11 is the result that the site that random choose empirical value method is different from Bayes statistical method carries out utilizing sanger sequence verification.
Detailed description of the invention
The technical characterstic of the present invention is expanded on further below in conjunction with accompanying drawing and specific embodiment.
As in figure 2 it is shown, RAD order-checking is checked order the difference is that needing to utilize restricted enzyme complete degestion genome before adding joint with routine high-throughput, then plus the special joint of RAD, after continue storehouse process to build storehouse with routine identical.Fig. 3 is the example figure of RAD enzyme action end sequencing.Show in figure 3 and use restricted enzyme Ecor1, identify the palindrome of " G^AATTC " on DNA molecular, and between G and A, DNA molecular is cut off, DNA molecular physical method after enzyme action is broken into short sequence fragment, and add top connection at the DNA fragmentation two ends containing enzyme action end sequence, PCR is enriched with and carries out single end (single-end) order-checking, and sequencing reading length is generally 100nt, it is also possible to for 50nt.
Building, based on enzyme action, the method that storehouse order-checking is identified with the mononucleotide polymorphism site of Bayesian statistic, its step is as follows:
1) after the sequencing result obtaining RAD high throughput sequencing technologies, it is filtered removing underproof sequencing sequence to the double; two end sequencing sequence of RAD.
Wherein, RAD high throughput sequencing technologies can be IlluminaGA sequencing technologies, it is also possible to for other high throughput sequencing technologies existing.
Described underproof sequencing sequence is the sequence that sequencing quality exceedes the 50% of whole piece series number lower than the base number of predetermined low quality threshold value, and does not have the sequence of enzyme action feature.
Low quality threshold value is determined by concrete sequencing technologies and order-checking environment, for instance be set as that single base sequencing quality is lower than 20;In sequencing sequence, the uncertain base of sequencing result (N in IlluminaGA sequencing result) number exceedes 10% of whole piece sequencing sequence base number and thinks defective sequence;Except sample joint sequence, test, with other, the exogenous array comparison introduced, such as various terminal sequence.If sequence exists exogenous array, think defective sequence;In the sequencing sequence of enzyme action one end, if initial several bases are not enzyme action end sequences, filter out (such as restricted enzyme Ecor1, sequencing sequence starts if not " AATTC " then filters out whole sequencing sequence).
2) sequencing sequence according to order-checking genes of individuals group enzyme action one end, utilizes the full same sex of sequence to generate the information of each individual heap.Detailed process is as shown in Figure 4.In heap (Stack), the sequence information of enzyme action one end can preserve in the way of Fig. 5, and in Figure 5, what first row represented is the sequence information of enzyme action one end;What secondary series represented is number of times that enzyme action one terminal sequence is sequenced, namely check order depth information;3rd row are the ID of this heap, for uniquely determining heap.
3) individual internal heap compares, if the situation in Fig. 6 does not occur in cluster, it was shown that do not have heterozygosis SNP above individual these sequences internal;If depositing situation in figure 6, it was shown that individual inside exists heterozygosis SNP.
4) when between individuality, heap compares between two, if there is Fig. 6 (a) situation, it was shown that there is the SNP that isozygotys between two individualities;If there is Fig. 6 (b) situation, it was shown that there is heterozygosis SNP between two individualities.
5) base type and the respective depth information of each candidate SNP locus are obtained after piling cluster and compare end, as Fig. 7 represents that above-mentioned information collects.
6) utilize Bayes statistical method that base type and the degree of depth are analyzed, such as Fig. 8, obtain final SNP genotype results.
Embodiment data:
These 141 individualities, including 139 F2 plant and two parents, are carried out RAD-PE order-checking by calabash courd F2 informative population genetic map project.(illustrate: the offspring that male parent and hybridization of female parent generate makes the offspring that F1, F1 selfing generates be F2;Although using RAD-PE order-checking, but analyze and only use enzyme action terminal sequence)
Material source: Zhejiang Academy of Agricultural Science.
Embodiment concrete operations flow process:
Check order the sequencing data obtained by two parent RAD-PE, according to sequencing quality value, the content of N, and whether contain enzyme action end sequence and is filtered, removes underproof sequencing sequence, and the valid data statistics obtained is as shown in table 1.
Table 1: calabash courd RAD order-checking valid data statistics
| Title | Use data volume (bp) | Title | Use data volume (bp) | Title | Use data volume (bp) |
| Male parent | 585,377,540 | F2-46 | 3,800,775 | F2-93 | 1,556,104 5 --> |
| Maternal | 423,794,746 | F2-47 | 2,522,407 | F2-94 | 1,651,259 |
| F2-1 | 3,114,771 | F2-48 | 4,636,152 | F2-95 | 3,213,147 |
| F2-2 | 2,302,730 | F2-49 | 3,737,623 | F2-96 | 2,202,354 |
| F2-3 | 537,822 | F2-50 | 647,499 | F2-97 | 1,956,440 |
| F2-4 | 1,650,925 | F2-51 | 3,678,334 | F2-98 | 1,112,431 |
| F2-5 | 2,824,708 | F2-52 | 2,153,996 | F2-99 | 1,086,168 |
| F2-6 | 579,177 | F2-53 | 7,029,889 | F2-100 | 1,705,836 |
| F2-7 | 2,805,093 | F2-54 | 2,315,687 | F2-101 | 2,311,919 |
| F2-8 | 2,234,442 | F2-55 | 4,116,520 | F2-102 | 1,445,671 |
| F2-9 | 1,814,510 | F2-56 | 1,554,335 | F2-103 | 5,292,536 |
| F2-10 | 2,063,581 | F2-57 | 1,912,949 | F2-104 | 371,736 |
| F2-11 | 437,393 | F2-58 | 4,513,981 | F2-105 | 5,528,190 |
| F2-12 | 1,114,627 | F2-59 | 4,660,158 | F2-106 | 2,286,908 |
| F2-13 | 292,168 | F2-60 | 2,963,600 | F2-107 | 2,977,378 |
| F2-14 | L981.379 | F2-61 | 1.912.346 | F2-108 | 1.113.885 |
| F2-15 | L710.808 | F2-62 | 2.198.112 | F2-109 | 2.358.577 |
| F2-16 | L666.317 | F2-63 | 2.149.228 | F2-110 | 2.014.988 |
| F2-17 | 3.837.185 | F2-64 | 2.907.400 | F2-111 | 5.021.837 |
| F2-18 | 2.705.794 | F2-65 | 2565399 | F2-112 | L687.183 |
| F2-19 | L641.718 | F2-66 | 1.802.757 | F2-113 | 1.454.774 |
| F2-20 | 4.181.837 | F2-67 | 6.136.789 | F2-114 | L187.993 |
| F2-21 | 2.167.926 | F2-68 | 5.106.060 | F2-115 | 917.204 |
| F2-22 | 8.967 | F2-69 | 5.492.357 | F2-116 | 673.176 |
| F2-23 | L936.761 | F2-70 | 4.925.717 | F2-117 | 903.357 |
| F2-24 | 4.907.028 | F2-71 | 2.016.103 | F2-118 | 1.252.469 |
| F2-25 | 2.641.269 | F2-72 | 4.495.767 | F2-119 | L066.660 |
| F2-26 | L344.809 | F2-73 | 957.643 | F2-120 | 83.426 |
| F2-27 | 2.184.764 | F2-74 | 3.193.347 | F2-121 | 624.005 |
| F2-28 | 2.312.351 | F2-75 | 30.335 | F2-122 | 4.246.910 |
| F2-29 | L318.322 | F2-76 | 2.067.906 | F2-123 | 824.013 |
| F2-30 | L830.247 | F2-77 | 200.856 | F2-124 | 3.322.863 |
| F2-31 | 358.911 | F2-78 | 6.978.303 | F2-125 | 89.336 |
| F2-32 | L450.039 | F2-79 | 5.309.200 | F2-126 | 367.005 |
| F2-33 | L767.194 | F2-80 | 3.081.537 | F2-127 | 1.707.758 |
| F2-34 | L587.589 | F2-81 | 3.071.09l | F2-128 | 2.385.919 |
| F2-35 | L008.970 | F2-82 | 1.223.914 | F2-129 | 2.786.068 |
| F2-36 | 2.877.974 | F2-83 | 5.586.662 | F2-130 | 890.661 |
| F2-37 | L268211 | F2-84 | 1.880.660 | F2-131 | 1.980.472 6--> |
| F2-38 | The skilful .973 of 6.O | F2-85 | 2.620.672 | F2-132 | 3.920.370 |
| F2-39 | 3.219262 | F2-86 | 5.992.662 | F2-133 | 500.349 |
| F2-40 | 2.241.103 | F2-87 | 5.636.602 | F2-134 | 2.150.765 |
| F2-41 | 3.610.730 | F2-88 | 544.490 | F2-135 | L115.366 |
| F2-42 | L641.063 | F2-89 | 4.897.907 | F2-136 | 1.306.934 |
| F2-43 | 2.382.923 | F2-90 | 2.355.593 | F2-137 | 616.728 |
| F2-44 | 2.598.428 | F2-91 | 2.506.27l | F2-138 | 949.308 |
| F2-45 | 692.675 | F2-92 | 841.285 | F2-139 | 1.014.677 |
Sequencing sequence according to order-checking genes of individuals group enzyme action one end, utilizes the full same sex of sequence to generate the information of two parent's heaps respectively.Obtain 3913 candidate SNP labellings of F2 colony, base type and similar Fig. 7 of corresponding depth profile (data volume is too big, it is impossible to show).It is utilized respectively current conventional empirical value method (degree of depth is not less than 6, heterozygosis criterion 0.25-0.75) with Bayes statistical method, candidate SNP labelling to be identified.Result after identifying using two kinds of methods respectively carries out the statistics of missing data, such as Fig. 9, it was shown that both there are differences, but not notable.But each loci gene type being analyzed, it has been found that have 11l, 005 site is accredited as in bayes method isozygotys, but is accredited as heterozygosis in empirical value;Discovery has 79,401 sites to be accredited as heterozygosis in bayes method, but is accredited as in empirical value and isozygotys, and claiming both the above site is uncertain site.100 uncertain sites of random choose carry out sanger method order-checking, it has been found that what in uncertain site, bayes method was correct occupies 77%, are significantly higher than empirical value method (3 examples selected in Figure 11).
This result shows that utilizing RAD sequencing data to carry out bayes method when SNP identifies be a kind of statistical means reliably.