HK1239899B

Movatterモバイル変換

Info

Publication number: HK1239899B
Application number: HK17113305.4A
Authority: HK
Inventors: 卢煜明; 赵慧君; 陈君赐
Original assignee: 香港中文大学
Priority date: 2007-07-23
Filing date: 2017-12-13
Publication date: 2021-02-05

Description

优先权声明Priority Declaration

本申请要求2007年7月23日提交的题目为“DETERMINING A NUCLEIC ACIDSEQUENCE IMBALANCE(确定核酸序列失衡)”的美国临时申请第60/951438号(AttorneyDocket No.016285-005200US)的优先权，并且是其正式申请，在此将该临时申请的全部内容通过引用并入并用于各种目的。This application claims priority to and is a non-exclusive application of U.S. Provisional Application No. 60/951,438 (Attorney Docket No. 016285-005200US), filed on July 23, 2007, entitled “DETERMINING A NUCLEIC ACID SEQUENCE IMBALANCE,” the entire contents of which are hereby incorporated by reference for all purposes.

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请还涉及同时提交的题目为“DETERMINING A NUCLEIC ACID SEQUENCEIMBALANCE(确定核酸序列失衡)”的正式申请(Attorney Docket No.016285-005210US)，在此将该申请内的全部内容通过引用并入并用于各种目的。This application is also related to a concurrently filed formal application entitled "DETERMINING A NUCLEIC ACID SEQUENCE IMBALANCE" (Attorney Docket No. 016285-005210US), the entire contents of which are hereby incorporated by reference for all purposes.

发明领域Field of the Invention

本发明一般涉及通过确定不同核酸序列间的失衡来诊断检测胎儿染色体非整倍性，更具体而言，涉及经由检测母体样品(如血液)来确定21三体性(trisomy 21)(唐氏综合征)和其他染色体非整倍性。The present invention generally relates to the diagnostic detection of fetal chromosomal aneuploidies by determining imbalances between different nucleic acid sequences, and more particularly to the determination of trisomy 21 (Down syndrome) and other chromosomal aneuploidies by testing maternal samples (e.g., blood).

发明背景Background of the Invention

胎儿染色体非整倍性是由异常剂量的染色体或染色体区的存在导致的。异常剂量可以是异常地高，如在21三体性中存在额外的21号染色体或染色体区；或异常地低，如在特纳综合征中缺乏X染色体的拷贝。Fetal chromosomal aneuploidy results from the presence of an abnormal dose of a chromosome or chromosome region. The abnormal dose can be abnormally high, such as the presence of an extra chromosome 21 or chromosome region in trisomy 21, or abnormally low, such as the missing copy of the X chromosome in Turner syndrome.

胎儿染色体非整倍性如21三体性的常规产前诊断方法涉及，通过侵入性方法如羊膜穿刺或绒毛膜绒毛取样对胎儿的材料进行取样，但这造成胎儿流失(fetal loss)的有限风险。无创方法，如通过超声波扫描术或生物化学标记物的筛查，已用于在确定的侵入性诊断方法前，将孕妇进行风险分级。然而，这些筛查方法通常测量与染色体非整倍性如21三体性有关的副现象，而不是核心染色体异常，因此诊断的准确性未达最佳标准，且具有诸如受孕龄(gestational age)过度影响等的其他缺点。Conventional prenatal diagnosis methods for fetal chromosomal aneuploidy, such as trisomy 21, involve sampling fetal material by invasive methods such as amniocentesis or chorionic villus sampling, but this poses a limited risk of fetal loss. Non-invasive methods, such as screening by ultrasound scanning or biochemical markers, have been used to risk-stratify pregnant women before definitive invasive diagnostic methods. However, these screening methods typically measure side effects associated with chromosomal aneuploidy, such as trisomy 21, rather than core chromosomal abnormalities, and therefore diagnostic accuracy is suboptimal and has other disadvantages such as being overly affected by gestational age.

1997年，在母体血浆中发现了循环的无细胞胎儿DNA，这为无创产前诊断提供了新的可能性(Lo,YMD and Chiu,RWK 2007Nat Rev Genet 8,71-77)。尽管这种方法易于应用于伴性病症(Costa,JM et al.2002N Engl J Med 346,1502)和某些单基因病症(Lo,YMDet al.1998N Engl J Med 339,1734-1738)的产前诊断，但是，该方法的产前检测胎儿染色体非整倍性的应用依然代表相当大的挑战(Lo,YMD and Chiu,RWK 2007,同上)。首先，胎儿核酸和母体来源的高背景核酸共存于母体血浆中，而母体来源的高背景核酸经常干扰胎儿核酸的分析(Lo,YMD et al.1998Am J Hum Genet 62,768-775)。其次，胎儿核酸主要以无细胞的形式在母体血浆中循环，这使得难以获得胎儿基因组的基因或染色体的剂量信息。In 1997, circulating cell-free fetal DNA was found in maternal plasma, which provided a new possibility for non-invasive prenatal diagnosis (Lo, YMD and Chiu, RWK 2007 Nat Rev Genet 8, 71-77). Although this method is easy to apply to prenatal diagnosis of sex-linked disorders (Costa, JM et al. 2002 N Engl J Med 346, 1502) and certain monogenic disorders (Lo, YMD et al. 1998 N Engl J Med 339, 1734-1738), the application of this method in prenatal detection of fetal chromosomal aneuploidy still represents a considerable challenge (Lo, YMD and Chiu, RWK 2007, ibid). First, fetal nucleic acid and high background nucleic acid of maternal origin coexist in maternal plasma, and high background nucleic acid of maternal origin often interferes with the analysis of fetal nucleic acid (Lo, YMD et al. 1998 Am J Hum Genet 62, 768-775). Second, fetal nucleic acids circulate primarily in maternal plasma in a cell-free form, which makes it difficult to obtain gene or chromosome dosage information of the fetal genome.

近年来，已取得了克服这些挑战的显著发展(Benachi,A&Costa,JM 2007Lancet369,440-442)。一种方法是，检测母体血浆中的胎儿特异性核酸，因而克服了母体背景干扰的问题(Lo,YMD and Chiu,RWK 2007,同上)。21号染色体的剂量由胎盘来源的DNA/RNA分子中多态性等位基因的比值来推断。然而，当样品中含有较低量的靶核酸时，这种方法的准确性较低，并且仅可适用于对靶多态性是杂合的胎儿，如果使用一种多态性，则该靶核酸仅是群体的一个亚群。In recent years, significant developments have been made to overcome these challenges (Benachi, A & Costa, JM 2007 Lancet 369, 440-442). One method is to detect fetal-specific nucleic acids in maternal plasma, thereby overcoming the problem of maternal background interference (Lo, YMD and Chiu, RWK 2007, supra). The dosage of chromosome 21 is inferred from the ratio of polymorphic alleles in DNA/RNA molecules derived from the placenta. However, when the sample contains a lower amount of target nucleic acid, the accuracy of this method is lower and is only applicable to fetuses that are heterozygous for the target polymorphism. If a polymorphism is used, the target nucleic acid is only a subgroup of the population.

Dhallan等(Dhallan,R,et al.2007,同上，Dhallan,R,et al.2007Lancet 369,474-481)描述了通过向母体血浆中添加甲醛富集循环的胎儿DNA比例的替代策略。母体血浆中胎儿所提供的21号染色体序列的比例，通过评估21号染色体上单核苷酸多态性(SNP)的父本遗传的胎儿特异性等位基因与非胎儿特异性等位基因的比值来确定。同样，计算参照染色体的SNP比值。随后，通过检测21号染色体的SNP比值和参照染色体的SNP比值间的统计学显著差异来推断胎儿21号染色体的失衡，其中利用小于等于0.05的固定p值来定义显著。为了确保高度的群体覆盖度，每条染色体靶向多于500的SNP。然而，存在有关甲醛将胎儿DNA富集至高比例的效率的争论(Chung,GTY,et al.2005Clin Chem 51,655-658)，因此，该方法的再现性需要进一步评估。另外，由于每个胎儿和母亲会提供每条染色体的许多不同的SNP，所以SNP比值比较的统计学检验的效力会因情况不同而不同(Lo YMD&Chiu,RWK.2007Lancet 369,1997)。此外，由于这些方法依赖于遗传多态性的检测，因此它们限于对这些多态性是杂合的胎儿。Dhallan et al. (Dhallan, R, et al. 2007, supra, Dhallan, R, et al. 2007 Lancet 369, 474-481) describe an alternative strategy for enriching the proportion of circulating fetal DNA by adding formaldehyde to maternal plasma. The proportion of chromosome 21 sequences provided by the fetus in maternal plasma is determined by evaluating the ratio of paternally inherited fetal-specific alleles to non-fetal-specific alleles of single nucleotide polymorphisms (SNPs) on chromosome 21. Similarly, the SNP ratio of the reference chromosome is calculated. Subsequently, the imbalance of fetal chromosome 21 is inferred by detecting a statistically significant difference between the SNP ratio of chromosome 21 and the SNP ratio of the reference chromosome, wherein significance is defined using a fixed p value of less than or equal to 0.05. In order to ensure a high degree of population coverage, more than 500 SNPs are targeted per chromosome. However, there is controversy about the efficiency of formaldehyde in enriching fetal DNA to a high proportion (Chung, GTY, et al. 2005 Clin Chem 51, 655-658), and therefore, the reproducibility of this method needs further evaluation. In addition, since each fetus and mother will provide many different SNPs per chromosome, the power of the statistical test for SNP ratio comparison will vary from case to case (Lo YMD & Chiu, RWK. 2007 Lancet 369, 1997). In addition, since these methods rely on the detection of genetic polymorphisms, they are limited to fetuses that are heterozygous for these polymorphisms.

利用由21三体性和整倍体胎儿获得的羊水细胞培养物中21号染色体基因座和参照基因座的聚合酶链式反应(PCR)和DNA定量，Zimmermann等(2002Clin Chem 48,362-363)基于21三体性胎儿的羊水细胞培养物的21号染色体DNA序列增加1.5倍，能区分这两组胎儿。因为DNA模板浓度中的2倍差异仅构成了一个阈值循环(Ct)的差异，所以1.5倍的差异的区分是常规实时PCR的极限。为了实现较好程度的定量区分，需要替代策略。Using polymerase chain reaction (PCR) and DNA quantification of the chromosome 21 locus and reference locus in amniotic fluid cell cultures obtained from trisomy 21 and euploid fetuses, Zimmermann et al. (2002 Clin Chem 48, 362-363) were able to distinguish between the two groups of fetuses based on a 1.5-fold increase in chromosome 21 DNA sequence in amniotic fluid cell cultures from trisomy 21 fetuses. Because a 2-fold difference in DNA template concentration only constitutes a difference in threshold cycle (Ct), the distinction of a 1.5-fold difference is the limit of conventional real-time PCR. In order to achieve a better degree of quantitative differentiation, an alternative strategy is needed.

已经研发了检测核酸样品中等位基因比值偏移(allelic ratio skewing)的数字PCR(Chang,HW et al.2002J Natl Cancer Inst 94,1697-1703)。数字PCR是基于扩增的核酸分析技术，其要求将含有核酸的样品分布于大量离散的样品中，在所述离散样品中，每个样品平均含有不多于约1个靶序列。通过数字PCR，用序列特异性引物扩增特异性核酸靶标来产生特异性扩增子。在核酸分析前，确定或选择待靶向的核酸基因座和待包括于反应中的序列特异性引物的种类或组。Digital PCR has been developed to detect allele ratio skewing in nucleic acid samples (Chang, HW et al. 2002 J Natl Cancer Inst 94, 1697-1703). Digital PCR is an amplification-based nucleic acid analysis technique that requires that a sample containing nucleic acid be distributed into a large number of discrete samples, where each sample contains, on average, no more than about one target sequence. By digital PCR, a specific nucleic acid target is amplified with sequence-specific primers to produce a specific amplicon. Prior to nucleic acid analysis, the nucleic acid locus to be targeted and the type or group of sequence-specific primers to be included in the reaction are determined or selected.

临床上，已经证明，数字PCR可以用于检测肿瘤DNA样品中的杂合性丢失(LOH)(Zhou,W.et al.2002Lancet 359,219-225)。为了分析数字PCR的结果，以前的研究采用序贯概率比检验(sequential probability ratio testing,SPRT)来将实验结果分类为表示样品中存在或不存在LOH(El Karoui et al.2006Stat Med 25,3124-3133)。Clinically, digital PCR has been shown to be useful for detecting loss of heterozygosity (LOH) in tumor DNA samples (Zhou, W. et al. 2002 Lancet 359, 219-225). To analyze digital PCR results, previous studies used sequential probability ratio testing (SPRT) to categorize the results as indicating the presence or absence of LOH in the sample (El Karoui et al. 2006 Stat Med 25, 3124-3133).

在以前的研究所用的方法中，由数字PCR所收集的数据的量相当低。因此，少量的数据点和典型的统计性涨落使得准确性受到损害。In previous studies using methods, the amount of data collected by digital PCR was quite low, so accuracy was compromised by the small number of data points and typical statistical fluctuations.

因此期望具有高度敏感性和特异性的无创检测，以便分别将假阴性和假阳性减少到最低限度。然而，胎儿DNA以低的绝对浓度存在，并代表母体血浆和血清中全部DNA序列的较少部分。因此，也期望具有通过使遗传信息的量最大化以允许胎儿染色体非整倍性的无创检测的方法，所述遗传信息的量可由含有母体背景核酸的生物样品中作为较少部分存在的数量有限的胎儿核酸推断。Therefore, it is desirable to have a non-invasive test with high sensitivity and specificity so as to minimize false negatives and false positives, respectively. However, fetal DNA is present in low absolute concentrations and represents a relatively small fraction of the total DNA sequences in maternal plasma and serum. Therefore, it is also desirable to have a method for non-invasive detection of fetal chromosomal aneuploidy by maximizing the amount of genetic information that can be inferred from a limited number of fetal nucleic acids present as a relatively small fraction in a biological sample containing maternal background nucleic acid.

发明概述SUMMARY OF THE INVENTION

本发明的实施方案提供了确定从孕妇获得的生物样品中是否存在核酸序列失衡(如染色体失衡)的方法、系统和装置。利用与生物样品中其他非临床相关染色体区(背景区)有关的临床相关染色体区的量的参数，可以进行这种确定。一方面，通过对母体样品，如尿、血浆、血清和其他合适的生物样品中的核酸分子进行测序来确定染色体的量。对生物样品中的核酸分子进行测序，以便对基因组部分进行测序。为了确定与参照数量相比的变化(即失衡)是否存在，选择了一个或多个截止值(cutoff value)，例如关于两个染色体区(或染色体区组)的量的比值。Embodiments of the present invention provide the method, system and apparatus of determining whether there is nucleotide sequence imbalance (such as chromosome imbalance) in the biological sample obtained from a pregnant woman.Utilize the parameter of the amount of the clinically relevant chromosome district relevant with other non-clinical relevant chromosome districts (background area) in the biological sample, can carry out this determination.On the one hand, by to maternal sample, as the nucleic acid molecule in urine, blood plasma, serum and other suitable biological samples are checked order to determine the amount of chromosome.The nucleic acid molecule in the biological sample is checked order, so that genome part is checked order.In order to determine whether the variation (i.e. imbalance) compared with reference to quantity exists, one or more cutoff values (cutoff value) are selected, for example, about the ratio of the amount of two chromosome districts (or chromosome district group).

根据一示例性的实施方案，分析从孕妇接收的生物样品来进行胎儿染色体非整倍性的产前诊断。生物样品包括核酸分子。对含于生物样品中的一部分核酸分子进行测序。一方面，所获得的遗传信息的量对诊断的准确性是足够的，然而并未过量，以便控制成本和所需的生物样品的输入量。According to an exemplary embodiment, a biological sample received from a pregnant woman is analyzed for prenatal diagnosis of fetal chromosomal aneuploidy. The biological sample includes nucleic acid molecules. A portion of the nucleic acid molecules contained in the biological sample are sequenced. On the one hand, the amount of genetic information obtained is sufficient for diagnostic accuracy, but not excessive, so as to control costs and the required input amount of biological samples.

基于测序，由鉴定为来源于第一染色体的序列，确定第一染色体的第一量。由鉴定为来源于第二染色体之一的序列，确定一条或多条第二染色体的第二量。随后，将第一量和第二量的参数与一个或多个截止值进行比较。基于比较，确定对于第一染色体，是否存在胎儿染色体非整倍性的分类。测序有利于使遗传信息的量最大化，所述遗传信息的量可由数量有限的作为较少部分存在于含有母体背景核酸的生物样品中的胎儿核酸推断。Based on sequencing, by being identified as the sequence that derives from the first chromosome, determine the first amount of the first chromosome.By being identified as the sequence that derives from one of the second chromosome, determine the second amount of one or more second chromosomes.Subsequently, the parameters of the first amount and the second amount are compared with one or more cutoff values.Based on comparison, determine for the first chromosome, whether there is the classification of fetal chromosome aneuploidy.Sequencing is conducive to maximizing the amount of genetic information, and the amount of the genetic information can be inferred by a limited number of fetal nucleic acids present in the biological sample containing maternal background nucleic acid as a smaller portion.

根据一示例性的实施方案，分析从孕妇接收的生物样品来实施胎儿染色体非整倍性的产前诊断。生物样品包括核酸分子。确定生物样品中胎儿DNA的百分比。基于该百分比，基于期望的准确性，计算待分析的序列的数量N。对生物样品中所含有的至少N个核酸分子进行随机测序。According to an exemplary embodiment, prenatal diagnosis of fetal chromosomal aneuploidy is performed by analyzing a biological sample received from a pregnant woman. The biological sample includes nucleic acid molecules. The percentage of fetal DNA in the biological sample is determined. Based on this percentage and a desired accuracy, the number N of sequences to be analyzed is calculated. At least N nucleic acid molecules contained in the biological sample are randomly sequenced.

基于随机测序，由鉴定为来源于第一染色体的序列，确定第一染色体的第一量。由鉴定为来源于第二染色体之一的序列，确定一条或多条第二染色体的第二量。随后，将第一量和第二量的参数，与一个或多个截止值进行比较。基于比较，确定对于第一染色体，是否存在胎儿染色体非整倍性的分类。随机测序有利于使可由数量有限的作为较少部分存在于含有母体背景核酸的样品中的胎儿核酸推断的遗传信息的量最大化。Based on random sequencing, by being identified as the sequence that derives from the first chromosome, determine the first amount of the first chromosome.By being identified as the sequence that derives from one of the second chromosome, determine the second amount of one or more second chromosomes.Subsequently, the parameters of the first amount and the second amount are compared with one or more cutoff values.Based on comparison, determine for the first chromosome, whether there is the classification of fetal chromosome aneuploidy.Random sequencing is conducive to maximizing the amount of genetic information that can be inferred from the fetal nucleic acid in the sample containing maternal background nucleic acid as a smaller portion by a limited number.

本发明的其他实施方案涉及与本文所述方法相关的系统和计算机可读介质。Other embodiments of the present invention relate to systems and computer-readable media related to the methods described herein.

参考下文详细的描述和附图，可获得对本发明的特征和优点的更好理解。A better understanding of the features and advantages of the present invention may be obtained by reference to the following detailed description and accompanying drawings.

附图简述BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施方案的方法100的流程图，该方法100用于在从孕妇个体获得的生物样品中进行胎儿染色体非整倍性的产前诊断。FIG1 is a flow chart of a method 100 according to an embodiment of the present invention for performing prenatal diagnosis of fetal chromosomal aneuploidy in a biological sample obtained from a pregnant individual.

图2是本发明实施方案的方法200的流程图，该方法200用于利用随机测序进行胎儿染色体非整倍性的产前诊断。FIG2 is a flow chart of a method 200 according to an embodiment of the present invention for prenatal diagnosis of fetal chromosomal aneuploidy using random sequencing.

图3A表示本发明的实施方案的，与21三体性或整倍体胎儿有关的母体血浆样品中21号染色体序列的百分比表现度(percentage representation)的图表。3A is a graph showing the percentage representation of chromosome 21 sequences in maternal plasma samples associated with trisomy 21 or euploid fetuses, according to an embodiment of the present invention.

图3B表示本发明的实施方案的，通过大规模并行测序和微流体数字PCR(microfluidics digital PCR)所确定的母体血浆胎儿DNA分数浓度间(fractional fetalDNA concentration)的相关性。3B shows the correlation between fractional fetal DNA concentration in maternal plasma determined by massively parallel sequencing and microfluidics digital PCR, according to an embodiment of the present invention.

图4A表示本发明的实施方案的，每条染色体的比对的序列百分比表现度的图表。FIG4A shows a graph of percent representation of aligned sequences for each chromosome, according to an embodiment of the present invention.

图4B表示图4A所示的21体情况和整倍体情况间，每条染色体的百分比表现度中的差异(％)的图表。FIG4B is a graph showing the difference (%) in the percent representation of each chromosome between the 21-somic case and the euploid case shown in FIG4A .

图5表示本发明的实施方案的，与21三体性胎儿有关的母体血浆中，21号染色体序列过度表现(over-representation)的程度和胎儿DNA分数浓度间的相关性。5 shows the correlation between the degree of over-representation of chromosome 21 sequences and the concentration of fetal DNA fraction in maternal plasma associated with fetal trisomy 21, according to an embodiment of the present invention.

图6表示根据本发明的实施方案分析的一部分人类基因组的表。T21表示从与21三体性胎儿有关的妊娠获得的样品。Figure 6 shows a table of a portion of the human genome analyzed according to an embodiment of the present invention. T21 represents a sample obtained from a pregnancy associated with a fetus with trisomy 21.

图7表示本发明的实施方案的，从21三体性胎儿中区分整倍体所需的序列数量的表。7 shows a table showing the number of sequences required to distinguish euploids from trisomy 21 fetuses according to an embodiment of the present invention.

图8A表示本发明的实施方案的，与21号染色体比对的被测序的标签的前10个起始位置的表。FIG8A shows a table of the first 10 starting positions of sequenced tags aligned to chromosome 21, according to an embodiment of the present invention.

图8B表示本发明的实施方案的，与22号染色体比对的被测序的标签的前10个起始位置的表。FIG8B shows a table of the first 10 starting positions of sequenced tags aligned to chromosome 22, according to an embodiment of the present invention.

图9表示可与本发明实施方案的系统和方法一起使用的示例性计算机装置的方框图。FIG9 shows a block diagram of an exemplary computer device that may be used with the systems and methods of embodiments of the present invention.

定义definition

本文所用术语“生物样品”指从个体(如诸如孕妇的人)采集的含有一个或多个感兴趣的核酸分子的任何样品。As used herein, the term "biological sample" refers to any sample collected from an individual (eg, a human such as a pregnant woman) that contains one or more nucleic acid molecules of interest.

术语“核酸”或“多核苷酸”指单链或双链形式的脱氧核糖核酸(DNA)或核糖核酸(RNA)和其多聚体，除非另有限制，该术语包括含有天然核苷酸的已知类似物的核酸，所述类似物具有与参照核酸类似的结合特性，并且以与天然存在的核苷酸类似的方式代谢。除非另有说明，特定的核酸序列还隐含地包括其保守修饰的变体(如简并密码子取代)、等位基因、直系同源物(orthologs)、SNP和互补序列以及明确表示的序列。具体来说，简并密码子的取代可以通过产生如下的序列实现：其中一个或多个选择的(或全部)密码子的第三位被混合碱基和/或脱氧次黄苷残基取代(Batzer et al.,Nucleic Acid Res.19:5081(1991)；Ohtsuka et al.,J.Biol.Chem.260:2605-2608(1985)；以及Rossolini et al.,Mol.Cell.Probes 8:91-98(1994))。术语核酸与基因、cDNA、mRNA、小非编码RNA、微RNA(miRNA)、Piwi-相互作用RNA和基因或基因座编码的短发夹RNA(shRNA)交换地使用。The term "nucleic acid" or "polynucleotide" refers to deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) and polymers thereof in single-stranded or double-stranded form. Unless otherwise limited, the term includes nucleic acids containing known analogs of natural nucleotides that have similar binding properties to the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly includes conservatively modified variants thereof (such as degenerate codon substitutions), alleles, orthologs, SNPs, and complementary sequences as well as the sequences explicitly indicated. Specifically, replacement of degenerate codons can be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer et al., Nucleic Acid Res. 19:5081 (1991); Ohtsuka et al., J. Biol. Chem. 260:2605-2608 (1985); and Rossolini et al., Mol. Cell. Probes 8:91-98 (1994)). The term nucleic acid is used interchangeably with gene, cDNA, mRNA, small noncoding RNA, microRNA (miRNA), Piwi-interacting RNA, and short hairpin RNA (shRNA) encoded by a gene or locus.

术语“基因”意指与产生多肽链有关的DNA的片段。其可以包括编码区之前和之后的区域(前导区和非转录尾区)，以及单独的编码片段(外显子)间的间插序列(内含子)。The term "gene" refers to a segment of DNA involved in producing a polypeptide chain. It may include regions before and after the coding region (leader and non-transcribed tail), as well as intervening sequences (introns) between individual coding segments (exons).

本文所用术语“反应”指与表示感兴趣的特定多核苷酸序列的存在或不存在的化学、酶促或物理作用有关的任何过程。“反应”的实例是诸如聚合酶链式反应(PCR)的扩增反应。“反应”的另一实例是通过合成或通过连接的测序反应。“信息反应”是表示一个或多个感兴趣的特定多核苷酸序列的存在的反应，并且在一种情况下，只存在一种感兴趣的序列。本文所用术语“孔(well)”指在预定位置和有限的结构中的反应，如孔形瓶、小室或PCR阵列中的室(chamber)。As used herein, the term "reaction" refers to any process associated with a chemical, enzymatic, or physical action that indicates the presence or absence of a specific polynucleotide sequence of interest. An example of a "reaction" is an amplification reaction such as a polymerase chain reaction (PCR). Another example of a "reaction" is a sequencing reaction by synthesis or by ligation. An "informative reaction" is a reaction that indicates the presence of one or more specific polynucleotide sequences of interest, and in one case, only one sequence of interest is present. As used herein, the term "well" refers to a reaction in a predetermined location and limited structure, such as a chamber in a well-shaped bottle, a cell, or a PCR array.

本文所用术语“临床相关核酸序列”可以指对应于潜在的失衡正被检测的更大的基因组序列片段的多核苷酸序列，或指更大的基因组序列本身。一实例是21号染色体的序列。其他的实例包括18号、13号、X和Y染色体。除此以外的其他实例包括，胎儿从其父母之一或两者遗传的突变的基因序列或遗传多态性或拷贝数变异。除此以外的其他实例包括，恶性肿瘤中突变、缺失或扩增的序列，如发生了杂合性丢失或基因重复的序列。在某些实施方案中，多种临床相关核酸序列，或临床相关核酸序列等同的多种标记，可用于提供用来检测失衡的数据。例如，来自21号染色体的5个不连续序列的数据，能够以累加的方式(additivefashion)用于确定可能的21号染色体失衡，从而将所需的样品体积有效地减少至1/5。As used herein, the term "clinically relevant nucleic acid sequence" may refer to a polynucleotide sequence corresponding to a larger genomic sequence fragment for which a potential imbalance is being detected, or to the larger genomic sequence itself. One example is the sequence of chromosome 21. Other examples include chromosomes 18, 13, X, and Y. Other examples besides this include mutated gene sequences or genetic polymorphisms or copy number variations inherited by the fetus from one or both of its parents. Other examples besides this include sequences that are mutated, deleted, or amplified in malignant tumors, such as sequences in which loss of heterozygosity or gene duplication has occurred. In certain embodiments, a plurality of clinically relevant nucleic acid sequences, or a plurality of markers equivalent to clinically relevant nucleic acid sequences, can be used to provide data for detecting an imbalance. For example, data from five discontinuous sequences of chromosome 21 can be used in an additive fashion to determine a possible imbalance in chromosome 21, thereby effectively reducing the required sample volume to 1/5.

本文所用术语“背景核酸序列”指与临床相关核酸序列的正常比值是已知的核酸序列，如1:1的比值。作为一实例，背景核酸序列和临床相关核酸序列是来自相同染色体，由于杂合性而不同的两个等位基因。在另一实例中，背景核酸序列是与另一等位基因杂合的一等位基因，该另一等位基因是临床相关核酸序列。而且，某些背景核酸序列和临床相关核酸序列的每一种可以来自不同的个体。As used herein, the term "background nucleic acid sequence" refers to a nucleic acid sequence that has a known normal ratio to a clinically relevant nucleic acid sequence, such as a 1:1 ratio. As an example, the background nucleic acid sequence and the clinically relevant nucleic acid sequence are two alleles from the same chromosome that differ due to heterozygosity. In another example, the background nucleic acid sequence is an allele that is heterozygous with another allele, and the other allele is a clinically relevant nucleic acid sequence. Moreover, each of certain background nucleic acid sequences and clinically relevant nucleic acid sequences can be from different individuals.

本文所用术语“参照核酸序列”指每个反应的平均浓度是已知的或已经等同地测量的核酸序列。As used herein, the term "reference nucleic acid sequence" refers to a nucleic acid sequence whose average concentration per reaction is known or has been equivalently measured.

本文所用术语“过度表现的(overrepresented)核酸序列”指两种感兴趣的序列(如临床相关序列和背景序列)中的核酸序列，该过度表现的核酸序列比生物样品中的其他序列更丰富。As used herein, the term "overrepresented nucleic acid sequence" refers to a nucleic acid sequence among two sequences of interest (eg, a clinically relevant sequence and a background sequence) that is more abundant than other sequences in a biological sample.

本文所用术语“基于”意指“至少部分地基于”，并指确定另一值所用的一个值(或结果)，如存在于方法的输入和该方法的输出的关系中的值。本文所用术语“获得”还指方法的输入和该方法的输出的关系，如该当获得是公式的计算时存在的关系。As used herein, the term "based on" means "based at least in part on," and refers to one value (or result) used to determine another value, such as a value that exists in a relationship between an input to a method and an output of the method. As used herein, the term "obtained" also refers to a relationship between an input to a method and an output of the method, such as the relationship that exists when obtaining is the calculation of a formula.

本文所用术语“定量数据”意指，由一个或多个反应获得的并且提供一个或多个数值的数据。例如，表示特定序列的荧光标记的孔的数目是定量数据。As used herein, the term "quantitative data" refers to data obtained from one or more reactions and providing one or more numerical values. For example, the number of fluorescently labeled wells representing a specific sequence is quantitative data.

本文所用术语“参数”意指，表征定量数据集和/或定量数据集间数值关系的数值。例如，第一核酸序列的第一量和第二核酸序列的第二量之间的比值(或比值的函数)是参数。As used herein, the term "parameter" refers to a numerical value that characterizes a quantitative data set and/or a numerical relationship between quantitative data sets. For example, the ratio (or a function of the ratio) between a first amount of a first nucleic acid sequence and a second amount of a second nucleic acid sequence is a parameter.

本文所用术语“截止值”意指，其值用于在生物样品的两个或多个分类状态(例如患病和非患病)间进行裁定(arbitrate)的数值。例如，如果参数大于截止值，将定量数据分为第一类(例如，患病状态)，或者如果该参数小于该截止值，则将定量数据分为另一类(例如，未患病状态)。As used herein, the term "cutoff value" refers to a numerical value used to arbitrate between two or more categorical states (e.g., diseased and non-diseased) of a biological sample. For example, if a parameter is greater than a cutoff value, the quantitative data is classified into a first category (e.g., diseased state), whereas if the parameter is less than the cutoff value, the quantitative data is classified into another category (e.g., non-diseased state).

本文所用术语“失衡”意指，与参考量的任何显著偏差，其是由临床相关核酸序列的量中的至少一个截止值所定义的。例如，参考量的比值为3/5，因此如果测量的比值是1:1，则存在失衡。As used herein, the term "imbalance" refers to any significant deviation from a reference amount, which is defined by at least one cutoff value in the amount of a clinically relevant nucleic acid sequence. For example, the reference amount ratio is 3/5, so if the measured ratio is 1:1, there is an imbalance.

本文所用术语“染色体非整倍性”意指，染色体的定量数量与二倍体基因组的染色体数量的变化。这种变化可以是增加或丢失。该变化可以包括一个染色体的全部或染色体的区域。As used herein, the term "chromosomal aneuploidy" refers to a change in the quantitative number of chromosomes relative to the number of chromosomes in a diploid genome. This change can be an increase or a loss. The change can include all of a chromosome or a region of a chromosome.

本文所用术语“随机测序”意指测序，由此被测序的核酸片段在测序程序前并未特异地鉴定或靶向。不需要靶向特定基因座位的序列特异性引物。被测序的核酸池随样品的不同而不同，甚至对于相同样品随分析的不同而不同。被测序的核酸的特征仅由所产生的测序输出揭示。在本发明的某些实施方案中，用共享某些共有特征的核酸分子的特定群体富集生物样品的程序，可先于随机测序。在一实施方案中，生物样品中的每个片段都具有相等的被测序的概率。As used herein, the term "random sequencing" refers to sequencing whereby the nucleic acid fragments being sequenced are not specifically identified or targeted prior to the sequencing procedure. Sequence-specific primers targeting specific gene loci are not required. The pool of nucleic acids sequenced varies from sample to sample, and even from analysis to analysis of the same sample. The characteristics of the sequenced nucleic acids are revealed solely by the sequencing output generated. In certain embodiments of the present invention, random sequencing may be preceded by a procedure that enriches a biological sample with a specific population of nucleic acid molecules that share certain common characteristics. In one embodiment, each fragment in a biological sample has an equal probability of being sequenced.

本文所用术语“人类基因组部分(fraction of the human genome)”或“人类基因组的一部分(portion of the human genome)”意指，小于100％的人类基因组的核苷酸序列，该人类基因组由约30亿个核苷酸碱基对组成。在测序的背景下，该术语指小于1倍覆盖度的人类基因组核苷酸序列。该术语可以表示为核苷酸/碱基对的百分比或绝对值。作为用途实例，该术语可以用来表示进行的测序的实际量。实施方案可以确定获得准确的诊断的人类基因组被测序部分所需的最小值。作为另一用途实例，该术语指用来获得疾病分类的参数或量的测序数据的量。As used herein, the term "fraction of the human genome" or "portion of the human genome" means less than 100% of the nucleotide sequence of the human genome, which consists of approximately 3 billion nucleotide base pairs. In the context of sequencing, the term refers to less than 1x coverage of the nucleotide sequence of the human genome. The term can be expressed as a percentage or an absolute value of nucleotides/base pairs. As an example of use, the term can be used to refer to the actual amount of sequencing performed. An embodiment can determine the minimum value of the sequenced portion of the human genome required to obtain an accurate diagnosis. As another example of use, the term refers to the amount of sequencing data used to obtain a parameter or quantity for disease classification.

本文所用术语“被测序的标签”意来自核酸分子的任何部分或全部的被测序的核苷酸串(string)。例如，被测序的标签可以是来自核酸片段的被测序的一短串核苷酸，位于核酸片段两端的一短串核苷酸，或存在于生物样品中的完整核酸片段的测序。核酸片段是更大的核酸分子的任何部分。片段(如基因)可以与更大核酸分子的其他部分分离地存在(即不连接)。As used herein, the term "sequenced tag" means a sequenced string of nucleotides from any part or all of a nucleic acid molecule. For example, a sequenced tag can be a short sequenced string of nucleotides from a nucleic acid fragment, a short string of nucleotides at either end of a nucleic acid fragment, or the sequencing of an entire nucleic acid fragment present in a biological sample. A nucleic acid fragment is any portion of a larger nucleic acid molecule. A fragment (e.g., a gene) can exist separately (i.e., not connected) from the rest of the larger nucleic acid molecule.

发明详述Detailed Description of the Invention

本发明的实施方案提供了，确定与非患病状态相比，临床相关染色体的存在增加还是减少(患病状态)的方法、系统和装置。这种确定可以通过利用与生物样品中其他非临床相关染色体区(背景区)有关的临床相关染色体区的量的参数来进行。对生物样品的核酸分子进行测序，以便对基因组部分进行测序，并可以由测序结果确定量。选择一个或多个截止值，用于确定是否存在与参照量相比的变化(即失衡)，例如，关于两个染色体区(或染色体区组)的量的比值。Embodiments of the present invention provide, determine that compared with non-ill state, the existence of clinically relevant chromosome increases or reduces the method, system and apparatus of (ill state).This determination can be carried out by utilizing the parameter of the amount in the clinically relevant chromosome district relevant with other non-clinical relevant chromosome districts (background area) in biological sample.The nucleic acid molecule of biological sample is checked order, so that genome part is checked order, and can be determined by sequencing result amount.Select one or more cut-off values, for determining whether there is the variation (i.e. imbalance) compared with reference amount, for example, about the ratio of the amount of two chromosome districts (or chromosome district group).

在参照量中所检测的变化可以是，与其他非临床相关序列相比的，与临床相关核酸序列有关的任何偏差(向上或向下)。因此，参照状态可以是任何比值或其他量(如除了1-1对应外)，并且如通过一个或多个截止值所确定的，表示变化的测量状态可以是不同于参考量的任何比值或其他量。The change detected in the reference amount can be any deviation (upward or downward) relative to a clinically relevant nucleic acid sequence compared to other non-clinically relevant sequences. Thus, the reference state can be any ratio or other quantity (e.g., other than a 1-1 correspondence), and the measured state representing a change can be any ratio or other quantity different from the reference amount, as determined by one or more cutoff values.

临床相关染色体区(也称为临床相关核酸序列)和背景核酸序列，可以来自第一类型的细胞和一种或多种第二类型的细胞。例如，来自胎儿/胎盘细胞的胎儿核酸序列存在于生物样品中，如含有来自母体细胞的母体核酸序列的背景的母体血浆。在一实施方案中，至少部分地基于生物样品中第一类型细胞的百分比来确定截止值。需要指出的是，样品中胎儿序列的百分比可以通过任何胎儿来源的基因座确定，并且不限于测量临床相关核酸序列。在另一实施方案中，至少部分地基于生物样品中肿瘤序列的百分比来确定截止值，所述生物样品，如血浆、血清、唾液或尿，含有来自体内非恶性细胞的核酸序列的背景。Clinically relevant chromosomal regions (also referred to as clinically relevant nucleic acid sequences) and background nucleic acid sequences can be from cells of the first type and one or more cells of the second type. For example, fetal nucleic acid sequences from fetal/placental cells are present in biological samples, such as maternal plasma containing a background of maternal nucleic acid sequences from maternal cells. In one embodiment, the cutoff value is determined at least in part based on the percentage of the first type of cells in the biological sample. It should be noted that the percentage of fetal sequences in the sample can be determined by any fetal-derived locus and is not limited to measuring clinically relevant nucleic acid sequences. In another embodiment, the cutoff value is determined at least in part based on the percentage of tumor sequences in the biological sample, and the biological sample, such as plasma, serum, saliva or urine, contains a background of nucleic acid sequences from non-malignant cells in the body.

I.一般方法I. General Methods

在步骤110中，接收来自孕妇的生物样品。该生物样品可以是血浆、尿、血清或任何其他合适的样品。样品含有胎儿和孕妇的核酸分子。例如，核酸分子可以是染色体的片段。In step 110, a biological sample is received from a pregnant woman. The biological sample can be plasma, urine, serum, or any other suitable sample. The sample contains nucleic acid molecules from the fetus and the pregnant woman. For example, the nucleic acid molecules can be fragments of chromosomes.

在步骤120中，对含于生物样品中的多个核酸分子的至少一部分进行测序。被测序的一部分代表人类基因组的部分。在一实施方案中，核酸分子是各自染色体的片段。可以对一端(如35个碱基对(bp))、两端或完整的片段进行测序。可以对样品中全部核酸分子进行测序，或仅对亚群进行测序。如下文更详细描述的，该亚群可以是随机选择的。In step 120, at least a portion of the plurality of nucleic acid molecules contained in the biological sample is sequenced. The sequenced portion represents a portion of the human genome. In one embodiment, the nucleic acid molecules are fragments of respective chromosomes. Sequencing can be performed on one end (e.g., 35 base pairs (bp)), on both ends, or on a complete fragment. All nucleic acid molecules in the sample can be sequenced, or only a subpopulation can be sequenced. As described in more detail below, the subpopulation can be randomly selected.

在一实施方案中，测序利用大规模并行测序进行。大规模并行测序，如可通过454平台(Roche)(Margulies,M.et al.2005Nature 437,376-380)、Illumina基因组分析仪(Illumina Genome Analyzer)(或Solexa平台)或SOLiD System(Applied Biosystems)或Helicos真实单分子DNA测序技术(the Helicos True Single Molecule DNA sequencingtechnology,Harris TD et al.2008Science,320,106-109)、Pacific Biosciences的单分子实时(SMRT^TM)技术和纳米孔测序(nanopore sequencing,Soni GV and MellerA.2007Clin Chem 53:1996-2001)实现，允许对分离自样品的许多核酸分子在并行方式下，以高阶多路进行测序(Dear Brief Funct Genomic Proteomic 2003；1:397-416)。这些平台的每一种可以对无性扩充的或者甚至未扩增的核酸片段的单个分子进行测序。In one embodiment, sequencing is performed using massively parallel sequencing. Massively parallel sequencing, such as can be achieved by the 454 platform (Roche) (Margulies, M. et al. 2005 Nature 437, 376-380), the Illumina Genome Analyzer (or Solexa platform) or the SOLiD System (Applied Biosystems), or the Helicos True Single Molecule DNA sequencing technology (Harris TD et al. 2008 Science, 320, 106-109), Pacific Biosciences' Single Molecule Real-Time (SMRT^™ ) technology and nanopore sequencing (Soni GV and Meller A. 2007 Clin Chem 53: 1996-2001), allows the sequencing of many nucleic acid molecules isolated from a sample in parallel with high-order multiplexing (Dear Brief Funct Genomic Proteomic 2003; 1: 397-416). Each of these platforms can sequence single molecules of clonally amplified or even unamplified nucleic acid fragments.

因为在每次运行中，由每个样品产生了数十万到数百万甚至可能数亿或数十亿的级别的大量测序读取，所以所得的测序读取形成了原始样品中核酸种类的混合物的代表性特征。例如，测序读取的单元型、转录物组(trascriptome)和甲基化特征与原始样品的这些代表性特征相似(Brenner et al Nat Biotech 2000；18:630-634；Taylor et al CancerRes 2007；67:8511-8518)。由于从每个样品中对序列进行大量取样，相同序列的数量，如以几倍覆盖度或高冗余度由核酸池的测序所产生的相同序列的数量，也是原始样品中特定核酸种类或基因座计数的良好定量体现。Because a large number of sequencing reads, ranging from hundreds of thousands to millions, or even hundreds of millions or billions, are generated from each sample in each run, the resulting sequencing reads form a representative profile of the mixture of nucleic acid species in the original sample. For example, the haplotype, transcriptome, and methylation profiles of sequencing reads are similar to those of the original sample (Brenner et al Nat Biotech 2000; 18: 630-634; Taylor et al Cancer Res 2007; 67: 8511-8518). Due to the large number of sequences sampled from each sample, the number of identical sequences, such as the number of identical sequences generated by sequencing the nucleic acid pool at several times coverage or high redundancy, is also a good quantitative representation of the counts of specific nucleic acid species or loci in the original sample.

在步骤130中，基于测序(如来自测序的数据)，确定第一染色体(如临床相关染色体)的第一量。第一量由鉴定为来自第一染色体的序列确定。例如，随后可用生物信息学程序将这些DNA序列中的每一个序列定位于人类基因组。有可能从随后的分析中放弃一部分此类序列，因为它们存在于人类基因组的重复区域中，或存在于经历了个体间变异(inter-individual variation)如拷贝数变异的区域中。因此，可以确定感兴趣的染色体的量或一条或多条其他染色体的量。In step 130, based on order-checking (such as from order-checking data), determine the first amount of the first chromosome (such as clinically relevant chromosome).The first amount is determined by being accredited as the sequence from the first chromosome.For example, each sequence in these DNA sequences can be located in the human genome using a bioinformatics program subsequently.It is possible to abandon a part of this type of sequence from subsequent analysis because they are present in the repeated region of the human genome, or are present in a region that has experienced inter-individual variation (inter-individual variation) such as copy number variation.Therefore, the amount of the chromosome of interest or the amount of one or more other chromosomes can be determined.

在步骤140中，基于测序，由鉴定为来自第二染色体之一的序列，确定一条或多条第二染色体的第二量。在一实施方案中，第二染色体是除第一染色体(即被检测的染色体)以外的所有其他染色体。在另一实施方案中，第二染色体就是单条其他染色体。In step 140, based on sequencing, a second amount of one or more second chromosomes is determined from the sequence identified as being from one of the second chromosomes. In one embodiment, the second chromosomes are all other chromosomes except the first chromosome (i.e., the chromosome being detected). In another embodiment, the second chromosome is a single other chromosome.

存在许多确定染色体量的方式，包括但不限于计数被测序的标签的数量、被测序的核苷酸(碱基对)的数量或来自特定染色体或染色体区的被测序的核苷酸(碱基对)的累积长度。There are many ways to determine chromosome mass, including but not limited to counting the number of tags sequenced, the number of nucleotides (base pairs) sequenced, or the cumulative length of sequenced nucleotides (base pairs) from a specific chromosome or chromosome region.

在另一实施方案中，可以将规则施加于测序结果来确定哪些被计数了。一方面，可以基于一部分测序输出来获得量。例如，对应于指定大小范围的核酸片段的测序输出，可以在生物信息学分析后进行选择。大小范围的实例是约<300bp、<200bp或<100bp。In another embodiment, rules can be applied to the sequencing results to determine which are counted. On the one hand, the amount can be obtained based on a portion of the sequencing output. For example, the sequencing output corresponding to nucleic acid fragments of a specified size range can be selected after bioinformatics analysis. Examples of size ranges are approximately <300bp, <200bp, or <100bp.

在步骤150中，由第一量和第二量确定参数。参数可以是，例如，第一量与第二量的简单比值，或第一量与第二量加第一量的比值。一方面，每个量可以是一个函数或不同函数的自变量，其中，随后可以获得这些不同函数的比值。本领域技术人员应当理解不同的合适参数的数量。In step 150, a parameter is determined from the first quantity and the second quantity. The parameter can be, for example, a simple ratio of the first quantity to the second quantity, or a ratio of the first quantity to the sum of the second quantity and the first quantity. In one aspect, each quantity can be the independent variable of a function or different functions, wherein the ratio of these different functions can then be obtained. A person skilled in the art will appreciate the number of different suitable parameters.

在一实施方案中，潜在地与染色体非整倍性，如21号染色体或18号染色体或13号染色体的非整倍性有关的染色体的参数(如分数表现度)，可以随后由生物信息学程序的结果来计算。基于所有序列的量(如包括临床相关染色体在内的所有染色体的某些测量)或染色体特定亚群的量(如只除开被检测的染色体以外的一个其他染色体)的量，可以获得分数表现度。In one embodiment, parameters (e.g., fractional expressivity) for chromosomes potentially associated with aneuploidy of chromosomes 21, 18, or 13 can then be calculated from the results of a bioinformatics program. Fractional expressivity can be obtained based on the amount of all sequences (e.g., some measurement of all chromosomes including clinically relevant chromosomes) or the amount of a specific subset of chromosomes (e.g., only one chromosome other than the chromosome being tested).

在步骤150中，将参数与一个或多个截止值进行比较。截止值可以由任何数量的适宜方式来确定。此类方式包括贝叶斯型似然方法(Bayesian-type likelihood method)、序贯概率比检验、假发现(false discovery)、置信区间、受试者工作特性(receiveroperating characteristic,ROC)。这些方法和样品特异性方法应用的实例描述于同时提交的申请"DETERMINING A NUCLEIC ACID SEQUENCE IMBALANCE(确定核酸序列失衡)"(Attorney Docket No.016285-005210US)中，将该申请通过引用并入。In step 150, parameter is compared with one or more cut-off values.Cut-off value can be determined by any suitable method.Such method includes Bayesian-type likelihood method, sequential probability ratio test, false discovery, confidence interval, receiver operating characteristic (ROC).The example of these methods and sample-specific method application is described in the application " DETERMINING A NUCLEIC ACID SEQUENCE IMBALANCE (determining nucleic acid sequence imbalance) " (Attorney Docket No.016285-005210US) submitted simultaneously, and this application is incorporated by reference.

在一实施方案中，随后将参数(如临床相关染色体的分数表现度)与涉及正常(即整倍体)胎儿的妊娠中所建立的参照范围进行比较。可能的是，在程序的某些变体中，参照范围(即截止值)可以根据特定母体血浆样品中胎儿DNA的分数浓度(f)进行调整。如果胎儿是男性，例如利用可在Y染色体上定位的序列，可以由测序数据集来确定f值。f值也可以例如利用胎儿外遗传标记(Chan KCA et al 2006Clin Chem 52,2211-8)，或由单核苷酸多态性的分析，在单独的分析中确定。In one embodiment, the parameter (such as the fractional expression of clinically relevant chromosomes) is then compared with the reference range established in the pregnancy of a normal (i.e., euploid) fetus. Possibly, in some variants of the program, the reference range (i.e., cutoff value) can be adjusted according to the fractional concentration (f) of fetal DNA in a specific maternal plasma sample. If the fetus is a male, for example, using a sequence that can be located on a Y chromosome, the f value can be determined by sequencing data sets. The f value can also be determined, for example, using fetal epigenetic markers (Chan KCA et al 2006 Clin Chem 52, 2211-8), or by the analysis of single nucleotide polymorphisms, in a separate analysis.

在步骤160中，基于比较，确定对于第一染色体，是否存在胎儿染色体非整倍性的分类。在一实施方案中，分类是明确的存在(yes)或不存在(no)。在另一实施方案中，分类可以是不可分类的或不确定的。在又一个实施方案中，分类可以是例如由医生以后解释的评分。In step 160, based on the comparison, it is determined whether there is a classification of fetal chromosomal aneuploidy for the first chromosome. In one embodiment, the classification is a clear presence (yes) or absence (no). In another embodiment, the classification can be unclassifiable or uncertain. In yet another embodiment, the classification can be a score that is later interpreted by a doctor, for example.

II.测序、比对以及量的确定II. Sequencing, Alignment, and Quantification

如上文所述，仅对基因组的部分进行测序。一方面，甚至当以小于100％的基因组覆盖度而不是以几倍的覆盖度对样品中的核酸池进行测序时，并且在一部分所捕获的核酸分子中，大多数每个核酸种类仅测序一次。还可以定量地确定特定染色体或染色体区的剂量失衡。换言之，由样品的其他可定位的被测序的标签中的所述基因座的百分比表现度来推断染色体或染色体区的剂量失衡。As mentioned above, only the part of genome is checked order.On the one hand, even when with less than 100% genome coverage rather than with several times of coverage the nucleic acid pool in sample is checked order, and in the nucleic acid molecules captured in a part, most of each nucleic acid species are only checked order once.The dosage imbalance of specific chromosome or chromosome area can also be determined quantitatively.In other words, the dosage imbalance of chromosome or chromosome area is inferred by the percentage expression of the described locus in the sequenced labels of other locatable samples.

这与下述情况相反，即对相同池的核酸进行多次测序，以便获得冗余度或几倍的覆盖度，据此将每个核酸种类测序多次。在此情况下，相对于另一核酸种类的已被测序的特定核酸种类的次数，与它们在原始样品中的相对浓度相关。随着实现核酸种类准确表现度所需的覆盖度倍数的增加，测序的成本增加。This is in contrast to sequencing the same pool of nucleic acids multiple times to achieve redundancy or multiple times of coverage, whereby each nucleic acid species is sequenced multiple times. In this case, the number of times a particular nucleic acid species is sequenced relative to another nucleic acid species is related to their relative concentrations in the original sample. As the number of times of coverage required to achieve accurate representation of nucleic acid species increases, the cost of sequencing increases.

在一实例中，此类序列的一部分可以来自与非整倍性有关的染色体，如本示例性实例中的21号染色体。然而，此类测序作业(sequencing exercise)的其他序列可来自其他染色体。通过考虑与其他染色体相比的21号染色体的相对大小，可以在参照范围内，获得此类测序作业的21号染色体特异性序列的标准化频率。如果胎儿具有21三体性，则此类测序作业的获得自21号染色的标准化频率将增加，因而允许检测21三体性。标准化频率变化的程度，将依赖于分析的样品中胎儿核酸的分数浓度。In one example, a portion of such sequences may be from a chromosome associated with aneuploidy, such as chromosome 21 in this illustrative example. However, other sequences for such a sequencing exercise may be from other chromosomes. By taking into account the relative size of chromosome 21 compared to other chromosomes, a normalized frequency of chromosome 21-specific sequences for such a sequencing exercise can be obtained within a reference range. If the fetus has trisomy 21, the normalized frequency obtained from chromosome 21 for such a sequencing exercise will increase, thereby allowing detection of trisomy 21. The extent to which the normalized frequency changes will depend on the fractional concentration of fetal nucleic acid in the sample being analyzed.

在一实施方案中，我们使用Illumina基因组分析仪，进行人类基因组DNA和人类血浆DNA样品的单末端测序。Illumina基因组分析仪可以对捕获于称为流动池(flow cell)的固体表面上的无性扩充的单个DNA分子进行测序。每个流动池具有8个泳道来用于对8个单独的样品或样品池进行测序。每个泳道能产生约200Mb的序列，其仅是人类基因组中30亿个碱基对的序列的部分。利用流动池的一条泳道，对每个基因组DNA或血浆DNA样品进行测序。将所产生的短序列标签与人类参照基因组序列进行比对，并标明染色体来源。将与每条染色体比对的单独被测序的标签的总数制成表格，并与参照人类基因组或非疾病表现样品所预期的每条染色体的相对大小进行比较。然后确定了染色体增加或丢失。In one embodiment, we use Illumina genome analyzer to carry out single-end sequencing of human genomic DNA and human plasma DNA samples. Illumina genome analyzer can sequence the single DNA molecules captured in the asexual expansion on the solid surface called flow cell. Each flow cell has 8 swimming lanes for sequencing 8 independent samples or sample pools. Each swimming lane can produce a sequence of about 200Mb, which is only the part of the sequence of 3 billion base pairs in the human genome. Utilize a swimming lane of the flow cell to sequence each genomic DNA or plasma DNA sample. The short sequence tags produced are compared with the human reference genome sequence, and the chromosome source is indicated. The total number of the independently sequenced tags compared with each chromosome is tabulated, and compared with the relative size of each chromosome expected by reference to the human genome or non-disease manifestation sample. Then it is determined that the chromosome increases or loses.

所述方法仅仅是目前所述的基因/染色体的剂量策略的一范例。可选地，可进行配对末端(paired-end)测序。计数比对的被测序的标签的数量并根据染色体位置进行分类，而不是如Campbell等所述(Nat Genet 2008；40:722-729)地比较参照基因组中所预期的被测序片段的长度。通过比较标签计数与参照基因组中的预期染色体大小或非疾病表现样品的预期染色体大小来确定染色体区或全部染色体的增加或丢失。因为配对末端测序允许推断原始核酸片段的大小，因而一实例致力于计数对应于指定大小的核酸片段的被配对测序的标签的数量，所述指定大小如<300bp、<200bp或<100bp。Described method is only an example of the dosage strategy of gene/chromosome described at present.Alternatively, paired-end (paired-end) sequencing can be carried out.The quantity of the sequenced labels of counting comparison is also classified according to chromosome position, rather than as described in Campbell etc. (Nat Genet 2008; 40:722-729) ground compares with reference to the length of the sequenced fragment expected in genome.By comparing label count and with reference to the expected chromosome size in genome or the expected chromosome size of non-disease manifestation sample, determine the increase or loss of chromosome region or whole chromosomes.Because paired-end sequencing allows to infer the size of original nucleic acid fragment, thus an example is devoted to counting the quantity of the paired sequenced labels of nucleic acid fragment corresponding to specified size, and described specified size is as <300bp, <200bp or <100bp.

在另一实施方案中，在测序前，还对在运行中被测序的核酸池的部分进行次级选择(sub-select)。例如，基于杂交的技术，如寡核苷酸阵列可用来首先对来自某些染色体的核酸序列进行次级选择，所述染色体如潜在的非整倍体染色体和与检测的非整倍性无关的其他染色体。另一实例是，在测序前，对样品池的核酸序列的某些亚群进行次级选择或富集。例如，如上文所讨论的，已报道，母体血浆中胎儿DNA分子由比母体背景DNA分子短的片段组成(Chan et al Clin Chem 2004；50:88-92)。因此，例如，通过凝胶电泳或尺寸排除柱(size exclusion column)或通过基于微流体的方法(microfluidics-based approach)，可以根据分子大小，利用本领域技术人员已知的一种或多种方法，对样品中的核酸序列进行分级。此外，可选地，在分析母体血浆中无细胞胎儿DNA的实例中，通过抑制母体背景的方法，如通过加入甲醛，可以富集胎儿的核酸部分(Dhallan et al JAMA 2004；291:1114-9)。在一实施方案中，对核酸的预选的池的一部分或亚群进行随机测序。In another embodiment, before sequencing, the part of the nucleic acid pool being sequenced in operation is also subjected to secondary selection (sub-select). For example, hybridization-based technology, such as oligonucleotide arrays can be used to first carry out secondary selection to the nucleic acid sequence from certain chromosomes, and the chromosomes are such as potential aneuploid chromosomes and other chromosomes unrelated to the aneuploidy detected. Another example is that before sequencing, certain subgroups of the nucleic acid sequence of the sample pool are subjected to secondary selection or enrichment. For example, as discussed above, it has been reported that fetal DNA molecules in maternal plasma are composed of fragments shorter than maternal background DNA molecules (Chan et al Clin Chem 2004; 50: 88-92). Therefore, for example, by gel electrophoresis or size exclusion column (size exclusion column) or by microfluidics-based method (microfluidics-based approach), the nucleic acid sequence in the sample can be graded according to molecular size using one or more methods well known to those skilled in the art. Alternatively, in the example of analyzing cell-free fetal DNA in maternal plasma, the fetal nucleic acid fraction can be enriched by methods that suppress maternal background, such as by adding formaldehyde (Dhallan et al JAMA 2004; 291: 1114-9). In one embodiment, a portion or subpopulation of a preselected pool of nucleic acids is randomly sequenced.

同样，其他单分子测序策略也可以用于本申请中，如Roche 454平台、AppliedBiosystems SOLiD平台、Helicos真实单分子DNA测序技术、Pacific Biosciences的单分子实时技术(SMRT^TM)以及纳米孔测序。Likewise, other single molecule sequencing strategies can also be used in this application, such as the Roche 454 platform, Applied Biosystems SOLiD platform, Helicos true single molecule DNA sequencing technology, Pacific Biosciences' Single Molecule Real-Time Technology (SMRT^™ ), and nanopore sequencing.

III.由测序的输出确定染色体的量III. Determining the amount of chromosomes from sequencing output

大规模并行测序后，实施生物信息学分析，以便定位被测序的标签的染色体来源。该程序后，将鉴定为来自潜在的非整倍体染色体，即本研究中的21号染色体的标签，与全部被测序的标签或来自与非整倍性无关的一条或多条染色体的标签进行定量比较。将检测样品的21号染色体和其他非21号染色体的测序输出间的相互关系，与由上节所述的方法获得的截止值进行比较，以确定样品是否由与整倍体或21三体性胎儿有关的妊娠获得。Following massively parallel sequencing, bioinformatics analysis is performed to localize the chromosomal origin of the sequenced tags. Following this procedure, tags identified as originating from a potentially aneuploid chromosome, in this case chromosome 21, are quantitatively compared to all sequenced tags or tags originating from one or more chromosomes unrelated to aneuploidy. The correlation between the sequencing output for chromosome 21 and other non-chromosomes of the test sample is compared to the cutoff values obtained using the methods described in the previous section to determine whether the sample was obtained from a pregnancy associated with a euploid or trisomy 21 fetus.

许多不同的量，包括但不限于下述可以由被测序的标签获得的量。例如，能够将和特定染色体比对的被测序的标签的数量，即绝对计数，与和其他染色体比对的被测序的标签的绝对计数进行比较。可选地，参照全部或某些其他被测序的标签，21号染色体的被测序的标签的量的分数计数(fractional count)，可以与其他非非整倍体染色体的分数计数进行比较。在本实验中，因为对每个DNA片段的36bp进行了测序，因而，特定染色体的被测序的核苷酸的数量，能够容易地由被测序的标签的计数乘以36bp获得。Many different quantities, including but not limited to the following, can be obtained from sequenced tags. For example, the number of sequenced tags aligned to a particular chromosome, i.e., the absolute count, can be compared with the absolute counts of sequenced tags aligned to other chromosomes. Alternatively, the fractional count of the amount of sequenced tags for chromosome 21 can be compared with the fractional counts of other non-aneuploid chromosomes with reference to all or some of the other sequenced tags. In this experiment, because 36bp of each DNA fragment was sequenced, the number of sequenced nucleotides for a particular chromosome can be easily obtained by multiplying the count of the sequenced tags by 36bp.

此外，因为利用仅能对人类基因组的部分进行测序的一个流动池，仅对每个母体血浆样品进行测序，因而，根据统计，大多数母体血浆DNA片段种类只被测序了一次，从而产生一个被测序的标签的计数。换言之，以小于1倍的覆盖度，对存在于母体血浆样品中的核酸片段进行了测序。因此，对于任何特定的染色体，被测序的核苷酸的总数，通常符合部分已被测序的所述染色体的量、比例或长度。因此，潜在的非整倍体染色体表现度的定量确定，能够参照其他染色体的同样获得的数量，由该潜在的非整倍体染色体的被测序的核苷酸的部分数量或相当的长度获得。In addition, because a flow cell that can only sequence part of the human genome is utilized, only each maternal plasma sample is sequenced. Thus, according to statistics, most maternal plasma DNA fragment types have only been sequenced once, thereby generating a count of sequenced tags. In other words, the nucleic acid fragments present in the maternal plasma sample are sequenced with a coverage less than 1 times. Therefore, for any specific chromosome, the total number of sequenced nucleotides usually conforms to the amount, ratio or length of the chromosome that has been sequenced. Therefore, the quantitative determination of potential aneuploid chromosome expression can be obtained by the partial number or considerable length of the sequenced nucleotides of the potential aneuploid chromosome with reference to the same quantity obtained for other chromosomes.

IV.用于测序的核酸池的富集IV. Enrichment of Nucleic Acid Pools for Sequencing

如上文所提到以及下节的实施例中所建立的，仅需要对一部分人类基因组进行测序来从整倍体情况区分21三体性。因此，可能并且节约成本的是，在对富集的池的部分进行随机测序前，富集待测序的核酸池。例如，母体血浆中的胎儿DNA分子由比母体背景DNA分子短的片段组成(Chan et al Clin Chem 2004；50:88-92)。因此，例如，通过凝胶电泳或尺寸排除柱或通过基于微流体的方法，根据分子大小，可以利用本领域技术人员已知的一种或多种方法对样品中的核酸序列进行分级。As mentioned above and established in the Examples of the following section, only a portion of the human genome needs to be sequenced to distinguish trisomy 21 from the euploid case. Therefore, it is possible and cost-effective to enrich the nucleic acid pool to be sequenced before randomly sequencing a portion of the enriched pool. For example, fetal DNA molecules in maternal plasma are composed of fragments that are shorter than maternal background DNA molecules (Chan et al Clin Chem 2004; 50: 88-92). Therefore, for example, by gel electrophoresis or size exclusion columns or by microfluidic-based methods, the nucleic acid sequences in the sample can be fractionated according to molecular size using one or more methods known to those skilled in the art.

此外，可选地，在分析母体血浆中无细胞胎儿DNA的实例中，胎儿核酸部分可以通过如加入甲醛的抑制母体背景的方法来富集(Dhallan et al JAMA 2004；291:1114-9)。获得自胎儿的序列的比例将在由更短的片段组成的核酸池中得以富集。根据图7，区分整倍体和21三体性情况所需的被测序的标签的数量，将随着胎儿DNA分数浓度的增加而减少。Alternatively, in the example of analyzing cell-free fetal DNA in maternal plasma, the fetal nucleic acid fraction can be enriched by methods such as the addition of formaldehyde to suppress maternal background (Dhallan et al JAMA 2004; 291: 1114-9). The proportion of sequences obtained from the fetus will be enriched in the nucleic acid pool composed of shorter fragments. As shown in Figure 7, the number of sequenced tags required to distinguish between euploid and trisomy 21 cases will decrease as the concentration of the fetal DNA fraction increases.

可选地，来自潜在的非整倍体染色体和与非整倍性无关的一条或多条染色体的序列，可以通过例如寡核苷酸微阵列的杂交技术富集。核酸的富集池随后进行随机测序。这将降低测序的成本。Alternatively, sequences from potential aneuploid chromosomes and one or more chromosomes unrelated to aneuploidy can be enriched by hybridization techniques such as oligonucleotide microarrays. The enriched pool of nucleic acids is then randomly sequenced. This will reduce the cost of sequencing.

V.随机测序V. Random Sequencing

图2是本发明实施方案的，利用随机测序进行胎儿染色体非整倍性的产前诊断的方法200的流程图。在大规模并行测序方法的一方面，可以同时产生所有染色体的代表性数据。不提前选择特定片段的来源。随机地进行测序，随后进行数据库搜索，以查明特定片段来自何处。这与扩增21号染色体的特异性片段和1号染色体的另一特异性片段的情况相反。FIG2 is a flow chart of a method 200 for prenatal diagnosis of fetal chromosomal aneuploidy using random sequencing, according to an embodiment of the present invention. In one aspect of a massively parallel sequencing approach, representative data for all chromosomes can be generated simultaneously. The source of a specific segment is not selected in advance. Sequencing is performed randomly, followed by a database search to pinpoint where the specific segment originated. This is in contrast to amplifying a specific segment of chromosome 21 and another specific segment of chromosome 1.

在步骤210中，接收来自孕妇的生物样品。在步骤220中，对于期望的准确性，计算待分析的序列数量N。在一实施方案中，首先测定生物样品中胎儿DNA的百分比。这可通过本领域技术人员已知的任何合适方式进行。测定可以是简单地读取由另一实体所测量的值。在本实施方案中，待分析的序列的数量N的计算，以百分比为基础。例如，当胎儿DNA的百分比降低时，需要分析的序列的数量将增加，而当胎儿DNA升高时，需要分析的序列的数量可以减少。数量N可以是固定数，或相对数，如百分比。在另一实施方案中，可以测序已知对准确的疾病诊断足够的数量N。即使在具有正常范围下限(lower end)的胎儿DNA浓度的妊娠中，也可以使数量N充分。In step 210, a biological sample is received from a pregnant woman. In step 220, the number N of sequences to be analyzed is calculated for the desired accuracy. In one embodiment, the percentage of fetal DNA in the biological sample is first determined. This can be performed by any suitable means known to those skilled in the art. The determination can be a simple reading of a value measured by another entity. In this embodiment, the number N of sequences to be analyzed is calculated based on a percentage. For example, as the percentage of fetal DNA decreases, the number of sequences that need to be analyzed will increase, while as the percentage of fetal DNA increases, the number of sequences that need to be analyzed can be reduced. The number N can be a fixed number, or a relative number, such as a percentage. In another embodiment, a number N known to be sufficient for an accurate disease diagnosis can be sequenced. Even in pregnancies with fetal DNA concentrations at the lower end of the normal range, the number N can be sufficient.

在步骤230中，对含于生物样品中的多个核酸分子中的至少N个进行随机测序。所述方法的特征是，在样品分析即测序前，待测序的核酸不是特定地确定的或靶向的。测序不需要靶向具体基因座的序列特异性引物。被测序的核酸池随样品的不同而不同，甚至对于相同样品随分析的不同而不同。此外，根据下文描述(图6)，情况诊断所需的测序输出的量，能够在所检测的样品和参照群体间不同。这些方面与大多数分子诊断方法明显不同，如原位杂交中基于荧光的方法、定量荧光PCR、定量实时PCR、数字PCR、比较基因组杂交、微阵列比较基因组杂交等，其中待靶向的基因座需要在先的预确定，因此需要使用基因座特异性引物或基因座特异性探针对或组(panel)。In step 230, at least N of the multiple nucleic acid molecules contained in the biological sample are subjected to random sequencing. The feature of the method is that, before sample analysis, i.e. sequencing, the nucleic acid to be sequenced is not specifically determined or targeted. Sequencing does not require sequence-specific primers targeting specific loci. The nucleic acid pool sequenced varies with the difference of the sample, and even varies with the difference of the analysis for the same sample. In addition, according to the description below (Fig. 6), the amount of sequencing output required for case diagnosis can be different between the sample detected and the reference population. These aspects are significantly different from most molecular diagnostic methods, such as fluorescence-based methods, quantitative fluorescence PCR, quantitative real-time PCR, digital PCR, comparative genomic hybridization, microarray comparative genomic hybridization, etc. in in situ hybridization, wherein the locus to be targeted needs to be predetermined in advance, and therefore it is necessary to use locus-specific primers or locus-specific probes to or groups (panel).

在一实施方案中，对存在于孕妇血浆中的DNA片段进行随机测序，并且获得原本来自胎儿或母亲的基因组序列。随机测序包括对存在于生物样品中的核酸分子的随机部分进行取样(测序)。因为测序是随机的，因而在每次分析中，可以对核酸分子(因此基因组)的不同亚群(部分)进行测序。即使当该亚群随样品或分析的不同而不同时，该实施方案依然有效。部分的实例是约0.1％、0.5％,、1％、5％、10％、20％或30％的基因组。在另一实施方案中，部分是至少这些值中的任一值。In one embodiment, random sequencing is performed on the DNA fragments present in the plasma of a pregnant woman, and the genomic sequence originally from the fetus or mother is obtained. Random sequencing includes sampling (sequencing) random portions of the nucleic acid molecules present in the biological sample. Because sequencing is random, different subpopulations (portions) of the nucleic acid molecules (and therefore the genome) can be sequenced in each analysis. This embodiment is still effective even when the subpopulations vary from sample to sample or analysis to analysis. Examples of portions are about 0.1%, 0.5%, 1%, 5%, 10%, 20% or 30% of the genome. In another embodiment, a portion is at least any of these values.

可以通过与方法100相似的方式，进行剩余的步骤240-270。The remaining steps 240 - 270 may be performed in a similar manner to method 100 .

VI.被测序的标签池的测序后选择VI. Post-sequencing selection of sequenced tag pools

如下文实施例II和III所述，测序数据的亚群足以区分21三体性和非整倍体的情况。测序数据的亚群可以是一定比例的传递某些性质参数的被测序的标签。例如，在实施例II中，使用唯一与重复屏蔽的(repeat-masked)参照人类基因组比对的被测序的标签。可选地，可以对所有染色体的核酸片段的代表性池进行测序，但是致力于有关潜在的非整倍体染色体的数据和有关许多非非整倍体染色体的数据间的比较。As described in Examples II and III below, a subset of sequencing data is sufficient to distinguish between trisomy 21 and aneuploidy. A subset of sequencing data can be a certain proportion of sequenced tags that convey certain property parameters. For example, in Example II, a unique sequenced tag aligned with a repeat-masked reference human genome is used. Alternatively, a representative pool of nucleic acid fragments of all chromosomes can be sequenced, but the data on potential aneuploid chromosomes and the data on many non-aneuploid chromosomes are focused on comparison.

此外，可选地，在测序后的分析过程中，可以对测序输出的亚群进行次级选择，所述亚群包括对应于原始样品中指定大小窗口的核酸片段所产生的被测序的标签。例如，利用Illumina基因组分析仪，可使用涉及核酸片段两个末端测序的配对末端测序。随后比对每个配对末端的测序数据和参照人类基因组序列。随后可以推导跨越两个末端间的核苷酸的距离或数量。也可以推导原始核酸片段的全长。可选地，诸如454平台的测序平台，以及可能的某些单分子测序技术，能对全长的短核酸片度，如20bp进行测序。以此方式，可以由测序数据直接获知核酸片段的实际长度。In addition, alternatively, in the analytical process after order-checking, the subgroup of order-checking output can be carried out secondary selection, and described subgroup comprises the label that is ordered corresponding to the nucleic acid fragment of the specified size window produced in the original sample.For example, utilize Illumina genome analyzer, can use the paired end order-checking that relates to two terminal order-checkings of nucleic acid fragment.Compare the sequencing data of each paired end and with reference to human genome sequence subsequently.Can deduce the distance or the quantity of Nucleotide that spans two ends subsequently.Also can deduce the full length of original nucleic acid fragment.Alternatively, the sequencing platform such as 454 platform, and possible some single molecule sequencing technology, can be to the short nucleic acid fragment of full length, as 20bp is ordered.In this way, can directly know the actual length of nucleic acid fragment by sequencing data.

利用其他的测序平台，如Applied Biosystems SOLiD系统(Applied BiosystemsSOLiD system)，此类配对末端分析也是可能的。对于Roche 454平台，因为与其他大规模并行测序系统相比，该454平台的读取长度增加，因而确定片段的全序列的片段长度也是可能的。Such paired-end analysis is also possible using other sequencing platforms, such as the Applied Biosystems SOLiD system. For the Roche 454 platform, it is also possible to determine the fragment length of the full sequence of a fragment because of the increased read length of the 454 platform compared to other massively parallel sequencing systems.

将数据分析集中于对应于原始母体血浆样品中的短核酸片段的被测序的标签的亚群具有优点，因为来自胎儿的DNA序列有效地富集了数据集。这是因为，母体血浆中的胎儿DNA分子由比母体背景DNA分子短的片段组成(Chan et al Clin Chem 2004；50:88-92)。根据图7，区分整倍体和21三体性情况所需的被测序的标签的数量，将随胎儿DNA分数浓度的增加而降低。Focusing data analysis on the subset of sequenced tags corresponding to short nucleic acid fragments in the original maternal plasma sample has advantages because the DNA sequences from the fetus effectively enrich the data set. This is because fetal DNA molecules in maternal plasma are composed of fragments that are shorter than maternal background DNA molecules (Chan et al Clin Chem 2004; 50: 88-92). According to Figure 7, the number of sequenced tags required to distinguish between euploid and trisomy 21 cases will decrease as the concentration of fetal DNA fraction increases.

核酸池亚群测序后的选择不同于在样品分析前实施的其他核酸富集策略，所述策略如用于选择特定大小的核酸分子的凝胶电泳或尺寸排除柱，并且所述策略需要从核酸背景池中物理分离富集的池。物理程序可以引入更多的实验步骤，因而可以招致诸如污染等问题。取决于疾病确定所需的敏感性和特异性，测序输出亚群的测序后计算机选择(post-sequencing in silico selection)也可以允许改变选择。The selection of nucleic acid pool subpopulations after sequencing is different from other nucleic acid enrichment strategies implemented before sample analysis, such as gel electrophoresis or size exclusion columns for selecting nucleic acid molecules of a specific size, and the strategy requires physical separation of the enriched pool from the nucleic acid background pool. Physical procedures can introduce more experimental steps and thus can lead to problems such as contamination. Depending on the sensitivity and specificity required for disease determination, post-sequencing in silico selection of sequencing output subpopulations can also allow for changes in selection.

用于确定母体血浆样品是否获得自怀有21三体性或整倍体胎儿的孕妇的生物信息学、计算和统计方法，可以编译成计算机程序产品，用于确定测序输出的参数。计算机程序的运行包括确定潜在的非整倍体染色体的定量数量以及一个或多个其他染色体的量。确定参数，并与适当的截止值比较，以确定对于潜在的非整倍体染色体，是否存在胎儿染色体非整倍性。Bioinformatics, computing and statistical methods for determining whether a maternal plasma sample is obtained from a pregnant woman with trisomy 21 or a euploid fetus can be compiled into a computer program product for determining the parameters of a sequencing output. The operation of the computer program includes determining the quantitative quantity of potential aneuploid chromosomes and the amount of one or more other chromosomes. Determine the parameters and compare with appropriate cutoff values to determine whether there is a fetal chromosomal aneuploidy for the potential aneuploid chromosome.

实施例Example

为了示例而非限制所要求保护的本发明，提供了下面的实施例。The following examples are provided to illustrate, but not to limit, the claimed invention.

I.胎儿21三体性的产前诊断I. Prenatal diagnosis of fetal trisomy 21

本研究募集8名孕妇。所有的孕妇都处于妊娠首三月或妊娠中三月，并是单胎妊娠。其中的4名，每个都怀有21三体性胎儿，其他的4名，每个都怀有整倍体胎儿。从每个个体采集20毫升外周静脉血。在1600×g下离心10分钟后，收获母体血浆，并16000×g进一步离心10分钟。随后由5-10ml每个血浆样品提取DNA。通过Illumina基因组分析仪，根据制造商的说明书，将母体血浆DNA用于大规模并行测序。在测序和序列数据分析过程中，进行测序的技术人员不了解胎儿的诊断情况。Eight pregnant women were recruited for this study. All were in the first or second trimester of pregnancy and had singleton pregnancies. Four of them were each pregnant with trisomy 21, and the other four were each pregnant with a euploid fetus. 20 ml of peripheral venous blood was collected from each individual. After centrifugation at 1600 × g for 10 minutes, maternal plasma was harvested and further centrifuged at 16,000 × g for 10 minutes. DNA was then extracted from 5-10 ml of each plasma sample. Maternal plasma DNA was used for massively parallel sequencing using an Illumina genome analyzer according to the manufacturer's instructions. During sequencing and sequence data analysis, the technicians performing the sequencing were unaware of the fetal diagnosis.

简而言之，将约50ng母体血浆DNA用于制备DNA文库。可以以较少的量如15ng或10ng母体血浆DNA开始。将母体血浆DNA片段平末端化，与Solexa连接物(adaptor)连接，并通过凝胶纯化选择150-300bp的片段。可选地，可以将平末端化和连接物连接的母体血浆DNA片段通过柱(如AMPure，Agencourt)，以便除去未连接的连接物，而无需在簇产生(clusters genearation)前进行大小选择。将连接物连接的DNA与流动池的表面杂交，并利用Illumina簇站(cluster station)产生DNA簇，随后在Illumina基因组分析仪上进行36个循环的测序。通过一个流动池对每个母体血浆样品的DNA进行测序。利用Solexa AnalysisPipeline编辑测序读取。随后利用Eland应用软件，将所有的读取与重复屏蔽的参照人类基因组序列，即NCBI汇编36(NCBI 36assembly)(GenBank登录号：NC_00000l至NC_000024)进行比对。In short, about 50ng maternal plasma DNA is used to prepare a DNA library. It can be started with less amount such as 15ng or 10ng maternal plasma DNA. The maternal plasma DNA fragments are blunt-ended, connected with Solexa adapters (adaptor), and the fragments of 150-300bp are selected by gel purification. Alternatively, the maternal plasma DNA fragments connected with the blunt-ended adapters can be passed through a column (such as AMPure, Agencourt) to remove unconnected adapters, without the need to carry out size selection before cluster generation (clusters genearation). The DNA connected with the adapters is hybridized to the surface of a flow cell, and Illumina cluster stations (cluster station) are utilized to produce DNA clusters, and 36 cycles of sequencing are carried out on an Illumina genome analyzer subsequently. The DNA of each maternal plasma sample is sequenced by a flow cell. Solexa AnalysisPipeline is utilized to edit sequencing and read. All reads were then aligned to the repeat-masked reference human genome sequence, NCBI 36 assembly (GenBank accession numbers: NC_000001 to NC_000024), using the Eland application.

在本研究中，为了减少数据分析的复杂性，仅进一步考虑了已经定位于重复屏蔽的人类基因组参照的唯一位置的序列。可选地，可以使用测序数据的其他亚群或整套测序数据。计数每一样品的唯一可定位(mappable)的序列的总数。将唯一地与21号染色体比对的序列的数量表示为，与每一样品的比对的序列的总计数的比例。因为母体血浆含有母体来源的背景DNA中的胎儿DNA，因此，由于在胎儿基因组中存在21号染色体的额外拷贝，21三体性胎儿提供了来自21号染色体的额外的被测序的标签。因此，在来自怀有21三体性胎儿的妊娠的母体血浆中，21号染色体序列的百分比，比来自怀有整倍体胎儿的妊娠的21号染色体的百分比高。分析不需要靶向胎儿特异性序列。分析也不需要从母体核酸中在先地以物理方式分离胎儿核酸。分析也不需要在测序后，从母体序列中区分或鉴定胎儿序列。In this study, to reduce the complexity of data analysis, only sequences that have been mapped to unique positions in the repeat-shielded human genome reference were further considered. Alternatively, other subsets of sequencing data or a complete set of sequencing data can be used. The total number of uniquely mappable sequences for each sample was counted. The number of sequences that uniquely aligned to chromosome 21 was expressed as the ratio of the total count of aligned sequences for each sample. Because maternal plasma contains fetal DNA in background DNA of maternal origin, trisomy 21 fetuses provide additional sequenced tags from chromosome 21 due to the presence of an extra copy of chromosome 21 in the fetal genome. Therefore, in maternal plasma from pregnancies with trisomy 21, the percentage of chromosome 21 sequences is higher than the percentage of chromosome 21 sequences from pregnancies with euploid fetuses. The analysis does not require targeting fetal-specific sequences. The analysis also does not require prior physical separation of fetal nucleic acid from maternal nucleic acid. The analysis also does not require distinguishing or identifying fetal sequences from maternal sequences after sequencing.

图3A表示8个母体血浆DNA样品中每一个的定位于21号染色体的序列的百分比(21号染色体的百分比表现度)。21三体性妊娠的母体血浆中的21号染色体的百分比表现度，明显地高于整倍体妊娠的21号染色体的百分比表现度。这些数据表明，胎儿非整倍性无创产前诊断，可以通过确定与参照群体的百分比表现度相比的非整倍体染色体的百分比表现度来实现。可选地，21号染色体的过度表现度可通过以下方法来检测：将以实验方式获得的21号染色体的百分比表现度与预期为整倍体人类基因组的21号染色体序列的百分比表现度进行比较。这可通过屏蔽或不屏蔽人类基因组中的重复区进行。Figure 3A shows the percentage of sequences located on chromosome 21 for each of the eight maternal plasma DNA samples (percentage expression of chromosome 21). The percentage expression of chromosome 21 in maternal plasma of trisomy 21 pregnancies is significantly higher than the percentage expression of chromosome 21 in euploid pregnancies. These data indicate that non-invasive prenatal diagnosis of fetal aneuploidy can be achieved by determining the percentage expression of aneuploid chromosomes compared with the percentage expression of a reference population. Alternatively, the over-expression of chromosome 21 can be detected by the following method: the percentage expression of chromosome 21 obtained experimentally is compared with the percentage expression of chromosome 21 sequence expected to be a euploid human genome. This can be done by shielding or not shielding the repeat region in the human genome.

8名孕妇中的5名，每个都怀有男性胎儿。定位于Y染色体的序列可以是胎儿特异性的。将定位于Y染色体的序列的百分比用于计算原始母体血浆样品中胎儿DNA分数浓度。而且，胎儿DNA分数浓度也通过利用微流体数字PCR来确定，所述微流体数字PCR涉及锌指蛋白、X连锁的(ZFX)和锌指蛋白、Y连锁的(ZFY)共生同源基因。Five of the eight pregnant women each carried a male fetus. Sequences mapped to the Y chromosome can be fetal-specific. The percentage of sequences mapped to the Y chromosome is used to calculate the fractional fetal DNA concentration in the original maternal plasma sample. Furthermore, the fractional fetal DNA concentration is also determined by utilizing microfluidic digital PCR, which involves zinc finger proteins, X-linked (ZFX) and zinc finger proteins, Y-linked (ZFY) paralogous genes.

图3B表示由经测序的Y染色体的百分比表现度推断的胎儿DNA分数浓度和通过ZFY/ZFX微流体数字PCR所确定的胎儿DNA分数浓度间的相关性。这两种方法确定的母体血浆中胎儿DNA分数浓度间存在正相关性。正相关性系数(r)在Pearson相关性分析中为0.917。Figure 3B shows the correlation between the fetal DNA fraction concentration inferred from the percent representation of the sequenced Y chromosome and the fetal DNA fraction concentration determined by ZFY/ZFX microfluidic digital PCR. A positive correlation was observed between the fetal DNA fraction concentrations in maternal plasma determined by these two methods. The positive correlation coefficient (r) was 0.917 in Pearson's correlation analysis.

对于两种代表性情况，与24条染色体(22条常染色体和X染色体以及Y染色体)中的每一条比对的母体血浆DNA序列的百分比显示于图4A中。一名孕妇怀有21三体性胎儿，其他的孕妇怀有整倍体胎儿。与怀有正常胎儿的孕妇相比，定位于21号染色体的序列的百分比表现度在怀有21三体性胎儿的孕妇中更高。The percentage of maternal plasma DNA sequences that aligned to each of the 24 chromosomes (22 autosomes and chromosomes X and Y) is shown in Figure 4A for two representative cases. One pregnant woman was carrying a fetus with trisomy 21, and the other pregnant woman was carrying a euploid fetus. The percentage representation of sequences mapped to chromosome 21 was higher in pregnant women carrying fetuses with trisomy 21 than in pregnant women carrying normal fetuses.

上述两种情况的母体血浆DNA样品间每条染色体的百分比表现度的差异(％)显示于图4B中。特定染色体的百分比差异用下述公式计算：The difference (%) in the percentage representation of each chromosome between the maternal plasma DNA samples for the two conditions is shown in Figure 4B. The percentage difference for a specific chromosome was calculated using the following formula:

百分比差异(％)＝(P₂₁-P_E)/P_E×100％，其中Percent difference (%) = (P₂₁ -_PE )/_PE × 100%, where

P₂₁＝在怀有21三体性胎儿的孕妇中，与特定染色体比对的血浆DNA序列的百分比；以及P₂₁ = the percentage of plasma DNA sequences that aligned to a specific chromosome in pregnancies with trisomy 21; and

P_E＝在怀有整倍体胎儿的孕妇中，与特定染色体比对的血浆DNA序列的百分比。_PE = Percentage of plasma DNA sequences that align to a specific chromosome in pregnant women carrying euploid fetuses.

如图4B所示，与怀有整倍体胎儿的孕妇相比，怀有21三体性胎儿的孕妇血浆中有21号染色体序列的11％的过度表现度。对于与其他染色体比对的序列，两种情况间的差异在5％以内。因为与整倍体母体血浆样品相比，21三体性中，21号染色体的百分比表现度增加了，因此，差异(％)可选地称为21号染色体过度表现的程度。除了21号染色体百分比表现度间的差异(％)和绝对差异以外，还能够计算检测样品和参照样品计数的比值，并且该比值表示与整倍体样品相比的21三体性中21号染色体过度表现的程度。As shown in Figure 4B, there is an 11% over-representation of chromosome 21 sequences in the plasma of pregnant women carrying fetuses with trisomy 21 compared to pregnant women carrying euploid fetuses. For sequences aligned with other chromosomes, the difference between the two cases is within 5%. Because the percentage representation of chromosome 21 is increased in trisomy 21 compared to euploid maternal plasma samples, the difference (%) can alternatively be referred to as the degree of over-representation of chromosome 21. In addition to the difference (%) and absolute difference between the percentage representation of chromosome 21, the ratio of the counts of the test sample and the reference sample can also be calculated, and this ratio represents the degree of over-representation of chromosome 21 in trisomy 21 compared to the euploid sample.

对于每个都怀有整倍体胎儿的4名孕妇，将她们平均1.345％的血浆DNA序列，与21号染色体进行比对。在怀有21三体性胎儿的4名孕妇中，她们的胎儿中有3名是男性。计算这三种情况下每一种情况的21号染色体的百分比表现度。如上文所述，根据获得自4个整倍体情况的值的21号染色体的平均百分比表现度，确定这三种21三体性情况的21号染色体百分比表现度中的差异(％)。换言之，在本计算中，将4个怀有整倍体胎儿的情况的平均值用作参照。这三种男性21三体性情况的胎儿DNA分数浓度，由他们各自的Y染色体序列的百分比表现度来推断。For each of the four pregnant women who were carrying a euploid fetus, an average of 1.345% of their plasma DNA sequences were aligned to chromosome 21. Of the four pregnant women who were carrying a fetus with trisomy 21, three of their fetuses were male. The percentage representation of chromosome 21 was calculated for each of these three cases. As described above, the difference (%) in the percentage representation of chromosome 21 for the three trisomy 21 cases was determined based on the average percentage representation of chromosome 21 from the values of the four euploid cases. In other words, the average of the four cases with euploid fetuses was used as a reference in this calculation. The fetal DNA fraction concentrations for the three male trisomy 21 cases were inferred from the percentage representation of their respective Y chromosome sequences.

21号染色体序列过度表现的程度和胎儿DNA分数浓度间的相关性显示于图5中。两个参数间存在显著的正相关性。相关性系数(r)在Pearson相关性分析中为0.898。这些结果表明，母体血浆中21号染色体序列过度表现的程度与母体血浆样品中胎儿DNA的分数浓度相关。因此，可以确定与胎儿DNA分数浓度相关的21号染色体序列过度表现程度中的截止值，以鉴定与21三体性胎儿有关的妊娠。The correlation between the degree of overrepresentation of chromosome 21 sequences and the fractional concentration of fetal DNA is shown in Figure 5. There is a significant positive correlation between the two parameters. The correlation coefficient (r) is 0.898 in the Pearson correlation analysis. These results indicate that the degree of overrepresentation of chromosome 21 sequences in maternal plasma is correlated with the fractional concentration of fetal DNA in maternal plasma samples. Therefore, a cutoff value in the degree of overrepresentation of chromosome 21 sequences that is correlated with the fractional concentration of fetal DNA can be determined to identify pregnancies associated with fetuses with trisomy 21.

母体血浆中胎儿DNA分数浓度的确定，也可以独立于测序运行进行。例如，Y染色体DNA浓度可以利用实时PCR、微流体PCR或质谱法来预定。例如，我们已经在图3B中证明，基于测序运行过程中所产生的Y染色体计数所估计的胎儿DNA浓度和在测序运行外所产生的ZFY/ZFX比值间存在良好的相关性。实际上，胎儿DNA浓度可以利用除Y染色体以外的基因座确定，并适用于女性胎儿。例如，Chan等证明，在母体来源的未甲基化的RASSF1A序列的背景下，可以在孕妇血浆中检测到胎儿来源的甲基化的RASSF1A序列(Chan et al,Clin Chem2006；52:2211-8)。因此，胎儿DNA分数浓度可以用甲基化的RASSF1A序列的量除以全部RASSF1A(甲基化和未甲基化的)序列的量来确定。The determination of the fractional fetal DNA concentration in maternal plasma can also be performed independently of the sequencing run. For example, the Y chromosome DNA concentration can be estimated using real-time PCR, microfluidic PCR, or mass spectrometry. For example, we have demonstrated in Figure 3B that there is a good correlation between the fetal DNA concentration estimated based on the Y chromosome count generated during the sequencing run and the ZFY/ZFX ratio generated outside of the sequencing run. In fact, the fetal DNA concentration can be determined using loci other than the Y chromosome and is applicable to female fetuses. For example, Chan et al. demonstrated that methylated RASSF1A sequences of fetal origin can be detected in maternal plasma in the presence of unmethylated RASSF1A sequences of maternal origin (Chan et al, Clin Chem 2006; 52: 2211-8). Therefore, the fractional fetal DNA concentration can be determined by dividing the amount of methylated RASSF1A sequences by the amount of all RASSF1A sequences (methylated and unmethylated).

对于实施我们的发明，预期母体血浆比母体血清优选，因为在血液凝固过程中，母体血细胞释放了DNA。因此，如果使用血清，则预期胎儿DNA的分数浓度在母体血浆中将比在母体血清中低。换言之，如果使用母体血清，对于待诊断的胎儿染色体非整倍性，与同时从同一孕妇获得的血浆样品相比，预期需要产生更多的序列。For practicing our invention, maternal plasma is expected to be preferred over maternal serum because DNA is released from maternal blood cells during blood clotting. Therefore, if serum is used, the fractional concentration of fetal DNA is expected to be lower in maternal plasma than in maternal serum. In other words, if maternal serum is used, more sequences are expected to be generated for a fetal chromosomal aneuploidy to be diagnosed compared to a plasma sample obtained simultaneously from the same pregnant woman.

此外，确定胎儿DNA的分数浓度的另一可选方式是，经由定量孕妇和胎儿间多态性差异(Dhallan R,et al.2007Lancet,369,474-481)。本方法的实例是，靶向多态性位点，在该位点孕妇是纯合的，而胎儿是杂合的。将胎儿特异性等位基因的量与共同等位基因的量进行比较，以便确定胎儿DNA的分数浓度。In addition, another alternative way to determine the fractional concentration of fetal DNA is by quantifying polymorphic differences between pregnant women and fetuses (Dhallan R, et al. 2007 Lancet, 369, 474-481). An example of this method is to target polymorphic sites where the pregnant woman is homozygous and the fetus is heterozygous. The amount of the fetus-specific allele is compared with the amount of the common allele to determine the fractional concentration of fetal DNA.

与检测染色体变异的现有技术相反，所述现有技术包括检测和定量一个或多个特异性序列的比较基因组杂交、微阵列比较基因组杂交、定量实时聚合酶链式反应，大规模并行测序不依赖于预定或预限定的DNA序列组的检测或分析。对样品池DNA分子的随机代表性部分进行测序。在含有或不含有感兴趣的DNA种类的样品间比较与各种染色体区比对的不同的被测序的标签的数量。染色体变异将由与样品中任何给定的染色体区比对的序列的数量(或百分比)中的差异来揭示。In contrast to the prior art for detecting chromosomal variation, the prior art includes comparative genomic hybridization, microarray comparative genomic hybridization, quantitative real-time polymerase chain reaction (PCR) for detecting and quantifying one or more specific sequences, and large-scale parallel sequencing does not rely on the detection or analysis of a predetermined or predefined DNA sequence group. Random representative portions of the sample pool DNA molecules are sequenced. The number of different sequenced tags compared with various chromosome regions is compared between samples containing or not containing the DNA species of interest. Chromosome variation will be revealed by the difference in the number (or percentage) of sequences compared with any given chromosome region in the sample.

在另一实施方案中，可以将血浆无细胞DNA的测序技术用于检测血浆DNA中的染色体变异来检测具体的癌症。不同的癌症具有一套典型的染色体变异。可以使用多个染色区中的变化(扩增和缺失)。因此，与扩充的区域比对的序列的比例将增加，而与减少的区域比对的序列的比例将减少。每条染色体的百分比表现度可以与参照基因组中每条相应染色体的大小进行比较，所述大小表示为相对于全基因组的任何给定染色体的基因组表现度的百分比。也可以使用与参照染色体直接比较或比较。In another embodiment, the sequencing technology of plasma cell-free DNA can be used to detect chromosomal variations in plasma DNA to detect specific cancers. Different cancers have a set of typical chromosomal variations. The changes (amplification and deletion) in multiple staining regions can be used. Therefore, the ratio of sequences aligned with the expanded region will increase, while the ratio of sequences aligned with the reduced region will decrease. The percentage expression of each chromosome can be compared with the size of each corresponding chromosome in the reference genome, and the size is expressed as the percentage of the genomic expression of any given chromosome relative to the whole genome. Direct comparison or comparison with the reference chromosome can also be used.

II.仅对人类基因组部分进行测序II. Sequencing only part of the human genome

在上文实施例I所述的实验中，仅利用一个流动池，对每个单独样品的母体血浆DNA进行测序。经测序运行，由每个检测的样品所产生的被测序的标签的数量显示于图6中。T21表示由与21三体性胎儿有关的妊娠所获得的样品。In the experiments described above in Example 1, maternal plasma DNA was sequenced for each individual sample using only one flow cell. The number of sequenced tags generated by each sample tested over the sequencing run is shown in Figure 6. T21 represents a sample obtained from a pregnancy associated with a fetus with trisomy 21.

因为对每个被测序的母体血浆DNA片段的36bp进行测序，因此，每个样品的被测序的核苷酸/碱基对的数量可以用被测序的标签的计数乘以36bp来确定，并且也显示于图6中。因为人类基因组中有大约30亿个碱基对，因此，由每个母体血浆样品所产生的测序数据的量，仅代表约10％至13％的部分。Because 36 bp of each sequenced maternal plasma DNA fragment was sequenced, the number of sequenced nucleotides/base pairs for each sample can be determined by multiplying the count of sequenced tags by 36 bp, and is also shown in Figure 6. Because there are approximately 3 billion base pairs in the human genome, the amount of sequencing data generated from each maternal plasma sample represents only about 10% to 13% of it.

此外，在本研究中，如上文实施例I所述，仅将唯一可定位的被测序的标签，在Eland软件的命名法中称为U0，用于证明，在来自怀有21三体性胎儿的妊娠的每一个的母体血浆样品中，存在21号染色体序列的量的过度表现。如图6所示，U0序列仅代表由每个样品所产生的全部被测序的标签的亚群，并且还代表甚至更小比例的，约2％的人类基因组。这些数据表明，仅对存在于检测的样品中的人类基因组序列的一部分进行测序，就足以实现胎儿非整倍性的诊断。Furthermore, in this study, as described in Example 1 above, only uniquely mappable sequenced tags, referred to as U0 in the Eland software nomenclature, were used to demonstrate the overrepresentation of chromosome 21 sequences in each maternal plasma sample from pregnancies with fetuses carrying trisomy 21. As shown in Figure 6, U0 sequences represent only a subset of the total sequenced tags generated by each sample and represent an even smaller proportion, approximately 2%, of the human genome. These data suggest that sequencing only a portion of the human genomic sequence present in the samples tested is sufficient to achieve diagnosis of fetal aneuploidy.

III.所需序列的数量的确定III. Determination of the number of sequences required

本次分析使用来自怀有整倍体男性胎儿的孕妇的血浆DNA的测序结果。可以无错配地定位至参照人类基因组序列的被测序的标签的数量为1,990,000。从这些1,990,000标签中随机地选择序列的亚群，并在每个亚群中计算与21号染色体比对的序列的百分比。亚群中序列的数量在60,000-540,000条序列变动。对于每个亚群大小，相同数量的被测序的标签的多个亚群，通过从总的池中随机地选择被测序的标签进行编辑，直到没有其他可能的组合。随后，在每个亚群大小内，由多个亚群计算与21号染色体比对的序列的平均百分比和其标准偏差(SD)。跨越不同亚群大小比较这些数据，以便确定亚群大小对与21号染色体比对的序列的百分比分布的影响。随后根据平均值和SD，计算百分比的第5和第95个百分点。This analysis used sequencing results from plasma DNA from pregnant women carrying euploid male fetuses. The number of sequenced tags that could be mapped to the reference human genome sequence without mismatches was 1,990,000. Subpopulations of sequences were randomly selected from these 1,990,000 tags, and the percentage of sequences that aligned to chromosome 21 was calculated in each subpopulation. The number of sequences in the subpopulations varied from 60,000 to 540,000 sequences. For each subpopulation size, multiple subpopulations with the same number of sequenced tags were compiled by randomly selecting sequenced tags from the total pool until no other combinations were possible. Subsequently, within each subpopulation size, the average percentage of sequences that aligned to chromosome 21 and its standard deviation (SD) were calculated from the multiple subpopulations. These data were compared across different subpopulation sizes to determine the effect of subpopulation size on the distribution of the percentage of sequences that aligned to chromosome 21. The 5th and 95th percentiles of the percentages were then calculated based on the mean and SD.

当孕妇怀有21三体性胎儿时，由于来自胎儿的21号染色体的额外剂量，与21号染色体比对的被测序的标签在母体血浆中应当是过度表现的。过度表现的程度依赖于母体血浆DNA样品中胎儿DNA百分比，并采用下述等式计算：When a pregnant woman is carrying a fetus with trisomy 21, sequenced tags that align to chromosome 21 should be over-represented in maternal plasma due to the extra dose of chromosome 21 from the fetus. The degree of over-representation depends on the percentage of fetal DNA in the maternal plasma DNA sample and is calculated using the following equation:

Per_T21＝Per_Eu×(1+f/2)，其中，Per_T21 = Per_Eu × (1 + f/2), where

Per_T21表示怀有21三体性胎儿的女性中与21号染色体比对的序列的百分比；并且Per_T21 represents the percentage of sequences that align to chromosome 21 in women carrying a fetus with trisomy 21; and

Per_Eu表示怀有整倍体胎儿的女性中与21号染色体比对的序列的百分比；以及Per_Eu represents the percentage of sequences that align to chromosome 21 among women carrying euploid fetuses; and

f表示母体血浆DNA中胎儿DNA的百分比。f represents the percentage of fetal DNA in maternal plasma DNA.

如图7所示，与21号染色体比对的序列百分比的SD，随每个亚群中序列数量的增加而降低。因此，当每个亚群中序列的数量增加时，第5和第95个百分点间的区间降低。当整倍体和21三体性情况的5％-95％区间不重叠时，则区分这两组情况是可能的，并且准确性大于95％。As shown in Figure 7, the SD of the percentage of sequences aligning to chromosome 21 decreases as the number of sequences in each subpopulation increases. Thus, the interval between the 5th and 95th percentiles decreases as the number of sequences in each subpopulation increases. When the 5%-95% intervals for euploid and trisomy 21 cases do not overlap, it is possible to distinguish between these two groups with greater than 95% accuracy.

如图7所示，区分21三体性情况和整倍体情况的最小亚群大小依赖于胎儿DNA百分比。对于20％、10％和5％的胎儿DNA百分比，区分21三体性和整倍体情况的最小亚群大小分别为120,000、180,000和540,000条序列。换言之，当母体血浆DNA样品含有20％的胎儿DNA时，对于确定胎儿是否具有21三体性，需要分析的序列的数量为120,000。当胎儿DNA百分比降低为5％时，需要分析的序列的数量将增加到540,000。As shown in Figure 7, the minimum subpopulation size required to distinguish between trisomy 21 and euploidy depends on the percentage of fetal DNA. For fetal DNA percentages of 20%, 10%, and 5%, the minimum subpopulation sizes required to distinguish between trisomy 21 and euploidy are 120,000, 180,000, and 540,000 sequences, respectively. In other words, when a maternal plasma DNA sample contains 20% fetal DNA, the number of sequences required to be analyzed to determine whether the fetus has trisomy 21 is 120,000. When the fetal DNA percentage decreases to 5%, the number of sequences required to be analyzed increases to 540,000.

因为利用36碱基对测序产生数据，因而120,000、180,000和540,000条序列分别对应于0.14％、0.22％和0.65％的人类基因组。因为据报道，从早期妊娠获得的母体血浆中较低范围的胎儿DNA浓度为约5％(Lo,YMD et al.1998Am J Hum Genet 62,768-775)，因而对约0.6％的人类基因组进行测序，可以代表，在检测任何妊娠的胎儿染色体非整倍性中，准确性至少为95％的诊断所需的测序的最小量。Because 36 base pair sequencing was used to generate the data, 120,000, 180,000, and 540,000 sequences correspond to 0.14%, 0.22%, and 0.65% of the human genome, respectively. Because the lower range of fetal DNA concentration in maternal plasma obtained from early pregnancy is reported to be approximately 5% (Lo, YMD et al. 1998 Am J Hum Genet 62, 768-775), sequencing approximately 0.6% of the human genome represents the minimum amount of sequencing required to detect fetal chromosomal aneuploidy in any pregnancy with at least 95% accuracy.

IV.随机测序IV. Random Sequencing

为了示例被测序的DNA片段在测序运行过程中是随机选择的，我们获得了由实施例I所分析的8个母体血浆样品产生的被测序的标签。对于每个母体血浆样品，相对于参照人类基因组序列即NCBI汇编36，我们确定了每个36bp被测序的标签的起始位置，该标签唯一地与21号染色体进行了比对，而无错配。我们随后按升序对来自每个样品的比对的被测序的标签池的起始位置数进行了排序。我们对22号染色体进行了相似的分析。出于示例的目的，将每个母体血浆样品的21号染色体和22号染色体的前10个起始位置分别显示于图8A和图8B中。由这些表可知，DNA片段的被测序的池在样品间是不同的。To illustrate that the sequenced DNA fragments are randomly selected during the sequencing run, we obtained sequenced tags generated by the 8 maternal plasma samples analyzed in Example 1. For each maternal plasma sample, we determined the starting position of each 36bp sequenced tag relative to the reference human genome sequence, i.e., NCBI assembly 36, which uniquely aligned with chromosome 21 without mismatches. We then sorted the starting position number of the aligned sequenced tag pools from each sample in ascending order. We performed a similar analysis for chromosome 22. For illustrative purposes, the first 10 starting positions of chromosomes 21 and 22 for each maternal plasma sample are shown in Figures 8A and 8B, respectively. As can be seen from these tables, the sequenced pools of DNA fragments differed between samples.

利用任何合适的计算机语言，如Java、C++或使用例如常规或面向对象技术的Perl，本申请所述的任何软件组件或函数可以作为由处理器运行的软件代码来执行。软件代码可在用于存储和/或传输的计算机可读介质上存储为一系列指令或命令，合适的介质包括随机存取存储器(RAM)、只读存储器(ROM)、诸如硬盘或软盘的磁性介质或诸如光盘(CD)或DVD(多功能数码光盘)的光学介质、闪存等。计算机可读介质可以是此类存储或传输装置的任何组合。Any software component or function described herein can be implemented as software code executed by a processor using any suitable computer language, such as Java, C++, or Perl using, for example, conventional or object-oriented techniques. The software code can be stored as a series of instructions or commands on a computer-readable medium for storage and/or transmission, suitable media including random access memory (RAM), read-only memory (ROM), magnetic media such as hard disks or floppy disks, or optical media such as compact disks (CDs) or DVDs (Digital Versatile Discs), flash memory, etc. The computer-readable medium can be any combination of such storage or transmission devices.

此类程序也可以利用适合通过有线、光学和/无线网络传播的载波信号来编码和传输，该网络符合包括国际互联网在内的各种协议。因此，本发明实施方案的计算机可读介质，可以利用此类程序编码的数据信号产生。用程序代码编码的计算机可读介质可以与兼容的装置组装，或由其他装置(如经由互联网下载)独立地提供。任何此类计算机可读介质可以位于一个计算机程序产品上或在该产品内(例如，硬盘或整个计算机系统)，并且可以存在于系统或网络内不同计算机程序产品上或在该产品内内。计算机系统可以包括显示屏、打印机或向用户提供本文所提到的任何结果的其他合适的显示器。Such program can also utilize the carrier signal that is suitable for propagating by wired, optical and/or wireless network to encode and transmit, and this network complies with the various protocols including the Internet.Therefore, the computer-readable medium of the embodiment of the present invention can utilize the data signal of this type of program encoding to produce.The computer-readable medium encoded with program code can be assembled with compatible devices, or independently provided by other devices (such as downloading via the Internet).Any such computer-readable medium can be located on a computer program product or in this product (for example, hard disk or whole computer system), and can be present on different computer program products in system or network or in this product.Computer system can include display screen, printer or provide other suitable displays of any result mentioned herein to user.

计算机系统的实例显示于图9中。图9中所示的子系统经由系统总线975相互连接。图9显示了其他子系统，如打印机974、键盘978、硬盘979、与显示适配器982连接的显示屏976等。与I/O控制器971连接的外围装置和输入/输出(I/O)装置，可以通过本领域已知的任何数量的方式连接至计算机系统，如串行端口977。例如，串行端口977或外部界面981可用于将计算机装置连接至诸如互联网的广域网、鼠标输入装置或扫描仪。经由系统总线互联允许中央处理器973与每个子系统通讯，并控制系统内存972或硬盘979的指令的执行以及子系统间信息的交换。系统内存972和/或硬盘979是计算机可读介质的具体表现。An example of a computer system is shown in FIG9 . The subsystems shown in FIG9 are interconnected via a system bus 975 . FIG9 shows other subsystems such as a printer 974 , a keyboard 978 , a hard disk 979 , a display screen 976 connected to a display adapter 982 , and the like. Peripheral devices and input/output (I/O) devices connected to an I/O controller 971 can be connected to the computer system by any number of means known in the art, such as a serial port 977 . For example, a serial port 977 or an external interface 981 can be used to connect the computer device to a wide area network such as the Internet, a mouse input device, or a scanner. Interconnection via the system bus allows the central processing unit 973 to communicate with each subsystem and control the execution of instructions in the system memory 972 or the hard disk 979 and the exchange of information between subsystems. The system memory 972 and/or the hard disk 979 are embodiments of computer-readable media.

出于示例和描述的目的，上文呈现了本发明示例性实施方案的描述。不意图是全面的或将本发明限制为所述的准确形式，并且根据上文的教导，可以做出许多修饰和变化。为了最好地解释本发明的原理及其实践应用而选择和描述了实施方案，由此使本领域技术人员在各种实施方案中，并且通过适于所考虑的具体用途的各种修饰来最佳地利用本发明。For purposes of illustration and description, the description of exemplary embodiments of the present invention has been presented above. It is not intended to be comprehensive or to limit the invention to the exact form described, and many modifications and variations may be made in light of the teachings above. The embodiments have been chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling those skilled in the art to best utilize the invention in various embodiments and through various modifications suitable for the specific application under consideration.

出于各种目的，将本文所引用的所有出版物、专利和专利申请通过引用全文并入。All publications, patents, and patent applications cited herein are incorporated by reference in their entirety for all purposes.