CN103710336B

Movatterモバイル変換

Info

Publication number: CN103710336B
Application number: CN201210379402.8A
Authority: CN
Inventors: 祝珍珍; 黄文潘; 章文蔚; 陈茂山; 张艳艳
Original assignee: BGI Technology Solutions Co Ltd
Current assignee: BGI Technology Solutions Co Ltd
Priority date: 2012-09-29
Filing date: 2012-09-29
Publication date: 2017-02-22
Anticipated expiration: 2032-09-29
Also published as: WO2014048185A1; CN103710336A

Abstract

Translated fromChinese

本发明提出了从RNA样本富集转录本的方法及其用途。从RNA样本富集转录本的方法，包括：利用富集试剂对RNA样本进行处理，以便富集转录本，其中，所述富集试剂具有5’‑单磷酸外切酶活性，所述转录本为在其5’末端具有帽子结构或三磷酸基团的RNA分子。利用该方法能够有效地富集转录本。The present invention proposes a method for enriching transcripts from an RNA sample and its use. A method for enriching transcripts from an RNA sample, comprising: treating the RNA sample with an enrichment reagent to enrich transcripts, wherein the enrichment reagent has 5'-monophosphate exonuclease activity, and the transcripts An RNA molecule with a cap or triphosphate group at its 5' end. Transcripts can be efficiently enriched using this method.

Description

Translated fromChinese

从RNA样本富集转录本的方法及其用途Methods for enriching transcripts from RNA samples and uses thereof

技术领域technical field

本发明涉及生物技术领域，具体的，本发明涉及从RNA样本富集转录本的方法及其用途，更具体的，本发明涉及从RNA样本富集转录本的方法、构建测序文库的方法、测序文库、核酸样本测序方法、确定转录起点(transcription start site,TSS)的方法、用于从RNA样本富集转录本的富集试剂、构建测序文库的装置、核酸样本测序设备以及确定TSS的系统。The present invention relates to the field of biotechnology. Specifically, the present invention relates to a method for enriching transcripts from an RNA sample and its use. More specifically, the present invention relates to a method for enriching transcripts from an RNA sample, a method for constructing a sequencing library, and a sequencing method. Libraries, nucleic acid sample sequencing methods, methods for determining transcription start sites (transcription start sites, TSS), enrichment reagents for enriching transcripts from RNA samples, devices for constructing sequencing libraries, nucleic acid sample sequencing equipment, and systems for determining TSS.

背景技术Background technique

基因的转录过程是从RNA聚合酶与DNA模板的启动子位置结合开始，然后从转录起点(transcription start site,在本文中简称为：TSS)进行转录延伸，最终形成完整的RNA。生物体内存在的RNA分子都是从TSS开始的，因此通过高通量测序研究TSS有助于我们从全基因组推测启动子的位置及结构，从而全局了解基因转录调控网络。TSS的研究也有助于修正原有的基因注释或发现新的基因。The gene transcription process starts from the combination of RNA polymerase and the promoter position of the DNA template, and then extends from the transcription start site (transcription start site, abbreviated as: TSS in this paper), and finally forms a complete RNA. The RNA molecules in organisms all start from TSS, so the study of TSS through high-throughput sequencing will help us infer the position and structure of the promoter from the whole genome, so as to understand the gene transcription regulatory network globally. The study of TSS also helps to revise the original gene annotation or discover new genes.

然而，目前对于TSS的研究，仍有待改进。However, the current research on TSS still needs to be improved.

发明内容Contents of the invention

本发明旨在至少在一定程度上解决上述技术问题之一或至少提供一种有用的商业选择。为此，本发明的一个目的在于提出一种能够有效富集转录本，进而可以有效确定TSS的手段。The present invention aims at solving one of the above technical problems at least to a certain extent or at least providing a useful commercial choice. Therefore, an object of the present invention is to propose a method that can effectively enrich transcripts, and then effectively determine TSS.

本发明是基于发明人的下列发现而完成的：The present invention has been accomplished based on the following findings of the inventors:

目前，关于高通量测序研究TSS的方法通常是针对具有帽子结构的RNA，采用CAGE或RACE的方法捕获RNA分子的5’末端。常见的有deepCAGE，PEAT，deep-RACE，nanoCAGE和CAGEscan。其中deepCAGE，PEAT，deep-RACE和CAGEscan需要酶切等繁琐操作，对RNA的要求量很高，而且产生的测序序列（reads）较短（大约20nt），只适用于具有帽子结构的RNA，不能用于没有帽子结构的原核RNA的TSS的研究。尽管nanoCAGE操作比较简单，对RNA的使用量要求也低，但是也只适用于具有帽子结构的RNA，而且产生的数据中假阳性比较多。发明人发现通过采用5’单磷酸外切酶，能够特异性地降解5’单磷酸的RNA，保留具有5’帽子和5’三磷酸的完整的RNA分子，可以有效地应用于富集转录本，因而能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。At present, the method of high-throughput sequencing to study TSS is usually aimed at the RNA with a cap structure, and the 5' end of the RNA molecule is captured by CAGE or RACE. The common ones are deepCAGE, PEAT, deep-RACE, nanoCAGE and CAGEscan. Among them, deepCAGE, PEAT, deep-RACE and CAGEscan require cumbersome operations such as enzyme digestion, and have high requirements for RNA, and the generated sequencing sequences (reads) are short (about 20nt), which are only suitable for RNA with a cap structure and cannot For the study of TSS of prokaryotic RNA without cap structure. Although the operation of nanoCAGE is relatively simple, and the requirement for the amount of RNA used is low, it is only applicable to RNA with a cap structure, and there are many false positives in the generated data. The inventors found that by using 5' monophosphate exonuclease, it can specifically degrade 5' monophosphate RNA, and retain the complete RNA molecule with 5' cap and 5' triphosphate, which can be effectively applied to enrich transcripts , so it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, and has many advantages of simple operation, high accuracy and low cost.

在本发明的第一方面，本发明提出了一种从RNA样本富集转录本的方法。根据本发明的实施例，该从RNA样本富集转录本的方法包括：利用富集试剂对RNA样本进行处理，以便富集转录本，其中，所述富集试剂具有5’-单磷酸外切酶活性，所述转录本为在其5’末端具有帽子结构或5’三磷酸的RNA分子。由于5’单磷酸外切酶，能够特异性地降解5’单磷酸的RNA，而不会降解具有5’帽子和5’三磷酸的完整的RNA分子，从而利用具有该5’单磷酸外切酶活性的富集试剂，可以有效地富集转录本，因而能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。In a first aspect of the present invention, the present invention proposes a method for enriching transcripts from an RNA sample. According to an embodiment of the present invention, the method for enriching transcripts from an RNA sample includes: processing the RNA sample with an enrichment reagent to enrich transcripts, wherein the enrichment reagent has a 5'-monophosphate exo Enzymatic activity, the transcript is an RNA molecule with a cap structure or a 5' triphosphate at its 5' end. Due to the 5' monophosphate exonuclease, it can specifically degrade the 5' monophosphate RNA without degrading the complete RNA molecule with the 5' cap and 5' triphosphate, thereby utilizing the 5' monophosphate exo The enrichment reagent for enzyme activity can effectively enrich transcripts, so it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, and has many advantages of simple operation, high accuracy and low cost.

在本发明的第二方面，本发明提出了一种构建测序文库的方法。根据本发明的实施例，该构建测序文库的方法包括：根据前面所述的方法，从RNA样本富集转录本；去除所述转录本的5’帽子结构或5’三磷酸，以便获得去除5’帽子结构或5’三磷酸的转录本；在去除5’帽子结构或5’三磷酸的转录本的5’末端连接RNA接头，以便获得连接有RNA接头的转录本；对连接有RNA接头的转录本进行反转录，以便获得与所述转录本对应的cDNA；对所述cDNA进行扩增，以便获得扩增产物；以及基于所述扩增产物，构建测序文库。由此，利用该方法，能够有效地针对核酸样本中所富集的转录本构建测序文库，因而能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。In the second aspect of the present invention, the present invention provides a method for constructing a sequencing library. According to an embodiment of the present invention, the method for constructing a sequencing library includes: enriching transcripts from RNA samples according to the method described above; removing the 5' cap structure or 5' triphosphate of the transcripts, so as to obtain Transcripts with 'cap structure or 5' triphosphate; RNA adapters are attached to the 5' end of transcripts from which 5' cap structures or 5' triphosphates have been removed, so as to obtain transcripts with RNA adapters attached; The transcript is reverse-transcribed to obtain cDNA corresponding to the transcript; the cDNA is amplified to obtain an amplified product; and a sequencing library is constructed based on the amplified product. Therefore, using this method, a sequencing library can be effectively constructed for transcripts enriched in nucleic acid samples, and thus can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, with simple operation and high accuracy And many advantages of low cost.

在本发明的第三方面，本发明提出了一种测序文库，其特征在于，是由前面所述的方法构建的。利用该测序文库，能够有效的对RNA转录本进行测序，能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。In the third aspect of the present invention, the present invention proposes a sequencing library, which is characterized in that it is constructed by the aforementioned method. Utilizing the sequencing library, RNA transcripts can be effectively sequenced, and it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, and has many advantages of simple operation, high accuracy and low cost.

在本发明的第四方面，本发明提出了一种核酸样本测序方法。根据本发明的实施例，该核酸样本测序方法包括：根据前面所述的方法，构建测序文库；以及对所述测序文库进行测序，以便获得测序结果。利用该方法，能够有效的对RNA转录本进行测序，能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。In the fourth aspect of the present invention, the present invention provides a nucleic acid sample sequencing method. According to an embodiment of the present invention, the nucleic acid sample sequencing method includes: constructing a sequencing library according to the aforementioned method; and sequencing the sequencing library, so as to obtain a sequencing result. Using this method, RNA transcripts can be effectively sequenced, and it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, and has many advantages of simple operation, high accuracy and low cost.

在本发明的第五方面，本发明提出了一种用于确定TSS的方法。根据本发明的实施例，该确定TSS的方法包括：从宿主提取RNA样本；利用前面所述的方法，获得由多个测序序列构成的测序结果；以及基于所述测序结果，确定TSS。利用该方法，可以有效地确定转录起始位点，能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。In a fifth aspect of the invention, the invention proposes a method for determining TSS. According to an embodiment of the present invention, the method for determining TSS includes: extracting an RNA sample from a host; using the method described above, obtaining a sequencing result consisting of multiple sequencing sequences; and determining TSS based on the sequencing result. Using this method, the transcription initiation site can be effectively determined, and it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA simultaneously, and has many advantages of simple operation, high accuracy and low cost.

在本发明的第六方面，本发明提出了一种用于从RNA样本富集转录本的富集试剂。根据本发明的实施例，富集试剂具有5’-单磷酸外切酶活性。利用该富集试剂，可以有效地富集转录本，因而能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。In the sixth aspect of the present invention, the present invention proposes an enrichment reagent for enriching transcripts from an RNA sample. According to an embodiment of the present invention, the enrichment reagent has 5'-monophosphate exonuclease activity. The enrichment reagent can be used to effectively enrich transcripts, so it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, and has the advantages of simple operation, high accuracy and low cost.

在本发明的第七方面，本发明提出了一种构建测序文库的装置。根据本发明的实施例，该构建测序文库的装置包括：转录本富集单元，所述转录本富集装置中设置有前面所述的富集试剂，以便从RNA样本富集转录本；末端修整单元，所述末端修整单元与所述转录本富集单元相连，并且适于去除所述转录本的5’帽子结构或5’三磷酸，以便获得去除5’帽子结构或5’三磷酸的转录本；RNA接头连接单元，所述RNA接头连接单元与末端修整单元相连，并且适于在去除5’帽子结构或5’三磷酸的转录本的5’末端连接RNA接头，以便获得连接有RNA接头的转录本；反转录单元，所述反转录单元与所述RNA接头连接单元相连，并且适于对连接有RNA接头的转录本进行反转录，以便获得与所述转录本对应的cDNA；扩增单元，所述扩增单元与所述反转录单元相连，并且适于对所述cDNA进行扩增，以便获得扩增产物；以及文库构建单元，所述文库构建单元与所述扩增单元相连，并且适于基于所述扩增产物，构建测序文库。利用该装置，能够有效地针对核酸样本中所富集的转录本构建测序文库，因而能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。In the seventh aspect of the present invention, the present invention provides a device for constructing a sequencing library. According to an embodiment of the present invention, the device for constructing a sequencing library includes: a transcript enrichment unit, the transcript enrichment device is provided with the aforementioned enrichment reagents, so as to enrich transcripts from RNA samples; end trimming unit, the end trimming unit is connected to the transcript enrichment unit and is adapted to remove the 5' cap structure or 5' triphosphate of the transcript, so as to obtain a transcript with the 5' cap structure or 5' triphosphate removed present; an RNA adapter ligation unit connected to an end trimming unit and adapted to attach an RNA adapter to the 5' end of a transcript from which a 5' cap structure or a 5' triphosphate has been removed, so as to obtain an RNA adapter ligated The transcript; the reverse transcription unit, the reverse transcription unit is connected with the RNA adapter connection unit, and is suitable for reverse transcription of the transcript connected with the RNA adapter, so as to obtain cDNA corresponding to the transcript an amplification unit, the amplification unit is connected to the reverse transcription unit, and is suitable for amplifying the cDNA to obtain an amplification product; and a library construction unit, the library construction unit is connected to the amplification unit The amplification units are connected and are suitable for constructing a sequencing library based on the amplification products. Using this device, a sequencing library can be effectively constructed for the enriched transcripts in nucleic acid samples, so it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, with simple operation, high accuracy and low cost of many advantages.

在本发明的第八方面，本发明提出了一种核酸样本测序设备，其特征在于，包括：文库构建装置，所述文库构建装置为前面所述的装置，以便针对核酸样本构建测序文库；以及测序装置，所述测序装置与所述文库构建装置相连，并且适于对所述测序文库进行测序，以便获得测序结果。利用该装置，能够有效的对RNA转录本进行测序，能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。In the eighth aspect of the present invention, the present invention proposes a nucleic acid sample sequencing device, which is characterized in that it includes: a library construction device, the library construction device is the device described above, so as to construct a sequencing library for nucleic acid samples; and A sequencing device, the sequencing device is connected to the library construction device and is suitable for sequencing the sequencing library so as to obtain a sequencing result. The device can effectively sequence RNA transcripts, and can be simultaneously applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA, and has the advantages of simple operation, high accuracy and low cost.

在本发明的第九方面，本发明提出了一种确定TSS的系统。根据本发明的实施例，该系统包括：样品提取设备，所述样品提取设备用于从宿主提取RNA样本；核酸样本测序设备，所述核酸样本测序设备与所述样品提取设备相连，并且所述测序设备为前面所述的核酸样本测序设备，以便针对所述RNA样本进行测序，从而获得由多个测序序列构成的测序结果；以及TSS确定设备，所述TSS确定装置与所述测序设备相连，并且适于基于所述测序结果，确定TSS。根据本发明的实施例，利用该系统能够有效的确定核酸样本中的TSS。In a ninth aspect of the present invention, the present invention proposes a system for determining TSS. According to an embodiment of the present invention, the system includes: a sample extraction device, the sample extraction device is used to extract an RNA sample from a host; a nucleic acid sample sequencing device, the nucleic acid sample sequencing device is connected to the sample extraction device, and the The sequencing device is the aforementioned nucleic acid sample sequencing device, so as to perform sequencing on the RNA sample, so as to obtain a sequencing result composed of multiple sequencing sequences; and a TSS determining device, the TSS determining device is connected to the sequencing device, And it is suitable for determining TSS based on the sequencing result. According to the embodiment of the present invention, the system can effectively determine the TSS in the nucleic acid sample.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and comprehensible from the description of the embodiments in conjunction with the following drawings, wherein:

图1显示了根据本发明一个实施例的构建测序文库的方法的流程示意图；Figure 1 shows a schematic flow diagram of a method for constructing a sequencing library according to an embodiment of the present invention;

图2显示了根据本发明一个实施例的确定TSS序列的信息学分析流程示意图；Figure 2 shows a schematic flow chart of the informatics analysis for determining the TSS sequence according to one embodiment of the present invention;

图3显示了根据本发明有一个实施例的确定TSS的系统的示意图；Figure 3 shows a schematic diagram of a system for determining TSS according to an embodiment of the present invention;

图4显示了根据本发明一个实施例的核酸样本测序设备的示意图；Figure 4 shows a schematic diagram of a nucleic acid sample sequencing device according to an embodiment of the present invention;

图5显示了根据本发明一个实施例的构建测序文库的装置的示意图；Figure 5 shows a schematic diagram of a device for constructing a sequencing library according to an embodiment of the present invention;

图6显示了根据本发明有一个实施例的确定TSS的设备的示意图；FIG. 6 shows a schematic diagram of a device for determining TSS according to an embodiment of the present invention;

图7显示了根据本发明一个实施例，筛选后的TSS在基因组上的分布，上图和下图分别是人RNA和大肠杆菌RNA样品的TSS分布图，其中0是基因编码区的起始位点，其上游就是转录起始的位点，从图中可以看出，大部分的序列都落在基因编码区的上游；Figure 7 shows the distribution of TSS on the genome after screening according to an embodiment of the present invention, the upper and lower figures are the TSS distribution diagrams of human RNA and Escherichia coli RNA samples, respectively, wherein 0 is the start position of the gene coding region point, its upstream is the site of transcription initiation, as can be seen from the figure, most of the sequences fall upstream of the coding region of the gene;

图8显示了根据本发明一个实施例，展示了8个人RNA样品的TSS图谱，从图中可以看到不同样品中TSS的分布情况;图9显示了根据本发明一个实施例，TSS上游的碱基分布图形，其中横坐标1对应的就是TSS的位置，以嘌呤为主（A/G）。上图是人RNA样品的TSS上游碱基分布图，有明显的GC富集区，这也是真核生物主要的启动子类型；下图是大肠杆菌RNA样品的TSS上游碱基分布图，在其上游-10区处也能找到典型的TATA盒；Fig. 8 has shown according to an embodiment of the present invention, has shown the TSS profile of 8 human RNA samples, can see the distribution situation of TSS in different samples from the figure; Fig. 9 has shown according to an embodiment of the present invention, the alkali of TSS upstream The base distribution graph, where the abscissa 1 corresponds to the position of TSS, mainly purine (A/G). The upper figure is the base distribution map of the TSS upstream of the human RNA sample, with obvious GC-rich regions, which is also the main promoter type in eukaryotes; the lower figure is the base distribution map of the TSS upstream of the E. coli RNA sample, in which A typical TATA box can also be found in the upstream -10 zone;

图10显示了根据本发明一个实施例，5’UTR的长度分布，也就是TSS到编码区的距离。上图是人RNA样品5’UTR的长度分布，下图是大肠杆菌RNA样品5’UTR的长度分布；Fig. 10 shows the length distribution of 5'UTR, that is, the distance from TSS to coding region, according to one embodiment of the present invention. The upper figure is the length distribution of the 5'UTR of the human RNA sample, and the lower figure is the length distribution of the 5'UTR of the E. coli RNA sample;

图11显示了相关性分析可获得对实验结果可靠性和操作稳定性的评估，如图11所示，上图是人RNA样品的两次重复，下图是大肠杆菌RNA样品的两次重复；以及Figure 11 shows that the correlation analysis can obtain the evaluation of the reliability of the experimental results and the stability of the operation, as shown in Figure 11, the upper figure is the two repetitions of the human RNA sample, and the lower figure is the two repetitions of the Escherichia coli RNA sample; as well as

图12，显示了根据本发明的实施例预测基因的结果示意图。上图是人的两个基因NM_018997和NM_031901的TSS分布，他们是发生了可变剪切的基因，图中红色竖线表示筛选的TSS，黑色的竖线是过滤前得到的序列，蓝色横线代表基因的外显子，黄色横线是基因的内含子；下图是大肠杆菌一个操纵子的TSS分布，原核的不存在内含子，所以只有代表基因的蓝色横线，这个操纵子的4个基因共有一个TSS。Fig. 12 shows a schematic diagram of the results of predicting genes according to an embodiment of the present invention. The figure above shows the TSS distribution of two human genes NM_018997 and NM_031901. They are genes that have undergone variable splicing. The red vertical line in the figure indicates the screened TSS, the black vertical line is the sequence obtained before filtering, and the blue horizontal line The line represents the exon of the gene, and the yellow horizontal line is the intron of the gene; the figure below shows the TSS distribution of an operon in Escherichia coli. There is no intron in the prokaryotic, so there is only a blue horizontal line representing the gene. This operon The four genes of the daughter share a TSS.

具体实施方式detailed description

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary and are intended to explain the present invention and should not be construed as limiting the present invention.

在本发明中，除非另有明确的规定和限定，术语“安装”、“相连”、“连接”、“固定”等术语应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以根据具体情况理解上述术语在本发明中的具体含义。另外，在本文中所使用的术语“上游”“下游”是按照5’端至3’端的方向所确定的。In the present invention, unless otherwise clearly specified and limited, terms such as "installation", "connection", "connection" and "fixation" should be understood in a broad sense, for example, it can be a fixed connection or a detachable connection , or integrally connected; it may be mechanically connected or electrically connected; it may be directly connected or indirectly connected through an intermediary, and it may be the internal communication of two components. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present invention according to specific situations. In addition, the terms "upstream" and "downstream" used herein are defined in the direction from the 5' end to the 3' end.

在本发明的第一方面，本发明提出了一种从RNA样本富集转录本的方法。根据本发明的实施例，该从RNA样本富集转录本的方法包括：利用富集试剂对RNA样本进行处理，以便富集转录本，其中，所述富集试剂具有5’-单磷酸外切酶活性，所述转录本为在其5’末端具有帽子结构或三磷酸的RNA分子。根据本发明实施例，具有5’单磷酸外切酶活性的酶的例子可以包括：核糖核酸外切酶XRN-1，Terminator^TM依赖于5’磷酸的核酸外切酶（Terminator^TM 5′-Phosphate-Dependent Exonuclease）或者TAKARA^TM碱性磷酸酶（TAKARA^TM Alkaline Phosphatase）。由于5’单磷酸外切酶，能够特异性地降解5’单磷酸的RNA，而不会降解具有5’帽子和5’三磷酸的完整的RNA分子，从而利用具有该5’单磷酸外切酶活性的富集试剂，可以有效地富集转录本，因而能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。In a first aspect of the present invention, the present invention proposes a method for enriching transcripts from an RNA sample. According to an embodiment of the present invention, the method for enriching transcripts from an RNA sample includes: processing the RNA sample with an enrichment reagent to enrich transcripts, wherein the enrichment reagent has a 5'-monophosphate exo Enzymatic activity, the transcript is an RNA molecule with a cap or triphosphate at its 5' end. According to an embodiment of the present invention, examples of enzymes having 5' monophosphate exonuclease activity may include: exoribonuclease XRN-1, TerminatorTM^5' -phosphate-dependent exonuclease (TerminatorTM^5' -Phosphate -Dependent Exonuclease) or TAKARA^TM Alkaline Phosphatase (TAKARA^TM Alkaline Phosphatase). Due to the 5' monophosphate exonuclease, it can specifically degrade the 5' monophosphate RNA without degrading the complete RNA molecule with the 5' cap and 5' triphosphate, thereby utilizing the 5' monophosphate exo The enrichment reagent for enzyme activity can effectively enrich transcripts, so it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, and has many advantages of simple operation, high accuracy and low cost.

根据本发明的实施例，该从RNA样本富集转录本的方法可以采用任何具有5’单磷酸外切酶活性的富集试剂。根据本发明实施例，具有5’单磷酸外切酶活性的酶的例子可以包括：核糖核酸外切酶XRN-1，Terminator^TM依赖于5’磷酸的核酸外切酶或者TAKARA^TM碱性磷酸酶。根据本发明的一个实施例，所述富集试剂含有DNase I。由此，可以进一步提高降解5’单磷酸的RNA的特异性和效率，从而进一步提高富集转录本方法的效率。根据本发明的一个实施例，所述富集试剂还可以进一步含有缓冲液和可溶性盐，以便进一步提高DNase I的酶活性。根据本发明的一个实施例，所述富集试剂的pH为8.0。根据本发明的一个实施例，所述缓冲液为Tris-HCl，所述可溶性盐为选自氯化钠和氯化镁的至少一种。根据本发明的一个实施例，在30摄氏度下，利用所述富集试剂对所述RNA样本进行处理。从而，可以进一步提高利用根据本发明实施例的富集试剂进行富集转录本的效率。根据本发明实施例，具有5’单磷酸外切酶活性的酶的例子可以包括：核糖核酸外切酶XRN-1，Terminator^TM依赖于5’磷酸的核酸外切酶或者TAKARA^TM碱性磷酸酶。According to an embodiment of the present invention, the method for enriching transcripts from an RNA sample can use any enrichment reagent with 5' monophosphate exonuclease activity. According to an embodiment of the present invention, examples of enzymes having 5' monophosphate exonuclease activity may include: exoribonuclease XRN-1, TerminatorTM 5' phosphate-dependent exonuclease or^TAKARATM^alkaline phosphatase . According to an embodiment of the present invention, the enrichment reagent contains DNase I. Thus, the specificity and efficiency of degrading 5' monophosphate RNA can be further improved, thereby further improving the efficiency of the method for enriching transcripts. According to an embodiment of the present invention, the enrichment reagent may further contain a buffer and a soluble salt, so as to further increase the enzymatic activity of DNase I. According to an embodiment of the present invention, the pH of the enrichment reagent is 8.0. According to an embodiment of the present invention, the buffer is Tris-HCl, and the soluble salt is at least one selected from sodium chloride and magnesium chloride. According to an embodiment of the present invention, the RNA sample is processed with the enrichment reagent at 30 degrees Celsius. Therefore, the efficiency of enriching transcripts by using the enrichment reagent according to the embodiment of the present invention can be further improved. According to an embodiment of the present invention, examples of enzymes having 5' monophosphate exonuclease activity may include: exoribonuclease XRN-1, TerminatorTM 5' phosphate-dependent exonuclease or^TAKARATM^alkaline phosphatase .

在本发明的第二方面，本发明提出了一种构建测序文库的方法。参考图1，根据本发明的实施例，该构建测序文库的方法包括：In the second aspect of the present invention, the present invention provides a method for constructing a sequencing library. Referring to Fig. 1, according to an embodiment of the present invention, the method for constructing a sequencing library includes:

S100（富集转录本）：根据前面所述的方法，从RNA样本富集转录本。关于该步骤，前面已经进行了详细描述，在此不再赘述。S100 (Enrich transcripts): Enrich transcripts from RNA samples according to the method described previously. This step has been described in detail above, and will not be repeated here.

S200（末端修整）：去除所述转录本的5’帽子结构或5’三磷酸，以便获得去除5’帽子结构或5’三磷酸的转录本。根据本发明的一个实施例，利用末端修整试剂去除所述转录本的5’帽子结构或5’三磷酸，其中，所述末端修整试剂具有烟草酸焦磷酸酶活性。根据本发明的一个实施例，所述修整试剂包含：烟草酸焦磷酸酶、可溶性盐、EDTA、β-巯基乙醇和Triton-X 100。根据本发明的一个实施例，所述可溶性盐为醋酸钠。根据本发明的一个实施例，所述修整试剂的pH为7.5。由此，可以进一步提高对RNA进行末端修整的效果，即能够有效的去除转录本的5’帽子结构或5’三磷酸，从而提高构建测序文库的效率。S200 (end trimming): removing the 5' cap structure or 5' triphosphate of the transcript in order to obtain a transcript with the 5' cap structure or 5' triphosphate removed. According to an embodiment of the present invention, the 5' cap structure or the 5' triphosphate of the transcript is removed by using an end trimming reagent, wherein the end trimming reagent has nicotinic acid pyrophosphatase activity. According to an embodiment of the present invention, the trimming reagent comprises: nicotinic acid pyrophosphatase, soluble salt, EDTA, β-mercaptoethanol and Triton-X 100. According to one embodiment of the present invention, the soluble salt is sodium acetate. According to an embodiment of the present invention, the pH of the trimming reagent is 7.5. Thus, the effect of end trimming on RNA can be further improved, that is, the 5' cap structure or 5' triphosphate of the transcript can be effectively removed, thereby improving the efficiency of constructing a sequencing library.

S300（连接接头）：在去除5’帽子结构或5’三磷酸的转录本的5’末端连接RNA接头，以便获得连接有RNA接头的转录本。根据本发明的一个实施例，利用连接试剂，在去除5’帽子结构或5’三磷酸的转录本的5’末端连接RNA接头，其中，所述连接试剂具有T4RNA连接酶活性。根据本发明的一个实施例，所述连接试剂包含：T4RNA连接酶，缓冲液、可溶性盐、二硫苏糖醇。根据本发明的一个实施例，所述连接试剂的pH为7.5。根据本发明的一个实施例，所述缓冲液为Tris-HCl。根据本发明的一个实施例，所述可溶性盐为氯化镁。根据本发明的一个实施例，在30摄氏度下，利用连接试剂，在去除5’帽子结构或5’三磷酸的转录本的5’末端连接RNA接头。由此，可以提高连接接头的效率，从而提高构建测序文库的效率。S300 (ligation adapter): RNA adapter is ligated to the 5' end of the transcript from which the 5' cap structure or 5' triphosphate is removed, so as to obtain the transcript with the RNA adapter ligated. According to one embodiment of the present invention, a ligation reagent is used to ligate an RNA linker to the 5' end of the transcript from which the 5' cap structure or 5' triphosphate has been removed, wherein the ligation reagent has T4 RNA ligase activity. According to an embodiment of the present invention, the ligation reagent comprises: T4 RNA ligase, buffer, soluble salt, and dithiothreitol. According to an embodiment of the present invention, the pH of the linking reagent is 7.5. According to an embodiment of the present invention, the buffer is Tris-HCl. According to one embodiment of the present invention, the soluble salt is magnesium chloride. According to one embodiment of the present invention, at 30 degrees Celsius, using a ligation reagent, an RNA adapter is ligated to the 5' end of the transcript from which the 5' cap structure or 5' triphosphate has been removed. As a result, the efficiency of linking adapters can be improved, thereby improving the efficiency of constructing sequencing libraries.

S400（反转录）：对连接有RNA接头的转录本进行反转录，以便获得与所述转录本对应的cDNA。根据本发明的实施例，进行反转录所采用的反转录引物在其末端具有与所述RNA接头相对应的序列，由此，所得到的cDNA在其末端也将具有接头，从而便于后续文库构建和测序。在本文中所使用的术语“与RNA接头相对应”的含义是指，反转录引物中包含的序列能够与RNA接头匹配，并且能够进行扩增反应，从而得到在两个末端具有接头的cDNA。例如，在进行反转录的两条反转录引物之一中包含与RNA接头之一相同的序列，而在另一个反转录引物中，则包含于另一个RNA接头互补的序列。根据本发明的一个实施例，所述反转录采用具有SEQ ID NO：1所示序列的寡核苷酸作为反转录引物。根据本发明的一个实施例，所述反转录引物（SEQ ID NO：1）中至少一个N被硫代修饰，从而可以防止该引物被核酸酶降解。根据本发明的一个实施例，所述反转录引物（SEQ ID NO：1）中倒数第二个N被硫代修饰。S400 (reverse transcription): performing reverse transcription on the transcript ligated with the RNA adapter, so as to obtain cDNA corresponding to the transcript. According to an embodiment of the present invention, the reverse transcription primer used for reverse transcription has a sequence corresponding to the RNA adapter at its end, thus, the obtained cDNA will also have an adapter at its end, thereby facilitating subsequent Library construction and sequencing. As used herein, the term "corresponding to the RNA adapter" means that the sequence contained in the reverse transcription primer can be matched with the RNA adapter and can perform an amplification reaction, thereby obtaining a cDNA having adapters at both ends . For example, one of the two reverse transcription primers for reverse transcription contains the same sequence as one of the RNA adapters, while the other reverse transcription primer contains a sequence complementary to the other RNA adapter. According to an embodiment of the present invention, the reverse transcription uses an oligonucleotide having a sequence shown in SEQ ID NO: 1 as a reverse transcription primer. According to an embodiment of the present invention, at least one N in the reverse transcription primer (SEQ ID NO: 1) is modified with sulfur, so as to prevent the primer from being degraded by nucleases. According to an embodiment of the present invention, the penultimate N in the reverse transcription primer (SEQ ID NO: 1) is modified with sulfur.

S500（扩增）：对所述cDNA进行扩增，以便获得扩增产物。本领域技术人员可以通过任何已知的方法进行扩增，例如可以通过常规的PCR方法，只需要采用根据接头的序列进行设计相应的引物即可。S500 (amplification): amplify the cDNA to obtain an amplification product. Those skilled in the art can perform amplification by any known method, for example, by conventional PCR method, only need to design corresponding primers according to the sequence of the linker.

S600（文库构建）：基于所述扩增产物，构建测序文库。本领域技术人员可以根据所期望使用的测序方法来针对扩增产物，本领域技术人员可以参考制造商所提供的操作说明，在此不在赘述。需要说明的是，利用根据本发明的方法进行处理所得到的扩增产物，可以应用于Illumina Hiseq2000、Genome Analyzer、SOLiD测序系统、Ion Torrent、IonProton、454、PacBio RS测序系统、Helicos tSMS技术以及纳米孔测序技术，从而可以实现高通量测序。S600 (library construction): construct a sequencing library based on the amplification product. Those skilled in the art can target the amplification product according to the desired sequencing method, and those skilled in the art can refer to the operating instructions provided by the manufacturer, which will not be repeated here. It should be noted that the amplification products obtained by processing according to the method of the present invention can be applied to Illumina Hiseq2000, Genome Analyzer, SOLiD sequencing system, Ion Torrent, IonProton, 454, PacBio RS sequencing system, Helicos tSMS technology and nano Hole sequencing technology, which can achieve high-throughput sequencing.

由此，利用该方法，能够有效地针对核酸样本中所富集的转录本构建测序文库，因而能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。另外，需要说明的是，在上述各处理步骤之间，可以任选地包括纯化产物的步骤，根据本发明的实施例，纯化RNA可以采用苯酚/氯仿/异戊醇（体积比为25:24:1）抽提，乙醇沉淀，是为了去除反应混合物中的酶，以免影响下一步骤的反应，而且用乙醇沉淀，还能保留一些小分子的转录本，如microRNA，使这一部分非编码RNA的TSS信息得以获得，从而帮助了解转录调控状态。Therefore, using this method, a sequencing library can be effectively constructed for transcripts enriched in nucleic acid samples, and thus can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, with simple operation and high accuracy And many advantages of low cost. In addition, it should be noted that, between the above-mentioned treatment steps, a step of purifying the product may optionally be included. According to an embodiment of the present invention, the purified RNA may be purified using phenol/chloroform/isoamyl alcohol (volume ratio of 25:24 :1) Extraction and ethanol precipitation are to remove the enzyme in the reaction mixture so as not to affect the reaction in the next step, and ethanol precipitation can also retain some small molecule transcripts, such as microRNA, so that this part of non-coding RNA TSS information is obtained, thereby helping to understand the state of transcriptional regulation.

在本发明的第四方面，本发明提出了一种核酸样本测序方法。根据本发明的实施例，该核酸样本测序方法包括：根据前面所述的方法，构建测序文库；以及对所述测序文库进行测序，以便获得测序结果。利用该方法，能够有效的对RNA转录本进行测序，能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。根据本发明的实施例，所述测序利用Illumina Hiseq2000、Genome Analyzer、SOLiD测序系统、Ion Torrent、Ion Proton、454、PacBio RS测序系统、Helicos tSMS技术以及纳米孔测序技术的至少一种进行的。由此，能够利用这些测序装置的高通量、深度测序的特点，进一步提高了测序的效率。在本发明的一个实施例中，所述测序是利用Illumina Hiseq2000进行的。In the fourth aspect of the present invention, the present invention provides a nucleic acid sample sequencing method. According to an embodiment of the present invention, the nucleic acid sample sequencing method includes: constructing a sequencing library according to the aforementioned method; and sequencing the sequencing library, so as to obtain a sequencing result. Using this method, RNA transcripts can be effectively sequenced, and it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, and has many advantages of simple operation, high accuracy and low cost. According to an embodiment of the present invention, the sequencing is performed using at least one of Illumina Hiseq2000, Genome Analyzer, SOLiD sequencing system, Ion Torrent, Ion Proton, 454, PacBio RS sequencing system, Helicos tSMS technology and nanopore sequencing technology. Therefore, the high-throughput and deep sequencing features of these sequencing devices can be utilized to further improve the efficiency of sequencing. In one embodiment of the present invention, the sequencing is performed using Illumina Hiseq2000.

根据本发明的第五方面，本发明提出了一种确定TSS的方法。根据本发明的实施例，该确定TSS的方法包括：从宿主提取RNA样本；利用前面所述的方法，获得由多个测序序列构成的测序结果；以及基于所述测序结果，确定TSS。利用该方法，能够有效的确定核酸样本中的TSS。According to a fifth aspect of the present invention, the present invention proposes a method for determining TSS. According to an embodiment of the present invention, the method for determining TSS includes: extracting an RNA sample from a host; using the method described above, obtaining a sequencing result consisting of multiple sequencing sequences; and determining TSS based on the sequencing result. Using the method, the TSS in the nucleic acid sample can be effectively determined.

根据本发明的一个实施例，所述RNA样本为宿主的总RNA的至少一部分。根据本发明的实施例，宿主可以为真核生物，例如人，也可以为原核生物，例如大肠杆菌。According to an embodiment of the present invention, the RNA sample is at least a part of the total RNA of the host. According to an embodiment of the present invention, the host may be a eukaryote, such as a human, or a prokaryote, such as Escherichia coli.

根据本发明的一个实施例，基于所述测序结果，确定TSS，进一步包括：将所述测序数据与参考序列进行比对；According to an embodiment of the present invention, determining the TSS based on the sequencing result further includes: comparing the sequencing data with a reference sequence;

基于比对结果，确定所述转录起点，Based on the comparison result, determining the transcription start point,

其中，所述参考序列中包含预定基因的5’-UTR序列的至少一部分，选择能够和所述参考序列对上、且在所述参考序列最上游的测序序列作为阳性序列，并且确定所述阳性序列的第一位碱基作为所述转录起始位点。这里所使用的术语“预定基因”指的是，在参考基因组上，预先设定了一系列基因的可能包括的范围，这些基因可能是已知的，也可以是未知而通过生物信息学推测出来的。根据本发明的实施例，参考序列的长度并不受特别限制，根据本发明的实施例，参考序列至少包含预定基因的翻译起始位点及其上游预定长度的序列。由于转录起始位点在翻译位点的上游，因而，通过选择参考序列的长度，可以将转录起始位点包括在其中。例如，根据本发明的实施例，针对原核宿主，所述参考序列包含所述预定基因的翻译起始位点与该翻译起始位点上游700bp位点之间的核酸序列，针对真核宿主，所述参考序列包含所述预定基因的翻译起始位点与该翻译起始位点上游5000bp位点之间的核酸序列。Wherein, the reference sequence contains at least a part of the 5'-UTR sequence of the predetermined gene, and the sequencing sequence that can be aligned with the reference sequence and is the most upstream of the reference sequence is selected as the positive sequence, and the positive sequence is determined. The first base of the sequence is used as the transcription initiation site. The term "predetermined gene" used here refers to the pre-set range of possible inclusion of a series of genes on the reference genome. These genes may be known or unknown and inferred through bioinformatics of. According to the embodiment of the present invention, the length of the reference sequence is not particularly limited. According to the embodiment of the present invention, the reference sequence at least includes the translation start site of the predetermined gene and its upstream sequence of a predetermined length. Since the transcription initiation site is upstream of the translation site, by choosing the length of the reference sequence, the transcription initiation site can be included therein. For example, according to an embodiment of the present invention, for prokaryotic hosts, the reference sequence includes the nucleic acid sequence between the translation initiation site of the predetermined gene and the upstream 700 bp site of the translation initiation site; for eukaryotic hosts, The reference sequence comprises the nucleic acid sequence between the translation initiation site of the predetermined gene and the upstream 5000bp site of the translation initiation site.

根据本发明的一个实施例，可以采用SOAPAlignment进行所述比对。在本发明中，通过一种短序列映射程序soapalignmentv2.2，将高通量测序技术得到的干净的序列片段分别比对到参考基因组和参考基因序列上，不允许碱基的错配。参考基因组序列和参考基因序列可取于公共数据库。According to an embodiment of the present invention, SOAPAlignment can be used to perform the alignment. In the present invention, a short sequence mapping program soapalignmentv2.2 is used to align the clean sequence fragments obtained by the high-throughput sequencing technology to the reference genome and the reference gene sequence, and no base mismatch is allowed. Reference genome sequences and reference gene sequences are available from public databases.

根据本发明的一个实施例，进一步包括对所述阳性序列进行筛选，其中所述筛选的原则是：所述阳性序列的数目是比对到所述预定基因内部的测序序列数目平均值的N倍以上，其中所述N为大于1的实数，优选地，所述N为至少10的实数。根据本发明的实施例，比对后，可以首先对比对结果进行筛选，以获得可靠地TSS信息。筛选方法为：假设干净序列比对到基因（与预定基因对应的序段）的第一个位置即为原始的TSS，但是这些序列有可能是比对到基因的内部成为假阳性的TSS，所以需要再进一步进行过滤。该方法可以使获得的序列在基因的5’端富集，因此真实的TSS的序列数会比落在基因内部的序列的平均数要高，于是在他们之间引进一个倍数N过滤TSS，即筛选的TSS的序列数要是落在对应的基因内部序列数平均值的N倍才将其认定为真实的TSS。根据本发明的实施例，N可以是至少为10的实数。According to an embodiment of the present invention, it further includes screening the positive sequences, wherein the principle of the screening is: the number of the positive sequences is N times the average number of sequencing sequences aligned to the predetermined gene In the above, wherein said N is a real number greater than 1, preferably, said N is a real number of at least 10. According to an embodiment of the present invention, after the comparison, the comparison and comparison results may be first screened to obtain reliable TSS information. The screening method is as follows: Assume that the first position of the clean sequence aligned to the gene (sequence segment corresponding to the predetermined gene) is the original TSS, but these sequences may be the TSS that is aligned to the inside of the gene and become a false positive, so Further filtering is required. This method can enrich the obtained sequences at the 5' end of the gene, so the number of real TSS sequences will be higher than the average number of sequences falling inside the gene, so a multiple N filter TSS is introduced between them, that is Only when the sequence number of the screened TSS falls within N times of the average number of sequences in the corresponding gene can it be identified as a true TSS. According to an embodiment of the present invention, N may be a real number of at least 10.

根据本发明的一个实施例，进一步包括对筛选结果进行卡方检验。根据本发明的一个实施例，所述卡方检验的检验值为3.84以上时，即置信度大于95%。根据本发明的实施例，过滤后，对于过滤结果本方法使用卡方检验来验证其可靠性，具体地，在上一实施例的基础上，计算所有的TSS对应的倍数的平均值，以及他们的标准差，标准化之后，用下述公式计算卡方值：根据卡方检验表中查到当置信度为0.95时卡方值为3.84，所以可以获得可靠度大于95%的TSS，根据公式算出的卡方值就必须大于3.84。According to an embodiment of the present invention, it further includes performing chi-square test on the screening results. According to an embodiment of the present invention, when the test value of the chi-square test is above 3.84, that is, the confidence level is greater than 95%. According to an embodiment of the present invention, after filtering, this method uses a Chi-square test to verify its reliability for the filtering results. Specifically, on the basis of the previous embodiment, calculate the average value of the multiples corresponding to all TSSs, and their The standard deviation of , after standardization, the chi-square value is calculated with the following formula: According to the chi-square test table, when the confidence level is 0.95, the chi-square value is 3.84, so the TSS with a reliability greater than 95% can be obtained, and the chi-square value calculated according to the formula must be greater than 3.84.

另外，根据本发明的实施例，还可以在进行获得测序结果之后，还可以包括对测序序列去除不合格的序列，获得干净的测序序列的步骤。根据本发明的实施例，不合格的序列包括:In addition, according to the embodiments of the present invention, after obtaining the sequencing results, a step of removing unqualified sequences from the sequencing sequences to obtain clean sequencing sequences may also be included. According to an embodiment of the present invention, unqualified sequences include:

测序质量低于某一阀值的碱基个数超过整条序列碱基个数的50%则认为是不合格序列。低质量阀值由具体测序技术及测序环境而定；If the number of bases whose sequencing quality is lower than a certain threshold exceeds 50% of the number of bases in the entire sequence, it is considered an unqualified sequence. The low-quality threshold is determined by the specific sequencing technology and sequencing environment;

序列中测序结果不确定的碱基（如Illumina Hiseq2000测序结果中的N）个数超过整条序列碱基个数的10%则认为是不合格序列；If the number of bases with uncertain sequencing results (such as N in the Illumina Hiseq2000 sequencing results) exceeds 10% of the total number of bases in the sequence, it is considered as an unqualified sequence;

除样本接头序列外，与其它实验引入的外源序列比对，如各种接头序列。若序列中存在外源序列则认为是不合格序列。In addition to the sample adapter sequence, it is compared with the exogenous sequence introduced by other experiments, such as various adapter sequences. If there are foreign sequences in the sequence, it is considered as unqualified sequence.

原始的序列数据经过去除不合格序列处理后得到的序列数据我们称为干净的序列片段（clean reads），可以作为后续分析的基础，由此，可以提高后续分析的有效性。The sequence data obtained after the original sequence data is processed by removing unqualified sequences is called clean sequence fragments (clean reads), which can be used as the basis for subsequent analysis, thereby improving the effectiveness of subsequent analysis.

另外，在卡方检验后，对验证可靠的结果进行一系列的生物信息分析，如：In addition, after the Chi-square test, a series of bioinformatics analyzes are performed on the verified and reliable results, such as:

1）TSS（转录起始位点）的分类：根据本发明的实施例，可以将筛选的TSS分为两大类，一类是能比对到基因组上且有对应的基因注释的TSS，称之为有注释的TSS；另一类是能比对到基因组上但是在其周围没有注释的基因信息，称之为未注释的TSS，可以用于新基因的预测。1) Classification of TSS (Transcription Start Site): According to the embodiment of the present invention, the screened TSS can be divided into two categories, one is the TSS that can be compared to the genome and has corresponding gene annotations, called It is annotated TSS; the other type is gene information that can be compared to the genome but has no annotations around it, called unannotated TSS, which can be used for the prediction of new genes.

2）TSS注释：这里主要对落在已知基因的TSS进行注释，包括TSS的表达量，TSS所处的位置，以及对应的基因注释信息。2) TSS annotation: Here we mainly annotate the TSS that falls on known genes, including the expression level of TSS, the location of TSS, and the corresponding gene annotation information.

3）构建TSS图谱：根据本发明的实施例，可以将同一物种在该方法中找到的TSS用图片的形式直观的展示出来形成TSS图谱，从图谱上可以很直观的看出每个TSS所在的位置以及他们的表达量。同时也可以看到不同样品中的TSS表达，分布的差异。3) Construction of TSS map: According to the embodiment of the present invention, the TSS found in the method of the same species can be displayed intuitively in the form of pictures to form a TSS map, and it can be seen intuitively from the map where each TSS is located location and their expression. At the same time, we can also see the differences in TSS expression and distribution in different samples.

4）启动子区寻找及5’UTR长度统计。4) Find the promoter region and count the length of 5'UTR.

5）实验重复性分析：根据本发明的实施例，对两次平行实验的结果相关性分析可获得对实验结果可靠性和操作稳定性的评估。5) Experiment repeatability analysis: According to the embodiment of the present invention, the correlation analysis of the results of two parallel experiments can obtain the evaluation of the reliability of the experimental results and the operational stability.

6）新基因预测：根据本发明的实施例，对于附近没有找到参考基因的TSS，可以将这些TSS附近的序列提取出来进行基因预测。原核生物用glimmer进行预测，真核生物用genscan进行预测。6) New gene prediction: According to the embodiment of the present invention, for TSSs for which no reference genes are found nearby, sequences near these TSSs can be extracted for gene prediction. Prokaryotes use glimmer to predict, and eukaryotes use genscan to predict.

7）数据可视化：根据本发明的实施例，利用分析结果，可以针对感兴趣的基因或者区域的TSS分布作图观察。7) Data visualization: According to the embodiment of the present invention, the analysis results can be used to map and observe the TSS distribution of the gene or region of interest.

在本发明的第六方面，本发明提出了一种用于从RNA样本富集转录本的富集试剂。根据本发明的实施例，富集试剂具有5’-单磷酸外切酶活性。利用该富集试剂，可以有效地富集转录本，因而能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。根据本发明的一个实施例，所述富集试剂含有DNase I。由此，可以进一步提高降解5’单磷酸的RNA的特异性和效率，从而进一步提高富集转录本的方法的效率。根据本发明的一个实施例，所述富集试剂还可以进一步含有缓冲液和可溶性盐，以便进一步提高DNase I的酶活性。根据本发明的一个实施例，所述富集试剂的pH为8.0。根据本发明的一个实施例，所述缓冲液为Tris-HCl，所述可溶性盐为选自氯化钠和氯化镁的至少一种。根据本发明的一个实施例，在30摄氏度下，利用所述富集试剂对所述RNA样本进行处理。从而，可以进一步提高利用根据本发明实施例的富集试剂进行富集转录本的效率。根据本发明实施例，具有5’单磷酸外切酶活性的酶的例子可以包括：核糖核酸外切酶XRN-1，Terminator^TM依赖于5’磷酸的核酸外切酶或者TAKARA^TM碱性磷酸酶。In the sixth aspect of the present invention, the present invention proposes an enrichment reagent for enriching transcripts from an RNA sample. According to an embodiment of the present invention, the enrichment reagent has 5'-monophosphate exonuclease activity. The enrichment reagent can be used to effectively enrich transcripts, so it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, and has the advantages of simple operation, high accuracy and low cost. According to an embodiment of the present invention, the enrichment reagent contains DNase I. Thus, the specificity and efficiency of degrading 5' monophosphate RNA can be further improved, thereby further improving the efficiency of the method for enriching transcripts. According to an embodiment of the present invention, the enrichment reagent may further contain a buffer and a soluble salt, so as to further increase the enzymatic activity of DNase I. According to an embodiment of the present invention, the pH of the enrichment reagent is 8.0. According to an embodiment of the present invention, the buffer is Tris-HCl, and the soluble salt is at least one selected from sodium chloride and magnesium chloride. According to an embodiment of the present invention, the RNA sample is processed with the enrichment reagent at 30 degrees Celsius. Therefore, the efficiency of enriching transcripts by using the enrichment reagent according to the embodiment of the present invention can be further improved. According to an embodiment of the present invention, examples of enzymes having 5' monophosphate exonuclease activity may include: exoribonuclease XRN-1, TerminatorTM 5' phosphate-dependent exonuclease or^TAKARATM^alkaline phosphatase .

在本发明的第七方面，本发明提出了一种构建测序文库的装置。参考图5，根据本发明的实施例，该构建测序文库的装置包括：转录本富集单元211、末端修整单元212、RNA接头连接单元213、反转录单元214、扩增单元215以及文库构建单元216。根据本发明的实施例，转录本富集单元211中设置有前面所述的富集试剂，以便从RNA样本富集转录本；末端修整单元212与所述转录本富集单元211相连，并且适于去除所述转录本的5’帽子结构或5’三磷酸，以便获得去除5’帽子结构或5’三磷酸的转录本；RNA接头连接单元213与末端修整单元212相连，并且适于在去除5’帽子结构或5’三磷酸的转录本的5’末端连接RNA接头，以便获得连接有RNA接头的转录本；反转录单元214与所述RNA接头连接单元213相连，并且适于对连接有RNA接头的转录本进行反转录，以便获得与所述转录本对应的cDNA；扩增单元215与所述反转录单元214相连，并且适于对所述cDNA进行扩增，以便获得扩增产物；文库构建单元216与所述扩增单元215相连，并且适于基于所述扩增产物，构建测序文库。利用该装置，能够有效地针对核酸样本中所富集的转录本构建测序文库，因而能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。根据本发明的一个实施例，所述末端修整单元212中设置有末端修整试剂，其中，所述末端修整试剂具有烟草酸焦磷酸酶活性。据本发明的一个实施例，所述修整试剂包含：烟草酸焦磷酸酶、可溶性盐、EDTA、β-巯基乙醇和Triton-X 100。据本发明的一个实施例，所述可溶性盐为醋酸钠。据本发明的一个实施例，所述修整试剂的pH为7.5。据本发明的一个实施例，所述反转录单元214中设置有具有SEQ ID NO：1所示序列的寡核苷酸作为反转录引物。据本发明的一个实施例，所述反转录引物中至少一个N被硫代修饰。据本发明的一个实施例，所述反转录引物中倒数第二个N被硫代修饰。据本发明的一个实施例，所述RNA接头连接单元213中设置有连接试剂，其中，所述连接试剂具有T4RNA连接酶活性。据本发明的一个实施例，所述连接试剂包含：T4RNA连接酶，缓冲液、可溶性盐、二硫苏糖醇。据本发明的一个实施例，所述连接试剂的pH为7.5。据本发明的一个实施例，所述缓冲液为Tris-HCl。据本发明的一个实施例，所述可溶性盐为氯化镁。In the seventh aspect of the present invention, the present invention provides a device for constructing a sequencing library. Referring to FIG. 5, according to an embodiment of the present invention, the device for constructing a sequencing library includes: a transcript enrichment unit 211, an end trimming unit 212, an RNA adapter connection unit 213, a reverse transcription unit 214, an amplification unit 215, and library construction Unit 216. According to an embodiment of the present invention, the aforementioned enrichment reagents are set in the transcript enrichment unit 211 so as to enrich transcripts from RNA samples; the end trimming unit 212 is connected to the transcript enrichment unit 211 and is adapted to for removing the 5' cap structure or 5' triphosphate of the transcript, so as to obtain a transcript with the 5' cap structure or 5' triphosphate removed; The 5' end of the transcript of the 5' cap structure or the 5' triphosphate is connected with an RNA linker so as to obtain a transcript connected with the RNA linker; the reverse transcription unit 214 is connected with the RNA linker connection unit 213 and is suitable for connecting A transcript with an RNA adapter is reverse-transcribed to obtain cDNA corresponding to the transcript; the amplification unit 215 is connected to the reverse transcription unit 214 and is suitable for amplifying the cDNA to obtain an amplified amplification product; the library construction unit 216 is connected to the amplification unit 215, and is suitable for constructing a sequencing library based on the amplification product. Using this device, a sequencing library can be effectively constructed for the enriched transcripts in nucleic acid samples, so it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time, with simple operation, high accuracy and low cost of many advantages. According to an embodiment of the present invention, the end trimming unit 212 is provided with an end trimming reagent, wherein the end trimming reagent has nicotinic acid pyrophosphatase activity. According to an embodiment of the present invention, the trimming reagent comprises: nicotinic acid pyrophosphatase, soluble salt, EDTA, β-mercaptoethanol and Triton-X 100. According to one embodiment of the present invention, the soluble salt is sodium acetate. According to an embodiment of the present invention, the pH of the trimming reagent is 7.5. According to an embodiment of the present invention, the reverse transcription unit 214 is provided with an oligonucleotide having a sequence shown in SEQ ID NO: 1 as a reverse transcription primer. According to an embodiment of the present invention, at least one N in the reverse transcription primer is modified with sulfur. According to an embodiment of the present invention, the penultimate N in the reverse transcription primer is modified with sulfur. According to an embodiment of the present invention, the RNA adapter connecting unit 213 is provided with a ligation reagent, wherein the ligation reagent has T4 RNA ligase activity. According to an embodiment of the present invention, the ligation reagent comprises: T4 RNA ligase, buffer, soluble salt, and dithiothreitol. According to an embodiment of the present invention, the pH of the linking reagent is 7.5. According to an embodiment of the present invention, the buffer is Tris-HCl. According to one embodiment of the present invention, the soluble salt is magnesium chloride.

在本发明的第八方面，本发明提出了一种核酸样本测序设备。参考图4，根据本发明的实施例，该设备包括：文库构建装置210，所述文库构建装置210为前面所述的装置，以便针对核酸样本构建测序文库；以及测序装置220，所述测序装置220与所述文库构建装置210相连，并且适于对所述测序文库进行测序，以便获得测序结果。利用该设备，能够有效的对RNA转录本进行测序，能够同时应用于真核和原核的RNA的TSS的高通量测序，具有操作简单，准确率高和成本低的众多优点。根据本发明的实施例，所述测序设备为IlluminaHiseq2000、Genome Analyzer、SOLiD测序系统、Ion Torrent、Ion Proton、454、PacBio RS测序系统、Helicos tSMS系统以及纳米孔测序系统的至少一种。In the eighth aspect of the present invention, the present invention provides a nucleic acid sample sequencing device. Referring to Fig. 4, according to an embodiment of the present invention, the device includes: a library construction device 210, the library construction device 210 is the aforementioned device, so as to construct a sequencing library for a nucleic acid sample; and a sequencing device 220, the sequencing device 220 is connected to the library construction device 210 and is suitable for sequencing the sequencing library so as to obtain a sequencing result. Using this device, RNA transcripts can be effectively sequenced, and it can be applied to high-throughput sequencing of TSS of eukaryotic and prokaryotic RNA at the same time. It has many advantages such as simple operation, high accuracy and low cost. According to an embodiment of the present invention, the sequencing device is at least one of IlluminaHiseq2000, Genome Analyzer, SOLiD sequencing system, Ion Torrent, Ion Proton, 454, PacBio RS sequencing system, Helicos tSMS system and nanopore sequencing system.

在本发明的第九方面，本发明提出了一种确定TSS的系统。参考图3，根据本发明的实施例，该系统包括：样品提取设备100，所述样品提取设备用于从宿主提取RNA样本；核酸样本测序设备200，所述核酸样本测序设备与所述样品提取设备相连，并且所述测序设备为前面所述的核酸样本测序设备，以便针对所述RNA样本进行测序，从而获得由多个测序序列构成的测序结果；以及TSS确定设备300，所述TSS确定设备300与所述测序设备200相连，并且适于基于所述测序结果，确定TSS。根据本发明的实施例，利用该系统能够有效的确定核酸样本中的TSS。参考图6，根据本发明的一个实施例，所述 TSS确定设备进一步包括：比对装置310，所述比对装置用于将所述测序数据与参考序列进行比对；确定装置320，所述确定装置适于基于比对结果，确定所述TSS，其中，所述参考序列中包含预定基因的5’-UTR序列的至少一部，并且，所述确定装置320适于：选择能够比对到所述与预定基因对应的序段的并且最接近所述与预定基因对应的序段5’端的测序序列作为阳性序列，并且确定所述阳性序列的第一碱基为转录起始位点。根据本发明的一个实施例，所述比对装置适于采用SOAPAlignment进行所述比对。根据本发明的一个实施例，所述确定装置进一步包括筛选单元，所述筛选单元适于对所述阳性序列进行筛选，其中所述筛选的原则是：所述阳性序列的序列数目是所述与预定基因对应的序段内部序列数目平均值的N倍以上，其中所述N为大于1的实数，优选地，N可以是至少为10的实数。根据本发明的一个实施例，所述确定装置进一步包括检验单元，所述检验单元适于对筛选结果进行卡方检验。根据本发明的一个实施例，所述卡方检验的检验值为3.84以上，对应置信度大于95%。In a ninth aspect of the present invention, the present invention proposes a system for determining TSS. Referring to Fig. 3, according to an embodiment of the present invention, the system includes: a sample extraction device 100, which is used to extract an RNA sample from a host; a nucleic acid sample sequencing device 200, which is connected with the sample extraction device The equipment is connected, and the sequencing equipment is the aforementioned nucleic acid sample sequencing equipment, so as to perform sequencing on the RNA sample, thereby obtaining a sequencing result composed of multiple sequencing sequences; and TSS determination equipment 300, the TSS determination equipment 300 is connected to the sequencing device 200 and is suitable for determining TSS based on the sequencing result. According to the embodiment of the present invention, the system can effectively determine the TSS in the nucleic acid sample. Referring to FIG. 6, according to an embodiment of the present invention, the TSS determination device further includes: a comparison device 310, which is used to compare the sequencing data with a reference sequence; a determination device 320, the The determining means is adapted to determine the TSS based on the comparison result, wherein the reference sequence contains at least a part of the 5'-UTR sequence of the predetermined gene, and the determining means 320 is adapted to: select the TSS that can be compared to The sequencing sequence of the sequence segment corresponding to the predetermined gene and closest to the 5' end of the sequence segment corresponding to the predetermined gene is used as a positive sequence, and the first base of the positive sequence is determined to be the transcription initiation site. According to an embodiment of the present invention, the alignment device is adapted to perform the alignment using SOAPAlignment. According to an embodiment of the present invention, the determination device further includes a screening unit, which is suitable for screening the positive sequences, wherein the principle of the screening is: the sequence number of the positive sequences is equal to the More than N times the average number of internal sequences in the segment corresponding to the predetermined gene, wherein N is a real number greater than 1, preferably, N can be a real number of at least 10. According to an embodiment of the present invention, the determining device further includes a testing unit, and the testing unit is adapted to perform a chi-square test on the screening results. According to an embodiment of the present invention, the test value of the chi-square test is above 3.84, and the corresponding confidence is greater than 95%.

在本发明中所使用的术语“预定基因”应做广义理解，其可以指任何已知的基因，也可以指通过已知的方法，预测可能会编码蛋白的核酸序列。The term "predetermined gene" used in the present invention should be understood in a broad sense, which may refer to any known gene, or to a nucleic acid sequence predicted to encode a protein through known methods.

下面将结合实施例对本发明的方案进行解释。本领域技术人员将会理解，下面的实施例仅用于说明本发明，而不应视为限定本发明的范围。实施例中未注明具体技术或条件的，按照本领域内的文献所描述的技术或条件（例如参考J.萨姆布鲁克等著，黄培堂等译的《分子克隆实验指南》，第三版，科学出版社）或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者，均为可以通过市购获得的常规产品，例如可以采购自Illumina公司。The solutions of the present invention will be explained below in conjunction with examples. Those skilled in the art will understand that the following examples are only for illustrating the present invention and should not be considered as limiting the scope of the present invention. If specific techniques or conditions are not indicated in the examples, according to the techniques or conditions described in the literature in this field (for example, refer to "Molecular Cloning Experiment Guide" translated by J. Sambrook et al., Huang Peitang et al., third edition, Science Press) or follow the product instructions. The reagents or instruments used, whose manufacturers are not indicated, are conventional products that can be purchased from the market, for example, they can be purchased from Illumina Corporation.

一般方法general method

在实施例中所采用的方法主要包括TSS文库构建以及测序后分析，其中TSS文库构建方法主要包括下述步骤：The method adopted in the embodiment mainly includes TSS library construction and post-sequencing analysis, wherein the TSS library construction method mainly includes the following steps:

（1）取总RNA样品，用DNaseI消化后，乙醇沉淀纯化消化后的RNA；(1) Take a total RNA sample, digest it with DNaseI, and purify the digested RNA by ethanol precipitation;

（2）将（1）得到的RNA与试剂I混匀反应，富集含有5’帽子或5’三磷酸的RNA；(2) Mix and react the RNA obtained in (1) with Reagent I to enrich the RNA containing 5' cap or 5' triphosphate;

（3）苯酚/氯仿/异戊醇（25:24:1）抽提纯化（2）得到的RNA；(3) Extract and purify the RNA obtained in (2) with phenol/chloroform/isoamyl alcohol (25:24:1);

（4）将（3）纯化后的RNA与试剂II混匀反应，去除5’帽子或5’三磷酸得到5’单磷酸；(4) Mix and react the purified RNA from (3) with Reagent II to remove the 5' cap or 5' triphosphate to obtain 5' monophosphate;

（5）苯酚/氯仿/异戊醇（25:24:1）抽提纯化（4）得到的RNA；(5) Extract and purify the RNA obtained in (4) with phenol/chloroform/isoamyl alcohol (25:24:1);

（6）将（5）的RNA加入RNA接头，并与试剂III混匀反应，在得到5’端加上接头的RNA；(6) Add the RNA of (5) to the RNA adapter, and mix it with reagent III to obtain the RNA with the adapter at the 5' end;

（7）用特定的反转录引物将（6）的RNA反转录，得到两端都有特定序列接头的cDNA并用磁珠纯化；(7) Reverse transcribe the RNA of (6) with specific reverse transcription primers to obtain cDNA with specific sequence adapters at both ends and purify it with magnetic beads;

（8）采用聚合酶链式反应（PCR）扩增（7）所得两端加接头的cDNA片段，使用磁珠纯化PCR产物；(8) Use polymerase chain reaction (PCR) to amplify the cDNA fragment with adapters at both ends obtained in (7), and use magnetic beads to purify the PCR product;

（9）采用Agilent Bioanalyzer 2100和Q-PCR检测文库浓度及片段大小。(9) Use Agilent Bioanalyzer 2100 and Q-PCR to detect library concentration and fragment size.

步骤（1）中，总RNA的量为5μg。In step (1), the amount of total RNA was 5 μg.

步骤（2）中，试剂I，含有：1μL 5’单磷酸外切酶（1U/μL），50mM缓冲盐，2mM-100mM可溶性盐，pH 8.0，溶剂为水。试剂I中缓冲盐为Tris-HCl。试剂I中可溶性盐为氯化钠或氯化镁。步骤（2）中所得RNA与试剂I混合温度为30℃。In step (2), reagent I contains: 1 μL 5’ monophosphate exonuclease (1U/μL), 50 mM buffer salt, 2 mM-100 mM soluble salt, pH 8.0, and the solvent is water. The buffer salt in reagent I is Tris-HCl. The soluble salt in reagent I is sodium chloride or magnesium chloride. The mixing temperature of RNA obtained in step (2) and reagent I is 30°C.

所述步骤（4）中，试剂II含有：0.2μL烟草酸焦磷酸酶（10U/μL），50mM可溶性盐，pH6.0,1mM EDTA，0.1%β-巯基乙醇,0.01%Triton X-100，溶剂为水。试剂II中可溶性盐为醋酸钠。样品与试剂II混合温度为37℃。In the step (4), the reagent II contains: 0.2 μL tobacco acid pyrophosphatase (10 U/μL), 50 mM soluble salt, pH 6.0, 1 mM EDTA, 0.1% β-mercaptoethanol, 0.01% Triton X-100, The solvent is water. The soluble salt in reagent II is sodium acetate. The mixing temperature of sample and reagent II is 37°C.

所述步骤（6）中，试剂III含有：1μL T4 RNA连接酶1，50mM缓冲盐，10mM可溶性盐，1mM二硫苏糖醇，pH 7.5，溶剂为水。试剂III中缓冲盐为Tris-HCl。试剂III中可溶性盐为氯化镁。步骤（6）中所得RNA与试剂III混合温度为20℃。In the step (6), the reagent III contains: 1 μL T4 RNA ligase 1, 50 mM buffer salt, 10 mM soluble salt, 1 mM dithiothreitol, pH 7.5, and the solvent is water. The buffer salt in reagent III is Tris-HCl. The soluble salt in reagent III is magnesium chloride. The mixing temperature of RNA obtained in step (6) and reagent III is 20°C.

所述步骤（7）中所用特定反转录引物序列为：5-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNN-3，其中倒数第二个N做硫代修饰。The specific reverse transcription primer sequence used in the step (7) is: 5-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNN-3, wherein the penultimate N is modified with sulfur.

所述步骤（3）和（5）之后，纯化RNA均采用苯酚/氯仿/异戊醇抽提，乙醇沉淀，是为了去除反应混合物中的酶，以免影响下一步骤的反应，而且用乙醇沉淀，还能保留一些小分子的转录本，如microRNA，使这一部分非编码RNA的TSS信息得以获得，从而帮助了解转录调控状态。After the steps (3) and (5), the purified RNA was extracted with phenol/chloroform/isoamyl alcohol and precipitated with ethanol to remove the enzyme in the reaction mixture so as not to affect the reaction in the next step, and precipitated with ethanol , can also retain some small molecule transcripts, such as microRNA, so that the TSS information of this part of non-coding RNA can be obtained, thereby helping to understand the state of transcriptional regulation.

参考图2，对TSS文库测序所产生的数据，进行生物信息分析，包括以下步骤：Referring to Figure 2, the bioinformatics analysis is performed on the data generated by the sequencing of the TSS library, including the following steps:

（1）过滤测序序列；(1) Filter the sequencing sequence;

在本发明中，接收到高通量测序序列后，对测序序列进行过滤，去除不合格的序列。其中高通量测序技术可以为Illumina Hiseq2000测序技术，也可以为现有的其他高通量测序技术。In the present invention, after receiving the high-throughput sequencing sequences, the sequencing sequences are filtered to remove unqualified sequences. The high-throughput sequencing technology may be Illumina Hiseq2000 sequencing technology, or other existing high-throughput sequencing technologies.

不合格序列包括：测序质量低于某一阀值的碱基个数超过整条序列碱基个数的50%则认为是不合格序列。低质量阀值由具体测序技术及测序环境而定；序列中测序结果不确定的碱基（如Illumina Hiseq2000测序结果中的N）个数超过整条序列碱基个数的10%则认为是不合格序列；除样本接头序列外，与其它实验引入的外源序列比对，如各种接头序列。若序列中存在外源序列则认为是不合格序列。原始的序列数据经过去除不合格序列处理后得到的序列数据我们称为干净的序列片段（clean reads），作为后续分析的基础。Unqualified sequences include: if the number of bases whose sequencing quality is lower than a certain threshold exceeds 50% of the number of bases in the entire sequence, it is considered as an unqualified sequence. The low-quality threshold is determined by the specific sequencing technology and sequencing environment; the number of bases in the sequence with uncertain sequencing results (such as N in the sequencing results of Illumina Hiseq2000) exceeds 10% of the number of bases in the entire sequence is considered to be invalid. Qualified sequence: In addition to the sample adapter sequence, it is compared with the foreign sequence introduced by other experiments, such as various adapter sequences. If there is an exogenous sequence in the sequence, it is considered as an unqualified sequence. The sequence data obtained after the original sequence data has been processed to remove unqualified sequences is called clean sequence fragments (clean reads), which serve as the basis for subsequent analysis.

（2）干净的序列片段与参考序列比对；(2) Alignment of clean sequence fragments with reference sequences;

在本发明中，通过一种短序列映射程序soapalignment v2.2，将高通量测序技术得到的干净的序列片段分别比对到参考基因组和参考基因序列上，不允许碱基的错配。参考基因组序列和参考基因序列可取于公共数据库。In the present invention, a short sequence mapping program soapalignment v2.2 is used to align the clean sequence fragments obtained by the high-throughput sequencing technology to the reference genome and the reference gene sequence, and no base mismatch is allowed. Reference genome sequences and reference gene sequences are available from public databases.

（3）比对后，首先对比对结果进行筛选，以获得可靠地TSS信息。筛选方法为：假设干净序列比对到基因组的第一个位置即为原始的TSS，但是这些序列有可能是比对到基因的内部成为假阳性的TSS，所以需要再进一步进行过滤。该方法可以使我们获得的序列在基因的5’端富集，因此真实的TSS的序列数会比落在基因内部的序列的平均数要高，于是在他们之间引进一个倍数N过滤TSS，即筛选的TSS的序列数要是落在对应的基因内部序列数平均值的N倍才将其认定为真实的TSS。(3) After the comparison, first screen the comparison results to obtain reliable TSS information. The screening method is as follows: Assume that the first position of the clean sequence aligned to the genome is the original TSS, but these sequences may be TSSs that become false positives when compared to the inside of the gene, so further filtering is required. This method can enrich the sequence we obtained at the 5' end of the gene, so the number of real TSS sequences will be higher than the average number of sequences falling inside the gene, so a multiple N is introduced between them to filter TSS, That is, the sequence number of the screened TSS is considered to be a real TSS only if it is N times the average number of sequences in the corresponding gene.

（4）过滤后，对于过滤结果本方法使用卡方检验来验证其可靠性，即卡方检验值应该大于3.84即置信度大于95%。(4) After filtering, the chi-square test is used to verify the reliability of the filtering results, that is, the chi-square test value should be greater than 3.84, which means the confidence level is greater than 95%.

（5）在卡方检验后，对验证可靠的结果进行一系列的生物信息分析，如：(5) After the chi-square test, conduct a series of bioinformatics analysis on the verified and reliable results, such as:

实施例1人RNA样本和大肠杆菌RNA样本的转录起始位点序列分析Transcription initiation site sequence analysis of embodiment 1 human RNA sample and Escherichia coli RNA sample

人RNA样本（样本一）购自安捷伦公司，大肠杆菌RNA（样本二）是将大肠杆菌培养至对数生长期后提取的RNA。取1-5μg的总RNA，用DNaseI进行消化，乙醇沉淀纯化，纯化后的RNA与试剂I混匀反应，富集得到含有5’帽子或5’三磷酸的完整RNA，用苯酚/氯仿/异戊醇抽提纯化后，与试剂II混匀反应，去除5’端的帽子或三磷酸使之变成单磷酸，用苯酚/氯仿/异戊醇抽提纯化，将5’单磷酸的RNA与试剂III和RNA接头混匀反应，在RNA 5’端加上接头，用特定的反转录引物将加有5’接头的RNA反转录为两端带有固定序列的cDNA，使用磁珠纯化cDNA产物，采用聚合酶链式反应（PCR）扩增所得cDNA片段，磁珠纯化PCR产物，上机测序。测序使用Illumina Hiseq2000。The human RNA sample (sample 1) was purchased from Agilent, and the Escherichia coli RNA (sample 2) was extracted after culturing Escherichia coli to the logarithmic growth phase. Take 1-5 μg of total RNA, digest it with DNaseI, and purify it by ethanol precipitation. The purified RNA is mixed with reagent I and reacted to enrich the complete RNA containing 5' cap or 5' triphosphate. Use phenol/chloroform/iso After amyl alcohol extraction and purification, mix with reagent II to react, remove the cap or triphosphate at the 5' end to make it monophosphate, extract and purify with phenol/chloroform/isoamyl alcohol, and mix the 5' monophosphate RNA with the reagent III Mix the reaction with the RNA adapter, add the adapter to the 5' end of the RNA, use specific reverse transcription primers to reverse transcribe the RNA with the 5' adapter into cDNA with fixed sequences at both ends, and use magnetic beads to purify the cDNA The product was cDNA fragment amplified by polymerase chain reaction (PCR), the PCR product was purified by magnetic beads, and sequenced on the machine. Illumina Hiseq2000 was used for sequencing.

按照一般方法的信息分析流程，筛选得到了一系列TSS信息，图7是筛选后的TSS在基因组上的分布，上图和下图分别是人RNA和大肠杆菌RNA 样品的TSS分布图，其中0是基因编码区的起始位点，其上游就是转录起始的位点，从图中可以看出，大部分的序列都落在基因编码区的上游。According to the information analysis process of the general method, a series of TSS information was screened and obtained. Figure 7 is the distribution of the screened TSS on the genome. The upper and lower figures are the TSS distributions of human RNA and E. It is the starting site of the coding region of the gene, and its upstream is the site of transcription initiation. It can be seen from the figure that most of the sequences fall upstream of the coding region of the gene.

另外，在本实施例中，针对这些TSS信息进行了一系列的分析。In addition, in this embodiment, a series of analyzes are performed on these TSS information.

首先是TSS的分类，将筛选的TSS分为两大类，一类是能比对到基因组上且有对应的基因注释的TSS，称之为有注释的TSS；另一类是能比对到基因组上但是在其周围没有注释的基因信息，称之为未注释的TSS，可以用于新基因的预测。The first is the classification of TSS, which divides the screened TSS into two categories, one is the TSS that can be compared to the genome and has corresponding gene annotations, which is called annotated TSS; the other is the TSS that can be compared to Gene information on the genome but not annotated around it, called unannotated TSS, can be used for prediction of new genes.

其次做了TSS的注释，这里主要对落在已知基因的TSS进行注释，包括TSS的表达量，TSS所处的位置，以及对应的基因注释信息。然后构建TSS图谱，发明人将同一物种在该方法中找到的TSS用图片的形式直观的展示出来形成TSS图谱，从图谱上可以很直观的看出每个TSS所在的位置以及他们的表达量。同时也可以看到不同样品中的TSS表达，分布的差异。如图8所示，每个是8个人的样品的TSS图谱，从图中可以看到不同样品中TSS的分布情况。Secondly, the TSS annotation is made. Here, the TSS that falls on the known gene is mainly annotated, including the expression level of the TSS, the position of the TSS, and the corresponding gene annotation information. Then the TSS map was constructed. The inventors intuitively displayed the TSS found in the method of the same species in the form of pictures to form a TSS map. From the map, the location of each TSS and their expression levels can be seen intuitively. At the same time, we can also see the differences in TSS expression and distribution in different samples. As shown in Figure 8, each is the TSS spectrum of 8 human samples, and the distribution of TSS in different samples can be seen from the figure.

接下来是启动子区的寻找及5’UTR长度统计，图9是TSS上游的碱基分布图，其中横坐标1对应的就是TSS的位置，以嘌呤为主（A/G），上图显示的是人的TSS上游碱基分布图，有明显的GC富集区，这也是真核生物主要的启动子类型，下图显示了大肠杆菌的碱基分布图，在其TSS上游-10区处也能找到典型的TATA盒；图10显示了人（上图）和大肠杆菌（下图）的5’UTR的长度分布，也就是TSS到编码区的距离，5’UTR的长度影响基因功能的发挥，真核的5’UTR比原核的要长。Next is the search for the promoter region and the statistics of the length of the 5'UTR. Figure 9 is the base distribution map upstream of the TSS, where the abscissa 1 corresponds to the position of the TSS, which is mainly purine (A/G). The figure above shows The above is the base distribution map of human TSS upstream, with obvious GC-rich region, which is also the main promoter type of eukaryotes. The following figure shows the base distribution map of Escherichia coli, at the -10 region upstream of its TSS A typical TATA box can also be found; Figure 10 shows the length distribution of the 5'UTR of humans (top) and E. coli (bottom), that is, the distance from the TSS to the coding region, and the length of the 5'UTR affects gene function. Play, eukaryotic 5'UTR is longer than prokaryotic.

在本实施例中，还对两次平行实验的结果做了相关性分析可获得对实验结果可靠性和操作稳定性的评估，如图11所示，同一样本两次平行实验之间的相关性越接近1，说明可重复性高。In this embodiment, a correlation analysis is also performed on the results of two parallel experiments to obtain an evaluation of the reliability and operational stability of the experimental results, as shown in Figure 11, the correlation between the two parallel experiments of the same sample The closer to 1, the higher the repeatability.

本发明对于附近没有找到参考基因的TSS，提取出这些TSS附近的序列进行基因预测。大肠杆菌的用glimmer进行预测，人的用genscan进行预测。最后，本发明利用分析结果，针对感兴趣的基因或者区域的TSS分布作图观察，如图12所示，上图是人的两个基因NM_018997和NM_031901的TSS分布，他们是发生了可变剪切的基因，图中红色竖线表示筛选的TSS，黑色的竖线是过滤前得到的序列，蓝色横线代表基因的外显子，黄色横线是基因的内含子，下图是大肠杆菌一个操纵子的TSS分布，原核的不存在内含子，所以只有代表基因的蓝色横线，这个操纵子的4个基因共有一个TSS。可以看到发明人筛选出来的TSS是位于基因上游的，也是可靠的。In the present invention, for TSSs for which no reference genes are found nearby, sequences near these TSSs are extracted for gene prediction. Escherichia coli is predicted by glimmer, and human is predicted by genscan. Finally, the present invention uses the analysis results to map and observe the TSS distribution of the gene or region of interest, as shown in Figure 12. The upper figure shows the TSS distribution of two human genes NM_018997 and NM_031901, which have undergone variable splicing The cut gene, the red vertical line in the figure indicates the screened TSS, the black vertical line is the sequence obtained before filtering, the blue horizontal line represents the exon of the gene, the yellow horizontal line is the intron of the gene, and the figure below is the large intestine The TSS distribution of an operon in Bacillus, there is no intron in the prokaryotic, so there is only a blue horizontal line representing the gene, and the four genes of this operon share a TSS. It can be seen that the TSS screened out by the inventor is located upstream of the gene and is also reliable.

本发明的描述是为了示例和描述起见而给出的，而并不是无遗漏的或者将本发明限于所公开的形式。很多修改和变化对于本领域的普通技术人员而言是显然的。选择和描述实施例是为了更好说明本发明的原理和实际应用，并且使本领域的普通技术人员能够理解本发明从而设计适于特定用途的带有各种修改的各种实施例。The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and changes will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to better explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention and design various embodiments with various modifications as are suited to the particular use.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, descriptions with reference to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在不脱离本发明的原理和宗旨的情况下在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it can be understood that the above embodiments are exemplary and cannot be construed as limitations to the present invention. Variations, modifications, substitutions, and modifications to the above-described embodiments are possible within the scope of the present invention.