CN109841265B

Movatterモバイル変換

Info

Publication number: CN109841265B
Application number: CN201910131751.XA
Authority: CN
Inventors: 汪小我; 方欢; 徐江平
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-02-22
Filing date: 2019-02-22
Publication date: 2021-09-21
Anticipated expiration: 2039-02-22
Also published as: CN109841265A

Abstract

本发明提供了一种使用片段化模式确定血浆游离核酸分子组织来源的方法和系统及应用。该方法通过将核酸分子的测序结果与多个参考序列进行比对处理，分别确定与所述参考序列对应的已知基因的频谱参数；然后基于预定组织中多个已知基因的表达量信息，以及多个已知基因的所述频谱参数，确定所述核酸分子与所述预定组织的相关性；其中，所述频谱参数是通过确定参考序列上位点的核小体保护分数，确定多个核小体周期作为第一参数以及与所述多个核小体周期对应的傅里叶强度作为第二参数，从而确定所述参考序列对应所述已知基因的所述频谱参数。将本发明所提供的方法应用于癌症的检测中，能够更加灵敏和准确。The present invention provides a method, system and application for determining the tissue origin of plasma free nucleic acid molecules using fragmentation patterns. In the method, the sequence results of nucleic acid molecules are compared with multiple reference sequences, and the spectral parameters of the known genes corresponding to the reference sequences are respectively determined; and the spectral parameters of a plurality of known genes, to determine the correlation between the nucleic acid molecule and the predetermined tissue; wherein, the spectral parameters are determined by determining the nucleosome protection score of the site on the reference sequence to determine a plurality of nuclear The corpuscle period is used as the first parameter and the Fourier intensities corresponding to the plurality of nucleosome periods are used as the second parameter, so as to determine the spectral parameter of the known gene corresponding to the reference sequence. Applying the method provided by the present invention to the detection of cancer can be more sensitive and accurate.

Description

Method and system for determining tissue source of plasma free nucleic acid molecules by using fragmentation mode and application

Technical Field

The invention relates to the field of biological information, in particular to a method and a system for determining a plasma free nucleic acid molecule tissue source by using a fragmentation mode and application of the method and the system.

Background

Cancer, a general term for malignant tumors and hematological cancers, is a malignant disease in which uncontrolled proliferation of cells occurs with metastasis. The incidence of cancer is high, and the survival rate of cancer at the late stage is low, so that the early discovery, early diagnosis and early treatment of cancer are particularly important.

The liquid biopsy technology is different from the solid tumor tissue in the operation for testing and detecting, and is a method for safely, conveniently and accurately detecting the cancer by drawing blood. The main principle is as follows: plasma free dna (cfdna) is derived from a fragmented genome released following apoptosis in various tissues and organs of the human body. When cancer occurs, the cfDNA contains a genomic fragment of cancer cells. In addition, the half-life of cfDNA is about ten minutes to several hours, which makes it possible to take blood to monitor cancer.

However, when blood is drawn for cancer monitoring, the accuracy of predicting tissue lesions needs to be further improved.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. To this end, it is an object of the present invention to provide a method for determining the association of a nucleic acid molecule with a predetermined tissue, a system for determining the association of a nucleic acid molecule with a predetermined tissue, and a method for determining the origin of a plasma-free nucleic acid tissue.

The present invention is obtained based on the following findings:

the pattern of genomic fragmentation during apoptosis is generally thought to be related to its chromatin structure. The fragmentation pattern, as the name implies, is a fragment endpoint pattern generated when a nuclease breaks DNA into fragments of different lengths within a cell. Genomic regions with open chromatin order are more prone to being disrupted by binding nucleases than regions with tight chromatin order; naked DNA, which does not bind any proteins, is more easily disrupted by nucleases, while regions protected by nucleosomes, transcription factors, etc., are less easily disrupted. Therefore, nucleosome arrangement of each gene region can be deduced from the fragmentation pattern of the genome. The arrangement of nucleosomes on the genome is extremely periodic, and according to the current research of MNase-seq, the distance between adjacent nucleosomes is about 185bp, including a core winding region of about 145bp and a junction region of about 40 bp.

The nucleosome arrangement of the gene promoter region can be deduced by using the gene expression data of each tissue cell in the public data set. Specifically, the higher the gene expression, the closer the nucleosomes of the promoter region are arranged, whereas the lower the gene expression, the looser the nucleosomes of the promoter region are arranged. Cancer was screened using fragmentation patterns of cfDNA. Currently in the field of cancer liquid biopsy, most studies are focused on cancer-specific mutations, copy number variations and methylation patterns in cfDNA, and since 2015, research methods to detect cancer using cfDNA fragmentation patterns began to emerge. Because different tissue cells have different chromatin structures and genome fragmentation patterns during apoptosis are different, mixed nucleosome arrangement is deduced by detecting the fragmentation patterns in cfDNA, and the tissue source distribution of cfDNA can be decoupled by combining the nucleosome arrangement deduced from gene expression data of each tissue cell in a public data set, so that whether a subject contains cancer tissues or not is screened.

Therefore, the invention provides the following technical scheme:

according to a first aspect of the invention, there is provided a method of determining the relatedness of a nucleic acid molecule to a predetermined tissue, comprising: aligning the sequencing result of the nucleic acid molecule with a plurality of reference sequences, wherein the sequencing result is composed of a plurality of sequencing reads, and each of the plurality of reference sequences is determined based on the sequence of one of a plurality of known genes; respectively determining the frequency spectrum parameters of the known genes corresponding to the reference sequences aiming at each of the plurality of reference sequences based on the comparison processing result; determining a correlation of the nucleic acid molecule with the predetermined tissue based on the expression amount information of the plurality of known genes in the predetermined tissue and the spectral parameters of the plurality of known genes; wherein the spectral parameters are determined for each of the reference sequences by: separately determining a nucleosome protection score for each site for at least a portion of the sites of the reference sequence; determining a first parameter and a second parameter based on the nucleosome protection score of each locus, the first parameter characterizing a plurality of nucleosome cycles, the second parameter characterizing Fourier intensities corresponding to the plurality of nucleosome cycles, respectively; determining the spectral parameters of the reference sequence corresponding to the known genes based on the first and second parameters.

The method for determining the correlation of a nucleic acid molecule with a predetermined tissue provided by the invention can judge the correlation of the nucleic acid molecule with the predetermined tissue. By studying the association of the same nucleic acid molecule with a predetermined tissue in different individuals, different states can be judged. For example, the method can be used for early cancer screening of the population, and can also be used for predicting and monitoring the recurrence risk of cancer patients and intervening diagnosis and treatment of the cancer patients. And may not only be used to detect a variety of cancers, the type of cancer that may be detected depends on which tissue expression profile data is gathered.

According to an embodiment of the present invention, the method for determining the association of a nucleic acid molecule with a predetermined tissue as described above may further comprise the following technical features:

in some embodiments of the invention, the nucleic acid molecule is a plasma-free nucleic acid molecule.

In some embodiments of the invention, the reference sequence is 5000-11000 bp in length. The length of the reference sequence should be greater than 25 nucleosome cycles, too short a length would result in inaccurate estimation of the second parameter, while too long a reference sequence would introduce regions unrelated to gene expression, thereby reducing the accuracy of the method.

In some embodiments of the invention, the reference sequence comprises the sequence of the known gene.

In some embodiments of the invention, the reference sequence is determined based on the transcription start site of the known gene. The vicinity of the transcription initiation site of the gene is rich in a promoter and a transcription factor binding site, and nucleosome characteristics of the gene are closely related to the expression quantity of the gene, so that the region near the transcription initiation site is selected as a reference sequence, which is favorable for determining the relevance of a nucleic acid molecule and a predetermined tissue.

In some embodiments of the invention, the sequencing is performed by high throughput sequencing.

In some embodiments of the invention, the known genes include all of the genes GRCh37 in Ensembl 75 th edition.

In some embodiments of the invention, for each locus, the nucleosome protection score is determined by:

(1) determining an alignment window based on the sites, the alignment window containing the sites;

(2) determining the sequencing reads that match the alignment window;

(3) distinguishing the sequencing reads into a first type sequencing read and a second type sequencing read, wherein the first type sequencing read completely covers the alignment window, and the second type sequencing read does not completely cover the alignment window;

(4) determining the nucleosome protection score based on the number of the first type of sequencing reads and the number of the second type of sequencing reads.

In some embodiments of the present invention, the length of the alignment window is 100-140. The window size should be consistent with the length of the nucleosome core histone combined DNA, generally 120bp is selected, the too short can mistake the protective site of the transcription factor as the nucleosome protective site, and the too long can lose part of the closely arranged nucleosome sites.

In some embodiments of the invention, the site is located in the middle of the alignment window.

In some embodiments of the invention, in step (4), the nucleosome protection score is positively correlated with the number of the first type of sequencing reads and negatively correlated with the number of the second type of sequencing reads.

In some embodiments of the invention, in step (4), the nucleosome protection score is determined by the formula number of first type sequencing reads-the second type sequencing reads.

In some embodiments of the present invention, based on the locus information of the reference sequence and the nucleosome protection score corresponding to the locus information, a plurality of nucleosome periods corresponding to the reference sequence are determined as a first parameter, and fourier intensities corresponding to the plurality of nucleosome periods are determined as a second parameter.

In some embodiments of the invention, the first and second parameters are determined by fourier transform.

In some embodiments of the invention, the first and second parameters are determined by:

(a) performing smooth correction on the basis of the site information of the reference sequence and a nucleosome protection score corresponding to the site information, and then drawing to obtain a first graph, wherein the x axis is the distance between the site and the gene transcription start site, and the y axis is the corrected nucleosome protection score;

(b) performing Fourier transform on the first graph obtained in the step (a) to obtain a second graph, wherein the x axis of the second graph is the nucleosome period, and the y axis of the second graph is the Fourier intensity corresponding to the nucleosome period.

In some embodiments of the invention, the spectral parameter is determined based on the following formula:

wherein T represents the nucleosome cycle, F (T) represents the Fourier intensity corresponding to the nucleosome cycle T, a represents the lower limit of the nucleosome cycle, and b represents the upper limit of the nucleosome cycle.

In some embodiments of the present invention, a 168 and b 208. This interval is obtained by maximizing the association of healthy human nucleic acid molecules with blood tissue, representing the range of nucleosome cycles most associated with gene expression, and other possible parameters include: a is equal to 120,170, b is equal to 205,212. In some embodiments of the invention, after determining the parameters a, b, the spectral parameters are related only to the first and second parameters.

In some embodiments of the invention, the correlation is determined by the following equation:

where ρ is_iRepresenting the Pearson correlation coefficient of said nucleic acid molecule with a predetermined tissue i,

representing said spectral parameter, A, of said nucleic acid molecule in a plurality of known genes_iRepresents the expression amount of the predetermined tissue i in the plurality of known genes, Var represents variance, and Cov represents covariance.

In some embodiments of the invention, the predetermined organization is selected from: liver, gallbladder, spleen, lung, kidney, bladder, esophagus, stomach, small intestine, colon, rectum, duodenum, appendix, pancreatic islet, salivary gland, tonsil, thyroid, parathyroid, adrenal gland, breast, ovary, fallopian tube, uterus, endometrium, cervix, prostate, foreskin, testis, epididymis, seminal vesicle, skin, adipose tissue, cerebral cortex, eye, cardiac muscle, skeletal muscle, smooth muscle, lymph node, blood, bone marrow.

According to a second aspect of the present invention, there is provided a system for determining the relatedness of a nucleic acid molecule to a predetermined tissue, comprising: an alignment device for aligning a sequencing result of the nucleic acid molecule with a plurality of reference sequences, wherein the sequencing result is composed of a plurality of sequencing reads, and each of the plurality of reference sequences is determined based on a sequence of one of a plurality of known genes; the spectrum parameter determining device is connected with the comparing device and respectively determines the spectrum parameters of the known genes corresponding to the reference sequences aiming at each of the plurality of reference sequences; a correlation determination device connected to the spectrum parameter determination device, the correlation determination device determining a correlation of the nucleic acid molecule with the predetermined tissue based on information on expression amounts of a plurality of the known genes in the predetermined tissue and the spectrum parameters of the plurality of the known genes; wherein the spectrum parameter determination device determines, for each of the reference sequences, by: a nucleosome protection score determining unit, wherein the nucleosome protection score determining unit respectively determines the nucleosome protection score of each locus aiming at least one part of the loci of the reference sequence; a parameter determining unit connected to the nucleosome protection score determining unit, the parameter determining unit determining a first parameter and a second parameter based on the nucleosome protection score of each locus, the first parameter characterizing a plurality of nucleosome periods, the second parameter characterizing fourier intensities corresponding to the plurality of nucleosome periods, respectively; and the spectrum parameter calculation unit is connected with the parameter determination unit and determines the spectrum parameters of the known genes corresponding to the reference sequence based on the first parameters and the second parameters.

According to an embodiment of the present invention, the system for determining the association of a nucleic acid molecule with a predetermined tissue as described above may further comprise the following technical features:

in some embodiments of the invention, in the above system, the nucleic acid molecule is a plasma-free nucleic acid molecule;

in some embodiments of the present invention, in the above system, the length of the reference sequence is 5000-11000 bp.

In some embodiments of the invention, in the above system, the reference sequence comprises a sequence of the known gene.

In some embodiments of the invention, in the above system, the reference sequence is determined based on the transcription start site of the known gene.

In some embodiments of the invention, in the above system, the sequencing is performed by high throughput sequencing. In some embodiments of the invention, in the above system, the first parameter and the second parameter are determined by fourier transform.

In some embodiments of the present invention, in the above system, the first parameter and the second parameter are determined by:

wherein T represents the nucleosome cycle, F (T) represents the Fourier intensity corresponding to the nucleosome cycle T, a represents the lower limit of the nucleosome cycle, and b represents the upper limit of the nucleosome cycle; in some embodiments of the present invention, in the above system, a is 168 and b is 208.

In some embodiments of the present invention, in the above system, the correlation is determined by the following formula:

In some embodiments of the invention, in the above system, for each locus, the nucleosome protection score is determined by:

(2) determining the sequencing reads that match the alignment window;

In some embodiments of the present invention, in the above system, the alignment window is 100-140 bp in length.

In some embodiments of the invention, in the above system, the site is located in the middle of the alignment window.

In some embodiments of the invention, in the above system, in step (4), the nucleosome protection score is in positive correlation with the number of the first type of sequencing reads and in negative correlation with the number of the second type of sequencing reads.

In some embodiments of the invention, in the above system, in step (4), the nucleosome protection score is determined by the formula number of first type sequencing reads-number of the second type sequencing reads.

In some embodiments of the present invention, in the above system, based on the locus information of the reference sequence and the nucleosome protection score corresponding to the locus information, a plurality of nucleosome cycles corresponding to the reference sequence are determined as the first parameter, and fourier intensities corresponding to the nucleosome cycles are determined as the second parameter.

According to a third aspect of the present invention, there is provided a method of determining the source of blood cell-free nucleic acid tissue, comprising: (i) determining the association of said blood cell-free nucleic acid with a predetermined tissue according to the method of any embodiment of the first aspect of the invention; (ii) ranking the correlations to determine the tissue origin of the plasma free nucleic acids.

According to an embodiment of the present invention, the above method for determining the source of blood cell-free nucleic acid tissue may further comprise the following technical features:

in some embodiments of the invention, the ranking obtained in step (ii) is compared to a control ranking in order to determine the source of the organization of which the ranking varies.

In some embodiments of the invention, the control ranking is determined for a genetically normal individual or an individual with a known status.

In some embodiments of the invention, the ranking is by high to low, and the tissue with the highest ranking is selected as the important monitoring tissue.

The beneficial effects obtained by the invention are as follows: compared with the existing technology for screening cancers by using cfDNA fragmentation mode, the liquid biopsy analysis method for screening pan-cancer can more fully utilize cfDNA fragmentation mode information, has higher correlation with gene expression, and can more sensitively and accurately detect various cancers when the sequencing is deeper.

Drawings

Fig. 1 is a technical schematic diagram provided according to an embodiment of the present invention, which illustrates a key difference between the spectral weighted nucleosome periodic characteristics of the invention and the prior art.

FIG. 2 is a correlation scatterplot provided in accordance with an embodiment of the present invention, showing that the spectral-weighted nucleosome cycle characteristics used in the present invention correlate with gene expression more strongly than in the prior art.

FIG. 3 is a graph of sequencing depth versus correlation according to an embodiment of the present invention, demonstrating that the correlation of the spectrally weighted nucleosome cycle characteristics used in the present invention with gene expression increases with increasing sequencing depth, which is not the case in the prior art.

FIG. 4 shows the results of 49 cancer samples tested using the present invention and the prior art according to an embodiment of the present invention. When sequencing is shallow, neither method is sufficient for screening; when the sequencing is deeper, the real cancer tissues can be detected more sensitively by using the method.

Fig. 5 shows the result of downsampling analysis using the present invention and the prior art according to an embodiment of the present invention.

FIG. 6 is a schematic diagram of a system for determining the relatedness of a nucleic acid molecule to a predetermined tissue according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Method for determining the relatedness of a nucleic acid molecule to a predetermined tissue and method for determining the relatedness of a nucleic acid molecule to a predetermined tissueSystem for controlling a power supply

According to one aspect of the invention, there is provided a method of determining the relatedness of a nucleic acid molecule to a predetermined tissue, comprising: aligning the sequencing result of the nucleic acid molecule with a plurality of reference sequences, wherein the sequencing result is composed of a plurality of sequencing reads, and each of the plurality of reference sequences is determined based on the sequence of one of a plurality of known genes; respectively determining the frequency spectrum parameters of the known genes corresponding to the reference sequences aiming at each of the plurality of reference sequences based on the comparison processing result; determining a correlation of the nucleic acid molecule with the predetermined tissue based on the expression amount information of the plurality of known genes in the predetermined tissue and the spectral parameters of the plurality of known genes; wherein the spectral parameters are determined for each of the reference sequences by: separately determining a nucleosome protection score for each site for at least a portion of the sites of the reference sequence; determining a first parameter and a second parameter based on the nucleosome protection score of each locus, the first parameter characterizing a plurality of nucleosome cycles, the second parameter characterizing Fourier intensities corresponding to the plurality of nucleosome cycles, respectively; determining the spectral parameters of the reference sequence corresponding to the known genes based on the first and second parameters.

Herein, determining the association of a nucleic acid molecule with a predetermined tissue refers to: by processing and analyzing the fragmentation pattern of the nucleic acid molecules, the sizes of the correlations between nucleosome characteristics of the nucleic acid molecules and gene expression of different tissues are obtained, and then by comparing the sizes of the correlations, the relationship between the nucleic acid molecules and the tissues can be judged. For example, the likelihood that the nucleic acid molecules are derived from these tissues can be determined; for example, it can be determined from which tissues the components are more abundant by comparing the relatedness of these nucleic acid molecules to different tissues.

As used herein, the term "reference sequence" refers to a nucleic acid sequence that is capable of being aligned to characterize the genetic or genomic origin, length, etc. of the nucleic acid molecule. According to embodiments of the invention, the reference sequence may be a gene or genome of known nucleic acid sequence. By aligning the sequencing result of the nucleic acid molecule to a plurality of reference sequences, the fragmentation pattern information of the nucleic acid molecule, such as the genome position, the length and the like, is obtained by means of the information of the reference sequences, and the information can be used for indicating different nucleic acid molecules. These reference sequences may be in the form of a genome or gene, etc. For example, the reference sequence can be the sequence of a known gene. The length of the reference sequence can be 5000-11000 bp, for example 10000bp, according to different genes. The reference sequence can extend to about 5000-11000 bp downstream from the transcription start site of a known gene. These known genes may be all of the genes GRCh37 inEnsembl 75 th edition.

The term "fragmentation pattern" refers to a fragmentation pattern exhibited by a nucleic acid molecule, such as at which positions fragmentation occurs and thus fragmentation, etc. For example, intracellular nucleases can break DNA into fragments of varying lengths, thereby creating different fragmentation patterns. For example, genomic regions with open chromatin alignment are more prone to bind nucleases and thus are disrupted than regions with tight chromatin alignment. As another example, naked DNA that does not bind to any protein is more easily disrupted by nucleases, while regions protected by nucleosomes, transcription factors, etc. are less easily disrupted. Analysis of this fragmentation pattern information can be used to indicate different nucleic acid molecules.

According to embodiments of the invention, the nucleic acid molecule may be an extracellular nucleic acid molecule. For example, it may be plasma-free DNA (cfDNA). Plasma free DNA is derived from a fragmented genome released after apoptosis in various tissues and organs of the human body. Analysis of fragmentation pattern information for the fragmented genomes in these plasma-free DNAs can be used to indicate different nucleic acid molecules. The nucleic acid molecules in these plasma-free DNA are derived from various tissues of the human body. Through the analysis of the fragmentation pattern information of different nucleic acid molecules, the correlation between the nucleic acid molecules and different tissues can be determined, so that the strength of the correlation between the nucleic acid molecules in the plasma free DNA and the tissues can be analyzed.

In obtaining the sequencing results of nucleic acid molecules, it can be performed by high throughput sequencing. When the nucleic acid molecules are sequenced to obtain a sequencing result, the nucleic acid molecules can be obtained by preparing a sequencing library of the nucleic acid sequence of a sample to be detected and performing computer sequencing. According to the embodiment of the invention, the sequencing library can be prepared according to the requirements of the selected sequencing method, the sequencing method can be selected according to the different selected sequencing platforms, but is not limited to Hisq2000/2500 sequencing platform of Illumina, Ion Torrent platform of Life Technologies, BGISEQ platform of BGI, the sequencing mode is double-end sequencing, and the obtained off-line data is the sequencing read fragment, which is called reads (reads). In preparing a sequencing library, a single-stranded DNA full genome library, a conventional double-stranded DNA full genome library, a pre-BS full genome methylation library, and the like can be used.

Herein, "aligning" in "aligning the sequencing result of the nucleic acid molecule with a plurality of reference sequences" means matching. For specific alignment, known alignment software can be used, such as bowtie2, SOAP, BWA, TeraMap, etc., which is not limited in this application. In the alignment process, according to the setting of alignment parameters, at most n base mismatches (mismatches) are allowed for a pair or a read, for example, n is set to 1 or 2, if more than n bases in a read are mismatched, it is considered that the pair of reads cannot be aligned to a reference sequence, or if all the mismatched n bases are located in one read of the pair of reads, it is considered that the read in the pair of reads cannot be aligned to the reference sequence. Meanwhile, the method can provide software such as samtools, bedtools and the like to extract information such as the end points, the lengths and the like of the DNA fragments.

Herein, the term "nucleosome protection score", i.e., WPS (window protection score), characterizes the degree to which each site of the genome is protected by nucleosomes, with a greater value indicating a greater likelihood of nucleosomes being present at that site. The WPS value is influenced by the number of nucleic acid molecules on the site comparison, the random effect is large, the correction can be carried out through two steps of smoothing and mean value reduction, and the corrected WPS is used for subsequent analysis.

Herein, the term "spectral parameters" refers to the spectral weighted nucleosome period swp (spectral weighted period), which is obtained by transforming the oscillogram of the nucleosome protective fraction into the spectral space through fourier transform and performing periodic decomposition, and represents the average distance between nucleosomes in a segment of genomic region. A larger value indicates a looser nucleosome arrangement within the region, whereas a smaller SWP indicates a tighter nucleosome arrangement.

According to an embodiment of the invention, the nucleosome protection score for each site can be determined by: (1) determining an alignment window based on the sites, the alignment window containing the sites; (2) determining the sequencing reads that match the alignment window; (3) distinguishing the sequencing reads into a first type sequencing read and a second type sequencing read, wherein the first type sequencing read completely covers the alignment window, and the second type sequencing read does not completely cover the alignment window; (4) determining the nucleosome protection score based on the number of the first type of sequencing reads and the number of the second type of sequencing reads.

The length of the alignment window can be 100-140 bp for each site. For example, it may be 100bp to 130bp, 110bp to 120bp, or the like. Each site is located in the middle of the alignment window. For example, when the alignment window is selected to be approximately 120bp, 60bp upstream of the site and 60bp downstream of the site may be selected as the corresponding alignment windows. And aiming at the comparison window of each site, determining the sequencing reads which can completely cover the comparison window as the first type of sequencing reads, and determining the sequencing reads which do not completely cover the comparison window as the second type of sequencing reads. By overlapping, it is meant that the 5 'end and the 3' end of the sequencing reads are aligned with or outside of the alignment window. Complete coverage, i.e., the 5 'end and 3' end of the sequencing reads are outside of or aligned with the alignment window. Incomplete coverage means that the 5 'end of the sequencing read falls within the alignment window, or the 3' end of the sequencing read falls within the alignment window. According to an embodiment of the invention, the nucleosome protection score of each locus is positively correlated to the number of the first type of sequencing reads and negatively correlated to the number of the second type of sequencing reads. In at least some embodiments, the difference in the number of sequencing reads of the first type and the number of sequencing reads of the second type can be utilized as a nucleosome protection score for each locus.

In one embodiment, the nucleosome protective score of a site is calculated using the following method: considering that the end point position of a certain fragment after being aligned to the genome is [ lo, hi ] in a window [ i-d, i + d ] which is long by taking i as the center (2d +1) for each site i of the genome, and if the intersection of [ lo, hi ] and [ i-d, i + d ] is [ i-d, i + d ], the fragment is considered to completely cover the window; conversely, if the intersection of [ lo, hi ] and [ i-d, i + d ] is not equal to [ i-d, i + d ] and is not null, then the fragment is deemed to not completely cover the window. If there are m segments that completely cover the window and n segments that do not completely cover the window, then WPS is m-n. By analogy, the WPS value of any position of the genome can be calculated.

Further, a plurality of nucleosome periods corresponding to the reference sequence can be determined as a first parameter according to the locus information of the reference sequence and the nucleosome protection score corresponding to the locus information, and the fourier strength corresponding to the nucleosome period is determined as a second parameter. Herein, each locus corresponds to a nucleosome protection score, and as mentioned above, the nucleosome protection score corresponding to each locus may be the same or different, and shows a waveform diagram with high and low in a section of genome region. The peak in the nucleosome protective fraction oscillogram is taken as the nucleosome locus, and the distance between adjacent nucleosomes is the nucleosome period. In a genome region, a nucleosome protection score oscillogram is subjected to fourier transform, which can be understood as that a plurality of nucleosomes with uneven density are subjected to cycle decomposition, so that a plurality of nucleosome cycles and corresponding cycle ratios thereof can be obtained, and the term "first parameter" is used for representing the plurality of nucleosome cycles. The term "second parameter" is used to characterize the fourier intensities, i.e. the period fractions, corresponding to the plurality of nucleosome periods.

In at least some embodiments, the respective first and second parameters may be determined by fourier transformation. In at least some embodiments, the first and second parameters may be determined by: (a) mapping based on the site information of the reference sequence and the nucleosome protection score corresponding to the site information to obtain a first graph, wherein the x axis is the distance between the site and the gene transcription starting site, and the y axis is the corrected nucleosome protection score; (b) performing Fourier transform on the first graph obtained in the step (a) to obtain a second graph, wherein the x axis of the second graph is the nucleosome period, and the y axis of the second graph is the Fourier intensity corresponding to the nucleosome period.

Also, the present invention provides a system for determining the relatedness of a nucleic acid molecule to a predetermined tissue, as shown in FIG. 6, comprising: the nucleic acid sequencing device comprises an alignment device, a spectrum parameter determining device and a correlation determining device, wherein the spectrum parameter determining device is connected with the alignment device, the correlation determining device is connected with the spectrum parameter determining device, and the alignment device is used for performing alignment processing on a sequencing result of the nucleic acid molecule and a plurality of reference sequences, wherein the sequencing result is composed of a plurality of sequencing reads, and each of the plurality of reference sequences is determined based on the sequence of one of a plurality of known genes; the spectrum parameter determination device determines, for each of the plurality of reference sequences, a spectrum parameter of the known gene corresponding to the reference sequence; the spectrum parameter determination device determines the spectrum parameter respectively by the following units for each of the reference sequences: a nucleosome protection score determining unit, wherein the nucleosome protection score determining unit respectively determines the nucleosome protection score of each locus aiming at least one part of the loci of the reference sequence; a parameter determining unit connected to the nucleosome protection score determining unit, the parameter determining unit determining a first parameter and a second parameter based on the nucleosome protection score of each locus, the first parameter characterizing a plurality of nucleosome periods, the second parameter characterizing fourier intensities corresponding to the plurality of nucleosome periods, respectively; and the spectrum parameter calculation unit is connected with the parameter determination unit and determines the spectrum parameters of the known genes corresponding to the reference sequence based on the first parameters and the second parameters.

The above description of the advantages and technical features of the method for determining the association of a nucleic acid molecule with a predetermined tissue is also applicable to the system for determining the association of a nucleic acid molecule with a predetermined tissue provided by the present invention, and will not be described herein again.

The method for determining the correlation between the nucleic acid molecule and the predetermined tissue or the system for determining the correlation between the nucleic acid molecule and the predetermined tissue provided by the invention can be used for analyzing the state of the predetermined tissue, for example, whether the predetermined tissue has cancer or other abnormalities.

In some embodiments, the present invention provides a method of determining the tissue origin of plasma free nucleic acids comprising: (i) determining the association of said plasma free nucleic acids with a predetermined tissue according to the method of determining the association of nucleic acid molecules with a predetermined tissue of the present invention; (ii) ranking the correlations to determine the tissue origin of the plasma free nucleic acids. The ranking of the relevance of a nucleic acid molecule to a predetermined tissue is obtained as a control ranking by analyzing normal individuals or individuals with known status. And (3) comparing the ranking obtained in the step (ii) with the control ranking to determine the tissue source of the ranking change. In some preferred embodiments, the ranking is performed from high to low in relevance, and the tissue with the highest ranking is selected as the important monitoring tissue.

Taking cfDNA as an example, compared with the existing technology for screening cancers by using a cfDNA fragmentation mode, the method provided by the invention is used for detecting the abnormality of the tissue, the method provided by the invention can more fully utilize the information of the cfDNA fragmentation mode, the cycle of nucleosomes is used as a first parameter, the Fourier intensities corresponding to a plurality of cycles of nucleosomes are used as a second parameter, the first parameter and the second parameter are related, the spectrum parameter of the known gene corresponding to the nucleic acid molecule is obtained, the correlation coefficient with the gene expression is larger, and a plurality of cancers can be detected more sensitively and accurately when the sequencing is deeper.

The scheme of the invention will be explained with reference to the examples. It will be appreciated by those skilled in the art that the following examples are illustrative of the invention only and should not be taken as limiting the scope of the invention. The examples, where specific techniques or conditions are not indicated, are to be construed according to the techniques or conditions described in the literature in the art or according to the product specifications. The reagents or instruments used are not indicated by the manufacturer, and are all conventional products commercially available.

Example 1

Example 1 takes cfDNA as an example, providing a method for detecting related spectral parameters in cfDNA obtained by blood drawing, which specifically comprises the following steps:

for the test subject, it is only necessary to draw about 10ml of peripheral blood. Separating plasma from peripheral blood of a subject, extracting cfDNA, establishing a library by whole genome and sequencing at both ends to obtain a cfDNA sequence, matching the obtained cfDNA sequence to a genome to obtain fragmentation mode information such as genome position, length and the like of a cfDNA fragment source, further calculating a nucleosome protection score (WPS) of the whole genome with single base resolution, and finally obtaining a frequency spectrum weighted nucleosome cycle (SWP) (namely frequency spectrum parameters) of each gene in the cfDNA. The method specifically comprises the following steps:

1. separating plasma: peripheral blood is obtained from a test subject, the blood is firstly separated by centrifugation at 4 ℃ and 1600g for 10 minutes, then the upper plasma is transferred to an empty tube, the plasma is centrifuged at 4 ℃ and 16000g for 10 minutes to completely remove cell components, and finally the supernatant obtained after two steps of centrifugation is taken for subsequent experiments or is placed in a frozen storage at-80 ℃. Incomplete plasma separation directly causes genome pollution of blood cells in the extracted cfDNA, so two-step centrifugation is generally suitable, and in addition, the genome pollution phenomenon of the cfDNA can also be caused due to overlong freezing time of a plasma sample.

2. Extracting cfDNA: using QIAGEN

The Circulating Nucleic Acid Kit extracts cfDNA.

3. Library construction and sequencing: double-strand library construction was performed using the ThruPLEX-FD and ThruPLEX DNA-seq kits and paired-end sequencing was performed using the sequencing platforms HiSeq 2000 and NextSeq 500.

4. The cfDNA obtained by sequencing was aligned to the GRCh37 genome using alignment software bowtie2, and information such as fragment endpoint, length, etc. was extracted using samtools software.

5. Calculating nucleosome protection score (WPS): considering that the end point position of a certain fragment after being aligned to the genome is [ lo, hi ] in a window [ i-d, i + d ] which is long by taking i as the center (2d +1) for each site i of the genome, and if the intersection of [ lo, hi ] and [ i-d, i + d ] is [ i-d, i + d ], the fragment is considered to completely cover the window; conversely, if the intersection of [ lo, hi ] and [ i-d, i + d ] is not equal to [ i-d, i + d ] and is not null, then the fragment is deemed to not completely cover the window. If there are m segments that completely cover the window and n segments that do not completely cover the window, then WPS is m-n. By analogy, the WPS value of any position of the genome can be calculated. In this example, take d-60 to match the size of a single nucleosome and calculate WPS using a fragment of length [120,180 ].

6. Calculating spectral parameters (SWP): taking a 10kb region downstream of the Transcription initiation Site (TSS) of the gene, namely: if the gene is located in the positive strand of the genome, [ TSS, TSS +10000 ]); if the gene is located on the negative strand of the genome, (TSS-10000, TSS)]. Then, WPS (as defined as WPS) at each position in the gene region

) Performing fast Fourier transform to extract frequency domain information of nucleosomes to obtain Fourier intensity under each period T

The spectral parameters of the gene are:

in order to maximize the correlation between SWP and gene expression, a is 168 and b is 208. Meanwhile, the present embodiment provides an algorithm for decoupling cfDNA tissue source distribution based on the aforementioned SWP features. Because the expression of the genes has a negative correlation with the SWP, the method calculates the correlation coefficient of the cfDNA weighted nucleosome frequency spectrum of the test subject and the gene expression of each tissue cell by means of a large amount of gene expression data in a public data set and sorts the correlation coefficient, and then the source tissue sorting of the cfDNA of the test subject can be obtained.

In the method of calculating cfDNA tissue origin distribution, the tissue type of the expression data used determines the traceable cfDNA-derived tissue type, and thus the detectable cancer type. In THE invention, transcript _ rna _ tissue.tsv.zip and transcript _ rna _ celline.tsv in THE THE HUMAN PROTEIN ATLAS database are used as gene expression data sets of each tissue cell, THE maximum transcript expression value is calculated as THE expression value of THE gene, and THE zero value in THE expression value matrix is replaced by 0.04 to take THE logarithm as THE logarithmic expression matrix.

Further, the spectral parameters of cfDNA are recorded as

M is the number of genes. Recording the logarithmic expression matrix in each tissue cell as A, A in the ith row and jth column of A_ijRepresenting the expression value of the gene j in the ith tissue, i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, N is the number of samples, and marking the ith row in A as A_i＝(A_i1,A_i2,…,A_iM). Then, the Pearson correlation coefficient of cfDNA with any tissue i is

Then

I.e. the ranking of tissue i in the cfDNA origin tissue ordering, and the origin tissue ordering of cfDNA is r ═ (r)₁,r₂,…r_N)。

To elaborate the differences of the spectral parameters compared to the prior methods, this example provides figures 1, 2, 3 detailing that improved nucleosome characteristics have a stronger correlation with gene expression and that the correlation increases with increasing sequencing depth.

FIG. 1 shows nucleosome protection scores of three genes with high, medium and low expression, nucleosome characteristics before and after improvement and Pearson's correlation coefficient with logarithmic expression value. The conventional way to calculate the nucleosome characteristics is to calculate the mean value of the Fourier intensity at 193-199bp (dashed line in the middle graph). The FFT intensity main peak of the low-expression gene ENSG00000066468 is 199bp, the nucleosome arrangement is loose, the frequency spectrum parameter is 197.1, the existing nucleosome is characterized by 1114992, and the logarithmic expression value is-4.82; the FFT intensity main peak of the middle expression gene ENSG00000001617 is 193bp, the arrangement is tighter relative to the nucleosome of the low expression gene, the frequency spectrum parameter is 193.6, the characteristic of the existing nucleosome is 3767714, and the logarithmic expression value is 1.74; the FFT intensity of the high expression gene ENSG00000003756 has a plurality of main peaks, the nucleosomes are arranged loosely and densely, the spectrum parameter is 185.0, the existing nucleosomes are characterized by 205691, and the logarithmic expression value is 5.04. Pearson correlation coefficients calculated by using nucleosome characteristics and logarithmic expression values of the three genes before and after improvement are-0.06 and-0.91 respectively, which shows that the improved nucleosome characteristics (spectrum parameters) have stronger correlation with gene expression and can better keep the high and low order information of the gene expression.

Figure 2 shows the nucleosome characteristics before and after improvement of three healthy human cfDNA sequencing samples (BH01, IH01, IH02, from the GEO public dataset GSE71378) and their correlation with gene expression in the blood cell line U-937. Wherein, the abscissa of the left three graphs is the existing nucleosome characteristic, namely 193-199bp average Fourier intensity, the abscissa of the right three graphs is the improved nucleosome characteristic, namely the spectrum parameter SWP, the ordinate of the six graphs is the logarithmic expression value of the U-937 cell line, and each point represents a gene. A. Graph B shows the results of BH01, with linear regression R before and after improvement being 0.36 and 0.46, respectively; C. graph D shows the results of IH01, with linear regression R before and after improvement being 0.25 and 0.42, respectively; E. graph F shows the results of IH02, with linear regression R before and after improvement being 0.15 and 0.37, respectively. It can be seen that the improved spectrally weighted nucleosome characteristics are more correlated with gene expression.

Figure 3 shows the effect of sequencing depth on the correlation of nucleosome characteristics with gene expression before and after improvement. The abscissa of the four graphs is the sequencing depth of cfDNA, the ordinate is the minimum of the negative correlation coefficient of nucleosome characteristics with gene expression, the left two graphs are the existing method, i.e., using the existing nucleosome characteristics, and the right two graphs are the improved method, i.e., using the improved nucleosome characteristics. The specific implementation mode is as follows: from the common data sets GSE71378 and SRA438908, 65 cfDNA sequencing samples were taken and their sequencing depth and correlation coefficient with gene expression were calculated and plotted as the true data in the A, B plot (open diamonds). Meanwhile, for each cfDNA sample, nucleosome characteristics of all genes were randomly replaced 100 times, and then correlation analysis was performed with gene expression of each tissue to find the minimum value of negative correlation coefficient, which was plotted as replacement data (solid dots) in A, B graph. Because the correlation between nucleosome arrangement and gene expression is disturbed by the displacement data, the correlation coefficient does not exceed-0.05 in both methods, while the correlation coefficient obtained by the improved method in the real data has a negative correlation with the sequencing depth.

To further verify this negative correlation, 11 samples (BH01, IC15, IC17, IC20, IC35, IC37, SRR3819936, SRR3819937, SRR3819938, SRR3819939, SRR3819940) that were sequenced deeper were selected from 65 cfDNA samples for downsampling analysis, the original sequencing reads were randomly sampled at a step size of 5% and from 5% to 95%, each ratio was repeated 50 times and plotted in the C, D diagram of fig. 3, and downsampling results for the same samples were connected by solid lines. The results prove that the correlation of the improved nucleosome characteristics and the gene expression is enhanced along with the increase of the sequencing depth, while the correlation of the existing nucleosome characteristics and the gene expression is mostly not the same phenomenon, and a few samples have a weakening trend even along with the increase of the sequencing depth. With the continuous reduction of sequencing cost and the continuous deepening of sequencing depth, the improved method has greater potential in the field of cancer detection.

Example 2

Embodiments provide a detection method for screening for potential cancer using the aforementioned cfDNA tissue source distribution. Comparing healthy people and detecting the tissue source sequencing of cfDNA of the testee, and finding out the tissue with abnormal sequencing change, namely the potential canceration tissue of the testee. The tissue source sequence of the subject is compared with that of a healthy person, and the tissue with large rank change is found out to be the potential canceration tissue.

The cfDNA whole genome sequencing sample (BH01) of the healthy person in the public data set GSE71378 is processed according to the method of theembodiment 1 to obtain the source tissue sequencing r of the cfDNA of the healthy person_healthy＝(r₁,r₂,…,r_N). Also, the same procedure was carried out for unknown cfDNA samples to obtain

Calculating the difference between an unknown cfDNA sample and a healthy person's cfDNA sample

Finally, unknown cfDNA samplesThe potential cancer tissue of this is ranked rank (Δ r), and the largest ranking change of the same tissue is considered as the final ranking of the tissue. In this example, 40 tissues are used, specifically including liver, gallbladder, spleen, lung, kidney, bladder, esophagus, stomach, small intestine, colon, rectum, duodenum, appendix, pancreas islet, salivary gland, tonsil, thyroid, parathyroid, adrenal gland, breast, ovary, fallopian tube, uterus, endometrium, cervix, prostate, foreskin, testis, epididymis, seminal vesicle, skin, adipose tissue, cerebral cortex, eye, myocardium, skeletal muscle, smooth muscle, lymph node, blood, and bone marrow.

The invention was implemented in cfDNA WGS data of 49 cancer patients in public datasets GSE71378 and SRA438908, specifically including 15 major classes of cancers, bladder, breast, colorectal, esophageal, head and neck, kidney, liver, lung, ovarian, pancreatic, prostate, skin, testicular, uterine, and gastric, with 10 subjects sequenced deeper (greater than 8 genomes) and the remaining 39 samples sequenced to no more than 3 genomes. The ranking of real cancer tissues in the predicted tissue using the modified method (triangles) and the existing method (circles), respectively, is shown in fig. 4, with the ranking of the same sample in both methods connected by solid lines. Of the 10 high depth cfDNA samples, 9 samples were better than the existing method using the improved method, and the improved method increased the first 3 accuracy from 10% to 50% and the first 10 accuracy from 50% to 90%. In 39 low depth samples, however, the overall effect of both methods was insufficient for cancer detection due to insufficient information content. To further explore the sequencing depth of cfDNA suitable for applying the present invention, 9 high depth cfDNA samples (excluding SRR3819938 samples that are poorly predicted by both methods) were subjected to downsampling analysis, the results are plotted in fig. 5, the black solid line is the improved method, the gray dashed line is the existing method, and the black dashed line is the baseline ranked 10. The sub-icons are entitled sample number and sequencing depth, the abscissa is the down-sampling proportion, the ordinate is the prediction ranking, the original sequencing reads are also randomly sampled according to the proportion of 5% to 95% with the step length being 5%, each proportion is repeated for 50 times, and the maximum value and the minimum value of the variance line respectively represent 25% quantiles and 75% quantiles. Similar to the results in fig. 3, the improved method tended to stabilize as the sequencing depth increased, and when the sequencing depth was higher than 8 x, the true cancer tissue of the other samples was always ranked top 10, except for a slight perturbation in the predicted ranking ofIC 20. On the other hand, the conventional method cannot improve the sequencing depth, but IC17 and IC37 deteriorate with the increase in the sequencing depth, which is consistent with the conclusion of fig. 3.

From the results, the invention uses more comprehensive nucleosome arrangement information, initiates the frequency spectrum weighting nucleosome periodic characteristics, and has higher accuracy rate for detecting various cancer species.

In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; may be mechanically coupled, may be electrically coupled or may be in communication with each other; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

Translated fromChinese

将所述核酸分子的测序结果与多个参考序列进行比对处理，其中，所述测序结果由多个测序读段构成，所述多个参考序列的每一个分别是基于多个已知基因之一的序列确定的；The sequencing result of the nucleic acid molecule is compared with a plurality of reference sequences, wherein the sequencing result is composed of a plurality of sequencing reads, each of the plurality of reference sequences is based on a plurality of known genes. A sequence is determined;

基于所述比对处理的结果，针对所述多个参考序列的每一个，分别确定所述核酸分子在所述多个已知基因的频谱参数；Based on the result of the alignment process, for each of the plurality of reference sequences, respectively determining the spectral parameters of the nucleic acid molecule in the plurality of known genes;

基于所述预定组织中多个所述已知基因的表达量信息，以及所述核酸分子在所述多个已知基因的频谱参数，确定所述核酸分子与所述预定组织的相关性；determining the correlation between the nucleic acid molecule and the predetermined tissue based on the expression information of the plurality of known genes in the predetermined tissue and the spectral parameters of the nucleic acid molecule in the plurality of known genes;

其中，所述频谱参数是针对所述参考序列的每一个，分别通过下列步骤确定的：Wherein, the spectrum parameters are determined respectively through the following steps for each of the reference sequences:

针对所述参考序列的至少一部分位点，分别确定每个位点的核小体保护分数；For at least a portion of the sites of the reference sequence, determine the nucleosome protection score for each site separately;

基于所述每个位点的核小体保护分数，确定第一参数和第二参数，所述第一参数表征多个相邻核小体的间距，即多个核小体周期，所述第二参数表征分别与所述多个核小体周期对应的傅里叶强度；Based on the nucleosome protection score of each site, a first parameter and a second parameter are determined. The two parameters represent the Fourier intensities corresponding to the multiple nucleosome periods respectively;

基于所述第一参数和所述第二参数，确定所述参考序列对应所述已知基因的所述频谱参数，determining the spectral parameter of the reference sequence corresponding to the known gene based on the first parameter and the second parameter,

所述第一参数和第二参数是通过下列步骤确定的：The first parameter and the second parameter are determined by the following steps:

(a)基于所述参考序列的位点信息以及所述位点信息对应的核小体保护分数进行平滑矫正后作图得到第一图，其中，x轴为位点与基因转录起始位点的距离，y轴为矫正后的核小体保护分数；(a) Based on the site information of the reference sequence and the nucleosome protection score corresponding to the site information, the first map is obtained after smoothing and correction, wherein the x-axis is the site and the gene transcription start site distance, the y-axis is the corrected nucleosome protection fraction;

(b)将步骤(a)中得到的所述第一图进行傅里叶变换，得到第二图，所述第二图的x轴为核小体周期，y轴为与核小体周期对应的傅里叶强度；(b) Fourier transform is performed on the first graph obtained in step (a) to obtain a second graph, where the x-axis of the second graph is the nucleosome period, and the y-axis is the period corresponding to the nucleosome The Fourier intensity of ;

所述频谱参数是基于下列公式确定的:The spectral parameters are determined based on the following formula:

其中，T代表核小体周期，F(T)代表核小体周期T对应的傅里叶强度，a代表核小体周期的下限，b代表核小体周期的上限；Among them, T represents the nucleosome period, F(T) represents the Fourier intensity corresponding to the nucleosome period T, a represents the lower limit of the nucleosome period, and b represents the upper limit of the nucleosome period;

所述相关性是通过下列公式确定的：The correlation is determined by the following formula:

其中，ρ_i代表所述核酸分子与预定组织i的皮尔逊相关系数，

代表所述核酸分子在多个已知基因的所述频谱参数，A_i代表预定组织i在所述多个已知基因的表达量，Var代表方差，Cov代表协方差。where ρ_i represents the Pearson correlation coefficient between the nucleic acid molecule and the predetermined tissue i,

represents the spectral parameters of the nucleic acid molecule in a plurality of known genes, A_i represents the expression level of the predetermined tissue i in the plurality of known genes, Var represents the variance, and Cov represents the covariance.

2.根据权利要求1所述的方法，其特征在于，所述核酸分子为游离于细胞外的核酸分子；2. The method according to claim 1, wherein the nucleic acid molecule is an extracellular nucleic acid molecule;

所述参考序列长度为5000～11000bp；The length of the reference sequence is 5000-11000 bp;

所述参考序列包括所述已知基因的序列；the reference sequence includes the sequence of the known gene;

所述参考序列是基于所述已知基因的转录起始位点确定的；The reference sequence is determined based on the transcription start site of the known gene;

所述测序是通过高通量测序进行的；the sequencing is performed by high-throughput sequencing;

所述已知基因包括Ensembl第75版中GRCh37的所有基因。The known genes include all genes of GRCh37 in Ensembl 75th edition.

3.根据权利要求1所述的方法，其特征在于，针对每个位点，所述核小体保护分数是通过下列步骤确定的：3. The method of claim 1, wherein, for each site, the nucleosome protection fraction is determined by the following steps:

(1)基于所述位点，确定比对窗口，所述比对窗口含有所述位点；(1) determining an alignment window based on the site, the alignment window containing the site;

(2)确定与所述比对窗口匹配的所述测序读段；(2) determining the sequencing reads that match the alignment window;

(3)将所述测序读段区分为第一类型测序读段和第二类型测序读段，所述第一类型测序读段完全覆盖所述比对窗口，所述第二类型测序读段不完全覆盖所述比对窗口；(3) Distinguishing the sequencing reads into first-type sequencing reads and second-type sequencing reads, the first-type sequencing reads completely cover the alignment window, and the second-type sequencing reads do not completely covering the alignment window;

(4)基于所述第一类型测序读段的数目和所述第二类型测序读段的数目，确定所述核小体保护分数；(4) determining the nucleosome protection score based on the number of the first type of sequencing reads and the number of the second type of sequencing reads;

所述比对窗口长度为100～140bp；The length of the alignment window is 100-140 bp;

所述位点位于所述比对窗口的中部；the site is located in the middle of the alignment window;

在步骤(4)中，所述核小体保护分数与所述第一类型测序读段的数目呈正相关，与所述第二类型测序读段的数目呈负相关；In step (4), the nucleosome protection score is positively correlated with the number of the first type of sequencing reads, and negatively correlated with the number of the second type of sequencing reads;

在步骤(4)中，所述核小体保护分数是通过公式第一类型测序读段的数目-所述第二类型测序读段的数目确定的；In step (4), the nucleosome protection score is determined by the formula: the number of the first type of sequencing reads - the number of the second type of sequencing reads;

基于所述参考序列的位点信息以及所述位点信息对应的核小体保护分数，确定所述参考序列对应的多个核小体周期作为第一参数，以及所述多个核小体周期对应的傅里叶强度作为第二参数。Based on the site information of the reference sequence and the nucleosome protection score corresponding to the site information, multiple nucleosome cycles corresponding to the reference sequence are determined as the first parameter, and the multiple nucleosome cycles are determined as the first parameter. The corresponding Fourier intensity is taken as the second parameter.

4.根据权利要求1所述的方法，其特征在于，4. The method of claim 1, wherein

a＝168，b＝208；a=168, b=208;

在确定参数a、b后，所述频谱参数仅与所述第一、第二参数有关。After the parameters a and b are determined, the spectral parameters are only related to the first and second parameters.

5.根据权利要求1所述的方法，其特征在于，5. The method of claim 1, wherein

所述预定组织选自：肝、胆、脾、肺、肾、膀胱、食管、胃、小肠、结肠、直肠、十二指肠、阑尾、胰岛、唾液腺、扁桃体、甲状腺、甲状旁腺、肾上腺、乳腺、卵巢、输卵管、子宫、子宫内膜、宫颈、前列腺、包皮、睾丸、附睾、精囊、皮肤、脂肪组织、大脑皮层、眼、心肌、骨骼肌、平滑肌、淋巴结、血液、骨髓中的至少一种组织。The predetermined tissue is selected from the group consisting of liver, gallbladder, spleen, lung, kidney, bladder, esophagus, stomach, small intestine, colon, rectum, duodenum, appendix, pancreatic islet, salivary gland, tonsil, thyroid, parathyroid, adrenal gland, At least one of breast, ovary, fallopian tube, uterus, endometrium, cervix, prostate, foreskin, testis, epididymis, seminal vesicle, skin, adipose tissue, cerebral cortex, eye, cardiac muscle, skeletal muscle, smooth muscle, lymph node, blood, bone marrow kind of organization.

6.一种确定核酸分子与预定组织相关性的系统，其特征在于，包括：6. A system for determining the relevance of nucleic acid molecules to predetermined tissues, comprising:

比对装置，所述比对装置用于将所述核酸分子的测序结果与多个参考序列进行比对处理，其中，所述测序结果由多个测序读段构成，所述多个参考序列的每一个分别是基于多个已知基因之一的序列确定的；Alignment device, which is used for aligning the sequencing result of the nucleic acid molecule with a plurality of reference sequences, wherein the sequencing result is composed of a plurality of sequencing reads, and the sequence of the plurality of reference sequences is Each is determined based on the sequence of one of a plurality of known genes, respectively;

频谱参数确定装置，所述频谱参数确定装置与所述比对装置相连，所述频谱参数确定装置针对所述多个参考序列的每一个，分别确定所述核酸分子在所述多个已知基因的频谱参数；A spectral parameter determination device, the spectral parameter determination device is connected to the comparison device, and the spectral parameter determination device respectively determines, for each of the plurality of reference sequences, that the nucleic acid molecule is in the plurality of known genes spectrum parameters;

相关性确定装置，所述相关性确定装置与所述频谱参数确定装置相连，所述相关性确定装置基于所述预定组织中多个所述已知基因的表达量信息，以及所述核酸分子在所述多个已知基因的频谱参数，确定所述核酸分子与所述预定组织的相关性；A correlation determination device, the correlation determination device is connected to the spectral parameter determination device, and the correlation determination device is based on the expression level information of a plurality of the known genes in the predetermined tissue, and the nucleic acid molecule in the Spectral parameters of the plurality of known genes to determine the correlation between the nucleic acid molecule and the predetermined tissue;

其中，所述频谱参数确定装置针对所述参考序列的每一个，分别通过如下单元确定所述频谱参数：Wherein, for each of the reference sequences, the device for determining the spectral parameters determines the spectral parameters through the following units:

核小体保护分数确定单元，所述核小体保护分数确定单元针对所述参考序列的至少一部分位点，分别确定每个位点的核小体保护分数；a nucleosome protection score determination unit, the nucleosome protection score determination unit respectively determines the nucleosome protection score of each site for at least a part of the sites of the reference sequence;

参数确定单元，所述参数确定单元与所述核小体保护分数确定单元相连，所述参数确定单元基于所述每个位点的核小体保护分数，确定第一参数和第二参数，所述第一参数表征多个核小体周期，所述第二参数表征分别与所述多个核小体周期对应的傅里叶强度；A parameter determination unit, the parameter determination unit is connected with the nucleosome protection score determination unit, and the parameter determination unit determines the first parameter and the second parameter based on the nucleosome protection score of each site, so The first parameter represents a plurality of nucleosome periods, and the second parameter represents Fourier intensities corresponding to the plurality of nucleosome periods respectively;

频谱参数计算单元，所述频谱参数计算单元与所述参数确定单元相连，所述频谱参数计算单元基于所述第一参数和所述第二参数，确定所述参考序列对应所述已知基因的所述频谱参数，A spectral parameter calculation unit, the spectral parameter calculation unit is connected to the parameter determination unit, and the spectral parameter calculation unit determines the reference sequence corresponding to the known gene based on the first parameter and the second parameter. the spectral parameters,

(b)将步骤步骤(a)中得到的所述第一图进行傅里叶变换，得到第二图，所述第二图的x轴为核小体周期，y轴为与核小体周期对应的傅里叶强度；(b) Fourier transform is performed on the first graph obtained in step (a) to obtain a second graph, the x-axis of the second graph is the nucleosome period, and the y-axis is the nucleosome period the corresponding Fourier intensity;

7.根据权利要求6所述的系统，其特征在于，所述核酸分子为游离于细胞外的核酸分子；7. The system according to claim 6, wherein the nucleic acid molecule is a nucleic acid molecule free from extracellular;

所述参考序列是基于所述已知基因的转录起始位点确定的；所述测序是通过高通量测序进行的；The reference sequence is determined based on the transcription initiation site of the known gene; the sequencing is performed by high-throughput sequencing;

a＝168，b＝208。a=168, b=208.

8.根据权利要求6所述的系统，其特征在于，针对每个位点，所述核小体保护分数是通过下列步骤确定的：8. The system of claim 6, wherein, for each site, the nucleosome protection fraction is determined by the following steps:

基于所述参考序列的位点信息以及所述位点信息对应的核小体保护分数，确定所述多个核小体周期作为第一参数，以及与所述多个核小体周期对应的傅里叶强度作为第二参数。Based on the site information of the reference sequence and the nucleosome protection score corresponding to the site information, the plurality of nucleosome periods are determined as the first parameters, and the Futher corresponding to the plurality of nucleosome periods is determined. The leaf strength is used as the second parameter.

9.一种确定血浆游离核酸组织来源的方法，其特征在于，包括：9. A method for determining the source of plasma free nucleic acid tissue, comprising:

(i)根据权利要求1～5中任一项所述的方法，确定所述血浆游离核酸与预定组织的相关性；(i) the method according to any one of claims 1 to 5, determining the correlation between the plasma-free nucleic acid and a predetermined tissue;

(ii)将所述相关性进行排序，以便确定所述血浆游离核酸的组织来源；(ii) ranking the correlations in order to determine the tissue origin of the plasma free nucleic acid;

将步骤(ii)中得到的排名与对照排名进行比较，以便确定排名变化的组织来源；comparing the ranking obtained in step (ii) with a control ranking in order to determine the organizational source of the change in ranking;

所述对照排名是基于正常个体或者具有已知状态的个体确定的；The control ranking is determined based on normal individuals or individuals with known status;

所述排序是按照由高到低进行的，并且选择排序升高的组织作为重点监控组织。The ranking is performed from high to low, and an organization with a higher ranking is selected as a key monitoring organization.