Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. To this end, it is an object of the present invention to provide a method for determining the association of a nucleic acid molecule with a predetermined tissue, a system for determining the association of a nucleic acid molecule with a predetermined tissue, and a method for determining the origin of a plasma-free nucleic acid tissue.
The present invention is obtained based on the following findings:
the pattern of genomic fragmentation during apoptosis is generally thought to be related to its chromatin structure. The fragmentation pattern, as the name implies, is a fragment endpoint pattern generated when a nuclease breaks DNA into fragments of different lengths within a cell. Genomic regions with open chromatin order are more prone to being disrupted by binding nucleases than regions with tight chromatin order; naked DNA, which does not bind any proteins, is more easily disrupted by nucleases, while regions protected by nucleosomes, transcription factors, etc., are less easily disrupted. Therefore, nucleosome arrangement of each gene region can be deduced from the fragmentation pattern of the genome. The arrangement of nucleosomes on the genome is extremely periodic, and according to the current research of MNase-seq, the distance between adjacent nucleosomes is about 185bp, including a core winding region of about 145bp and a junction region of about 40 bp.
The nucleosome arrangement of the gene promoter region can be deduced by using the gene expression data of each tissue cell in the public data set. Specifically, the higher the gene expression, the closer the nucleosomes of the promoter region are arranged, whereas the lower the gene expression, the looser the nucleosomes of the promoter region are arranged. Cancer was screened using fragmentation patterns of cfDNA. Currently in the field of cancer liquid biopsy, most studies are focused on cancer-specific mutations, copy number variations and methylation patterns in cfDNA, and since 2015, research methods to detect cancer using cfDNA fragmentation patterns began to emerge. Because different tissue cells have different chromatin structures and genome fragmentation patterns during apoptosis are different, mixed nucleosome arrangement is deduced by detecting the fragmentation patterns in cfDNA, and the tissue source distribution of cfDNA can be decoupled by combining the nucleosome arrangement deduced from gene expression data of each tissue cell in a public data set, so that whether a subject contains cancer tissues or not is screened.
Therefore, the invention provides the following technical scheme:
according to a first aspect of the invention, there is provided a method of determining the relatedness of a nucleic acid molecule to a predetermined tissue, comprising: aligning the sequencing result of the nucleic acid molecule with a plurality of reference sequences, wherein the sequencing result is composed of a plurality of sequencing reads, and each of the plurality of reference sequences is determined based on the sequence of one of a plurality of known genes; respectively determining the frequency spectrum parameters of the known genes corresponding to the reference sequences aiming at each of the plurality of reference sequences based on the comparison processing result; determining a correlation of the nucleic acid molecule with the predetermined tissue based on the expression amount information of the plurality of known genes in the predetermined tissue and the spectral parameters of the plurality of known genes; wherein the spectral parameters are determined for each of the reference sequences by: separately determining a nucleosome protection score for each site for at least a portion of the sites of the reference sequence; determining a first parameter and a second parameter based on the nucleosome protection score of each locus, the first parameter characterizing a plurality of nucleosome cycles, the second parameter characterizing Fourier intensities corresponding to the plurality of nucleosome cycles, respectively; determining the spectral parameters of the reference sequence corresponding to the known genes based on the first and second parameters.
The method for determining the correlation of a nucleic acid molecule with a predetermined tissue provided by the invention can judge the correlation of the nucleic acid molecule with the predetermined tissue. By studying the association of the same nucleic acid molecule with a predetermined tissue in different individuals, different states can be judged. For example, the method can be used for early cancer screening of the population, and can also be used for predicting and monitoring the recurrence risk of cancer patients and intervening diagnosis and treatment of the cancer patients. And may not only be used to detect a variety of cancers, the type of cancer that may be detected depends on which tissue expression profile data is gathered.
According to an embodiment of the present invention, the method for determining the association of a nucleic acid molecule with a predetermined tissue as described above may further comprise the following technical features:
in some embodiments of the invention, the nucleic acid molecule is a plasma-free nucleic acid molecule.
In some embodiments of the invention, the reference sequence is 5000-11000 bp in length. The length of the reference sequence should be greater than 25 nucleosome cycles, too short a length would result in inaccurate estimation of the second parameter, while too long a reference sequence would introduce regions unrelated to gene expression, thereby reducing the accuracy of the method.
In some embodiments of the invention, the reference sequence comprises the sequence of the known gene.
In some embodiments of the invention, the reference sequence is determined based on the transcription start site of the known gene. The vicinity of the transcription initiation site of the gene is rich in a promoter and a transcription factor binding site, and nucleosome characteristics of the gene are closely related to the expression quantity of the gene, so that the region near the transcription initiation site is selected as a reference sequence, which is favorable for determining the relevance of a nucleic acid molecule and a predetermined tissue.
In some embodiments of the invention, the sequencing is performed by high throughput sequencing.
In some embodiments of the invention, the known genes include all of the genes GRCh37 in Ensembl 75 th edition.
In some embodiments of the invention, for each locus, the nucleosome protection score is determined by:
(1) determining an alignment window based on the sites, the alignment window containing the sites;
(2) determining the sequencing reads that match the alignment window;
(3) distinguishing the sequencing reads into a first type sequencing read and a second type sequencing read, wherein the first type sequencing read completely covers the alignment window, and the second type sequencing read does not completely cover the alignment window;
(4) determining the nucleosome protection score based on the number of the first type of sequencing reads and the number of the second type of sequencing reads.
In some embodiments of the present invention, the length of the alignment window is 100-140. The window size should be consistent with the length of the nucleosome core histone combined DNA, generally 120bp is selected, the too short can mistake the protective site of the transcription factor as the nucleosome protective site, and the too long can lose part of the closely arranged nucleosome sites.
In some embodiments of the invention, the site is located in the middle of the alignment window.
In some embodiments of the invention, in step (4), the nucleosome protection score is positively correlated with the number of the first type of sequencing reads and negatively correlated with the number of the second type of sequencing reads.
In some embodiments of the invention, in step (4), the nucleosome protection score is determined by the formula number of first type sequencing reads-the second type sequencing reads.
In some embodiments of the present invention, based on the locus information of the reference sequence and the nucleosome protection score corresponding to the locus information, a plurality of nucleosome periods corresponding to the reference sequence are determined as a first parameter, and fourier intensities corresponding to the plurality of nucleosome periods are determined as a second parameter.
In some embodiments of the invention, the first and second parameters are determined by fourier transform.
In some embodiments of the invention, the first and second parameters are determined by:
(a) performing smooth correction on the basis of the site information of the reference sequence and a nucleosome protection score corresponding to the site information, and then drawing to obtain a first graph, wherein the x axis is the distance between the site and the gene transcription start site, and the y axis is the corrected nucleosome protection score;
(b) performing Fourier transform on the first graph obtained in the step (a) to obtain a second graph, wherein the x axis of the second graph is the nucleosome period, and the y axis of the second graph is the Fourier intensity corresponding to the nucleosome period.
In some embodiments of the invention, the spectral parameter is determined based on the following formula:
wherein T represents the nucleosome cycle, F (T) represents the Fourier intensity corresponding to the nucleosome cycle T, a represents the lower limit of the nucleosome cycle, and b represents the upper limit of the nucleosome cycle.
In some embodiments of the present invention, a 168 and b 208. This interval is obtained by maximizing the association of healthy human nucleic acid molecules with blood tissue, representing the range of nucleosome cycles most associated with gene expression, and other possible parameters include: a is equal to 120,170, b is equal to 205,212. In some embodiments of the invention, after determining the parameters a, b, the spectral parameters are related only to the first and second parameters.
In some embodiments of the invention, the correlation is determined by the following equation:
where ρ is
iRepresenting the Pearson correlation coefficient of said nucleic acid molecule with a predetermined tissue i,
representing said spectral parameter, A, of said nucleic acid molecule in a plurality of known genes
iRepresents the expression amount of the predetermined tissue i in the plurality of known genes, Var represents variance, and Cov represents covariance.
In some embodiments of the invention, the predetermined organization is selected from: liver, gallbladder, spleen, lung, kidney, bladder, esophagus, stomach, small intestine, colon, rectum, duodenum, appendix, pancreatic islet, salivary gland, tonsil, thyroid, parathyroid, adrenal gland, breast, ovary, fallopian tube, uterus, endometrium, cervix, prostate, foreskin, testis, epididymis, seminal vesicle, skin, adipose tissue, cerebral cortex, eye, cardiac muscle, skeletal muscle, smooth muscle, lymph node, blood, bone marrow.
According to a second aspect of the present invention, there is provided a system for determining the relatedness of a nucleic acid molecule to a predetermined tissue, comprising: an alignment device for aligning a sequencing result of the nucleic acid molecule with a plurality of reference sequences, wherein the sequencing result is composed of a plurality of sequencing reads, and each of the plurality of reference sequences is determined based on a sequence of one of a plurality of known genes; the spectrum parameter determining device is connected with the comparing device and respectively determines the spectrum parameters of the known genes corresponding to the reference sequences aiming at each of the plurality of reference sequences; a correlation determination device connected to the spectrum parameter determination device, the correlation determination device determining a correlation of the nucleic acid molecule with the predetermined tissue based on information on expression amounts of a plurality of the known genes in the predetermined tissue and the spectrum parameters of the plurality of the known genes; wherein the spectrum parameter determination device determines, for each of the reference sequences, by: a nucleosome protection score determining unit, wherein the nucleosome protection score determining unit respectively determines the nucleosome protection score of each locus aiming at least one part of the loci of the reference sequence; a parameter determining unit connected to the nucleosome protection score determining unit, the parameter determining unit determining a first parameter and a second parameter based on the nucleosome protection score of each locus, the first parameter characterizing a plurality of nucleosome periods, the second parameter characterizing fourier intensities corresponding to the plurality of nucleosome periods, respectively; and the spectrum parameter calculation unit is connected with the parameter determination unit and determines the spectrum parameters of the known genes corresponding to the reference sequence based on the first parameters and the second parameters.
According to an embodiment of the present invention, the system for determining the association of a nucleic acid molecule with a predetermined tissue as described above may further comprise the following technical features:
in some embodiments of the invention, in the above system, the nucleic acid molecule is a plasma-free nucleic acid molecule;
in some embodiments of the present invention, in the above system, the length of the reference sequence is 5000-11000 bp.
In some embodiments of the invention, in the above system, the reference sequence comprises a sequence of the known gene.
In some embodiments of the invention, in the above system, the reference sequence is determined based on the transcription start site of the known gene.
In some embodiments of the invention, in the above system, the sequencing is performed by high throughput sequencing. In some embodiments of the invention, in the above system, the first parameter and the second parameter are determined by fourier transform.
In some embodiments of the present invention, in the above system, the first parameter and the second parameter are determined by:
(a) performing smooth correction on the basis of the site information of the reference sequence and a nucleosome protection score corresponding to the site information, and then drawing to obtain a first graph, wherein the x axis is the distance between the site and the gene transcription start site, and the y axis is the corrected nucleosome protection score;
(b) performing Fourier transform on the first graph obtained in the step (a) to obtain a second graph, wherein the x axis of the second graph is the nucleosome period, and the y axis of the second graph is the Fourier intensity corresponding to the nucleosome period.
In some embodiments of the invention, the spectral parameter is determined based on the following formula:
wherein T represents the nucleosome cycle, F (T) represents the Fourier intensity corresponding to the nucleosome cycle T, a represents the lower limit of the nucleosome cycle, and b represents the upper limit of the nucleosome cycle; in some embodiments of the present invention, in the above system, a is 168 and b is 208.
In some embodiments of the present invention, in the above system, the correlation is determined by the following formula:
where ρ is
iRepresenting the Pearson correlation coefficient of said nucleic acid molecule with a predetermined tissue i,
representing said spectral parameter, A, of said nucleic acid molecule in a plurality of known genes
iRepresents the expression amount of the predetermined tissue i in the plurality of known genes, Var represents variance, and Cov represents covariance.
In some embodiments of the invention, in the above system, for each locus, the nucleosome protection score is determined by:
(1) determining an alignment window based on the sites, the alignment window containing the sites;
(2) determining the sequencing reads that match the alignment window;
(3) distinguishing the sequencing reads into a first type sequencing read and a second type sequencing read, wherein the first type sequencing read completely covers the alignment window, and the second type sequencing read does not completely cover the alignment window;
(4) determining the nucleosome protection score based on the number of the first type of sequencing reads and the number of the second type of sequencing reads.
In some embodiments of the present invention, in the above system, the alignment window is 100-140 bp in length.
In some embodiments of the invention, in the above system, the site is located in the middle of the alignment window.
In some embodiments of the invention, in the above system, in step (4), the nucleosome protection score is in positive correlation with the number of the first type of sequencing reads and in negative correlation with the number of the second type of sequencing reads.
In some embodiments of the invention, in the above system, in step (4), the nucleosome protection score is determined by the formula number of first type sequencing reads-number of the second type sequencing reads.
In some embodiments of the present invention, in the above system, based on the locus information of the reference sequence and the nucleosome protection score corresponding to the locus information, a plurality of nucleosome cycles corresponding to the reference sequence are determined as the first parameter, and fourier intensities corresponding to the nucleosome cycles are determined as the second parameter.
According to a third aspect of the present invention, there is provided a method of determining the source of blood cell-free nucleic acid tissue, comprising: (i) determining the association of said blood cell-free nucleic acid with a predetermined tissue according to the method of any embodiment of the first aspect of the invention; (ii) ranking the correlations to determine the tissue origin of the plasma free nucleic acids.
According to an embodiment of the present invention, the above method for determining the source of blood cell-free nucleic acid tissue may further comprise the following technical features:
in some embodiments of the invention, the ranking obtained in step (ii) is compared to a control ranking in order to determine the source of the organization of which the ranking varies.
In some embodiments of the invention, the control ranking is determined for a genetically normal individual or an individual with a known status.
In some embodiments of the invention, the ranking is by high to low, and the tissue with the highest ranking is selected as the important monitoring tissue.
The beneficial effects obtained by the invention are as follows: compared with the existing technology for screening cancers by using cfDNA fragmentation mode, the liquid biopsy analysis method for screening pan-cancer can more fully utilize cfDNA fragmentation mode information, has higher correlation with gene expression, and can more sensitively and accurately detect various cancers when the sequencing is deeper.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Method for determining the relatedness of a nucleic acid molecule to a predetermined tissue and method for determining the relatedness of a nucleic acid molecule to a predetermined tissueSystem for controlling a power supply
According to one aspect of the invention, there is provided a method of determining the relatedness of a nucleic acid molecule to a predetermined tissue, comprising: aligning the sequencing result of the nucleic acid molecule with a plurality of reference sequences, wherein the sequencing result is composed of a plurality of sequencing reads, and each of the plurality of reference sequences is determined based on the sequence of one of a plurality of known genes; respectively determining the frequency spectrum parameters of the known genes corresponding to the reference sequences aiming at each of the plurality of reference sequences based on the comparison processing result; determining a correlation of the nucleic acid molecule with the predetermined tissue based on the expression amount information of the plurality of known genes in the predetermined tissue and the spectral parameters of the plurality of known genes; wherein the spectral parameters are determined for each of the reference sequences by: separately determining a nucleosome protection score for each site for at least a portion of the sites of the reference sequence; determining a first parameter and a second parameter based on the nucleosome protection score of each locus, the first parameter characterizing a plurality of nucleosome cycles, the second parameter characterizing Fourier intensities corresponding to the plurality of nucleosome cycles, respectively; determining the spectral parameters of the reference sequence corresponding to the known genes based on the first and second parameters.
Herein, determining the association of a nucleic acid molecule with a predetermined tissue refers to: by processing and analyzing the fragmentation pattern of the nucleic acid molecules, the sizes of the correlations between nucleosome characteristics of the nucleic acid molecules and gene expression of different tissues are obtained, and then by comparing the sizes of the correlations, the relationship between the nucleic acid molecules and the tissues can be judged. For example, the likelihood that the nucleic acid molecules are derived from these tissues can be determined; for example, it can be determined from which tissues the components are more abundant by comparing the relatedness of these nucleic acid molecules to different tissues.
As used herein, the term "reference sequence" refers to a nucleic acid sequence that is capable of being aligned to characterize the genetic or genomic origin, length, etc. of the nucleic acid molecule. According to embodiments of the invention, the reference sequence may be a gene or genome of known nucleic acid sequence. By aligning the sequencing result of the nucleic acid molecule to a plurality of reference sequences, the fragmentation pattern information of the nucleic acid molecule, such as the genome position, the length and the like, is obtained by means of the information of the reference sequences, and the information can be used for indicating different nucleic acid molecules. These reference sequences may be in the form of a genome or gene, etc. For example, the reference sequence can be the sequence of a known gene. The length of the reference sequence can be 5000-11000 bp, for example 10000bp, according to different genes. The reference sequence can extend to about 5000-11000 bp downstream from the transcription start site of a known gene. These known genes may be all of the genes GRCh37 inEnsembl 75 th edition.
The term "fragmentation pattern" refers to a fragmentation pattern exhibited by a nucleic acid molecule, such as at which positions fragmentation occurs and thus fragmentation, etc. For example, intracellular nucleases can break DNA into fragments of varying lengths, thereby creating different fragmentation patterns. For example, genomic regions with open chromatin alignment are more prone to bind nucleases and thus are disrupted than regions with tight chromatin alignment. As another example, naked DNA that does not bind to any protein is more easily disrupted by nucleases, while regions protected by nucleosomes, transcription factors, etc. are less easily disrupted. Analysis of this fragmentation pattern information can be used to indicate different nucleic acid molecules.
According to embodiments of the invention, the nucleic acid molecule may be an extracellular nucleic acid molecule. For example, it may be plasma-free DNA (cfDNA). Plasma free DNA is derived from a fragmented genome released after apoptosis in various tissues and organs of the human body. Analysis of fragmentation pattern information for the fragmented genomes in these plasma-free DNAs can be used to indicate different nucleic acid molecules. The nucleic acid molecules in these plasma-free DNA are derived from various tissues of the human body. Through the analysis of the fragmentation pattern information of different nucleic acid molecules, the correlation between the nucleic acid molecules and different tissues can be determined, so that the strength of the correlation between the nucleic acid molecules in the plasma free DNA and the tissues can be analyzed.
In obtaining the sequencing results of nucleic acid molecules, it can be performed by high throughput sequencing. When the nucleic acid molecules are sequenced to obtain a sequencing result, the nucleic acid molecules can be obtained by preparing a sequencing library of the nucleic acid sequence of a sample to be detected and performing computer sequencing. According to the embodiment of the invention, the sequencing library can be prepared according to the requirements of the selected sequencing method, the sequencing method can be selected according to the different selected sequencing platforms, but is not limited to Hisq2000/2500 sequencing platform of Illumina, Ion Torrent platform of Life Technologies, BGISEQ platform of BGI, the sequencing mode is double-end sequencing, and the obtained off-line data is the sequencing read fragment, which is called reads (reads). In preparing a sequencing library, a single-stranded DNA full genome library, a conventional double-stranded DNA full genome library, a pre-BS full genome methylation library, and the like can be used.
Herein, "aligning" in "aligning the sequencing result of the nucleic acid molecule with a plurality of reference sequences" means matching. For specific alignment, known alignment software can be used, such as bowtie2, SOAP, BWA, TeraMap, etc., which is not limited in this application. In the alignment process, according to the setting of alignment parameters, at most n base mismatches (mismatches) are allowed for a pair or a read, for example, n is set to 1 or 2, if more than n bases in a read are mismatched, it is considered that the pair of reads cannot be aligned to a reference sequence, or if all the mismatched n bases are located in one read of the pair of reads, it is considered that the read in the pair of reads cannot be aligned to the reference sequence. Meanwhile, the method can provide software such as samtools, bedtools and the like to extract information such as the end points, the lengths and the like of the DNA fragments.
Herein, the term "nucleosome protection score", i.e., WPS (window protection score), characterizes the degree to which each site of the genome is protected by nucleosomes, with a greater value indicating a greater likelihood of nucleosomes being present at that site. The WPS value is influenced by the number of nucleic acid molecules on the site comparison, the random effect is large, the correction can be carried out through two steps of smoothing and mean value reduction, and the corrected WPS is used for subsequent analysis.
Herein, the term "spectral parameters" refers to the spectral weighted nucleosome period swp (spectral weighted period), which is obtained by transforming the oscillogram of the nucleosome protective fraction into the spectral space through fourier transform and performing periodic decomposition, and represents the average distance between nucleosomes in a segment of genomic region. A larger value indicates a looser nucleosome arrangement within the region, whereas a smaller SWP indicates a tighter nucleosome arrangement.
According to an embodiment of the invention, the nucleosome protection score for each site can be determined by: (1) determining an alignment window based on the sites, the alignment window containing the sites; (2) determining the sequencing reads that match the alignment window; (3) distinguishing the sequencing reads into a first type sequencing read and a second type sequencing read, wherein the first type sequencing read completely covers the alignment window, and the second type sequencing read does not completely cover the alignment window; (4) determining the nucleosome protection score based on the number of the first type of sequencing reads and the number of the second type of sequencing reads.
The length of the alignment window can be 100-140 bp for each site. For example, it may be 100bp to 130bp, 110bp to 120bp, or the like. Each site is located in the middle of the alignment window. For example, when the alignment window is selected to be approximately 120bp, 60bp upstream of the site and 60bp downstream of the site may be selected as the corresponding alignment windows. And aiming at the comparison window of each site, determining the sequencing reads which can completely cover the comparison window as the first type of sequencing reads, and determining the sequencing reads which do not completely cover the comparison window as the second type of sequencing reads. By overlapping, it is meant that the 5 'end and the 3' end of the sequencing reads are aligned with or outside of the alignment window. Complete coverage, i.e., the 5 'end and 3' end of the sequencing reads are outside of or aligned with the alignment window. Incomplete coverage means that the 5 'end of the sequencing read falls within the alignment window, or the 3' end of the sequencing read falls within the alignment window. According to an embodiment of the invention, the nucleosome protection score of each locus is positively correlated to the number of the first type of sequencing reads and negatively correlated to the number of the second type of sequencing reads. In at least some embodiments, the difference in the number of sequencing reads of the first type and the number of sequencing reads of the second type can be utilized as a nucleosome protection score for each locus.
In one embodiment, the nucleosome protective score of a site is calculated using the following method: considering that the end point position of a certain fragment after being aligned to the genome is [ lo, hi ] in a window [ i-d, i + d ] which is long by taking i as the center (2d +1) for each site i of the genome, and if the intersection of [ lo, hi ] and [ i-d, i + d ] is [ i-d, i + d ], the fragment is considered to completely cover the window; conversely, if the intersection of [ lo, hi ] and [ i-d, i + d ] is not equal to [ i-d, i + d ] and is not null, then the fragment is deemed to not completely cover the window. If there are m segments that completely cover the window and n segments that do not completely cover the window, then WPS is m-n. By analogy, the WPS value of any position of the genome can be calculated.
Further, a plurality of nucleosome periods corresponding to the reference sequence can be determined as a first parameter according to the locus information of the reference sequence and the nucleosome protection score corresponding to the locus information, and the fourier strength corresponding to the nucleosome period is determined as a second parameter. Herein, each locus corresponds to a nucleosome protection score, and as mentioned above, the nucleosome protection score corresponding to each locus may be the same or different, and shows a waveform diagram with high and low in a section of genome region. The peak in the nucleosome protective fraction oscillogram is taken as the nucleosome locus, and the distance between adjacent nucleosomes is the nucleosome period. In a genome region, a nucleosome protection score oscillogram is subjected to fourier transform, which can be understood as that a plurality of nucleosomes with uneven density are subjected to cycle decomposition, so that a plurality of nucleosome cycles and corresponding cycle ratios thereof can be obtained, and the term "first parameter" is used for representing the plurality of nucleosome cycles. The term "second parameter" is used to characterize the fourier intensities, i.e. the period fractions, corresponding to the plurality of nucleosome periods.
In at least some embodiments, the respective first and second parameters may be determined by fourier transformation. In at least some embodiments, the first and second parameters may be determined by: (a) mapping based on the site information of the reference sequence and the nucleosome protection score corresponding to the site information to obtain a first graph, wherein the x axis is the distance between the site and the gene transcription starting site, and the y axis is the corrected nucleosome protection score; (b) performing Fourier transform on the first graph obtained in the step (a) to obtain a second graph, wherein the x axis of the second graph is the nucleosome period, and the y axis of the second graph is the Fourier intensity corresponding to the nucleosome period.
Also, the present invention provides a system for determining the relatedness of a nucleic acid molecule to a predetermined tissue, as shown in FIG. 6, comprising: the nucleic acid sequencing device comprises an alignment device, a spectrum parameter determining device and a correlation determining device, wherein the spectrum parameter determining device is connected with the alignment device, the correlation determining device is connected with the spectrum parameter determining device, and the alignment device is used for performing alignment processing on a sequencing result of the nucleic acid molecule and a plurality of reference sequences, wherein the sequencing result is composed of a plurality of sequencing reads, and each of the plurality of reference sequences is determined based on the sequence of one of a plurality of known genes; the spectrum parameter determination device determines, for each of the plurality of reference sequences, a spectrum parameter of the known gene corresponding to the reference sequence; the spectrum parameter determination device determines the spectrum parameter respectively by the following units for each of the reference sequences: a nucleosome protection score determining unit, wherein the nucleosome protection score determining unit respectively determines the nucleosome protection score of each locus aiming at least one part of the loci of the reference sequence; a parameter determining unit connected to the nucleosome protection score determining unit, the parameter determining unit determining a first parameter and a second parameter based on the nucleosome protection score of each locus, the first parameter characterizing a plurality of nucleosome periods, the second parameter characterizing fourier intensities corresponding to the plurality of nucleosome periods, respectively; and the spectrum parameter calculation unit is connected with the parameter determination unit and determines the spectrum parameters of the known genes corresponding to the reference sequence based on the first parameters and the second parameters.
The above description of the advantages and technical features of the method for determining the association of a nucleic acid molecule with a predetermined tissue is also applicable to the system for determining the association of a nucleic acid molecule with a predetermined tissue provided by the present invention, and will not be described herein again.
The method for determining the correlation between the nucleic acid molecule and the predetermined tissue or the system for determining the correlation between the nucleic acid molecule and the predetermined tissue provided by the invention can be used for analyzing the state of the predetermined tissue, for example, whether the predetermined tissue has cancer or other abnormalities.
In some embodiments, the present invention provides a method of determining the tissue origin of plasma free nucleic acids comprising: (i) determining the association of said plasma free nucleic acids with a predetermined tissue according to the method of determining the association of nucleic acid molecules with a predetermined tissue of the present invention; (ii) ranking the correlations to determine the tissue origin of the plasma free nucleic acids. The ranking of the relevance of a nucleic acid molecule to a predetermined tissue is obtained as a control ranking by analyzing normal individuals or individuals with known status. And (3) comparing the ranking obtained in the step (ii) with the control ranking to determine the tissue source of the ranking change. In some preferred embodiments, the ranking is performed from high to low in relevance, and the tissue with the highest ranking is selected as the important monitoring tissue.
Taking cfDNA as an example, compared with the existing technology for screening cancers by using a cfDNA fragmentation mode, the method provided by the invention is used for detecting the abnormality of the tissue, the method provided by the invention can more fully utilize the information of the cfDNA fragmentation mode, the cycle of nucleosomes is used as a first parameter, the Fourier intensities corresponding to a plurality of cycles of nucleosomes are used as a second parameter, the first parameter and the second parameter are related, the spectrum parameter of the known gene corresponding to the nucleic acid molecule is obtained, the correlation coefficient with the gene expression is larger, and a plurality of cancers can be detected more sensitively and accurately when the sequencing is deeper.
The scheme of the invention will be explained with reference to the examples. It will be appreciated by those skilled in the art that the following examples are illustrative of the invention only and should not be taken as limiting the scope of the invention. The examples, where specific techniques or conditions are not indicated, are to be construed according to the techniques or conditions described in the literature in the art or according to the product specifications. The reagents or instruments used are not indicated by the manufacturer, and are all conventional products commercially available.
Example 1
Example 1 takes cfDNA as an example, providing a method for detecting related spectral parameters in cfDNA obtained by blood drawing, which specifically comprises the following steps:
for the test subject, it is only necessary to draw about 10ml of peripheral blood. Separating plasma from peripheral blood of a subject, extracting cfDNA, establishing a library by whole genome and sequencing at both ends to obtain a cfDNA sequence, matching the obtained cfDNA sequence to a genome to obtain fragmentation mode information such as genome position, length and the like of a cfDNA fragment source, further calculating a nucleosome protection score (WPS) of the whole genome with single base resolution, and finally obtaining a frequency spectrum weighted nucleosome cycle (SWP) (namely frequency spectrum parameters) of each gene in the cfDNA. The method specifically comprises the following steps:
1. separating plasma: peripheral blood is obtained from a test subject, the blood is firstly separated by centrifugation at 4 ℃ and 1600g for 10 minutes, then the upper plasma is transferred to an empty tube, the plasma is centrifuged at 4 ℃ and 16000g for 10 minutes to completely remove cell components, and finally the supernatant obtained after two steps of centrifugation is taken for subsequent experiments or is placed in a frozen storage at-80 ℃. Incomplete plasma separation directly causes genome pollution of blood cells in the extracted cfDNA, so two-step centrifugation is generally suitable, and in addition, the genome pollution phenomenon of the cfDNA can also be caused due to overlong freezing time of a plasma sample.
2. Extracting cfDNA: using QIAGEN
The Circulating Nucleic Acid Kit extracts cfDNA.
3. Library construction and sequencing: double-strand library construction was performed using the ThruPLEX-FD and ThruPLEX DNA-seq kits and paired-end sequencing was performed using the sequencing platforms HiSeq 2000 and NextSeq 500.
4. The cfDNA obtained by sequencing was aligned to the GRCh37 genome using alignment software bowtie2, and information such as fragment endpoint, length, etc. was extracted using samtools software.
5. Calculating nucleosome protection score (WPS): considering that the end point position of a certain fragment after being aligned to the genome is [ lo, hi ] in a window [ i-d, i + d ] which is long by taking i as the center (2d +1) for each site i of the genome, and if the intersection of [ lo, hi ] and [ i-d, i + d ] is [ i-d, i + d ], the fragment is considered to completely cover the window; conversely, if the intersection of [ lo, hi ] and [ i-d, i + d ] is not equal to [ i-d, i + d ] and is not null, then the fragment is deemed to not completely cover the window. If there are m segments that completely cover the window and n segments that do not completely cover the window, then WPS is m-n. By analogy, the WPS value of any position of the genome can be calculated. In this example, take d-60 to match the size of a single nucleosome and calculate WPS using a fragment of length [120,180 ].
6. Calculating spectral parameters (SWP): taking a 10kb region downstream of the Transcription initiation Site (TSS) of the gene, namely: if the gene is located in the positive strand of the genome, [ TSS, TSS +10000 ]); if the gene is located on the negative strand of the genome, (TSS-10000, TSS)]. Then, WPS (as defined as WPS) at each position in the gene region
) Performing fast Fourier transform to extract frequency domain information of nucleosomes to obtain Fourier intensity under each period T
The spectral parameters of the gene are:
in order to maximize the correlation between SWP and gene expression, a is 168 and b is 208. Meanwhile, the present embodiment provides an algorithm for decoupling cfDNA tissue source distribution based on the aforementioned SWP features. Because the expression of the genes has a negative correlation with the SWP, the method calculates the correlation coefficient of the cfDNA weighted nucleosome frequency spectrum of the test subject and the gene expression of each tissue cell by means of a large amount of gene expression data in a public data set and sorts the correlation coefficient, and then the source tissue sorting of the cfDNA of the test subject can be obtained.
In the method of calculating cfDNA tissue origin distribution, the tissue type of the expression data used determines the traceable cfDNA-derived tissue type, and thus the detectable cancer type. In THE invention, transcript _ rna _ tissue.tsv.zip and transcript _ rna _ celline.tsv in THE THE HUMAN PROTEIN ATLAS database are used as gene expression data sets of each tissue cell, THE maximum transcript expression value is calculated as THE expression value of THE gene, and THE zero value in THE expression value matrix is replaced by 0.04 to take THE logarithm as THE logarithmic expression matrix.
Further, the spectral parameters of cfDNA are recorded as
M is the number of genes. Recording the logarithmic expression matrix in each tissue cell as A, A in the ith row and jth column of A
ijRepresenting the expression value of the gene j in the ith tissue, i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, N is the number of samples, and marking the ith row in A as A
i=(A
i1,A
i2,…,A
iM). Then, the Pearson correlation coefficient of cfDNA with any tissue i is
Then
I.e. the ranking of tissue i in the cfDNA origin tissue ordering, and the origin tissue ordering of cfDNA is r ═ (r)
1,r
2,…r
N)。
To elaborate the differences of the spectral parameters compared to the prior methods, this example provides figures 1, 2, 3 detailing that improved nucleosome characteristics have a stronger correlation with gene expression and that the correlation increases with increasing sequencing depth.
FIG. 1 shows nucleosome protection scores of three genes with high, medium and low expression, nucleosome characteristics before and after improvement and Pearson's correlation coefficient with logarithmic expression value. The conventional way to calculate the nucleosome characteristics is to calculate the mean value of the Fourier intensity at 193-199bp (dashed line in the middle graph). The FFT intensity main peak of the low-expression gene ENSG00000066468 is 199bp, the nucleosome arrangement is loose, the frequency spectrum parameter is 197.1, the existing nucleosome is characterized by 1114992, and the logarithmic expression value is-4.82; the FFT intensity main peak of the middle expression gene ENSG00000001617 is 193bp, the arrangement is tighter relative to the nucleosome of the low expression gene, the frequency spectrum parameter is 193.6, the characteristic of the existing nucleosome is 3767714, and the logarithmic expression value is 1.74; the FFT intensity of the high expression gene ENSG00000003756 has a plurality of main peaks, the nucleosomes are arranged loosely and densely, the spectrum parameter is 185.0, the existing nucleosomes are characterized by 205691, and the logarithmic expression value is 5.04. Pearson correlation coefficients calculated by using nucleosome characteristics and logarithmic expression values of the three genes before and after improvement are-0.06 and-0.91 respectively, which shows that the improved nucleosome characteristics (spectrum parameters) have stronger correlation with gene expression and can better keep the high and low order information of the gene expression.
Figure 2 shows the nucleosome characteristics before and after improvement of three healthy human cfDNA sequencing samples (BH01, IH01, IH02, from the GEO public dataset GSE71378) and their correlation with gene expression in the blood cell line U-937. Wherein, the abscissa of the left three graphs is the existing nucleosome characteristic, namely 193-199bp average Fourier intensity, the abscissa of the right three graphs is the improved nucleosome characteristic, namely the spectrum parameter SWP, the ordinate of the six graphs is the logarithmic expression value of the U-937 cell line, and each point represents a gene. A. Graph B shows the results of BH01, with linear regression R before and after improvement being 0.36 and 0.46, respectively; C. graph D shows the results of IH01, with linear regression R before and after improvement being 0.25 and 0.42, respectively; E. graph F shows the results of IH02, with linear regression R before and after improvement being 0.15 and 0.37, respectively. It can be seen that the improved spectrally weighted nucleosome characteristics are more correlated with gene expression.
Figure 3 shows the effect of sequencing depth on the correlation of nucleosome characteristics with gene expression before and after improvement. The abscissa of the four graphs is the sequencing depth of cfDNA, the ordinate is the minimum of the negative correlation coefficient of nucleosome characteristics with gene expression, the left two graphs are the existing method, i.e., using the existing nucleosome characteristics, and the right two graphs are the improved method, i.e., using the improved nucleosome characteristics. The specific implementation mode is as follows: from the common data sets GSE71378 and SRA438908, 65 cfDNA sequencing samples were taken and their sequencing depth and correlation coefficient with gene expression were calculated and plotted as the true data in the A, B plot (open diamonds). Meanwhile, for each cfDNA sample, nucleosome characteristics of all genes were randomly replaced 100 times, and then correlation analysis was performed with gene expression of each tissue to find the minimum value of negative correlation coefficient, which was plotted as replacement data (solid dots) in A, B graph. Because the correlation between nucleosome arrangement and gene expression is disturbed by the displacement data, the correlation coefficient does not exceed-0.05 in both methods, while the correlation coefficient obtained by the improved method in the real data has a negative correlation with the sequencing depth.
To further verify this negative correlation, 11 samples (BH01, IC15, IC17, IC20, IC35, IC37, SRR3819936, SRR3819937, SRR3819938, SRR3819939, SRR3819940) that were sequenced deeper were selected from 65 cfDNA samples for downsampling analysis, the original sequencing reads were randomly sampled at a step size of 5% and from 5% to 95%, each ratio was repeated 50 times and plotted in the C, D diagram of fig. 3, and downsampling results for the same samples were connected by solid lines. The results prove that the correlation of the improved nucleosome characteristics and the gene expression is enhanced along with the increase of the sequencing depth, while the correlation of the existing nucleosome characteristics and the gene expression is mostly not the same phenomenon, and a few samples have a weakening trend even along with the increase of the sequencing depth. With the continuous reduction of sequencing cost and the continuous deepening of sequencing depth, the improved method has greater potential in the field of cancer detection.
Example 2
Embodiments provide a detection method for screening for potential cancer using the aforementioned cfDNA tissue source distribution. Comparing healthy people and detecting the tissue source sequencing of cfDNA of the testee, and finding out the tissue with abnormal sequencing change, namely the potential canceration tissue of the testee. The tissue source sequence of the subject is compared with that of a healthy person, and the tissue with large rank change is found out to be the potential canceration tissue.
The cfDNA whole genome sequencing sample (BH01) of the healthy person in the public data set GSE71378 is processed according to the method of the
embodiment 1 to obtain the source tissue sequencing r of the cfDNA of the healthy person
healthy=(r
1,r
2,…,r
N). Also, the same procedure was carried out for unknown cfDNA samples to obtain
Calculating the difference between an unknown cfDNA sample and a healthy person's cfDNA sample
Finally, unknown cfDNA samplesThe potential cancer tissue of this is ranked rank (Δ r), and the largest ranking change of the same tissue is considered as the final ranking of the tissue. In this example, 40 tissues are used, specifically including liver, gallbladder, spleen, lung, kidney, bladder, esophagus, stomach, small intestine, colon, rectum, duodenum, appendix, pancreas islet, salivary gland, tonsil, thyroid, parathyroid, adrenal gland, breast, ovary, fallopian tube, uterus, endometrium, cervix, prostate, foreskin, testis, epididymis, seminal vesicle, skin, adipose tissue, cerebral cortex, eye, myocardium, skeletal muscle, smooth muscle, lymph node, blood, and bone marrow.
The invention was implemented in cfDNA WGS data of 49 cancer patients in public datasets GSE71378 and SRA438908, specifically including 15 major classes of cancers, bladder, breast, colorectal, esophageal, head and neck, kidney, liver, lung, ovarian, pancreatic, prostate, skin, testicular, uterine, and gastric, with 10 subjects sequenced deeper (greater than 8 genomes) and the remaining 39 samples sequenced to no more than 3 genomes. The ranking of real cancer tissues in the predicted tissue using the modified method (triangles) and the existing method (circles), respectively, is shown in fig. 4, with the ranking of the same sample in both methods connected by solid lines. Of the 10 high depth cfDNA samples, 9 samples were better than the existing method using the improved method, and the improved method increased the first 3 accuracy from 10% to 50% and the first 10 accuracy from 50% to 90%. In 39 low depth samples, however, the overall effect of both methods was insufficient for cancer detection due to insufficient information content. To further explore the sequencing depth of cfDNA suitable for applying the present invention, 9 high depth cfDNA samples (excluding SRR3819938 samples that are poorly predicted by both methods) were subjected to downsampling analysis, the results are plotted in fig. 5, the black solid line is the improved method, the gray dashed line is the existing method, and the black dashed line is the baseline ranked 10. The sub-icons are entitled sample number and sequencing depth, the abscissa is the down-sampling proportion, the ordinate is the prediction ranking, the original sequencing reads are also randomly sampled according to the proportion of 5% to 95% with the step length being 5%, each proportion is repeated for 50 times, and the maximum value and the minimum value of the variance line respectively represent 25% quantiles and 75% quantiles. Similar to the results in fig. 3, the improved method tended to stabilize as the sequencing depth increased, and when the sequencing depth was higher than 8 x, the true cancer tissue of the other samples was always ranked top 10, except for a slight perturbation in the predicted ranking ofIC 20. On the other hand, the conventional method cannot improve the sequencing depth, but IC17 and IC37 deteriorate with the increase in the sequencing depth, which is consistent with the conclusion of fig. 3.
From the results, the invention uses more comprehensive nucleosome arrangement information, initiates the frequency spectrum weighting nucleosome periodic characteristics, and has higher accuracy rate for detecting various cancer species.
In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; may be mechanically coupled, may be electrically coupled or may be in communication with each other; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.