METHOD OF DETERMINING THE LIKELIHOOD OF A DISEASE BY COMBINING BIOMARKERS AND IMAGING
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of priority of US Provisional Patent Application No. 63/570,041, filed March 26, 2024, which is incorporated by reference herein in its entirety for all purposes.
BACKGROUND
[0002] The accurate and timely diagnosis of diseases stands as a cornerstone in modem healthcare, facilitating effective treatment strategies and improving patient outcomes. Traditional diagnostic methods often rely on observable symptoms or invasive procedures, which may delay diagnosis.
[0003] Moreover, such conventional diagnostic approaches frequently encounter limitations in terms of sensitivity, specificity, and accessibility. Furthermore, the complexity of certain diseases necessitates innovative methodologies capable of integrating diverse sources of information to enhance diagnostic accuracy.
[0004] The present inventors have addressed this need by providing improved methods for determining the likelihood of a disease presence in subjects. This disclosure relates to methods for determining the likelihood of whether a subject has a disease. The methods involve analyzing both biomarkers present in a sample from a subject and imaging data from that same subject.
SUMMARY
[0005] Described herein are methods that determine the likelihood of whether a subject has a disease. The methods involve analyzing both biomarkers present in a sample from a subject and imaging data from that same subject.
[0006] In one aspect, the present invention leverages a combination of multiple data types, including biomarker data, imaging data, and optionally other clinical data. By integrating these different sources of data, the invention provides a more comprehensive and reliable assessment of disease likelihood.
[0007] For example, while non-invasive interrogation of blood-based biomarkers has proved valuable in early disease detection for cancer and other indications such as neurological and immune disorders, imaging is usually needed to confirm the presence of the disease. On the other hand, while imaging approaches are considered the gold standard for acute detection of cancer and other diseases, early disease detection can be challenging when tumor sizes are small or the resolution of imaging does not allow for diagnosis. Combining biomarker analysis with imaging provides a more sensitive and specific diagnostic approach. Moreover, such data can be used in a combined analysis, for example, using artificial intelligence-based approaches to train models for improved diagnosis and/or to increase the confidence of a determination relative to a determination based on either the imaging data or the biomarker data alone.
[0008] Such approaches have utility across a broad range of medical domains, including oncology, cardiology, and neurology. By enabling early detection and prognostication, the methods described herein can facilitate timely interventions, where necessary, and ultimately improve patient outcomes.
[0009] In one aspect, the invention provides a method of determining a likelihood of whether a subject has a disease, comprising: (i) providing a sample from the subject and analyzing the sample to quantify a level of one or more biomarkers in the sample to provide biomarker data associated with the subject; (ii) obtaining imaging data of the subject, wherein the imaging data comprise data obtained from one or more imaging modalities; and (iii) performing a combined analysis of the biomarker data and the imaging data using a computer processor to provide a likelihood of whether the subject has the disease; wherein step (i) can be performed before, after, or at the same time as step (ii).
[0010] In some embodiments, the one or more biomarkers comprise cell free nucleic acids. In some embodiments, the cell free nucleic acids comprise cell free DNA. In some embodiments, the biomarker data comprises sequence data. In some embodiments, the biomarker data comprises data on the presence, absence and/or frequency of: (i) genetic variants present on the cell free nucleic acids, such as single nucleotide variations (SNVs), indels, copy number variations (CNVs) and/or gene fusions; and/or (ii) epigenetic modifications of the cell free nucleic acids, such as nucleic acid methylation. In some embodiments, the biomarker data comprises data on the fragmentation pattern of the cell free nucleic acids, such as the length of the nucleic acids and/or the mapping location of the termini of the cell free nucleic acids to a reference genome. In some embodiments, the one or more biomarkers comprise circulating proteins indicative of the presence or progression of the disease. In some embodiments, the biomarker data is analyzing using machine learning algorithms.
[0011] In some embodiments, the one or more imaging modalities comprises computed tomography (CT) scans; magnetic resonance imaging (MRI) scans; positron emission tomography (PET) scans; ultrasound scans; and/or X-rays. In some embodiments, the imaging data is analyzed using machine learning algorithms. In some embodiments, the combined analysis includes correlating the biomarker data with the imaging data to identify features indicative of the disease. In some embodiments, the combined analysis is performed using machine learning algorithms, wherein the machine learning algorithm provides the likelihood of whether the subject has the disease. In some embodiments, the combined analysis comprises weighting the biomarker data and the imaging data based on their respective diagnostic performance metrics. Such methods can be used to optimize the determination of the likelihood of whether the subject has the disease. In some embodiments, the combined analysis includes feature extraction, feature selection, dimensionality reduction, or data fusion techniques to identify relevant biomarker-imaging associations for disease likelihood determination. In some embodiments, the combined analysis considers temporal changes in the biomarker data and the imaging data over time to a provide dynamic assessment of disease likelihood or disease progression.
[0012] In some embodiments, the likelihood of whether the subject has the disease is determined using statistical models and/or pattern recognition techniques. In some embodiments, the likelihood of whether the subject has the disease is outputted as a numerical value, probability value, or classification label. In some embodiments, the likelihood of whether the subject has the disease is determined based on a comparison of the combined analysis results with reference data obtained from a cohort of subjects with known disease status. In some embodiments, the likelihood of whether the subject has the disease is adjusted based on additional clinical information, demographic factors, and/or risk factors associated with the disease. In some embodiments, the method further comprises comparing the likelihood to a predetermined threshold and classifying the subject as having the disease if the likelihood meets the predetermined threshold.
[0013] In some embodiments, the disease is selected from the group consisting of cancer, cardiovascular diseases, neurological disorders, autoimmune diseases, infectious diseases, and metabolic disorders. In some embodiments, the disease is cancer. In some embodiments, the sample and the imaging data are obtained from the same subject during a single diagnostic session or from multiple diagnostic sessions.
[0014] In some embodiments, results of the methods disclosed herein are used as an input to generate a report. The report may be in a paper or electronic format. For example, the determined likelihood of whether a subject has a disease, or information derived therefrom, can be displayed directly in such a report. Alternatively, or additionally, diagnostic information or therapeutic recommendations which are at least in part based on the methods disclosed herein can be included in the report.
[0015] The various steps of the methods disclosed herein may be carried out at the same or different times, in the same or different geographical locations, e.g. countries, and/or by the same or different people.
[0016] Additional advantages will be set forth in part in the description which follows or may be learned by practice. The advantages will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate certain embodiments, and together with the written description, serve to explain certain principles of the methods, computer readable media, and systems disclosed herein. The description provided herein is better understood when read in conjunction with the accompanying drawings which are included by way of example and not by way of limitation. It will be understood that like reference numerals identify like components throughout the drawings, unless the context indicates otherwise. It will also be understood that some or all of the figures may be schematic representations for purposes of illustration and do not necessarily depict the actual relative sizes or locations of the elements shown.
[0018] FIG. l is a schematic diagram of an example of a system suitable for use with some embodiments of the disclosure.
DETAILED DESCRIPTION
[0019] Reference will now be made in detail to certain embodiments of the invention. While the invention will be described in conjunction with such embodiments, it will be understood that they are not intended to limit the invention to those embodiments. On the contrary, the invention is intended to cover all alternatives, modifications, and equivalents, which may be included within the invention as defined by the appended claims.
[0020] Before describing the present teachings in detail, it is to be understood that the disclosure is not limited to specific compositions or process steps, as such may vary. It should be noted that, as used in this specification and the appended claims, the singular form “a”, “an” and “the” include plural references unless the context clearly dictates otherwise. Thus, for example, reference to “a nucleic acid” includes a plurality of nucleic acids.
[0021] Numeric ranges are inclusive of the numbers defining the range. Measured and measurable values are understood to be approximate, taking into account significant digits and the error associated with the measurement.
[0022] Unless specifically noted in the above specification, embodiments in the specification that recite “comprising” various components are also contemplated as “consisting of’ or “consisting essentially of’ the recited components.
[0023] The section headings used herein are for organizational purposes and are not to be construed as limiting the disclosed subject matter in any way.
[0024] All patents, patent applications, websites, other publications or documents and the like cited herein whether supra or infra, are expressly incorporated by reference in their entirety for all purposes to the same extent as if each individual item were specifically and individually indicated to be so incorporated by reference. If different versions of a publication, website or the like are published at different times, the version most recently published at the effective filing date of the application is meant, unless otherwise indicated.
Samples
[0025] The invention relates to methods of determining a likelihood of whether a subject has a disease. In particular, the methods of the invention involve providing a sample from the subject and analyzing the sample to quantify a level of one or more biomarkers in the sample to provide biomarker data associated with the subject. In some cases, the sample comprises nucleic acid, such as DNA, for example cell free DNA (cfDNA). In some cases, the sample comprises nucleic acid, such as RNA.
[0026] The sample can be any biological sample isolated from a subject. The sample can be a bodily sample. Samples can include body tissues, such as known or suspected solid tumors, whole blood, platelets, serum, plasma, stool, red blood cells, white blood cells or leukocytes, endothelial cells, tissue biopsies, cerebrospinal fluid synovial fluid, lymphatic fluid, ascites fluid, interstitial or extracellular fluid, the fluid in spaces between cells, including gingival crevicular fluid, bone marrow, pleural effusions, cerebrospinal fluid, saliva, mucous, sputum, semen, sweat, urine. Samples are preferably body fluids, particularly blood and fractions thereof. A sample can be in the form originally isolated from a subject or can have been subjected to further processing to remove components, such as cells, or enrich for one component relative to another. A sample can be isolated or obtained from a subject and transported to a site of sample analysis. The sample may be preserved and shipped at a desirable temperature, e.g., room temperature, 4°C, -20°C, or -80°C. A sample can be isolated or obtained from a subject at the site of the sample analysis.
[0027] The sample may be plasma. The volume of plasma can depend on the desired read depth for sequenced regions. Exemplary volumes are 0.4-40 ml, 5-20 ml, 10-20 ml. For examples, the volume can be 0.5 ml, 1 ml, 5 ml, 10 ml, 20 ml, 30 ml, or 40 ml. A volume of sampled plasma may be 5 to 20 ml.
[0028] A sample can comprise various amount of nucleic acid that contains genome equivalents. For example, a sample of about 30 ng DNA can contain about 10,000 (104) haploid human genome equivalents and, in the case of cell free DNA (cfDNA), about 200 billion (2x1011) individual polynucleotide molecules. Similarly, a sample of about 100 ng of DNA can contain about 30,000 haploid human genome equivalents and, in the case of cfDNA, about 600 billion (6 x 1011) individual molecules.
[0029] A sample can comprise nucleic acids from different sources, e.g., cellular DNA and cell-free DNA of the same subject, or cellular DNA and cell-free DNA of different subjects. A sample can comprise nucleic acids carrying mutations. For example, a sample can comprise DNA carrying germline mutations and/or somatic mutations. Germline mutations refer to mutations existing in germline DNA of a subject. Somatic mutations refer to mutations originating in somatic cells of a subject, e.g., cancer cells. A sample can comprise DNA carrying cancer-associated mutations (e.g., cancer-associated somatic mutations). A sample can comprise an epigenetic variant (i.e. a chemical or protein modification), wherein the epigenetic variant associated with the presence of a genetic variant such as a cancer-associated mutation. In some embodiments, the sample comprises an epigenetic variant associated with the presence of a genetic variant, wherein the sample does not comprise the genetic variant.
[0030] The sample may comprise cell free nucleic acids, such as cfDNA. The cfDNA may be obtained from the subject, for example as described above. For example, the sample for analysis may be plasma or serum containing cell-free nucleic acids. “Cell-free DNA” “cfDNA molecules,” or “cfDNA”, for example, include DNA molecules that naturally occur in a subject in extracellular form (e.g., in blood, serum, plasma, or other bodily fluids such as lymph, cerebrospinal fluid, urine, or sputum). While the cfDNA originally existed in a cell or cells in a large complex biological organism, e.g., a mammal, it has undergone release from the cell(s) in vivo into a fluid found in the organism, and may be obtained by obtaining a sample of the fluid without the need to perform an in vitro cell lysis step. In other words, cell-free nucleic acids or cfDNA are nucleic acids or cfDNA not contained within or otherwise bound to a cell in vivo. Cell-free nucleic acids include DNA, RNA, and hybrids thereof, including cfDNA derived from genomic DNA, mitochondrial DNA, siRNA, miRNA, circulating RNA (cRNA), tRNA, rRNA, small nucleolar RNA (snoRNA), Piwi-interacting RNA (piRNA), long non-coding RNA (long ncRNA), or fragments of any of these. Cell-free nucleic acids can be double-stranded, single- stranded, or a hybrid thereof. A cell-free nucleic acid can be released into bodily fluid through secretion or cell death processes, e.g., cellular necrosis and apoptosis. Some cell-free nucleic acids are released into bodily fluid from cancer cells e.g., circulating tumor DNA, (ctDNA). Others are released from healthy cells. In some embodiments, cfDNA is cell-free fetal DNA (cffDNA). In some embodiments, cell free nucleic acids are produced by tumor cells. In some embodiments, cell free nucleic acids are produced by a mixture of tumor cells and non-tumor cells.
[0031] Exemplary amounts of cell-free nucleic acids in a sample before amplification range from about 1 fg to about 1 pg, e.g., 1 pg to 200 ng, 1 ng to 100 ng, 10 ng to 1000 ng. For example, the amount can be up to about 600 ng, up to about 500 ng, up to about 400 ng, up to about 300 ng, up to about 200 ng, up to about 100 ng, up to about 50 ng, or up to about 20 ng of cell-free nucleic acid molecules. The amount can be at least 1 fg, at least 10 fg, at least 100 fg, at least 1 pg, at least 10 pg, at least 100 pg, at least 1 ng, at least 10 ng, at least 100 ng, at least 150 ng, or at least 200 ng of cell-free nucleic acid molecules. The amount can be up to 1 femtogram (fg), 10 fg, 100 fg, 1 picogram (pg), 10 pg, 100 pg, 1 ng, 10 ng, 100 ng, 150 ng, or 200 ng of cell-free nucleic acid molecules. The method can comprise obtaining 1 femtogram (fg) to 200 ng cell-free nucleic acid molecules from samples.
[0032] Cell-free nucleic acids have an exemplary size distribution of about 100-500 nucleotides, with molecules of 110 to about 230 nucleotides representing about 90% of molecules, with a mode of about 168 nucleotides and a second minor peak in a range between 240 to 440 nucleotides.
[0033] Cell-free nucleic acids can be isolated from bodily fluids through a fractionation step in which cell-free nucleic acids, as found in solution, are separated from intact cells and other non-soluble components of the bodily fluid. Fractionation may include techniques such as centrifugation or filtration. Alternatively, cells in bodily fluids can be lysed and cell-free and cellular nucleic acids processed together. Generally, after addition of buffers and wash steps, nucleic acids can be precipitated with an alcohol. Further clean up steps may be used such as silica-based columns to remove contaminants or salts. Nonspecific bulk carrier nucleic acids, such as Cl DNA, DNA or protein for bisulfite sequencing, hybridization, and/or ligation, may be added throughout the reaction to optimize certain aspects of the procedure such as yield.
[0034] After such processing, samples can include various forms of nucleic acid including double stranded DNA, single stranded DNA and single stranded RNA. In some embodiments, single stranded DNA and RNA can be converted to double stranded forms so they are included in subsequent processing and analysis steps.
Subjects and diseases
[0035] The subject may be a human. The subject may be a mammal, an animal, a primate, rodent (including mice and rats), or other common laboratory, domestic, companion, service or agricultural animal, for example a rabbit, dog, cat, horse, cow, sheep, goat or Pig-
[0036] The subject may have or be suspected of having a disease. The disease can be any disease where there are known biomarkers and which are detectable by imaging. The disease may be cancer, a cardiovascular disease, a neurological disorder, an autoimmune disease, an infectious disease, or a metabolic disorder. In some embodiments, the disclosed methods can be used in the early detection of cancer.
[0037] The subject may exhibit one or more risk factors associated with the disease. The risk factors will depend on the disease in question. However, exemplary risk factors include age, specific germline genetic variants, diet, physical inactivity, gender, tobacco use, alcohol consumption, body weight, blood pressure and familial history of the disease. The subject may have had the disease in the past, but has since recovered or is in remission.
[0038] The types of cancers that may be detected may include blood cancers, brain cancers, lung cancers, skin cancers, nose cancers, throat cancers, liver cancers, bone cancers, lymphomas, pancreatic cancers, skin cancers, bowel cancers, rectal cancers, thyroid cancers, bladder cancers, kidney cancers, mouth cancers, stomach cancers, solid state tumors, heterogeneous tumors, homogenous tumors and the like. Specific examples of such cancers include biliary tract cancer, bladder cancer, transitional cell carcinoma, urothelial carcinoma, brain cancer, gliomas, astrocytomas, breast carcinoma, metaplastic carcinoma, cervical cancer, cervical squamous cell carcinoma, rectal cancer, colorectal carcinoma, colon cancer, hereditary nonpolyposis colorectal cancer, colorectal adenocarcinomas, gastrointestinal stromal tumors (GISTs), endometrial carcinoma, endometrial stromal sarcomas, esophageal cancer, esophageal squamous cell carcinoma, esophageal adenocarcinoma, ocular melanoma, uveal melanoma, gallbladder carcinomas, gallbladder adenocarcinoma, renal cell carcinoma, clear cell renal cell carcinoma, transitional cell carcinoma, urothelial carcinomas, Wilms tumor, leukemia, acute lymphocytic leukemia (ALL), acute myeloid leukemia (AML), chronic lymphocytic leukemia (CLL), chronic myeloid leukemia (CML), chronic myelomonocytic leukemia (CMML), liver cancer, liver carcinoma, hepatoma, hepatocellular carcinoma, cholangiocarcinoma, hepatoblastoma, lung cancer, non-small cell lung cancer (NSCLC), mesothelioma, B-cell lymphomas, non-Hodgkin lymphoma, diffuse large B-cell lymphoma, mantle cell lymphoma, T cell lymphomas, non-Hodgkin lymphoma, precursor T-lymphoblastic lymphoma/leukemia, peripheral T cell lymphomas, multiple myeloma, nasopharyngeal carcinoma (NPC), neuroblastoma, oropharyngeal cancer, oral cavity squamous cell carcinomas, osteosarcoma, ovarian carcinoma, pancreatic cancer, pancreatic ductal adenocarcinoma, pseudopapillary neoplasms, acinar cell carcinomas, prostate cancer, prostate adenocarcinoma, skin cancer, melanoma, malignant melanoma, cutaneous melanoma, small intestine carcinomas, stomach cancer, gastric carcinoma, gastrointestinal stromal tumor (GIST), uterine cancer, or uterine sarcoma.
[0039] Due to the imaging component, the methods of the present invention are particularly useful in determining the likelihood of cancers which involve the formation of solid tumors, such as breast cancer, lung cancer, colorectal cancer, prostate cancer and pancreatic cancer. Nevertheless, the claimed methods also have utility in determining the likelihood of whether a subject has a cancer which does not involve solid tumor formation (such as blood cancers) because imaging data can be used to identify the secondary effects of these cancers, such as infections, organ enlargement (e.g., liver or spleen enlargement), or bone involvement. Performing a combined analysis of such imaging data with the corresponding biomarker data can therefore provide an advantageous method of determining a likelihood of whether a subject has a disease. Examples of combinations of
biomarkers and imaging methods that can be used to detect cancer using the methods of the invention are summarized in Table 1 shown below.
TABLE 1
[0040] The methods of the invention can also be used for determining the likelihood of whether a subject has a cardiovascular disease. Such cardiovascular diseases include coronary artery disease, myocardial infarction, heart failure and pulmonary embolism. Such diseases are detectable through various imaging modalities, for example coronary artery disease is detectable through any of coronary angiography, coronary computed tomography angiography, and myocardial perfusion imaging. Similarly, echocardiography, cardiac MRI, and nuclear imaging can be used to detect myocardial infarction. Examples of combinations of biomarkers and imaging methods that can be used to detect cardiovascular diseases using the methods of the invention are summarized in Table 2 shown below.
TABLE 2
[0041] The methods of the invention can also be used for determining the likelihood of whether a subject has a neurological disease. Such neurological diseases include Alzheimer's disease, multiple sclerosis, stroke, Parkinson's disease and brain tumors. For example, the disease may be Alzheimer's disease. Such neurological diseases are detectable through various imaging modalities, for example Alzheimer's disease is detectable through MRI. Similarly, echocardiography, cardiac MRI, and nuclear imaging can be used to detect myocardial infarction. Examples of combinations of biomarkers and
imaging methods that can be used to detect neurological diseases using the methods of the invention are summarized in Table 3 below.
TABLE 3
[0042] The methods of the invention can also be used for determining the likelihood of whether a subject has an autoimmune disease. Such autoimmune diseases include rheumatoid arthritis, systemic lupus erythematosus, multiple sclerosis, Hashimoto's thyroiditis and Type 1 diabetes. The disease may be systemic lupus erythematosus. Examples of combinations of biomarkers and imaging methods that can be used to detect
autoimmune diseases using the methods of the invention are summarized in Table 4 shown below.
TABLE 4
[0043] The methods of the invention can also be used for determining the likelihood of whether a subject has an infectious disease. Such infectious diseases include tuberculosis, pneumonia, HIV, hepatitis C and malaria. Such infectious diseases are detectable through various imaging modalities, for example tuberculosis and pneumonia are detectable through chest X-rays and CT scans. Examples of combinations of biomarkers and imaging
methods that can be used to detect infectious diseases using the methods of the invention are summarized in Table 5 shown below.
TABLE 5
[0044] The methods of the invention can also be used for determining the likelihood of whether a subject has a metabolic disease. Such metabolic diseases include type 2 diabetes mellitus, non-alcoholic fatty liver disease (NAFLD), hyperthyroidism, hypothyroidism and adrenal insufficiency. Such metabolic diseases are detectable through various imaging modalities, for example type 2 diabetes mellitus is detectable through MRI. Examples of combinations of biomarkers and imaging methods that can be used to detect metabolic diseases using the methods of the invention are summarised in Table 6 shown below.
TABLE 6
[0045] The methods disclosed herein can be used to detect minimal residual disease (MRD). Minimal residual disease (MRD) assays are diagnostic tests used to detect and quantify residual cancer cells that may remain in a patient's body after treatment, even when no clinical evidence of disease is present. The disclosed methods are particularly useful in MRD applications because these assays require a high level of sensitivity to detect the disease at the earliest possible opportunity to improve patient outcome. The use of both biomarker data and imaging data results in a particularly high sensitivity which can result in the detection of disease at an earlier time compared to the use of the biomarker data or imaging data alone.
Biomarkers
[0046] The relevant biomarker data is dependent on the disease in question. Biomarkers known to be associated with different diseases are widely available in the literature and can also be found on publicly available databases. For example, databases include MarkerDB (https://markerdb.ca/), The Human Protein Atlas (https://www.proteinatlas.org/), The Clinical Proteomic Tumor Analysis Consortium (CPTAC) Data Portal (https://proteomics.cancer.gov/data-portal), and The Cancer Genome Atlas (https://www.cancer.gov/tcga),
[0047] The biomarker can be any measurable indicator of the presence or absence of a disease, or an indicator of the progression or the extent of a disease. The biomarker can be, for example, a nucleic acid, a protein, or a metabolite.
[0048] The biomarker can be a nucleic acid biomarker. Exemplary nucleic acid biomarkers include cell free nucleic acids, such as those present in plasma samples. Nucleic acid biomarkers include genetic variants detected through the analysis of nucleic acids (e.g. cfDNA), such as single nucleotide variations (SNVs), indels, copy number variations (CNVs) and/or gene fusions. For example, SNVs corresponding to the KRAS G12D mutation is commonly observed in pancreatic adenocarcinoma and can thus be used as a biomarker for pancreatic cancer. Similarly, genetic variants corresponding the EGFR exon 19 deletions and L858R mutations in EGFR are prevalent in non-small cell lung cancer (NSCLC). These variants can therefore be used individually or in combination as biomarkers for NSCLC.
[0049] Nucleic acid biomarkers also include epigenetic variants detected through the analysis of nucleic acids (e.g. cfDNA), such as nucleic acid methylation. Nucleic acid methylation may include DNA methylation, such as 5-methylcytosine (5mC) and 5- hydroxymethylcytosine (5hmC). Nucleic acid methylation can also include RNA methylation such as N6-methyladenosine (m6A), 5mC (5-methylcytosine) and mlA (Nl- methyladenosine). Nucleic acid methylation is a known biomarker for a wide variety of diseases. For example, hypermethylation of the promoter region of the CDKN2A gene is a biomarker for various cancers, such as lung, breast, and pancreatic cancer. In addition to the methylation level at specific sites, global DNA methylation patterns in cfDNA can also serve as biomarkers for cancer. Detection and quantification of methylated cfDNA, e.g. in blood samples, can therefore be used as a biomarker for cancer.
[0050] Nucleic acid biomarkers also include the fragmentation pattern of cell free nucleic acids, such as the length of the nucleic acids and/or the mapping location of the termini of the cell free nucleic acids to a reference genome. Abnormal patterns of cfDNA fragmentation can indicate the presence of cancerous cells. Cells, including tumor cells, release fragmented DNA into the bloodstream, and the size distribution of cfDNA fragments can vary between cancer patients and healthy individuals. The fragmentation pattern of cfDNA can therefore provide a biomarker for cancer.
[0051] Nucleic acid biomarkers also include nucleic acids of a specific sequence and/or source, such as nucleic acids derived from an infectious agent. Nucleic acid biomarkers also include RNA (e.g. cfRNA), including the expression level of one or more genes.
[0052] In some embodiments, the method comprises preparing one or more sequencing libraries from the nucleic acids in the sample, wherein the one or more sequencing libraries are then subjected to nucleic acid sequencing to obtain the biomarker data. Methods of nucleic acid library preparation are described elsewhere herein. Alternatively, nucleic acid biomarkers can be analysed using PCR (e.g. real-time PCR, digital PCR or endpoint PCR), Fluorescence In Situ Hybridization (FISH) or CRISPR-Cas-based detection (e.g., SHERLOCK or DETECTR methods).
[0053] The biomarker may be a protein biomarker. The sample analysed to quantify the level of the protein biomarker may be selected from a blood, serum, plasma, tissue biopsies, urine or cerebrospinal fluid. Examples of protein biomarkers which can be used in the disclosed methods include CRP (e.g. for cardiovascular disease and infections), PSA (e.g. for prostate cancer), troponin (e.g. for myocardial infarction), BNP (e.g. for heart failure) NT-proBNP (e.g. for heart failure), HbAlc (e.g. for diabetes mellitus), TSH (e.g. for hypothyroidism or hyperthyroidism). Examples of protein biomarkers which can be used in the disclosed methods for determining a likelihood of whether a subject has a cancer include PSA (e.g. for prostate cancer), CA125 (e.g. for ovarian cancer) CEA (e.g. for colorectal cancer), AFP (e.g. for liver cancer), CAI 9-9 (e.g. for pancreatic cancer), CA 15-3 (e.g. for breast cancer), CA27.29 (e.g. for breast cancer), HE4 (e.g. for ovarian cancer), and NSE (e.g. for neuroendocrine tumors, such as small cell lung cancer).
[0054] The quantification of the level of protein biomarkers, including determining whether the protein is present or absent, can be performed by a variety of methods including enzyme-linked immunosorbent assay (ELISA), western blotting (immunoblotting), immunoassays, mass spectrometry, lateral flow assays, and protein microarrays.
[0055] The biomarker may be a metabolite. The sample analysed to quantify the level of the metabolite may be selected from a blood, serum, plasma, tissue biopsies, urine, saliva or cerebrospinal fluid. Examples of metabolites which can be used in the disclosed methods include glucose (e.g. for diabetes mellitus), creatinine (e.g. for kidney disease), and cholesterol (e.g. for cardiovascular disease). The quantification of the level of metabolites, including determining whether the metabolite is present or absent, can be performed by a variety of methods which will depend on the metabolite in question. For example, glucose can be detected by using a blood glucose meter, using the glucose oxidase method and/or using the hexokinase method. For example, creatinine can be detected by using the Jaffe reaction method and/or using an enzymatic method with creatininase. Similarly, cholesterol can be detected using an enzymatic method using cholesterol oxidase.
[0056] The biomarker can comprise methylation of cell-free DNA (cfDNA), for example, in stool-based samples. This can include the analysis of gene-specific methylation patterns such as LASS4, LRRC4, and PPP2R5C. Such methylation patterns are particularly useful for the detection of colorectal cancer. Methods for quantifying these methylation patterns include quantitative PCR (qPCR), and bisulfite sequencing, optionally performed in combination with fecal hemoglobin detection (e.g., fecal immunochemical tests). For example, combining cfDNA methylation patterns with colonoscopy imaging can significantly improve sensitivity for early-stage colorectal cancer detection. Furthermore, methylation biomarkers can complement CT coIonography to identify lesions missed by imaging alone.
[0057] The biomarker can include glycosaminoglycans (GAGs), for instance, chondroitin sulfate (CS), heparan sulfate (HS), and hyaluronic acid (HA), e.g. in blood samples. Alterations in GAG levels have been associated with diseases such as cancer, osteoarthritis, and liver fibrosis. Detection and quantification methods can include labeling and separation techniques such as ultra-high-performance liquid chromatography (UHPLC), followed by quantification using electrospray ionization triple-quadrupole mass spectrometry (ESI-MS/MS). Elevated levels of specific GAGs can indicate aggressive tumor phenotypes and metastatic potential, and when combined with, e.g. MRI or PET imaging, can enhance diagnostic accuracy and monitoring of tumor progression.
[0058] The biomarker may be represented by the serum spectral profile, e.g. generated through infrared spectroscopy analysis of blood serum. Spectral differences detected using infrared spectroscopy can serve as biomarkers indicative of cancers, metabolic disorders, and inflammatory diseases. Fourier-transform infrared spectroscopy (FTIR) can be used for serum spectral profiling. Integration with, e.g. MRI or PET imaging can help localize disease processes identified through spectral anomalies, improving diagnostic precision.
[0059] The biomarker can also be a combined biomarker, for example integrating both DNA (e.g., cell-free DNA methylation profiles) and protein biomarkers. Such combined approaches are particularly effective for cancer detection and monitoring, including breast, colorectal, lung, and pancreatic cancers. Machine learning classifiers, such as random forests or neural networks, can analyze and integrate multimodal biomarker data to provide enhanced predictive accuracy. When integrated with, e.g. mammography or CT imaging, these multimodal biomarkers substantially enhance early tumor detection and reduce false positives.
[0060] The biomarker can include exosomal RNA, such as that isolated from blood or urine samples. In some embodiments, this can involve the detection of RNA markers such as ERG, SPDEF, and PCA3. These RNA markers are particularly useful for detecting prostate cancer. Quantification methods include RT-qPCR, digital PCR, and RNA sequencing. Combined with e.g. multiparametric MRI, exosomal RNA biomarkers can effectively differentiate aggressive prostate tumors from benign conditions.
[0061] The biomarker can include histone modifications on circulating nucleosomes in bodily fluids. These biomarkers can be used for the detection of cancers, autoimmune diseases, and inflammatory conditions. Specific histone modifications, such as H3K27ac marking active enhancers and H3K4me3 marking active promoters, can be quantified through sandwich immunoassays, chromatin immunoprecipitation (ChlP)-based sequencing, or mass spectrometry. When used in conjunction with, e.g. PET imaging, these biomarkers can facilitate the early identification of inflammatory and malignant tissues.
[0062] Microbial DNA, particularly circulating microbial DNA detected in blood samples, can be used in the detection of cancers such as colorectal, pancreatic, and liver cancers. Detection can involve sequencing methodologies such as metagenomic sequencing or targeted sequencing, with subsequent analysis employing machine learning algorithms to predict disease presence or absence. For example, combining microbial DNA biomarkers with endoscopic ultrasound imaging can provide superior localization and staging of gastrointestinal cancers.
[0063] The biomarker can include surface biomarkers on extracellular vesicles (EVs) in bodily fluids, particularly blood. The biomarkers can be used for the detection of cancers including prostate, breast, lung, and colorectal cancers. Detection methods include sizeexclusion chromatography for EV enrichment, immunoaffinity capture using specific antibodies, and detection of EV-surface protein biomarkers with antibodies conjugated to dsDNA oligos, employing proximity ligation and qPCR-based detection. Integration of EV biomarkers, e.g. with PET/CT imaging can enhance tumor detection sensitivity and aids precise tumor localization.
[0064] The biomarker can include cfDNA associated with active chromatin regions. Fragmentation patterns of such cfDNA can indicate gene expression profiles relevant for cancer detection, particularly lung, colorectal, and breast cancers. Methods for quantifying fragmentation patterns include sequencing techniques like whole-genome sequencing (WGS) or targeted sequencing. These biomarkers combined with, e.g. chest CT or mammography enhance early detection and characterization of lung and breast cancers.
[0065] The biomarker can include microRNAs (miRNA) present in bodily fluids such as blood. These can be used for detecting cancers (e.g., lung, colorectal, breast), cardiovascular diseases, and neurological disorders. Methods to quantify miRNA include RT-qPCR, digital PCR, and next-generation sequencing (NGS). Integration of miRNA profiles with, e.g. MRI or PET scans allows for improved specificity in distinguishing malignant from benign lesions.
[0066] Biomarkers can include structural variants identified from cfDNA, such as those in blood or other bodily fluids. These biomarkers can be indicative of cancers, particularly breast, ovarian, lung, and colorectal cancers. Structural genomic alterations can be quantified using sequencing-based methodologies, such as whole-genome sequencing and targeted sequencing panels. Combining structural variant analysis with, e.g. CT or MRI imaging facilitates early detection and accurate staging of cancer.
[0067] Biomarkers can include small non-coding RNAs (ncRNAs), particularly cancer- associated small ncRNAs detected in bodily fluids such as blood. These biomarkers can be used for the detection of cancer, including lung, breast, prostate, and colorectal cancers. Methods for quantifying small ncRNAs include RT-qPCR, digital PCR, and RNA sequencing.
[0068] Biomarkers can include proteins within extracellular vesicle fractions isolated from bodily fluids, such as blood. Specific examples of such proteins include CD63, CD81, and CD9, which are tetraspanins commonly enriched in extracellular vesicles and serve as general markers of vesicle populations. Additionally, disease-specific biomarkers include Glypican-1 (GPC1), which is associated with pancreatic cancer, and prostate-specific membrane antigen (PSMA), indicative of prostate cancer. For neurodegenerative diseases, proteins like tau, amyloid-beta, and alpha-synuclein are detectable in extracellular vesicles, serving as biomarkers for Alzheimer's and Parkinson's diseases. Similarly, cardiovascular biomarkers, including proteins such as cardiac troponin T (cTnT), natriuretic peptides (BNP), and endothelial markers like vascular cell adhesion molecule 1 (VCAM-1), can be used as biomarkers for cardiovascular injury or dysfunction. These proteins can be identified and quantified using mass spectrometry-based proteomics and subsequently monitored using enzyme-linked immunosorbent assays (ELISA).
[0069] Biomarkers can include a combination of metabolites and lipids, which can be detected by liquid chromatography-mass spectrometry (LC-MS) and flow injection mass spectrometry (FI-MS). These can serve as biomarkers for metabolic disorders, cardiovascular diseases, and various cancers. Comprehensive metabolic profiling through these methodologies allows for precise disease detection and monitoring.
[0070] Biomarkers can include epigenetic modifications on cfDNA associated with nucleosomes. These modifications can be analyzed through chromatin immunoprecipitation-based sequencing (ChlP-seq), and can serve as biomarkers to probe genomic features indicative of cancers (e.g., colorectal, prostate, breast), autoimmune diseases, and neurological disorders. These methods can be used in the analysis of genomic regions including promoters, enhancers, and gene bodies.
[0071] The combination of circulating epithelial cells (CECs) and circulating tumor DNA (ctDNA) can be used as biomarkers for detecting of advanced adenomas (AA) and/or colorectal cancer (CRC). Detection methods for CECs include capture from whole blood using anti-EpCAM antibodies combined with immunofluorescent staining, microscopybased image capture, and Al-enabled identification and quantification specific to gastrointestinal epithelial cells. ctDNA quantification can involve digital PCR, qPCR, or next-generation sequencing, as described elsewhere herein.
Biomarker data
[0072] Biomarker data may include any data relating to the presence or absence of the biomarker. The biomarker data can include data relating to the presence or absence of multiple biomarkers. The biomarker data can be a numerical value, probability value, or classification label. The biomarker data can be binary, such as a value indicating the presence or absence of the biomarker, or whether or not the detected level of the one or more biomarkers meets a predefined threshold. The biomarker data may be a likelihood of whether a subject has the disease based at least in part on the level of one or more biomarkers within the sample and without consideration of the imaging data.
[0073] When the biomarker is a nucleic acid biomarker, the biomarker data may be the frequency or genetic variants and/or epigenetic modifications. The biomarker data may be sequence data which comprises information about the level of one or more biomarkers in the sample (e.g. frequency of genetic variant and/or RNA expression level). The biomarker data may be data derived from sequence data which comprises information about the level of one or more biomarkers in the sample.
[0074] The biomarker data may comprise data from multiple types of biomarkers, such as nucleic acid biomarkers and protein biomarkers. The biomarker data may comprise data from multiple types of nucleic acid biomarkers, such as cfDNA biomarkers and RNA biomarkers. In some embodiments, the biomarker data can be obtained from sequencing, PCR or other types of amplification, immunoassays, proximity extension assays, proximity ligation assays, microarray, mass spectrometry, fluorescence anisotropy, flow cytometry, and surface plasmon resonance. In some embodiments, the number of biomarker types combined with imaging data depends on the size of the tumor obtained from the imaging and on the cancer type. If the size of the tumor is closer to the tumor size threshold, then less number of biomarker types can be combined with imaging data. If the size of the tumor if far below the tumor size threshold, then more number of biomarker types can be combined with imaging data, In some embodiments, when the tumor size is between 1 and 2 cm, then the number of biomarker types analyzed can be at least 1 or at least 2. For example, when the tumor size is 1 cm, two types of biomarkers are analyzed and the biomarker types can be methylation and somatic variants. In some embodiments, when the tumor size is less than 1 cm, the number of biomarker types analyzed can be at least 3, at least 4 or at least 5. For example, if the tumor size is about 0.5 cm, then three or four types of biomarkers are analyzed and the biomarker types can be methylation, somatic variants, EVs and miRNA.
[0075] The biomarker data may comprise data derived from machine learning methods.
Such methods include logistic regression, random forest, support vector machines, neural networks, gradient boosting machines, K-Nearest Neighbors, decision trees, Gaussian processes, and dimensionality reduction techniques.
Imaging
[0076] Any imaging modality can be used in the methods of the invention which are appropriate for detecting the presence or absence of the disease in question. Exemplary imaging modalities include X-ray radiography, Computed Tomography (CT or CAT scan), Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET scan), ultrasound imaging (sonography), fluoroscopy, mammography, endoscopy, angiography, nuclear medicine imaging (including SPECT and PET-CT), optical coherence tomography (OCT), magnetic resonance angiography (MRA), dual -energy X-ray absorptiometry (DEXA or DXA scan), thermography, and radionuclide scintigraphy.
[0077] Imaging modalities which can be used in detecting the presence or absence of cancer include X-ray radiography (e.g. for lung cancer), Computed Tomography (CT) scans (e.g. for lung cancer, pancreatic cancer or colorectal cancer), magnetic resonance imaging (MRI) (e.g. for brain tumors, breast cancer or prostate cancer) Positron Emission Tomography (PET) (e.g. for lung cancer, lymphoma, or colorectal cancer), and ultrasound (e.g. for breast cancer, thyroid cancer, or ovarian cancer).
[0078] The imaging data obtained from the one or more imaging modalities can include digital images generated from those imaging modalities, or data derived from those digital images. The imaging data analysis will depend on the imaging modality. Exemplary processes by which imaging data can be extracted from the output of the imaging modality can include one or more of image processing, feature extraction, segmentation, and classification.
[0079] As an initial step, the images may undergo processing steps to enhance quality and/or remove artifacts. This may include noise reduction, image registration (aligning images from different modalities or different time points), and/or normalization. Computational algorithms can then be used to extract quantitative features from the images. These features may include intensity histograms, texture features, shape descriptors, and spatial relationships between structures. Feature extraction techniques can vary depending on the modality and the specific diagnostic task. Segmentation can be used for partitioning the image into meaningful regions or objects of interest. This step may be important for isolating structures or abnormalities from surrounding tissues. Various segmentation algorithms which can be used include thresholding, region growing, edge detection, and machine learning-based approaches. Encoder-decoder deep architecture may also be used for the detection of structures within imaging data. Encoder-decoder deep architecture may include at least one encoder, that takes imaging data as an input, and at least one decoder that is able to predict the subsequent output based on the encoded imaging data. Encoder-decoder deep architecture may be used, for example, in the detection or segmentation of solid tumours in imaging data. Once features are extracted, classification algorithms can be employed to differentiate between normal and abnormal findings or to classify different types of abnormalities. Machine learning techniques which can be used for this purpose include support vector machines, random forests, neural networks, and convolutional neural networks (CNNs). These algorithms can be trained on labelled datasets to learn patterns indicative of specific diseases or conditions.
[0080] Computational algorithms may use a combination detection and classification algorithms for the analysis of imaging data. Such algorithms may use a combination of deep neural networks and 3D morphological analysis. Such machine learning algorithms include End to End ML/ Al tech-based CADe/CADx software used in eyonis™, developed by Median Technologies (Sophia-Antipolis, France). Machine learning algorithms can be used to analyse the output from imaging modalities. Such exemplary processes can include training a machine learning model with medical images associated with a known diagnosis. Exemplary algorithms which can be used include convolutional neural networks (CNNs), support vector machines (SVMs), random forests, and deep learning architectures. The trained models can optionally be evaluated using separate validation datasets to assess their performance, e.g. by using metrics such as accuracy, sensitivity, specificity, area under the receiver operating characteristic curve (AUC-ROC), precision, F-score and run time. Optionally, the machine learning models may undergo optimization to improve performance, which may involve fine-tuning hyperparameters, adjusting network architectures, or employing techniques such as data augmentation to increase dataset size.
[0081] Imaging data may be analysed by machine learning algorithms, such as CNNs. Such machine learning algorithms can act as independent readers (IR) which determine diagnosis based off imaging data through a binary output. The output of an independent reader may be combined with a further independent reader, such as a healthcare professional, for double reading (DR) of imaging data. Combination of two IRs in DR facilitates a more accurate diagnosis. Cases in which IRs disagree on output can be further analysed by an adjudicator that determines the diagnosis, such as a further healthcare professional. The DR may be combined with biomarker data to confirm diagnosis in a combined analysis to determine diagnosis. Such machine learning algorithms include MiaTM developed by Kh eiron Medical Technologies (London, UK), for instance for analyzing mammograms for the detection of breast cancer. The imaging data used in the methods of the present invention can be any data which has been obtained from one or more of the above processes. For example, the imaging data can be a probability value of the presence of an abnormal finding, such as based on a classification step, such as that described above. For example, the imaging data can be a binary output of whether there is an abnormal finding, such as based on a classification step, such as that described above. The imaging data may be a likelihood of whether a subject has the disease based on any of the steps described above, but without the integration of the biomarker data.
Combined analysis
[0082] Methods described herein comprise a step of performing a combined analysis of the biomarker data and the imaging data using a computer processor to provide a likelihood of whether the subject has the disease. The biomarker data and the imaging data input into this combined analysis can be any form of biomarker data or imaging data as described elsewhere herein.
[0083] For example, the biomarker data input into the combined analysis may comprise data comprising the level of one or more biomarkers, such as the presence or absence of the one or more biomarkers, or a likelihood thereof. Similarly, the imaging data input into the combined analysis may comprise data comprising the presence or absence of an abnormality, or a likelihood thereof. In other cases, the biomarker data input into the combined analysis may comprise a likelihood of whether a subject has a disease based at least in part on the quantification of the level of one or more biomarkers in the sample but without consideration of the imaging data. The imaging data input into the combined analysis may comprise a likelihood of whether a subject has a disease based at least in part on the output of the one or more imaging modality, but without consideration of the biomarker data.
[0084] The combined analysis may be performed using computational techniques such as machine learning algorithms, statistical modelling, or pattern recognition methods. These computational techniques integrate information from both biomarker and imaging data to assess the likelihood of the subject having the disease.
[0085] Machine learning algorithms which can be used in the combined analysis include support vector machines (SVM), random forests, or neural networks (e.g. residual neural networks). Such methods are particularly useful in the combined analysis of heterogeneous data sources like biomarker and imaging data. Statistical approaches which can be used in the combined analysis include regression analysis, Bayesian methods, or multivariate analysis. Such methods can be applied to model the joint distribution of biomarker and imaging features. These can therefore be used to capture the complex interactions between different data modalities and quantify their combined influence on the disease likelihood. Pattern recognition methods which can be used in the combined analysis include principal component analysis (PCA), independent component analysis (ICA), or clustering algorithms. Such methods can be used to identify latent structures or clusters within the combined dataset. By uncovering hidden patterns, these methods can help identify the underlying factors contributing to the disease. Further dimension reduction machine learning methods which can be used include multidimensional scaling, UMAP (uniform manifold approximation and projection), and t stochastic neighbour embedding (t-SNE).
[0086] For example, PCA can be used to reduce the dimensionality of biomarker and imaging data. Each dataset's features can be transformed into a set of principal components, capturing the maximum variance. By combining these components, the datasets are integrated into a unified representation, enabling efficient joint analysis. This integration enhances pattern recognition, facilitates disease diagnosis, and provides insights into the relationships between biomarkers and imaging features. PCA can therefore aid in identifying crucial biomarkers and imaging characteristics, contributing to more accurate disease classification and prognosis.
[0087] ICA can be used to separate mixed signals into statistically independent components. In the context of combined biomarker and imaging data analysis, ICA can isolate underlying sources contributing to both types of data, revealing hidden patterns or features. For instance, in neuroimaging studies, ICA can separate brain activity into meaningful components, potentially associating biomarkers with specific brain regions or functions. By jointly analyzing biomarker and imaging data with ICA, insights can be made into the interactions between biological markers and anatomical or functional characteristics, facilitating disease diagnosis or treatment assessment.
[0088] Clustering algorithms can be applied to combine biomarker data and imaging data by grouping similar samples based on their features. For instance, using k-means clustering, samples with similar biomarker profiles and imaging characteristics can be clustered together, aiding in identifying distinct disease subtypes or stages.
[0089] The combined analysis may use supervised machine learning algorithms. Supervised machine learning algorithms can integrate biomarker and imaging data by training models to predict disease outcomes or classifications based on labelled examples. Using labelled samples, the algorithms can learn relationships between biomarkers and imaging features indicative of disease presence, progression, or response to treatment. These models can then generalize to new, unlabelled data, aiding in disease diagnosis, prognosis, and/or treatment planning. Techniques such as convolutional neural networks (CNNs) can directly process imaging data, while traditional classifiers like random forests or support vector machines incorporate biomarker data alongside imaging data.
[0090] The combined analysis may use unsupervised machine learning algorithms. Unsupervised machine learning algorithms can integrate biomarker and imaging data by identifying patterns and structures without explicit labels. Techniques like clustering (e.g., k-means) can group similar data points, revealing associations between biomarker data and imaging data. Dimensionality reduction methods (e.g., PC A) can be used to extract information from high-dimensional data, aiding visualization and feature selection. Moreover, generative models (e.g., Variational Autoencoders) can be used to learn underlying data distributions, enabling synthesis of new samples for data augmentation or anomaly detection.
[0091] The integration of the biomarker data and imaging data may involve weighting different features based on their predictive value. For example, if a biomarker was known to have strong predictive value for a particular disease, the corresponding data could be given a higher weighting. This integration may involve considering interactions between biomarker data and imaging data. For example, if both data comprised features predictive of the same disease, the determined likelihood of whether a subject having that disease could be increased more than the contribution of each of the corresponding imaging data and biomarker data alone.
[0092] The combined analysis may therefore comprise a step of data integration of the biomarker data and imaging data which may include the steps of feature fusion, cross- modal correlation and/or hierarchical modelling.
[0093] In feature fusion, features can be extracted from biomarker data and imaging data and combined to form a unified feature space. Such methods can enable the algorithm to capture complementary information from different data modalities, enhancing the discriminative power of the model. In cross-modal correlation, correlations between biomarker data and imaging data can be used to identify associations predictive of the disease. Accordingly, by jointly analyzing biomarker and imaging data, the algorithm can exploit synergies between different types of information, leading to more accurate predictions than predictions based on either type of data alone. Ensemble algorithms can also be used in the disclosed methods. Ensemble algorithms can be used to fuse biomarker data and imaging data by combining multiple models trained on each data type separately. For example, a random forest model can be trained on biomarker data, while a convolutional neural network (CNN) can be trained on imaging data. Predictions from each model can then be combined. Such an approach utilizes the strengths of both data types to enhance diagnostic accuracy.
[0094] The algorithms used for the computational analysis may be subjected to validation and/or optimization procedures. For example, the algorithms can be subjected to cross- validation. In such methods, the algorithms can be validated using techniques such as k- fold cross-validation to assess its performance on independent datasets and mitigate overfitting. For example, the algorithms can be subjected to hyperparameter tuning. In such methods, parameters of the algorithm, such as regularization coefficients or kernel bandwidths, can be optimized through techniques like grid search or Bayesian optimization to maximize predictive accuracy. For example, the algorithms can be subjected to model interpretation. In such methods, the interpretability of the combined analysis model can be enhanced through techniques such as feature importance ranking, model visualization, or SHAP (SHapley Additive exPlanations) values. Such methods may enable clinicians to understand the underlying mechanisms driving the predicted disease likelihood and interpret the results in a clinically meaningful way.
[0095] Exemplary validation metrics which can be used include accuracy, sensitivity, specificity, area under the receiver operating characteristic curve (AUC-ROC), precision, F-score and run time.
[0096] Further clinical data may also be integrated into the combined analysis to improve the determination of the likelihood of whether a subject has a disease. Such additional data may include the subject’s medical history. Patterns in medical history may suggest predispositions to certain diseases or provide clues about the underlying cause of symptoms. Such additional data may include symptom presentation. Information about a subject's symptoms, including their onset, duration, severity, progression, and associated factors, can aid in diagnosing a disease. Certain combinations of symptoms, known as symptom clusters, may be indicative of specific diseases or conditions. Such additional data may include physical examination findings. Such physical examination may involve the systematic assessment of a subject's body, including vital signs, appearance, palpation, auscultation, and other clinical maneuvers. Abnormal physical examination findings, such as abnormal lung sounds, skin lesions, enlarged lymph nodes, or neurological deficits, can provide useful diagnostic indications.
Likelihood determination of the disease
[0097] The combined analysis provides a likelihood of whether the subject has the disease. Such a likelihood output can take a variety of forms. The likelihood may be a likelihood score or a probability estimate indicating the probability that the subject has the disease. The output could be a single probability value ranging from 0 to 1, where 0 indicates no likelihood of the disease and 1 indicates absolute certainty of the disease presence. Alternatively, the output might be a binary classification (e.g., disease present or absent) based on a predefined threshold probability. In addition to providing a likelihood score, the output may also estimate the uncertainty associated with the prediction. This uncertainty quantification may aid a clinician in understanding the reliability of the prediction and for guiding subsequent clinical actions. Methods for estimating uncertainty may include confidence intervals, Bayesian inference, or ensemble techniques that assess the variability of predictions across different models or data subsets.
[0098] The likelihood determination may be presented in any format that is interpretable and actionable by healthcare professionals. This could involve visual representations such as probability histograms, heatmaps, or decision curves, which provide insights into the distribution of likelihood scores across different patient populations or disease subtypes. Additionally, the output might include a summary describing the factors behind the likelihood determination, such as the biomarkers or imaging features driving the prediction.
[0099] The generated likelihood score may also be contextualized within the clinical setting to guide further diagnostic or therapeutic decisions. For instance, the output might also recommend additional confirmatory tests or suggest personalized treatment strategies based on the estimated likelihood of disease presence. The interpretation of the likelihood determination may also incorporate the specific characteristics of the patient, such as age, medical history, and risk factors, to provide tailored clinical guidance. The likelihood determination may also be considered as part of an iterative process, where additional clinical data or patient follow-up may be incorporated to refine the prediction. For example, the likelihood score may be used to monitor disease progression wherein the disclosed methods is carried out at multiple timepoints and the likelihood outcome is compared between the different timepoints. Such a method can be used to evaluate treatment response, and/or adjust therapeutic interventions accordingly. Such a feedback loop can facilitate continuous learning and improvement of the diagnostic algorithm.
Nucleic acid library preparation
[0100] Biomarkers may be nucleic acid biomarkers and the analysis of such nucleic acid biomarkers may be performed using nucleic acid sequencing. Methods which can be used in such nucleic acid sequencing are outlined below.
[0101] Double-stranded nucleic acids e.g. DNA molecules in a sample, and single stranded nucleic acid molecule converted to double stranded molecules, can be linked to adapters at either one end or both ends. DNA ligase and adapters can be added to ligate DNA molecules (e.g. cfDNA) in the sample with an adapter on one or both ends, i.e. to form adapted DNA. As used herein, “adapter” refers to short nucleic acids (e.g., less than about 500, less than about 100 or less than about 50 nucleotides in length, or be 20-30, 20- 40, 30-50, 30-60, 40-60, 40-70, 50-60, 50-70, 20-500, or 30-100 bases from end to end) that are typically at least partially double-stranded and can be ligated to the end of a given sample nucleic acid molecule. In some instances, two adapters can be ligated to a single sample nucleic acid molecule, with one adapter ligated to each end of the sample nucleic acid molecule.
[0102] Adapters can include nucleic acid primer binding sites to permit amplification of a sample nucleic acid molecule flanked by adapters at both ends, and/or a sequencing primer binding site, including primer binding sites for sequencing applications, such as various next generation sequencing (NGS) applications. Adapters can include a sequence for hybridizing to a solid support, e.g., a flow cell sequence. Adapters can also include binding sites for capture probes, such as an oligonucleotide attached to a flow cell support or the like. Adapters can also include sample indexes and/or molecular barcodes. These are typically positioned relative to amplification primer and sequencing primer binding sites, such that the sample index and/or molecular barcode is included in amplicons and sequencing reads of a given nucleic acid molecule. Adapters of the same or different sequence can be linked to the respective ends of a sample nucleic acid molecule. In some cases, adapters of the same or different sequence are linked to the respective ends of the nucleic acid molecule except that the sample index and/or molecular barcode differs in its sequence. In some embodiments, the one or more sequencing libraries are obtained using a single-stranded sequencing library preparation method. In some embodiments, analysis may comprise sequencing one or more sequencing libraries obtained using a doublestranded sequencing library preparation method. Further
[0103] In some embodiments, the nucleic acid molecules of the sample may be tagged with sample indexes and/or molecular barcodes (referred to generally as “tags”). Tags can form part of an adapter.
[0104] Tags can be molecules, such as nucleic acids, containing information that indicates a feature of the molecule with which the tag is associated. For example, molecules can bear a sample tag or sample index (which distinguishes molecules in one sample from those in a different sample), a partition or sub-partition tag (which distinguishes molecules in one partition or sub-partition from those in a different partition or sub-partition) and/or a molecular tag/molecular barcode/barcode (which distinguishes different molecules from one another (in both unique and non-unique tagging scenarios). In certain embodiments, a tag can comprise one or a combination of barcodes.
[0105] Optionally, adapters may contain a partition-specific barcode and/or a molecular barcode. As used herein, the term “barcode” refers to a nucleic acid molecule having a particular nucleotide sequence, or to the nucleotide sequence, itself, depending on context. A barcode can have, for example, between 10 and 100 nucleotides. A collection of barcodes can have degenerate sequences or can have sequences having a certain Hamming distance, as desired for the specific purpose. So, for example, a molecular barcode can be comprised of one barcode or a combination of two barcodes, each attached to different ends of a molecule. Additionally, or alternatively, for different partitions or different samples, different sets of molecular barcodes can be used such that the barcodes serve as a molecular tag through their individual sequences and also serve to identify the partition or sample to which they correspond based the set of which they are a member.
[0106] In some embodiments, the tags may have additional functions, for example the tags can be used to index sample sources or used as unique molecular identifiers (which can be used to improve the quality of sequencing data by differentiating sequencing errors from mutations, for example as in Kinde et al., Proc Nat’l Acad Sci USA 108: 9530-9535 (2011), Kou et al., PLoS ONE,11 : eO 146638 (2016)) or used as non-unique molecular identifiers, for example as described in US Pat. No. 9,598,731. Similarly, in some embodiments, the tags may have additional functions, for example the tags can be used to index sample sources or used as non-unique molecular identifiers (which can be used to improve the quality of sequencing data by differentiating sequencing errors from mutations).
[0107] Tags may be incorporated into or otherwise joined to adapters by chemical synthesis, ligation (e.g., as described above, e.g. by blunt-end ligation or sticky-end ligation), or overlap extension polymerase chain reaction (PCR), among other methods. Such adapters are ultimately joined to the target nucleic acid molecule. In other embodiments, one or more rounds of amplification cycles (e.g., PCR amplification) may be applied to introduce sample indexes to a nucleic acid using conventional nucleic acid amplification methods. The amplifications may be conducted in one or more reaction mixtures (e.g., a plurality of microwells in an array). Molecular barcodes, partition barcodes and/or sample indexes may be introduced simultaneously, or in any sequential order. In some embodiments, molecular barcodes and/or sample indexes are introduced prior to and/or after sequence capturing steps, if present, are performed. In some embodiments, only the molecular barcodes are introduced prior to probe capturing and the sample indexes are introduced after sequence capturing steps are performed. In some embodiments, both the molecular barcodes and the sample indexes are introduced prior to performing probe-based sequence capturing steps, if present. In some embodiments, the sample indexes are introduced after sequence capturing steps are performed, if present. In some embodiments, sample indexes are incorporated through overlap extension polymerase chain reaction (PCR).
[0108] In some embodiments, the tags may be located at one end or at both ends of the sample nucleic acids. In some embodiments, tags are predetermined or random or semirandom sequences. In some embodiments, the tag(s) may together be less than about 500, 200, 100, 50, 20, 10, 9, 8, 7, 6, or 5 nucleotides in length. Typically, tags are about 5 to 20 or 6 to 15 nucleotides in length. The tags may be linked to sample nucleic acids randomly or non-randomly.
[0109] In some embodiments, each sample is distinctly tagged with a sample index or a combination of sample indexes. In some embodiments, each nucleic acid molecule of a sample or subsample is uniquely tagged with a molecular barcode or a combination of molecular barcodes. In other embodiments, a plurality of molecular barcodes may be used such that molecular barcodes are not necessarily unique to one another in the plurality (e.g., non-unique molecular barcodes). In these embodiments, molecular barcodes are generally attached (e.g., by ligation) to individual nucleic acid molecules such that the combination of the molecular barcode and the sequence of the sample nucleic acid that it is attached to creates a unique sequence that may be individually tracked. Detection of nonunique molecular barcodes in combination with endogenous sequence information typically allows for the assignment of a unique identity to a particular molecule.
Endogenous sequence information includes the beginning (start) and/or end (stop) genomic location/position corresponding to the sequence of the original nucleic acid molecule in the sample, start and stop genomic positions corresponding to the sequence of the original nucleic acid molecule in the sample, the beginning (start) and/or end (stop) genomic location/position of the sequence read that is mapped to the reference sequence, start and stop genomic positions of the sequence read that is mapped to the reference sequence, sub-sequences of sequence reads at one or both ends, length of sequence reads, and/or length of the original nucleic acid molecule in the sample. In some embodiments, the beginning region comprises the first 5, the first 10, the first 15, the first 20, the first 25, the first 30 or at least the first 30 base positions at the 5' end of the sequencing read that align to the reference sequence. In some embodiments, the end region comprises the last 5, the last 10, the last 15, the last 20, the last 25, the last 30 or at least the last 30 base positions at the 3' end of the sequencing read that align to the reference sequence. The length, or number of base pairs, of an individual sequence read are also optionally used to assign a unique identity to a given molecule. As described herein, fragments from a single strand of nucleic acid having been assigned a unique identity, may thereby permit subsequent identification of fragments from the parent strand, and/or a complementary strand.
[0110] In certain embodiments, the number of different tags used to uniquely identify a number of molecules, z, in a class can be between any of 2*z, 3*z, 4*z, 5*z, 6*z, 7*z, 8*z, 9*z, 10*z, 11 *z, 12*z, 13*z, 14*z, 15*z, 16*z, 17*z, 18*z, 19*z, 20*z or 100*z (e.g., lower limit) and any of 100,000*z, 10,000*z, 1000*z or 100*z (e.g., upper limit). In some embodiments, molecular barcodes are introduced at an expected ratio of a set of identifiers (e.g., a combination of unique or non-unique molecular barcodes) to molecules in a sample. One example format uses from about 2 to about 1,000,000 different molecular barcode sequences, or from about 5 to about 150 different molecular barcode sequences, or from about 20 to about 50 different molecular barcode sequences, ligated to both ends of a target molecule. Alternatively, from about 25 to about 1,000,000 different molecular barcode sequences may be used. For example, 20-50 x 20-50 molecular barcode sequences (i.e., one of the 20-50 different molecular barcode sequences can be attached to each end of the target molecule) can be used. Such numbers of identifiers are typically sufficient for different molecules having the same start and stop points to have a high probability of receiving different combinations of identifiers.
[oni] In some embodiments, the assignment of unique or non-unique molecular barcodes in reactions is performed using methods and systems described in, for example, U.S. Patent Application Nos. 2001/0053519, 2003/0152490, and 2011/0160078, and U.S. Patent Nos. 6,582,908, 7,537,898, 9,598,731, and 9,902,992. Alternatively, in some embodiments, different nucleic acid molecules of a sample may be identified using only endogenous sequence information (e.g., start and/or stop positions, sub-sequences of one or both ends of a sequence, and/or lengths). The addition of tags (e.g. sample indexes, partition and/or sub-partition tags and/or molecular barcodes) to nucleic acids can be done through amplification, wherein the tags are comprised in primers used for amplification.
[0112] In some embodiments, the nucleic acids are ligated to adapters comprising molecular barcodes. These molecular barcodes (optionally in combination with endogenous sequence information) can then be used when analyzing the sequencing data to group sequence reads deriving from the same parent nucleic acids (i.e. those nucleic acids prior to any amplification). The grouped sequence reads can then be analysed, for example, to determine a consensus sequence for parent nucleic acids. Some methods are particularly suited to methods which identify the presence or absence of genetic variants in addition to the modification status of the nucleic acids. This is because the determination of a consensus sequence can distinguish true genetic variants from sequencing and/or amplification errors.
Amplification
[0113] Sample nucleic acids flanked by adapters can be amplified by PCR and/or other amplification methods. Amplification is typically primed by primers binding to primer binding sites in adapters flanking a DNA molecule to be amplified. Amplification methods can involve cycles of denaturation, annealing and extension, resulting from thermocycling or can be isothermal as in transcription-mediated amplification. Other amplification methods include the ligase chain reaction, strand displacement amplification, nucleic acid sequence-based amplification, and self-sustained sequence-based replication.
[0114] In some embodiments, the present methods perform dsDNA ligations with T-tailed and C-tailed adapters when the sample nucleic acids have been subjected to A-tailing, e.g. using T4 polymerase or KI enow large fragment. This increases the efficiency of ligation and results in amplification of at least 50, 60, 70 or 80% of double stranded nucleic acids. Such methods can increase the amount or number of amplified molecules relative to control methods performed with T-tailed adapters alone by at least 10, 15 or 20%.
[0115] Amplification may be performed before or after any sequence capture step. In some embodiments, the ligating occurs before or simultaneously with amplification. In some embodiments, amplification is primed by primer binding to primer binding site(s) in the adapter(s).
[0116] Targeted amplification may also be used to enrich for nucleic acids biomarkers which are predictive of disease. Such targeted amplification can use target specific primers which target amplicons comprising variant positions associated with a disease (e.g. sequence-variable target regions). In some embodiments, the methods disclosed herein can be used for the detection of minimal residual disease. In such embodiments, the primers may be personalized to the subject such that they target amplicons comprising genetic variants previously identified in that subject, for example in a tumor sample. Accordingly, the methods disclosed herein can comprise a tumor informed MRD assay. Alternatively, the methods disclosed herein can comprise a tumor naive MRD assay. In such embodiments, the targeted amplicons may comprise genetic variants known to be biomarkers, but wherein the primers are not personalized to the specific subject. In some instance, a hybrid approach may be taken wherein a portion of the target amplicons comprise genetic variants previously identified in that subject, and a portion of the target amplicons are not personalized to the specific subject.
Capturing using capture probes
[0117] Nucleic acids may be subject to a sequence capture step, in which molecules having target sequences are captured for subsequent analysis. This allows nucleic acids derived from target regions of the genome to be isolated and analysed, thus avoiding the need for whole genome analysis. In some embodiments, capture is performed after an amplification step. In some embodiments, capture is performed before an amplification step. Target sequences may include genomic regions containing biomarkers known to be associated with a disease, such as cancer.
[0118] Capture may be performed using any suitable approach known in the art. Target capture can involve use of a bait set comprising oligonucleotide baits labeled with a capture group, such as the examples noted below. The probes can have sequences selected to tile across a panel of regions, such as genes. Such bait sets are combined with a sample under conditions that allow hybridization of the target molecules with the baits. Then, captured molecules are isolated using the capture group. For example, a biotin capture group can be captured by bead-based streptavidin. Such methods are further described in, for example, U.S. patent 9,850,523.
[0119] Capture groups include, without limitation, biotin, avidin, streptavidin, a nucleic acid comprising a particular nucleotide sequence, a hapten recognized by an antibody, and magnetically attractable particles. The capture group can be a member of a binding pair, such as biotin/streptavidin or hapten/antibody. In some embodiments, a capture group that is attached to an analyte is captured by its binding pair which is attached to an isolatable moiety, such as a magnetically attractable particle or a large particle that can be sedimented through centrifugation. The capture group can be any type of molecule that allows affinity separation of nucleic acids bearing the capture group from nucleic acids lacking the capture group. An exemplary capture group are biotin which allows affinity separation by binding to streptavidin linked or linkable to a solid phase or an oligonucleotide, which allows affinity separation through binding to a complementary oligonucleotide linked or linkable to a solid phase.
[0120] In some embodiments, the methods herein comprise capturing nucleic acids comprising nucleic acid biomarkers, such as those described elsewhere herein. Such nucleic acid biomarkers can include epigenetic and/or sequence-variable target regions. In some embodiments, the methods herein comprise capturing nucleic acids comprising epigenetic target regions, such as differentially methylated regions. Such regions may be captured from a sample (e.g., a subsample) that has undergone attachment of adapters, derivatization, partitioning, and/or amplification. Enriching for or capturing DNA comprising epigenetic and/or sequence-variable target regions may comprise contacting the DNA with a set of target-specific probes. The set of target-specific probes may have any of the features described herein for sets of target-specific probes, including but not limited to in the embodiments set forth above and the sections relating to probes below.
[0121] The capturing step may be performed using conditions suitable for specific nucleic acid hybridization, which generally depend to some extent on features of the probes such as length, base composition, etc. Those skilled in the art will be familiar with appropriate conditions given general knowledge in the art regarding nucleic acid hybridization. In some embodiments, complexes of target-specific probes and DNA are formed.
[0122] In some embodiments, methods described herein comprise capturing a plurality of sets of target regions. The target regions comprise intronic regions or VDJ regions that may comprise rearrangements. The target regions may comprise epigenetic target regions, which may show differences in methylation levels depending on whether they originated from a tumor or from healthy cells. The target regions may comprise sequence-variable regions, which may show differences in sequence, other than rearrangements, depending on whether they originated from a tumor or from healthy cells. The target regions may comprise both epigenetic target regions and sequence-variable regions. The capturing step produces a captured set of DNA molecules. In some embodiments, the DNA molecules corresponding to the sequence-variable target region set are captured at a greater capture yield in the captured set of DNA molecules than DNA molecules corresponding to the epigenetic target region set. In some embodiments, a method described herein comprises contacting DNA with a set of target-specific probes, wherein the set of target-specific probes is configured to capture cfDNA corresponding to the sequence-variable target region set at a greater capture yield than DNA corresponding to the epigenetic target region set. For additional discussion of capturing steps, capture yields, and related aspects, see W02020/160414.
[0123] It can be beneficial to capture DNA corresponding to the sequence-variable target region set at a greater capture yield than DNA corresponding to the epigenetic target region set because a greater depth of sequencing may be necessary to analyse the sequence-variable target regions with sufficient confidence or accuracy than may be necessary to analyse the epigenetic target regions. The volume of data needed to determine fragmentation patterns (e.g., to test for perturbation of transcription start sites or CTCF binding sites) or methylation status is generally less than the volume of data needed to determine the presence or absence of genetic variants, such as cancer-related sequence mutations. Capturing the target region sets at different yields can facilitate sequencing the target regions to different depths of sequencing in the same sequencing run (e.g., using a pooled mixture and/or in the same sequencing cell). Accordingly, in some embodiments, sequence variable regions are sequenced at a greater depth (e.g. at least on average 2 x greater, 5 x greater, 10 x greater or 100 x greater) than epigenetic variable regions.
[0124] In some embodiments, amplification is performed before the capturing step. In some embodiments, amplification is performed after the capturing step. In some embodiments, an amplification step is performed before and after the capturing step. In some embodiments, the methods further comprise sequencing the captured DNA to different degrees of sequencing depth for the epigenetic and sequence-variable target region sets and for rearrangements, consistent with the discussion herein.
[0125] In some embodiments, a capturing step is performed with probes for a sequencevariable target region set and probes for an epigenetic target region set in the same vessel at the same time, e.g., the probes for the sequence-variable and epigenetic target region sets are in the same composition. This approach provides a relatively streamlined workflow. In some embodiments, the concentration of the probes for the sequence-variable target region set is greater that the concentration of the probes for the epigenetic target region set.
[0126] Alternatively, a capturing step is performed with a sequence-variable target region probe set in a first vessel and with an epigenetic target region probe set in a second vessel, or a contacting step is performed with a sequence-variable target region probe set at a first time and a first vessel and an epigenetic target region probe set at a second time before or after the first time. This approach allows for preparation of separate first and second compositions comprising captured DNA corresponding to a sequence-variable target region set and captured DNA corresponding to an epigenetic target region set. The compositions can be processed separately as desired (e.g., partitioning based on epigenetic modification, such as nucleic acid methylation). These can then be pooled in appropriate proportions to provide material for further processing and analysis such as sequencing.
[0127] The capturing step may be performed using variant-specific hybridization, for example wherein the variant is a biomarker for a disease. Variant-specific hybridizationbased enrichment of nucleic acids involves selectively capturing target sequences containing specific genetic variants (which can be biomarkers). This process utilizes probes designed to complement the variant sequences of interest. Upon hybridization, these probes bind to their complementary sequences, enabling the isolation of the desired genetic variants from a heterogeneous mixture of nucleic acids. By exploiting the specificity of hybridization, this method allows for the enrichment of variant-containing nucleic acids relative to non-variant-containing nucleic acids from the same genomic region. This technique can be used to enhance the sensitivity of detecting genetic variants (including biomarkers).
[0128] In some embodiments, the capturing step may involve the use of a personalized capture panel. Such personalized capture panels may target genomic regions comprising genetic variants previously identified in that subject, for example in a tumor sample. In other embodiments, the capture panel is not personalized to the subject. In some embodiments, a portion of the capture panel is personalized to the subject and a portion of the capture panel is not personalized to the subject.
[0129] To improve the likelihood of detecting genomic regions of interest and optionally, tumor indicating mutations, the sections of DNA sequenced may comprise a panel of genes or genomic sections that comprise known genomic regions. Selection of a limited section for sequencing (e.g., a limited panel) can reduce the total sequencing needed (e.g., a total amount of nucleotides sequenced). A sequencing panel can target a plurality of different genes or regions, for example, to detect a single cancer, a set of cancers, or all cancers. Alternatively, DNA may be sequenced by whole genome sequencing (WGS) or other unbiased sequencing method without the use of a sequencing panel. Examples of suitable panel and targets for use in panels can be found in the epigenetic targets described in International Application W02020160414, filed January 31, 2020, which is incorporated by reference in its entirety.
[0130] In some aspects, a panel that targets a plurality of different genes or genomic regions (e.g., CHIP genes, transcriptional factor binding regions, distal regulatory elements (DREs), repetitive elements, intron-exon junctions, transcriptional start sites (TSSs), and/or the like) is selected such that a determined proportion of subjects having a cancer exhibits a genetic variant or tumor marker in one or more different genes in the panel. The panel may be selected to limit a region for sequencing to a fixed number of base pairs. The panel may be selected to sequence a desired amount of DNA. The panel may be further selected to achieve a desired sequence read depth. The panel may be selected to achieve a desired sequence read depth or sequence read coverage for an amount of sequenced base pairs. The panel may be selected to achieve a theoretical sensitivity, a theoretical specificity, and/or a theoretical accuracy for detecting one or more genetic variants in a sample.
[0131] Genes included in this panel may comprise one or more of: ATM, ATR, BAP1, BARD1, BRCA1, BRCA2, BRIP1, CDK12, CHEK1, CHEK2, FANCA, FANCL, HDAC2, MRE11, NBN, PALB2, RAD50, RAD51, RAD51B, RAD51C, RAD51D, RAD54L, XRCC2, XRCC3 DNMT3A, TP53, LRP1B, KRAS, MARCH11, TAC1, TCF21, SHOX2, pl6, Casp8, CDH13, MGMT, MLH1, MSH2, TSLC1, APC, DKK1, DKK3, LKB1, WIFI, RUNX3, GATA4, GATA5, PAX5, E-Cadherin, H-Cadherin, VIM, SEPT9, CYCD2, TFPI2, GATA4, RARB2, pl6INK4a, APC, NDRG4, HLTF, HPP1, hMLHl, RASSF1A, IGFBP3, ITGA4, PIK3CA, ERBB2 (HER2), BRCA1/2, NTRK1/2/3, MSI-High, ESRI, ATM, HRR, FGFR2/3, IDH1, KRAS, NRAS, BRAF, KIT, PDGFRA, EGFR, ALK, ROS1, MET, TMB, or RET.
[0132] Probes for detecting the panel of regions can include those for detecting genomic regions of interest (hotspot regions) as well as nucleosome-aware probes (e.g., KRAS codons 12 and 13) and may be designed to optimize capture based on analysis of cfDNA coverage and fragment size variation impacted by nucleosome binding patterns and GC sequence composition. Regions used herein can also include non-hotspot regions optimized based on nucleosome positions and GC models. The panel can comprise a plurality of subpanels, including subpanels for identifying tissue of origin (e.g., use of published literature to define 50-100 baits representing genes with most diverse transcription profile across tissues (not necessarily promoters)), whole genome scaffold (e.g., for identifying ultra-conservative genomic content and tiling sparsely across chromosomes with handful of probes for copy number base lining purposes), transcription start site (TSS)/CpG islands (e.g., for capturing differential methylated regions (e.g., Differentially Methylated Regions (DMRs)) in for example in promoters of tumor suppressor genes (e.g., SEPT9/VIM in colorectal cancer)). In some embodiments, markers for a tissue of origin are tissue-specific epigenetic markers.
[0133] Some examples of listings of genomic locations of interest may be found in Table 7 and Table 8. In some embodiments, genomic locations used in the methods of the present disclosure comprise at least a portion of at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, or 97 of the genes of Table 7. In some embodiments, genomic locations used in the methods of the present disclosure comprise at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, or 70 of the SNVs of Table 7. In some embodiments, genomic locations used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, or 18 of the CNVs of Table 7. In some embodiments, genomic locations used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, or 6 of the fusions of Table 7. In some embodiments, genomic locations used in the methods of the present disclosure comprise at least a portion of at least 1, at least 2, or 3 of the indels of Table 7. In some embodiments, genomic locations used in the methods of the present disclosure comprise at least a portion of at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, at least 100, at least 105, at least 110, or 115 of the genes of Table 8. In some embodiments, genomic locations used in the methods of the present disclosure comprise at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, or 73 of the SNVs of Table 8. In some embodiments, genomic locations used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, or 18 of the CNVs of Table 8. In some embodiments, genomic locations used in the methods of the present disclosure comprise at least 1, at least 2, at least 3, at least 4, at least 5, or 6 of the fusions of Table 8. In some embodiments, genomic locations used in the methods of the present disclosure comprise at least a portion of at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, or 18 of the indels of Table 8. Each of these genomic locations of interest may be identified as a backbone region or hot-spot region for a given bait set panel. An example of a listing of hot-spot genomic locations of interest may be found in Table 9. In some embodiments, genomic locations used in the methods of the present disclosure comprise at least a portion of at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, or at least 20 of the genes of Table 9. Each hot-spot genomic location is listed with several characteristics, including the associated gene, chromosome on which it resides, the start and stop position of the genome representing the gene’s locus, the length of the gene’s locus in base pairs, the exons covered by the gene, and the critical feature (e.g., type of mutation) that a given genomic location of interest may seek to capture.
TABLE 7
TABLE 8
TABLE 9
[0134] In some embodiments, the one or more regions in the panel comprise one or more loci from one or a plurality of genes for detecting residual cancer after surgery. This detection can be earlier than is possible for existing methods of cancer detection. In some embodiments, the one or more genomic locations in the panel comprise one or more loci from one or a plurality of genes for detecting cancer in a high-risk patient population. For example, smokers have much higher rates of lung cancer than the general population. Moreover, smokers can develop other lung conditions that make cancer detection more difficult, such as the development of irregular nodules in the lungs. In some embodiments, the methods described herein detect cancer in high risk patients earlier than is possible for existing methods of cancer detection.
[0135] A genomic location may be selected for inclusion in a sequencing panel based on a number of subjects with a cancer that have a tumor marker in that gene or region. A genomic location may be selected for inclusion in a sequencing panel based on prevalence of subjects with a cancer and a tumor marker present in that gene. Presence of a tumor marker in a region may be indicative of a subject having cancer.
[0136] In some instances, the panel may be selected using information from one or more databases. The information regarding a cancer may be derived from cancer tumor biopsies or cfDNA assays. A database may comprise information describing a population of sequenced tumor samples. A database may comprise information about mRNA expression in tumor samples. A databased may comprise information about regulatory elements or genomic regions in tumor samples. The information relating to the sequenced tumor samples may include the frequency various genetic variants and describe the genes or regions in which the genetic variants occur. The genetic variants may be tumor markers. A non-limiting example of such a database is COSMIC. COSMIC is a catalogue of somatic mutations found in various cancers. For a particular cancer, COSMIC ranks genes based on frequency of mutation. A gene may be selected for inclusion in a panel by having a high frequency of mutation within a given gene. For instance, COSMIC indicates that 33% of a population of sequenced breast cancer samples have a mutation in TP53 and 22% of a population of sampled breast cancers have a mutation in KRAS. Other ranked genes, including APC, have mutations found only in about 4% of a population of sequenced breast cancer samples. TP53 and KRAS may be included in a sequencing panel based on having relatively high frequency among sampled breast cancers (compared to APC, for example, which occurs at a frequency of about 4%). COSMIC is provided as a nonlimiting example, however, any database or set of information may be used that associates a cancer with tumor marker located in a gene or genetic region. In another example, as provided by COSMIC, of 1156 biliary tract cancer samples, 380 samples (33%) carried mutations in TP53. Several other genes, such as APC, have mutations in 4-8% of all samples. Thus, TP53 may be selected for inclusion in the panel based on a relatively high frequency in a population of biliary tract cancer samples.
[0137] A gene or genomic section may be selected for a panel where the frequency of a tumor marker is significantly greater in sampled tumor tissue or circulating tumor DNA than found in a given background population. A combination of genomic locations may be selected for inclusion of a panel such that at least a majority of subjects having a cancer may have a tumor marker or genomic region present in at least one of the genomic location or genes in the panel. The combination of genomic location may be selected based on data indicating that, for a particular cancer or set of cancers, a majority of subjects have one or more tumor markers in one or more of the selected regions. For example, to detect cancer 1, a panel comprising regions A, B, C, and/or D may be selected based on data indicating that 90% of subjects with cancer 1 have a tumor marker in regions A, B, C, and/or D of the panel. Alternately, tumor markers may be shown to occur independently in two or more regions in subjects having a cancer such that, combined, a tumor marker in the two or more regions is present in a majority of a population of subjects having a cancer. For example, to detect cancer 2, a panel comprising regions X, Y, and Z may be selected based on data indicating that 90% of subjects have a tumor marker in one or more regions, and in 30% of such subjects a tumor marker is detected only in region X, while tumor markers are detected only in regions Y and/or Z for the remainder of the subjects for whom a tumor marker was detected. Tumor markers present in one or more genomic locations previously shown to be associated with one or more cancers may be indicative of or predictive of a subject having cancer if a tumor marker is detected in one or more of those regions 50% or more of the time. Computational approaches such as models employing conditional probabilities of detecting cancer given a cancer frequency for a set of tumor markers within one or more regions may be used to predict which regions, alone or in combination, may be predictive of cancer. Other approaches for panel selection involve the use of databases describing information from studies employing comprehensive genomic profiling of tumors with large panels and/or whole genome sequencing (WGS, RNA-seq, Chip-seq, bisulfate sequencing, ATAC-seq, and others). Information gleaned from literature may also describe pathways commonly affected and mutated in certain cancers. Panel selection may be further informed by the use of ontologies describing genetic information.
[0138] Genes included in the panel for sequencing can include the fully transcribed region, the promoter region, enhancer regions, regulatory elements, and/or downstream sequence. To further increase the likelihood of detecting tumor indicating mutations only exons may be included in the panel. The panel can comprise all exons of a selected gene, or only one or more of the exons of a selected gene. The panel may comprise of exons from each of a plurality of different genes. The panel may comprise at least one exon from each of the plurality of different genes.
[0139] In some aspects, a panel of exons from each of a plurality of different genes is selected such that a determined proportion of subjects having a cancer exhibit a genetic variant in at least one exon in the panel of exons.
[0140] At least one full exon from each different gene in a panel of genes may be sequenced. The sequenced panel may comprise exons from a plurality of genes. The panel may comprise exons from 2 to 100 different genes, from 2 to 70 genes, from 2 to 50 genes, from 2 to 30 genes, from 2 to 15 genes, or from 2 to 10 genes.
[0141] A selected panel may comprise a varying number of exons. The panel may comprise from 2 to 3000 exons. The panel may comprise from 2 to 1000 exons. The panel may comprise from 2 to 500 exons. The panel may comprise from 2 to 100 exons. The panel may comprise from 2 to 50 exons. The panel may comprise no more than 300 exons. The panel may comprise no more than 200 exons. The panel may comprise no more than 100 exons. The panel may comprise no more than 50 exons. The panel may comprise no more than 40 exons. The panel may comprise no more than 30 exons. The panel may comprise no more than 25 exons. The panel may comprise no more than 20 exons. The panel may comprise no more than 15 exons. The panel may comprise no more than 10 exons. The panel may comprise no more than 9 exons. The panel may comprise no more than 8 exons. The panel may comprise no more than 7 exons.
[0142] The panel may comprise one or more exons from a plurality of different genes. The panel may comprise one or more exons from each of a proportion of the plurality of different genes. The panel may comprise at least two exons from each of at least 25%, 50%, 75% or 90% of the different genes. The panel may comprise at least three exons from each of at least 25%, 50%, 75% or 90% of the different genes. The panel may comprise at least four exons from each of at least 25%, 50%, 75% or 90% of the different genes.
[0143] The sizes of the sequencing panel may vary. A sequencing panel may be made larger or smaller (in terms of nucleotide size) depending on several factors including, for example, the total amount of nucleotides sequenced or a number of unique molecules sequenced for a particular region in the panel. The sequencing panel can be sized 5 kb to 50 kb. The sequencing panel can be 10 kb to 30 kb in size. The sequencing panel can be 12 kb to 20 kb in size. The sequencing panel can be 12 kb to 60 kb in size. The sequencing panel can be at least lOkb, 12 kb, 15 kb, 20 kb, 25 kb, 30 kb, 35 kb, 40 kb, 45 kb, 50 kb, 60 kb, 70 kb, 80 kb, 90 kb, 100 kb, 110 kb, 120 kb, 130 kb, 140 kb, or 150 kb in size. The sequencing panel may be less than 100 kb, 90 kb, 80 kb, 70 kb, 60 kb, or 50 kb in size.
[0144] The panel selected for sequencing can comprise at least 1, 5, 10, 15, 20, 25, 30, 40, 50, 60, 80, or 100 genomic locations (e.g., that each include genomic regions of interest). In some cases, the genomic locations in the panel are selected that the size of the locations are relatively small. In some cases, the regions in the panel have a size of about 10 kb or less, about 8 kb or less, about 6 kb or less, about 5 kb or less, about 4 kb or less, about 3 kb or less, about 2.5 kb or less, about 2 kb or less, about 1.5 kb or less, or about 1 kb or less or less. In some cases, the genomic locations in the panel have a size from about 0.5 kb to about 10 kb, from about 0.5 kb to about 6 kb, from about 1 kb to about 11 kb, from about 1 kb to about 15 kb, from about 1 kb to about 20 kb, from about 0.1 kb to about 10 kb, or from about 0.2 kb to about 1 kb. For example, the regions in the panel can have a size from about 0.1 kb to about 5 kb.
[0145] The panel selected herein can allow for deep sequencing that is sufficient to detect low-frequency genetic variants (e.g., in cell-free nucleic acid molecules obtained from a sample). An amount of genetic variants in a sample may be referred to in terms of the minor allele frequency for a given genetic variant. The minor allele frequency may refer to the frequency at which minor alleles (e.g., not the most common allele) occurs in a given population of nucleic acids, such as a sample. Genetic variants at a low minor allele frequency may have a relatively low frequency of presence in a sample. In some cases, the panel allows for detection of genetic variants at a minor allele frequency of at least 0.0001%, 0.001%, 0.005%, 0.01%, 0.05%, 0.1%, or 0.5%. The panel can allow for detection of genetic variants at a minor allele frequency of 0.001% or greater. The panel can allow for detection of genetic variants at a minor allele frequency of 0.01% or greater. The panel can allow for detection of genetic variant present in a sample at a frequency of as low as 0.0001%, 0.001%, 0.005%, 0.01%, 0.025%, 0.05%, 0.075%, 0.1%, 0.25%, 0.5%, 0.75%, or 1.0%. The panel can allow for detection of tumor markers present in a sample at a frequency of at least 0.0001%, 0.001%, 0.005%, 0.01%, 0.025%, 0.05%, 0.075%, 0.1%, 0.25%, 0.5%, 0.75%, or 1.0%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 1.0%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 0.75%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 0.5%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 0.25%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 0.1%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 0.075%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 0.05%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 0.025%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 0.01%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 0.005%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 0.001%. The panel can allow for detection of tumor markers at a frequency in a sample as low as 0.0001%. The panel can allow for detection of tumor markers in sequenced cfDNA at a frequency in a sample as low as 1.0% to 0.0001%. The panel can allow for detection of tumor markers in sequenced cfDNA at a frequency in a sample as low as 0.01% to 0.0001%.
[0146] A genetic variant can be exhibited in a percentage of a population of subjects who have a disease (e.g., cancer). In some cases, at least 1%, 2%, 3%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 99% of a population having the cancer exhibit one or more genetic variants in at least one of the regions in the panel. For example, at least 80% of a population having the cancer may exhibit one or more genetic variants in at least one of the genomic positions in the panel.
[0147] The panel can comprise one or more locations comprising genomic regions of interest from each of one or more genes. In some cases, the panel can comprise one or more locations comprising genomic regions of interest from each of at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, or 80 genes. In some cases, the panel can comprise one or more locations comprising genomic regions of interest from each of at most 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, or 80 genes. In some cases, the panel can comprise one or more locations comprising genomic regions of interest from each of from about 1 to about 80, from 1 to about 50, from about 3 to about 40, from 5 to about 30, from 10 to about 20 different genes.
[0148] The locations comprising genomic regions in the panel can be selected so that one or more epigenetically modified regions are detected. The one or more epigenetically modified regions can be acetylated, methylated, ubiquitylated, phosphorylated, sumoylated, ribosylated, and/or citrullinated. For example, the regions in the panel can be selected so that one or more methylated regions are detected. In some embodiments, a genomic region of the panel may comprise one or more of the following genes: DNMT3A, TP53, LRP1B, KRAS, MARCH11, TAC1, TCF21, SHOX2, pl6, Casp8, CDH13, MGMT, MLH1, MSH2, TSLC1, APC, DKK1, DKK3, LKB1, WIFI, RUNX3, GATA4, GATA5, PAX5, E-Cadherin, H-Cadherin, VIM, SEPT9, CYCD2, TFPI2, GATA4, RARB2, pl6INK4a, APC, NDRG4, HLTF, HPP1, hMLHl, RASSF1A, IGFBP3, ITGA4, PIK3CA, ERBB2 (HER2), BRCA1/2, NTRK1/2/3, MSI-High, ESRI, ATM, HRR, FGFR2/3, IDH1, KRAS, NRAS, BRAF, KIT, PDGFRA, EGFR, ALK, R0S1, MET, TMB, or RET.
[0149] The regions in the panel can be selected so that they comprise sequences differentially transcribed across one or more tissues. In some cases, the locations comprising genomic regions can comprise sequences transcribed in certain tissues at a higher level compared to other tissues. For example, the locations comprising genomic regions can comprise sequences transcribed in certain tissues but not in other tissues.
[0150] The genomic locations in the panel can comprise coding and/or non-coding sequences. For example, the genomic locations in the panel can comprise one or more sequences in exons, introns, promoters, 3’ untranslated regions, 5’ untranslated regions, regulatory elements, transcription start sites, and/or splice sites. In some cases, the regions in the panel can comprise other non-coding sequences, including pseudogenes, repeat sequences, transposons, viral elements, and telomeres. In some cases, the genomic locations in the panel can comprise sequences in non-coding RNA, e.g., ribosomal RNA, transfer RNA, Piwi-interacting RNA, orphan-non coding RNA and microRNA.
[0151] The genomic locations in the panel can be selected to detect (diagnose) a cancer with a desired level of sensitivity (e.g., through the detection of one or more genetic variants). For example, the regions in the panel can be selected to detect the cancer (e.g., through the detection of one or more genetic variants) with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. The genomic locations in the panel can be selected to detect the cancer with a sensitivity of 100%.
[0152] The genomic locations in the panel can be selected to detect (diagnose) a cancer with a desired level of specificity (e.g., through the detection of one or more genetic variants). For example, the genomic locations in the panel can be selected to detect cancer (e.g., through the detection of one or more genetic variants) with a specificity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. The genomic locations in the panel can be selected to detect the one or more genetic variant with a specificity of 100%.
[0153] The genomic locations in the panel can be selected to detect (diagnose) a cancer with a desired positive predictive value. Positive predictive value can be increased by increasing sensitivity (e.g., chance of an actual positive being detected) and/or specificity (e.g., chance of not mistaking an actual negative for a positive). As a non-limiting example, genomic locations in the panel can be selected to detect the one or more genetic variant with a positive predictive value of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. The regions in the panel can be selected to detect the one or more genetic variant with a positive predictive value of 100%.
[0154] The genomic locations in the panel can be selected to detect (diagnose) a cancer with a desired accuracy. As used herein, the term “accuracy” may refer to the ability of a test to discriminate between a disease condition (e.g., cancer) and healthy condition. Accuracy may be can be quantified using measures such as sensitivity and specificity, predictive values, likelihood ratios, the area under the ROC curve, Youden’s index and/or diagnostic odds ratio.
[0155] Accuracy may presented as a percentage, which refers to a ratio between the number of tests giving a correct result and the total number of tests performed. The regions in the panel can be selected to detect cancer with an accuracy of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. The genomic locations in the panel can be selected to detect cancer with an accuracy of 100%.
[0156] A panel may be selected to be highly sensitive and detect low frequency genetic variants. For instance, a panel may be selected such that a genetic variant or tumor marker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may be detected at a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. Genomic locations in a panel may be selected to detect a tumor marker present at a frequency of 1% or less in a sample with a sensitivity of 70% or greater. A panel may be selected to detect a tumor marker at a frequency in a sample as low as 0.1% with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a tumor marker at a frequency in a sample as low as 0.01% with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a tumor marker at a frequency in a sample as low as 0.001% with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
[0157] A panel may be selected to be highly specific and detect low frequency genetic variants. For instance, a panel may be selected such that a genetic variant or tumor marker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may be detected at a specificity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. Genomic locations in a panel may be selected to detect a tumor marker present at a frequency of 1% or less in a sample with a specificity of 70% or greater. A panel may be selected to detect a tumor marker at a frequency in a sample as low as 0.1% with a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a tumor marker at a frequency in a sample as low as 0.01% with a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a tumor marker at a frequency in a sample as low as 0.001% with a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
[0158] A panel may be selected to be highly accurate and detect low frequency genetic variants. A panel may be selected such that a genetic variant or tumor marker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may be detected at an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. Genomic locations in a panel may be selected to detect a tumor marker present at a frequency of 1% or less in a sample with an accuracy of 70% or greater. A panel may be selected to detect a tumor marker at a frequency in a sample as low as 0.1% with an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a tumor marker at a frequency in a sample as low as 0.01% with an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a tumor marker at a frequency in a sample as low as 0.001% with an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
[0159] A panel may be selected to be highly predictive and detect low frequency genetic variants. A panel may be selected such that a genetic variant or tumor marker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may have a positive predictive value of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
[0160] The concentration of probes or baits used in the panel may be increased (2 to 6 ng/pL) to capture more nucleic acid molecule within a sample. The concentration of probes or baits used in the panel may be at least 2 ng/pL, 3 ng/ pL, 4 ng/ pL, 5 ng/pL, 6 ng/pL, or greater. The concentration of probes may be about 2 ng/pL to about 3 ng/pL, about 2 ng/pL to about 4 ng/pL, about 2 ng/pL to about 5 ng/pL, about 2 ng/pL to about 6 ng/pL. The concentration of probes or baits used in the panel may be 2 ng/pL or more to 6 ng/pL or less. In some instances this may allow for more molecules within a biological to be analyzed thereby enabling lower frequency alleles to be detected.
[0161] In an embodiment, utilizing the sequencing pipeline, the panel may be subjected to one or more of: whole-genome bisulfite sequencing (WGBS) interrogating genome-wide methylation patterns, whole-genome sequencing (WGS), and/or targeted sequencing approaches interrogating copy-number variants (CNVs) and single-nucleotide variants (SNVs).
[0162] Genetic and/or epigenetic information obtained from DNA of the subject can be combined with imaging data, as described herein, to provide a determination of whether a subject has a cancer or a likelihood that the subject has a cancer. Detailed descriptions of how to analyze cell free human DNA for both genetic and epigenetic variants associated with cancer can be found in US provisional patent application 62/799637, which is herein incorporated by reference in its entirety. Additional guidance for analyzing cell free DNA for the detecting cancer can be found in, among other places US Patent 9834822, PCT application WO2018064629A1, and PCT application WO2017106768A1.
[0163] Various embodiments include the step of sequencing DNA (e.g., cfDNA) for the purpose of detecting genetic variants in genes associated with cancer. Various embodiments also include the step of sequencing DNA (e.g., cfDNA) for the purpose of detecting epigenetic variants in genes associated with cancer, for example, but not limited to, include DNA sequences that are differentially methylated in cancerous and noncancerous cells and nucleosomal fragmentation patterns such as those described in US published patent application US2017/0211143.
[0164] In some embodiments, a captured set of nucleic acid, e.g., comprising DNA (such as cfDNA) is provided. With respect to the disclosed methods, the captured set of DNA may be provided, e.g., following capturing, and/or separating steps as described herein. The captured set may comprise DNA corresponding to one or both of a sequence-variable target region set and an epigenetic target region set. In some embodiments, the captured set comprises DNA corresponding to a sequence-variable target region set, and an epigenetic target region set. In all embodiments described herein involving a sequencevariable target region set and an epigenetic target region set, the sequence-variable target region set comprises regions not present in the epigenetic target region set and vice versa, although in some instances a fraction of the regions may overlap (e.g., a fraction of genomic positions may be represented in both target region sets).
[0165] Methylation target region set
[0166] In some embodiments, an epigenetic target region set is captured. The epigenetic target region set may comprise one or more types of target regions likely to differentiate DNA from neoplastic (e.g., tumor or cancer) cells and from healthy cells, e.g., non-neoplastic circulating cells. The epigenetic target region set can be analyzed in various ways, including methods that do not depend on a high degree of accuracy in sequence determination of specific nucleotides within a target. Exemplary types of such regions are discussed in detail herein. In some embodiments, methods according to the disclosure comprise determining whether cfDNA molecules corresponding to the epigenetic target region set comprise or indicate cancer-associated epigenetic modifications (e.g., hypermethylation in one or more hypermethylation variable target regions; one or more perturbations of CTCF binding; and/or one or more perturbations of transcription start sites) and/or copy number variations (e.g., focal amplifications). Such analyses can be conducted by sequencing and require less data (e.g., number of sequence reads or depth of sequencing coverage) than determining the presence or absence of a sequence mutation such as a base substitution, insertion, or deletion. The epigenetic target region set may also comprise one or more control regions, e.g., as described herein.
[0167] In some embodiments, the epigenetic target region set has a footprint of at least 100 kb, e.g., at least 200 kb, at least 300 kb, or at least 400 kb. In some embodiments, the epigenetic target region set has a footprint in the range of 100-1000 kb, e.g., 100-200 kb, 200-300 kb, 300-400 kb, 400-500 kb, 500-600 kb, 600-700 kb, 700-800 kb, 800-900 kb, and 900-1,000 kb.
[0168] In some embodiments, the sequencing panel comprises a panel of genomic regions that show no or negligible methylation signal when analyzing cell-free DNA (cfDNA) from healthy individuals (e.g. in blood) but exhibit detectable methylation when analyzing cfDNA from individuals with cancer. Such regions are characterized by low background methylation levels in healthy populations, thereby providing an enhanced contrast that facilitates sensitive detection of tumor-derived DNA. In some embodiments, the sequencing panel comprises a panel of genomic regions that show no or negligible methylation signal for a particular cell or tissue type (e.g., lung tissue) when analyzing cell-free DNA (cfDNA) from healthy individuals (e.g. in blood) but exhibit detectable methylation when analyzing cfDNA from individuals with disease associated with that particular cell or tissue type (e.g., if the tissue type is lung, then the disease can be lung cancer or pulmonary disorder).
[0169] Selection of genomic regions for such a sequencing panel can comprise identification through comparative methylation profiling between healthy and cancerous cfDNA samples. The panel may include genomic loci that remain consistently unmethylated or minimally methylated across diverse healthy populations, while consistently acquiring methylation marks in cancer, or a specific type of cancer.
[0170] Examples of genomic regions suitable for inclusion in such sequencing panels may include loci proximal to oncogene promoters, enhancers, or distal regulatory elements that are epigenetically modified exclusively during carcinogenesis. Additionally, regions encompassing tumor suppressor genes, particularly those silenced by promoter methylation in cancer cells, may also be included. The sequencing panel may further encompass intergenic regions identified through computational modeling and experimental validation to confirm negligible methylation in healthy cfDNA, yet consistently altered in cancer.
[0171] In some embodiments, the epigenetic target regions comprise differentially methylated regions (DMRs). In some embodiments, differentially methylated regions comprise a region of DNA having a detectably different degree of methylation in at least one cell or tissue type relative to the degree of methylation in the same region of DNA from at least one other cell or tissue type; or having a detectably different degree of methylation in at least one cell or tissue type obtained from a subject having a disease or disorder relative to the degree of methylation in the same region of DNA in the same cell or tissue type obtained from a healthy subject. In some embodiments, a DMR has a detectably higher degree of methylation (e.g., a hypermethylated region) in at least one cell or tissue type relative to the degree of methylation in the same region of DNA from at least one other cell or tissue type or from the same cell or tissue type from a healthy subject. In some embodiments, a DMR has a detectably lower degree of methylation (e.g., a hypomethylated region) in at least one cell or tissue type relative to the degree of methylation in the same region of DNA from at least one other cell or tissue type or from the same cell or tissue type from a healthy subject.
[0172] Hypermethylation variable target regions
[0173] In some embodiments, the epigenetic target region set comprises one or more hypermethylation variable target regions. In general, hypermethylation variable target regions refer to regions where an increase in the level of observed methylation indicates an increased likelihood that a sample (e.g., of cfDNA) contains DNA produced by neoplastic cells, such as tumor or cancer cells. For example, hypermethylation of promoters of tumor suppressor genes has been observed repeatedly. See, e.g., Kang et al., Genome Biol. 18:53 (2017) and references cited therein.
[0174] An extensive discussion of methylation variable target regions in colorectal cancer is provided in Lam et al., Biochim Biophys Acta. 1866: 106-20 (2016). These include VIM, SEPT9, ITGA4, 0SM4, GATA4 and NDRG4. An exemplary set of hypermethylation variable target regions comprising the genes or portions thereof based on the colorectal cancer (CRC) studies is provided in . Many of these genes likely have relevance to cancers beyond colorectal cancer; for example, TP53 is widely recognized as a critically important tumor suppressor and hypermethylation-based inactivation of this gene may be a common oncogenic mechanism.
Table 10. Exemplary hypermethylation target regions (genes or portions thereof) based on CRC studies.
[0175] In some embodiments, the hypermethylation variable target regions comprise a plurality of genes or portions thereof listed in Table 10, e.g., at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% of the genes or portions thereof listed in Table 10. For example, for each locus included as a target region, there may be one or more probes with a hybridization site that binds between the transcription start site and the stop codon (the last stop codon for genes that are alternatively spliced) of the gene. In some embodiments, the one or more probes bind within 300 bp upstream and/or downstream of the genes or portions thereof listed in Table 10, e.g., within 200 or 100 bp.
[0176] Methylation variable target regions in various types of lung cancer are discussed in detail, e.g., in Ooki et al., Clin. Cancer Res. 23:7141-52 (2017); Belinksy, Annu. Rev. Physiol. 77:453-74 (2015); Hulbert et al., Clin. Cancer Res. 23: 1998-2005 (2017); Shi et al., BMC Genomics 18:901 (2017); Schneider et al., BMC Cancer. 11 : 102 (2011); Lissa et al., Transl Lung Cancer Res 5(5):492-504 (2016); Skvortsova et al., Br. J. Cancer. 94(10): 1492-1495 (2006); Kim et al., Cancer Res. 61 :3419-3424 (2001);
Furonaka et al., Pathology International 55:303-309 (2005); Gomes et al., Rev. Port.
Pneumol. 20:20-30 (2014); Kim et al., Oncogene. 20: 1765-70 (2001); Hopkins-Donaldson et al., Cell Death Differ. 10:356-64 (2003); Kikuchi et al., Clin. Cancer Res. 11 :2954-61 (2005); Heller et al., Oncogene 25:959-968 (2006); Licchesi et al., Carcinogenesis.
29:895-904 (2008); Guo et al., Clin. Cancer Res. 10:7917-24 (2004); Palmisano et al., Cancer Res. 63:4620-4625 (2003); and Toyooka et al., Cancer Res. 61 :4556-4560, (2001).
[0177] An exemplary set of hypermethylation variable target regions comprising genes or portions thereof based on the lung cancer studies is provided in Table 11. Many of these genes likely have relevance to cancers beyond lung cancer; for example, Casp8 (Caspase 8) is a key enzyme in programmed cell death and hypermethylation-based inactivation of this gene may be a common oncogenic mechanism not limited to lung cancer.
Additionally, a number of genes appear in both Tables 10 and 11, indicating generality.
Table 11. Exemplary hypermethylation target regions (genes or portions thereof) based on lung cancer studies
[0178] Any of the foregoing embodiments concerning target regions identified in Table 11 may be combined with any of the embodiments described above concerning target regions identified in Table 10. In some embodiments, the hypermethylation variable target regions comprise a plurality of genes or portions thereof listed in Table 10 or Table 11, e.g., at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% of the genes or portions thereof listed in Table 1 or Table 2.
[0179] Additional hypermethylation target regions may be obtained, e.g., from the Cancer Genome Atlas. Kang et al., Genome Biology 18:53 (2017), describe construction of a probabilistic method called Cancer Locator using hypermethylation target regions from breast, colon, kidney, liver, and lung. In some embodiments, the hypermethylation target regions can be specific to one or more types of cancer. Accordingly, in some embodiments, the hypermethylation target regions include one, two, three, four, or five subsets of hypermethylation target regions that collectively show hypermethylation in one, two, three, four, or five of breast, colon, kidney, liver, and lung cancers.
[0180] Hypomethylation variable target regions
[0181] Global hypomethylation is a commonly observed phenomenon in various cancers. See, e.g., Hon et al., Genome Res. 22:246-258 (2012) (breast cancer); Ehrlich, Epigenomics 1 :239-259 (2009) (review article noting observations of hypomethylation in colon, ovarian, prostate, leukemia, hepatocellular, and cervical cancers). For example, regions such as repeated elements, e.g., LINE1 elements, Alu elements, centromeric tandem repeats, pericentromeric tandem repeats, and satellite DNA, and intergenic regions that are ordinarily methylated in healthy cells may show reduced methylation in tumor cells. Accordingly, in some embodiments, the epigenetic target region set includes hypomethylation variable target regions, where a decrease in the level of observed methylation indicates an increased likelihood that a sample (e.g., of cfDNA) contains DNA produced by neoplastic cells, such as tumor or cancer cells.
[0182] In some embodiments, hypomethylation variable target regions include repeated elements and/or intergenic regions. In some embodiments, repeated elements include one, two, three, four, or five of LINE1 elements, Alu elements, centromeric tandem repeats, pericentromeric tandem repeats, and/or satellite DNA.
[0183] Exemplary specific genomic regions that show cancer-associated hypomethylation include nucleotides 8403565-8953708 and 151104701-151106035 of human chromosome 1, e.g., according to the hgl9 or hg38 human genome construct. In some embodiments, the hypomethylation variable target regions overlap or comprise one or both of these regions.
[0184] CTCF binding regions
[0185] CTCF is a DNA-binding protein that contributes to chromatin organization and often colocalizes with cohesin. Perturbation of CTCF binding sites has been reported in a variety of different cancers. See, e.g., Katainen et al., Nature Genetics, doi:10.1038/ng.3335, published online 8 June 2015; Guo et al., Nat. Commun. 9: 1520 (2018). CTCF binding results in recognizable patterns in cfDNA that can be detected by sequencing, e.g., through fragment length analysis. For example, details regarding sequencing-based fragment length analysis are provided in Snyder et al., Cell 164:57-68 (2016); WO 2018/009723; and US20170211143A1, each of which are incorporated herein by reference.
[0186] Thus, perturbations of CTCF binding result in variation in the fragmentation patterns of cfDNA. As such, CTCF binding sites represent a type of fragmentation variable target regions.
[0187] There are many known CTCF binding sites. See, e.g., the CTCFBSDB (CTCF Binding Site Database), available on the Internet at insulatordb.uthsc.edu/; Cuddapah et al., Genome Res. 19:24-32 (2009); Martin et al., Nat. Struct. Mol. Biol. 18:708-14 (2011); Rhee et al., Cell. 147: 1408-19 (2011), each of which are incorporated by reference. Exemplary CTCF binding sites are at nucleotides 56014955-56016161 on chromosome 8 and nucleotides 95359169-95360473 on chromosome 13, e.g., according to the hgl9 or hg38 human genome construct.
[0188] Accordingly, in some embodiments, the epigenetic target region set includes CTCF binding regions. In some embodiments, the CTCF binding regions comprise at least 10, 20, 50, 100, 200, or 500 CTCF binding regions, or 10-20, 20-50, 50-100, 100-200, 200-500, or 500-1000 CTCF binding regions, e.g., such as CTCF binding regions described above or in one or more of CTCFBSDB or the Cuddapah et al., Martin et al., or Rhee et al. articles cited above.
[0189] In some embodiments, at least some of the CTCF sites can be methylated or unmethylated, wherein the methylation state is correlated with the whether or not the cell is a cancer cell. In some embodiments, the epigenetic target region set comprises at least 100 bp, at least 200 bp, at least 300 bp, at least 400 bp, at least 500 bp, at least 750 bp, at least 1000 bp upstream and/or downstream regions of the CTCF binding sites.
[0190] Transcription start sites
[0191] Transcription start sites may also show perturbations in neoplastic cells. For example, nucleosome organization at various transcription start sites in healthy cells of the hematopoietic lineage — which contributes substantially to cfDNA in healthy individuals — may differ from nucleosome organization at those transcription start sites in neoplastic cells. This results in different cfDNA patterns that can be detected by sequencing, for example, as discussed generally in Snyder et al., Cell 164:57-68 (2016); WO 2018/009723; and US20170211143A1.
[0192] Thus, perturbations of transcription start sites also result in variation in the fragmentation patterns of cfDNA. As such, transcription start sites also represent a type of fragmentation variable target regions.
[0193] Human transcriptional start sites are available from DBTSS (DataBase of Human Transcription Start Sites), available on the Internet at dbtss.hgc.jp and described in Yamashita et al., Nucleic Acids Res. 34(Database issue): D86-D89 (2006), which is incorporated herein by reference.
[0194] Accordingly, in some embodiments, the epigenetic target region set includes transcriptional start sites. In some embodiments, the transcriptional start sites comprise at least 10, 20, 50, 100, 200, or 500 transcriptional start sites, or 10-20, 20-50, 50-100, 100- 200, 200-500, or 500-1000 transcriptional start sites, e.g., such as transcriptional start sites listed in DBTSS. In some embodiments, at least some of the transcription start sites can be methylated or unmethylated, wherein the methylation state is correlated with the whether or not the cell is a cancer cell. In some embodiments, the epigenetic target region set comprises at least 100 bp, at least 200 bp, at least 300 bp, at least 400 bp, at least 500 bp, at least 750 bp, at least 1000 bp upstream and/or downstream regions of the transcription start sites.
[0195] Methylation control regions
[0196] It can be useful to include control regions to facilitate data validation. In some embodiments, the epigenetic target region set includes control regions that are expected to be methylated or unmethylated in essentially all samples, regardless of whether the DNA is derived from a cancer cell or a normal cell. In some embodiments, the epigenetic target region set includes control hypomethylated regions that are expected to be hypomethylated in essentially all samples. In some embodiments, the epigenetic target region set includes control hypermethylated regions that are expected to be hypermethylated in essentially all samples.
[0197] Copy number variations; focal amplifications
[0198] Although copy number variations such as focal amplifications are somatic mutations, they can be detected by sequencing based on read frequency in a manner analogous to approaches for detecting certain epigenetic changes such as changes in methylation. As such, regions that may show copy number variations such as focal amplifications in cancer can be included in the epigenetic target region set and may comprise one or more of AR, BRAF, CCND1, CCND2, CCNE1, CDK4, CDK6, EGFR, ERBB2, FGFR1, FGFR2, KIT, KRAS, MET, MYC, PDGFRA, PIK3CA, and RAFI. For example, in some embodiments, the epigenetic target region set comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, or 18 of the foregoing targets.
[0199] Sequencing
[0200] Sample nucleic acids can be sequence by a variety of sequencing methods. Sequencing methods include, for example, Sanger sequencing, high-throughput sequencing, pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-ligation, sequencing-by- hybridization, Digital Gene Expression (Helicos), Next generation sequencing (NGS), Single Molecule Sequencing by Synthesis (SMSS) (Helicos), massively-parallel sequencing, Clonal Single Molecule Array (Solexa), shotgun sequencing, Ion Torrent, Oxford Nanopore, Roche Genia, Maxim-Gilbert sequencing, primer walking, and sequencing using PacBio, SOLiD, Ion Torrent, or Nanopore platforms. Sequencing reactions can be performed in a variety of sample processing units, which may include multiple lanes, multiple channels, multiple wells, or other mean of processing multiple sample sets substantially simultaneously. Sample processing unit can also include multiple sample chambers to enable processing of multiple runs simultaneously.
[0201] Simultaneous sequencing reactions may be performed using multiplex sequencing. In some cases, cell-free nucleic acids may be sequenced with at least, for example, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 50000, or 100,000 sequencing reactions. In other cases, cell-free nucleic acids may be sequenced with less than, for example, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 50000, or 100,000 sequencing reactions. Sequencing reactions may be performed sequentially or simultaneously. Subsequent data analysis may be performed on all or part of the sequencing reactions. In some cases, data analysis may be performed on at least, for example, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 50000, or 100,000 sequencing reactions. In other cases, data analysis may be performed on less than, for example, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 50000, or 100,000 sequencing reactions. An exemplary read depth is 1,000-50,000 or 1,000-10,000 or 1,000-20,000 reads per locus (base).
[0202] Epigenetic analysis using conversion-based procedures
[0203] The biomarker data used in the methods of the present invention may comprise epigenetic data, such as nucleic acid methylation. Various sequencing methods which can be used to determine the methylation status are known in the art. Single-nucleotide resolving assays to detect nucleoside methylation often require a conversion of the modified nucleosides or corresponding unmodified nucleosides to change their basepairing specificity. The conversion is then detected by sequencing. Examples of such methods include bisulfite and oxidative bisulfite and Tet-assisted bisulfite conversion, EM- seq, TAPS and TAPS P conversion and ACE-seq. See, e.g., Moss et al., Nat Commun. 2018; 9: 5068; Booth et al., Science 2012; 336: 934-937; Yu et al., Cell 2012; 149: 1368- 80; Liu et al., Nature Biotechnology 2019; 37:424-429; Schutsky, E.K. et al.; and Vaisvila et al. Genome Research 2021 31(7): 1280-1289.
[0204] When using conversion-based methods for methylation analysis, the methods can comprise the use of adapters comprising quality control nucleosides. The use of quality control nucleosides in the adapters can advantageously be used with enzymatic conversion procedures which convert the base pairing specificity of modified nucleosides (e.g., DM-seq conversion comprising adding a protective group (such as a carboxymethyl group) to unmodified cytosines, and deaminating 5mC, such as using an APOB EC enzyme) or enzymatic conversion procedures which convert the base pairing specificity of unmodified nucleosides. For example, in some embodiments, when a molecule comprising adapters containing two or more quality control nucleosides is exposed to a conversion procedure selected to change the base pairing specificity of quality control nucleosides, the base-pairing specificity of a first portion (e.g., at least one) of the quality control nucleosides is changed but the base-pairing specificity of a second portion (e.g., at least one) of the quality control nucleosides in the adapter is unaffected, which can indicate suboptimal conversion. The use of quality control nucleosides in the adapters can advantageously be used to predict/infer/indicate false negative detection and/or identification of modified nucleosides in the DNA sample (i.e., incorrectly identifying a base as being unmodified) and/or false positive detection and/or identification of modified nucleosides in the DNA sample (i.e., incorrectly identifying a base as being modified). Quality control nucleosides as described herein for use to detect the occurrence of false positive detection of modified nucleosides may be referred to as “false positive quality control nucleosides”. Quality control nucleosides as described herein for use to detect the occurrence of false negative detection of modified nucleosides may be referred to as “false negative quality control nucleosides”. A nucleoside having a modification status that means that its base pairing specificity is not changed when exposed to a particular conversion procedure may in some cases be referred to as a “protected” nucleoside or as having a “protected modification status” or similar.
[0205] In the case of detecting false negatives using conversion procedures which convert the base pairing specificity of modified nucleosides, the quality control nucleosides in the adapters may comprise modified nucleosides such that the conversion efficiency of the conversion procedure/sub-optimal conversion can measured, and thus the frequency of false negatives predicted. Sub-optimal conversion refers to conversion of fewer than all nucleosides of the type that the reagent used in a conversion procedure normally converts; for example, a sub-optimal conversion by a deaminase as in DM-seq results in conversion of some but not all 5mCs to thymine. Sub-optimal conversion may also be referred to as incomplete conversion in the sense that some nucleosides (modified or unmodified) that should have been converted by the conversion procedure in a complete reaction were not actually converted.
[0206] In the case of detecting false positives using conversion procedures which convert the base pairing specificity of unmodified nucleosides, the quality control nucleosides in the adapters may comprise modified nucleosides such that the erroneous conversion frequency of modified nucleosides can measured, and thus the frequency of false positives predicted. Erroneous conversion refers to conversion of a nucleoside other than the nucleosides that are typically converted by a conversion procedure. Conversion of a methylated cytosine by a conversion method that typically converts only unmodified cytosines is an example of erroneous conversion.
[0207] In the case of detecting false positives using conversion procedures which convert the base pairing specificity of modified nucleosides, the quality control nucleosides in the adapters may comprise unmodified nucleosides such that the erroneous conversion frequency of unmodified nucleosides can measured, and thus the frequency of false positives predicted.
[0208] In the case of detecting false positives using conversion procedures which convert the base pairing specificity of unmodified nucleosides, the quality control nucleosides in the adapters may comprise unmodified nucleosides such that the conversion efficiency of the conversion procedure/sub-optimal conversion can measured, and thus the frequency of false positives predicted.
[0209] There are various methods of detecting and/or identifying modified nucleosides that rely on a conversion procedure that changes the base-pairing specificity of a nucleoside, based on the modification status of the nucleosides. These changes of basepairing specificity can then be detected, and thus the modification status of the nucleoside inferred, by sequencing.
[0210] In some cases, the conversion procedure used in the methods of the disclosure is one that changes the base pairing specificity of a modified nucleoside (e.g., methylated cytosine), but does not change the base pairing specificity of the corresponding unmodified nucleoside (e.g. cytosine) or does not change the base pairing specificity of any unmodified nucleoside (e.g. cytosine, adenosine, guanosine and thymidine (or uracil)). Advantages of methods that do not convert the base-pairing specificity of unmodified nucleosides include reduced loss of sequence complexity, higher sequencing efficiency and reduced alignment losses. Additionally, methods such as DM-seq may in some cases be preferred over methods such as bisulfite sequencing and EM-seq because they are less destructive (especially important for low yield samples such as cfDNA) and do not require denaturation, meaning that non-conversion errors are theoretically more likely to be random. In methods that require denaturation for conversion, failure to denature a DNA molecule will result in non-conversion of all bases in the DNA molecule. As biological changes in methylation are predominantly concerted to a localized region of interest, these non-random (localized) conversion can appear as false negatives (non-methylated regions). Random non-conversion methods can maximally affect a low percent of bases within a region, and thus the specificity of methylation change detection can be maximized (reduce false positives) by placing a threshold on % of bases within a region that are methylated/non-methylated. Hence, in some cases, a conversion procedure that does not involve denaturation is preferred.
[0211] In some embodiments, an adapter comprises a first quality control nucleoside with a first modification status (e.g., modified, such as methylated) and a second quality control nucleoside with a second modification status (e.g., unmodified). Such adapters can be used to detect both suboptimal conversion and erroneous conversion.
[0212] In some cases, the conversion procedure used in the methods of the disclosure is one that changes the base pairing specificity of an unmodified nucleoside (e.g., cytosine), but does not change the base pairing specificity of the corresponding modified nucleoside (e.g., methylated cytosine). The skilled person can select a suitable method according to their needs, including which nucleoside modifications are to be detected and/or identified.
[0213] In some embodiments, the conversion procedure converts modified nucleosides. In some embodiments, the conversion procedure which converts modified nucleosides comprises enzymatic conversion, such as DM-seq, for example, as described in WO2023/288222A1. In DM-seq, unmodified cytosines in the DNA are enzymatically protected from a subsequent deamination step wherein 5mC in 5mCpG is converted to T. The enzymatically protected unmodified (e.g., unmethylated) cytosines are not converted and are read as “C” during sequencing. Cytosines that are read as thymines (in a CpG context) are identified as methylated cytosines in the DNA.
[0214] Thus, when this type of conversion is used, the first nucleobase comprises unmodified (such as unmethylated) cytosine, and the second nucleobase comprises modified (such as methylated) cytosine. Sequencing of the converted DNA identifies positions that are read as cytosine as being unmodified C positions. Meanwhile, positions that are read as T are identified as being T or 5mC. Performing DM-seq conversion thus facilitates identifying positions containing 5mC using the sequence reads obtained. Hence, in these embodiments, the quality control nucleosides in the adapters used in the method comprise unmodified (unmethylated) cytosines.
[0215] Exemplary cytosine deaminases for use herein include APOBEC enzymes, for example, APOBEC3 A. Generally, AID/ APOBEC family DNA deaminase enzymes such as APOBEC3A (A3 A) are used to deaminate (unprotected) unmodified cytosine and 5mC. For an exemplary description of APOBEC conversion, see, e.g., Schutsky et al., Nature Biotechnology 2018; 36: 1083-1090.
[0216] The enzymatic protection of unmodified cytosines in the DNA comprises addition of a protective group to the unmodified cytosines. Such protective groups can comprise an alkyl group, an alkyne group, a carboxyl group, a carboxyalkyl group, an amino group, a hydroxymethyl group, a glucosyl group, a glucosylhydroxymethyl group, an isopropyl group, or a dye. For example, DNA can be treated with a methyltransferase, such as a CpG-specific methyltransferase, which adds the protective group to unmodified cytosines. The term methyltransferase is used broadly herein to refer to enzymes capable of transferring a methyl or substituted methyl (e.g., carboxymethyl) to a substrate (e.g., a cytosine in a nucleic acid). In some embodiments, the DNA is contacted with a CpG- specific DNA methyltransferase (MTase), such as a CpG-specific carboxymethyltransferase (CxMTase), and a substituted methyl donor, such as a carboxymethyl donor (e.g., carboxymethyl-S-adenosyl-L-methionine). See, e.g., WO2021/236778A2. In particular embodiments, the CxMTase can facilitate the addition of a protective carboxymethyl group to an unmethylated cytosine. In some embodiments, the unmethylated cytosine is unmodified cytosine. The carboxymethyl group can prevent deamination of the cytosine during a deamination step (such as a deamination step using an APOBEC enzyme, such as A3 A). Substituted methyl or carboxymethyl donors useful in the disclosed methods include but are not limited to, S-adenosyl-L-methionine (SAM) analogs, optionally wherein the SAM analog is carboxy-S-adenosyl-L-methionine (CxSAM). SAM analogs are described, for example, in WO2022/197593A1. The MTase may be, for example, a CpG methyltransferase from Spiroplasma sp. strain MQ1 (M.SssI), DNA-m ethyltransferase 1 (DNMT1), DNA-methyltransf erase 3 alpha (DNMT3A), DNA- methyltransferase 3 beta (DNMT3B), or DNA adenine methyltransferase (Dam). The CxMTase may be a CpG methyltransferase from Mycoplasma penetrans (M.Mpel). In a particular embodiment, the methyltransferase enzyme is a variant of M.Mpel, wherein the amino acid corresponding to position 374 is R or K, or a sequence at least 90%, at least 92%, at least 94%, at least 96%, at least 97%, at least 98%, or at least 99% identical thereto, optionally wherein the amino acid corresponding to position 374 is R or K.
[0217] In one embodiment, the methyltransferase enzyme is a variant of M.Mpel having an N374R substitution or an N374K substitution. The methyltransferase having an N374R substitution or an N374K substitution can further comprise one or more amino acid substitutions selected from a) substitution of one or both residues T300 and E305 with S, A, G, Q, D, or N; b) substitution of one or more residues A323, N306, and Y299 with a positively charged amino acid selected from K, R or H; and/or c) substitution of S323 with A, G, K, R or H, which may enhance the activity of the enzyme.
[0218] Optionally, the conversion procedure further includes enzymatic protection of 5hmCs, such as by glucosylation of the 5hmCs (e.g., using PGT) or by carbamoylation of the 5hmCs (e.g., using 5-hydroxymethylcytosine carbamoyltransferase), in the DNA prior to the deamination of unprotected modified cytosines. In this method, 5hmC can be protected from conversion, for example through glucosylation using P-glucosyl transferase (PGT), forming (5-glucosylhydroxymethylcytosine) 5ghmC, or through carbamoylation using 5-hydroxymethylcytosine carbamoyltransferase, forming 5cmC. Examples thereof are described, for example, in Yu et al., Cell 2012; 149: 1368-80, and in Yang et al., Bioprotocol, 2023; 12(17): e4496. Glucosylation or carbamoylation of 5hmC can reduce or eliminate deamination of 5hmC by a deaminase such as APOBEC3 A. Treatment with an MTase or CxMTase then adds a protecting group to unmodified (unmethylated) cytosines in the DNA. 5mC (but not protected, unmodified cytosine and not 5ghmC or 5cmC) is then deaminated (converted to T in the case of 5mC) by treatment with a deaminase, for example, an APOBEC enzyme (such as APOBEC3 A). Sequencing of the converted DNA identifies positions that are read as cytosine as being either 5hmC or unmodified C positions. Meanwhile, positions that are read as T are identified as being T or 5mC. Performing DM-seq conversion with glucosylation of 5hmC on a sample as described herein thus facilitates distinguishing positions containing unmodified C or 5hmC on the one hand from positions containing 5mC using the sequence reads obtained. Hence, in these embodiments, the quality control nucleosides in the adapters used in the method may comprise both unmodified cytosine and 5hmC. This allows the efficiency of each of the two steps to be determined separately. For example, if sequencing of the adapter indicates that both the 5mC and the 5hmC nucleoside(s) have converted base-pairing specificity, this indicates that the 5hmC-protecting step was ineffective. If the 5mC nucleoside(s) do not have converted base-pairing specificity, this indicates that (at least) the DM-seq process was ineffective. If base-pairing of the 5mC nucleosides, but not the 5hmC nucleosides, in the adapter have converted base-pairing specificity, then both steps were effective.
[0219] In addition to controlling for sub-optimal conversion of modified nucleosides, quality control nucleosides in the adapters can also be used to predict false positives (i.e., nucleosides erroneously classified as being modified). In this case, the quality control nucleosides in the adapters comprise, for appropriate conversion procedures, unmodified C. If sequencing of the adapter indicates that quality control nucleoside(s) have converted the base-pairing specificity, this indicates that the unmodified base (e.g., unmodified C) has been erroneously converted. This information can then be used to predict false positive detection of modified nucleosides (e.g., modified C) in the DNA sample.
[0220] In particular embodiments, methods of the present disclosure have utility in providing a quality control method for the identification of methylated cytosines which are not present in any sequence context (i.e., CpG and CpH cytosines). Methylated CpH or non-CpG cytosines are infrequent and thus require high levels of sensitivity to reliably detect. Additionally methylated CpGs that co-locate with methylated non-CpGs cannot be detected by methods that use methylation status of non-CpG cytosines as indicator of sub- optimal molecular conversion. The methods of the present disclosure achieve this by providing quality control nucleosides which are known to have a particular modification status, and thus provide a reliable measure of the frequency of erroneous conversion and/or sub-optimal conversion.
[0221] In some embodiments, methods of the present disclosure comprise analysis of sequence variations and/or fragmentation patterns, and do not exclude adapted DNA with sub-optimal or erroneous conversion of quality control nucleosides from analysis of sequence variations and/or fragmentation patterns. For example, the methods can comprise detecting the presence or absence of sequence variations and/or determining fragmentation patterns, wherein adapted DNA comprising quality control nucleosides indicative of sub- optimal or erroneous conversion of quality control nucleosides is included in detecting the presence or absence of sequence variations and/or determining fragmentation patterns. In this way, the present methods can reduce the likelihood of false negatives and/or false positives in detecting modified nucleosides (e.g., 5mC) by excluding adapted DNA unsuitable for that purpose due to sub-optimal or erroneous conversion, while retaining such adapted DNA for analyses of sequence variations and/or fragmentation patterns (which are not impacted by suboptimal or erroneous conversion) and therefore avoiding impacting sensitivity.
[0222] In some embodiments, use of the disclosed quality control methods comprises: (a) ligating the DNA to oligonucleotide adapters, wherein the adapters comprise quality control nucleosides, wherein the quality control nucleosides have the same nucleoside identity and the same or a different modification status to modified nucleosides to be detected in the DNA, and wherein the modification status of the quality control nucleosides is known; (b) subjecting the adapted DNA, or a subsample thereof, to a conversion procedure that changes the base pairing specificity of the quality control nucleosides or does not change the base pairing specificity of the quality control nucleosides, depending on the modification status of the nucleosides, wherein the conversion procedure comprises deamination of unmodified cytosines, and wherein the conversion procedure is selected to (i) change the base pairing specificity of adapted DNA nucleosides having the same nucleoside identity and modification status as quality control nucleosides in the adapters, and not change the base pairing specificity of adapted DNA nucleosides having the same nucleosides identity as quality control nucleosides in the adapters but a different modification status; and/or (ii) not change the base pairing specificity of adapted DNA nucleosides having the same nucleoside identity and modification status as quality control nucleosides in the adapters, and change the base pairing specificity of adapted DNA nucleosides having the same pairing identity as quality control nucleosides in the adapters but a different modification status; (c) sequencing the adapted DNA after conversion step (b); (d) using the sequence data obtained in step (c) to determine base pairing specificity conversion of the quality control nucleosides in the adapters; and (e) using the base pairing specificity conversion of the quality control nucleosides in the adapters as a quality control measure for conversion step (b), wherein sub-optimal conversion of adapter quality control nucleosides following a conversion procedure of step (b)(i) and/or erroneous conversion of adapter quality control nucleosides following a conversion procedure of step (b)(ii) predicts false negative and/or false positive detection of modified nucleosides in the DNA sample.
[0223] In some embodiments of the disclosed methods, the quality control conversion procedure is selected to change the base pairing specificity of unmodified quality control nucleosides in the adapters, but not the base pairing specificity of DNA sample nucleosides having the same nucleoside identity but a different modification status. In some such embodiments, suboptimal conversion of the unmodified quality control nucleosides predicts false negative detection of DNA sample nucleosides having the same nucleoside identity and modification status as the quality control nucleosides or a different modification status and the same change in base pairing specificity on exposure to the conversion procedure. In some such embodiments, suboptimal conversion of the unmodified quality control nucleosides predicts false positive detection of DNA sample nucleosides having the same nucleoside identity and a different modification status as the quality control nucleosides or a different modification status and the same change in base pairing specificity on exposure to the conversion procedure.
[0224] In other embodiments of the disclosed methods, the quality control conversion procedure is selected to not change the base pairing specificity of modified quality control nucleosides in the adapters, and to change the base pairing specificity of DNA sample nucleosides having the same nucleoside identity but no modification. In some such embodiments, erroneous conversion of the modified quality control nucleosides predicts false negative detection of DNA sample nucleosides having the same nucleoside identity and modification status as the quality control nucleosides or a different modification status and the same change in base pairing specificity on exposure to the conversion procedure. In some such embodiments, erroneous conversion of the modified quality control nucleosides predicts false positive detection of DNA sample nucleosides having the same nucleoside identity as the quality control nucleosides but no modification or a different modification status and the same change in base pairing specificity on exposure to the conversion procedure.
[0225] In some embodiments, the quality control nucleosides in the adapters comprise unmodified cytosine. In some embodiments, the quality control nucleosides in the adapters comprise modified cytosine. In some such embodiments, the quality control nucleosides in the adapters comprise 5 -methylcytosine (5mC) and/or 5-hydroxymethyl-cytosine (5hmC). In some embodiments, the quality control nucleosides in the adapters comprise 5- methylcytosine (5mC). In some embodiments, the quality control nucleosides in the adapters comprise 5-hydroxymethyl-cytosine (5hmC).
[0226] Thus, also provided herein are methods wherein the conversion procedure comprises deamination of unmodified nucleosides, such as unmodified cytosines. In some embodiments, the conversion procedure comprises enzymatic conversion of unmodified nucleosides, such as unmodified cytosines using a non-specific, modification-sensitive double-stranded DNA deaminase, e.g., as in SEM-seq. See, e.g., Vaisvila et al. (2023) Discovery of novel DNA cytosine deaminase activities enables a nondestructive singleenzyme methylation sequencing method for base resolution high-coverage methylome mapping of cell-free and ultra-low input DNA. bioRxiv; DOI: 10.1101/2023.06.29.547047, available at https://www.biorxiv.org/content/10.1101/2023.06.29.547047vl. SEM-Seq employs a nonspecific, modification-sensitive double-stranded DNA deaminase (MsddA) in a nondestructive single-enzyme 5-methylctyosine sequencing (SEM-seq) method that deaminates unmodified cytosines. Accordingly, SEM-seq does not require the TET2 and T4-PGT or 5-hydroxymethylcytosine carbamoyltransferase protection and denaturing steps that are of use, e.g., in APOEC3A-based protocols. Additionally, MsddA does not deaminate 5-formylated cytosines (5fC) or 5-carboxylated cytosines (5caC). In SEM-seq, unmodified cytosines in the DNA are deaminated to uracil and is read as “T” during sequencing. Modified cytosines (e.g., 5mC) are not converted and are read as “C” during sequencing. Cytosines that are read as thymines are identified as unmodified (e.g., unmethylated) cytosines or as thymines in the DNA. Performing SEM-seq conversion thus facilitates identifying positions containing 5mC using the sequence reads obtained. In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA comprises enzymatic conversion of the first nucleobase using MsddA. Optionally, however, in some embodiments of the disclosed methods wherein the conversion procedure deaminates unmodified nucleosides (such as unmodified cytosines), the method further comprises enzymatic protection of at least one type of modified nucleoside (such as modified cytosines, such as 5mC and/or 5hmC) in the DNA prior to deamination of unprotected unmodified nucleosides (such as unprotected unmodified cytosines). In some embodiments, the at least one type of modified nucleoside is 5mC. In some embodiments, enzymatic protection of 5mC comprises converting a 5mC to carboxylcytosine. For example, converting a 5mC to carboxylcytosine can comprise contacting the 5mC with a TET enzyme, such as TET1, TET2, or TET3, or any suitable TET enzyme disclosed herein. In some embodiments, the at least one type of modified nucleoside is 5hmC. In some embodiments, the enzymatic protection of 5hmCs in the DNA prior to the deamination of unmodified cytosines glucosylation of the 5hmCs, such as described herein.
[0227] Also provided herein are methods in which alternative base conversion schemes are used. For example, unmethylated cytosines can be left intact (such as through being protected, such as using a method disclosed herein) while methylated cytosines and hydroxymethylcytosines are converted to a base read as a thymine (e.g., uracil, thymine, or dihydrouracil).
[0228] In some embodiments, converting a modified (such as methylated or hydroxymethylated) cytosine in at least one first or second strand to a thymine or a base read as thymine comprises oxidizing a hydroxymethyl cytosine, e.g., the hydroxymethyl cytosine is oxidized to formylcytosine. In some embodiments, oxidizing the hydroxymethyl cytosine to formylcytosine comprises contacting the hydroxymethyl cytosine with a ruthenate, such as potassium ruthenate (KRuO4).
[0229] In some embodiments, the modified cytosine is converted to thymine, uracil, or dihydrouracil. In any such embodiments, amplification methods may comprise uracil- and/or dihydrouracil-tolerant amplification methods, such as PCR using a uracil- and/or dihydrouracil-tolerant DNA polymerase.
[0230] In some embodiments, the method comprises converting a formylcytosine and/or a methylcytosine to carboxylcytosine as part of converting the modified cytosine in at least one first or second strand to a thymine or a base read as thymine. For example, converting the formylcytosine and/or the methylcytosine to carboxylcytosine can comprise contacting the formylcytosine and/or the methylcytosine with a TET enzyme, such as TET1, TET2, or TET3. In some embodiments, the method comprises reducing the carboxylcytosine as part of converting the modified cytosine in at least one first or second strand to a thymine or a base read as thymine, and/or the carboxylcytosine is reduced to dihydrouracil. In some embodiments, reducing the carboxylcytosine comprises contacting the carboxylcytosine with a borane or borohydride reducing agent.
[0231] In some embodiments, the borane or borohydride reducing agent comprises pyridine borane, 2-picoline borane, borane, tert-butylamine borane, ammonia borane, sodium borohydride, sodium cyanoborohydride (NaBH3CN), lithium borohydride (LiBH4), ethylenediamine borane, dimethylamine borane, sodium triacetoxyborohydride, morpholine borane, 4-methylmorpholine borane, trimethylamine borane, dicyclohexylamine borane, or a salt thereof. In other embodiments, the reducing agent comprises lithium aluminum hydride, sodium amalgam, amalgam, sulfur dioxide, dithionate, thiosulfate, iodide, hydrogen peroxide, hydrazine, diisobutylaluminum hydride, oxalic acid, carbon monoxide, cyanide, ascorbic acid, formic acid, dithiothreitol, betamercaptoethanol, or any combination thereof.
[0232] Various TET enzymes may be used in the disclosed methods as appropriate. In some embodiments, the one or more TET enzymes comprise TETv. TETv is described in US Patent 10,260,088 and its sequence is SEQ ID NO: 1 therein. In some embodiments, the one or more TET enzymes comprise TETcd. TETcd is described in US Patent 10,260,088 and its sequence is SEQ ID NO: 3 therein. In some embodiments, the one or more TET enzymes comprise TET1. In some embodiments, the one or more TET enzymes comprise TET2. TET2 may be expressed and used as a fragment comprising TET2 residues 1129-1480 joined to TET2 residues 1844-1936 by a linker as described, e.g., in US Patent 10,961,525. In some embodiments, the one or more TET enzymes comprise TET1 and TET2. In some embodiments, the one or more TET enzymes comprise a VI 900 TET mutant, such as a V1900A, V1900C, V1900G, VI 9001, or V1900P TET mutant. In some embodiments, the one or more TET enzymes comprise a VI 900 TET2 mutant, such as a V1900A, V1900C, V1900G, VI 9001, or V1900P TET2 mutant. It can be beneficial to use a TET enzyme that maximizes formation of 5-carboxylcytosine (5-caC) relative to less oxidized modified cytosines, particularly 5 -formylcytosine, because 5-caC is not a substrate for enzymatic deamination, e.g., by APOBEC enzymes such as APOBEC3 A. Maximizing formation of 5-caC thus reduces the risk of false calls in which a base is identified as unmethylated because it underwent deamination even though it was methylated (or hydroxymethylated) in the original sample. Accordingly, in some embodiments, the TET enzyme comprises a mutation that increases formation of 5-caC. Exemplary mutations are set forth above. “A mutation that increases formation of 5-caC” means that the TET enzyme having the mutation produces more 5-caC than a TET enzyme that lacks the mutation but is otherwise identical. 5-caC production can be measured as described, e.g., in Liu et al., Nat Chem Biol 13: 181-187 (2017) (see Online Methods section, TET reactions in vitro subsection, “driving” conditions). Any variants and/or mutants described in Liu et al. (2017) can be used in the disclosed methods as appropriate.
[0233] In some embodiments, the one or more TET enzymes comprise a TET2 enzyme comprising a T1372S mutation, such as TET2-CS-T1372S and TET2-CD-T1372S. A TET2 comprising a T1372S mutation is described in US Patent 10,961,525 and may be expressed and used as a fragment comprising TET2 residues 1129-1480 joined to TET2 residues 1844-1936 by a linker. Position 1372 of TET2 corresponds to position 258 of SEQ ID NO: 21 (wild type TET2 catalytic domain) of US Patent 10,961,525. Thus, the sequence of a T1372S TET2 catalytic domain may be obtained by changing the threonine at position 258 of SEQ ID NO: 21 of US Patent 10,961,525 to serine. TET2 comprising a T1372S mutation is also described in Liu et al., Nat Chem Biol. 2017 February; 13(2): 181-187. As demonstrated in Liu et al., TET2 comprising a T1372S mutation can more efficiently oxidize 5mC to produce 5-carboxylcytosine (5caC) than other versions of TET2 such as TET2 lacking a T1372S mutation.
[0234] Provided herein is a method comprising contacting DNA contacting DNA with a TET2 enzyme comprising a T1372S mutation to oxidize 5-methylcytosine (5mC) and/or 5-hydroxymethylcytosine (5hmC) present in the DNA to 5-carboxy cytosine (5caC), subsequently contacting at least a portion of the DNA with a substituted borane reducing agent, thereby converting 5-caC in the DNA to dihydrouracil (DHU), thereby producing treated DNA, and sequencing at least a portion of the treated DNA.
Epigenetic analysis using partitioning-based procedures
[0235] Methylation analysis can also be performed using methylation-specific binding agents, such as methyl-binding domain (MBD) based partitioning. MBD proteins, such as MBD2 or MeCP2, are proteins that have a high affinity for methylated CpG sites in DNA. These MBD proteins can be used in binding assays to partition nucleic acid based on its methylation status. The nucleic acids can be incubated with an MBD protein or MBD- containing complex under conditions that allow for the specific binding of the MBD to methylated CpG sites. Unmethylated CpG sites are not recognized by the MBD and remain unbound. The bound nucleic acids can then be partitioned to provide a hypermethylated partition (the bound nucleic acids) and a hypomethylated partition (the unbound nucleic acids). The bound nucleic acids can be further partitioned using washes of increasing stringency. One or more of the resulting partitions can thus be subjected to nucleic acid sequencing and the methylation status of the nucleic acids can be inferred from the partition that they were present in.
[0236] In some embodiments, the partitions are differentially tagged and then recombined before dividing the sample into first and second aliquots, followed by subsequent steps of methods described herein. In some embodiments, the sample that is divided into the first and second aliquots is a partition, such as a hypomethylated partition, and the second aliquot is combined with at least one other partition, such as a hypermethylated partition, before undergoing enrichment and/or other steps of the method.
[0237] In some instances, a heterogeneous nucleic acid sample is partitioned into two or more partitions (e.g., at least 3, 4, 5, 6 or 7 partitions). In some embodiments, each partition is differentially tagged. Tagged partitions can then be pooled together for collective sample prep and/or sequencing. The partitioning-tagging-pooling steps can occur more than once, with each round of partitioning occurring based on a different characteristics and tagged using differential tags that are distinguished from other partitions and partitioning means.
[0238] Examples of characteristics that can be used for partitioning include sequence length, methylation level, nucleosome binding, sequence mismatch, immunoprecipitation, and/or proteins that bind to DNA. Resulting partitions can include one or more of the following nucleic acid forms: single-stranded DNA (ssDNA), double-stranded DNA (dsDNA), shorter DNA fragments and longer DNA fragments. In some embodiments, a heterogeneous population of nucleic acids is partitioned into nucleic acids with one or more epigenetic modifications and without the one or more epigenetic modifications. Examples of epigenetic modifications include presence or absence of methylation; level of methylation; type of methylation (e.g., 5-methylcytosine versus other types of methylation, such as adenine methylation and/or cytosine hydroxymethylation); and association and level of association with one or more proteins, such as histones. Alternatively, or additionally, a heterogeneous population of nucleic acids can be partitioned into nucleic acid molecules associated with nucleosomes and nucleic acid molecules devoid of nucleosomes. Alternatively, or additionally, a heterogeneous population of nucleic acids may be partitioned into single-stranded DNA (ssDNA) and double-stranded DNA (dsDNA). Alternatively, or additionally, a heterogeneous population of nucleic acids may be partitioned based on nucleic acid length (e.g., molecules of up to 160 bp and molecules having a length of greater than 160 bp).
[0239] In some instances, each partition (representative of a different nucleic acid form) is differentially labelled, and the partitions are pooled together prior to sequencing. In other instances, the different forms are separately sequenced.
[0240] Samples can include nucleic acids varying in modifications including postreplication modifications to nucleotides and binding, usually noncovalently, to one or more proteins.
[0241] In an embodiment, the population of nucleic acids is one obtained from a serum, plasma or blood sample from a subject suspected of having neoplasia, a tumor, or cancer or previously diagnosed with neoplasia, a tumor, or cancer. The population of nucleic acids includes nucleic acids having varying levels of methylation. Methylation can occur from any one or more post-replication or transcriptional modifications. Postreplication modifications include modifications of the nucleotide cytosine, particularly at the 5-position of the nucleobase, e.g., 5 -methylcytosine, 5-hydroxymethylcytosine, 5- formylcytosine and 5-carboxylcytosine.
[0242] In some embodiments, the nucleic acids in the original population can be single-stranded and/or double-stranded. Partitioning based on single v. double strandedness of the nucleic acids can be accomplished by, e.g. using labelled capture probes to partition ssDNA and using double stranded adapters to partition dsDNA.
[0243] The affinity agents can be antibodies with the desired specificity, natural binding partners or variants thereof (Bock et al., Nat Biotech 28: 1106-1114 (2010); Song et al., Nat Biotech 29: 68-72 (2011)), or artificial peptides selected e.g., by phage display to have specificity to a given target. Examples of capture moieties contemplated herein include methyl binding domain (MBDs) and methyl binding proteins (MBPs) as described herein.
[0244] Likewise, partitioning of different forms of nucleic acids can be performed using histone binding proteins which can separate nucleic acids bound to histones from free or unbound nucleic acids. Examples of histone binding proteins that can be used in the methods disclosed herein include RBBP4 (RbAp48) and SANT domain peptides.
[0245] Although for some affinity agents and modifications, binding to the agent may occur in an essentially all or none manner depending on whether a nucleic acid bears a modification, the separation may be one of degree. In such instances, nucleic acids overrepresented in a modification bind to the agent at a greater extent that nucleic acids underrepresented in the modification. Alternatively, nucleic acids having modifications may bind in an all or nothing manner. But then, various levels of modifications may be sequentially eluted from the binding agent.
[0246] For example, in some embodiments, partitioning can be binary or based on degree/level of modifications. For example, all methylated fragments can be partitioned from unmethylated fragments using methyl-binding domain proteins (e.g., MethylMiner Methylated DNA Enrichment Kit (Thermo Fisher Scientific). Subsequently, additional partitioning may involve eluting fragments having different levels of methylation by adjusting the salt concentration in a solution with the methyl-binding domain and bound fragments. As salt concentration increases, fragments having greater methylation levels are eluted.
[0247] In some instances, the final partitions are representatives of nucleic acids having different extents of modifications (overrepresentative or underrepresentative of modifications). Overrepresentation and underrepresentation can be defined by the number of modifications bom by a nucleic acid relative to the median number of modifications per strand in a population. For example, if the median number of 5-methylcytosine residues in nucleic acid in a sample is 2, a nucleic acid including more than two 5-methylcytosine residues is overrepresented in this modification and a nucleic acid with 1 or zero 5- methylcytosine residues is underrepresented. The effect of the affinity separation is to enrich for nucleic acids overrepresented in a modification in a bound phase and for nucleic acids underrepresented in a modification in an unbound phase (i.e. in solution). The nucleic acids in the bound phase can be eluted before subsequent processing.
[0248] When using MethylMiner Methylated DNA Enrichment Kit (Thermo Fisher Scientific) various levels of methylation can be partitioned using sequential elutions. For example, a hypomethylated partition (e.g., no methylation) can be separated from a methylated partition by contacting the nucleic acid population with the MBD from the kit, which is attached to magnetic beads. The beads are used to separate out the methylated nucleic acids from the non- methylated nucleic acids. Subsequently, one or more elution steps are performed sequentially to elute nucleic acids having different levels of methylation. For example, a first set of methylated nucleic acids can be eluted at a salt concentration of 160 mM or higher, e.g., at least 200 mM, 300 mM, 400 mM, 500 mM, 600 mM, 700 mM, 800 mM, 900 mM, 1000 mM, or 2000 mM. After such methylated nucleic acids are eluted, magnetic separation is once again used to separate higher level of methylated nucleic acids from those with lower level of methylation. The elution and magnetic separation steps can repeat themselves to create various partitions such as a hypomethylated partition (e.g., representative of no methylation), a methylated partition (representative of low level of methylation), and a hyper methylated partition (representative of high level of methylation).
[0249] In some methods, nucleic acids bound to an agent used for affinity separation are subjected to a wash step. The wash step washes off nucleic acids weakly bound to the affinity agent. Such nucleic acids can be enriched in nucleic acids having the modification to an extent close to the mean or median (i.e., intermediate between nucleic acids remaining bound to the solid phase and nucleic acids not binding to the solid phase on initial contacting of the sample with the agent).
[0250] The affinity separation results in at least two, and sometimes three or more partitions of nucleic acids with different extents of a modification. While the partitions are still separate, the nucleic acids of at least one partition, and usually two or three (or more) partitions are linked to nucleic acid tags, usually provided as components of adapters, with the nucleic acids in different partitions receiving different tags that distinguish members of one partition from another. The tags linked to nucleic acid molecules of the same partition can be the same or different from one another. But if different from one another, the tags may have part of their code in common so as to identify the molecules to which they are attached as being of a particular partition. For further details regarding portioning nucleic acid samples based on characteristics such as methylation, see WO2018/119452, which is incorporated herein by reference.
[0251] In some embodiments, the nucleic acid molecules can be fractionated into different partitions based on the nucleic acid molecules that are bound to a specific protein or a fragment thereof and those that are not bound to that specific protein or fragment thereof.
[0252] Nucleic acid molecules can be fractionated based on DNA-protein binding. Protein-DNA complexes can be fractionated based on a specific property of a protein. Examples of such properties include various epitopes, modifications (e.g., histone methylation or acetylation) or enzymatic activity. Examples of proteins which may bind to DNA and serve as a basis for fractionation may include, but are not limited to, protein A and protein G. Any suitable method can be used to fractionate the nucleic acid molecules based on protein bound regions. Examples of methods used to fractionate nucleic acid molecules based on protein bound regions include, but are not limited to, SDS-PAGE, chromatin-immuno-precipitation (ChIP), heparin chromatography, and asymmetrical field flow fractionation (AF4).
[0253] In some embodiments, partitioning of the nucleic acids is performed by contacting the nucleic acids with a methylation binding domain (“MBD”) of a methylation binding protein (“MBP”). MBD binds to 5-methylcytosine (5mC). MBD is coupled to paramagnetic beads, such as Dynabeads® M-280 Streptavidin via a biotin linker. Partitioning into fractions with different extents of methylation can be performed by eluting fractions by increasing the NaCl concentration.
[0254] Examples of MBPs contemplated herein include, but are not limited to: (a) MeCP2 is a protein preferentially binding to 5-methyl-cytosine over unmodified cytosine; (b) RPL26, PRP8 and the DNA mismatch repair protein MHS6 preferentially bind to 5- hydroxymethyl-cytosine over unmodified cytosine; (c) FOXK1, FOXK2, FOXP1, FOXP4 and FOXI3 preferably bind to 5 -formyl -cytosine over unmodified cytosine (lurlaro et al., Genome Biol. 14: R119 (2013)); or (d) Antibodies specific to one or more methylated nucleotide bases.
[0255] In general, elution is a function of number of methylated sites per molecule, with molecules having more methylation eluting under increased salt concentrations. To elute the DNA into distinct populations based on the extent of methylation, one can use a series of elution buffers of increasing NaCl concentration. Salt concentration can range from about 100 mM to about 2500 mM NaCl. In one embodiment, the process results in three (3) partitions. Molecules are contacted with a solution at a first salt concentration and comprising a molecule comprising a methyl binding domain, which molecule can be attached to a capture moiety, such as streptavidin. At the first salt concentration a population of molecules will bind to the MBD and a population will remain unbound. The unbound population can be separated as a “hypomethylated” population. For example, a first partition representative of the hypomethylated form of DNA is that which remains unbound at a low salt concentration, e.g., 100 mM or 160 mM. A second partition representative of intermediate methylated DNA is eluted using an intermediate salt concentration, e.g., between 100 mM and 2000 mM concentration. This is also separated from the sample. A third partition representative of hypermethylated form of DNA is eluted using a high salt concentration, e.g., at least about 2000 mM.
[0256] In some embodiments, e.g., wherein an epigenetic target region set is captured, sample DNA (for e.g., between 1 and 300 ng) is mixed with an appropriate amount of methyl binding domain (MBD) buffer (the amount of MBD buffer depends on the amount of DNA used) and magnetic beads conjugated with MBD proteins and incubated overnight. Methylated DNA (hypermethylated DNA) binds the MBD protein on the magnetic beads during this incubation. Non-methylated (hypomethylated DNA) or less methylated DNA (intermediately methylated) is washed away from the beads with buffers containing increasing concentrations of salt. For example, one, two, or more fractions containing non-methylated, hypomethylated, and/or intermediately methylated DNA may be obtained from such washes. Finally, a high salt buffer is used to elute the heavily methylated DNA (hypermethylated DNA) from the MBD protein. In some embodiments, these washes result in three partitions (hypomethylated partition, intermediately methylated fraction and hypermethylated partition) of DNA having increasing levels of methylation.
[0257] In some embodiments, the three partitions of DNA are desalted and concentrated in preparation for the enzymatic steps of library preparation.
[0258] In some embodiments, the methylation signature of molecules can be determined by treating the sample with one or more methylation sensitive restriction enzymes (MSRE) and/or methylation dependent restriction enzymes (MDRE). In some embodiments, any of the above methods can be used either alone or in combination, to determine the methylation signature of the molecules.
Applications
[0259] The methods disclosed herein allow for determining a likelihood of whether a subject has a disease. One useful exemplary application of the methods of the disclosure is using the likelihood determination in diagnosing and prognosing cancer or other genetic diseases or conditions, e.g., determining the presence or absence of a cancer in a subject.
[0260] Hence, in some embodiments, methods described herein comprise identifying or predicting the presence or absence of a tumor (or neoplastic cells, or cancer cells), determining the likelihood that a test subject has a tumor or cancer, and/or characterizing a tumor, neoplastic cells or cancer as described herein.
[0261] Cancer and Other Diseases; Cell type quantification
[0262] The present methods can be used to diagnose presence of a condition, e.g., cancer or precancer, in a subject, to characterize a condition (such as to determine a cancer stage or determining heterogeneity of a cancer), to monitor a subject’s response to receiving a treatment for a condition (such as a response to a chemotherapeutic or immunotherapeutic), assess prognosis of a subject (such as to predict a survival outcome in a subject having a cancer), to determine a subject’s risk of developing a condition, to predict a subsequent course of a condition in a subject, to determine metastasis or recurrence of a cancer in a subject (or a risk of cancer metastasis or recurrence), and/or to monitor a subject’s health as part of a preventative health monitoring program (such as to determine whether and/or when a subject is in need of further diagnostic screening). The present disclosure can also be useful in determining the efficacy of a particular treatment option. Successful treatment options may increase the amount of rare mutations detected in a subject's blood if the treatment is successful as more cancers may die and shed nucleic acids (e.g., DNA). In other examples, this may not occur. In another example, certain treatment options may be correlated with genetic profiles of cancers over time. This correlation may be useful in selecting a therapy. In some embodiments, target regions are analyzed to determine whether they show methylation characteristics of tumor cells or cells that do not ordinarily contribute significantly to cfDNA and/or target regions are analyzed to determine whether they show methylation characteristic of tumor cells or cells that do not ordinarily contribute significantly to cfDNA. In some embodiments, successful treatment options may result in changes in levels of different immune cell types (including rare immune cell types), and/or increases in the amount of target proteins, copy number variation, rare mutations, and/or cancer-related epigenetic signatures (such as hypermethylated regions or hypomethylated regions) detected in, e.g., a sample from a subject, such as detected in a subject's blood (such as in DNA isolated from a buffy coat sample or any other sample comprising cells, such as in a blood sample (e.g., a whole blood sample, a plasma sample, a buffy coat sample, a leukapheresis sample, or a PBMC sample) from the subject) if the treatment is successful as more cancer cells may die and shed DNA, or, e.g., if a successful treatment results in an increase or decrease in the quantity of a specific protein in the blood and an unsuccessful treatment results in no change.
[0263] Additionally, if a cancer is observed to be in remission after treatment, the present methods can be used to monitor the likelihood of residual disease or the likelihood of recurrence of disease. In some embodiments, the present methods are used for screening for a cancer, such as a metastasis, or in a method for screening cancer, such as in a method of detecting the presence or absence of a metastasis. For example, the sample can be a sample from a subject who has or has not been previously diagnosed with cancer. In some embodiments, one or more, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more samples are collected from a subject as described herein, such as before and/or after the subject is diagnosed with a cancer. In some embodiments, the subject may or may not have cancer. In some embodiments, the subject may or may not have an early-stage cancer. In some embodiments, the subject has one or more risk factors for cancer, such as tobacco use (e.g., smoking), being overweight or obese, having a high body mass index (BMI), being of advanced age, poor nutrition, high alcohol consumption, or a family history of cancer.
[0264] In some embodiments, the subject has used tobacco, e.g., for at least 1, 5, 10, or 15 years. In some embodiments, the subject has a high BMI, e.g., a BMI of 25 or greater, 26 or greater, 27 or greater, 28 or greater, 29 or greater, or 30 or greater. In some embodiments, the subject is at least 40, 45, 50, 55, 60, 65, 70, 75, or 80 years old. In some embodiments, the subject has poor nutrition, e.g., high consumption of one or more of red meat and/or processed meat, trans fat, saturated fat, and refined sugars, and/or low consumption of fruits and vegetables, complex carbohydrates, and/or unsaturated fats. High and low consumption can be defined, e.g., as exceeding or falling below, respectively, recommendations in Dietary Guidelines for Americans 2020-2025, available at dietaryguidelines.gov/sites/default/files/2021-03/Dietary_Guidelines_for_Americans- 2020-2025.pdf. In some embodiments, the subject has high alcohol consumption, e.g., at least three, four, or five drinks per day on average (where a drink is about one ounce or 30 mL of 80-proof hard liquor or the equivalent). In some embodiments, the subject has a family history of cancer, e.g., at least one, two, or three blood relatives were previously diagnosed with cancer. In some embodiments, the relatives are at least third-degree relatives (e.g., great-grandparent, great aunt or uncle, first cousin), at least second-degree relatives (e.g., grandparent, aunt or uncle, or half-sibling), or first-degree relatives (e.g., parent or full sibling).
[0265] Furthermore, in some embodiments, the one or more methods described in the present disclosure may be used to assist in the treatment of a type of cancer. In some embodiments, the methods and systems disclosed herein may be used to identify customized or targeted therapies to treat a given disease or condition in patients based on the classification of a nucleic acid variant as being of somatic or germline origin. Typically, the disease under consideration is a type of cancer, such as any referred to herein. The types and number of cancers that may be detected may include blood cancers, brain cancers, lung cancers, skin cancers, nose cancers, throat cancers, liver cancers, bone cancers, lymphomas, pancreatic cancers, skin cancers, bowel cancers, rectal cancers, thyroid cancers, bladder cancers, kidney cancers, mouth cancers, stomach cancers, solid state tumors, heterogeneous tumors, homogenous tumors and the like. Non-limiting examples of such cancers include biliary tract cancer, bladder cancer, transitional cell carcinoma, urothelial carcinoma, brain cancer, gliomas, astrocytomas, breast cancer, metaplastic carcinoma, cervical cancer, cervical squamous cell carcinoma, rectal cancer, colorectal carcinoma, colon cancer, hereditary nonpolyposis colorectal cancer, colorectal adenocarcinomas, gastrointestinal stromal tumors (GISTs), endometrial carcinoma, endometrial stromal sarcomas, esophageal cancer, esophageal squamous cell carcinoma, esophageal adenocarcinoma, ocular melanoma, uveal melanoma, gallbladder carcinomas, gallbladder adenocarcinoma, renal cell carcinoma, clear cell renal cell carcinoma, transitional cell carcinoma, urothelial carcinomas, Wilms tumor, leukemia, acute lymphocytic leukemia (ALL), acute myeloid leukemia (AML), chronic lymphocytic leukemia (CLL), chronic myeloid leukemia (CML), chronic myelomonocytic leukemia (CMML), liver cancer, liver carcinoma, hepatoma, hepatocellular carcinoma, cholangiocarcinoma, hepatoblastoma, Lung cancer, non-small cell lung cancer (NSCLC), mesothelioma, B-cell lymphomas, non-Hodgkin lymphoma, diffuse large B-cell lymphoma, Mantle cell lymphoma, T cell lymphomas, non- Hodgkin lymphoma, precursor T-lymphoblastic lymphoma/leukemia, peripheral T cell lymphomas, multiple myeloma, nasopharyngeal carcinoma (NPC), neuroblastoma, oropharyngeal cancer, oral cavity squamous cell carcinomas, osteosarcoma, ovarian carcinoma, pancreatic cancer, pancreatic ductal adenocarcinoma, pseudopapillary neoplasms, acinar cell carcinomas, prostate cancer, prostate adenocarcinoma, skin cancer, melanoma, malignant melanoma, cutaneous melanoma, small intestine carcinomas, stomach cancer, gastric carcinoma, gastrointestinal stromal tumor (GIST), uterine cancer, or uterine sarcoma.
[0266] In some embodiments, the cancer is a type of cancer that is not a hematological cancer, e.g., a solid tumor cancer such as a carcinoma, adenocarcinoma, or sarcoma. Type and/or stage of cancer can be detected from genetic variations including mutations, rare mutations, indels, rearrangements, copy number variations, transversions, translocations, recombinations, inversion, deletions, aneuploidy, partial aneuploidy, polyploidy, chromosomal instability, chromosomal structure alterations, gene fusions, chromosome fusions, gene truncations, gene amplification, gene duplications, chromosomal lesions, DNA lesions, abnormal changes in nucleic acid chemical modifications, abnormal changes in epigenetic patterns, such as 5mC and 5mC profiles. Hence, the present methods can in some cases be used in combination with methods used to detect other genetic/epigenetic variations, e.g. in a method of detecting or characterizing a cancer or other methods described herein.
[0267] In some embodiments, a method described herein comprises identifying the presence of target regions and/or DNA produced by a tumor (or neoplastic cells, or cancer cells) or by precancer cells. In some embodiments, a method described herein comprises determining the level of target regions and/or identifying the presence of DNA produced by a tumor (or neoplastic cells, or cancer cells) or by precancer cells. In some embodiments, determining the level of target regions comprises determining either an increased level or decreased level of target regions, wherein the increased or decreased level of target regions is determined by comparing the level of target regions with a threshold level/value.
[0268] Genetic and/or epigenetic data can also be used for characterizing a specific form of cancer. Cancers are often heterogeneous in both composition and staging. Genetic and/or epigenetic profile data may allow characterization of specific sub-types of cancer that may be important in the diagnosis or treatment of that specific sub-type. This information may also provide a subject or practitioner clues regarding the prognosis of a specific type of cancer and allow either a subject or practitioner to adapt treatment options in accord with the progress of the disease. Some cancers can progress to become more aggressive and genetically unstable. Other cancers may remain benign, inactive or dormant. The system and methods of this disclosure may be useful in determining disease progression.
[0269] Further, the methods of the disclosure may be used to characterize the heterogeneity of an abnormal condition in a subject. Such methods can include, e.g., generating a genetic and/or epigenetic profile of nucleic acids (e.g., cfDNA) derived from the subject, wherein the genetic and/or epigenetic profile comprises a plurality of data resulting from copy number variation and rare mutation analyses. In some embodiments, an abnormal condition is cancer, e.g., as described herein. In some embodiments, the abnormal condition may be one resulting in a heterogeneous genomic population. In the example of cancer, some tumors are known to comprise tumor cells in different stages of the cancer. In other examples, heterogeneity may comprise multiple foci of disease, such as where one or more foci (such as one or more tumor foci) are the result of metastases that have spread from a primary site of a cancer. The tissue(s) of origin can be useful for identifying organs affected by the cancer, including the primary cancer and/or metastatic tumors.
[0270] The present methods can also be used to quantify levels of different cell types, such as immune cell types, including rare immune cell types, such as activated lymphocytes and myeloid cells at particular stages of differentiation. Such quantification can be based on the numbers of molecules corresponding to a given cell type in a sample. In some embodiments, the sequencing comprises generating a plurality of sequencing reads. Sequence information obtained in the present methods may comprise sequence reads of the nucleic acids generated by a nucleic acid sequencer. In some embodiments, the nucleic acid sequencer performs pyrosequencing, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-synthesis, 5-letter sequencing, 6-letter sequencing, sequencing-by-ligation or sequencing-by-hybridization on the nucleic acids to generate sequencing reads. In some embodiments, the method further comprises mapping the plurality of sequence reads to one or more reference sequences to generate mapped sequence reads. In some embodiments, the method further comprises grouping the sequence reads into families of sequence reads, each family comprising sequence reads generated from a nucleic acid in the sample. In some embodiments, the methods comprise determining the likelihood that the subject from which the sample was obtained has cancer or precancer, or has a metastasis, that is related to changes in proportions of types of immune cells. In some embodiments, the methods comprises processing the mapped sequence reads to determine the likelihood that the subject has cancer or precancer.
[0271] The present methods can be used to generate or profile, fingerprint or set of data that is a summation of genetic and/or epigenetic information derived from different cells in a heterogeneous disease. This set of data may comprise copy number variation, epigenetic variation, and mutation analyses alone or in combination.
[0272] The present methods can be used to diagnose, prognose, monitor or observe cancers, or other diseases. In some embodiments, the methods herein do not involve the diagnosing, prognosing or monitoring a fetus and as such are not directed to non-invasive prenatal testing. In other embodiments, these methodologies may be employed in a pregnant subject to diagnose, prognose, monitor or observe cancers or other diseases in an unborn subject whose DNA and other polynucleotides may co-circulate with maternal molecules.
[0273] Non-limiting examples of other genetic-based diseases, disorders, or conditions that are optionally evaluated using the methods and systems disclosed herein include achondroplasia, alpha- 1 antitrypsin deficiency, antiphospholipid syndrome, autism, autosomal dominant polycystic kidney disease, Charcot-Marie-Tooth (CMT), cri du chat, Crohn's disease, cystic fibrosis, Dercum disease, down syndrome, Duane syndrome, Duchenne muscular dystrophy, Factor V Leiden thrombophilia, familial hypercholesterolemia, familial Mediterranean fever, fragile X syndrome, Gaucher disease, hemochromatosis, hemophilia, holoprosencephaly, Huntington's disease, Klinefelter syndrome, Marfan syndrome, myotonic dystrophy, neurofibromatosis, Noonan syndrome, osteogenesis imperfecta, Parkinson's disease, phenylketonuria, Poland anomaly, porphyria, progeria, retinitis pigmentosa, severe combined immunodeficiency (SCID), sickle cell disease, spinal muscular atrophy, Tay-Sachs, thalassemia, trimethylaminuria, Turner syndrome, velocardiofacial syndrome, WAGR syndrome, Wilson disease, or the like.
[0274] In some embodiments, the methods can provide a measure of the extent of DNA damage through the quantification of the regions synthesized during the end repair, the methods disclosed herein can also be used to quantify the level of DNA damage present in the original DNA sample. This is because the level of end repair will depend in part on the amount of DNA damage (e.g. gaps, nicks and overhangs) present in the DNA because it is this damage which can act as the priming sites for synthesis in the end repair (see Figures 1-6 and the corresponding descriptions).
[0275] In some embodiments, the method further comprises calculating a synthesis index which is a quantitative measure of the regions synthesized in the end repair. The synthesis index may be on a molecule level and/or a sample level. The synthesis index may be the proportion of sequencing data which corresponds to synthesized regions. In some embodiments, the method further comprises comparing the synthesis index to one or more reference values to classify the DNA sample. The classification may be whether the DNA sample derives from a subject with or without cancer. The reference values may be derived from one or more control DNA samples which are known to have a specific properties, such as being derived from a subject known to have cancer, e.g. a specific type of cancer. The reference values may be obtained by performing the method used to obtain the synthesis index on control samples (i.e. using the same end repair, ligation and sequencing methods).
[0276] In some embodiments, the sample is obtained from a subject who was previously diagnosed with a cancer and received one or more previous cancer treatments. In some embodiments, the sample is obtained at one or more preselected time points following the one or more previous cancer treatments. In some embodiments, a method described herein comprises detecting a presence or absence of DNA originating or derived from a tumor cell at a preselected timepoint following a previous cancer treatment of a subject previously diagnosed with cancer using a set of sequence information obtained as described herein. The method may further comprise determining a cancer recurrence score that is indicative of the presence or absence of the DNA originating or derived from the tumor cell for the subject.
[0277] Where a cancer recurrence score is determined, it may further be used to determine a cancer recurrence status. The cancer recurrence status may be at risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold. The cancer recurrence status may be at low or lower risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold. In particular embodiments, a cancer recurrence score equal to the predetermined threshold may result in a cancer recurrence status of either at risk for cancer recurrence or at low or lower risk for cancer recurrence.
[0278] In some embodiments, a cancer recurrence score is compared with a predetermined cancer recurrence threshold, and the subject is classified as a candidate for a subsequent cancer treatment when the cancer recurrence score is above the cancer recurrence threshold or not a candidate for therapy when the cancer recurrence score is below the cancer recurrence threshold. In particular embodiments, a cancer recurrence score equal to the cancer recurrence threshold may result in classification as either a candidate for a subsequent cancer treatment or not a candidate for therapy.
[0279] The present methods can also be used to quantify levels of different cell types, such as immune cell types, including rare immune cell types, such as activated lymphocytes and myeloid cells at particular stages of differentiation. Such quantification can be based on the numbers of molecules corresponding to a given cell type in a sample. Sequence information obtained in the present methods may comprise sequence reads of the nucleic acids generated by a nucleic acid sequencer. In some embodiments, the nucleic acid sequencer performs pyrosequencing, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-synthesis, 5-letter sequencing, 6- letter sequencing, sequencing-by-ligation or sequencing-by-hybridization on the nucleic acids to generate sequencing reads. In some embodiments, the method further comprises grouping the sequence reads into families of sequence reads, each family comprising sequence reads generated from a nucleic acid in the sample. In some embodiments, the methods comprise determining the likelihood that the subject from which the sample was obtained has cancer, precancer, an infection, transplant rejection, or other diseases or disorder that is related to changes in proportions of types of immune cells. Comparisons of immune cell identities and/or immune cell quantities/proportions between two or more samples collected from a subject at two different time points can allow for monitoring of one or more aspects of a condition in the subject over time, such as a response of the subject to a treatment, the severity of the condition (such as a cancer stage) in the subject, a recurrence of the condition (such as a cancer), and/or the subject’s risk of developing the condition (such as a cancer).
[0280] The methods discussed above may further comprise any compatible feature or features set forth elsewhere herein, including in the section regarding methods of determining a risk of cancer recurrence in a subject and/or classifying a subject as being a candidate for a subsequent cancer treatment.
[0281] Methods of determining a risk of cancer recurrence in a test subject and/or classifying a subject as being a candidate for a subsequent cancer treatment
[0282] In some embodiments, a method provided herein is or comprises a method of determining a risk of cancer recurrence in a subject. In some embodiments, a method provided herein is or comprises a method of detecting the presence of absence of a metastasis in a subject. In some embodiments, a method provided herein is or comprises a method of classifying a subject as being a candidate for a subsequent cancer treatment.
[0283] Any of such methods may comprise collecting a sample (such as DNA, such as DNA originating or derived from a tumor cell) from the subject diagnosed with the cancer at one or more preselected timepoints following one or more previous cancer treatments to the subject. The subject may be any of the subjects described herein. The sample may comprise chromatin, cfDNA, or other cell materials. The sample, such as the DNA sample, may be a tissue sample. The DNA may be DNA, such as cfDNA, from a blood sample (e.g., a whole blood sample, a buffy coat sample, a leukapheresis sample, or a PBMC sample). The DNA may comprise DNA obtained from a tissue sample or a liquid sample.
[0284] Any of such methods may comprise capturing a plurality of sets of target regions from DNA from the subject, wherein the plurality of target region sets comprises a sequence-variable target region set and an epigenetic target region set, whereby a captured set of DNA molecules is produced. The capturing step or steps may be performed according to any of the embodiments described elsewhere herein.
[0285] In any of such methods, the previous cancer treatment may comprise surgery, administration of a therapeutic composition, and/or chemotherapy. Any of such methods may comprise sequencing the captured DNA molecules, whereby a set of sequence information is produced. The captured DNA molecules of the sequence-variable target region set may be sequenced to a greater depth of sequencing than the captured DNA molecules of the epigenetic target region set.
[0286] Any of such methods may comprise detecting a presence or absence of DNA, such as cfDNA, originating or derived from a tumor cell at a preselected timepoint using the set of sequence information. The detection of the presence or absence of DNA originating or derived from a tumor cell may be performed according to any of the embodiments thereof described elsewhere herein.
[0287] Methods of determining a risk of cancer recurrence in a subject may comprise determining a cancer recurrence score that is indicative of the presence or absence, or amount, of the DNA, such as genomic regions of interest and target regions, originating or derived from the tumor cell for the subject. The cancer recurrence score may further be used to determine a cancer recurrence status. The cancer recurrence status may be at risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold. The cancer recurrence status may be at low or lower risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold. In particular embodiments, a cancer recurrence score equal to the predetermined threshold may result in a cancer recurrence status of either at risk for cancer recurrence or at low or lower risk for cancer recurrence.
[0288] Methods of detecting the presence or absence of metastasis in a subject may comprise comparing the presence or level of a tissue-specific cell material to the presence or level of the tissue-specific cell material obtained from the subject at a different time, a reference level of the tissue-specific cell material, or to a comparator cell material. Methods herein may comprise additional steps to determine whether a metastasis is present.
[0289] Methods of classifying a subject as being a candidate for a subsequent cancer treatment may comprise comparing the cancer recurrence score of the subject with a predetermined cancer recurrence threshold, thereby classifying the subject as a candidate for the subsequent cancer treatment when the cancer recurrence score is above the cancer recurrence threshold or not a candidate for therapy when the cancer recurrence score is below the cancer recurrence threshold. In particular embodiments, a cancer recurrence score equal to the cancer recurrence threshold may result in classification as either a candidate for a subsequent cancer treatment or not a candidate for therapy. In some embodiments, the subsequent cancer treatment comprises chemotherapy or administration of a therapeutic composition.
[0290] Any of such methods may comprise determining a disease-free survival (DFS) period for the subject based on the cancer recurrence score; for example, the DFS period may be 1 year, 2 years, 3, years, 4 years, 5 years, or 10 years.
[0291] In some embodiments, sequence-variable target region sequences are obtained, and determining the cancer recurrence score may comprise determining at least a first subscore indicative of the amount of the levels of particular immune cell types, SNVs, insertions/deletions, CNVs and/or fusions present in sequence-variable target region sequences.
[0292] In some embodiments, a number of mutations in the sequence-variable target regions chosen from 1, 2, 3, 4, or 5 is sufficient for the first subscore to result in a cancer recurrence score classified as positive for cancer recurrence. In some embodiments, the number of mutations is chosen from 1, 2, or 3.
[0293] In some embodiments, epigenetic target region sequences are obtained, and determining the cancer recurrence score comprises determining a second sub score indicative of the amount of molecules (obtained from the epigenetic target region sequences) that represent an epigenetic state different from DNA found in a corresponding sample from a healthy subject (e.g., DNA, such as cfDNA, found in a blood sample (e.g., a whole blood sample, a buffy coat sample, a leukapheresis sample, or a PBMC sample) from a healthy subject, or DNA found in a tissue sample from a healthy subject where the tissue sample is of the same type of tissue as was obtained from the subject). These abnormal molecules (i.e., molecules with an epigenetic state different from DNA found in a corresponding sample from a healthy subject) may be consistent with epigenetic changes associated with cancer (such as with a metastasis), e.g., methylation of hypermethylation variable target regions and/or perturbed fragmentation of fragmentation variable target regions, where “perturbed” means different from DNA found in a corresponding sample from a healthy subject.
[0294] In some embodiments, a proportion of molecules corresponding to the hypermethylation variable target region set and/or fragmentation variable target region set that indicate hypermethylation in the hypermethylation variable target region set and/or abnormal fragmentation in the fragmentation variable target region set greater than or equal to a value in the range of 0.001%-10% is sufficient for the subscore to be classified as positive for cancer recurrence. The range may be 0.001%-l%, 0.005%-l%, 0.01%-5%, 0.01%-2%, or O.OI%-1%.
[0295] In some embodiments, any of such methods may comprise determining a fraction of tumor DNA from the fraction of molecules in the set of sequence information that indicate one or more features indicative of origination from a tumor cell. This may be done for molecules corresponding to some or all of the target regions, e.g., including one or more of hypermethylation variable target regions, hypomethylation variable target regions, and fragmentation variable target regions (hypermethylation of a hypermethylation variable target region and/or abnormal fragmentation of a fragmentation variable target region may be considered indicative of origination from a tumor cell). This may be done for molecules corresponding to sequence variable target regions, e.g., molecules comprising alterations consistent with cancer, such as SNVs, indels, CNVs, and/or fusions. The fraction of tumor DNA may be determined based on a combination of molecules corresponding to epigenetic target regions and molecules corresponding to sequence-variable target regions.
[0296] Determination of a cancer recurrence score may be based at least in part on the fraction of tumor DNA, wherein a fraction of tumor DNA greater than a threshold in the range of 10-11 to 1 or 10-10 to 1 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence. In some embodiments, a fraction of tumor DNA greater than or equal to a threshold in the range of 10-10 to 10-9, 10-9 to 10-8, 10- 8 to 10-7, 10-7 to 10-6, 10-6 to 10-5, 10-5 to 10-4, 10-4 to 10-3, 10-3 to 10-2, or 10-2 to 10-1 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence. In some embodiments, the fraction of tumor DNA greater than a threshold of at least 10-7 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence. A determination that a fraction of tumor DNA is greater than a threshold, such as a threshold corresponding to any of the foregoing embodiments, may be made based on a cumulative probability. For example, the sample was considered positive if the cumulative probability that the tumor fraction was greater than a threshold in any of the foregoing ranges exceeds a probability threshold of at least 0.5, 0.75, 0.9, 0.95, 0.98, 0.99, 0.995, or 0.999. In some embodiments, the probability threshold is at least 0.95, such as 0.99.
[0297] In some embodiments, the set of sequence information comprises sequencevariable target region sequences and epigenetic target region sequences, and determining the cancer recurrence score comprises determining a subscore indicative of the amount of SNVs, insertions/deletions, CNVs and/or fusions present in sequence-variable target region sequences and a subscore indicative of the amount of abnormal molecules in epigenetic target region sequences, and combining the subscores to provide the cancer recurrence score. Where the subscores are combined, they may be combined by applying a threshold to each subscore independently (e.g., greater than a predetermined number of mutations (e.g., > 1) in sequence-variable target regions, and greater than a predetermined fraction of abnormal molecules (i.e., molecules with an epigenetic state different from the DNA found in a corresponding sample from a healthy subject; e.g., tumor) in epigenetic target regions), or training a machine learning classifier to determine status based on a plurality of positive and negative training samples.
[0298] In some embodiments, the set of sequence information comprises sequencevariable target region sequences and epigenetic target region sequences, and determining the cancer recurrence score comprises determining a first subscore indicative of the levels of particular immune cell types, a second subscore indicative of the amount of SNVs, insertions/deletions, CNVs and/or fusions present in sequence-variable target region sequences and a third subscore indicative of the amount of abnormal molecules in epigenetic target region sequences, and combining the first, second, and third subscores to provide the cancer recurrence score. Where the subscores are combined, they may be combined by applying a threshold to each subscore independently in sequence-variable target regions, respectively, and greater than a predetermined fraction of abnormal molecules (i.e., molecules with an epigenetic state different from the DNA found in a corresponding sample from a healthy subject; e.g., tumor) in epigenetic target regions), or training a machine learning classifier to determine status based on a plurality of positive and negative training samples.
[0299] In some embodiments, a value for the combined score in the range of -4 to 2 or -3 to 1 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence.
[0300] In any embodiment where a cancer recurrence score is classified as positive for cancer recurrence, the cancer recurrence status of the subject may be at risk for cancer recurrence and/or the subject may be classified as a candidate for a subsequent cancer treatment. In some embodiments, the cancer is any one of the types of cancer described elsewhere herein, e.g., colorectal cancer.
[0301] Methods of monitoring a cancer in a subject over time; sample collection at two or more time points
[0302] In some embodiments, the present methods can be used to monitor one or more aspects of a condition in a subject over time, such as a subject’s response to receiving a treatment for a condition (such as a response to a chemotherapeutic or immunotherapeutic), the severity of the condition (such as a cancer stage) in the subject, a recurrence of the condition (such as a cancer), and/or the subject’s risk of developing the condition (such as a cancer) and/or to monitor a subject’s health as part of a preventative health monitoring program (such as to determine whether and/or when a subject is in need of further diagnostic screening). In some embodiments, monitoring comprises analysis of at least two samples collected from a subject at least two different time points as described herein.
[0303] The methods according to the present disclosure can be useful in predicting a subject’s response to a particular treatment option, such as over a period of time. As described elsewhere herein, successful treatment options may increase the amount of cancer associated DNA sequences detected in a subject's blood, such as if the treatment is successful as more cancers may die and shed DNA. In such examples, certain treatment options may be correlated with genetic profiles of cancers over time. This correlation may be useful in selecting a therapy. In some embodiments, successful treatment options may result in an increase or decrease in the levels of different immune cell types (including rare immune cell types), and/or an increase or decrease in the levels of a specific protein or proteins and/or a specific DNA sequence (e.g., of a CDR3), such as in the blood, and an unsuccessful treatment may result in no change. In other examples, this may not occur.
[0304] As disclosed herein, in some embodiments, quantities of each of a plurality of cell types, such as immune cell types, are determined based on sequencing and analysis (such as determination of epigenetic and/or genomic signatures) of DNA isolated from at least one sample comprising cells (such as a tissue sample or a blood sample, e.g., a whole blood sample, a buffy coat sample, a leukapheresis sample, or a PBMC sample) from a subject. In some embodiments, differences in levels and/or presence of particular genetic and/or epigenetic signatures in DNA isolated from blood samples from a subject can be used to quantify cell types, such as immune cell types, within the sample. Thus, a comparison of the disclosed genetic and/or epigenetic signatures in DNA isolated from blood samples collected from a subject at two or more time points can be used to monitor changes in cell type quantities in the subject under different conditions (such as prior to and after a treatment), or over time (e.g., as part of a preventative health monitoring program). The disclosed methods can include evaluating (such as quantifying) and/or interpreting cell types (such as immune cell types) present in one or more samples (such as a tissue sample or a blood sample, e.g., a whole blood sample, a buffy coat sample, a leukapheresis sample, or a PBMC sample) collected from a subject at one or more timepoints in comparison to a selected baseline value or reference standard (or a selected set of baseline values or reference standards). A baseline value or reference standard may be a quantity of cell types measured in one or more samples (such as an average quantity or range of quantities of cell types present in at least two samples) collected from the subject at one or more time points, such as prior to receiving a treatment, prior to diagnosis of a condition (such as a cancer), or as part of a preventative health monitoring program. A baseline value or reference standard may be a quantity of cell types measured in one or more samples (such as an average quantity or range of quantities of cell types present in at least two samples) collected at one or more timepoints from one or more subjects that do not have the condition (such as a healthy subject that does not have a cancer), one or more subjects that responded favorably to the treatment, or one or more subjects that have not received the treatment. In certain embodiments, the baseline value or reference standard utilized is a standard or profile derived from a single reference subject. In other embodiments, the baseline value or reference standard utilized is a standard or profile derived from averaged data from multiple reference subjects. The reference standard, in various embodiments, can be a single value, a mean, an average, a numerical mean or range of numerical means, a numerical pattern, or a graphical pattern created from the cell type quantity data derived from a single reference subject or from multiple reference subjects. Selection of the particular baseline values or reference standards, or selection of the one or more reference subjects, depends upon the use to which the methods described herein are to be put by, for example, a research scientist or a clinician (such as a physician).
[0305] In some embodiments, one or more samples (such as a tissue sample or a blood sample, e.g., a whole blood sample, a buffy coat sample, a leukapheresis sample, or a PBMC sample) may be collected from a subject at two or more timepoints, to assess changes in cell types (such as changes in quantities of cell types) between the two or more timepoints. In some embodiments, a sample collected at a first time point is a tissue sample or a blood sample, and a sample collected at a subsequent time point (such as a second time point) is a blood sample. In some embodiments, a sample collected at a first time point is a tissue sample and a sample collected at a subsequent time point (such as a second time point) is a blood sample. By monitoring cell types and identifying differences between cell types in samples collected from a subject at two or more timepoints, the present methods can be used, for example, to determine the presence or absence of a condition (such as a cancer), a response of the subject to a treatment, one or more characteristic of a condition (such as a cancer stage) in the subject, recurrence of a condition (such as a cancer), and/or a subject’s risk of developing a condition (such as a cancer). Thus, in some embodiments, methods are provided wherein quantities of cell types present in at least one sample (such as at least one tissue sample and/or at least one blood sample, e.g., a whole blood sample, buffy coat sample, leukapheresis sample, or PBMC sample) collected from a subject at one or more timepoints (such as prior to receiving a treatment) are compared to quantities of cell types present in at least one sample collected from the subject at one or more different time points (such as after receiving the treatment). The disclosed methods can allow for patient-specific monitoring, such that, for example, differences in cell type quantities between samples collected from the subject at different timepoints may indicate changes (such as presence or absence of a condition, response to a treatment, a prognosis, or the like) that are significant with respect to the subject but may yet fall within a normal range of a general healthy population.
[0306] As disclosed herein, methods are provided for monitoring one or more aspects of a condition in a subject over time, such as but not limited to, a subject’s response to receiving a treatment for a condition (such as a response to a chemotherapeutic or immunotherapeutic). In certain embodiments, one or more samples is collected from the subject at least 1-10, at least 1-5, at least 2-5, or at least 1, at least 2, least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 15, or at least 20 time points prior to the subject receiving the treatment. In certain embodiments, one or more samples is collected from the subject at least 1-10, at least 1-5, at least 2-5, or at least 1, at least 2, least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 15, or at least 20 time points after the subject has received the treatment. Sample collection from a subject can be ongoing during and/or after treatment to monitor the subject’s response to the treatment.
[0307] In some embodiments, samples are not collected from a subject prior to diagnosis of a condition (such as a cancer) or prior to receiving a treatment. In such embodiments, wherein the response of a subject to a treatment, or the course or stage of a condition (such as a cancer) in the subject is being monitored over time, cell types are compared between samples taken at least 2-10, at least 2-5, at least 3-6, or at least 2, such as at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 15, or at least 20 time points collected after the subject has been diagnosed and/or after the subject has received the treatment. Sample collection from a subject can be ongoing during and/or after treatment to monitor the subject’s response to the treatment.
[0308] In some embodiments of the disclosed methods, one or more samples (such as one or more tissue, whole blood, buffy coat, leukapheresis, or PBMC samples) is collected from a subject at least once per year, such as about 1-12 times or about 2-6 times, such as about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12 times per year. In other embodiments, one or more samples is collected from the subject less than once per year, such as about once every 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or 24 months. In some embodiments, one or more samples is collected from the subject about once every 1-5 years or about once every 1-2 years, such as about every 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 years.
[0309] In other embodiments of the disclosed methods, one or more samples (such as one or more tissue samples or blood samples, e.g., or one or more buffy coat samples, whole blood samples, leukapheresis samples, or PBMC samples) are collected from a subject at least once per week, such as on 1-4 days, 1-2 days, or on 1, 2, 3, 4, 5, 6, or 7 days per week. In certain embodiments, one or more samples is collected from the subject at least once per month, such as 1-15 times, 1-10 times, 2-5 times, or 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 times per month. In other embodiments, one or more samples is collected from the subject every month, every 2 months, every 3 months, every 4 months, every 5 months, every 6 months, every 7 months, every 8 months, every 9 months, every 10 months, every 11 months, or every 12 months. In some embodiments, one or more samples is collected from the subject at least once per day, such as 1, 2, 3, 4, 5, or 6 times per day. Selection of the one or more sample collection timepoints (e.g., the frequency of sample collection), or of the number of samples to be collected at each timepoint, depends upon the use to which the methods described herein are to be put by, for example, a research scientist or a clinician (such as a physician).
Therapies and Related Administration
[0310] In certain embodiments, the methods disclosed herein relate to identifying and administering therapies, such as customized therapies, to patients. In some embodiments, determination of the levels of particular immune cell types, including rare immune cell types, facilitates selection of appropriate treatment. In some embodiments, the patient or subject has a given disease, disorder or condition, e.g., any of the cancers or other conditions described elsewhere herein. Essentially any cancer therapy (e.g., surgical therapy, radiation therapy, chemotherapy, immunotherapy, and/or the like) may be included as part of these methods. In certain embodiments, the therapy administered to a subject comprises at least one chemotherapy drug. In some embodiments, the chemotherapy drug may comprise alkylating agents (for example, but not limited to, Chlorambucil, Cyclophosphamide, Cisplatin and Carboplatin), nitrosoureas (for example, but not limited to, Carmustine and Lomustine), anti-metabolites (for example, but not limited to, Fluorauracil, Methotrexate and Fludarabine), plant alkaloids and natural products (for example, but not limited to, Vincristine, Paclitaxel and Topotecan), antitumor antibiotics (for example, but not limited to, Bleomycin, Doxorubicin and Mitoxantrone), hormonal agents (for example, but not limited to, Prednisone, Dexamethasone, Tamoxifen and Leuprolide) and biological response modifiers (for example, but not limited to, Herceptin and Avastin, Erbitux and Rituxan). In some embodiments, the chemotherapy administered to a subject may comprise FOLFOX or FOLFIRI. In certain embodiments, a therapy may be administered to a subject that comprises at least one PARP inhibitor. In some embodiments, the therapies are PARP inhibitors, such as Olaparib (LYNPARZA®), Rucaparib (RUBRACA®), Niraparib (ZEJULA®), and Talazoparib (TALZENNA®). These may be used for treating mutations in BRCA1, BRCA2, ATM, BARD1, BRIP1, CDK12, CHEK1, CHEK2, FANCL, PALB2, RAD51B,RAD51 C, RAD51D and RAD54L alterations, and/or for genes associated Homologous Recombination Repair (HRR). Typically, therapies include at least one immunotherapy (or an immunotherapeutic agent). Immunotherapy refers generally to methods of enhancing an immune response against a given cancer type. In certain embodiments, immunotherapy refers to methods of enhancing a T cell response against a tumor or cancer.
[0311] In some embodiments, therapy is customized based on the status of a nucleic acid variant as being of somatic or germline origin. In some embodiments, essentially any cancer therapy (e.g., surgical therapy, radiation therapy, chemotherapy, immunotherapy, and/or the like) may be included as part of these methods. Customized therapies can include at least one immunotherapy (or an immunotherapeutic agent). Immunotherapy refers generally to methods of enhancing an immune response against a given cancer type. In certain embodiments, immunotherapy refers to methods of enhancing a T cell response against a tumor or cancer.
[0312] In some embodiments, the immunotherapy or immunotherapeutic agent targets an immune checkpoint molecule. Certain tumors are able to evade the immune system by co-opting an immune checkpoint pathway. Thus, targeting immune checkpoints has emerged as an effective approach for countering a tumor’ s ability to evade the immune system and activating anti-tumor immunity against certain cancers. Pardoll, Nature Reviews Cancer, 2012, 12:252-264.
[0313] In some embodiments the treatment comprises immunotherapies and/or immune checkpoint inhibitors (ICIS). Immunotherapies are treatments with one or more agents that act to stimulate the immune system so as to kill or at least to inhibit growth of cancer cells, and preferably to reduce further growth of the cancer, reduce the size of the cancer and/or eliminate the cancer. Some such agents bind to a target present on cancer cells; some bind to a target present on immune cells and not on cancer cells; some bind to a target present on both cancer cells and immune cells. Such agents include, but are not limited to, checkpoint inhibitors and/or antibodies. Checkpoint inhibitors are inhibitors of pathways of the immune system that maintain self-tolerance and modulate the duration and amplitude of physiological immune responses in peripheral tissues to minimize collateral tissue damage (see, e.g., Pardoll, Nature Reviews Cancer 12, 252-264 (2012)). Exemplary agents include antibodies against any of PD-1, PD-2, PD-L1, PD-L2, CTLA-4, 0X40, B7.1, B7He, LAG3, CD137, KIR, CCR5, CD27, CD40, or CD47. Other exemplary agents include proinflammatory cytokines, such as IL-ip, IL-6, and TNF-a. Other exemplary agents are T-cells activated against a tumor, such as T-cells activated by expressing a chimeric antigen targeting a tumor antigen recognized by the T-cell. In some embodiments, anti-PD-1 or anti-PD-Ll therapies comprise pembrolizumab (KEYTRUDA®), nivolumab (OPDIVO®), and cemiplimab (LIBTAYO®), atezolizumab (TECENTRIQ®), durvalumab (INFINZI®), and avelumab (BAVENCIO®). These therapies may be used to treat patients identified as having high microsatellite instability (MSI) status or high tumor mutational burden (TMB).
[0314] In certain embodiments, the immune checkpoint molecule is an inhibitory molecule that reduces a signal involved in the T cell response to antigen. For example, CTLA4 is expressed on T cells and plays a role in downregulating T cell activation by binding to CD80 (aka B7.1) or CD86 (aka B7.2) on antigen presenting cells. PD-1 is another inhibitory checkpoint molecule that is expressed on T cells. PD-1 limits the activity of T cells in peripheral tissues during an inflammatory response. In addition, the ligand for PD-1 (PD-L1 or PD-L2) is commonly upregulated on the surface of many different tumors, resulting in the downregulation of anti-tumor immune responses in the tumor microenvironment. In certain embodiments, the inhibitory immune checkpoint molecule is CTLA4 or PD-1. In other embodiments, the inhibitory immune checkpoint molecule is a ligand for PD-1, such as PD-L1 or PD-L2. In other embodiments, the inhibitory immune checkpoint molecule is a ligand for CTLA4, such as CD80 or CD86. In other embodiments, the inhibitory immune checkpoint molecule is lymphocyte activation gene 3 (LAG3), killer cell immunoglobulin like receptor (KIR), T cell membrane protein 3 (TIM3), galectin 9 (GAL9), or adenosine A2a receptor (A2aR).
[0315] Antagonists that target these immune checkpoint molecules can be used to enhance antigen-specific T cell responses against certain cancers. Accordingly, in certain embodiments, the immunotherapy or immunotherapeutic agent is an antagonist of an inhibitory immune checkpoint molecule. In certain embodiments, the inhibitory immune checkpoint molecule is PD-1. In certain embodiments, the inhibitory immune checkpoint molecule is PD-L1. In certain embodiments, the antagonist of the inhibitory immune checkpoint molecule is an antibody (e.g., a monoclonal antibody). In certain embodiments, the antibody or monoclonal antibody is an anti-CTLA4, anti-PD-1, anti-PD-Ll, or anti- PD-L2 antibody. In certain embodiments, the antibody is a monoclonal anti-PD-1 antibody. In some embodiments, the antibody is a monoclonal anti-PD-Ll antibody. In certain embodiments, the monoclonal antibody is a combination of an anti-CTLA4 antibody and an anti-PD-1 antibody, an anti-CTLA4 antibody and an anti-PD-Ll antibody, or an anti-PD-Ll antibody and an anti-PD-1 antibody. In certain embodiments, the anti- PD-1 antibody is one or more of pembrolizumab (Keytruda®) or nivolumab (Opdivo®). In certain embodiments, the anti-CTLA4 antibody is ipilimumab (Yervoy®). In certain embodiments, the anti-PD-Ll antibody is one or more of atezolizumab (Tecentriq®), avelumab (Bavencio®), or durvalumab (Imfinzi®).
[0316] In certain embodiments, the immunotherapy or immunotherapeutic agent is an antagonist (e.g., antibody) against CD80, CD86, LAG3, KIR, TIM3, GAL9, or A2aR. In other embodiments, the antagonist is a soluble version of the inhibitory immune checkpoint molecule, such as a soluble fusion protein comprising the extracellular domain of the inhibitory immune checkpoint molecule and an Fc domain of an antibody. In certain embodiments, the soluble fusion protein comprises the extracellular domain of CTLA4, PD-1, PD-L1, or PD-L2. In some embodiments, the soluble fusion protein comprises the extracellular domain of CD80, CD86, LAG3, KIR, TIM3, GAL9, or A2aR. In one embodiment, the soluble fusion protein comprises the extracellular domain of PD-L2 or LAG3.
[0317] In some embodiments, the therapies target mutated forms of the EGFR protein.
Such therapies can include osimertinib (TAGRISSO®), erlotinib (TARCEVA®), and gefinitib (IRES SA®).
[0318] Therapies can include one or more of treatments for target therapies, including abemaciclib (VERZENIO®), abiraterone acetate (ZYTIGA®), acalabrutinib (CALQUENCE®), adagrasib (KRAZATI®), ado-trastuzumab emtansine (KADCYLA®), afatinib dimaleate (GILOTRIF®), alectinib (ALCENSA®), alemtuzumab (CAMPATH®), alitretinoin (PANRETIN®), alpelisib (PIQRAY®), amivantamab- vmjw (RYBREVANT®), anastrozole (ARIMIDEX®), apalutamide (ERLEADA®), asciminib hydrochloride (SCEMBLIX®), atezolizumab (TECENTRIQ®), avapritinib (AYVAKIT®), avelumab (BAVENCIO®), axicabtagene ciloleucel (YESCARTA®), axitinib (INLYTA®), belinostat (BELEODAQ®), belzutifan (WELIREG®), bevacizumab (AVASTIN®), bexarotene (TARGRETIN®), binimetinib (MEKTOVI®), blinatumomab (BLINCYTO®), bortezomib (VELCADE®), bosutinib (BOSULIF®), brentuximab vedotin (ADCETRIS®), brexucabtagene autoleucel (TEC ARTUS®), brigatinib (ALUNBRIG®), cabazitaxel (JEVTANA), cabozantinib-s-malate (CABOMETYX®), cabozantinib-s-malate (COMETRIQ®), capmatinib hydrochloride (TAB RECTA®), carfilzomib (KYPROLIS®), cemiplimab-rwlc (LIBTAYO®), ceritinib (ZYKADIA®), cetuximab (ERBITUX®), ciltacabtagene autoleucel (CARVYKTI®), cobimetinib fumarate (COTELLIC®), copanlisib hydrochloride (ALIQUOPA®), crizotinib (XALKORI®), dabrafenib (TAFMLAR®), dabrafenib mesylate (TAFMLAR®), dacomitinib (VIZIMPRO®), daratumumab (DARZALEX®), daratumumab and hyaluronidase-fihj (DARZALEX FASPRO®), darolutamide (NUBEQ A®), dasatinib (SPRYCEL®), denileukin diftitox (ONTAK®), denosumab (XGEVA®), dinutuximab (UNITUXIN®), dostarlimab-gxly (JEMPERLI®), durvalumab (IMFINZI®), duvelisib (COPIKTRA®), elacestrant dihydrochloride (ORSERDU®), elotuzumab (EMPLICITI®), enasidenib mesylate (IDHIFA®), encorafenib (BRAFTOVI®), enfortumab vedotin-ejfv (PADCEV®), entrectinib (ROZLYTREK®), enzalutamide (XTANDI®), erdafitinib (BAL VERSA®), erlotinib hydrochloride (TARCEVA®), everolimus (AFINITOR®), exemestane (AROMASIN®), fam-trastuzumab deruxtecan-nxki (ENHERTU®), fedratinib hydrochloride (INREBIC®), fulvestrant (FASLODEX®), futibatinib (LYTGOBI®), gefitinib (IRES SA®), gemtuzumab ozogamicin (MYLOTARG®), gilteritinib fumarate (XOSPATA®), glasdegib maleate (DAURISMO®), ibritumomab tiuxetan (ZEVALIN®), ibrutinib (IMBRUVICA®), idecabtagene vicleucel (ABECMA®), idelalisib (ZYDELIG®), imatinib mesylate (GLEEVEC®), infigratinib phosphate (TRUSELTIQ®), inotuzumab ozogamicin (BESPONSA®), iobenguane 1 131 (AZEDRA®), ipilimumab (YERVOY®), isatuximab-irfc (SARCLISA®), ivosidenib (TIBSOVO®), ixazomib citrate (NINLARO®), lanreotide acetate (SOMATULINE DEPOT®), lapatinib ditosylate (TYKERB®), larotrectinib sulfate (VITRAKVI®), lenvatinib mesylate (LENVIMA®), letrozole (FEMARA®), lisocabtagene maraleucel (BREYANZI®), loncastuximab tesirine- Ipyl (ZYNLONTA®), lorlatinib (LORBRENA®), lutetium Lu 177 vipivotide tetraxetan (PLUVICTO®), lutetium Lu 177-dotatate (LUTATHRA®), margetuximab-cmkb (MARGENZA®), midostaurin (R YD APT®), mirvetuximab soravtansine-gynx (ELAHERE®), mobocertinib succinate (EXKIVITY®), mogamulizumab-kpkc (POTELIGEO®), mosunetuzumab-axgb (LUNSUMIO®), moxetumomab pasudotox-tdfk (LUMOXITI®), naxitamab-gqgk (DANYELZA®), necitumumab (PORTRAZZA®), neratinib maleate (NERLYNX®), nilotinib (TASIGNA®), niraparib tosylate monohydrate (ZEJULA®), nivolumab (OPDIVO®), nivolumab and relatlimab-rmbw (OPDUALAG®), obinutuzumab (GAZYVA®), ofatumumab (ARZERRA®), olaparib (LYNPARZA®), olutasidenib (REZLHIDIA®), osimertinib mesylate (TAGRISSO®), pacritinib citrate (VONJO®), palbociclib (IBRANCE®), panitumumab (VECTIBIX®), pazopanib hydrochloride (VOTRIENT®), pembrolizumab (KEYTRUDA®), pemigatinib (PEMAZYRE®), pertuzumab (PERJETA®), pertuzumab, trastuzumab, and hyaluronidase-zzxf (PHESGO®), pexidartinib hydrochloride (TURALIO®), pirtobrutinib (JAYPIRCA®), polatuzumab vedotin-piiq (POLIVY®), ponatinib hydrochloride (ICLUSIG®), pralatrexate (FOLOTYN®), pralsetinib (GAVRETO®), radium 223 dichloride (XOFIGO®), ramucirumab (CYRAMZA®), regorafenib (STIVARGA®), retifanlimab-dlwr (ZYNYZ®), ribociclib (KISQALI®), ripretinib (QINLOCK®), rituximab (RITUXAN®), rituximab and hyaluronidase human (RITUXAN HYCELA®), romidepsin (ISTODAX®), rucaparib camsylate (RUBRACA®), ruxolitinib phosphate (JAKAFI®), sacituzumab govitecan-hziy (TRODELVY®), selinexor (XPOVIO®), selpercatinib (RETEVMO®), selumetinib sulfate (KOSELUGO®), siltuximab (SYLVANT®), sirolimus protein-bound particles (FYARRO®), sonidegib (ODOMZO®), sorafenib tosylate (NEXAVAR®), sotorasib (LUMAKRAS®), sunitinib malate (SUTENT®), tafasitamab-cxix (MONJUVI®), tagraxofusp-erzs (ELZONRIS®), talazoparib tosylate (TALZENNA®), tamoxifen citrate (SOLTAMOX®), tazemetostat hydrobromide (TAZVERIK®), tebentafusp-tebn (KIMMTRAK®), teclistamab-cqyv (TECVAYLI®), temsirolimus (TORISEL®), tepotinib hydrochloride (TEPMETKO®), tisagenlecleucel (KYMRIAH®), tisotumab vedotin-tftv (TIVDAK®), tivozanib hydrochloride (FOTIVDA®), toremifene (FARESTON®), trametinib (MEKINIST®), trametinib dimethyl sulfoxide (MEKINIST®), trastuzumab (HERCEPTIN®), tremelimumab-actl (IMJUDO®), tretinoin (VESANOID®), tucatinib (TUKYSA®), vandetanib (CAPRELSA®), vemurafenib (ZELBORAF®), venetoclax (VENCLEXTA®), vismodegib (ERIVEDGE®), vorinostat (ZOLINZA®), zanubrutinib (BRUKINSA®), and/or ziv-aflibercept (ZALTRAP®).
[0319] Table 12 provides an exemplary list of drugs used to treat cancers with mutations observed in target genes associated with certain cancer types. In certain embodiments, the subject has a cancer of a type listed in Table 12 including a mutation in one or more target genes listed in Table 12 for that cancer type, and the therapy administered to the subject comprises the drug listed in Table 12 for that cancer type and mutation.
TABLE 12. Exemplary drugs
Ill
[0320] In some embodiments, the methods described herein can be used to treat patients by (i) detecting one or more mutations in the one or more target genes listed in Table 6; and (ii) administering the corresponding one or more drugs listed in Table 6. In some embodiments, these therapies may be used alone or in combination with other therapies to treat a disease. In certain embodiments, the immune checkpoint molecule is a co-stimulatory molecule that amplifies a signal involved in a T cell response to an antigen. For example, CD28 is a co-stimulatory receptor expressed on T cells. When a T cell binds to antigen through its T cell receptor, CD28 binds to CD80 (aka B7.1) or CD86 (aka B7.2) on antigen-presenting cells to amplify T cell receptor signaling and promote T cell activation. Because CD28 binds to the same ligands (CD80 and CD86) as CTLA4, CTLA4 is able to counteract or regulate the co-stimulatory signaling mediated by CD28. In certain embodiments, the immune checkpoint molecule is a co-stimulatory molecule selected from CD28, inducible T cell co-stimulator (ICOS), CD137, 0X40, or CD27. In other embodiments, the immune checkpoint molecule is a ligand of a co-stimulatory molecule, including, for example, CD80, CD86, B7RP1, B7-H3, B7-H4, CD137L, OX40L, or CD70. Agonists that target these co-stimulatory checkpoint molecules can be used to enhance antigen-specific T cell responses against certain cancers. Accordingly, in certain embodiments, the immunotherapy or immunotherapeutic agent is an agonist of a co- stimulatory checkpoint molecule. In certain embodiments, the agonist of the co- stimulatory checkpoint molecule is an agonist antibody and preferably is a monoclonal antibody. In certain embodiments, the agonist antibody or monoclonal antibody is an anti- CD28 antibody. In other embodiments, the agonist antibody or monoclonal antibody is an anti-ICOS, anti-CD137, anti-OX40, or anti-CD27 antibody. In other embodiments, the agonist antibody or monoclonal antibody is an anti-CD80, anti-CD86, anti-B7RPl, anti- B7-H3, anti-B7-H4, anti-CD137L, anti-OX40L, or anti-CD70 antibody.
[0321] These methods provided herein provide a deeper understanding of the changes in DNA and proteins that cause cancer, allowing the identification of biomarkers and design of treatments that target these proteins. In some embodiments, the biomarker may include an epigenetic signature, such as a methylation state, methylation score and/or DNA fragmentation pattern/score. In some embodiments, the epigenetic signature can be determined for one or more regions that include, but not limited to, transcription start sites, promoter regions, CTCF binding regions and regulatory protein binding regions. In some embodiments, the epigenetic signature is determined for one or more regions that include, but not limited to, transcription start sites, promoter regions, intergenic regions and/or intronic regions that are associated with at least one or more genes listed in Table 7. Such treatments may include small-molecule drugs or monoclonal antibodies. The methods may also improve biomarker testing in individuals suffering from disease and help determine if the individual is a candidate for a certain drug or combination of drugs based on the presence or absence of the biomarker. Additionally, the methods can improve identification of mutations that contribute to the development of resistance to targeted therapy. Consequently, the analysis techniques may reduce unnecessary or untimely therapeutic interventions, patient suffering, and patient mortality.
[0322] In certain embodiments, the status of a nucleic acid variant from a sample from a subject as being of somatic or germline origin may be compared with a database of comparator results from a reference population to identify customized or targeted therapies for that subject. Typically, the reference population includes patients with the same cancer or disease type as the subject and/or patients who are receiving, or who have received, the same therapy as the subject. A customized or targeted therapy (or therapies) may be identified when the nucleic variant and the comparator results satisfy certain classification criteria (e.g., are a substantial or an approximate match).
[0323] In certain embodiments, the customized therapies described herein are typically administered parenterally (e.g., intravenously or subcutaneously). Pharmaceutical compositions containing an immunotherapeutic agent are typically administered intravenously. Certain therapeutic agents are administered orally. However, customized therapies (e.g., immunotherapeutic agents, etc.) may also be administered by any method known in the art, for example, buccal, sublingual, rectal, vaginal, intraurethral, topical, intraocular, intranasal, and/or intraauricular, which administration may include tablets, capsules, granules, aqueous suspensions, gels, sprays, suppositories, salves, ointments, or the like.
[0324] In some embodiments, therapy is customized based on the status of a nucleic acid variant as being of somatic or germline origin. In some embodiments, determination of the levels of particular cell types, e.g., immune cell types, including rare immune cell types, facilitates selection of appropriate treatment.
[0325] The present methods can be used to diagnose the presence of a condition, e.g., cancer or precancer, in a subject, to characterize a condition (such as to determine a cancer stage or heterogeneity of a cancer), to monitor a subject’s response to receiving a treatment for a condition (such as a response to a chemotherapeutic or immunotherapeutic), assess prognosis of a subject (such as to predict a survival outcome in a subject having a cancer), to determine a subject’s risk of developing a condition, to predict a subsequent course of a condition in a subject, to determine metastasis or recurrence of a cancer in a subject (or a risk of cancer metastasis or recurrence), and/or to monitor a subject’s health as part of a preventative health monitoring program (such as to determine whether and/or when a subject is in need of further diagnostic screening). The methods according to the present disclosure can also be useful in predicting a subject’s response to a particular treatment option. Successful treatment options may increase the amount of copy number variation, rare mutations, and/or cancer-related epigenetic signatures (such as hypermethylated regions or hypomethylated regions) detected in a subject's blood (such as in DNA isolated from a buffy coat sample or any other sample comprising cells, such as a blood sample (e.g., a whole blood sample, a buffy coat sample, a leukapheresis sample, or a PBMC sample) from the subject) if the treatment is successful as more cancer cells may die and shed DNA, or if a successful treatment results in an increase or decrease in the quantity of a specific immune cell type in the blood and an unsuccessful treatment results in no change. In other examples, this may not occur. In another example, certain treatment options may be correlated with genetic profiles of cancers over time. This correlation may be useful in selecting a therapy for a subject. In some embodiments, determination of the metastasis site facilitates selection of appropriate treatment.
[0326] Thus, in some embodiments, quantities of each of one or more of a particular genetic and/or epigenetic signature (e.g., quantities of fusions, indels, SNPs, CNVs, and/or rare mutations, and/or cancer-related epigenetic signatures (such as specific (e.g., DMRs) or global hypermethylated or hypomethylated regions, and/or fragmentation variable regions)) in DNA from a subject's blood (such as in DNA (e.g., cfDNA) isolated from a blood sample (e.g., a whole blood sample) from the subject)) are determined based on sequencing and analysis. In some embodiments, quantities of each of a plurality of cell types, such as immune cell types, are determined based on sequencing and analysis (such as determination of epigenetic and/or genomic signatures) of DNA isolated from at least one sample comprising cells (such as blood sample (e.g., a whole blood sample, a buffy coat sample, a leukapheresis sample, or a PBMC sample) from a subject. The plurality of immune cell types can include, but is not limited to, macrophages (including Ml macrophages and M2 macrophages), activated B cells (including regulatory B cells, memory B cells and plasma cells); T cell subsets, such as central memory T cells, naive- like T cells, and activated T cells (including cytotoxic T cells, regulatory T cells (Tregs), CD4 effector memory T cells, CD4 central memory T cells, CD8 effector memory T cells, and CD8 central memory T cells); immature myeloid cells (including myeloid-derived suppressor cells (MDSCs), low-density neutrophils, immature neutrophils, and immature granulocytes); and natural killer (NK) cells. As disclosed herein, differences in levels and/or presence of particular genetic and/or epigenetic signatures in DNA isolated from blood samples from a subject can be used to quantify cell types, such as immune cell types, within the sample. Thus, a comparison of one or more genetic and/or epigenetic signatures in DNA isolated from blood samples collected from a subject at two or more time points can be used to monitor changes in the one or more signatures and/or the one or more cell type quantities in the subject under different conditions (such as prior to and after a treatment), or over time (e.g., as part of a preventative health monitoring program).
[0327] In some embodiments, therapy is customized based on the status of a detected nucleic acid variant as being of somatic or germline origin. In some embodiments, essentially any cancer therapy (e.g., surgical therapy, radiation therapy, chemotherapy, and/or the like) may be included as part of these methods. Typically, customized therapies include at least one immunotherapy (or an immunotherapeutic agent). Immunotherapy refers generally to methods of enhancing an immune response against a given cancer type. In certain embodiments, immunotherapy refers to methods of enhancing a T cell response against a tumor or cancer. Therapies can function by helping the immune system destroy cancer cells. For example, certain targeted therapies may mark cancer cells for the immune system to destroy them. Other targeted therapies may support the immune system to work more effectively against cancer. Yet other therapies may stop cancer cells from growing, for example, by interfering with cancer cell surface markers preventing them from dividing. Additionally, therapies can inhibit signals that promote angiogenesis. Such angiogenesis inhibitors prevent blood supply into the tumor thereby, preventing tumor growth. Other targeted therapies can deliver toxic substances to the tumor. Examples include monoclonal antibodies combined with toxins, chemotherapy, or radiation. Some targeted therapies induce apoptosis or deplete cancer of hormones.
[0328] In certain embodiments, the status of a nucleic acid variant from a sample from a subject as being of somatic or germline origin may be compared with a database of comparator results from a reference population to identify customized or targeted therapies for that subject. Typically, the reference population includes patients with the same cancer or disease type as the subject and/or patients who are receiving, or who have received, the same therapy as the subject. A customized or targeted therapy (or therapies) may be identified when the nucleic variant and the comparator results satisfy certain classification criteria (e.g., are a substantial or an approximate match).
[0329] The disclosed methods can include evaluating (such as quantifying) and/or interpreting at least one cell material released from a potential metastasis site (such as at least one cell material in a sample from a subject) and/or cell types that contribute to DNA, such as cfDNA, in one or more samples collected from a subject at one or more timepoints in comparison to a selected baseline value or reference standard (or a selected set of baseline values or reference standards). A baseline value or reference standard may be a presence or level of at least one cell material and/or a quantity of cell types measured in one or more samples (such as an average quantity or range of quantities of cell types present in at least two samples) collected from the subject at one or more time points, such as prior to receiving a treatment, prior to diagnosis of a condition (such as a cancer), or as part of a preventative health monitoring program. A baseline value or reference standard may be a presence or level of at least one cell material and/or a quantity of cell types measured with respect to one or more samples (such as an average quantity or range of quantities of cell types present in at least two samples) collected at one or more timepoints from one or more subjects that do not have the condition (such as a healthy subject that does not have a cancer), one or more subjects that responded favorably to the treatment, or one or more subjects that have not received the treatment. In certain embodiments, the baseline value or reference standard utilized is a standard or profile derived from a single reference subject. In other embodiments, the baseline value or reference standard utilized is a standard or profile derived from averaged data from multiple reference subjects. The reference standard, in various embodiments, can be a single value, a mean, an average, a numerical mean or range of numerical means, a numerical pattern, or a graphical pattern created from the cell type quantity data derived from a single reference subject or from multiple reference subjects. Selection of the particular baseline values or reference standards, or selection of the one or more reference subjects, depends upon the use to which the methods described herein are to be put by, for example, a research scientist or a clinician (such as a physician).
[0330] The disclosed methods can include evaluating (such as quantifying) and/or interpreting one or more genetic and/or epigenetic signatures, and/or one or more cell types (such as one or more immune cell types), present in one or more samples (e.g., in DNA, such as cfDNA, from a blood sample(e.g., a whole blood sample, a buffy coat sample, a leukapheresis sample, or a PBMC sample)) collected from a subject at one or more timepoints in comparison to a selected baseline value or reference standard (or a selected set of baseline values or reference standards). A baseline value or reference standard may be a quantity of copy number variation, rare mutations, cancer-related epigenetic signatures (such as hypermethylated regions or hypomethylated regions), and/or cell types measured in one or more samples (such as an average quantity or range of quantities of such signatures present in at least two samples) collected from the subject at one or more time points, such as prior to receiving a treatment, prior to diagnosis of a condition (such as a cancer), or as part of a preventative health monitoring program. A baseline value or reference standard may be a quantity of, e.g., copy number variation, rare mutations, cancer-related epigenetic signatures (such as hypermethylated regions or hypomethylated regions), and/or cell types measured in one or more samples (such as an average quantity or range of quantities of such signatures and/or cell types present in at least two samples) collected at one or more timepoints from one or more subjects that do not have the condition (such as a healthy subject that does not have a cancer), one or more subjects that responded favorably to the treatment, or one or more subjects that have not received the treatment.
[0331] In certain embodiments, the baseline value or reference standard utilized is a standard or profile derived from a single reference subject. In other embodiments, the baseline value or reference standard utilized is a standard or profile derived from averaged data from multiple reference subjects. The reference standard, in various embodiments, can be a single value, a mean, an average, a numerical mean or range of numerical means, a numerical pattern, or a graphical pattern created from the genetic and/or epigenetic signature quantity data derived from a single reference subject or from multiple reference subjects. Selection of the particular baseline values or reference standards, or selection of the one or more reference subjects, depends upon the use to which the methods described herein are to be put by, for example, a research scientist or a clinician (such as a physician).
[0332] In some embodiments, one or more samples comprising cells (such as a buffy coat sample or any other sample comprising cells, such as a blood sample (e.g., a whole blood sample, a leukapheresis sample, or a PBMC sample) may be collected from a subject at two or more timepoints, to assess changes in cell types (such as changes in quantities of cell types) between the two timepoints. By monitoring cell types and identifying differences between cell types in samples collected from a subject at two or more timepoints, the present methods can be used, for example, to determine the presence or absence of a condition (such as a cancer), a response of the subject to a treatment, one or more characteristic of a condition (such as a cancer stage) in the subject, recurrence of a condition (such as a cancer), and/or a subject’s risk of developing a condition (such as a cancer). Thus, in some embodiments, methods are provided wherein quantities of cell types present in at least one sample (such as at least one whole blood sample, buffy coat sample, leukapheresis sample, or PBMC sample) collected from a subject at one or more timepoints (such as prior to receiving a treatment) are compared to quantities of cell types present in at least one sample collected from the subject at one or more different time points (such as after receiving the treatment). The disclosed methods can allow for patientspecific monitoring, such that, for example, differences in cell type quantities between samples collected from the subject at different timepoints may indicate changes (such as presence or absence of a condition, response to a treatment, a prognosis, or the like) that are significant with respect to the subject but may yet fall within a normal range of a general healthy population.
[0333] In some embodiments, methods are provided for monitoring a response (such as a change in disease state, such as a presence or absence of a metastasis in a subject, such as measured by assessing a presence or level of at least one cell material released from a potential metastasis site in a sample from the subject) of a subject to a treatment (such as a chemotherapy or an immunotherapy). In certain embodiments, one or more samples is collected from the subject at least 1-10, at least 1-5, at least 2-5, or at least 1, at least 2, least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 15, or at least 20 time points prior to the subject receiving the treatment. In certain embodiments, one or more samples is collected from the subject at least 1-10, at least 1-5, at least 2-5, or at least 1, at least 2, least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 15, or at least 20 time points after the subject has received the treatment. Sample collection from a subject can be ongoing during and/or after treatment to monitor the subject’s response to the treatment.
[0334] In some embodiments, samples are not collected from a subject prior to diagnosis of a condition (such as a cancer) or prior to receiving a treatment. In such embodiments, wherein the response of a subject to a treatment or the course or stage of a condition (such as a cancer) in the subject is being monitored over time, genetic and/or epigenetic signatures, and/or cell types are compared between samples taken at least 2-10, at least 2-5, at least 3-6, or at least 2, such as at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 15, or at least 20 time points collected after the subject has been diagnosed and/or after the subject has received the treatment. Sample collection from a subject can be ongoing during and/or after treatment to monitor the subject’s response to the treatment.
[0335] In some embodiments of the disclosed methods, one or more samples is collected from a subject at least once per year, such as about 1-12 times or about 2-6 times, such as about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12 times per year. In other embodiments, one or more samples is collected from the subject less than once per year, such as about once every 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or 24 months. In some embodiments, one or more samples is collected from the subject about once every 1-5 years or about once every 1-2 years, such as about every 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 years.
[0336] In other embodiments of the disclosed methods, one or more samples (such as one or more whole blood, buffy coat, leukapheresis, or PBMC samples) are collected from a subject at least once per week, such as on 1-4 days, 1-2 days, or on 1, 2, 3, 4, 5, 6, or 7 days per week. In certain embodiments, one or more samples are collected from the subject at least once per month, such as 1-15 times, 1-10 times, 2-5 times, or 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 times per month. In other embodiments, one or more samples is collected from the subject every month, every 2 months, every 3 months, every 4 months, every 5 months, every 6 months, every 7 months, every 8 months, every 9 months, every 10 months, every 11 months, or every 12 months. In some embodiments, one or more samples is collected from the subject at least once per day, such as 1, 2, 3, 4, 5, or 6 times per day. Selection of the one or more sample collection timepoints (e.g., the frequency of sample collection), or of the number of samples to be collected at each timepoint, depends upon the use to which the methods described herein are to be put by, for example, a research scientist or a clinician (such as a physician).
[0337] In certain embodiments, the customized therapies described herein are typically administered parenterally (e.g., intravenously or subcutaneously). Pharmaceutical compositions containing an immunotherapeutic agent are typically administered intravenously. Certain therapeutic agents are administered orally. However, customized therapies (e.g., immunotherapeutic agents, etc.) may also be administered by methods such as, for example, buccal, sublingual, rectal, vaginal, intraurethral, topical, intraocular, intranasal, and/or intraauricular, which administration may include tablets, capsules, granules, aqueous suspensions, gels, sprays, suppositories, salves, ointments, or the like.
[0338] Therapeutic options for treating specific genetic-based diseases, disorders, or conditions, other than cancer, are generally well-known to those of ordinary skill in the art and will be apparent given the particular disease, disorder, or condition under consideration.
[0339] Computer systems
[0340] Methods of the present disclosure can be implemented using, or with the aid of, computer systems. FIG. 1 shows a computer system 101 that is programmed or otherwise configured to implement the methods of the present disclosure. The computer system 101 can regulate various aspects sample preparation, sequencing, and/or analysis. In some examples, the computer system 101 is configured to perform sample preparation and sample analysis, including (where applicable) nucleic acid sequencing, e.g., according to any of the methods disclosed herein.
[0341] The computer system 101 includes a central processing unit (CPU, also "processor" and "computer processor" herein) 105, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 101 also includes memory or memory location 110 (e.g., random-access memory, readonly memory, flash memory), electronic storage unit 115 (e.g., hard disk), communication interface 120 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 125, such as cache, other memory, data storage, and/or electronic display adapters. The memory 110, storage unit 115, interface 120, and peripheral devices 125 are in communication with the CPU 105 through a communication network or bus (solid lines), such as a motherboard. The storage unit 115 can be a data storage unit (or data repository) for storing data. The computer system 101 can be operatively coupled to a computer network 130 with the aid of the communication interface 120. The computer network 130 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The computer network 130 in some cases is a telecommunication and/or data network. The computer network 130 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The computer network 130, in some cases with the aid of the computer system 101, can implement a peer-to-peer network, which may enable devices coupled to the computer system 101 to behave as a client or a server.
[0342] The CPU 105 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 110. Examples of operations performed by the CPU 105 can include fetch, decode, execute, and writeback.
[0343] The storage unit 115 can store files, such as drivers, libraries, and saved programs. The storage unit 115 can store programs generated by users and recorded sessions, as well as output(s) associated with the programs. The storage unit 115 can store user data, e.g., user preferences and user programs. The computer system 101 in some cases can include one or more additional data storage units that are external to the computer system 101, such as located on a remote server that is in communication with the computer system 101 through an intranet or the Internet. Data may be transferred from one location to another using, for example, a communication network or physical data transfer (e.g., using a hard drive, thumb drive, or other data storage mechanism).
[0344] The computer system 101 can communicate with one or more remote computer systems through the network 130. For embodiment, the computer system 101 can communicate with a remote computer system of a user (e.g., operator). Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC's (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access the computer system 101 via the network 130.
[0345] Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 101, such as, for example, on the memory 110 or electronic storage unit 115. The machine executable or machine-readable code can be provided in the form of software. During use, the code can be executed by the processor 105. In some cases, the code can be retrieved from the storage unit 115 and stored on the memory 110 for ready access by the processor 105. In some situations, the electronic storage unit 115 can be precluded, and machine-executable instructions are stored on memory 110.
[0346] In an aspect, the present disclosure provides a non-transitory computer-readable medium comprising computer-executable instructions which, when executed by at least one electronic processor, perform at least a portion of a method described herein. For example, the method may comprise a method of determining a likelihood of whether a subject has a disease, comprising: (i) providing a sample from the subject and analyzing the sample to quantify a level of one or more biomarkers in the sample to provide biomarker data associated with the subject; (ii) obtaining imaging data of the subject, wherein the imaging data comprise data obtained from one or more imaging modalities; and (iii) performing a combined analysis of the biomarker data and the imaging data using a computer processor to provide a likelihood of whether the subject has the disease; wherein step (i) can be performed before, after, or at the same time as step (ii).
[0347] The code can be pre-compiled and configured for use with a machine have a processor adapted to execute the code or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
[0348] Aspects of the systems and methods provided herein, such as the computer system 101, can be embodied in programming. Various aspects of the technology may be thought of as "products" or "articles of manufacture" typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. "Storage" type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming.
[0349] All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical, and electromagnetic waves, such as those used across physical interfaces between local devices, through wired and optical landline networks, and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible "storage" media, terms such as computer or machine "readable medium" refer to any medium that participates in providing instructions to a processor for execution.
[0350] Hence, a machine-readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrierwave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards, paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
[0351] The computer system 101 can include or be in communication with an electronic display that comprises a user interface (UI) for providing, for example, one or more results of sample analysis. Examples of UIs include, without limitation, a graphical user interface (GUI) and web-based user interface.
[0352] Additional details relating to computer systems and networks, databases, and computer program products are also provided in, for example, Peterson, Computer Networks: A Systems Approach, Morgan Kaufmann, 5th Ed. (2011), Kurose, Computer Networking: A Top-Down Approach, Pearson, 7th Ed. (2016), Elmasri, Fundamentals of Database Systems, Addison Wesley, 6th Ed. (2010), Coronel, Database Systems: Design, Implementation, & Management, Cengage Learning, 11th Ed. (2014), Tucker, Programming Languages, McGraw-Hill Science/Engineering/Math, 2nd Ed. (2006), and Rhoton, Cloud Computing Architected: Solution Design Handbook, Recursive Press (2011), each of which is hereby incorporated by reference in its entirety.
[0353] While preferred embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the disclosure be limited by the specific examples provided within the specification. While the disclosure has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the disclosure. Furthermore, it shall be understood that all aspects of the disclosure are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is therefore contemplated that the disclosure shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the disclosure and that methods and structures within the scope of these claims and their equivalents be covered thereby.
[0354] While the foregoing disclosure has been described in some detail by way of illustration and example for purposes of clarity and understanding, it will be clear to one of ordinary skill in the art from a reading of this disclosure that various changes in form and detail can be made without departing from the true scope of the disclosure and may be practiced within the scope of the appended claims. For example, all the methods, systems, computer readable media, and/or component features, steps, elements, or other aspects thereof can be used in various combinations.