Genomics is an interdisciplinary field ofmolecular biology focusing on the structure, function, evolution, mapping, and editing ofgenomes. A genome is an organism's complete set ofDNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration.[1][2][3][4][excessive citations] In contrast togenetics, which refers to the study ofindividual genes and their roles in inheritance, genomics aims at the collective characterization and quantification ofall of an organism's genes, their interrelations and influence on the organism.[5] Genes may direct the production ofproteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughputDNA sequencing andbioinformatics to assemble and analyze the function and structure of entire genomes.[6][7] Advances in genomics have triggered a revolution in discovery-based research andsystems biology to facilitate understanding of even the most complex biological systems such as the brain.[8]
The field also includes studies of intragenomic (within the genome) phenomena such asepistasis (effect of one gene on another),pleiotropy (one gene affecting more than one trait),heterosis (hybrid vigour), and other interactions betweenloci andalleles within the genome.[9]
From the Greek ΓΕΝ[10]gen, "gene" (gamma, epsilon, nu, epsilon) meaning "become, create, creation, birth", and subsequent variants: genealogy, genesis, genetics, genic, genomere, genotype, genus etc. While the wordgenome (from theGermanGenom, attributed toHans Winkler) was in use inEnglish as early as 1926,[11] the termgenomics was coined by Tom Roderick, ageneticist at theJackson Laboratory (Bar Harbor, Maine), over beers withJames E. Womack, Tom Shows andStephen O'Brien at a meeting held inMaryland on the mapping of the human genome in 1986.[12] First as the name for anew journal and then as a whole new science discipline.[13]
Frederick Sanger andWalter Gilbert shared half of the 1980 Nobel Prize in Chemistry for Independently developing methods for the sequencing of DNA.
In addition to his seminal work on the amino acid sequence of insulin,Frederick Sanger and his colleagues played a key role in the development of DNA sequencing techniques that enabled the establishment of comprehensive genome sequencing projects.[9] In 1975, he and Alan Coulson published a sequencing procedure using DNA polymerase with radiolabelled nucleotides that he called thePlus and Minus technique.[21][22] This involved two closely related methods that generated short oligonucleotides with defined 3' termini. These could be fractionated byelectrophoresis on apolyacrylamide gel (called polyacrylamide gel electrophoresis) and visualised using autoradiography. The procedure could sequence up to 80 nucleotides in one go and was a big improvement, but was still very laborious. Nevertheless, in 1977 his group was able to sequence most of the 5,386 nucleotides of the single-strandedbacteriophageφX174, completing the first fully sequenced DNA-based genome.[23] The refinement of thePlus and Minus method resulted in the chain-termination, orSanger method (seebelow), which formed the basis of the techniques of DNA sequencing, genome mapping, data storage, and bioinformatic analysis most widely used in the following quarter-century of research.[24][25] In the same yearWalter Gilbert andAllan Maxam ofHarvard University independently developed theMaxam-Gilbert method (also known as thechemical method) of DNA sequencing, involving the preferential cleavage of DNA at known bases, a less efficient method.[26][27] For their groundbreaking work in the sequencing of nucleic acids, Gilbert and Sanger shared half the 1980Nobel Prize in chemistry withPaul Berg (recombinant DNA).
The advent of these technologies resulted in a rapid intensification in the scope and speed of completion ofgenome sequencing projects. The first complete genome sequence of aeukaryotic organelle, the humanmitochondrion (16,568 bp, about 16.6 kb [kilobase]), was reported in 1981,[28] and the firstchloroplast genomes followed in 1986.[29][30] In 1992, the first eukaryoticchromosome, chromosome III of brewer's yeastSaccharomyces cerevisiae (315 kb) was sequenced.[31] The first free-living organism to be sequenced was that ofHaemophilus influenzae (1.8 Mb [megabase]) in 1995.[32] The following year a consortium of researchers from laboratories acrossNorth America,Europe, andJapan announced the completion of the first complete genome sequence of a eukaryote,S. cerevisiae (12.1 Mb), and since then genomes have continued being sequenced at an exponentially growing pace.[33] As of October 2011[update], the complete sequences are available for: 2,719viruses, 1,115archaea andbacteria, and 36eukaryotes, of which about half arefungi.[34][35]
The number of genome projects has increased as technological improvements continue to lower the cost of sequencing.(A) Exponential growth of genome sequence databases since 1995.(B) The cost in US Dollars (USD) to sequence one million bases.(C) The cost in USD to sequence a 3,000 Mb (human-sized) genome on a log-transformed scale.
A rough draft of thehuman genome was completed by theHuman Genome Project in early 2001, creating much fanfare.[41] This project, completed in 2003, sequenced the entire genome for one specific person, and by 2007 this sequence was declared "finished" (less than one error in 20,000 bases and all chromosomes assembled).[41] In the years since then, the genomes of many other individuals have been sequenced, partly under the auspices of the1000 Genomes Project, which announced the sequencing of 1,092 genomes in October 2012.[42] Completion of this project was made possible by the development of dramatically more efficient sequencing technologies and required the commitment of significantbioinformatics resources from a large international collaboration.[43] The continued analysis of human genomic data has profound political and social repercussions for human societies.[44]
The English-languageneologismomics informally refers to a field of study in biology ending in-omics, such as genomics,proteomics ormetabolomics. The related suffix-ome is used to address the objects of study of such fields, such as thegenome,proteome, ormetabolome (lipidome) respectively. The suffix-ome as used in molecular biology refers to atotality of some sort; similarlyomics has come to refer generally to the study of large, comprehensive biological data sets. While the growth in the use of the term has led some scientists (Jonathan Eisen, among others[45]) to claim that it has been oversold,[46] it reflects the change in orientation towards the quantitative analysis of complete or near-complete assortment of all the constituents of a system.[47] In the study ofsymbioses, for example, researchers which were once limited to the study of a single gene product can now simultaneously compare the total complement of several types of biological molecules.[48][49]
After an organism has been selected, genome projects involve three components: the sequencing of DNA, the assembly of that sequence to create a representation of the original chromosome, and the annotation and analysis of that representation.[9]
Overview of a genome project. First, the genome must be selected, which involves several factors including cost and relevance. Second, the sequence is generated and assembled at a given sequencing center (such asBGI orDOE JGI). Third, the genome sequence is annotated at several levels: DNA, protein, gene pathways, or comparatively.
Historically, sequencing was done insequencing centers, centralized facilities (ranging from large independent institutions such asJoint Genome Institute which sequence dozens of terabases a year, to local molecular biology core facilities) which contain research laboratories with the costly instrumentation and technical support necessary. As sequencing technology continues to improve, however, a new generation of effective fast turnaround benchtop sequencers has come within reach of the average academic laboratory.[50][51] On the whole, genome sequencing approaches fall into two broad categories,shotgun andhigh-throughput (ornext-generation) sequencing.[9]
Shotgun sequencing is a sequencing method designed for analysis of DNA sequences longer than 1000 base pairs, up to and including entire chromosomes.[52] It is named by analogy with the rapidly expanding, quasi-random firing pattern of ashotgun. Since gel electrophoresis sequencing can only be used for fairly short sequences (100 to 1000 base pairs), longer DNA sequences must be broken into random small segments which are then sequenced to obtainreads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence.[52][53] Shotgun sequencing is a random sampling process, requiring over-sampling to ensure a givennucleotide is represented in the reconstructed sequence; the average number of reads by which a genome is over-sampled is referred to ascoverage.[54]
For much of its history, the technology underlying shotgun sequencing was the classical chain-termination method or 'Sanger method', which is based on the selective incorporation of chain-terminatingdideoxynucleotides byDNA polymerase duringin vitroDNA replication.[23][55] Recently, shotgun sequencing has been supplanted byhigh-throughput sequencing methods, especially for large-scale, automatedgenome analyses. However, the Sanger method remains in wide use, primarily for smaller-scale projects and for obtaining especially long contiguous DNA sequence reads (>500 nucleotides).[56] Chain-termination methods require a single-stranded DNA template, a DNAprimer, aDNA polymerase, normal deoxynucleosidetriphosphates (dNTPs), and modified nucleotides (dideoxyNTPs) that terminate DNA strand elongation. These chain-terminating nucleotides lack a 3'-OH group required for the formation of aphosphodiester bond between two nucleotides, causing DNA polymerase to cease extension of DNA when a ddNTP is incorporated. The ddNTPs may be radioactively orfluorescently labelled for detection inDNA sequencers.[9] Typically, these machines can sequence up to 96 DNA samples in a single batch (run) in up to 48 runs a day.[57]
The high demand for low-cost sequencing has driven the development of high-throughput sequencing technologies thatparallelize the sequencing process, producing thousands or millions of sequences at once.[58][59] High-throughput sequencing is intended to lower the cost of DNA sequencing beyond what is possible with standard dye-terminator methods. In ultra-high-throughput sequencing, as many as 500,000 sequencing-by-synthesis operations may be run in parallel.[60][61]
Illumina Genome Analyzer II System. Illumina technologies have set the standard for high-throughput massively parallel sequencing.[50]
The Illumina dye sequencing method is based on reversible dye-terminators and was developed in 1996 at the Geneva Biomedical Research Institute, byPascal Mayer and Laurent Farinelli.[62] In this method, DNA molecules and primers are first attached on a slide and amplified withpolymerase so that local clonal colonies, initially coined "DNA colonies", are formed. To determine the sequence, four types of reversible terminator bases (RT-bases) are added and non-incorporated nucleotides are washed away. Unlike pyrosequencing, the DNA chains are extended one nucleotide at a time and image acquisition can be performed at a delayed moment, allowing for very large arrays of DNA colonies to be captured by sequential images taken from a single camera. Decoupling the enzymatic reaction and the image capture allows for optimal throughput and theoretically unlimited sequencing capacity; with an optimal configuration, the ultimate throughput of the instrument depends only on theA/D conversion rate of the camera. The camera takes images of thefluorescently labeled nucleotides, then the dye along with the terminal 3' blocker is chemically removed from the DNA, allowing the next cycle.[63]
An alternative approach, ion semiconductor sequencing, is based on standard DNA replication chemistry. This technology measures the release of a hydrogen ion each time a base is incorporated. A microwell containing template DNA is flooded with a singlenucleotide, if the nucleotide is complementary to the template strand it will be incorporated and a hydrogen ion will be released. This release triggers anISFET ion sensor. If ahomopolymer is present in the template sequence multiple nucleotides will be incorporated in a single flood cycle, and the detected electrical signal will be proportionally higher.[64]
Overlapping reads form contigs; contigs and gaps of known length form scaffolds.
Paired end reads of next generation sequencing data mapped to a reference genome.
Multiple, fragmented sequence reads must be assembled together on the basis of their overlapping areas.
Sequence assembly refers toaligning and merging fragments of a much longerDNA sequence in order to reconstruct the original sequence.[9] This is needed as currentDNA sequencing technology cannot read whole genomes as a continuous sequence, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used. Third generation sequencing technologies such as PacBio or Oxford Nanopore routinely generate sequencing reads 10-100 kb in length; however, they have a high error rate at approximately 1 percent.[65][66] Typically the short fragments, called reads, result fromshotgun sequencinggenomic DNA, orgene transcripts (ESTs).[9]
Assembly can be broadly categorized into two approaches:de novo assembly, for genomes which are not similar to any sequenced in the past, and comparative assembly, which uses the existing sequence of a closely related organism as a reference during assembly.[54] Relative to comparative assembly,de novo assembly is computationally difficult (NP-hard), making it less favourable for short-read NGS technologies. Within thede novo assembly paradigm there are two primary strategies for assembly, Eulerian path strategies, and overlap-layout-consensus (OLC) strategies. OLC strategies ultimately try to create a Hamiltonian path through an overlap graph which is an NP-hard problem. Eulerian path strategies are computationally more tractable because they try to find a Eulerian path through a deBruijn graph.[54]
The DNA sequence assembly alone is of little value without additional analysis.[9] Genome annotation is the process of attaching biological information tosequences, and consists of three main steps:[68]
identifying portions of the genome that do not code for proteins
attaching biological information to these elements.
Automatic annotation tools try to perform these stepsin silico, as opposed to manual annotation (a.k.a. curation) which involves human expertise and potential experimental verification.[69] Ideally, these approaches co-exist and complement each other in the same annotationpipeline (also seebelow).
Traditionally, the basic level of annotation is usingBLAST for finding similarities, and then annotating genomes based on homologues.[9] More recently, additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases (e.g.Ensembl) rely on both curated data sources as well as a range of software tools in their automated genome annotation pipeline.[70]Structural annotation consists of the identification of genomic elements, primarilyORFs and their localisation, or gene structure.Functional annotation consists of attaching biological information to genomic elements.
The need for reproducibility and efficient management of the large amount of data associated with genome projects mean thatcomputational pipelines have important applications in genomics.[71]
Functional genomics is a field ofmolecular biology that attempts to make use of the vast wealth of data produced by genomic projects (such asgenome sequencing projects) to describegene (andprotein) functions and interactions. Functional genomics focuses on the dynamic aspects such as genetranscription,translation, andprotein–protein interactions, as opposed to the static aspects of the genomic information such asDNA sequence or structures. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts, and protein products. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "gene-by-gene" approach.
A major branch of genomics is still concerned withsequencing the genomes of various organisms, but the knowledge of full genomes has created the possibility for the field offunctional genomics, mainly concerned with patterns ofgene expression during various conditions. The most important tools here aremicroarrays andbioinformatics.
An example of a protein structure determined by the Midwest Center for Structural Genomics
Structural genomics seeks to describe the3-dimensional structure of every protein encoded by a givengenome.[72][73] This genome-based approach allows for a high-throughput method of structure determination by a combination ofexperimental and modeling approaches. The principal difference between structural genomics andtraditional structural prediction is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large numbers of sequenced genomes and previously solved protein structures allow scientists to model protein structure on the structures of previously solved homologs. Structural genomics involves taking a large number of approaches to structure determination, including experimental methods using genomic sequences or modeling-based approaches based on sequence orstructural homology to a protein of known structure or based on chemical and physical principles for a protein with no homology to any known structure. As opposed to traditionalstructural biology, the determination of aprotein structure through a structural genomics effort often (but not always) comes before anything is known regarding the protein function. This raises new challenges instructural bioinformatics, i.e. determining protein function from its3D structure.[74]
Epigenomics is the study of the complete set ofepigenetic modifications on the genetic material of a cell, known as theepigenome.[75] Epigenetic modifications are reversible modifications on a cell's DNA or histones that affect gene expression without altering the DNA sequence (Russell 2010 p. 475). Two of the most characterized epigenetic modifications areDNA methylation andhistone modification.[76] Epigenetic modifications play an important role in gene expression and regulation, and are involved in numerous cellular processes such as indifferentiation/development[77] andtumorigenesis.[75] The study of epigenetics on a global level has been made possible only recently through the adaptation of genomic high-throughput assays.[78]
Environmental Shotgun Sequencing (ESS) is a key technique in metagenomics. (A) Sampling from habitat; (B) filtering particles, typically by size; (C) Lysis and DNA extraction; (D) cloning and library construction; (E) sequencing the clones; (F) sequence assembly into contigs and scaffolds.
Metagenomics is the study ofmetagenomes,genetic material recovered directly fromenvironmental samples. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics. While traditionalmicrobiology and microbialgenome sequencing rely upon cultivatedclonalcultures, early environmental gene sequencing cloned specific genes (often the16S rRNA gene) to produce aprofile of diversity in a natural sample. Such work revealed that the vast majority ofmicrobial biodiversity had been missed bycultivation-based methods.[79] Recent studies use "shotgun"Sanger sequencing or massively parallelpyrosequencing to get largely unbiased samples of all genes from all the members of the sampled communities.[80] Because of its power to reveal the previously hidden diversity of microscopic life, metagenomics offers a powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the entire living world.[81][82]
Bacteriophages have played and continue to play a key role in bacterialgenetics andmolecular biology. Historically, they were used to definegene structure and gene regulation. Also the firstgenome to be sequenced was abacteriophage. However, bacteriophage research did not lead the genomics revolution, which is clearly dominated by bacterial genomics. Only very recently has the study of bacteriophage genomes become prominent, thereby enabling researchers to understand the mechanisms underlyingphage evolution. Bacteriophage genome sequences can be obtained through direct sequencing of isolated bacteriophages, but can also be derived as part of microbial genomes. Analysis of bacterial genomes has shown that a substantial amount of microbial DNA consists ofprophage sequences and prophage-like elements.[83] A detailed database mining of these sequences offers insights into the role of prophages in shaping the bacterial genome: Overall, this method verified many known bacteriophage groups, making this a useful tool for predicting the relationships of prophages from bacterial genomes.[84][85]
At present there are 24cyanobacteria for which a total genome sequence is available. 15 of these cyanobacteria come from the marine environment. These are sixProchlorococcus strains, seven marineSynechococcus strains,Trichodesmium erythraeum IMS101 andCrocosphaera watsoniiWH8501. Several studies have demonstrated how these sequences could be used very successfully to infer important ecological and physiological characteristics of marine cyanobacteria. However, there are many more genome projects currently in progress, amongst those there are furtherProchlorococcus and marineSynechococcus isolates,Acaryochloris andProchloron, the N2-fixing filamentous cyanobacteriaNodularia spumigena,Lyngbya aestuarii andLyngbya majuscula, as well asbacteriophages infecting marine cyanobacteria. Thus, the growing body of genome information can also be tapped in a more general way to address global problems by applying a comparative approach. Some new and exciting examples of progress in this field are the identification of genes for regulatory RNAs, insights into the evolutionary origin ofphotosynthesis, or estimation of the contribution ofhorizontal gene transfer to the genomes that have been analyzed.[86]
Schematickaryogram of a human, providing a simplified overview of the human genome. It is a graphical representation of the idealized humandiploid karyotype, with annotatedbands and sub-bands. It shows dark and white regions onG banding. Each row is vertically aligned atcentromere level. It shows 22homologousautosomal chromosome pairs, both the female (XX) and male (XY) versions of the twosex chromosomes, as well as themitochondrial genome (at bottom left).
Next-generation genomic technologies allow clinicians and biomedical researchers to drastically increase the amount of genomic data collected on large study populations.[87] When combined with new informatics approaches that integrate many kinds of data with genomic data in disease research, this allows researchers to better understand the genetic bases of drug response and disease.[88][89]
Early efforts to apply the genome to medicine included those by a Stanford team led byEuan Ashley who developed the first tools for the medical interpretation of a human genome.[90][91][92] The Genomes2People research program atBrigham and Women's Hospital,Broad Institute and Harvard Medical School was established in 2012 to conduct empirical research in translating genomics into health.Brigham and Women's Hospital opened a Preventive Genomics Clinic in August 2019, withMassachusetts General Hospital following a month later.[93][94] TheAll of Us research program aims to collect genome sequence data from 1 million participants to become a critical component of the precision medicine research platform[95] and theUK Biobank initiative has studied more than 500.000 individuals with deep genomic and phenotypic data.[96]
Conservationists can use the information gathered by genomic sequencing in order to better evaluate genetic factors key to species conservation, such as thegenetic diversity of a population or whether an individual is heterozygous for a recessive inherited genetic disorder.[100] By using genomic data to evaluate the effects ofevolutionary processes and to detect patterns in variation throughout a given population, conservationists can formulate plans to aid a given species without as many variables left unknown as those unaddressed by standardgenetic approaches.[101]
^Satzinger H (March 2008). "Theodor and Marcella Boveri: chromosomes and cytoplasm in heredity and development".Nature Reviews. Genetics.9 (3):231–238.doi:10.1038/nrg2311.PMID18268510.S2CID15829893.
^Cremer T, Cremer C (2006). "Rise, fall and resurrection of chromosome territories: a historical perspective. Part I. The rise of chromosome territories".European Journal of Histochemistry.50 (3):161–176.PMID16920639.
^Concepts of genetics (10th ed.). San Francisco: Pearson Education. 2012.ISBN978-0-321-72412-0.
^Culver KW, Labow MA (8 November 2002)."Genomics". In Robinson R (ed.).Genetics. Macmillan Science Library. Macmillan Reference USA.ISBN978-0-02-865606-9.
^Kaiser O, Bartels D, Bekel T, Goesmann A, Kespohl S, Pühler A, Meyer F (December 2003). "Whole genome shotgun sequencing guided by bioinformatics pipelines--an optimized approach for an established technique".Journal of Biotechnology.106 (2–3):121–133.doi:10.1016/j.jbiotec.2003.08.008.PMID14651855.
^abDarden L, Tabery J (2010)."Molecular Biology". In Zalta EN (ed.).The Stanford Encyclopedia of Philosophy (Fall 2010 ed.).
^Anderson S, Bankier AT, Barrell BG, de Bruijn MH, Coulson AR, Drouin J, et al. (April 1981). "Sequence and organization of the human mitochondrial genome".Nature.290 (5806):457–465.Bibcode:1981Natur.290..457A.doi:10.1038/290457a0.PMID7219534.S2CID4355527.(subscription required)
^Yue GH, Lo LC, Zhu ZY, Lin G, Feng F (April 2006). "The complete nucleotide sequence of the mitochondrial genome of Tetraodon nigroviridis".DNA Sequence.17 (2):115–121.doi:10.1080/10425170600700378.PMID17076253.S2CID21797344.
^Sanger F, Coulson AR (May 1975). "A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase".Journal of Molecular Biology.94 (3):441–448.doi:10.1016/0022-2836(75)90213-2.PMID1100841.
^US 20050100900, Kawashima EH, Farinelli L, Mayer P, "Method of nucleic acid amplification", published 12 May 2005, issued 26 July 2011, assigned to Solexa Ltd Great Britain.
^Laird PW (March 2010). "Principles and challenges of genomewide DNA methylation analysis".Nature Reviews. Genetics.11 (3):191–203.doi:10.1038/nrg2732.PMID20125086.S2CID6780101.
^Luikart G, England PR, Tallmon D, Jordan S, Taberlet P (December 2003). "The power and promise of population genomics: from genotyping to genome typing".Nature Reviews. Genetics.4 (12):981–94.doi:10.1038/nrg1226.PMID14631358.S2CID8516357.
^Allendorf FW, Hohenlohe PA, Luikart G (October 2010). "Genomics and the future of conservation genetics".Nature Reviews. Genetics.11 (10):697–709.doi:10.1038/nrg2844.PMID20847747.S2CID10811958.