- Review Article
- Published:
Annotating non-coding regions of the genome
Nature Reviews Geneticsvolume 11, pages559–571 (2010)Cite this article
11kAccesses
24Altmetric
Key Points
Most of the human genome consists of DNA that does not code for proteins.
Annotating functional regions in the non-coding genome involves two complementary analysis techniques: comparative analysis, which involves examining DNA sequences, and functional analysis, which involves examining the output of functional genomics experiments.
With the exponential increase in DNA sequence data, it is now possible to compare sequences within a single human haplotype, between cell types in a single person, across the human population and between species. Integrating the analysis across all these scales is useful.
There are two main methods of sequence comparison: scanning for regions of high sequence similarity above some operational threshold, and building statistical models of sequence families. Model-based sequence analysis can incorporate more biological knowledge than sequence similarity scans and provide more refined results.
The output of most high-throughput functional genomics experiments can be treated as a continuous signal mapped onto the genome and analysed with a standardized signal processing approach.
Signal processing involves smoothing the raw signal, then thresholding and segmenting the signal into discrete annotated blocks.
Integration of multiple types of signals generates a progression of more and more complex annotations; these smaller annotations are clustered into groups and then into functional networks that begin to represent the state of biological knowledge about the genome.
A chronic problem with annotation based on functional genomics data is the lack of sufficient validation by more low-throughput methods.
Techniques such as paired-end sequencing and chromosome conformation capture (and its descendants) enable annotation of connectivity between elements and necessitate a move beyond the one-dimensional signal approach to annotation.
Abstract
Most of the human genome consists of non-protein-coding DNA. Recently, progress has been made in annotating these non-coding regions through the interpretation of functional genomics experiments and comparative sequence analysis. One can conceptualize functional genomics analysis as involving a sequence of steps: turning the output of an experiment into a 'signal' at each base pair of the genome; smoothing this signal and segmenting it into small blocks of initial annotation; and then clustering these small blocks into larger derived annotations and networks. Finally, one can relate functional genomics annotations to conserved units and measures of conservation derived from comparative sequence analysis.
This is a preview of subscription content,access via your institution
Access options
Subscription info for Japanese customers
We have a dedicated website for our Japanese customers. Please go tonatureasia.com to subscribe to this journal.
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
References
Britten, R. J. & Kohne, D. E. Repeated sequences in DNA.Science161, 529–540 (1968).
Ohno, S. So much 'junk' DNA in our genome.Brookhaven Symp. Biol.23, 366–370 (1972).
Lewin, R. Proposal to sequence the human genome stirs debate.Science232, 1598–1600 (1986).
Robertson, M. The proper study of mankind.Nature322, 11 (1986).
Choi, M. et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing.Proc. Natl Acad. Sci. USA106, 19096–19101 (2009).
Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing.Nature Biotech.27, 182–189 (2009).
Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes.Nature461, 272–276 (2009).
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome.Nature409, 860–921 (2001).
Venter, J. C. et al. The sequence of the human genome.Science291, 1304–1351 (2001).
Ghildiyal, M. & Zamore, P. D. Small silencing RNAs: an expanding universe.Nature Rev. Genet.10, 94–108 (2009).
Bejerano, G. et al. Ultraconserved elements in the human genome.Science304, 1321–1325 (2004).
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.Genome Res.15, 1034–1050 (2005).
Pennacchio, L. A. et al.In vivo enhancer analysis of human conserved non-coding sequences.Nature444, 499–502 (2006).
Kleinjan, D. A. & van Heyningen, V. Long-range control of gene expression: emerging mechanisms and disruption in disease.Am. J. Hum. Genet.76, 8–32 (2005).
Yeager, M. et al. Comprehensive resequence analysis of a 136 kb region of human chromosome 8q24 associated with prostate and colon cancers.Hum. Genet.124, 161–170 (2008).
Visel, A., Rubin, E. M. & Pennacchio, L. A. Genomic views of distant-acting enhancers.Nature461, 199–205 (2009).
Lupski, J. R. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits.Trends Genet.14, 417–422 (1998).A prescient exposition of the important link between disease and structural variation in the human genome.
Kidd, J. M. et al. Mapping and sequencing of structural variation from eight human genomes.Nature453, 56–64 (2008).The first high-resolution sequence map of human structural variation.
Lupski, J. R. & Stankiewicz, P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes.PLoS Genet.1, e49 (2005).
The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.Nature447, 799–816 (2007).A comprehensive overview of what was learned during the ENCODE pilot project.
Celniker, S. E. et al. Unlocking the secrets of the genome.Nature459, 927–930 (2009).
Searls, D. B. The language of genes.Nature420, 211–217 (2002).
Whitfield, J. Across the curious parallel of language and species evolution.PLoS Biol.6, e186 (2008).
Pagel, M. Human language as a culturally transmitted replicator.Nature Rev. Genet.10, 405–415 (2009).
Saha, S., Bridges, S., Magbanua, Z. V. & Peterson, D. G. Empirical comparison of ab initio repeat finding programs.Nucleic Acids Res.36, 2284–2294 (2008).
Washietl, S. et al. Structured RNAs in the ENCODE selected regions of the human genome.Genome Res.17, 852–864 (2007).
Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE.Genome Biol.7, S4 (2006).
Zhang, Z. L. et al. PseudoPipe: an automated pseudogene identification pipeline.Bioinformatics22, 1437–1439 (2006).
Karro, J. E. et al. Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation.Nucleic Acids Res.35, D55–D60 (2007).
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G.Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, 1998).
Miller, W., Makova, K. D., Nekrutenko, A. & Hardison, R. C. Comparative genomics.Annu. Rev. Genomics Hum. Genet.5, 15–56 (2004).
Margulies, E. H. & Birney, E. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes.Nature Rev. Genet.9, 303–313 (2008).
Ren, B. et al. Genome-wide location and function of DNA binding proteins.Science290, 2306–2309 (2000).
Iyer, V. R. et al. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF.Nature409, 533–538 (2001).
Lee, T. I., Johnstone, S. E. & Young, R. A. Chromatin immunoprecipitation and microarray-based analysis of protein location.Nature Protoc.1, 729–748 (2006).
Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping ofin vivo protein–DNA interactions.Science316, 1497–1502 (2007).
Robertson, G. et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing.Nature Methods4, 651–657 (2007).
Park, P. J. ChIP–seq: advantages and challenges of a maturing technology.Nature Rev. Genet.10, 669–680 (2009).
Bertone, P. et al. Global identification of human transcribed sequences with genome tiling arrays.Science306, 2242–2246 (2004).
Cheng, J. et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution.Science308, 1149–1154 (2005).
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA–seq.Nature Methods5, 621–628 (2008).
Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing.Science320, 1344–1349 (2008).
Sultan, M. et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome.Science321, 956–960 (2008).
Wang, Z., Gerstein, M. & Snyder, M. RNA–seq: a revolutionary tool for transcriptomics.Nature Rev. Genet.10, 57–63 (2009).
Karolchik, D. et al. The UCSC Genome Browser Database.Nucleic Acids Res.31, 51–54 (2003).
Lister, R. et al. Human DNA methylomes at base resolution show widespread epigenomic differences.Nature462, 315–322 (2009).
Bernstein, B. E. et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells.Cell125, 315–326 (2006).
Barski, A. et al. High-resolution profiling of histone methylations in the human genome.Cell129, 823–837 (2007).
Mikkelsen, T. S. et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells.Nature448, 553–560 (2007).
Royce, T. E., Rozowsky, J. S. & Gerstein, M. B. Assessing the need for sequence-based normalization in tiling microarray experiments.Bioinformatics23, 988–997 (2007).
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores.Genome Res.18, 1851–1858 (2008).
Li, R. Q., Li, Y. R., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program.Bioinformatics24, 713–714 (2008).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.Genome Biol.10, R25 (2009).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform.Bioinformatics25, 1754–1760 (2009).
Zhang, Z. D., Rozowsky, J., Snyder, M., Chang, J. & Gerstein, M. Modeling ChIP sequencingin silico with applications.PLoS Comput. Biol.4, e1000158 (2008).
Rozowsky, J. et al. PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls.Nature Biotech.27, 66–75 (2009).
Auerbach, R. K. et al. Mapping accessible chromatin regions using Sono-Seq.Proc. Natl Acad. Sci. USA106, 14926–14931 (2009).
Kapranov, P. et al. Large-scale transcriptional activity in chromosomes 21 and 22.Science296, 916–919 (2002).
Rinn, J. L. et al. The transcriptional activity of human Chromosome 22.Genes Dev.17, 529–540 (2003).
Kapranov, P. et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription.Science316, 1484–1488 (2007).
Ponjavic, J., Ponting, C. P. & Lunter, G. Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs.Genome Res.17, 556–565 (2007).
Struhl, K. Transcriptional noise and the fidelity of initiation by RNA polymerase II.Nature Struct. Mol. Biol.14, 103–105 (2007).
van Bakel, H., Nislow, C., Blencowe, B. J. & Hughes, T. R. Most dark matter transcripts are associated with known genes.PLoS Biol.8, e1000371 (2010).A recent reappraisal, based on RNA–seq and tiling-array data, of the degree of pervasive transcription in the human genome.
Farnham, P. J. Insights from genomic profiling of transcription factors.Nature Rev. Genet.10, 605–616 (2009).
Pinkel, D. et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays.Nature Genetics20, 207–211 (1998).
Gokcumen, O. & Lee, C. Copy number variants (CNVs) in primate species using array-based comparative genomic hybridization.Methods49, 18–25 (2009).
Stathopoulos, A., Van Drenth, M., Erives, A., Markstein, M. & Levine, M. Whole-genome analysis of dorsal-ventral patterning in theDrosophila embryo.Cell111, 687–701 (2002).An elegant study of the effect of transcription factor concentration on the arrangement ofcis-regulatory elements at target genes.
Tantin, D., Gemberling, M., Callister, C. & Fairbrother, W. High-throughput biochemical analysis ofin vivo location data reveals novel distinct classes of POU5F1(Oct4)/DNA complexes.Genome Res.18, 631–639 (2008).
Zhang, Z. D. D. et al. Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions.Genome Res.17, 787–797 (2007).
Rozowsky, J. S. et al. The DART classification of unannotated transcription within the ENCODE regions: associating transcription with known and novel loci.Genome Res.17, 732–745 (2007).
Bailey, J. A. & Eichler, E. E. Primate segmental duplications: crucibles of evolution, diversity and disease.Nature Rev. Genet.7, 552–564 (2006).
Kim, P. M. et al. Analysis of copy number variants and segmental duplications in the human genome: evidence for a change in the process of formation in recent evolutionary history.Genome Res.18, 1865–1874 (2008).
Zheng, D. et al. Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution.Genome Res.17, 839–851 (2007).
Tam, O. H. et al. Pseudogene-derived small interfering RNAs regulate gene expression in mouse oocytes.Nature453, 534–538 (2008).
Watanabe, T. et al. Endogenous siRNAs from naturally formed dsRNAs regulate transcripts in mouse oocytes.Nature453, 539–543 (2008).
Sasidharan, R. & Gerstein, M. Protein fossils live on as RNA.Nature453, 729–731 (2008).
Ahituv, N. et al. Deletion of ultraconserved elements yields viable mice.PLoS Biol.5, e234 (2007).
Monroe, D. Genomic clues to DNA treasure sometimes lead nowhere.Science325, 142–143 (2009).
Lareau, L. F., Inada, M., Green, R. E., Wengrod, J. C. & Brenner, S. E. Unproductive splicing of SR genes associated with highly conserved and ultraconserved DNA elements.Nature446, 926–929 (2007).
Baer, C. F., Miyamoto, M. M. & Denver, D. R. Mutation rate variation in multicellular eukaryotes: causes and consequences.Nature Rev. Genet.8, 619–631 (2007).
Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals.Nature458, 223–227 (2009).A good example of the benefits of integrating comparative and functional analysis, which in this case led to the discovery of a new class of functional NCEs.
Khalil, A. M. et al. Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression.Proc. Natl Acad. Sci. USA106, 11667–11672 (2009).
Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA sequencing.Nature Nanotechnol.4, 265–270 (2009).
Eid, J. et al. Real-time DNA sequencing from single polymerase molecules.Science323, 133–138 (2009).
Du, J. et al. A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and ChIP–chip experiments: systematically incorporating validated biological knowledge.Bioinformatics22, 3016–3024 (2006).
Geiss, G. K. et al. Direct multiplexed measurement of gene expression with color-coded probe pairs.Nature Biotech.26, 317–325 (2008).
Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome conformation.Science295, 1306–1311 (2002).
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics.Genome Res.19, 1639–1645 (2009).
Fullwood, M. J. et al. An oestrogen-receptor-a-bound human chromatin interactome.Nature462, 58–64 (2009).
Dostie, J. et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements.Genome Res.16, 1299–1309 (2006).
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.Science326, 289–293 (2009).
Duan, Z. et al. A three-dimensional model of the yeast genome.Nature465, 363–367 (2010).References 91 and 92 are two examples of the power of using long-distance connectivity data in the genome to map genome structure.
Clamp, M. et al. Distinguishing protein-coding and noncoding genes in the human genome.Proc. Natl Acad. Sci. USA104, 19428–19433 (2007).
King, M. C. & Wilson, A. C. Evolution at two levels in humans and chimpanzees.Science188, 107–116 (1975).
Gregory, T. R. Synergy between sequence and size in large-scale genomics.Nature Rev. Genet.6, 699–708 (2005).
Galgoczy, D. J. et al. Genomic dissection of the cell-type-specification circuit inSaccharomyces cerevisiae.Proc. Natl Acad. Sci. USA101, 18069–18074 (2004).
Sulston, J. E., Schierenberg, E., White, J. G. & Thomson, J. N. The embryonic-cell lineage of the nematodeCaenorhabditis elegans.Dev. Biol.100, 64–119 (1983).
Vickaryous, M. K. & Hall, B. K. Human cell type diversity, evolution, development, and classification with special reference to cells derived from the neural crest.Biol. Rev. Camb. Philos. Soc.81, 425–455 (2006).
Arendt, D. The evolution of cell types in animals: emerging principles from molecular studies.Nature Rev. Genet.9, 868–882 (2008).
Schlotterer, C. & Tautz, D. Slippage synthesis of simple sequence DNA.Nucleic Acids Res.20, 211–215 (1992).
Amor, D. J. & Choo, K. H. A. Neocentromeres: role in human disease, evolution, and centromere study.Am. J. Hum. Genet.71, 695–714 (2002).
Vinces, M. D., Legendre, M., Caldara, M., Hagihara, M. & Verstrepen, K. J. Unstable tandem repeats in promoters confer transcriptional evolvability.Science324, 1213–1216 (2009).
Mills, R. E., Bennett, E. A., Iskow, R. C. & Devine, S. E. Which transposable elements are active in the human genome?Trends Genet.23, 183–191 (2007).
Zhang, Z., Frankish, A., Hunt, T., Harrow, J. & Gerstein, M. Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates.Genome Biol.11, R26 (2010).
Lagos-Quintana, M., Rauhut, R., Lendeckel, W. & Tuschl, T. Identification of novel genes coding for small expressed RNAs.Science294, 853–858 (2001).
Lau, N. C., Lim, L. P., Weinstein, E. G. & Bartel, D. P. An abundant class of tiny RNAs with probable regulatory roles inCaenorhabditis elegans.Science294, 858–862 (2001).
Lee, R. C. & Ambros, V. An extensive class of small RNAs inCaenorhabditis elegans.Science294, 862–864 (2001).
Brennecke, J. et al. Discrete small RNA-generating loci as master regulators of transposon activity inDrosophila.Cell128, 1089–1103 (2007).
Carmell, M. A. et al. MIWI2 is essential for spermatogenesis and repression of transposons in the mouse male germline.Dev. Cell12, 503–514 (2007).
Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. & Luscombe, N. M. A census of human transcription factors: function, expression and evolution.Nature Rev. Genet.10, 252–263 (2009).A useful synthesis of the current state of knowledge about human transcription factors.
Maston, G. A., Evans, S. K. & Green, M. R. Transcriptional regulatory elements in the human genome.Annu. Rev. Genomics Hum. Genet.7, 29–59 (2006).
Bovee, D. et al. Closing gaps in the human genome with fosmid resources generated from multiple individuals.Nature Genet.40, 96–101 (2008).
Kaiser, J. A plan to capture human diversity in 1000 genomes.Science319, 395–395 (2008).
Levy, S. et al. The diploid genome sequence of an individual human.PLoS Biol.5, 2113–2144 (2007).
Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.Nature Methods6, 677–681 (2009).
Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes.Genome Res.19, 1270–1278 (2009).
Lee, S., Hormozdiari, F., Alkan, C. & Brudno, M. MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions.Nature Methods6, 473–474 (2009).
Kidd, J. M. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions.Nature Methods7, 365–371 (2010).The authors report the characterization of new insertion sequences relative to the human reference genome; this study is a useful addition to the field as it moves towards a series of reference genomes for sub-populations.
Lam, H. Y. K. et al. Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library.Nature Biotech.28, 47–55 (2010).
Li, R. Q. et al. Building the sequence map of the human pan-genome.Nature Biotech.28, 57–63 (2010).
Griffiths-Jones, S., Saini, H. K., van Dongen, S. & Enright, A. J. miRBase: tools for microRNA genomics.Nucleic Acids Res.36, D154–D158 (2008).
Iafrate, A. J. et al. Detection of large-scale variation in the human genome.Nature Genet.36, 949–951 (2004).
Acknowledgements
The authors thank members of the Gerstein laboratory for helpful discussions and careful reading of the manuscript. We acknowledge support from the US NIH and from the Albert L. Williams Professorship funds.
Author information
Authors and Affiliations
Program in Computational Biology and Bioinformatics, Yale University, New Haven, 06520, Connecticut, USA
Roger P. Alexander, Gang Fang & Mark B. Gerstein
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, 06520, Connecticut, USA
Roger P. Alexander, Gang Fang, Joel Rozowsky & Mark B. Gerstein
Department of Genetics, Stanford University, Stanford, 94305, California, USA
Michael Snyder
Department of Computer Science, Yale University, New Haven, 06520, Connecticut, USA
Mark B. Gerstein
- Roger P. Alexander
You can also search for this author inPubMed Google Scholar
- Gang Fang
You can also search for this author inPubMed Google Scholar
- Joel Rozowsky
You can also search for this author inPubMed Google Scholar
- Michael Snyder
You can also search for this author inPubMed Google Scholar
- Mark B. Gerstein
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toMark B. Gerstein.
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Related links
Related links
FURTHER INFORMATION
Berkeley Drosophila Genome Project
Glossary
- Targeted exome sequencing
A technique that involves filtering genomic DNA by capturing regions of interest (often protein-coding exons) on a microarray, then sequencing the captured DNA using next-generation techniques.
- Structural variants
Chromosomal rearrangements (deletions, duplications, novel sequence insertions or inversions) that are inherited and polymorphic across the human population. Structural variants are by definition longer than SNPs and can be hundreds of thousands of base pairs long.
- Copy-number variants
Structural variants that arise from deletion or duplication and thus lead to a change in copy number of the underlying region of the genome.
- Segmental duplication
The operational definition of a segmental duplication rests on finding two regions in the same genome ranging in length from a thousand to several million nucleotides with at least 90% sequence identity. Segmental duplications are inherited but not necessarily polymorphic across the human population.
- Pseudogenes
Copies of protein-coding genes with mutations that disrupt their coding sequence and demolish their original protein-coding function.
- Syntenic blocks
Segments that align between genome sequences from two species and that are believed to define an orthologous relationship.
- DNA-based transposons
Transposable DNA elements that rely on a transposase enzyme to excise themselves from one region of the genome and insert themselves into a different region, without increasing in copy number.
- RNA-based retrotransposons
Transposable elements generated when reverse transcriptase enzymes copy RNA elements into DNA and insert the DNA copies back into the genome.
- Duplicated pseudogenes
Pseudogenes that result from whole-genome or segmental duplications, in which one copy maintains its ancestral function and the other copy degrades into a pseudogene.
- Processed pseudogenes
Pseudogenes that arise when the mRNA of a parent gene is retrotranscribed back into DNA and inserted into the genome.
- Unitary pseudogenes
A rare class of pseudogene in which a single-copy parent gene becomes non-functional.
- Chromatin immunoprecipitation
(ChIP.) A technique for identifying potential regulatory sequences that are bound by the protein of interest. Soluble DNA–chromatin extracts (complexes of DNA and protein) are isolated by using antibodies that recognize specific DNA-binding proteins. In ChIP–chip, the ChIP step is followed by microarray analysis, whereas in ChIP–seq, it is followed by sequencing.
- Tiling arrays
A class of microarray in which probes of a specific length and spacing provide uniform coverage of an entire genome or portion of a genome to a desired resolution.
- RNA sequencing
The use of high-throughput sequencing of RNA that has been reverse-transcribed into DNA to characterize the set of RNA transcripts produced by a cell.
- Smoothing
The process of filtering noise from a signal by removing fine-scale variation.
- Thresholding
The process of discretizing a continuous signal by choosing a signal value above which the signal is considered 'on' or 'active' and below which the signal is considered 'off' or 'inactive'.
- Segmenting
The result of thresholding in signal processing — that is, segments are those regions defined as 'on' or 'active' after discretization of the signal.
- Heterochromatin
Highly compact and therefore inactive regions of the genome. Largely composed of repetitive DNA, heterochromatin forms dark bands after Giemsa staining.
- Euchromatin
The lightly staining regions of the genome that are generally decondensed during interphase and contain transcriptionally active regions.
- Fosmid
A low-copy vector for the construction of stable genomic libraries that uses theEscherichia coli F-factor origin of replication. Each fosmid clone can store∼40 kb of library DNA. Cloned sequences are more stable in fosmids than in high-copy vectors.
- Specificity
A measure of the proportion of true negatives correctly identified as such (for example, the percentage of healthy people who are identified as not having a disease).
- Regulatory forests
Regions of the genome that are enriched with binding sites for regulatory factors, such as transcription factors.
- Principal components analysis
A statistical method used to simplify data sets by transforming a series of correlated variables into a smaller number of uncorrelated factors.
- Non-allelic homologous recombination
Recombination between segmental duplications that leads to local duplication, deletion or inversion of genome sequence.
- Ultraconserved elements
Operationally defined as non-coding elements that are hundreds of base pairs long and 100% identical across human, mouse and rat genomes.
- Sensitivity
A measure of the proportion of true positives that are correctly identified as such (for example, the percentage of sick people who are identified as having a disease).
- Paired-end sequencing
Determination of the sequence at both ends of a fragment of DNA of known size.
- Chromosome conformation capture
A technique used to study the long-distance interactions between genomic regions, which in turn can be used to study the three-dimensional architecture of chromosomes within a cell nucleus.
Rights and permissions
About this article
Cite this article
Alexander, R., Fang, G., Rozowsky, J.et al. Annotating non-coding regions of the genome.Nat Rev Genet11, 559–571 (2010). https://doi.org/10.1038/nrg2814
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative