Movatterモバイル変換


[0]ホーム

URL:


Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
Thehttps:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log inShow account info
Access keysNCBI HomepageMyNCBI HomepageMain ContentMain Navigation
pubmed logo
Advanced Clipboard
User Guide

Full text links

Nature Publishing Group full text link Nature Publishing Group Free PMC article
Full text links

Actions

Share

Review
.2016 Dec;17(12):758-772.
doi: 10.1038/nrg.2016.119. Epub 2016 Oct 24.

The state of play in higher eukaryote gene annotation

Affiliations
Review

The state of play in higher eukaryote gene annotation

Jonathan M Mudge et al. Nat Rev Genet.2016 Dec.

Abstract

A genome sequence is worthless if it cannot be deciphered; therefore, efforts to describe - or 'annotate' - genes began as soon as DNA sequences became available. Whereas early work focused on individual protein-coding genes, the modern genomic ocean is a complex maelstrom of alternative splicing, non-coding transcription and pseudogenes. Scientists - from clinicians to evolutionary biologists - need to navigate these waters, and this has led to the design of high-throughput, computationally driven annotation projects. The catalogues that are being produced are key resources for genome exploration, especially as they become integrated with expression, epigenomic and variation data sets. Their creation, however, remains challenging.

PubMed Disclaimer

Figures

Figure 1
Figure 1. A modern view of the genomic landscape
This hypothetical diagram illustrates the major types of genes and transcripts found in eukaryotic genomes. Two protein-coding genes are illustrated, (a) and (b). Coding sequences (CDS) are shown as open green boxes, untranslated regions as filled red boxes. Whereas locus (b) appears to generate a single CDS transcript, locus (a) generates two distinct protein isoforms through the differential incorporation of a central exon. Locus (a) also has a retained intron associated, while an additional ‘read-through’ transcript incorporates exons from (a) and (b). This transcript is subjected to NMD (unfilled lilac boxes). Gene (c) is a long non-coding RNA (lncRNA) with two transcripts (red boxes), although three small RNAs are also transcribed from within one of its introns (open blue boxes). Loci (d) and (e) are unprocessed (filled green boxes) and processed (grey box) pseudogenes respectively. Locus (d) is transcribed. A series of promoter regions (filled grey ovals) and enhancer regions (open ovals) are indicated. Promoters are associated with transcription start sites (TSSs) for the various loci, whereas enhancers are found some distance from the gene or genes they regulate.
Figure 2
Figure 2. The core annotation workflows for different gene types
These workflows illustrate general annotation principles, not the specific pipelines of any particular genebuild.a) Protein-coding genes within reference genomes were largely annotated based on the computational genomic alignment of Sanger-sequenced transcripts and protein-coding sequences, followed by manual annotation via interface tools such as Zmap, WebApollo, Artemis or the Integrative Genomics Viewer. Transcripts were typically taken from GenBank, proteins from Swiss-Prot.b) Protein-coding genes within non-reference genomes are usually annotated based on fewer resources; here, RNA sequencing (RNA-seq) data are used in combination with protein homology information extrapolated from a closely-related genome. RNA-seq pipelines for read alignment include STAR and TopHat, whereas model creation is commonly performed by Cufflinks.c) Long non-coding RNA (lncRNA) structures can be annotated in a similar manner to protein-coding transcripts as for (a) and (b), although coding potential must be ruled out. This is typically done by examining sequence conservation with phyloCSF or using experimental datasets such as mass spectrometry or ribosome profiling. Here, 5’ Cap Analysis of Gene Expression (CAGE) and polyA-seq data are also incorporated to obtain true transcript endpoints. Designated lncRNA pipelines include PLAR.d) Small RNAs are typically added to genebuilds by mining repositories such as RFAM or miRBase. However, these entries can be used to search for additional loci based on homology.e) Pseudogene annotation is based on identification of loci with protein-homology to either paralogous or orthologous protein-coding genes. Computational annotation pipelines include PseudoPipe, although manual annotation is more accurate. Finally, all annotation methods can be thwarted by the existence of sequence gaps in the genome assembly (right-angled arrow).
Figure 3
Figure 3. High-level strategies for gene annotation projects
This schematic details the annotation pathways for reference and novel genomes. Coding sequences (CDS) are outlined in green, nonsense-mediated decay (NMD) is shown in purple and untranslated regions (UTRs) are filled in red. The core evidence sets used at each stage are listed, although their availability and incorporation vary across different projects. The types of evidence used for reference genebuilds have evolved over time: RNA sequencing (RNA-seq) has replaced Sanger sequencing, conservation-based methodologies have become more powerful and proteogenomic datasets are now available. By contrast, novel genebuilds are constructed based on RNA-seq and/orab initio modelling, in combination with the projection of annotation from other species (known as liftover) and the usage of other species evidence sets. In fact, certain novel genebuilds such as pig and rat now incorporate a modest amount of manual annotation, and could perhaps be described as ‘intermediate’ in status between ‘novel’ and ‘reference’. Furthermore, such genebuilds have also been improved by community annotation; this process typically follows the manual annotation workflows for reference genomes, although at a smaller scale. While all reference genebuilds are ‘mature’ in our view, progress into the ‘extended genebuild’ phase is most advanced for human. A promoter is indicated by the blue circle, an enhancer is indicated by the orange circle, and binding sites for transcription factors (TFs) or RNA-binding proteins (RBPs) are shown as orange triangles. Gene expression can be analyzed on any genebuild regardless of quality, although it is more effective when applied to accurate transcript catalogues. Clearly, the results of expression analyses have the potential to reciprocally improve the efficacy of genebuilds, although it remains to be seen how this will be achieved in practice (indicated by ‘?’).
Figure 4
Figure 4. Transcriptional complexity in theNRIP1 locus
a) Capture Hi-C indicates that the nuclear receptor interacting protein 1 (NRIP1) locus on human chromosome 21 forms a loop with a previously unannotated region nearby. Pacific Biosciences (PacBio) CaptureSeq data could be aligned here (R. Johnson, personal communication), leading to the annotation of lncRNA OTTHUMG00000488671 in GENCODE.b)| A long non-coding RNA (lncRNA) transcription start site (TSS) falls within an ENCODE-defined enhancer (red and orange blocks; processed by Ensembl). Three transcription factor binding (TFB) regions — E2F1, E2F4 and E2F6 — co-localize based on ENCODE chromatin immunoprecipitation followed by sequencing (ChIP-seq) data. In combination, these data suggest an ‘extended gene model’ forNRIP1, which may aid the interpretation of three genome-wide association study (GWAS) signals linked to Crohn’s disease (rs2823286, rs1297265 and rs1736020; shown as asterisks) as previously noted by Mifsudet al.c)NRIP1 contains one transcript in RefSeq and 6 in GENCODE. The coding sequence (CDS; shown as an open green box) has Swiss-Prot support, and a PhyloCSF conservation signal. (The untranslated regions (UTRs) are shown as filled red boxes.)d) Two distinct first exons ofNRIP1 are annotated, both supported by 5’ Cap Analysis of Gene Expression (CAGE) data. RNA-seq from Uhlenet al. indicates differential expression, with usage of the upstream exon apparently limited to bone marrow (and adipose; not shown). This TSS is dominant in white blood cells, which are bone-marrow-derived. RNA-seq and CAGE support a more general expression profile for the downstream first exon, with evidence of TSS variability.
See this image and copyright information in PMC

Similar articles

See all similar articles

Cited by

See all "Cited by" articles

References

    1. Harrow J, et al. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res. 2012;22:1760–74. - PMC - PubMed
    1. Kim VN, Han J, Siomi MC. Biogenesis of small RNAs in animals. Nat Rev Mol Cell Biol. 2009;10:126–39. - PubMed
    1. Andersson L, et al. Coordinated international action to accelerate genome-to-phenome with FAANG, the Functional Annotation of Animal Genomes project. Genome Biol. 2015;16:57. - PMC - PubMed
    1. O'Leary NA, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–45. - PMC - PubMed
    1. McGarvey KM, et al. Mouse genome annotation by the RefSeq project. Mamm Genome. 2015;26:379–90. - PMC - PubMed

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full text links
Nature Publishing Group full text link Nature Publishing Group Free PMC article
Cite
Send To

NCBI Literature Resources

MeSHPMCBookshelfDisclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.


[8]ページ先頭

©2009-2025 Movatter.jp