Review

.2016 Dec;17(12):758-772.

doi: 10.1038/nrg.2016.119. Epub 2016 Oct 24.

The state of play in higher eukaryote gene annotation

Jonathan M Mudge¹, Jennifer Harrow^{1 2}

Affiliations

PMID:27773922
PMCID: PMC5876476
DOI: 10.1038/nrg.2016.119

Review

The state of play in higher eukaryote gene annotation

Jonathan M Mudge et al. Nat Rev Genet.2016 Dec.

.2016 Dec;17(12):758-772.

doi: 10.1038/nrg.2016.119. Epub 2016 Oct 24.

Authors

Jonathan M Mudge¹, Jennifer Harrow^{1 2}

Affiliations

¹ Department of Computational Genomics, Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK.
² Illumina Cambridge Ltd, Chesterford Research Park, Little Chesterford, Saffron Walden CB10 1 XL, UK.

PMID:27773922
PMCID: PMC5876476
DOI: 10.1038/nrg.2016.119

Abstract

A genome sequence is worthless if it cannot be deciphered; therefore, efforts to describe - or 'annotate' - genes began as soon as DNA sequences became available. Whereas early work focused on individual protein-coding genes, the modern genomic ocean is a complex maelstrom of alternative splicing, non-coding transcription and pseudogenes. Scientists - from clinicians to evolutionary biologists - need to navigate these waters, and this has led to the design of high-throughput, computationally driven annotation projects. The catalogues that are being produced are key resources for genome exploration, especially as they become integrated with expression, epigenomic and variation data sets. Their creation, however, remains challenging.

PubMed Disclaimer

Figures

**Figure 1. A modern view of the genomic landscape**
This hypothetical diagram illustrates the major types of genes and transcripts found in eukaryotic genomes. Two protein-coding genes are illustrated, (a) and (b). Coding sequences (CDS) are shown as open green boxes, untranslated regions as filled red boxes. Whereas locus (b) appears to generate a single CDS transcript, locus (a) generates two distinct protein isoforms through the differential incorporation of a central exon. Locus (a) also has a retained intron associated, while an additional ‘read-through’ transcript incorporates exons from (a) and (b). This transcript is subjected to NMD (unfilled lilac boxes). Gene (c) is a long non-coding RNA (lncRNA) with two transcripts (red boxes), although three small RNAs are also transcribed from within one of its introns (open blue boxes). Loci (d) and (e) are unprocessed (filled green boxes) and processed (grey box) pseudogenes respectively. Locus (d) is transcribed. A series of promoter regions (filled grey ovals) and enhancer regions (open ovals) are indicated. Promoters are associated with transcription start sites (TSSs) for the various loci, whereas enhancers are found some distance from the gene or genes they regulate.

**Figure 2. The core annotation workflows for different gene types**
These workflows illustrate general annotation principles, not the specific pipelines of any particular genebuild.a) Protein-coding genes within reference genomes were largely annotated based on the computational genomic alignment of Sanger-sequenced transcripts and protein-coding sequences, followed by manual annotation via interface tools such as Zmap, WebApollo, Artemis or the Integrative Genomics Viewer. Transcripts were typically taken from GenBank, proteins from Swiss-Prot.b) Protein-coding genes within non-reference genomes are usually annotated based on fewer resources; here, RNA sequencing (RNA-seq) data are used in combination with protein homology information extrapolated from a closely-related genome. RNA-seq pipelines for read alignment include STAR and TopHat, whereas model creation is commonly performed by Cufflinks.c) Long non-coding RNA (lncRNA) structures can be annotated in a similar manner to protein-coding transcripts as for (a) and (b), although coding potential must be ruled out. This is typically done by examining sequence conservation with phyloCSF or using experimental datasets such as mass spectrometry or ribosome profiling. Here, 5’ Cap Analysis of Gene Expression (CAGE) and polyA-seq data are also incorporated to obtain true transcript endpoints. Designated lncRNA pipelines include PLAR.d) Small RNAs are typically added to genebuilds by mining repositories such as RFAM or miRBase. However, these entries can be used to search for additional loci based on homology.e) Pseudogene annotation is based on identification of loci with protein-homology to either paralogous or orthologous protein-coding genes. Computational annotation pipelines include PseudoPipe, although manual annotation is more accurate. Finally, all annotation methods can be thwarted by the existence of sequence gaps in the genome assembly (right-angled arrow).

**Figure 3. High-level strategies for gene annotation projects**
This schematic details the annotation pathways for reference and novel genomes. Coding sequences (CDS) are outlined in green, nonsense-mediated decay (NMD) is shown in purple and untranslated regions (UTRs) are filled in red. The core evidence sets used at each stage are listed, although their availability and incorporation vary across different projects. The types of evidence used for reference genebuilds have evolved over time: RNA sequencing (RNA-seq) has replaced Sanger sequencing, conservation-based methodologies have become more powerful and proteogenomic datasets are now available. By contrast, novel genebuilds are constructed based on RNA-seq and/orab initio modelling, in combination with the projection of annotation from other species (known as liftover) and the usage of other species evidence sets. In fact, certain novel genebuilds such as pig and rat now incorporate a modest amount of manual annotation, and could perhaps be described as ‘intermediate’ in status between ‘novel’ and ‘reference’. Furthermore, such genebuilds have also been improved by community annotation; this process typically follows the manual annotation workflows for reference genomes, although at a smaller scale. While all reference genebuilds are ‘mature’ in our view, progress into the ‘extended genebuild’ phase is most advanced for human. A promoter is indicated by the blue circle, an enhancer is indicated by the orange circle, and binding sites for transcription factors (TFs) or RNA-binding proteins (RBPs) are shown as orange triangles. Gene expression can be analyzed on any genebuild regardless of quality, although it is more effective when applied to accurate transcript catalogues. Clearly, the results of expression analyses have the potential to reciprocally improve the efficacy of genebuilds, although it remains to be seen how this will be achieved in practice (indicated by ‘?’).

**Figure 4. Transcriptional complexity in theNRIP1 locus**
a) Capture Hi-C indicates that the nuclear receptor interacting protein 1 (*NRIP1*) locus on human chromosome 21 forms a loop with a previously unannotated region nearby. Pacific Biosciences (PacBio) CaptureSeq data could be aligned here (R. Johnson, personal communication), leading to the annotation of lncRNA OTTHUMG00000488671 in GENCODE.b)| A long non-coding RNA (lncRNA) transcription start site (TSS) falls within an ENCODE-defined enhancer (red and orange blocks; processed by Ensembl). Three transcription factor binding (TFB) regions — E2F1, E2F4 and E2F6 — co-localize based on ENCODE chromatin immunoprecipitation followed by sequencing (ChIP-seq) data. In combination, these data suggest an ‘extended gene model’ forNRIP1, which may aid the interpretation of three genome-wide association study (GWAS) signals linked to Crohn’s disease (rs2823286, rs1297265 and rs1736020; shown as asterisks) as previously noted by Mifsudet al.c)*NRIP1* contains one transcript in RefSeq and 6 in GENCODE. The coding sequence (CDS; shown as an open green box) has Swiss-Prot support, and a PhyloCSF conservation signal. (The untranslated regions (UTRs) are shown as filled red boxes.)d) Two distinct first exons ofNRIP1 are annotated, both supported by 5’ Cap Analysis of Gene Expression (CAGE) data. RNA-seq from Uhlenet al. indicates differential expression, with usage of the upstream exon apparently limited to bone marrow (and adipose; not shown). This TSS is dominant in white blood cells, which are bone-marrow-derived. RNA-seq and CAGE support a more general expression profile for the downstream first exon, with evidence of TSS variability.

See this image and copyright information in PMC

Cited by

Recognition of the polycistronic nature of human genes is critical to understanding the genotype-phenotype relationship.
Brunet MA, Levesque SA, Hunting DJ, Cohen AA, Roucou X.Brunet MA, et al.Genome Res. 2018 May;28(5):609-624. doi: 10.1101/gr.230938.117. Epub 2018 Apr 6.Genome Res. 2018.PMID:29626081Free PMC article.Review.
De novo assembly and annotation of the Patagonian toothfish (Dissostichus eleginoides) genome.
Ryder D, Stone D, Minardi D, Riley A, Avant J, Cross L, Soeffker M, Davidson D, Newman A, Thomson P, Darby C, van Aerle R.Ryder D, et al.BMC Genomics. 2024 Mar 4;25(1):233. doi: 10.1186/s12864-024-10141-4.BMC Genomics. 2024.PMID:38438840Free PMC article.
ToxCodAn-Genome: an automated pipeline for toxin-gene annotation in genome assembly of venomous lineages.
Nachtigall PG, Durham AM, Rokyta DR, Junqueira-de-Azevedo ILM.Nachtigall PG, et al.Gigascience. 2024 Jan 2;13:giad116. doi: 10.1093/gigascience/giad116.Gigascience. 2024.PMID:38241143Free PMC article.
TEx-MST: tissue expression profiles of MANE select transcripts.
Tung KF, Lin WC.Tung KF, et al.Database (Oxford). 2022 Sep 28;2022:baac089. doi: 10.1093/database/baac089.Database (Oxford). 2022.PMID:36170113Free PMC article.
Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.
Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright JC, Kay M, Davidson C, Fitzgerald S, Seal R, Tweedie S, He L, Waterhouse RM, Li Y, Bruford E, Choudhary JS, Frankish A, Kellis M.Mudge JM, et al.Genome Res. 2019 Dec;29(12):2073-2087. doi: 10.1101/gr.246462.118. Epub 2019 Sep 19.Genome Res. 2019.PMID:31537640Free PMC article.

See all "Cited by" articles

References

1. Harrow J, et al. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res. 2012;22:1760–74. - PMC - PubMed
1. Kim VN, Han J, Siomi MC. Biogenesis of small RNAs in animals. Nat Rev Mol Cell Biol. 2009;10:126–39. - PubMed
1. Andersson L, et al. Coordinated international action to accelerate genome-to-phenome with FAANG, the Functional Annotation of Animal Genomes project. Genome Biol. 2015;16:57. - PMC - PubMed
1. O'Leary NA, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–45. - PMC - PubMed
1. McGarvey KM, et al. Mouse genome annotation by the RefSeq project. Mamm Genome. 2015;26:379–90. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Related information

MedGen

Grants and funding

U41 HG007234/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- scite Smart Citations

Movatterモバイル変換

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Full text links

Actions

Share

The state of play in higher eukaryote gene annotation

Affiliations

The state of play in higher eukaryote gene annotation

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources