Gene whose sequence partially overlaps the reading frame of another gene
Anoverlapping gene (orOLG)[1][2] is agene whose expressiblenucleotide sequence partially overlaps with the expressible nucleotide sequence of another gene.[3] In this way, a nucleotide sequence may make a contribution to the function of one or moregene products. Overlapping genes are present in and a fundamental feature of bothcellular andviralgenomes.[2] The current definition of an overlapping gene varies significantly between eukaryotes, prokaryotes, and viruses.[2] Inprokaryotes andviruses overlap must be betweencoding sequences but notmRNA transcripts, and is defined when these coding sequences share a nucleotide on either the same or opposite strands. Ineukaryotes, gene overlap is almost always defined as mRNA transcript overlap. Specifically, a gene overlap in eukaryotes is defined when at least one nucleotide is shared between the boundaries of the primary mRNA transcripts of two or more genes, such that a DNA basemutation at any point of the overlapping region would affect the transcripts of all genes involved. This definition includes 5′ and 3′untranslated regions (UTRs) along withintrons.
Overprinting refers to a type of overlap in which all or part of the sequence of one gene is read in an alternatereading frame from another gene at the samelocus.[4] The alternative open reading frames (ORF) are thought to be created by criticalnucleotide substitutions within an expressible pre-existing gene, which can beinduced to express a novelprotein while still preserving the function of the original gene.[5] Overprinting has been hypothesized as a mechanism forde novo emergence of new genes from existing sequences, either older genes or previouslynon-coding regions of the genome.[6] It is believed that most overlapping genes, or genes whose expressible nucleotide sequences partially overlap with each other, evolved in part due to this mechanism, suggesting that each overlap is composed of one ancestral gene and one novel gene.[7] Subsequently, overprinting is also believed to be a source of novel proteins, as de novo proteins coded by these novel genes usually lack remotehomologs in databases.[8] Overprinted genes are particularly common features of thegenomic organization of viruses, likely to greatly increase the number of potential expressible genes from a small set of viral genetic information.[9] It is likely that overprinting is responsible for the generation of numerous novel proteins by viruses over the course of theirevolutionary history.
Tandem out-of-phase overlap of the human mitochondrial genes ATP8 (+1 frame, in red) and ATP6 (+3 frame, in blue)[10]
Genes may overlap in a variety of ways and can be classified by their positions relative to each other.[3][11][12][13][14]
Unidirectional ortandem overlap: the3' end of one gene overlaps with the5' end of another gene on the same strand. This arrangement can be symbolized with the notation → → where arrows indicate the reading frame from start to end.
Convergent orend-on overlap: the3' ends of the two genes overlap on opposite strands. This can be written as → ←.
Divergent ortail-on overlap: the5' ends of the two genes overlap on opposite strands. This can be written as ← →.
In-phase overlap occurs when the shared sequences use the same reading frame. This is also known as "phase 0". Unidirectional genes with phase 0 overlap are not considered distinct genes, but rather asalternative start sites of the same gene.
Out-of-phase overlaps occurs when the shared sequences use different reading frames. This can occur in "phase 1" or "phase 2", depending on whether the reading frames are offset by 1 or 2 nucleotides. Because acodon is three nucleotides long, an offset of three nucleotides is an in-phase, phase 0 frame.
Studies on overlapping genes suggest that their evolution can be summarized in two possible models.[4] In one model, the two proteins encoded by their respective overlapping genes evolve under similarselection pressures. The proteins and the overlap region are highly conserved when strong selection againstamino acid change is favored. Overlapping genes are reasoned to evolve under strict constraints as a single nucleotide substitution is able to alter the structure and function of the two proteins simultaneously. A study on thehepatitis B virus (HBV), whose DNA genome contains numerous overlapping genes, showed the mean number of synonymous nucleotide substitutions per site in overlapping coding regions was significantly lower than that of non-overlapping regions.[15] The same study showed that it was possible for some of these overlapping regions and their proteins to diverge significantly from the original when there's weak selection against amino acid change. Thespacer domain of thepolymerase and the pre-S1 region of a surface protein of HBV, for example, had a percentage of conserved amino acids of 30% and 40%, respectively.[15] However, these overlap regions are known to be less important forreplication compared to the overlap regions that were highly conserved among different HBV strains, which are absolutely essential for the process.
The second model suggests that the two proteins and their respective overlap genes evolve under opposite selection pressures: one frame experiencespositive selection while the other is underpurifying selection. Intombusviruses, the proteinsp19 andp22 are encoded by overlapping genes that form a 549 nt coding region, and p19 is shown to be under positive selection while p22 is under purifying selection.[16] Additional examples are mentioned in studies involving overlapping genes of theSendai virus,[17]potato leafroll virus,[18] and humanparvovirus B19.[19] This phenomenon of overlapping genes experiencing different selection pressures is suggested to be a consequence of a highrate of nucleotide substitution with different effects on the two frames; the substitutions may be majorlynon-synonymous for one frame while mostly beingsynonymous for the other frame.[4]
Overlapping genes are particularly common in rapidly evolving genomes, such as those ofviruses,bacteria, andmitochondria. They may originate in three ways:[20]
By extension of an existingopen reading frame (ORF) downstream into a contiguous gene due to the loss of astop codon;
By extension of an existing ORF upstream into a contiguous gene due to loss of aninitiation codon;
By generation of a novel ORF within an existing one due to apoint mutation.
The use of the same nucleotide sequence to encode multiple genes may provideevolutionary advantage due to reduction ingenome size and due to the opportunity fortranscriptional andtranslationalco-regulation of the overlapping genes.[12][21][22][23] Gene overlaps introduce novel evolutionary constraints on the sequences of the overlap regions.[14][24]
Acladogram indicating the likely evolutionary trajectory of the gene-dense pX region inhuman T-lymphotropic virus 1 (HTLV1), adeltaretrovirus associated with blood cancers. This region contains numerous overlapping genes, several of which likely originatedde novo through overprinting.[9]
In 1977,Pierre-Paul Grassé proposed that one of the genes in the pair could have originatedde novo by mutations to introduce novel ORFs in alternate reading frames; he described the mechanism asoverprinting.[25]: 231 It was later substantiated bySusumu Ohno, who identified a candidate gene that may have arisen by this mechanism.[26] Some de novo genes originating in this way may not remain overlapping, butsubfunctionalize followinggene duplication,[6] contributing to the prevalence oforphan genes. Which member of an overlapping gene pair is younger can be identifiedbioinformatically either by a more restrictedphylogenetic distribution, or by less optimizedcodon usage.[9][27][28] Younger members of the pair tend to have higherintrinsic structural disorder than older members, but the older members are also more disordered than other proteins, presumably as a way of alleviating the increased evolutionary constraints posed by overlap.[27] Overlaps are more likely to originate in proteins that already have high disorder.[27]
Overlapping genes in the bacteriophage ΦX174 genome. There are 11 genes in this genome (A, A*, B-H, J, K). Genes B, K, E overlap with genes A, C, D.[29]
Overlapping genes occur in alldomains of life, though with varying frequencies. They are especially common inviral genomes.
The existence of overlapping genes was first identified in the virusΦX174, whosegenome was the first DNA genome ever sequenced byFrederick Sanger in 1977.[29] Previous analysis of ΦX174, a small single-stranded DNAbacteriophage that infected the bacteriaEscherichia coli, suggested that theproteins produced during infection requiredcoding sequences longer than the measured length of its genome.[31] Analysis of the fully sequenced 5386 nucleotide genome showed that the virus possessed extensive overlap between coding regions, revealing that some genes (like genes D and E) were translated from the same DNA sequences but in different reading frames.[29][31] Analternative start site within the genome replication gene A of ΦX174 was shown to express atruncated protein with an identical coding sequence to theC-terminus of the original A protein but possessing a different function[32][33] It was concluded that other undiscovered sites ofpolypeptide synthesis could be hidden through the genome due to overlapping genes. An identified de novo gene of another overlappinggene locus was shown to express a novel protein that induces lysis of E. coli by inhibiting biosynthesis of its cell wall[56], suggesting that de novo protein creation through the process of overprinting can be a significant factor in the evolution ofpathogenicity of viruses.[4] Another example is theORF3d gene in theSARS-CoV 2 virus.[1][34] Overlapping genes are particularly common inviral genomes.[9] Some studies attribute this observation toselective pressure toward small genome sizes mediated by the physical constraints of packaging the genome in aviral capsid, particularly one oficosahedral geometry.[35] However, other studies dispute this conclusion and argue that the distribution of overlaps in viral genomes is more likely to reflect overprinting as the evolutionary origin of overlapping viral genes.[36] Overprinting is a common source ofde novo genes in viruses.[28]
The proportion of viruses with overlapping coding sequences within their genomes varies.[2] Double-strandedRNA viruses have fewer than a quarter that contains them while almost three-quarters ofretroviridae and viruses withsingle-stranded DNA genomes contain overlapping coding sequences.[37] Segmented viruses in particular, or viruses with their genome split into separate pieces and packaged either all in the samecapsid or in separate capsids, are more likely to contain an overlapping sequence than non-segmented viruses.[37] RNA viruses have fewer overlapping genes than DNA viruses which possess lowermutation rates and less restrictive genome sizes.[37][38] The lower mutation rate of DNA viruses facilitates greater genomic novelty and evolutionary exploration within a structurally constrained genome and may be the primary driver of the evolution of overlapping genes.[39][40]
Studies of overprinted viral genes suggest that their protein products tend to be accessory proteins which are notessential to viral proliferation, but contribute topathogenicity. Overprinted proteins often have unusualamino acid distributions and high levels of intrinsicdisorder.[41] In some cases overprinted proteins do have well-defined, but novel, three-dimensional structures;[42] one example is theRNA silencing suppressor p19 found inTombusviruses, which has both a novelprotein fold and a novel binding mode in recognizingsiRNAs.[28][30][43]
Estimates of gene overlap inbacterial genomes typically find that around one third of bacterial genes are overlapped, though usually only by a few base pairs.[12][44][45] Most studies of overlap in bacterial genomes find evidence that overlap serves a function ingene regulation, permitting the overlapped genes to betranscriptionally andtranslationally co-regulated.[12][23] In prokaryotic genomes, unidirectional overlaps are most common, possibly due to the tendency of adjacent prokaryotic genes to share orientation.[12][14][11] Among unidirectional overlaps, long overlaps are more commonly read with a one-nucleotide offset in reading frame (i.e., phase 1) and short overlaps are more commonly read in phase 2.[45][46] Long overlaps of greater than 60base pairs are more common for convergent genes; however, putative long overlaps have very high rates ofmisannotation.[47] Robustly validated examples of long overlaps in bacterial genomes are rare; in the well-studiedmodel organismEscherichia coli, only four gene pairs are well validated as having long, overprinted overlaps.[48]
Compared to prokaryotic genomes, eukaryotic genomes are often poorly annotated and thus identifying genuine overlaps is relatively challenging.[28] However, examples of validated gene overlaps have been documented in a variety of eukaryotic organisms, including mammals such as mice and humans.[49][50][51][52] Eukaryotes differ from prokaryotes in distribution of overlap types: while unidirectional (i.e., same-strand) overlaps are most common in prokaryotes, opposite or antiparallel-strand overlaps are more common in eukaryotes. Among the opposite-strand overlaps, convergent orientation is most common.[50] Most studies of eukaryotic gene overlap have found that overlapping genes are extensively subject to genomic reorganization even in closely related species, and thus the presence of an overlap is not always well-conserved.[51][53] Overlap with older or less taxonomically restricted genes is also a common feature of genes likely to have originatedde novo in a given eukaryotic lineage.[51][54][55]
The precise functions of overlapping genes seems to vary across the domains of life but several experiments have shown that they are important for virus lifecycles through proper protein expression and stoichiometry[56] as well as playing a role in proper protein folding.[57] A version ofbacteriophageΦX174 has also been created where all gene overlaps were removed[58] proving they were not necessary for replication.
The retention and evolution of overlapping genes within viruses may also be due tocapsid size limitations.[59] Dramatic viability loss was observed in viruses with genomes engineered to be longer than the wild-type genome.[60] Increasing the single-stranded DNA genome length ofΦX174 by >1% results in almost complete loss ofinfectivity, believed to be the result of the strict physical constraints imposed by the finite capsid volume.[61] Studies onadeno-associated viruses asgene deliveryvectors showed that viral packaging is constrained by genetic cargo size limits, requiring the use of multiplevectors to deliver large human genes such as CFTR81.[62][63] Therefore, it is suggested that overlapping genes evolved as a means to overcome these physical constraints, increasing genetic diversity by utilizing only the existing sequence rather than increasing genome length.
Standardized methods such asgenome annotation may be inappropriate for the detection of overlapping genes as they are reliant on already curated genes while overlapping genes are generally overlooked contain atypical sequence composition.[2][64][65][66] Genome annotation standards are also often biased against feature overlaps, such as genes entirely contained within another gene.[67] Furthermore, some bioinformatics pipelines such as theRAST pipeline markedly penalizes overlaps between predicted ORFs.[68] However, rapid advancement of genome-scale protein and RNA measurement tools along with increasingly advanced prediction algorithms have revealed an avalanche of overlapping genes and ORFs within numerous genomes.[2]Proteogenomic methods have been essential in discovering numerous overlapping genes and include a combination of techniques such asbottom-up proteomics,ribosome profiling,DNA sequencing, andperturbation.RNA sequencing is also used to identify genomic regions containing overlapping transcripts. It has been utilized to identify 180,000 alternate ORFs within previously annotated coding regions found in humans.[69] Newly discovered ORFs such as these are verified using a variety ofreverse genetics techniques, such asCRISPR-Cas9 andcatalytically dead Cas9 (dCas9) disruption.[70][71][72] Attempts at proof-by-synthesis are also performed to show beyond doubt the absence of any undiscovered overlapping genes.[73]
^Gibbs A, Keese PK (19 October 1995), "In search of the origins of viral genes",Molecular Basis of Virus Evolution, Cambridge University Press, pp. 76–90,doi:10.1017/cbo9780511661686.008,ISBN978-0-521-45533-6{{citation}}: CS1 maint: work parameter with ISBN (link)
^Anderson S, Bankier AT, Barrell BG, de Bruijn MH, Coulson AR, Drouin J, Eperon IC, Nierlich DP, Roe BA, Sanger F, Schreier PH, Smith AJ, Staden R, Young IG (April 1981). "Sequence and organization of the human mitochondrial genome".Nature.290 (5806):457–465.Bibcode:1981Natur.290..457A.doi:10.1038/290457a0.PMID7219534.S2CID4355527.
^abNormark S., Bergstrom S., Edlund T., Grundstrom T., Jaurin B., Lindberg F.P., Olsson O. (1983). "Overlapping genes".Annual Review of Genetics.17:499–525.doi:10.1146/annurev.ge.17.120183.002435.PMID6198955.
^abcdRogozin IB, Spiridonov AN, Sorokin AV, Wolf YI, Jordan I, Tatusov RL, Koonin EV (May 2002). "Purifying and directional selection in overlapping prokaryotic genes".Trends in Genetics.18 (5):228–232.doi:10.1016/S0168-9525(02)02649-5.PMID12047938.
^Fujii Y, Kiyotani K, Yoshida T, Sakaguchi T (2001). "Conserved and non-conserved regions in the Sendai virus genome: Evolution of a gene possessing overlapping reading frames".Virus Genes.22 (1):47–52.doi:10.1023/a:1008130318633.ISSN0920-8569.PMID11210938.S2CID12869504.