Gene duplication (orchromosomal duplication or gene amplification) is a major mechanism through which new genetic material is generated duringmolecular evolution. It can be defined as any duplication of a region ofDNA that contains agene. Gene duplications can arise as products of several types of errors inDNA replication andrepair machinery as well as through fortuitous capture by selfish genetic elements. Common sources of gene duplications includeectopic recombination,retrotransposition event,aneuploidy,polyploidy, andreplication slippage.[1]
Duplications arise from an event termedunequal crossing-over that occurs during meiosis between misaligned homologous chromosomes. The chance of it happening is a function of the degree of sharing of repetitive elements between two chromosomes. The products of this recombination are a duplication at the site of the exchange and a reciprocal deletion. Ectopic recombination is typically mediated by sequence similarity at the duplicate breakpoints, which form direct repeats. Repetitive genetic elements such astransposable elements offer one source of repetitive DNA that can facilitate recombination, and they are often found at duplication breakpoints in plants and mammals.[2]
Replication slippage is an error in DNA replication that can produce duplications of short genetic sequences. During replicationDNA polymerase begins to copy the DNA. At some point during the replication process, the polymerase dissociates from the DNA and replication stalls. When the polymerase reattaches to the DNA strand, it aligns the replicating strand to an incorrect position and incidentally copies the same section more than once. Replication slippage is also often facilitated by repetitive sequences, but requires only a few bases of similarity.[citation needed]
Retrotransposons, mainlyL1, can occasionally act on cellular mRNA. Transcripts are reverse transcribed to DNA and inserted into random place in the genome, creating retrogenes. Resulting sequence usually lack introns and often contain poly(A) sequences that are also integrated into the genome. Many retrogenes display changes in gene regulation in comparison to their parental gene sequences, which sometimes results in novel functions. Retrogenes can move between different chromosomes to shape chromosomal evolution.[3]
Aneuploidy occurs when nondisjunction at a single chromosome results in an abnormal number of chromosomes. Aneuploidy is often harmful and in mammals regularly leads to spontaneous abortions (miscarriages). Some aneuploid individuals are viable, for example trisomy 21 in humans, which leads toDown syndrome. Aneuploidy often alters gene dosage in ways that are detrimental to the organism; therefore, it is unlikely to spread through populations.
Polyploidy, orwhole genome duplication, is a product ofnondisjunction during meiosis which results in additional copies of the entire genome. Polyploidy is common in plants, but it has also occurred in animals, with two rounds of whole genome duplication (2R event) in the vertebrate lineage leading to humans.[4] It has also occurred in the hemiascomycete yeasts ~100 mya.[5][6]
After a whole genome duplication, there is a relatively short period of genome instability, extensive gene loss, elevated levels of nucleotide substitution and regulatory network rewiring.[7][8] In addition, gene dosage effects play a significant role.[9] Thus, most duplicates are lost within a short period, however, a considerable fraction of duplicates survive.[10] Interestingly, genes involved in regulation are preferentially retained.[11][12] Furthermore, retention of regulatory genes, most notably theHox genes, has led to adaptive innovation.
Rapid evolution and functional divergence have been observed at the level of the transcription of duplicated genes, usually by point mutations in short transcription factor binding motifs.[13][14] Furthermore, rapid evolution of protein phosphorylation motifs, usually embedded within rapidly evolving intrinsically disordered regions is another contributing factor for survival and rapid adaptation/neofunctionalization of duplicate genes.[15] Thus, a link seems to exist between gene regulation (at least at the post-translational level) and genome evolution.[15]
Polyploidy is also a well known source of speciation, as offspring, which have different numbers of chromosomes compared to parent species, are often unable to interbreed with non-polyploid organisms. Whole genome duplications are thought to be less detrimental than aneuploidy as the relative dosage of individual genes should be the same.
Comparisons of genomes demonstrate that gene duplications are common in most species investigated. This is indicated by variable copy numbers (copy number variation) in the genome of humans[16][17] or fruit flies.[18] However, it has been difficult to measure the rate at which such duplications occur. Recent studies yielded a first direct estimate of the genome-wide rate of gene duplication inC. elegans, the first multicellular eukaryote for which such as estimate became available. The gene duplication rate inC. elegans is on the order of 10−7 duplications/gene/generation, that is, in a population of 10 million worms, one will have a gene duplication per generation. This rate is two orders of magnitude greater than the spontaneous rate of point mutation per nucleotide site in this species.[19] Older (indirect) studies reported locus-specific duplication rates in bacteria,Drosophila, and humans ranging from 10−3 to 10−7/gene/generation.[20][21][22]
Gene duplications are an essential source of genetic novelty that can lead to evolutionary innovation. Duplication creates genetic redundancy, where the second copy of the gene is often free fromselective pressure—that is,mutations of it have no deleterious effects to its host organism. If one copy of a gene experiences a mutation that affects its original function, the second copy can serve as a 'spare part' and continue to function correctly. Thus, duplicate genes accumulate mutations faster than a functional single-copy gene, over generations of organisms, and it is possible for one of the two copies to develop a new and different function. Some examples of such neofunctionalization is the apparent mutation of a duplicated digestive gene in a family ofice fish into an antifreeze gene and duplication leading to a novel snake venom gene[23] and the synthesis of 1 beta-hydroxytestosterone in pigs.[24]
Gene duplication is believed to play a major role inevolution; this stance has been held by members of the scientific community for over 100 years.[25]Susumu Ohno was one of the most famous developers of this theory in his classic bookEvolution by gene duplication (1970).[26] Ohno argued that gene duplication is the most important evolutionary force since the emergence of theuniversal common ancestor.[27]Majorgenome duplication events can be quite common. It is believed that the entireyeastgenome underwent duplication about 100 million years ago.[28]Plants are the most prolific genome duplicators. For example,wheat is hexaploid (a kind ofpolyploid), meaning that it has six copies of its genome.
Another possible fate for duplicate genes is that both copies are equally free to accumulate degenerative mutations, so long as any defects are complemented by the other copy. This leads to a neutral "subfunctionalization" (a process ofconstructive neutral evolution) or DDC (duplication-degeneration-complementation) model,[29][30] in which the functionality of the original gene is distributed among the two copies. Neither gene can be lost, as both now perform important non-redundant functions, but ultimately neither is able to achieve novel functionality.
Subfunctionalization can occur through neutral processes in which mutations accumulate with no detrimental or beneficial effects. However, in some cases subfunctionalization can occur with clear adaptive benefits. If an ancestral gene ispleiotropic and performs two functions, often neither one of these two functions can be changed without affecting the other function. In this way, partitioning the ancestral functions into two separate genes can allow for adaptive specialization of subfunctions, thereby providing an adaptive benefit.[31]
Often the resulting genomic variation leads to gene dosage dependent neurological disorders such asRett-like syndrome andPelizaeus–Merzbacher disease.[32] Such detrimental mutations are likely to be lost from the population and will not be preserved or develop novel functions. However, many duplications are, in fact, not detrimental or beneficial, and these neutral sequences may be lost or may spread through the population through random fluctuations viagenetic drift.
The two genes that exist after a gene duplication event are calledparalogs and usually code forproteins with a similar function and/or structure. By contrast,orthologous genes present in different species which are each originally derived from the same ancestral sequence. (SeeHomology of sequences in genetics).
It is important (but often difficult) to differentiate between paralogs and orthologs in biological research. Experiments on human gene function can often be carried out on otherspecies if a homolog to a human gene can be found in the genome of that species, but only if the homolog is orthologous. If they are paralogs and resulted from a gene duplication event, their functions are likely to be too different. One or more copies of duplicated genes that constitute a gene family may be affected by insertion oftransposable elements that causes significant variation between them in their sequence and finally may become responsible fordivergent evolution. This may also render the chances and the rate ofgene conversion between the homologs of gene duplicates due to less or no similarity in their sequences.
Paralogs can be identified in single genomes through a sequence comparison of all annotated gene models to one another. Such a comparison can be performed on translated amino acid sequences (e.g. BLASTp, tBLASTx) to identify ancient duplications or on DNA nucleotide sequences (e.g. BLASTn, megablast) to identify more recent duplications. Most studies to identify gene duplications require reciprocal-best-hits or fuzzy reciprocal-best-hits, where each paralog must be the other's single best match in a sequence comparison.[33]
Most gene duplications exist aslow copy repeats (LCRs), rather highly repetitive sequences like transposable elements. They are mostly found inpericentronomic,subtelomeric andinterstitial regions of a chromosome. Many LCRs, due to their size (>1Kb), similarity, and orientation, are highly susceptible to duplications and deletions.
Technologies such as genomicmicroarrays, also called array comparativegenomic hybridization (array CGH), are used to detect chromosomal abnormalities, such as microduplications, in a high throughput fashion from genomic DNA samples. In particular, DNAmicroarray technology can simultaneously monitor theexpression levels of thousands of genes across many treatments or experimental conditions, greatly facilitating the evolutionary studies ofgene regulation after gene duplication orspeciation.[34][35]
Gene duplications can also be identified through the use of next-generation sequencing platforms. The simplest means to identify duplications in genomic resequencing data is through the use of paired-end sequencing reads. Tandem duplications are indicated by sequencing read pairs which map in abnormal orientations. Through a combination of increased sequence coverage and abnormal mapping orientation, it is possible to identify duplications in genomic sequencing data.
TheInternational System for Human Cytogenomic Nomenclature (ISCN) is an international standard forhuman chromosomenomenclature, which includes band names, symbols and abbreviated terms used in the description of human chromosome and chromosome abnormalities. Abbreviations includedup for duplications of parts of a chromosome.[36] For example, dup(17p12) causesCharcot–Marie–Tooth disease type 1A.[37]
Gene duplication does not necessarily constitute a lasting change in a species' genome. In fact, such changes often don't last past the initial host organism. From the perspective ofmolecular genetics,gene amplification is one of many ways in which agene can beoverexpressed. Genetic amplification can occur artificially, as with the use of thepolymerase chain reaction technique to amplify short strands ofDNAin vitro usingenzymes, or it can occur naturally, as described above. If it's a natural duplication, it can still take place in asomatic cell, rather than agermline cell (which would be necessary for a lasting evolutionary change).
Duplications ofoncogenes are a common cause of many types ofcancer. In such cases the genetic duplication occurs in a somatic cell and affects only the genome of the cancer cells themselves, not the entire organism, much less any subsequent offspring. Recent comprehensive patient-level classification and quantification of driver events inTCGA cohorts revealed that there are on average 12 driver events per tumor, of which 1.5 are amplifications of oncogenes.[38]
Cancer type | Associated gene amplifications | Prevalence of amplification in cancer type (percent) |
---|---|---|
Breast cancer | MYC | 20%[39] |
ERBB2 (HER2) | 20%[39] | |
CCND1 (Cyclin D1) | 15–20%[39] | |
FGFR1 | 12%[39] | |
FGFR2 | 12%[39] | |
Cervical cancer | MYC | 25–50%[39] |
ERBB2 | 20%[39] | |
Colorectal cancer | HRAS | 30%[39] |
KRAS | 20%[39] | |
MYB | 15–20%[39] | |
Esophageal cancer | MYC | 40%[39] |
CCND1 | 25%[39] | |
MDM2 | 13%[39] | |
Gastric cancer | CCNE (Cyclin E) | 15%[39] |
KRAS | 10%[39] | |
MET | 10%[39] | |
Glioblastoma | ERBB1 (EGFR) | 33–50%[39] |
CDK4 | 15%[39] | |
Head and neck cancer | CCND1 | 50%[39] |
ERBB1 | 10%[39] | |
MYC | 7–10%[39] | |
Hepatocellular cancer | CCND1 | 13%[39] |
Neuroblastoma | MYCN | 20–25%[39] |
Ovarian cancer | MYC | 20–30%[39] |
ERBB2 | 15–30%[39] | |
AKT2 | 12%[39] | |
Sarcoma | MDM2 | 10–30%[39] |
CDK4 | 10%[39] | |
Small cell lung cancer | MYC | 15–20%[39] |
Whole-genome duplications are also frequent in cancers, detected in 30% to 36% of tumors from the most common cancer types.[40][41] Their exact role in carcinogenesis is unclear, but they in some cases lead to loss of chromatin segregation leading to chromatin conformation changes that in turn lead to oncogenic epigenetic and transcriptional modifications.[42]