A multiplesequence alignment of five mammalianhistone H1 proteins Sequences are theamino acids for residues 120-180 of the proteins. Residues that are conserved across all sequences are highlighted in grey. Below each site (i.e., position) of the protein sequence alignment is a key denoting conserved sites (*), sites withconservative replacements (:), sites with semi-conservative replacements (.), and sites withnon-conservative replacements ( ).[1]
Over many generations, nucleic acid sequences in thegenome of anevolutionary lineage can gradually change over time due to random mutations anddeletions.[9][10] Sequences may also recombine or be deleted due tochromosomal rearrangements. Conserved sequences are sequences which persist in the genome despite such forces, and have slower rates of mutation than the background mutation rate.[11]
In coding sequences, the nucleic acid and amino acid sequence may be conserved to different extents, as the degeneracy of thegenetic code means thatsynonymous mutations in a coding sequence do not affect the amino acid sequence of its protein product.[15]
The nucleic acid sequence of a protein coding gene may also be conserved by other selective pressures. Thecodon usage bias in some organisms may restrict the types of synonymous mutations in a sequence. Nucleic acid sequences that causesecondary structure in the mRNA of a coding gene may be selected against, as some structures may negatively affect translation, or conserved where the mRNA also acts as a functional non-coding RNA.[19][20]
Non-coding sequences important forgene regulation, such as the binding or recognition sites ofribosomes andtranscription factors, may be conserved within a genome. For example, thepromoter of a conserved gene oroperon may also be conserved. As with proteins, nucleic acids that are important for the structure and function ofnon-coding RNA (ncRNA) can also be conserved. However, sequence conservation in ncRNAs is generally poor compared to protein-coding sequences, andbase pairs that contribute to structure or function are often conserved instead.[21][22]
Conserved sequences may be identified byhomology search, using tools such asBLAST,HMMER,OrthologR,[25] and Infernal.[26] Homology search tools may take an individual nucleic acid or protein sequence as input, or use statistical models generated frommultiple sequence alignments of known related sequences. Statistical models such asprofile-HMMs, and RNA covariance models which also incorporate structural information,[27] can be helpful when searching for more distantly related sequences. Input sequences are then aligned against a database of sequences from related individuals or other species. The resulting alignments are then scored based on the number of matching amino acids or bases, and the number of gaps or deletions generated by the alignment. Acceptable conservative substitutions may be identified using substitution matrices such asPAM andBLOSUM. Highly scoring alignments are assumed to be from homologous sequences. The conservation of a sequence may then be inferred by detection of highly similar homologs over a broad phylogenetic range.[28]
A sequence logo for the LexA-binding motif ofgram-positive bacteria. As theadenosine at position 5 is highly conserved, it appears larger than other characters.[29]
Multiple sequence alignments can be used to visualise conserved sequences. TheCLUSTAL format includes a plain-text key to annotate conserved columns of the alignment, denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( )[30] Sequence logos can also show conserved sequence by representing the proportions of characters at each point in the alignment by height.[29]
This image from the ECR browser[31] shows the result of aligning different vertebrate genomes to the human genome at the conservedOTX2 gene. Top: Gene annotations ofexons andintrons of the OTX2 gene. For each genome, sequence similarity (%) compared to the human genome is plotted. Tracks show thezebrafish,dog,chicken,western clawed frog,opossum,mouse,rhesus macaque andchimpanzee genomes. The peaks show regions of high sequence similarity across all genomes, showing that this sequence is highly conserved.
Whole genome alignments (WGAs) may also be used to identify highly conserved regions across species. Currently the accuracy andscalability of WGA tools remains limited due to the computational complexity of dealing with rearrangements, repeat regions and the large size of many eukaryotic genomes.[32] However, WGAs of 30 or more closely related bacteria (prokaryotes) are now increasingly feasible.[33][34]
Other approaches use measurements of conservation based onstatistical tests that attempt to identify sequences which mutate differently to an expected background (neutral) mutation rate.
The GERP (Genomic Evolutionary Rate Profiling) framework scores conservation of genetic sequences across species. This approach estimates the rate of neutral mutation in a set of species from a multiple sequence alignment, and then identifies regions of the sequence that exhibit fewer mutations than expected. These regions are then assigned scores based on the difference between the observed mutation rate and expected background mutation rate. A high GERP score then indicates a highly conserved sequence.[35][36]
LIST[37][38] (Local Identity and Shared Taxa) is based on the assumption that variations observed in species closely related to human are more significant when assessing conservation compared to those in distantly related species. Thus, LIST utilizes the local alignment identity around each position to identify relevant sequences in the multiple sequence alignment (MSA) and then it estimates conservation based on the taxonomy distances of these sequences to human. Unlike other tools, LIST ignores the count/frequency of variations in the MSA.
Aminode[39] combines multiple alignments with phylogenetic analysis to analyze changes in homologous proteins and produce a plot that indicates the local rates of evolutionary changes. This approach identifies the Evolutionarily Constrained Regions in a protein, which are segments that are subject topurifying selection and are typically critical for normal protein function.
Other approaches such as PhyloP and PhyloHMM incorporatestatistical phylogenetics methods to compareprobability distributions of substitution rates, which allows the detection of both conservation and accelerated mutation. First, a background probability distribution is generated of the number of substitutions expected to occur for a column in a multiple sequence alignment, based on aphylogenetic tree. The estimated evolutionary relationships between the species of interest are used to calculate the significance of any substitutions (i.e. a substitution between two closely related species may be less likely to occur than distantly related ones, and therefore more significant). To detect conservation, a probability distribution is calculated for a subset of the multiple sequence alignment, and compared to the background distribution using a statistical test such as alikelihood-ratio test orscore test.P-values generated from comparing the two distributions are then used to identify conserved regions. PhyloHMM useshidden Markov models to generate probability distributions. The PhyloP software package compares probability distributions using alikelihood-ratio test orscore test, as well as using a GERP-like scoring system.[40][41][42]
The most highly conserved genes are those that can be found in all organisms. These consist mainly of thencRNAs and proteins required fortranscription andtranslation, which are assumed to have been conserved from thelast universal common ancestor of all life.[49]
Sets of conserved sequences are often used for generatingphylogenetic trees, as it can be assumed that organisms with similar sequences are closely related.[52] The choice of sequences may vary depending on the taxonomic scope of the study. For example, the most highly conserved genes such as the 16S RNA and other ribosomal sequences are useful for reconstructing deep phylogenetic relationships and identifying bacterialphyla inmetagenomics studies.[53][54] Sequences that are conserved within aclade but undergo some mutations, such ashousekeeping genes, can be used to study species relationships.[55][56][57] Theinternal transcribed spacer (ITS) region, which is required for spacing conserved rRNA genes but undergoes rapid evolution, is commonly used to classifyfungi and strains of rapidly evolving bacteria.[58][59][60][61]
As highly conserved sequences often have important biological functions, they can be useful a starting point for identifying the cause ofgenetic diseases. Manycongenital metabolic disorders andLysosomal storage diseases are the result of changes to individual conserved genes, resulting in missing or faulty enzymes that are the underlying cause of the symptoms of the disease. Genetic diseases may be predicted by identifying sequences that are conserved between humans and lab organisms such asmice[62] orfruit flies,[63] and studying the effects ofknock-outs of these genes.[64]Genome-wide association studies can also be used to identify variation in conserved sequences associated with disease or health outcomes. More than two dozen novel potential susceptibility loci have been discovered for Alzehimer's disease.[65][66]
Identifying conserved sequences can be used to discover and predict functional sequences such as genes.[67] Conserved sequences with a known function, such as protein domains, can also be used to predict the function of a sequence. Databases of conserved protein domains such asPfam and theConserved Domain Database can be used to annotate functional domains in predicted protein coding genes.[68]
^Isenbarger, Thomas A.; Carr, Christopher E.; Johnson, Sarah Stewart; Finney, Michael; Church, George M.; Gilbert, Walter; Zuber, Maria T.; Ruvkun, Gary (14 October 2008). "The Most Conserved Genome Segments for Life Detection on Earth and Other Planets".Origins of Life and Evolution of Biospheres.38 (6):517–533.Bibcode:2008OLEB...38..517I.doi:10.1007/s11084-008-9148-z.PMID18853276.S2CID15707806.
^Hug, Laura A.; Baker, Brett J.; Anantharaman, Karthik; Brown, Christopher T.; Probst, Alexander J.; Castelle, Cindy J.; Butterfield, Cristina N.; Hernsdorf, Alex W.; Amano, Yuki; Ise, Kotaro; Suzuki, Yohey; Dudek, Natasha; Relman, David A.; Finstad, Kari M.; Amundson, Ronald; Thomas, Brian C.; Banfield, Jillian F. (11 April 2016)."A new view of the tree of life".Nature Microbiology.1 (5): 16048.doi:10.1038/nmicrobiol.2016.48.PMID27572647.
^Schoch, C. L.; Seifert, K. A.; Huhndorf, S.; Robert, V.; Spouge, J. L.; Levesque, C. A.; Chen, W.; Bolchacova, E.; Voigt, K.; Crous, P. W.; Miller, A. N.; Wingfield, M. J.; Aime, M. C.; An, K.-D.; Bai, F.-Y.; Barreto, R. W.; Begerow, D.; Bergeron, M.-J.; Blackwell, M.; Boekhout, T.; Bogale, M.; Boonyuen, N.; Burgaz, A. R.; Buyck, B.; Cai, L.; Cai, Q.; Cardinali, G.; Chaverri, P.; Coppins, B. J.; Crespo, A.; Cubas, P.; Cummings, C.; Damm, U.; de Beer, Z. W.; de Hoog, G. S.; Del-Prado, R.; Dentinger, B.; Dieguez-Uribeondo, J.; Divakar, P. K.; Douglas, B.; Duenas, M.; Duong, T. A.; Eberhardt, U.; Edwards, J. E.; Elshahed, M. S.; Fliegerova, K.; Furtado, M.; Garcia, M. A.; Ge, Z.-W.; Griffith, G. W.; Griffiths, K.; Groenewald, J. Z.; Groenewald, M.; Grube, M.; Gryzenhout, M.; Guo, L.-D.; Hagen, F.; Hambleton, S.; Hamelin, R. C.; Hansen, K.; Harrold, P.; Heller, G.; Herrera, C.; Hirayama, K.; Hirooka, Y.; Ho, H.-M.; Hoffmann, K.; Hofstetter, V.; Hognabba, F.; Hollingsworth, P. M.; Hong, S.-B.; Hosaka, K.; Houbraken, J.; Hughes, K.; Huhtinen, S.; Hyde, K. D.; James, T.; Johnson, E. M.; Johnson, J. E.; Johnston, P. R.; Jones, E. B. G.; Kelly, L. J.; Kirk, P. M.; Knapp, D. G.; Koljalg, U.; Kovacs, G. M.; Kurtzman, C. P.; Landvik, S.; Leavitt, S. D.; Liggenstoffer, A. S.; Liimatainen, K.; Lombard, L.; Luangsa-ard, J. J.; Lumbsch, H. T.; Maganti, H.; Maharachchikumbura, S. S. N.; Martin, M. P.; May, T. W.; McTaggart, A. R.; Methven, A. S.; Meyer, W.; Moncalvo, J.-M.; Mongkolsamrit, S.; Nagy, L. G.; Nilsson, R. H.; Niskanen, T.; Nyilasi, I.; Okada, G.; Okane, I.; Olariaga, I.; Otte, J.; Papp, T.; Park, D.; Petkovits, T.; Pino-Bodas, R.; Quaedvlieg, W.; Raja, H. A.; Redecker, D.; Rintoul, T. L.; Ruibal, C.; Sarmiento-Ramirez, J. M.; Schmitt, I.; Schussler, A.; Shearer, C.; Sotome, K.; Stefani, F. O. P.; Stenroos, S.; Stielow, B.; Stockinger, H.; Suetrong, S.; Suh, S.-O.; Sung, G.-H.; Suzuki, M.; Tanaka, K.; Tedersoo, L.; Telleria, M. T.; Tretter, E.; Untereiner, W. A.; Urbina, H.; Vagvolgyi, C.; Vialle, A.; Vu, T. D.; Walther, G.; Wang, Q.-M.; Wang, Y.; Weir, B. S.; Weiss, M.; White, M. M.; Xu, J.; Yahr, R.; Yang, Z. L.; Yurkov, A.; Zamora, J.-C.; Zhang, N.; Zhuang, W.-Y.; Schindel, D. (27 March 2012)."Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi".Proceedings of the National Academy of Sciences.109 (16):6241–6246.doi:10.1073/pnas.1117018109.PMC3341068.PMID22454494.
^Ge, Dongliang; Fellay, Jacques; Thompson, Alexander J.; Simon, Jason S.; Shianna, Kevin V.; Urban, Thomas J.; Heinzen, Erin L.; Qiu, Ping; Bertelsen, Arthur H.; Muir, Andrew J.; Sulkowski, Mark; McHutchison, John G.; Goldstein, David B. (16 August 2009). "Genetic variation in IL28B predicts hepatitis C treatment-induced viral clearance".Nature.461 (7262):399–401.Bibcode:2009Natur.461..399G.doi:10.1038/nature08309.PMID19684573.S2CID1707096.
^Marchler-Bauer, A.; Lu, S.; Anderson, J. B.; Chitsaz, F.; Derbyshire, M. K.; DeWeese-Scott, C.; Fong, J. H.; Geer, L. Y.; Geer, R. C.; Gonzales, N. R.; Gwadz, M.; Hurwitz, D. I.; Jackson, J. D.; Ke, Z.; Lanczycki, C. J.; Lu, F.; Marchler, G. H.; Mullokandov, M.; Omelchenko, M. V.; Robertson, C. L.; Song, J. S.; Thanki, N.; Yamashita, R. A.; Zhang, D.; Zhang, N.; Zheng, C.; Bryant, S. H. (24 November 2010)."CDD: a Conserved Domain Database for the functional annotation of proteins".Nucleic Acids Research.39 (Database):D225 –D229.doi:10.1093/nar/gkq1189.PMC3013737.PMID21109532.