Aprotein superfamily is the largest grouping (clade) ofproteins for whichcommon ancestry can be inferred (seehomology). Usually this common ancestry is inferred fromstructural alignment[1] and mechanistic similarity, even if no sequence similarity is evident.[2]Sequence homology can then be deduced even if not apparent (due to low sequence similarity). Superfamilies typically contain severalprotein families which show sequence similarity within each family. The termprotein clan is commonly used forprotease andglycosyl hydrolases superfamilies based on theMEROPS andCAZy classification systems.[2][3]
Above,secondary structural conservation of 80 members of thePA protease clan (superfamily). H indicatesα-helix, E indicatesβ-sheet, L indicates loop. Below, sequence conservation for the same alignment. Arrows indicatecatalytic triad residues. Aligned on the basis of structure byDALI
Superfamilies of proteins are identified using a number of methods. Closely related members can be identified by different methods to those needed to group the most evolutionarily divergent members.
Asequence alignment of mammalianhistone proteins. The similarity of the sequences implies that they evolved bygene duplication. Residues that are conserved across all sequences are highlighted in grey. Below the protein sequences is a key denoting:[4]
Historically, the similarity of different amino acid sequences has been the most common method of inferringhomology.[5] Sequence similarity is considered a good predictor of relatedness, since similar sequences are more likely the result ofgene duplication anddivergent evolution, rather than the result ofconvergent evolution. Amino acid sequence is typically more conserved than DNA sequence (due to thedegenerate genetic code), so it is a more sensitive detection method. Since some of the amino acids have similar properties (e.g., charge, hydrophobicity, size),conservative mutations that interchange them are oftenneutral to function. The most conserved sequence regions of a protein often correspond to functionally important regions likecatalytic sites and binding sites, since these regions are less tolerant to sequence changes.
Using sequence similarity to infer homology has several limitations. There is no minimum level of sequence similarity guaranteed to produce identical structures. Over long periods of evolution, related proteins may show no detectable sequence similarity to one another. Sequences with manyinsertions and deletions can also sometimes be difficult toalign and so identify the homologous sequence regions. In thePA clan ofproteases, for example, not a single residue is conserved through the superfamily, not even those in thecatalytic triad. Conversely, the individual families that make up a superfamily are defined on the basis of their sequence alignment, for example the C04 protease family within the PA clan.
Nevertheless, sequence similarity is the most commonly used form of evidence to infer relatedness, since the number of known sequences vastly outnumbers the number of knowntertiary structures.[6] In the absence of structural information, sequence similarity constrains the limits of which proteins can be assigned to a superfamily.[6]
Structure is much more evolutionarily conserved than sequence, such that proteins with highly similar structures can have entirely different sequences.[7] Over very long evolutionary timescales, very few residues show detectable amino acid sequence conservation, howeversecondary structural elements andtertiary structural motifs are highly conserved. Someprotein dynamics[8] andconformational changes of the protein structure may also be conserved, as is seen in theserpin superfamily.[9] Consequently, protein tertiary structure can be used to detect homology between proteins even when no evidence of relatedness remains in their sequences.Structural alignment programs, such asDALI, use the 3D structure of a protein of interest to find proteins with similar folds.[10] However, on rare occasions, related proteins may evolve to be structurally dissimilar[11] and relatedness can only be inferred by other methods.[12][13][14]
Thecatalytic mechanism of enzymes within a superfamily is commonly conserved, althoughsubstrate specificity may be significantly different.[15] Catalytic residues also tend to occur in the same order in the protein sequence.[16] For the families within the PA clan of proteases, although there has been divergent evolution of thecatalytic triad residues used to perform catalysis, all members use a similar mechanism to performcovalent, nucleophilic catalysis on proteins, peptides or amino acids.[17] However, mechanism alone is not sufficient to infer relatedness. Some catalytic mechanisms have beenconvergently evolved multiple times independently, and so form separate superfamilies,[18][19][20] and in some superfamilies display a range of different (though often chemically similar) mechanisms.[15][21]
Protein superfamilies represent the current limits of our ability to identify common ancestry.[22] They are the largestevolutionary grouping based on directevidence that is currently possible. They are therefore amongst the most ancient evolutionary events currently studied. Some superfamilies have members present in allkingdoms oflife, indicating that the last common ancestor of that superfamily was in thelast universal common ancestor of all life (LUCA).[23]
Superfamily members may be in different species, with the ancestral protein being the form of the protein that existed in the ancestral species (orthology). Conversely, the proteins may be in the same species, but evolved from a single protein whose gene wasduplicated in the genome (paralogy).
A majority of proteins contain multiple domains. Between 66 and 80% of eukaryotic proteins have multiple domains while about 40-60% of prokaryotic proteins have multiple domains.[5] Over time, many of the superfamilies of domains have mixed together. In fact, it is very rare to find "consistently isolated superfamilies".[5][1] When domains do combine, the N- to C-terminal domain order (the "domain architecture") is typically well conserved. Additionally, the number of domain combinations seen in nature is small compared to the number of possibilities, suggesting that selection acts on all combinations.[5]
Members share a large α8β8 barrel structure. It is one of the most commonprotein folds and themonophylicity of this superfamily is still contested.[37][38]
^Li D, Zhang L, Yin H, Xu H, Satkoski Trask J, Smith DG, Li Y, Yang M, Zhu Q (June 2014). "Evolution of primate α and θ defensins revealed by analysis of genomes".Molecular Biology Reports.41 (6):3859–66.doi:10.1007/s11033-014-3253-z.PMID24557891.S2CID14936647.
^abDessailly, Benoit H.; Dawson, Natalie L.; Das, Sayoni; Orengo, Christine A. (2017), "Function Diversity within Folds and Superfamilies",From Protein Structure to Function with Bioinformatics, Springer Netherlands, pp. 295–325,doi:10.1007/978-94-024-1069-3_9,ISBN978-94-024-1067-9
^Coutinho PM, Deleury E, Davies GJ, Henrissat B (April 2003). "An evolving hierarchical family classification for glycosyltransferases".Journal of Molecular Biology.328 (2):307–17.doi:10.1016/S0022-2836(03)00307-3.PMID12691742.
^Nardini M, Dijkstra BW (December 1999). "Alpha/beta hydrolase fold enzymes: the family keeps growing".Current Opinion in Structural Biology.9 (6):732–7.doi:10.1016/S0959-440X(99)00037-8.PMID10607665.
^Mohamed MF, Hollfelder F (January 2013). "Efficient, crosswise catalytic promiscuity among enzymes that catalyze phosphoryl transfer".Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics.1834 (1):417–24.doi:10.1016/j.bbapap.2012.07.015.PMID22885024.
^Branden C, Tooze J (1999).Introduction to protein structure (2nd ed.). New York: Garland Pub.ISBN978-0-8153-2305-1.
^Bolognesi M, Onesti S, Gatti G, Coda A, Ascenzi P, Brunori M (February 1989). "Aplysia limacina myoglobin. Crystallographic analysis at 1.6 A resolution".Journal of Molecular Biology.205 (3):529–44.doi:10.1016/0022-2836(89)90224-6.PMID2926816.
^Bork P, Holm L, Sander C (September 1994). "The immunoglobulin fold. Structural classification, sequence patterns and common core".Journal of Molecular Biology.242 (4):309–20.doi:10.1006/jmbi.1994.1582.PMID7932691.
^Nagano N, Orengo CA, Thornton JM (August 2002). "One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions".Journal of Molecular Biology.321 (5):741–65.doi:10.1016/s0022-2836(02)00649-6.PMID12206759.
^Farber G (1993). "An α/β-barrel full of evolutionary trouble".Current Opinion in Structural Biology.3 (3):409–412.doi:10.1016/S0959-440X(05)80114-9.