Using CLUSTAL for multiple sequence alignments
- PMID:8743695
- DOI: 10.1016/s0076-6879(96)66024-8
Using CLUSTAL for multiple sequence alignments
Abstract
We have tested CLUSTAL W in a wide variety of situations, and it is capable of handling some very difficult protein alignment problems. If the data set consists of enough closely related sequences so that the first alignments are accurate, then CLUSTAL W will usually find an alignment that is very close to ideal. Problems can still occur if the data set includes sequences of greatly different lengths or if some sequences include long regions that are impossible to align with the rest of the data set. Trying to balance the need for long insertions and deletions in some alignments with the need to avoid them in others is still a problem. The default values for our parameters were tested empirically using test cases of sets of globular proteins where some information as to the correct alignment was available. The parameter values may not be very appropriate with nonglobular proteins. We have argued that using one weight matrix and two gap penalties is too simplistic to be of general use in the most difficult cases. We have replaced these parameters with a large number of new parameters designed primarily to help encourage gaps in loop regions. Although these new parameters are largely heuristic in nature, they perform surprisingly well and are simple to implement. The underlying speed of the progressive alignment approach is not adversely affected. The disadvantage is that the parameter space is now huge; the number of possible combinations of parameters is more than can easily be examined by hand. We justify this by asking the user to treat CLUSTAL W as a data exploration tool rather than as a definitive analysis method. It is not sensible to automatically derive multiple alignments and to trust particular algorithms as being capable of always getting the correct answer. One must examine the alignments closely, especially in conjunction with the underlying phylogenetic tree (or estimate of it) and try varying some of the parameters. Outliers (sequences that have no close relatives) should be aligned carefully, as should fragments of sequences. The program will automatically delay the alignment of any sequences that are less than 40% identical to any others until all other sequences are aligned, but this can be set from a menu by the user. It may be useful to build up an alignment of closely related sequences first and to then add in the more distant relatives one at a time or in batches, using the profile alignments and weighting scheme described earlier and perhaps using a variety of parameter settings. We give one example using SH2 domains. SH2 domains are widespread in eukaryotic signalling proteins where they function in the recognition of phosphotyrosine-containing peptides. In the chapter by Bork and Gibson ([11], this volume), Blast and pattern/profile searches were used to extract the set of known SH2 domains and to search for new members. (Profiles used in database searches are conceptually very similar to the profiles used in CLUSTAL W: see the chapters [11] and [13] for profile search methods.) The profile searches detected SH2 domains in the JAK family of protein tyrosine kinases, which were thought not to contain SH2 domains. Although the JAK family SH2 domains are rather divergent, they have the necessary core structural residues as well as the critical positively charged residue that binds phosphotyrosine, leaving no doubt that they are bona fide SH2 domains. The five new JAK family SH2 domains were added sequentially to the existing alignment of 65 SH2 domains using the CLUSTAL W profile alignment option. Figure 6 shows part of the resulting alignment. Despite their divergent sequences, the new SH2 domains have been aligned nearly perfectly with the old set. No insertions were placed in the original SH2 domains. In this example, the profile alignment procedure has produced better results than a one-step full alignment of all 70 SH2 domains, and in considerably less time. (ABSTRACT TRUNCATED)
Similar articles
- CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.Thompson JD, Higgins DG, Gibson TJ.Thompson JD, et al.Nucleic Acids Res. 1994 Nov 11;22(22):4673-80. doi: 10.1093/nar/22.22.4673.Nucleic Acids Res. 1994.PMID:7984417Free PMC article.
- Combining multiple structure and sequence alignments to improve sequence detection and alignment: application to the SH2 domains of Janus kinases.Al-Lazikani B, Sheinerman FB, Honig B.Al-Lazikani B, et al.Proc Natl Acad Sci U S A. 2001 Dec 18;98(26):14796-801. doi: 10.1073/pnas.011577898.Proc Natl Acad Sci U S A. 2001.PMID:11752426Free PMC article.
- The performance of several multiple-sequence alignment programs in relation to secondary-structure features for an rRNA sequence.Hickson RE, Simon C, Perrey SW.Hickson RE, et al.Mol Biol Evol. 2000 Apr;17(4):530-9. doi: 10.1093/oxfordjournals.molbev.a026333.Mol Biol Evol. 2000.PMID:10742045
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ.Altschul SF, et al.Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. doi: 10.1093/nar/25.17.3389.Nucleic Acids Res. 1997.PMID:9254694Free PMC article.Review.
- Bioinformatics in protein analysis.Persson B.Persson B.EXS. 2000;88:215-31. doi: 10.1007/978-3-0348-8458-7_14.EXS. 2000.PMID:10803381Review.
Cited by
- Cloning and evolutionary analysis of tobacco MAPK gene family.Zhang X, Cheng T, Wang G, Yan Y, Xia Q.Zhang X, et al.Mol Biol Rep. 2013 Feb;40(2):1407-15. doi: 10.1007/s11033-012-2184-9. Epub 2012 Oct 19.Mol Biol Rep. 2013.PMID:23079708
- A Systematic View of theMLO Family in Rice Suggests Their Novel Roles in Morphological Development, Diurnal Responses, the Light-Signaling Pathway, and Various Stress Responses.Nguyen VN, Vo KT, Park H, Jeon JS, Jung KH.Nguyen VN, et al.Front Plant Sci. 2016 Sep 27;7:1413. doi: 10.3389/fpls.2016.01413. eCollection 2016.Front Plant Sci. 2016.PMID:27729915Free PMC article.
- Global genetics and invasion history of the potato powdery scab pathogen, Spongospora subterranea f.sp. subterranea.Gau RD, Merz U, Falloon RE, Brunner PC.Gau RD, et al.PLoS One. 2013 Jun 28;8(6):e67944. doi: 10.1371/journal.pone.0067944. Print 2013.PLoS One. 2013.PMID:23840791Free PMC article.
- A highly conserved, small LTR retrotransposon that preferentially targets genes in grass genomes.Gao D, Chen J, Chen M, Meyers BC, Jackson S.Gao D, et al.PLoS One. 2012;7(2):e32010. doi: 10.1371/journal.pone.0032010. Epub 2012 Feb 16.PLoS One. 2012.PMID:22359654Free PMC article.
- Structural insight into DFMO resistant ornithine decarboxylase from Entamoeba histolytica: an inkling to adaptive evolution.Preeti, Tapas S, Kumar P, Madhubala R, Tomar S.Preeti, et al.PLoS One. 2013;8(1):e53397. doi: 10.1371/journal.pone.0053397. Epub 2013 Jan 11.PLoS One. 2013.PMID:23326423Free PMC article.
MeSH terms
Substances
Related information
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous