Modeling Site Heterogeneity with Posterior Mean Site Frequency Profiles Accelerates Accurate Phylogenomic Estimation
- PMID:28950365
- DOI: 10.1093/sysbio/syx068
Modeling Site Heterogeneity with Posterior Mean Site Frequency Profiles Accelerates Accurate Phylogenomic Estimation
Abstract
Proteins have distinct structural and functional constraints at different sites that lead to site-specific preferences for particular amino acid residues as the sequences evolve. Heterogeneity in the amino acid substitution process between sites is not modeled by commonly used empirical amino acid exchange matrices. Such model misspecification can lead to artefacts in phylogenetic estimation such as long-branch attraction. Although sophisticated site-heterogeneous mixture models have been developed to address this problem in both Bayesian and maximum likelihood (ML) frameworks, their formidable computational time and memory usage severely limits their use in large phylogenomic analyses. Here we propose a posterior mean site frequency (PMSF) method as a rapid and efficient approximation to full empirical profile mixture models for ML analysis. The PMSF approach assigns a conditional mean amino acid frequency profile to each site calculated based on a mixture model fitted to the data using a preliminary guide tree. These PMSF profiles can then be used for in-depth tree-searching in place of the full mixture model. Compared with widely used empirical mixture models with $k$ classes, our implementation of PMSF in IQ-TREE (http://www.iqtree.org) speeds up the computation by approximately $k$/1.5-fold and requires a small fraction of the RAM. Furthermore, this speedup allows, for the first time, full nonparametric bootstrap analyses to be conducted under complex site-heterogeneous models on large concatenated data matrices. Our simulations and empirical data analyses demonstrate that PMSF can effectively ameliorate long-branch attraction artefacts. In some empirical and simulation settings PMSF provided more accurate estimates of phylogenies than the mixture models from which they derive.
Similar articles
- A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny.Wang HC, Li K, Susko E, Roger AJ.Wang HC, et al.BMC Evol Biol. 2008 Dec 16;8:331. doi: 10.1186/1471-2148-8-331.BMC Evol Biol. 2008.PMID:19087270Free PMC article.
- Is Over-parameterization a Problem for Profile Mixture Models?Baños H, Susko E, Roger AJ.Baños H, et al.Syst Biol. 2024 May 27;73(1):53-75. doi: 10.1093/sysbio/syad063.Syst Biol. 2024.PMID:37843172Free PMC article.
- The Relative Importance of Modeling Site Pattern Heterogeneity Versus Partition-Wise Heterotachy in Phylogenomic Inference.Wang HC, Susko E, Roger AJ.Wang HC, et al.Syst Biol. 2019 Nov 1;68(6):1003-1019. doi: 10.1093/sysbio/syz021.Syst Biol. 2019.PMID:31140564
- Accelerated Estimation of Frequency Classes in Site-Heterogeneous Profile Mixture Models.Susko E, Lincker L, Roger AJ.Susko E, et al.Mol Biol Evol. 2018 May 1;35(5):1266-1283. doi: 10.1093/molbev/msy026.Mol Biol Evol. 2018.PMID:29688541
- Data-specific substitution models improve protein-based phylogenetics.Brazão JM, Foster PG, Cox CJ.Brazão JM, et al.PeerJ. 2023 Aug 8;11:e15716. doi: 10.7717/peerj.15716. eCollection 2023.PeerJ. 2023.PMID:37576497Free PMC article.
Cited by
- Genomic transfers help to decipher the ancient evolution of filoviruses and interactions with vertebrate hosts.Taylor DJ, Barnhart MH.Taylor DJ, et al.PLoS Pathog. 2024 Sep 3;20(9):e1011864. doi: 10.1371/journal.ppat.1011864. eCollection 2024 Sep.PLoS Pathog. 2024.PMID:39226335Free PMC article.
- A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins.Cao W, Wu LY, Xia XY, Chen X, Wang ZX, Pan XM.Cao W, et al.Sci Rep. 2023 Nov 20;13(1):20304. doi: 10.1038/s41598-023-47496-9.Sci Rep. 2023.PMID:37985846Free PMC article.
- Diversification of giant and large eukaryotic dsDNA viruses predated the origin of modern eukaryotes.Guglielmini J, Woo AC, Krupovic M, Forterre P, Gaia M.Guglielmini J, et al.Proc Natl Acad Sci U S A. 2019 Sep 24;116(39):19585-19592. doi: 10.1073/pnas.1912006116. Epub 2019 Sep 10.Proc Natl Acad Sci U S A. 2019.PMID:31506349Free PMC article.
- Phylogenomic Evidence for the Origin of Obligate Anaerobic Anammox Bacteria Around the Great Oxidation Event.Liao T, Wang S, Stüeken EE, Luo H.Liao T, et al.Mol Biol Evol. 2022 Aug 3;39(8):msac170. doi: 10.1093/molbev/msac170.Mol Biol Evol. 2022.PMID:35920138Free PMC article.
- Comparative Plastid Genomics of Green-Colored Dinoflagellates Unveils Parallel Genome Compaction and RNA Editing.Matsuo E, Morita K, Nakayama T, Yazaki E, Sarai C, Takahashi K, Iwataki M, Inagaki Y.Matsuo E, et al.Front Plant Sci. 2022 Jul 11;13:918543. doi: 10.3389/fpls.2022.918543. eCollection 2022.Front Plant Sci. 2022.PMID:35898209Free PMC article.
Publication types
MeSH terms
Associated data
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources