Movatterモバイル変換


[0]ホーム

URL:


Skip to Main Content
Advertisement
Oxford Academic
Search
Molecular Biology and Evolution
Society for Molecular Biology and Evolution
Close
Search
Journal Article

A Bayesian Mixture Model for Across-Site Heterogeneities in the Amino-Acid Replacement Process

Nicolas Lartillot,
Nicolas Lartillot
Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, Montréal, Québec Canada
Search for other works by this author on:
Hervé Philippe
Hervé Philippe
Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, Montréal, Québec Canada
Search for other works by this author on:
Molecular Biology and Evolution, Volume 21, Issue 6, June 2004, Pages 1095–1109,https://doi.org/10.1093/molbev/msh112
Published:
01 June 2004
Article history
Accepted:
10 February 2004
Published:
01 June 2004
Search
Close
Search

Abstract

Most current models of sequence evolution assume that all sites of a protein evolve under the same substitution process, characterized by a 20 × 20 substitution matrix. Here, we propose to relax this assumption by developing a Bayesian mixture model that allows the amino-acid replacement pattern at different sites of a protein alignment to be described by distinct substitution processes. Our model, named CAT, assumes the existence of distinct processes (or classes) differing by their equilibrium frequencies over the 20 residues. Through the use of a Dirichlet process prior, the total number of classes and their respective amino-acid profiles, as well as the affiliations of each site to a given class, are all free variables of the model. In this way, the CAT model is able to adapt to the complexity actually present in the data, and it yields an estimate of the substitutional heterogeneity through the posterior mean number of classes. We show that a significant level of heterogeneity is present in the substitution patterns of proteins, and that the standard one-matrix model fails to account for this heterogeneity. By evaluating the Bayes factor, we demonstrate that the standard model is outperformed by CAT on all of the data sets which we analyzed. Altogether, these results suggest that the complexity of the pattern of substitution of real sequences is better captured by the CAT model, offering the possibility of studying its impact on phylogenetic reconstruction and its connections with structure-function determinants.

Introduction

Probabilistic methods are widely used in phylogenetic reconstruction. Their main advantage, compared to methods such as maximum parsimony, is to make all assumptions underlying the reconstruction explicit, while providing powerful and general techniques for validating those assumptions (Swofford et al. 1996;Sullivan and Swofford 2001). Given a stochastic model of evolution, these methods allow computation of the probability of observing the available data, conditional on a phylogenetic hypothesis (specified by a topology, branch lengths plus some other parameters). This probability is used as a measure of the likelihood of the corresponding hypothesis, and one then invokes the maximum likelihood (ML) principle, choosing the hypothesis for which the probability of observing the data is maximal.

Maximum likelihood is intuitively appealing, and mathematical theorems guarantee the asymptotic consistency of the method (Wald 1949). However, when applied to models that are too rich in parameters, ML can lead to over-fitting artifacts. This is true, for instance, when models with site-specific parameters are considered, a problem sometimes referred to as the “infinitely many parameter trap” (Felsenstein 2004). In practice, this limitation on the number of parameters restricts the range of models that the phylogeneticist, seeking more realism, would like to explore. An alternative probabilistic paradigm, the Bayesian method, has been introduced recently in the field of molecular phylogeny (Li 1996;Yang and Rannala 1997;Larget and Simon 1999;Huelsenbeck and Ronquist 2001). It was initially proposed as a practical alternative to ML, mainly on the grounds that it offers a natural measure of uncertainty, thereby avoiding the costly method of bootstrap resampling (Larget and Simon 1999). However, another advantage of Bayes, which is in our opinion much more fundamental, lies in its greater flexibility with respect to the model's dimensionality (Huelsenbeck et al. 2002;Rannala 2002). In contrast to ML, and through a different treatment of the nuisance parameters which are not directly estimated but integrated away, Bayes is able to deal with much higher dimensional models while offering several methods to test these models in the light of available data (Jeffreys 1935;Jaynes 2003). This flexibility opens new avenues of investigation, as it makes it possible to build more realistic models of sequence evolution by adjusting the dimensionality so that it reflects the complexity actually displayed by the data, rather than the limitations inherent in the method.

One of the ways by which restrictions are traditionally imposed on model dimensionality is by assuming that sites of an alignment are independent and identically distributed: that is, they are considered as independent realizations of the same substitution process, running along the branches of the underlying evolutionary tree (Felsenstein 1981). In certain circumstances, as in the case of pseudogenes, this assumption might be valid, but more often different nucleotides or different codons of a gene will evolve under very different selection pressures. The assumption of identical distribution is partially relaxed in the “rates across sites” models (Yang 1993,1994), where the rate of evolution at each site is a random variable drawn from a gamma distribution. Such models yield a major improvement in statistical adequacy over the uniform rate model when applied to both nucleotide and protein data sets (Yang 1996). However, they still assume that all the remaining parameters of the model of evolution—i.e., the equilibrium frequencies and the relative rates of substitution among nucleotides or amino acids—are the same at all sites.

In practice, all these parameters are summarized into a 4 × 4 or 20 × 20 rate matrix. In the case of amino-acid alignments, empirical matrices are generally used, which have first been obtained by counting pairs of amino acids at homologous positions in large sets of aligned proteins (Dayhoff, Eck, and Park 1972;Dayhoff, Schwartz, and Orcutt 1978;Jones, Taylor, and Thornton 1992). Matrices optimized by ML have also been proposed for mitochondrial (Adachi and Hasegawa 1996), chloroplast (Adachi et al. 2000), and nuclear (Whelan and Goldman 2001) proteins. More recently, an alternative, faster method of optimization has been introduced (Muller, Spang, and Vingron 2002), and a new method generalizing the use of empirical matrices has been proposed (Goldman and Whelan 2002).

A number of models for possible heterogeneity in the substitution pattern at distinct sites have already been proposed, both for nucleotide data and amino-acid data. In the case of nucleotides, the transition/transversion ratio was modeled as a site-specific random variable (Huelsenbeck and Nielsen 1999). As for proteins, a first approach has been proposed, in which substitutional heterogeneity is introduced through a set of eight to ten predefined categories, based on secondary structure and solvent accessibility considerations (Goldman, Thorne, and Jones 1996;Thorne, Goldman, and Jones 1996;Goldman, Thorne, and Jones 1998;Liò and Goldman 1999). Each category has its own rate matrix, optimized by ML on real data sets. This model was shown to be significantly supported by real sequences, yet it does not address the question of the extent of heterogeneity actually present in the data. Furthermore, it makes specific hypotheses about its determining factors. An alternative method has been proposed in which no prior constraints are specified between the substitution processes and other features of the protein, like the secondary structure (Koshi and Goldstein 1998;Koshi, Mindell, and Goldstein 1999;Koshi and Goldstein 2001). However, the substitution processes themselves are constrained to conform to a prior biochemical model. Although this approach was generalized (Dimmic, Mindell, and Goldstein 2000;Soyer et al. 2002), the total number of categories is predetermined and is kept small, still for dimensionality reasons. A more radical approach was taken by Bruno (Bruno 1996;Halpern and Bruno 1998), through a model in which the equilibrium frequencies of the 20 amino acids are distinct at each site of the data set. The resulting model seems to capture important features of the substitution process along protein sequences, but it requires a large number of taxa in order for the statistics at each column to be significant.

As mentioned above, the flexibility of the Bayesian paradigm with respect to model dimensionality makes it possible to build models assuming high levels of heterogeneity. Yet, this does not tell how the number of parameters can be adjusted properly to match the signal present in the data. A possibility is to use mixture models in which the dimensionality is not fixed a priori and is itself a free parameter of the inference. Such mixture models are being introduced in many fields of applied statistics (Escobar and West 1995;Green and Richardson 1998), including bioinformatics (Eskin, Grundy, and Singer 2001;Broet, Richardson, and Radvanyi 2002), but they have not yet been applied to molecular phylogenetics.

Here we propose a mixture model, CAT, which generalizes most of the previous approaches (Bruno 1996;Koshi and Goldstein 1998;Dimmic, Mindell, and Goldstein 2000). The model allows for a numberK of classes, each of which is characterized by its own set of equilibrium frequencies, and lets each site “choose” the class under which its substitutional history is to be described. The model can be constrained, with the number of classes fixed to one, as in the standard one-matrix model, or such that each site is described by its own class. Alternatively, we can use a Dirichlet process prior (Ferguson 1973;Antoniak 1974) on the space of equilibrium frequencies, to let the total number of classes be a free variable of the inference. The posterior mean value is then a measure of the substitutional heterogeneity actually present in the alignment. We have implemented this model in a Markov chain Monte Carlo (MCMC) framework, allowing joint inference to be performed simultaneously on all the parameters of the model, including the mixture parameters, the rates at each site, the branch lengths, and the topology of the underlying phylogenetic tree. Using this model, we have conducted inferences on large real data sets and found that, in all cases, the level of heterogeneity is much higher than has been accounted for by previous mixture models. In addition, we show that the standard model based on one single empirical matrix conditioning the substitution process at all sites is outperformed by CAT.

Materials and Methods

Data and Trees

We used three alignments of amino-acid sequences as follows:

EF30-627

These data were obtained as a subset of a larger alignment of eukaryotic elongation factor 2 sequences. Thirty taxa were chosen to represent the diversity of the eukaryotic lineages, and their aligned sequences were retrieved. We removed all columns for which gaps were present or data were missing, leaving a total of 627 sites. A phylogenetic reconstruction under the JTT+F model, performed with the implementation described below, yielded a tree which we used as the fixed topology under which the most time-consuming inferences were conducted. This tree was identical to the posterior consensus tree obtained with the MrBayes 3.0 program (Huelsenbeck and Ronquist 2001) using the JTT matrix, a Dirichlet prior with a concentration parameter of 1 on the equilibrium frequencies, and an Invariant + Γ rate prior modeled by 16 discrete rate categories.

Ek55-1525

These data are a concatenation of the sequences of four cytoplasmic proteins (actin, EF-1α, α and β tubulins), sampled across 55 eukaryotic species (Baldauf et al. 2000). The alignment was kindly provided by Sandra Baldauf. We removed the diplomonads and trichomonads from our analysis because of their long branches. When constrained, the topology was fixed as inBaldauf et al. (2000).

Mt45-3596

The complete coding sequences of the mitochondrial genomes of 45 mammals were aligned with each other, and the ambiguous regions of the alignment were removed, yielding a data matrix of 45 × 3,596 amino acids. The MrBayes program was run on this alignment, using themtREV empirical matrix, with four discrete Γ-categories. The resulting majority-rule consensus (using theallcompat option) was used as the fixed topology.

Notation and Parameters

The available data are in the form of an alignment ofP amino-acid sequences, of lengthN. Leti index the columnsCi, or sites, andj the branches. All of the models used in this work assume that the sequences are related by an unknown, unrooted phylogenetic treet, with branch lengths βj ≥ 0,j = 1, 2, … , 2P − 3. Sites have their own rates of substitutionri > 0, such that

\(_{\mathit{N}}^{1}\)
\({\sum}_{\mathit{i}{=}1}^{\mathit{N}}\)
ri = 1. Gaps are treated as missing data.

Markovian Model of Amino-Acid Replacement

For all models, substitutions occur according to a Markov process running along the branches of the tree. Such a Markov process is characterized by a rate matrixQ = [Qlm] that can be expressed in terms of 20 stationary probabilities, or equilibrium frequencies, (πl), 1 ≤l ≤ 20,
\({\sum}^{20}_{\mathit{l}{=}1}\)
πl = 1 and a set of relative rates, or exchangeability parameters, (ρlm), 1 ≤l,m ≤ 20, according to the relation
formula
The process is assumed to be reversible, ρlm = ρml, and the matrix has been scaled so that the substitution rate is one, using the normalization constant
formula
With this scaling, branch lengths are measured in expected number of substitutions per site. FromQ, one obtains the transition probability matrixP(ν) = [Plm(ν)], specifying the probability that amino-acidl changes intom over an evolutionary distance of ν, asP(ν) =eνQ.

Mixture Modeling

We proposed a model that is site-heterogeneous with respect to the substitution process, and which we called CAT (because in effect, it classifies sites intocategories). Under the CAT model, sites are distributed according to a mixture ofK distinct classes. Each class is characterized by its own substitution matrixQk, and the class to which each site belongs is specified by an allocation variablezi ∈ [1..K]. The vectorz = (zi)i∈[1..N] is called the allocation vector.

For simplicity, we consider only mixtures of matrices {Qk}k=1..K having all the same set of relative rates ρ, but different stationary probabilities πk, so that the mixtures are actually defined on the space of stationary probabilities (the π-space):
formula
where, in order to constrain the substitution rate, we invoke a normalization constant
formula
With this normalization, the branch lengths have the same meaning under all classes; i.e., they are measured in expected number of substitutions. Because a class is entirely characterized by its π-vector, we call the latter itsprofile.

The relative rates ρlm can be fixed to pre-specified values, allowing different models to be specified. The mathematically simplest model is obtained by setting all the relative rates equal to unity: sites are described by a mixture of Poisson processes. Alternative models were also considered, based on the relative rates of known empirical matrices: JTT (Jones, Taylor, and Thornton 1992) andmtREV (Adachi and Hasegawa 1996), depending on the data set under study. The corresponding models were called CAT-Poisson, CAT-JTT, and CAT-mtREV, respectively.

The relative rates can as well be considered as free parameters of the inference (CAT-GTR model). Because they are defined up to a multiplicative constant, which cancels out upon normalization of the rate matrix, we impose the formal constraint that
formula

Once site-specific rates, substitution matrices, and allocation variables are known, the likelihood at each site is computed using Felsenstein's pruning algorithm (Felsenstein 1981). In the case where the processes are Poisson, it is possible to perform mathematically equivalent computations by recoding the process at each site in such a way that all nonobserved amino acids at a given column are collectively considered as one single state. The stationary probability of this new state is the sum of the stationary probabilities of all non-observed amino acids. Felsenstein's pruning algorithm is inS2P, whereS is the number of states, andP the number of taxa. For amino-acid data without the recoding,S = 20. Under the recoding,S will be in general less than 20, and for highly conserved alignments it can often be as low as 2 or 3, which yields a significant increase in computational speed (up to a 50-fold increase in speed was observed in the case of the eukaryotic data set Ek55-1525).

Priors

We used uninformative priors ont, β,r and ρ, as follows:

  • tUniform over topologies

  • βjUniform[0, βmax], with βmax = 100

  • rDirichlet(1, 1, … , 1)

    The resulting marginal distribution at each site is a γ distribution, with an α parameter of 1.

  • ρ ∼Dirichlet(1, 1, … , 1)

In addition, we defined a prior on the mixtures, using a Dirichlet process (DP) (Ferguson 1973;Antoniak 1974). A DP prior is parameterized by a concentration parameter α, and a base distributionG0(π) defined on the π-space, the space of stationary probabilities. It can be described by the following procedure for randomly drawing a configuration from the prior (Neal 2000):

  • Draw the number of classesK, together with a N-vector of allocation variablesz, according top(z,K | α).

  • Draw K values i.i.d. in the π-space: ∀k, πkG0

p(z, K | α) can be expressed as
formula
where ηk stands for the ocupancy number of classk (i.e., the number of sites allocated to classk). Integrating this expression overz yields the marginal prior distribution on the number of classes,p(K | α), which has a complicated form, not reported here (Antoniak 1974;Escobar and West 1995). An intuitive formulation of its dependence on α is that, given α, the marginal prior probability that two different columns taken at random belong to the same class is 1/(1 + α) (Neal 2000). Thus, α defines the level of heterogeneity a priori assumed, with higher values favoring a larger number of classes. We considered α as a free parameter, with a flat prior between two extreme values, and zero outside.
formula
We set αmin to 0.001, and αmax to 1,000. This prior on α induces a smooth marginal prior distribution onK, allowing for the posterior mean number of classes to be determined principally by the data. Finally, we used a flat Dirichlet prior as the base distribution on the sets of stationary probabilities:
formula

In particular cases, one can also dispense with the DP prior and constrain the value ofK. Two cases were considered:

  • The one-matrix model:K = 1. This corresponds to the usual models: Poisson, JTT,mtREV and, when the relative rates ρlm are considered as free parameters, GTR. The stationary probabilities of the single substitution matrix can be either free or fixed to their empirical estimates. The models using empirical estimates are called Poisson+F, JTT+F, WAG+F, ormtREV+F, depending on the set of relative rates that are used (in the case of the GTR model, stationary probabilities are always considered as free parameters).

  • The maximally heterogeneous model:K =N, whereN is the number of sites. Each site evolves under its own substitution process. TheN sets of 20 stationary probabilities are considered unknown parameters (19N free parameters). The resulting model is called MAX, and under equal relative rates is similar to the model proposed byBruno (1996).

Implementation and Monte Carlo Sampling

Gibbs Sampling of the Dirichlet Process

To visualize how a DP prior mixture model works, one can make explicit the probability of all possible allocations of a given site, conditional on the rest of the data. For any given sitei, we denote collectively byzi and πi the allocation variables and the stationary probabilities at all other sites. In addition, for any classk, ηk,−i stands for the number of sites other thani allocated to classk (i.e., the number ofj ∈ [1..N],ji such thatzj =k). Then, one has (Neal 2000):
formula
Here,Z is the appropriate normalizing constant that makes these probabilities sum to one. The dependence ofp(Ci | π), the likelihood at the ith. column, on all parameters other than the stationary probabilities π (like the topology and the branch lengths) has been omitted for notational simplicity.

These two equations immediately suggest a Gibbs sampling algorithm, in which each site is taken in turn, and its allocation variable is reassessed according to these conditional probabilities. Note that the total number of classesK can change through this update. Thus, if the site had a class on its own, but is re-allocated to another already existing class,K will decrease by one. In the reverse situation, if sitei were initially allocated to a classk such that ηk,−i > 0, and ends up allocated to a new class, thenK will increase. In all other cases,K will remain constant. Upon iteration, and when combined with other updates (see below), this algorithm allowsK to fluctuate across the whole range [1..N].

What determines the equilibrium level reached byK? First, one can see from equation 9 that, not surprisingly, the choice among alternative allocations of sitei to already existing classes is driven by the relative likelihoods of these allocations. In contrast, equation 10 shows that the probability of letting the site have a class on its own is mainly determined by how theprior expectation of the site's likelihood over all possible sets of stationary probabilities compares with likelihoods under already existing classes. This amounts to comparing the performance of anon-fitted new profile to the currently available ones,which have already been fitted to the rest of the data. This asymmetry makes it difficult for a site to induce the creation of a new class, unless the pattern at the corresponding column of the alignment is sufficiently distinct from all other columns. Finally, the weights (ηz,−i in equation 9, or α in equation 10) represent the net variations of the prior factorP(z, K | α). That the weight is α in equation 10 makes it apparent that α has a direct influence on the propensity of proposing new classes, and thus on the stationary value ofK.

In our MCMC sampler, we rely on a slightly modified version of this update mechanism (Neal 2000), as follows:

  • SwitchMode: In addition to theK currently existing classes, κ new classes are created, with profiles drawn fromG0. One site (i) is then chosen at random, and for everyk ∈ [1,K + κ], sitei is allocated to classk, the corresponding likelihood for sitei (
    \(\mathit{L}^{\mathit{k}}_{\mathit{i}}\)
    ) is computed, and a Gibbs sampling overk is performed based on the probabilities
    formula
    wherewk = ηk,−i, if ηk,−i > 0, andwk = α / κ otherwise. Different values of κ do not change the stationary distribution, but they yield faster convergence and better mixing. We empirically set κ to 5.

SwitchMode is used in combination with two other operators, one updating the stationary probabilities of one class at a time (StationaryMove) while keepingz and α constant, and another one proposing a new value for α.

  • StationaryMove: one site is chosen at random, an update of the profile of the class to which it is allocated is proposed according to a Dirichlet distribution centered on the current value, with a tuning parameter of λS.

  • Alpha: A random real value δ = λα * (U − 0.5), with a tuning parameter λα > 0, is added to α, with back-reflection in [αmin, αmax], if required. The Hastings ratio equals 1, and the likelihood does not depend on α, so that the only nontrivial factor involved in this Metropolis update is the resulting change inP(z,K | α), yielding the acceptance probability:
    formula
    where α and α* stand for the current and the proposed values of α, respectively.

Topology and Branch Lengths

We used the Global and Local algorithms proposed byLarget and Simon (1999). In addition, we also devised a node-sliding operator, as well as three operators proposing new values for the branch lengths, while leaving the topology invariant:

  • NodeSliding: we randomly pick an internal branch of the unrooted tree and then designate its two nodesu andv. The other two neighbors ofu are calleda andb, and those ofv are calledc andd, with equal probability. We then slide the branchau along the segmentbuvd, by a distancel = λN * (U − 0.5), whereUUniform[0, 1], and λN > 0, is a tuning parameter. Ifl > 0, the move is made towardd, and towardb otherwise. In addition, if the move pushes the branch out of the range defined byb andd, the excess is reflected back into the required interval. The Hastings ratio for this proposal is 1.

  • OneBranch: one branch of the unrooted tree is chosen at random, and its length is multiplied by a random factorr =eλ0(U−0.5), whereUUniform[0, 1], and λ0 > 0, is a tuning parameter. The Hastings ratio equalsr.

  • AllBranch: all branch lengths are updated simultaneously, each βj being multiplied by a distinct random factorrj =

    \(\mathit{e}^{{\lambda}_{\mathit{A}}(\mathit{U}_{\mathit{j}}{-}0.5)}\)
    ⁠, where λA is a tuning parameter. The Hastings ratio is ∏jrj.

  • Homothetic: all branch lengths are updated simultaneously, as in OneBranch, all being multiplied by the same random factorr =

    \(\mathit{e}^{{\lambda}_{\mathit{H}}(\mathit{U}{-}0.5)}\)
    ⁠. The Hastings ratio isr2P−3.

Rates Across Sites

As inLarget and Simon (1999), we propose a new set of values according to a Dirichlet distribution centered on the current parameter value. We found that this Monte Carlo operator was more efficient if applied only on a subset of the rate vector. Specifically, a small numberq of sites,i1,i2,..iq, are chosen at random, and a new set of values, according to a Dirichlet distribution restricted oni1,i2iq, with weights

\({\lambda}_{\mathit{D}}\mathit{r}_{\mathit{i}_{1}}\)
⁠,
\({\lambda}_{\mathit{D}}\mathit{r}_{\mathit{i}_{2}}\)
, … ,
\({\lambda}_{\mathit{D}}\mathit{r}_{\mathit{i}_{\mathit{q}}}\)
, was proposed as un update. λD is a tuning parameter. Empirically, we found thatq = 10 gives a good mixing.

General MCMC Settings

During sampling, the different components of the hypothesis vector were updated separately, in random alternation, according to a grid of weights (wm)1≤mM, whereM is the total number of operators. Specifically, definingW =

\({\sum}^{\mathit{M}}_{\mathit{m}{=}1}\)
wm, we call acycle a series ofW iterations. For each iteration,m is drawn at random, according tomwm/W, and the corresponding operator is called to act on the current state. The weights (wm) were determined empirically, like the tuning parameters, to optimize the mixing of the Markov chain. Both the weights and the tuning parameters are dependent on the model, as well as on the data set under which the inference is conducted (see table S1 in the Supplementary Material online).

For each run, the convergence was assessed by checking for the absence of long-term trends in a series of key monitor functions. Specifically, we monitored the log likelihood, the tree length, the entropy of the rate distribution, the number of classes, and the average profile entropy (class-occupancy weighted average). In addition, in most cases, at least two independent runs were performed, starting from different points of the parameter space taken at random, and their marginal properties (majority-rule consensus tree, class number, and composition) were compared.

Under CAT-Poisson, CAT-JTT, CAT-mtREV, and CAT-GTR, we did a total of 600,000 update cycles. The first 100,000 points were discarded, and we subsampled every 50 cycles after burn in. Under JTT+F and MTREV+F, the only free parameters are the site-specific rates of substitution and the branch lengths, which makes convergence much faster, so that we only did a total of 100,000 update cycles, removed the first 20,000, and subsampled every 10 cycles. Under GTR, we did a total of 200,000 update cycles, removed the first 40,000, and subsampled every 10 cycles.

Clustering

To identify the classes that are stable across a MCMC run, we pooled all the classes of all the points of the run (burn-in discarded), and defined clusters based on the degree of similarity between the corresponding profiles. Specifically, for each pair of classes (h, k), we computed the quadratic distance between their profiles,
formula
and considered that two profiles belong to the same cluster if the distance separating them is below a predefined thresholddmax = 0.01.

Note that any given point of the sample may have several of its classes contained in some clusters, while it may not be represented in other clusters. A given cluster is referred to as astable cluster if one and only one class is affiliated to this cluster for more than 80% of the points of the sample. For each stable cluster, one can compute the following:

  • its mean occupancy number: specifically, the sum of all its classes' occupancy number divided by the sample size.

  • mean profile: occupancy number weighted average of its classes' profiles.

A stable cluster can be interpreted as a class that can be identified unequivocally across the MCMC sampling, and thus, we will refer to stable clusters directly asstable classes.

We developed software for principal component analysis. Plots were visualized usingGnuplot (http://www.gnuplot.info). We wrote a program for visualizing profiles using a representation akin to sequence logos (Schneider and Stephens 1990), translating sets of stationary probabilities into PostScript files.

Posterior Predictive Resampling

One of the ways to evaluate the performances of alternative models is to compare their predictions on real and simulated data (Gelman, Meng, and Stern 1996). For instance, in the present context, we wish to compare the number of classes inferred by CAT on real data with that obtained on data simulated under CAT itself, or under a standard model, such as JTT. For these comparisons to be meaningful, it is important to simulate the replica under sensible parameter values. In the maximum likelihood framework, one would choose the ML estimate. In a Bayesian analysis, it is more customary to sample the replica from the posterior predictive distribution (Rubin 1984;Gelman, Meng, and Stern 1996;Gelman et al. 2004). Given a data setD, and a modelM, parameterized byh ∈ Ω, this distribution is defined as:
formula
where
formula
is the posterior probability distribution over the parameters, induced by the data.

In practice, a collection ofR replicas are obtained using the following procedure: first, run a MCMC under modelM, with dataD, discard the burn-in, and takeR points regularly spaced in the remaining part of the chain (hr)1≤rR. Next, for eachr, simulate a replicaDr underhr.

Numerical Evaluation of the Bayes Factor by Thermodynamic Integration

The most common Bayesian method of model comparison consists in computing the Bayes factor (Jeffreys 1935,1961;Jaynes 2003). The Bayes factor in favor of modelM1 against modelM0, given the dataD, is defined as
formula
P(D | Mi),i = 0, 1, is the likelihood averaged over the prior (ormarginal likelihood):
formula
whereh stands for the parameter vector, and Ω for the whole parameter space. ModelM1 will be supported over modelM0 if
\(\mathit{B}^{1}_{0}\)
is greater than 1, which amounts to choosing the model that has the higher marginal likelihood. To compute Bayes factors, we used a numerical method based on an analogy with thermodynamics; it is calledthermodynamic integration (Ogata 1989). The details of this method, also known aspath sampling (Gelman 1998), are explained in the Appendix.

When more than two models are being compared, one can exploit the additivity property of the logarithm of the Bayes factor, i.e., ∀M0,M1,M2, ln

\(\mathit{B}^{2}_{1}\)
= ln
\(\mathit{B}^{2}_{0}\)
− ln
\(\mathit{B}^{2}_{1}\)
. Therefore, one only needs to compute the Bayes factors of each of the models with respect to the same reference modelM0, and compare their values directly. In the present study, Poisson+F was used as the reference.

All source code files and alignments are available upon request.

Results

We first conducted inferences under the CAT-Poisson Model (i.e., a free number of Poisson processes), on the elongation factor 2 data set, EF30-627. The topology was constrained according to the posterior consensus tree found by running our program on this data set using the JTT+F model (fig. 1A), while all other parameters (branch lengths, site-specific rates, class number and profiles, as well as the allocations of sites to modes) were left unconstrained.

Figure 2A shows the evolution of the value ofK, the number of classes, during the elongation of two independent MCMC chains. The chains were initialized atK = 1 andK =N, respectively. Irrespective of the starting point,K converged to a well-defined interval centered on 28 (28.4 ± 3.5). The histograms of the frequencies estimated from the two independent chains are very similar (fig. 2B), the corresponding 95% credibility interval being [22,35] and [21,36]. Substitutional homogeneity across sites is thus rejected by our model (p(K = 1 | D) ≃ 0). On the other hand, the maximally heterogeneous situation is also excluded by the present analysis (p(K =N | D) ≃ 0). In fact, the mean posterior number of classes, as inferred from CAT-Poisson, is low, when compared to the total number of columns of the alignment (an average of one class per 22 sites). It should be stressed that, even with as few as 28 classes, the model is dealing with a relatively high number of additional parameters, compared to JTT+F (specifically, 28 × 19 = 532 free continuous parameters for the profiles, andN = 627 discrete parameters for the allocation vector, whereN is the length of the alignment).

We performed the same analysis without constraining the topology. We found a topology (fig. 1B) similar to that obtained using the JTT+F model (fig. 1A). A few differences are present, however: under CAT-Poisson, the choanoflagellateMonosiga is found within the metazoan clade, and not as its sister group. In addition,Dictyostelium is clustered with diplomonads and trichomonads. Finally, stramenopiles became a sister group of the alveolates, instead of the diplomonads and trichomonads. In spite of these differences, however, the mean number of classes (〈K〉 = 27.9 ± 3.6) is close to that obtained when the topology is constrained, suggesting that the analysis performed under CAT-Poisson is robust against phylogenetic uncertainty.

Interestingly, the inferred total length of the tree is strongly model dependent: under the constrained topology, we observed a posterior mean number of 8.13 ± 0.21 substitutions per site under CAT-Poisson, versus 7.46 ± 0.14 under JTT+F (table 1).

Stable Class Identification

Any two points of a given Markov chain obtained under the CAT model may differ in all respects (class number, site to class allocations, class specific profiles, as well as branch lengths and site-specific rates), and there is no a priori obvious way of identifying a given class throughout the sample. On the other hand, there may be some stable patterns. To identify them, we took 1,000 regularly spaced points from a given run and pooled all the classes observed at each of these points. Each class is characterized by its profile, and can be assimilated to a vector in a 20-dimensional space. A principal component analysis (PCA) on the set of pooled classes is shown infigure 3A. Two kinds of patterns can be seen: first, a broad set of dots scattered more or less homogeneously across a large region in the center of the projection, and second, a series of 10 to 12 dense clouds. To further characterize this distribution, we used a clustering method (seeMaterials and Methods), and identified a series of 11 stable classes (i.e., classes that can be unequivocally identified across the sample). The mean profile and the mean occupation number of each identified stable class is shown infigure 3D, and a projection of their profiles back onto the PCA (fig. 3A, large crosses) clearly shows that these stable classes correspond to the dense clouds.

The profiles of the stable classes are biochemically reasonable (fig. 3D). For instance, there are classes with negatively charged residues (D and E), with positively charged residues (K and R), or with aromatic (F and Y) residues. Note the diversity of alternative classes belonging to the same biochemical category: there are two hydrophobic classes, one with the amino acids I and V, and one favoring L and M. Finally, some amino acids are found in several stable classes, like S, which belongs to both (A,S) and (S,T) classes; the most extreme example is that of alanine (A), which is present in five different classes, (A,S), (A,G), (A,P), (A,C,S,T,V), and (A,D,E,G,K,Q,N,S,T,V). Thus, according to these results, the same amino acid can undergo different types of substitutions depending on the context. The same cluster analysis was performed without constraining the topology and gave essentially identical results (see fig. S1, in the Supplementary Material online).

The CAT-Poisson model was applied to data sets of larger size, consisting in the concatenation of four nuclear proteins from 55 eukaryotes (Ek55-1525), and of the mitochondrial proteins of 45 mammals (Mt45-3596). We found significant support in favor of heterogeneity in all cases (table 1), as shown both by the posterior mean number of classes (26 and 35) and by the number of stable classes identified by clustering (14 and 22).

Biochemically reasonable classes are found in all cases (see figures S2 and S3 in the Supplementary Material online), and they bear many similarities with each other across the three data sets, in particular between EF30-627 and Ek55-1525, suggesting that the number and the composition of the classes inferred by CAT-Poisson correspond to generic properties of the substitution patterns in proteins. More significant differences are observed between mitochondrial proteins and the other two data sets. This might reflect that mitochondrial proteins are mostly transmembrane proteins, whereas the proteins included in the two other data matrices are exclusively cytosolic factors. This is in agreement with the fact that themtREV is markedly different from other empirical substitution matrices (Adachi and Hasegawa 1996).

Posterior Predictive Checks

We compared the inferences conducted on EF30-627 with the results obtained from data sets of the same size, but simulated under various conditions. First, for data simulated under one single Poisson process and analyzed under CAT-Poisson, one class was recovered with significant probability (P = 0.65). Next, we simulated data under CAT-Poisson, drawing the parameters of the simulation from the posterior predictive distribution (10 replica, seeMaterials and Methods). The posterior mean number of classes found on the simulated data sets (〈K〉 = 23.1 ± 2.7) is slightly lower than the value found on the real data set (〈K〉 = 28.4 ± 3.5), suggesting that, as an estimate of the number of classes, 〈K〉 is biased downward. However, the number of identified stable classes is very similar (KSM = 11.20 ± 1.08 versusKSM = 11), as well as the underlying profiles (not shown). A PCA projection of the classes obtained in one of these simulations (fig. 3B) does not display significant differences with the analysis performed on the real data set (fig. 3A). Thus, in spite of a slight bias, the mixtures inferred by CAT-Poisson appear to be quite robust to posterior predictive resampling.

In contrast, when the same experiment is performed on alignments that have been obtained by simulation under the JTT+F model, a much lower level of heterogeneity is detected (〈K〉 = 13.7 ± 1.6, n = 10 replica). Upon clustering, an average ofKSM = 6.8 ± 0.8 stable classes is obtained. The profiles of these stable classes display significant differences with those observed for real data (fig. 3C andE): for instance, there is only one class containing A, S, and T, in place of the two (A,S) and (S,T) classes found on real sequences, and the (F,Y) class has disappeared altogether.

CAT Inferences Using Non-Poissonian Rate Matrices

The previous experiment suggests that the information contained in the JTT matrix is able to account for only part of the substitutional heterogeneity across sites present in real data. To investigate this problem further, we set the relative rates underlying our mixture model equal to those of the JTT matrix. (in the case of mitochondrial proteins, we used themtREV coefficients instead). Our argument is that, if one single empirical matrix (JTT) is sufficient to account for the substitutional heterogeneity of a particular data set, then an inference conducted under the CAT-JTT model on this data set should yield a nonvanishing posterior probability of having one single class.

We performed a CAT-JTT inference both on EF30-627 and on a data set drawn from the JTT+F induced posterior predictive distribution. Whereas the probability of recovering a single class is high in the case of the simulated data (P = 0.6), it is virtually 0 for real data (fig. 4, andtable 2). Surprisingly, the posterior mean number of classes is even greater with CAT-JTT (〈K〉 = 51.6 ± 8.3) than with CAT-Poisson (〈K〉 = 28.4 ± 3.5), although when a cluster-analysis is performed, a lower number of stable classes than in CAT is obtained (KSM = 5). The same analysis was conducted on Ek55-1525, and on the mitochondrial data set Mt45-3596, (in the latter case, usingmtREV as the empirical matrix). In both cases, the posterior probability of the data being described by one single class is virtually zero, and the posterior mean number of classes is also higher under CAT-JTT or CAT-mtREV (table 2), than under CAT-Poisson (table 1).

In a last experiment, the relative rates were considered as unknown parameters (although they were still constrained to be the same for all classes). The results were qualitatively the same as with CAT-JTT or CAT-mtREV, with a posterior mean number of classes of 〈K〉 = 47 ± 7 inferred on EF30-627 (fig. 4). Thus, even under non-uniform relative rates, a significant part of the heterogeneity among sites has to be further accounted for by forcing the mixture to contain more than one class. In other words, models based on one single matrix are far from capturing the substitutional heterogeneity of real data to its full extent.

Bayesian Model Comparison

Bayes factor evaluation is the most common method of model comparison in Bayesian inference (Gelman 1998;Jaynes 2003), and has already been applied in phylogenetics (Suchard, Weiss, and Sinsheimer 2001). Using the numerical method calledthermodynamic integration (Ogata 1989, see Appendix), we computed the Bayes factor between Poisson+F, taken as the reference model, and CAT-Poisson, MAX-Poisson, and a series of empirical matrix models (JTT+F, WAG+F, andmtREV+F;table 3). In the case of EF30-627, we also included the general model GTR in the analysis.

Our results first allow comparison of the one-matrix models among themselves: as expected (Whelan and Goldman 2001), WAG+F performs better (andmtREV+F worse) than JTT+F on cytosolic proteins (EF30-627 amd Ek55-1525). Interestingly, JTT+F and WAG+F are better than GTR on EF30-627, suggesting that the relative rates of known empirical matrices are close to optimal for this data set. Irrespective of their relative performances, however, the one-matrix models are all outperformed by CAT-Poisson. For example, the support in favor of CAT-Poisson with respect to WAG+ F on EF30-627 is of 1,932 − 1,760 = 172 natural units of logarithm, which corresponds to 0.27 natural log units per site. This means that, on average, each column ise0.27 = 1.32 times better explained by CAT-Poisson than by WAG+F. In contrast to CAT-Poisson, MAX-Poisson does not seem to be favored over one-matrix models: for all three data sets, MAX-Poisson was by far the worst-performing model after Poisson+F (table 3).

Discussion

Characterizing Substitutional Heterogeneity Using a Dirichlet Process Prior Model

We have developed a Bayesian model of sequence evolution, CAT, that can account for substitutional heterogeneity across the sites of protein sequences. In contrast to already existing mixture models designed for the same purpose (Goldman, Thorne, and Jones 1996;Thorne, Goldman, and Jones 1996;Goldman, Thorne, and Jones 1998;Koshi and Goldstein 1998;Koshi, Mindell, and Goldstein 1999;Liò and Goldman 1999;Dimmic, Mindell, and Goldstein 2000;Koshi and Goldstein 2001), all of which have a fixed number of classes, the number of substitutional classes in CAT is itself a free parameter. In this way, it will adapt to the substitutional complexity of the underlying data set, and it provides an estimate of this complexity through the posterior mean number of classes.

Inferences using the CAT model on real protein data sets uncover a high level of heterogeneity, much higher than what is assumed by all other mixture models that have been proposed thus far. To give a numerical estimate, under CAT-Poisson, we always found a posterior mean number of classes greater than 20, whereas in all other models, the number of classes is always set to a value lower than 10. Moreover, the difference between the posterior mean number of classes found upon posterior predictive resampling and the estimate obtained on the real data set (23.1 ± 2.7 versus 28.4 ± 3.5) reveals the existence of a bias, suggesting that the actual level of heterogeneity is probably even higher than what we have observed.

When looking at the classes' profiles and occupancy numbers, one sees that there actually exist markedly different kinds of classes. On the one hand, using a simple clustering algorithm, we were able to identify a restricted set of classes that are stable along a given Markov chain, and are represented by many sites across the sequences. The profiles of these classes (fig. 3D) correspond to reasonable biochemical features, although they are more diverse and more refined than the basic biochemical categories generally used. Strikingly, most of them are markedly peaked, giving a significant weight to only two or three amino acids, suggesting the presence of a strong, site-specific, selection pressure at the amino-acid level, which would be one of the dominant forces shaping the substitution process. Another feature is that a given amino acid can belong to several distinct classes, indicating that context-dependent amino-acid exchangeability is an important aspect of protein evolution. Note that this property cannot be accounted for by the standard one-matrix models. Similar observations have been made previously, using a parsimony-based approach (Lopez 1997).

On the other hand, the difference between the posterior mean number of classes and the number of stable classes (28.4 ± 3.5 versus 11 in the case of EF30-627), reveals the existence of a large series of classes which are not easily identifiable. Closer examination indicates that most of these classes have a low occupancy number; some of them appear in a transitory fashion along a given run, and others, although persistent, do not always have the same sites affiliated to them. These classes should be interpreted with caution: they might not correspond to actual substitutional categories, and they might instead be considered as formal mathematical objects, relevant only through their marginal average influence on the substitution processes operating at the corresponding sites.

The fact that some classes are stable, while some other are not, offers an interesting insight into the flexibility of the Bayesian mixture models such as CAT-Poisson: obviously, the set of unstable classes, meaningful only through their average influence on the site-specific substitution processes, is a typically Bayesian feature, and it does not have an equivalent in existing heterogeneous models implemented in the ML framework. Incidentally, this suggests that the level of heterogeneity that these ML models assume should be compared not to the posterior mean number of classes found under CAT-Poisson, but rather, to the number of stable classes. This number provides a much more conservative estimate of the level of heterogeneity; yet, we found it to be consistently higher than 10, ranging from 11 to 22, confirming that the heterogeneity has been underestimated thus far.

Comparison with Other Heterogeneous Models

Thorne, Goldman, and co-workers propose to associate amino-acid replacement patterns with protein secondary structure and solvent accessibility (Goldman, Thorne, and Jones 1996;Thorne, Goldman, and Jones 1996;Goldman, Thorne, and Jones 1998;Liò and Goldman 1999). In practice, their model implements a mixture of 10 classes, corresponding to five types of secondary structure and two levels of solvent accessibility. Using parametric bootstrap, they show that their mixture model improves significantly on the standard homogeneous model. Their approach is very different from ours, in that we did not pre-specify what could determine the variations of the substitution pattern among sites. The results seem to be accordingly quite distinct: this is particularly obvious in the composition of the 10 substitutional classes estimated by Thorne, Goldman, and co-workers, which are all characterized by very broad distributions on all 20 amino acids, not as peaked as the classes proposed by CAT. To further compare the two approaches, we made a preliminary investigation of the correlation between the secondary structure and the substitution profiles inferred by CAT-Poisson at each site of the elongation factor 2 data set: we identified the sites that were allocated to one of the stable classes at a frequency higher than 80%, and we looked at their distribution across the sequence (see table S2 in the Supplementary Material online). We found that although α-helices are enriched in classes such as (K,R) and (D,E), and β-sheets are enriched in classes like (I,V) and (F,Y), all of the classes but one are represented in the two kinds of secondary structure, as well as in the remaining parts of the sequence. As for solvent accessibility, correlations were uncovered, but also here, most of the classes are represented in both buried and exposed categories. Thus, the heterogeneity uncovered in the present work cannot be explained only in terms of secondary structure and solvent accessibility. More generally, it seems that our approach and that of Thorne, Goldman, and co-workers do not work at the same level: CAT would focus on local, specific, biochemical requirements, which can differ widely even between two adjacent residues, belonging to the same secondary structure, whereas the model of Thorne, Goldman, and co-workers averages over this low-level heterogeneity, to capture more global modulations of the amino-acid replacement pattern, correlated to local secondary structure determinants.

The CAT model is much more closely related to the model proposed by Koshi and Goldstein (Koshi and Goldstein 1998;Koshi, Mindell, and Goldstein 1999;Koshi and Goldstein 2001). In a first version of their model, these authors imposed restriction on the set of possible amino-acid profiles, by defining their model exclusively in terms of hydrophobicity and size. These two physical-chemical properties probably play an important role in the shaping of the classes that we have obtained, but they are not sufficient to account for their diversity. In particular, we have found classes that are more clearly defined by the electric charge (D,E), or by the aromaticity (F,Y) (fig. 3D). This restriction was relieved in subsequent work (Dimmic, Mindell, and Goldstein 2000;Soyer et al. 2002), and the resulting model is more similar to CAT, except for the limitation in the total number of classes, which was set to 5. Another potentially important difference is that Koshi and Goldstein's model assumes that all sites belonging to the same class have also the same rate of substitution, while according to the CAT model, substitutional classes and site-specific overall rates of substitution are a priori independent variables. The fact that all stable classes that we observed on real data contain both fast and slow evolving sites (not shown) tends to give support to our prior choice.

Evaluation of the Model's Performance

The models investigated in the present work were compared through the evaluation of the Bayes factor. Note that the Bayes factor, as the ratio of the marginal likelihoods under two alternative models, could seem at first to be tantamount to a likelihood ratio. However, it is not equivalent to it. For instance, JTT+F, being nested within GTR, would necessarily yield a lowermaximal likelihood than GTR. Yet, itsmarginal likelihood is larger than that of GTR (lnBF = 81 natural log units in favor of JTT+F;table 3). This illustrates that the Bayes factor implicitely penalizes higher dimensional models. In fact, this penalty can be estimated asymptotically, yielding the Bayesian Information Criterion (Schwartz 1978):
formula
wherek is the difference of the number of parameters of the two models, andN is the number of columns of the alignment. This formula is valid only for i.i.d models, and it cannot be used to compare, for instance, GTR with CAT. However, in the general case, the Bayes factor can always be considered as a tradeoff between the relative likelihood scores of the two models and the informational cost of their respective parameterization.

The Bayes factor favors CAT-Poisson over standard one-matrix models, for all of the data sets which we have analyzed. It should be emphasized that an empirical matrix such as JTT has a significant prior advantage over CAT-Poisson, because it incorporates external information bearing on biochemical realism. In contrast, the CAT-Poisson model is completely naive, in the sense that the prior is totally uninformative. Nevertheless, this lack of prior biochemical knowledge does not seem to impair the performance of CAT-Poisson, presumably because this latter model is able to extract a substantial amount of equivalent knowledge from the data set, as confirmed by the biochemical relevance of the classes' profiles. Importantly, JTT+F is rejected even on quite small data sets, like EF2, which implies that the amount of information sufficient for CAT to outperform JTT+F is low. To verify this, we computed the Bayes factor under a smaller data set, made of the first third of EF30-627 (209 sites), and with a subset of five species. In this case, we observed that the Bayes factor was now in favor of JTT+F (lnB = −6.9 natural units), indicating that the amount of information necessary for CAT-Poisson to be fit is not reached on this highly reduced data matrix.

In contrast to CAT-Poisson, MAX-Poisson is rejected when compared to JTT+F. A plausible explanation is that MAX is penalized by its very large number of parameters (19 for each site), which cannot all be correctly determined by the information contained in the data.

Sensitivity to the Prior Distribution and to the Model's Assumptions

It is known that sensitivity to the prior is an important issue when dealing with parameter-rich models (Huelsenbeck et al. 2002;Rannala 2002). In the present work, there are two main directions along which the prior should be tested. First, the prior onK, the number of classes, is currently defined through the prior on α, the mixture concentration parameter. For α, we have only tried a flat prior distribution, although we could test other possibilities. More generally, we could dispense with the Dirichlet process prior, and work on more general mixture models that make it possible to work directly with the prior onK (Green and Richardson 1998). Second, we could also test the base distribution on the π-space. Thus far, we have chosen a flat Dirichlet distribution, but this could be generalized to any Dirichlet distribution by varying the center of the distribution, or by modulating its concentration parameter.

The assumption that the substitution process at each site is Poisson is another potential limitation; in particular, it does not take the codon structure into account. In principle, it is easy to dispense with this simplifying assumption. Our current implementation already allows the use of any predefined set of relative rates. Similarly, one could imagine a version of the CAT model formulated at the codon level, and in which a single set of relative rates of codon substitution would be combined with class-specific amino-acid acceptance profiles. However, the computations under unequal relative rates of exchange between amino acids can be as much as 50 times slower than under a Poisson process, a factor which is even greater in the case of the codon models. Further algorithmic development is therefore needed in order to proceed in this direction.

Impact on Phylogenetic Inference

Inferences about class number and composition conducted on EF30-627 under a fixed tree are very similar to those where the phylogeny is also a free parameter (see fig. S1 in the Supplementary Material online). Likewise, we observed that the number and the profiles of the classes inferred for the mitochondrial data set were not very sensitive to the exact tree used as a constraint (not shown). This suggests that our analysis is robust with respect to the phylogeny, a fact which is not surprising, as similar conclusions have been reached by authors studying the sensitivity of rate across site parameters (Yang 1995), or of tests of model comparison (Posada and Crandall 2001), to the underlying phylogeny.

Conversely, however, the models investigated in this work differ in some of their predictions, in particular, in their estimates of the evolutionary distance between sequences and of the phylogenetic relationships. First, the tree length under CAT-Poisson is 10% to 20% higher than under JTT+F, suggesting that mutational saturation of the sequences could be better accounted for by CAT-Poisson than by JTT+F. This would not be completely surprising (Miyamoto and Fitch 1996). Mutational saturation—i.e., a high number of convergences and reversions throughout the sequences—will be all the more frequent as the substitution process at any given site is, on average, confined to a very restricted set of amino acids. This seems to be exactly the kind of situation inferred by CAT-Poisson. Importantly, these homoplasies will be more easily missed by models that underestimate site-specific restrictions of the set of admissible residues. It is quite plausible that JTT+F leads to such an underestimation, and this would explain the discrepancies observed between JTT+F and CAT-Poisson, concerning the number of substitutions along the tree.

Second, there are a few differences in the phylogenetic reconstruction obtained under CAT-Poisson and under JTT+F (fig. 1). One of the groupings found under CAT-Poisson, the monophyly of chromalveolates (stramenopiles + alveolates), is reasonable, and has been advocated by other studies (Baldauf et al. 2000;Fast et al. 2001;Bapteste et al. 2002), whereas others, like the polyphyly of amoebas and the position of the choanoflagellateMonosiga within the animal clade, are more questionable (Bapteste et al. 2002;Lang et al. 2002). More work is still needed to evaluate the performances of CAT in phylogenetic reconstruction, an issue we are currently investigating. The present results already illustrate that a better statistical fit does not necessarily imply a more reliable tree (Sullivan and Swofford 2001). In any case, CAT might allow better inference of evolutionary distances, and for that reason, it could be used for molecular datings. In addition, it can be a promising tool for analyzing the relationships between site-specific substitution patterns and structure-function determinants.

Appendix: Numerical Evaluation of the Bayes Factor by Thermodynamic Integration

In what follows, we consider two models,M0 andM1, which are defined on the same set of parameters,h ∈ Ω (note that this condition is purely formal, because parameters specific to one of the models, sayM0, can be included in the parameter vector of the other model, even if they are not involved in the computation of the likelihood underM1). The Bayes factor in favor of modelM0, with respect to modelM1, is defined as the ratio of the marginal likelihoods:
formula
A parallel can be drawn between Bayesian inference and statistical physics as follows. Given one of the models,Mi, define an energy function underMi as
formula
According to Bayes's theorem
formula
wherekT = 1, andZi =P(D | Mi) plays here the role of a normalization factor. Thus, the energy is defined so that the posterior distribution is a Boltzmann distribution at a “temperature”kT = 1: this distribution can be sampled from using the classical MCMC methods, and the resulting Markov chain will thus be equivalent to a physical system at thermal equilibrium, fluctuating between all its accessible microscopic states. The free energy of such a system is related to the normalization factorZi, which is called the partition function in physics, by the relation
formula
BecauseZi is identical to the marginal likelihood, one has
formula
Hence, choosing the model of largest marginal likelihood amounts to choosing that of lowest free energy.
Let us now define the β-energy as
formula
and the corresponding β-probability as
formula
normalized by
formula

Note that for β = 0 (β = 1),pβ reduces to the posterior density underM0 (M1). Thus, the set (pβ)0≤β≤1 defines a continuous path in the space of probability distributions, connecting the posterior distributions underM0 andM1.

As before, we can define
formula
Taking the derivative ofFβ with respect to β yields
formula
where <·>β stands for the expectation with respect topβ (or β-average).
This yields the integral formula
formula
which suggest a simple method for estimating lnB01 (Ogata 1989): for any β ∈ [0, 1], a sample frompβ(h) can be obtained by running a MCMC. On such a sample, say (
\(\mathit{h}^{\mathit{l}}_{{\beta}}\)
)1≤lL, β-averages can be computed using the usual asymptotic relation:
formula
In particular, an estimate of <E1E0 >β can be computed in this way for any given value of β. Repeating this computation for a series of values of β regularly spaced over the unit interval, and approximating the integral by the Simpson procedure, yields an estimate of lnB01. Evidently, the quality of the estimate will depend both on the length of each MCMC run and on the number of points of the interpolation. In the present work, we found that six points (β = 0, 0.2, 0.4, 0.6, 0.8, 1.0) were sufficient to give a precision of about 10%. As for the length of the MCMC run, 100,000 cycles turn out to be sufficient, yielding a precision of 1% on the estimate of <E1E0 >β.

Supplementary Material

The following materials relevant to this article are available online:table 1. Settings of the MCMC runs;table 2. Distribution of the sites allocated to stable classes, according to secondary structure and solvent accessibility;fig. 1. CAT-Poisson analysis on the elongation factor data set EF30-627, under constrained (A,C) and unconstrained (B,D) topology;fig. 2. CAT-POISSON analysis on the eukaryotic data set Ek55-1525.fig. 3. CAT-Poisson analysis on the mammal mitochondrial data set Mt45-3596.

Associate Editor

Majority rule consensus phylogenies obtained with the models JTT+F (A) and CAT-Poisson (B). Branch lengths are proportional to the expected number of substitutions per site. Node support values are equal to the posterior probabilities
Fig. 1.

Majority rule consensus phylogenies obtained with the models JTT+F (A) and CAT-Poisson (B). Branch lengths are proportional to the expected number of substitutions per site. Node support values are equal to the posterior probabilities

A. Traceplot showing the evolution of K, the number of modes, during the elongation of two independent MCMCs (only the first 500 cycles are shown). The initial conditions were K = 1 (circles) and K = N (squares), with N standing for the number of sites. B. Histograms displaying the frequencies at which each value of K was observed along the two chains referred to in A, starting from K = 1 (solid lines), and K = N (dashed lines)
Fig. 2.

A. Traceplot showing the evolution ofK, the number of modes, during the elongation of two independent MCMCs (only the first 500 cycles are shown). The initial conditions wereK = 1 (circles) andK =N (squares), withN standing for the number of sites.B. Histograms displaying the frequencies at which each value ofK was observed along the two chains referred to inA, starting fromK = 1 (solid lines), andK =N (dashed lines)

A, B, C. Principal Component analysis of the profiles obtained under the CAT model, using A, real sequences (EF30-627); B, data simulated under CAT (SimuCAT30-627); C, data simulated under JTT (SimuJTT30-627). To make comparison easier, the profiles in B and C are projected onto the 2-dimensional subspace of R20 defined by the principal components computed in A. First axis accounts for 17.3%, and second axis for 12.7%, of the total variance in A and B, and 15.3% and 11.3% of the variance in C. D. Profiles of the 11 stable modes obtained with EF30-627. The one-letter amino-acid code is used. Font height is proportional to the frequency of the corresponding amino-acid. Modes are reported on panels A and B. E. Profiles of the six stable modes obtained with JTTSimu30-627. Modes are reported on panel C
Fig. 3.

A,B,C. Principal Component analysis of the profiles obtained under the CAT model, usingA, real sequences (EF30-627);B, data simulated under CAT (SimuCAT30-627);C, data simulated under JTT (SimuJTT30-627). To make comparison easier, the profiles inB andC are projected onto the 2-dimensional subspace ofR20 defined by the principal components computed inA. First axis accounts for 17.3%, and second axis for 12.7%, of the total variance inA andB, and 15.3% and 11.3% of the variance inC.D. Profiles of the 11 stable modes obtained with EF30-627. The one-letter amino-acid code is used. Font height is proportional to the frequency of the corresponding amino-acid. Modes are reported on panelsA andB.E. Profiles of the six stable modes obtained with JTTSimu30-627. Modes are reported on panelC

Histograms showing the estimated posterior probability distribution of K, the number of modes under the CAT-JTT for data simulated under JTT+F (dashed lines), under CAT-JTT for the real data (EF30-627, solid lines), and under CAT-GTR for the real data (dotted lines)
Fig. 4.

Histograms showing the estimated posterior probability distribution ofK, the number of modes under the CAT-JTT for data simulated under JTT+F (dashed lines), under CAT-JTT for the real data (EF30-627, solid lines), and under CAT-GTR for the real data (dotted lines)

Table 1

Estimates of Class Number and Tree Length Under CAT-Poisson, MAX-Poisson and JTT+F.

CAT-Poisson
MAX-PoissonJTT+F
Data Set〈α〉aKbKSMcTLCATdTLMAXeTLJTT+Ff
EF30-6276.3 ± 1.628.4 ± 3.5118.13 ± 0.217.06 ± 0.147.46 ± 0.14
Ek55-15254.7 ± 1.326.4 ± 3.7145.36 ± 0.104.82 ± 0.074.78 ± 0.08
Mt45-35965.4 ± 1.135.3 ± 3.2227.54 ± 0.105.90 ± 0.055.91 ± 0.05
CAT-Poisson
MAX-PoissonJTT+F
Data Set〈α〉aKbKSMcTLCATdTLMAXeTLJTT+Ff
EF30-6276.3 ± 1.628.4 ± 3.5118.13 ± 0.217.06 ± 0.147.46 ± 0.14
Ek55-15254.7 ± 1.326.4 ± 3.7145.36 ± 0.104.82 ± 0.074.78 ± 0.08
Mt45-35965.4 ± 1.135.3 ± 3.2227.54 ± 0.105.90 ± 0.055.91 ± 0.05

a Mean posterior value of the Dirichlet Prior concentration parameter.

b Mean posterior number of classes.

c Number of stable classes detected by clustering.

d Mean posterior tree length under CAT.

e Mean posterior tree length under MAX-Poisson.

f Mean posterior tree length under JTT+F.

Table 1

Estimates of Class Number and Tree Length Under CAT-Poisson, MAX-Poisson and JTT+F.

CAT-Poisson
MAX-PoissonJTT+F
Data Set〈α〉aKbKSMcTLCATdTLMAXeTLJTT+Ff
EF30-6276.3 ± 1.628.4 ± 3.5118.13 ± 0.217.06 ± 0.147.46 ± 0.14
Ek55-15254.7 ± 1.326.4 ± 3.7145.36 ± 0.104.82 ± 0.074.78 ± 0.08
Mt45-35965.4 ± 1.135.3 ± 3.2227.54 ± 0.105.90 ± 0.055.91 ± 0.05
CAT-Poisson
MAX-PoissonJTT+F
Data Set〈α〉aKbKSMcTLCATdTLMAXeTLJTT+Ff
EF30-6276.3 ± 1.628.4 ± 3.5118.13 ± 0.217.06 ± 0.147.46 ± 0.14
Ek55-15254.7 ± 1.326.4 ± 3.7145.36 ± 0.104.82 ± 0.074.78 ± 0.08
Mt45-35965.4 ± 1.135.3 ± 3.2227.54 ± 0.105.90 ± 0.055.91 ± 0.05

a Mean posterior value of the Dirichlet Prior concentration parameter.

b Mean posterior number of classes.

c Number of stable classes detected by clustering.

d Mean posterior tree length under CAT.

e Mean posterior tree length under MAX-Poisson.

f Mean posterior tree length under JTT+F.

Table 2

Estimates of the Number of Classes Under CAT-JTT and CAT-mtREV.

Data SetP(K = 1)KKSM
SimuJTT30-6270.601.7 ± 1.11
EF30-627051.6 ± 8.35
SimuJTT55-15250.641.8 ± 1.41
Ek55-1525045.7 ± 6.014
Mt45-3596070.9 ± 6.616
Data SetP(K = 1)KKSM
SimuJTT30-6270.601.7 ± 1.11
EF30-627051.6 ± 8.35
SimuJTT55-15250.641.8 ± 1.41
Ek55-1525045.7 ± 6.014
Mt45-3596070.9 ± 6.616
Table 2

Estimates of the Number of Classes Under CAT-JTT and CAT-mtREV.

Data SetP(K = 1)KKSM
SimuJTT30-6270.601.7 ± 1.11
EF30-627051.6 ± 8.35
SimuJTT55-15250.641.8 ± 1.41
Ek55-1525045.7 ± 6.014
Mt45-3596070.9 ± 6.616
Data SetP(K = 1)KKSM
SimuJTT30-6270.601.7 ± 1.11
EF30-627051.6 ± 8.35
SimuJTT55-15250.641.8 ± 1.41
Ek55-1525045.7 ± 6.014
Mt45-3596070.9 ± 6.616
Table 3

Model Comparison. Natural Logarithm of the Bayes Factor Between Poisson+F and All Other Models.

Data SetJTT+FWAG+FmtREV+FGTRCAT-PoissonMAX-Poisson
EF30-6271,6311,7601,3741,5501,932807
Ek55-15253,3243,6282,8483,8081,112
Mt45-359610,0898,97010,94312,389855
Data SetJTT+FWAG+FmtREV+FGTRCAT-PoissonMAX-Poisson
EF30-6271,6311,7601,3741,5501,932807
Ek55-15253,3243,6282,8483,8081,112
Mt45-359610,0898,97010,94312,389855
Table 3

Model Comparison. Natural Logarithm of the Bayes Factor Between Poisson+F and All Other Models.

Data SetJTT+FWAG+FmtREV+FGTRCAT-PoissonMAX-Poisson
EF30-6271,6311,7601,3741,5501,932807
Ek55-15253,3243,6282,8483,8081,112
Mt45-359610,0898,97010,94312,389855
Data SetJTT+FWAG+FmtREV+FGTRCAT-PoissonMAX-Poisson
EF30-6271,6311,7601,3741,5501,932807
Ek55-15253,3243,6282,8483,8081,112
Mt45-359610,0898,97010,94312,389855

We are grateful to David Bryant, Olivier Gascuel, Franz Lang, Jeffrey Thorne, and two anonymous referees for their useful comments on the manuscript. We wish to thank Sandra Baldauf for making available the alignment of four eukaryotic proteins, and Béatrice Philippe for her help in determining the relations between classes and secondary structure on the elongation factor data set. This work was supported by the Canadian Research Chair, and by a starting fund from the University of Montréal.

Literature Cited

Adachi, J., and M. Hasegawa.

1996
. Model of amino-acid substitution in proteins encoded by mitochondrial DNA.
J. Mol. Evol.
42
:
459
-468.

Adachi, J., P. J. Wadell, W. Martin, and M. Hasegawa.

2000
. Plastid genome phylogeny and a model of amino-acid substitution for proteins encoded by chloroplast DNA.
J. Mol. Evol.
50
:
348
-358.

Antoniak, C. E.

1974
. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.
Ann. Statistics
2
:
1152
-1174.

Baldauf, S. L., A. J. Roger, I. Wenk-Siefert, and W. F. Doolittle.

2000
. A kingdom-level phylogeny of eukaryotes based on combined protein data.
Science
290
:
972
-977.

Bapteste, E., H. Brinkmann, J. A. Lee, D. V. Moore, C. W. Sensen, P. Gordon, L. Durufle, T. Gaasterland, P. Lopez, M. Muller, and H. Philippe.

2002
. The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba.
Proc. Natl. Acad. Sci. USA
99
:
1414
-1419.

Broet, P., S. Richardson, and F. Radvanyi.

2002
. Bayesian hierarchical model for identifying changes in gene expression from microarray experiments.
J. Comp. Biol.
9
:
671
-683.

Bruno, W. J.

1996
. Modeling residue usage in aligned protein sequences via maximum likelihood.
Mol. Biol. Evol.
13
:
1368
-1374.

Dayhoff, M., R. V. Eck, and C. M. Park.

1972
. A model of evolutionary change in proteins. Pp. 88–89In M. Dayhoff, ed., Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Washington, D.C.

Dayhoff, M., R. Schwartz, and B. Orcutt.

1978
. A model of evolutionary change in proteins. Pp. 345–352In M. Dayhoff, ed., Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Washington, D.C.

Dimmic, M. W., D. P. Mindell, and R. A. Goldstein.

2000
. Modeling evolution at the protein level using an adjustable amino acid fitness model.
Pac. Symp. Biocomput.
5
:
18
-29.

Escobar, M., and M. West.

1995
. Bayesian density estimation and inference using mixtures.
J. Am. Stat. Assoc.
90
:
577
-588.

Eskin, E., W. N. Grundy, and Y. Singer.

2001
. Using mixtures of common ancestors for estimating the probabilities of discrete events in biological sequences.
Bioinformatics
17
:
S65
-S73.

Fast, N. M., J. C. Kissinger, D. S. Roos, and P. J. Keeling.

2001
. Nuclear-encoded, plastid-targeted genes suggest a single common origin for apicomplexan and dinoflagellate plastids.
Mol. Biol. Evol.
18
:
418
-426.

Felsenstein, J.

1981
. Evolutionary trees from DNA sequences: a maximum likelihood approach.
J. Mol. Evol.
17
:
368
-376.

Felsenstein, J.

2004
. Inferring phylogenies. Sinauer Associates Inc., Sunderland, Mass.

Ferguson, T.

1973
. A Bayesian analysis of some nonparametric problems.
Ann. Statistics
1
:
209
-230.

Gelman, A.

1998
. Simulating normalizing constants: from importance sampling to bridge sampling to path sampling.
Stat. Sci.
13
:
163
-185.

Gelman, A., X. L. Meng, and H. Stern.

1996
. Posterior predicive assessment of model fitness via realised discrepancies.
Statistica Sinica
6
:
733
-807.

Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin.

2004
. Bayesian Data Analysis, 2nd edition. Chapman and Hall/CRC.

Goldman, N., J. Thorne, and D. Jones.

1998
. Assessing the impact of secondary structure and solvent accessibility on protein evolution.
Genetics
149
:
445
-458.

Goldman, N., J. L. Thorne, and D. T. Jones.

1996
. Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses.
J. Mol. Biol.
263
:
196
-208.

Goldman, N., and S. Whelan.

2002
. A novel use of equilibrium frequencies in models of sequence evolution.
Mol. Biol. Evol.
19
:
1821
-1831.

Green, P. J., and S. Richardson.

1998
. Modelling heterogeneity with and without the Dirichlet process. Technical report, University of Bristol, Bristol, U.K.

Halpern, A. L., and W. J. Bruno.

1998
. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies.
Mol. Biol. Evol.
15
:
910
-917.

Huelsenbeck, J. P., B. Larget, R. E. Miller., and F. Ronquist.

2002
. Potential applications and pitfalls of Bayesian inference of phylogeny.
Syst. Biol.
51
:
673
-688.

Huelsenbeck, J. P., and R. Nielsen.

1999
. Variation in the pattern of nucleotide substitution across sites.
J. Mol. Evol.
48
:
86
-93.

Huelsenbeck, J. P., and F. Ronquist.

2001
. MrBayes: Bayesian inference of phylogenetic trees.
Bioinformatics
17
:
754
-755.

Jaynes, E.

2003
. Probability Theory. The logic of science. Cambridge University Press, Cambridge, U.K.

Jeffreys, H.

1935
. Some tests of significance, treated by the theory of probability.
Proc. Camb. Phil. Soc.
31
:
203
-222.

Jeffreys, H.

1961
. Theory of Probability. Oxford University Press.

Jones, D. T., W. R. Taylor, and J. M. Thornton.

1992
. The rapid generation of mutation data matrices from protein sequences.
Cabios
8
:
275
-282.

Kass, R., and A. Raftery.

1995
. Bayes factors and model uncertainty.
J. Am. Stat. Assoc.
90
:
773
-795.

Koshi, J. M., and R. A. Goldstein.

1998
. Models of natural mutations including site heterogeneity.
Proteins
32
:
289
-295.

Koshi, J. M., and R. A. Goldstein.

2001
. Analyzing site heterogeneity during protein evolution. Pac. Symp. Biocomput. pp. 191–202.

Koshi, J. M., D. P. Mindell, and R. A. Goldstein.

1999
. Using physical-chemistry–based substitution models in phylogenetic analyses of HIV-1 subtypes.
Mol. Biol. Evol.
16
:
173
-179.

Lang, B. F., C. O'Kelly, T. Nerad, M. W. Gray, and G. Burger.

2002
. The closest unicellular relatives of animals.
Curr. Biol.
12
:
1773
-1778.

Larget, B., and D. Simon.

1999
. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees.
Mol. Biol. Evol.
16
:
750
-759.

Li, S.

1996
. Phylogenetic tree construction using Markov chain Monte Carlo. PhD dissertation, Ohio State University, Columbus, Ohio.

Liò, P., and N. Goldman.

1999
. Using protein structural information in evolutionary inference: transmembrane proteins.
Mol. Biol. Evol.
16
:
1696
-1710.

Lopez, P.

1997
. Analyse phylogénétique de grands alignements de protéines: vers une classification des sites? Master degree dissertation, Université Paris XI, Paris, France.

Miyamoto, M. M., and W. M. Fitch.

1996
. Constraints on protein evolution and the age of Eubacteria/Eukaryote split.
Syst. Biol.
45
:
568
-575.

Muller, T., R. Spang, and M. Vingron.

2002
. Estimating amino-acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method.
Mol. Biol. Evol.
19
:
8
-13.

Neal, R. M.

2000
. Markov chain sampling methods for Dirichlet process mixture models.
J. Comput. Graphical. Stat.
9
:
249
-265.

Ogata, Y.

1989
. A Monte Carlo method for high dimensional integration.
Numerishe Mathematik
55
:
137
-157.

Posada, D., and K. Crandall.

2001
. Selecting the best-fit model of nucleotide substitution.
Syst. Biol.
50
:
580
-601.

Rannala, B.

2002
. Identifiability of parameters in MCMC Bayesian inference of phylogeny.
Syst. Biol.
51
:
754
-760.

Rubin, D. B.

1984
. Bayesianly justifiable and relevant frequency calculations for the applied statistician.
Ann. Stat.
4
:
1151
-1172.

Schneider, T. D., and R. M. Stephens.

1990
. Sequence logos: a new way to display consensus sequences.
Nucleic Acids Res.
18
:
6097
-6100.

Schwartz, G.

1978
. Estimating the dimension of a model.
Ann. Statistics
6
:
461
-464.

Soyer, O., M. W. Dimmic, R. R. Neubig, and R. A. Goldstein.

2002
. Using evolutionary methods to study G-protein coupled receptors. Pac. Symp. Biocomput. pp. 625–636.

Suchard, M., R. Weiss, and J. Sinsheimer.

2001
. Bayesian selection of continuous-time Markov chain evolutionary models.
Mol. Biol. Evol.
18
:
1001
-1013.

Sullivan, J., and D. L. Swofford.

2001
. Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site variation and nucleotide substitution pattern are violated?
Syst. Biol.
50
:
723
-729.

Swofford, D., G. P. Olsen, P. J. Waddell, and D. M. Hillis.

1996
. Phylogenetic inference. In Molecular Systematics. Sinauer Associates, Sunderland, Mass.

Thorne, J. L., N. Goldman, and D. T. Jones.

1996
. Combining protein evolution and secondary structure.
Mol. Biol. Evol.
13
:
666
-673.

Wald, A.

1949
. Note on the consistency of maximumm likelihood.
Ann. Math. Stat.
20
:
595
-601.

Whelan, S., and N. Goldman.

2001
. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach.
Mol. Biol. Evol.
18
:
691
-699.

Yang, Z.

1993
. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites.
Mol. Biol. Evol.
10
:
1396
-1401.

Yang, Z.

1994
. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods.
J. Mol. Evol.
39
:
306
-314.

Yang, Z.

1995
. A space-time process model for the evolution of DNA sequences.
Genetics
139
:
993
-1005.

Yang, Z.

1996
. Among site variation and its impact on phylogenetic analyses.
Trends Ecol. Evol.
11
:
367
-370.

Yang, Z., and B. Rannala.

1997
. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method.
Mol. Biol. Evol.
14
:
717
-724.

Molecular Biology and Evolution vol. 21 no. 6 © Society for Molecular Biology and Evolution 2004; all rights reserved.
Issue Section:
Original Article
  • Supplementary data

    Advertisement

    Citations

    Views

    16,313

    Altmetric

    Metrics
    Total Views16,313
    13,747Pageviews
    2,566PDF Downloads
    Since 11/1/2016
    Month:Total Views:
    November 20161
    January 201713
    February 201758
    March 201767
    April 201764
    May 201770
    June 201754
    July 201715
    August 201732
    September 201763
    October 201756
    November 201761
    December 2017218
    January 2018301
    February 2018285
    March 2018378
    April 2018269
    May 2018357
    June 2018311
    July 2018299
    August 2018305
    September 2018319
    October 2018330
    November 2018391
    December 2018267
    January 2019176
    February 2019331
    March 2019366
    April 2019320
    May 2019316
    June 2019263
    July 2019304
    August 2019281
    September 2019250
    October 2019281
    November 2019238
    December 2019239
    January 2020195
    February 2020118
    March 2020108
    April 2020134
    May 2020104
    June 2020102
    July 2020182
    August 2020138
    September 2020138
    October 2020110
    November 2020141
    December 2020105
    January 2021125
    February 2021101
    March 2021165
    April 2021154
    May 2021193
    June 2021155
    July 2021124
    August 2021126
    September 2021142
    October 2021147
    November 2021163
    December 2021147
    January 2022108
    February 2022103
    March 2022131
    April 2022108
    May 2022133
    June 2022138
    July 202290
    August 2022112
    September 2022110
    October 2022112
    November 2022148
    December 2022139
    January 2023104
    February 2023148
    March 2023169
    April 2023160
    May 2023135
    June 202393
    July 2023113
    August 202386
    September 202395
    October 2023159
    November 2023122
    December 2023139
    January 2024165
    February 2024153
    March 2024213
    April 2024165
    May 2024147
    June 202492
    July 2024153
    August 2024139
    September 2024117
    October 2024115
    November 2024102
    December 2024101
    January 2025109
    February 2025118
    March 202597
    April 2025106
    Citations
    Powered by Dimensions
    1,146Web of Science
    Altmetrics
    ×

    Email alerts

    New journal issues alert

    To set up an email alert, pleasesign in to your personal account, orregister

    Sign in

    Personal account

    • Sign in with email/username & password
    • Get email alerts
    • Save searches
    • Purchase content
    • Activate your purchase/trial code
    • Add your ORCID iD

    Journal article activity alert

    To set up an email alert, pleasesign in to your personal account, orregister

    Sign in

    Personal account

    • Sign in with email/username & password
    • Get email alerts
    • Save searches
    • Purchase content
    • Activate your purchase/trial code
    • Add your ORCID iD
    Advertisement
    Advertisement
    Advertisement
    Molecular Biology and Evolution
    • Online ISSN 1537-1719
    • Copyright © 2025 Society for Molecular Biology and Evolution
    Close
    Close
    This Feature Is Available To Subscribers Only

    Sign In orCreate an Account

    Close

    This PDF is available to Subscribers Only

    View Article Abstract & Purchase Options

    For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

    Close

    [8]ページ先頭

    ©2009-2025 Movatter.jp