Most current models of sequence evolution assume that all sites of a protein evolve under the same substitution process, characterized by a 20 × 20 substitution matrix. Here, we propose to relax this assumption by developing a Bayesian mixture model that allows the amino-acid replacement pattern at different sites of a protein alignment to be described by distinct substitution processes. Our model, named CAT, assumes the existence of distinct processes (or classes) differing by their equilibrium frequencies over the 20 residues. Through the use of a Dirichlet process prior, the total number of classes and their respective amino-acid profiles, as well as the affiliations of each site to a given class, are all free variables of the model. In this way, the CAT model is able to adapt to the complexity actually present in the data, and it yields an estimate of the substitutional heterogeneity through the posterior mean number of classes. We show that a significant level of heterogeneity is present in the substitution patterns of proteins, and that the standard one-matrix model fails to account for this heterogeneity. By evaluating the Bayes factor, we demonstrate that the standard model is outperformed by CAT on all of the data sets which we analyzed. Altogether, these results suggest that the complexity of the pattern of substitution of real sequences is better captured by the CAT model, offering the possibility of studying its impact on phylogenetic reconstruction and its connections with structure-function determinants.
Probabilistic methods are widely used in phylogenetic reconstruction. Their main advantage, compared to methods such as maximum parsimony, is to make all assumptions underlying the reconstruction explicit, while providing powerful and general techniques for validating those assumptions (Swofford et al. 1996;Sullivan and Swofford 2001). Given a stochastic model of evolution, these methods allow computation of the probability of observing the available data, conditional on a phylogenetic hypothesis (specified by a topology, branch lengths plus some other parameters). This probability is used as a measure of the likelihood of the corresponding hypothesis, and one then invokes the maximum likelihood (ML) principle, choosing the hypothesis for which the probability of observing the data is maximal.
Maximum likelihood is intuitively appealing, and mathematical theorems guarantee the asymptotic consistency of the method (Wald 1949). However, when applied to models that are too rich in parameters, ML can lead to over-fitting artifacts. This is true, for instance, when models with site-specific parameters are considered, a problem sometimes referred to as the “infinitely many parameter trap” (Felsenstein 2004). In practice, this limitation on the number of parameters restricts the range of models that the phylogeneticist, seeking more realism, would like to explore. An alternative probabilistic paradigm, the Bayesian method, has been introduced recently in the field of molecular phylogeny (Li 1996;Yang and Rannala 1997;Larget and Simon 1999;Huelsenbeck and Ronquist 2001). It was initially proposed as a practical alternative to ML, mainly on the grounds that it offers a natural measure of uncertainty, thereby avoiding the costly method of bootstrap resampling (Larget and Simon 1999). However, another advantage of Bayes, which is in our opinion much more fundamental, lies in its greater flexibility with respect to the model's dimensionality (Huelsenbeck et al. 2002;Rannala 2002). In contrast to ML, and through a different treatment of the nuisance parameters which are not directly estimated but integrated away, Bayes is able to deal with much higher dimensional models while offering several methods to test these models in the light of available data (Jeffreys 1935;Jaynes 2003). This flexibility opens new avenues of investigation, as it makes it possible to build more realistic models of sequence evolution by adjusting the dimensionality so that it reflects the complexity actually displayed by the data, rather than the limitations inherent in the method.
One of the ways by which restrictions are traditionally imposed on model dimensionality is by assuming that sites of an alignment are independent and identically distributed: that is, they are considered as independent realizations of the same substitution process, running along the branches of the underlying evolutionary tree (Felsenstein 1981). In certain circumstances, as in the case of pseudogenes, this assumption might be valid, but more often different nucleotides or different codons of a gene will evolve under very different selection pressures. The assumption of identical distribution is partially relaxed in the “rates across sites” models (Yang 1993,1994), where the rate of evolution at each site is a random variable drawn from a gamma distribution. Such models yield a major improvement in statistical adequacy over the uniform rate model when applied to both nucleotide and protein data sets (Yang 1996). However, they still assume that all the remaining parameters of the model of evolution—i.e., the equilibrium frequencies and the relative rates of substitution among nucleotides or amino acids—are the same at all sites.
In practice, all these parameters are summarized into a 4 × 4 or 20 × 20 rate matrix. In the case of amino-acid alignments, empirical matrices are generally used, which have first been obtained by counting pairs of amino acids at homologous positions in large sets of aligned proteins (Dayhoff, Eck, and Park 1972;Dayhoff, Schwartz, and Orcutt 1978;Jones, Taylor, and Thornton 1992). Matrices optimized by ML have also been proposed for mitochondrial (Adachi and Hasegawa 1996), chloroplast (Adachi et al. 2000), and nuclear (Whelan and Goldman 2001) proteins. More recently, an alternative, faster method of optimization has been introduced (Muller, Spang, and Vingron 2002), and a new method generalizing the use of empirical matrices has been proposed (Goldman and Whelan 2002).
A number of models for possible heterogeneity in the substitution pattern at distinct sites have already been proposed, both for nucleotide data and amino-acid data. In the case of nucleotides, the transition/transversion ratio was modeled as a site-specific random variable (Huelsenbeck and Nielsen 1999). As for proteins, a first approach has been proposed, in which substitutional heterogeneity is introduced through a set of eight to ten predefined categories, based on secondary structure and solvent accessibility considerations (Goldman, Thorne, and Jones 1996;Thorne, Goldman, and Jones 1996;Goldman, Thorne, and Jones 1998;Liò and Goldman 1999). Each category has its own rate matrix, optimized by ML on real data sets. This model was shown to be significantly supported by real sequences, yet it does not address the question of the extent of heterogeneity actually present in the data. Furthermore, it makes specific hypotheses about its determining factors. An alternative method has been proposed in which no prior constraints are specified between the substitution processes and other features of the protein, like the secondary structure (Koshi and Goldstein 1998;Koshi, Mindell, and Goldstein 1999;Koshi and Goldstein 2001). However, the substitution processes themselves are constrained to conform to a prior biochemical model. Although this approach was generalized (Dimmic, Mindell, and Goldstein 2000;Soyer et al. 2002), the total number of categories is predetermined and is kept small, still for dimensionality reasons. A more radical approach was taken by Bruno (Bruno 1996;Halpern and Bruno 1998), through a model in which the equilibrium frequencies of the 20 amino acids are distinct at each site of the data set. The resulting model seems to capture important features of the substitution process along protein sequences, but it requires a large number of taxa in order for the statistics at each column to be significant.
As mentioned above, the flexibility of the Bayesian paradigm with respect to model dimensionality makes it possible to build models assuming high levels of heterogeneity. Yet, this does not tell how the number of parameters can be adjusted properly to match the signal present in the data. A possibility is to use mixture models in which the dimensionality is not fixed a priori and is itself a free parameter of the inference. Such mixture models are being introduced in many fields of applied statistics (Escobar and West 1995;Green and Richardson 1998), including bioinformatics (Eskin, Grundy, and Singer 2001;Broet, Richardson, and Radvanyi 2002), but they have not yet been applied to molecular phylogenetics.
Here we propose a mixture model, CAT, which generalizes most of the previous approaches (Bruno 1996;Koshi and Goldstein 1998;Dimmic, Mindell, and Goldstein 2000). The model allows for a numberK of classes, each of which is characterized by its own set of equilibrium frequencies, and lets each site “choose” the class under which its substitutional history is to be described. The model can be constrained, with the number of classes fixed to one, as in the standard one-matrix model, or such that each site is described by its own class. Alternatively, we can use a Dirichlet process prior (Ferguson 1973;Antoniak 1974) on the space of equilibrium frequencies, to let the total number of classes be a free variable of the inference. The posterior mean value is then a measure of the substitutional heterogeneity actually present in the alignment. We have implemented this model in a Markov chain Monte Carlo (MCMC) framework, allowing joint inference to be performed simultaneously on all the parameters of the model, including the mixture parameters, the rates at each site, the branch lengths, and the topology of the underlying phylogenetic tree. Using this model, we have conducted inferences on large real data sets and found that, in all cases, the level of heterogeneity is much higher than has been accounted for by previous mixture models. In addition, we show that the standard model based on one single empirical matrix conditioning the substitution process at all sites is outperformed by CAT.
We used three alignments of amino-acid sequences as follows:
These data were obtained as a subset of a larger alignment of eukaryotic elongation factor 2 sequences. Thirty taxa were chosen to represent the diversity of the eukaryotic lineages, and their aligned sequences were retrieved. We removed all columns for which gaps were present or data were missing, leaving a total of 627 sites. A phylogenetic reconstruction under the JTT+F model, performed with the implementation described below, yielded a tree which we used as the fixed topology under which the most time-consuming inferences were conducted. This tree was identical to the posterior consensus tree obtained with the MrBayes 3.0 program (Huelsenbeck and Ronquist 2001) using the JTT matrix, a Dirichlet prior with a concentration parameter of 1 on the equilibrium frequencies, and an Invariant + Γ rate prior modeled by 16 discrete rate categories.
These data are a concatenation of the sequences of four cytoplasmic proteins (actin, EF-1α, α and β tubulins), sampled across 55 eukaryotic species (Baldauf et al. 2000). The alignment was kindly provided by Sandra Baldauf. We removed the diplomonads and trichomonads from our analysis because of their long branches. When constrained, the topology was fixed as inBaldauf et al. (2000).
The complete coding sequences of the mitochondrial genomes of 45 mammals were aligned with each other, and the ambiguous regions of the alignment were removed, yielding a data matrix of 45 × 3,596 amino acids. The MrBayes program was run on this alignment, using themtREV empirical matrix, with four discrete Γ-categories. The resulting majority-rule consensus (using theallcompat option) was used as the fixed topology.
The available data are in the form of an alignment ofP amino-acid sequences, of lengthN. Leti index the columnsCi, or sites, andj the branches. All of the models used in this work assume that the sequences are related by an unknown, unrooted phylogenetic treet, with branch lengths βj ≥ 0,j = 1, 2, … , 2P − 3. Sites have their own rates of substitutionri > 0, such that
We proposed a model that is site-heterogeneous with respect to the substitution process, and which we called CAT (because in effect, it classifies sites intocategories). Under the CAT model, sites are distributed according to a mixture ofK distinct classes. Each class is characterized by its own substitution matrixQk, and the class to which each site belongs is specified by an allocation variablezi ∈ [1..K]. The vectorz = (zi)i∈[1..N] is called the allocation vector.
The relative rates ρlm can be fixed to pre-specified values, allowing different models to be specified. The mathematically simplest model is obtained by setting all the relative rates equal to unity: sites are described by a mixture of Poisson processes. Alternative models were also considered, based on the relative rates of known empirical matrices: JTT (Jones, Taylor, and Thornton 1992) andmtREV (Adachi and Hasegawa 1996), depending on the data set under study. The corresponding models were called CAT-Poisson, CAT-JTT, and CAT-mtREV, respectively.
Once site-specific rates, substitution matrices, and allocation variables are known, the likelihood at each site is computed using Felsenstein's pruning algorithm (Felsenstein 1981). In the case where the processes are Poisson, it is possible to perform mathematically equivalent computations by recoding the process at each site in such a way that all nonobserved amino acids at a given column are collectively considered as one single state. The stationary probability of this new state is the sum of the stationary probabilities of all non-observed amino acids. Felsenstein's pruning algorithm is inS2P, whereS is the number of states, andP the number of taxa. For amino-acid data without the recoding,S = 20. Under the recoding,S will be in general less than 20, and for highly conserved alignments it can often be as low as 2 or 3, which yields a significant increase in computational speed (up to a 50-fold increase in speed was observed in the case of the eukaryotic data set Ek55-1525).
We used uninformative priors ont, β,r and ρ, as follows:
t ∼Uniform over topologies
βj ∼Uniform[0, βmax], with βmax = 100
r ∼Dirichlet(1, 1, … , 1)
The resulting marginal distribution at each site is a γ distribution, with an α parameter of 1.
ρ ∼Dirichlet(1, 1, … , 1)
In addition, we defined a prior on the mixtures, using a Dirichlet process (DP) (Ferguson 1973;Antoniak 1974). A DP prior is parameterized by a concentration parameter α, and a base distributionG0(π) defined on the π-space, the space of stationary probabilities. It can be described by the following procedure for randomly drawing a configuration from the prior (Neal 2000):
Draw the number of classesK, together with a N-vector of allocation variablesz, according top(z,K | α).
Draw K values i.i.d. in the π-space: ∀k, πk ∼G0
In particular cases, one can also dispense with the DP prior and constrain the value ofK. Two cases were considered:
The one-matrix model:K = 1. This corresponds to the usual models: Poisson, JTT,mtREV and, when the relative rates ρlm are considered as free parameters, GTR. The stationary probabilities of the single substitution matrix can be either free or fixed to their empirical estimates. The models using empirical estimates are called Poisson+F, JTT+F, WAG+F, ormtREV+F, depending on the set of relative rates that are used (in the case of the GTR model, stationary probabilities are always considered as free parameters).
The maximally heterogeneous model:K =N, whereN is the number of sites. Each site evolves under its own substitution process. TheN sets of 20 stationary probabilities are considered unknown parameters (19N free parameters). The resulting model is called MAX, and under equal relative rates is similar to the model proposed byBruno (1996).
These two equations immediately suggest a Gibbs sampling algorithm, in which each site is taken in turn, and its allocation variable is reassessed according to these conditional probabilities. Note that the total number of classesK can change through this update. Thus, if the site had a class on its own, but is re-allocated to another already existing class,K will decrease by one. In the reverse situation, if sitei were initially allocated to a classk such that ηk,−i > 0, and ends up allocated to a new class, thenK will increase. In all other cases,K will remain constant. Upon iteration, and when combined with other updates (see below), this algorithm allowsK to fluctuate across the whole range [1..N].
What determines the equilibrium level reached byK? First, one can see from equation 9 that, not surprisingly, the choice among alternative allocations of sitei to already existing classes is driven by the relative likelihoods of these allocations. In contrast, equation 10 shows that the probability of letting the site have a class on its own is mainly determined by how theprior expectation of the site's likelihood over all possible sets of stationary probabilities compares with likelihoods under already existing classes. This amounts to comparing the performance of anon-fitted new profile to the currently available ones,which have already been fitted to the rest of the data. This asymmetry makes it difficult for a site to induce the creation of a new class, unless the pattern at the corresponding column of the alignment is sufficiently distinct from all other columns. Finally, the weights (ηz,−i in equation 9, or α in equation 10) represent the net variations of the prior factorP(z, K | α). That the weight is α in equation 10 makes it apparent that α has a direct influence on the propensity of proposing new classes, and thus on the stationary value ofK.
In our MCMC sampler, we rely on a slightly modified version of this update mechanism (Neal 2000), as follows:
SwitchMode is used in combination with two other operators, one updating the stationary probabilities of one class at a time (StationaryMove) while keepingz and α constant, and another one proposing a new value for α.
StationaryMove: one site is chosen at random, an update of the profile of the class to which it is allocated is proposed according to a Dirichlet distribution centered on the current value, with a tuning parameter of λS.
We used the Global and Local algorithms proposed byLarget and Simon (1999). In addition, we also devised a node-sliding operator, as well as three operators proposing new values for the branch lengths, while leaving the topology invariant:
NodeSliding: we randomly pick an internal branch of the unrooted tree and then designate its two nodesu andv. The other two neighbors ofu are calleda andb, and those ofv are calledc andd, with equal probability. We then slide the brancha −u along the segmentb −u −v −d, by a distancel = λN * (U − 0.5), whereU ∼Uniform[0, 1], and λN > 0, is a tuning parameter. Ifl > 0, the move is made towardd, and towardb otherwise. In addition, if the move pushes the branch out of the range defined byb andd, the excess is reflected back into the required interval. The Hastings ratio for this proposal is 1.
OneBranch: one branch of the unrooted tree is chosen at random, and its length is multiplied by a random factorr =eλ0(U−0.5), whereU ∼Uniform[0, 1], and λ0 > 0, is a tuning parameter. The Hastings ratio equalsr.
AllBranch: all branch lengths are updated simultaneously, each βj being multiplied by a distinct random factorrj =
Homothetic: all branch lengths are updated simultaneously, as in OneBranch, all being multiplied by the same random factorr =
As inLarget and Simon (1999), we propose a new set of values according to a Dirichlet distribution centered on the current parameter value. We found that this Monte Carlo operator was more efficient if applied only on a subset of the rate vector. Specifically, a small numberq of sites,i1,i2,..iq, are chosen at random, and a new set of values, according to a Dirichlet distribution restricted oni1,i2 …iq, with weights
During sampling, the different components of the hypothesis vector were updated separately, in random alternation, according to a grid of weights (wm)1≤m≤M, whereM is the total number of operators. Specifically, definingW =
For each run, the convergence was assessed by checking for the absence of long-term trends in a series of key monitor functions. Specifically, we monitored the log likelihood, the tree length, the entropy of the rate distribution, the number of classes, and the average profile entropy (class-occupancy weighted average). In addition, in most cases, at least two independent runs were performed, starting from different points of the parameter space taken at random, and their marginal properties (majority-rule consensus tree, class number, and composition) were compared.
Under CAT-Poisson, CAT-JTT, CAT-mtREV, and CAT-GTR, we did a total of 600,000 update cycles. The first 100,000 points were discarded, and we subsampled every 50 cycles after burn in. Under JTT+F and MTREV+F, the only free parameters are the site-specific rates of substitution and the branch lengths, which makes convergence much faster, so that we only did a total of 100,000 update cycles, removed the first 20,000, and subsampled every 10 cycles. Under GTR, we did a total of 200,000 update cycles, removed the first 40,000, and subsampled every 10 cycles.
Note that any given point of the sample may have several of its classes contained in some clusters, while it may not be represented in other clusters. A given cluster is referred to as astable cluster if one and only one class is affiliated to this cluster for more than 80% of the points of the sample. For each stable cluster, one can compute the following:
its mean occupancy number: specifically, the sum of all its classes' occupancy number divided by the sample size.
mean profile: occupancy number weighted average of its classes' profiles.
A stable cluster can be interpreted as a class that can be identified unequivocally across the MCMC sampling, and thus, we will refer to stable clusters directly asstable classes.
We developed software for principal component analysis. Plots were visualized usingGnuplot (http://www.gnuplot.info). We wrote a program for visualizing profiles using a representation akin to sequence logos (Schneider and Stephens 1990), translating sets of stationary probabilities into PostScript files.
In practice, a collection ofR replicas are obtained using the following procedure: first, run a MCMC under modelM, with dataD, discard the burn-in, and takeR points regularly spaced in the remaining part of the chain (hr)1≤r≤R. Next, for eachr, simulate a replicaDr underhr.
When more than two models are being compared, one can exploit the additivity property of the logarithm of the Bayes factor, i.e., ∀M0,M1,M2, ln
All source code files and alignments are available upon request.
We first conducted inferences under the CAT-Poisson Model (i.e., a free number of Poisson processes), on the elongation factor 2 data set, EF30-627. The topology was constrained according to the posterior consensus tree found by running our program on this data set using the JTT+F model (fig. 1A), while all other parameters (branch lengths, site-specific rates, class number and profiles, as well as the allocations of sites to modes) were left unconstrained.
Figure 2A shows the evolution of the value ofK, the number of classes, during the elongation of two independent MCMC chains. The chains were initialized atK = 1 andK =N, respectively. Irrespective of the starting point,K converged to a well-defined interval centered on 28 (28.4 ± 3.5). The histograms of the frequencies estimated from the two independent chains are very similar (fig. 2B), the corresponding 95% credibility interval being [22,35] and [21,36]. Substitutional homogeneity across sites is thus rejected by our model (p(K = 1 | D) ≃ 0). On the other hand, the maximally heterogeneous situation is also excluded by the present analysis (p(K =N | D) ≃ 0). In fact, the mean posterior number of classes, as inferred from CAT-Poisson, is low, when compared to the total number of columns of the alignment (an average of one class per 22 sites). It should be stressed that, even with as few as 28 classes, the model is dealing with a relatively high number of additional parameters, compared to JTT+F (specifically, 28 × 19 = 532 free continuous parameters for the profiles, andN = 627 discrete parameters for the allocation vector, whereN is the length of the alignment).
We performed the same analysis without constraining the topology. We found a topology (fig. 1B) similar to that obtained using the JTT+F model (fig. 1A). A few differences are present, however: under CAT-Poisson, the choanoflagellateMonosiga is found within the metazoan clade, and not as its sister group. In addition,Dictyostelium is clustered with diplomonads and trichomonads. Finally, stramenopiles became a sister group of the alveolates, instead of the diplomonads and trichomonads. In spite of these differences, however, the mean number of classes (〈K〉 = 27.9 ± 3.6) is close to that obtained when the topology is constrained, suggesting that the analysis performed under CAT-Poisson is robust against phylogenetic uncertainty.
Interestingly, the inferred total length of the tree is strongly model dependent: under the constrained topology, we observed a posterior mean number of 8.13 ± 0.21 substitutions per site under CAT-Poisson, versus 7.46 ± 0.14 under JTT+F (table 1).
Any two points of a given Markov chain obtained under the CAT model may differ in all respects (class number, site to class allocations, class specific profiles, as well as branch lengths and site-specific rates), and there is no a priori obvious way of identifying a given class throughout the sample. On the other hand, there may be some stable patterns. To identify them, we took 1,000 regularly spaced points from a given run and pooled all the classes observed at each of these points. Each class is characterized by its profile, and can be assimilated to a vector in a 20-dimensional space. A principal component analysis (PCA) on the set of pooled classes is shown infigure 3A. Two kinds of patterns can be seen: first, a broad set of dots scattered more or less homogeneously across a large region in the center of the projection, and second, a series of 10 to 12 dense clouds. To further characterize this distribution, we used a clustering method (seeMaterials and Methods), and identified a series of 11 stable classes (i.e., classes that can be unequivocally identified across the sample). The mean profile and the mean occupation number of each identified stable class is shown infigure 3D, and a projection of their profiles back onto the PCA (fig. 3A, large crosses) clearly shows that these stable classes correspond to the dense clouds.
The profiles of the stable classes are biochemically reasonable (fig. 3D). For instance, there are classes with negatively charged residues (D and E), with positively charged residues (K and R), or with aromatic (F and Y) residues. Note the diversity of alternative classes belonging to the same biochemical category: there are two hydrophobic classes, one with the amino acids I and V, and one favoring L and M. Finally, some amino acids are found in several stable classes, like S, which belongs to both (A,S) and (S,T) classes; the most extreme example is that of alanine (A), which is present in five different classes, (A,S), (A,G), (A,P), (A,C,S,T,V), and (A,D,E,G,K,Q,N,S,T,V). Thus, according to these results, the same amino acid can undergo different types of substitutions depending on the context. The same cluster analysis was performed without constraining the topology and gave essentially identical results (see fig. S1, in the Supplementary Material online).
The CAT-Poisson model was applied to data sets of larger size, consisting in the concatenation of four nuclear proteins from 55 eukaryotes (Ek55-1525), and of the mitochondrial proteins of 45 mammals (Mt45-3596). We found significant support in favor of heterogeneity in all cases (table 1), as shown both by the posterior mean number of classes (26 and 35) and by the number of stable classes identified by clustering (14 and 22).
Biochemically reasonable classes are found in all cases (see figures S2 and S3 in the Supplementary Material online), and they bear many similarities with each other across the three data sets, in particular between EF30-627 and Ek55-1525, suggesting that the number and the composition of the classes inferred by CAT-Poisson correspond to generic properties of the substitution patterns in proteins. More significant differences are observed between mitochondrial proteins and the other two data sets. This might reflect that mitochondrial proteins are mostly transmembrane proteins, whereas the proteins included in the two other data matrices are exclusively cytosolic factors. This is in agreement with the fact that themtREV is markedly different from other empirical substitution matrices (Adachi and Hasegawa 1996).
We compared the inferences conducted on EF30-627 with the results obtained from data sets of the same size, but simulated under various conditions. First, for data simulated under one single Poisson process and analyzed under CAT-Poisson, one class was recovered with significant probability (P = 0.65). Next, we simulated data under CAT-Poisson, drawing the parameters of the simulation from the posterior predictive distribution (10 replica, seeMaterials and Methods). The posterior mean number of classes found on the simulated data sets (〈K〉 = 23.1 ± 2.7) is slightly lower than the value found on the real data set (〈K〉 = 28.4 ± 3.5), suggesting that, as an estimate of the number of classes, 〈K〉 is biased downward. However, the number of identified stable classes is very similar (KSM = 11.20 ± 1.08 versusKSM = 11), as well as the underlying profiles (not shown). A PCA projection of the classes obtained in one of these simulations (fig. 3B) does not display significant differences with the analysis performed on the real data set (fig. 3A). Thus, in spite of a slight bias, the mixtures inferred by CAT-Poisson appear to be quite robust to posterior predictive resampling.
In contrast, when the same experiment is performed on alignments that have been obtained by simulation under the JTT+F model, a much lower level of heterogeneity is detected (〈K〉 = 13.7 ± 1.6, n = 10 replica). Upon clustering, an average ofKSM = 6.8 ± 0.8 stable classes is obtained. The profiles of these stable classes display significant differences with those observed for real data (fig. 3C andE): for instance, there is only one class containing A, S, and T, in place of the two (A,S) and (S,T) classes found on real sequences, and the (F,Y) class has disappeared altogether.
The previous experiment suggests that the information contained in the JTT matrix is able to account for only part of the substitutional heterogeneity across sites present in real data. To investigate this problem further, we set the relative rates underlying our mixture model equal to those of the JTT matrix. (in the case of mitochondrial proteins, we used themtREV coefficients instead). Our argument is that, if one single empirical matrix (JTT) is sufficient to account for the substitutional heterogeneity of a particular data set, then an inference conducted under the CAT-JTT model on this data set should yield a nonvanishing posterior probability of having one single class.
We performed a CAT-JTT inference both on EF30-627 and on a data set drawn from the JTT+F induced posterior predictive distribution. Whereas the probability of recovering a single class is high in the case of the simulated data (P = 0.6), it is virtually 0 for real data (fig. 4, andtable 2). Surprisingly, the posterior mean number of classes is even greater with CAT-JTT (〈K〉 = 51.6 ± 8.3) than with CAT-Poisson (〈K〉 = 28.4 ± 3.5), although when a cluster-analysis is performed, a lower number of stable classes than in CAT is obtained (KSM = 5). The same analysis was conducted on Ek55-1525, and on the mitochondrial data set Mt45-3596, (in the latter case, usingmtREV as the empirical matrix). In both cases, the posterior probability of the data being described by one single class is virtually zero, and the posterior mean number of classes is also higher under CAT-JTT or CAT-mtREV (table 2), than under CAT-Poisson (table 1).
In a last experiment, the relative rates were considered as unknown parameters (although they were still constrained to be the same for all classes). The results were qualitatively the same as with CAT-JTT or CAT-mtREV, with a posterior mean number of classes of 〈K〉 = 47 ± 7 inferred on EF30-627 (fig. 4). Thus, even under non-uniform relative rates, a significant part of the heterogeneity among sites has to be further accounted for by forcing the mixture to contain more than one class. In other words, models based on one single matrix are far from capturing the substitutional heterogeneity of real data to its full extent.
Bayes factor evaluation is the most common method of model comparison in Bayesian inference (Gelman 1998;Jaynes 2003), and has already been applied in phylogenetics (Suchard, Weiss, and Sinsheimer 2001). Using the numerical method calledthermodynamic integration (Ogata 1989, see Appendix), we computed the Bayes factor between Poisson+F, taken as the reference model, and CAT-Poisson, MAX-Poisson, and a series of empirical matrix models (JTT+F, WAG+F, andmtREV+F;table 3). In the case of EF30-627, we also included the general model GTR in the analysis.
Our results first allow comparison of the one-matrix models among themselves: as expected (Whelan and Goldman 2001), WAG+F performs better (andmtREV+F worse) than JTT+F on cytosolic proteins (EF30-627 amd Ek55-1525). Interestingly, JTT+F and WAG+F are better than GTR on EF30-627, suggesting that the relative rates of known empirical matrices are close to optimal for this data set. Irrespective of their relative performances, however, the one-matrix models are all outperformed by CAT-Poisson. For example, the support in favor of CAT-Poisson with respect to WAG+ F on EF30-627 is of 1,932 − 1,760 = 172 natural units of logarithm, which corresponds to 0.27 natural log units per site. This means that, on average, each column ise0.27 = 1.32 times better explained by CAT-Poisson than by WAG+F. In contrast to CAT-Poisson, MAX-Poisson does not seem to be favored over one-matrix models: for all three data sets, MAX-Poisson was by far the worst-performing model after Poisson+F (table 3).
We have developed a Bayesian model of sequence evolution, CAT, that can account for substitutional heterogeneity across the sites of protein sequences. In contrast to already existing mixture models designed for the same purpose (Goldman, Thorne, and Jones 1996;Thorne, Goldman, and Jones 1996;Goldman, Thorne, and Jones 1998;Koshi and Goldstein 1998;Koshi, Mindell, and Goldstein 1999;Liò and Goldman 1999;Dimmic, Mindell, and Goldstein 2000;Koshi and Goldstein 2001), all of which have a fixed number of classes, the number of substitutional classes in CAT is itself a free parameter. In this way, it will adapt to the substitutional complexity of the underlying data set, and it provides an estimate of this complexity through the posterior mean number of classes.
Inferences using the CAT model on real protein data sets uncover a high level of heterogeneity, much higher than what is assumed by all other mixture models that have been proposed thus far. To give a numerical estimate, under CAT-Poisson, we always found a posterior mean number of classes greater than 20, whereas in all other models, the number of classes is always set to a value lower than 10. Moreover, the difference between the posterior mean number of classes found upon posterior predictive resampling and the estimate obtained on the real data set (23.1 ± 2.7 versus 28.4 ± 3.5) reveals the existence of a bias, suggesting that the actual level of heterogeneity is probably even higher than what we have observed.
When looking at the classes' profiles and occupancy numbers, one sees that there actually exist markedly different kinds of classes. On the one hand, using a simple clustering algorithm, we were able to identify a restricted set of classes that are stable along a given Markov chain, and are represented by many sites across the sequences. The profiles of these classes (fig. 3D) correspond to reasonable biochemical features, although they are more diverse and more refined than the basic biochemical categories generally used. Strikingly, most of them are markedly peaked, giving a significant weight to only two or three amino acids, suggesting the presence of a strong, site-specific, selection pressure at the amino-acid level, which would be one of the dominant forces shaping the substitution process. Another feature is that a given amino acid can belong to several distinct classes, indicating that context-dependent amino-acid exchangeability is an important aspect of protein evolution. Note that this property cannot be accounted for by the standard one-matrix models. Similar observations have been made previously, using a parsimony-based approach (Lopez 1997).
On the other hand, the difference between the posterior mean number of classes and the number of stable classes (28.4 ± 3.5 versus 11 in the case of EF30-627), reveals the existence of a large series of classes which are not easily identifiable. Closer examination indicates that most of these classes have a low occupancy number; some of them appear in a transitory fashion along a given run, and others, although persistent, do not always have the same sites affiliated to them. These classes should be interpreted with caution: they might not correspond to actual substitutional categories, and they might instead be considered as formal mathematical objects, relevant only through their marginal average influence on the substitution processes operating at the corresponding sites.
The fact that some classes are stable, while some other are not, offers an interesting insight into the flexibility of the Bayesian mixture models such as CAT-Poisson: obviously, the set of unstable classes, meaningful only through their average influence on the site-specific substitution processes, is a typically Bayesian feature, and it does not have an equivalent in existing heterogeneous models implemented in the ML framework. Incidentally, this suggests that the level of heterogeneity that these ML models assume should be compared not to the posterior mean number of classes found under CAT-Poisson, but rather, to the number of stable classes. This number provides a much more conservative estimate of the level of heterogeneity; yet, we found it to be consistently higher than 10, ranging from 11 to 22, confirming that the heterogeneity has been underestimated thus far.
Thorne, Goldman, and co-workers propose to associate amino-acid replacement patterns with protein secondary structure and solvent accessibility (Goldman, Thorne, and Jones 1996;Thorne, Goldman, and Jones 1996;Goldman, Thorne, and Jones 1998;Liò and Goldman 1999). In practice, their model implements a mixture of 10 classes, corresponding to five types of secondary structure and two levels of solvent accessibility. Using parametric bootstrap, they show that their mixture model improves significantly on the standard homogeneous model. Their approach is very different from ours, in that we did not pre-specify what could determine the variations of the substitution pattern among sites. The results seem to be accordingly quite distinct: this is particularly obvious in the composition of the 10 substitutional classes estimated by Thorne, Goldman, and co-workers, which are all characterized by very broad distributions on all 20 amino acids, not as peaked as the classes proposed by CAT. To further compare the two approaches, we made a preliminary investigation of the correlation between the secondary structure and the substitution profiles inferred by CAT-Poisson at each site of the elongation factor 2 data set: we identified the sites that were allocated to one of the stable classes at a frequency higher than 80%, and we looked at their distribution across the sequence (see table S2 in the Supplementary Material online). We found that although α-helices are enriched in classes such as (K,R) and (D,E), and β-sheets are enriched in classes like (I,V) and (F,Y), all of the classes but one are represented in the two kinds of secondary structure, as well as in the remaining parts of the sequence. As for solvent accessibility, correlations were uncovered, but also here, most of the classes are represented in both buried and exposed categories. Thus, the heterogeneity uncovered in the present work cannot be explained only in terms of secondary structure and solvent accessibility. More generally, it seems that our approach and that of Thorne, Goldman, and co-workers do not work at the same level: CAT would focus on local, specific, biochemical requirements, which can differ widely even between two adjacent residues, belonging to the same secondary structure, whereas the model of Thorne, Goldman, and co-workers averages over this low-level heterogeneity, to capture more global modulations of the amino-acid replacement pattern, correlated to local secondary structure determinants.
The CAT model is much more closely related to the model proposed by Koshi and Goldstein (Koshi and Goldstein 1998;Koshi, Mindell, and Goldstein 1999;Koshi and Goldstein 2001). In a first version of their model, these authors imposed restriction on the set of possible amino-acid profiles, by defining their model exclusively in terms of hydrophobicity and size. These two physical-chemical properties probably play an important role in the shaping of the classes that we have obtained, but they are not sufficient to account for their diversity. In particular, we have found classes that are more clearly defined by the electric charge (D,E), or by the aromaticity (F,Y) (fig. 3D). This restriction was relieved in subsequent work (Dimmic, Mindell, and Goldstein 2000;Soyer et al. 2002), and the resulting model is more similar to CAT, except for the limitation in the total number of classes, which was set to 5. Another potentially important difference is that Koshi and Goldstein's model assumes that all sites belonging to the same class have also the same rate of substitution, while according to the CAT model, substitutional classes and site-specific overall rates of substitution are a priori independent variables. The fact that all stable classes that we observed on real data contain both fast and slow evolving sites (not shown) tends to give support to our prior choice.
The Bayes factor favors CAT-Poisson over standard one-matrix models, for all of the data sets which we have analyzed. It should be emphasized that an empirical matrix such as JTT has a significant prior advantage over CAT-Poisson, because it incorporates external information bearing on biochemical realism. In contrast, the CAT-Poisson model is completely naive, in the sense that the prior is totally uninformative. Nevertheless, this lack of prior biochemical knowledge does not seem to impair the performance of CAT-Poisson, presumably because this latter model is able to extract a substantial amount of equivalent knowledge from the data set, as confirmed by the biochemical relevance of the classes' profiles. Importantly, JTT+F is rejected even on quite small data sets, like EF2, which implies that the amount of information sufficient for CAT to outperform JTT+F is low. To verify this, we computed the Bayes factor under a smaller data set, made of the first third of EF30-627 (209 sites), and with a subset of five species. In this case, we observed that the Bayes factor was now in favor of JTT+F (lnB = −6.9 natural units), indicating that the amount of information necessary for CAT-Poisson to be fit is not reached on this highly reduced data matrix.
In contrast to CAT-Poisson, MAX-Poisson is rejected when compared to JTT+F. A plausible explanation is that MAX is penalized by its very large number of parameters (19 for each site), which cannot all be correctly determined by the information contained in the data.
It is known that sensitivity to the prior is an important issue when dealing with parameter-rich models (Huelsenbeck et al. 2002;Rannala 2002). In the present work, there are two main directions along which the prior should be tested. First, the prior onK, the number of classes, is currently defined through the prior on α, the mixture concentration parameter. For α, we have only tried a flat prior distribution, although we could test other possibilities. More generally, we could dispense with the Dirichlet process prior, and work on more general mixture models that make it possible to work directly with the prior onK (Green and Richardson 1998). Second, we could also test the base distribution on the π-space. Thus far, we have chosen a flat Dirichlet distribution, but this could be generalized to any Dirichlet distribution by varying the center of the distribution, or by modulating its concentration parameter.
The assumption that the substitution process at each site is Poisson is another potential limitation; in particular, it does not take the codon structure into account. In principle, it is easy to dispense with this simplifying assumption. Our current implementation already allows the use of any predefined set of relative rates. Similarly, one could imagine a version of the CAT model formulated at the codon level, and in which a single set of relative rates of codon substitution would be combined with class-specific amino-acid acceptance profiles. However, the computations under unequal relative rates of exchange between amino acids can be as much as 50 times slower than under a Poisson process, a factor which is even greater in the case of the codon models. Further algorithmic development is therefore needed in order to proceed in this direction.
Inferences about class number and composition conducted on EF30-627 under a fixed tree are very similar to those where the phylogeny is also a free parameter (see fig. S1 in the Supplementary Material online). Likewise, we observed that the number and the profiles of the classes inferred for the mitochondrial data set were not very sensitive to the exact tree used as a constraint (not shown). This suggests that our analysis is robust with respect to the phylogeny, a fact which is not surprising, as similar conclusions have been reached by authors studying the sensitivity of rate across site parameters (Yang 1995), or of tests of model comparison (Posada and Crandall 2001), to the underlying phylogeny.
Conversely, however, the models investigated in this work differ in some of their predictions, in particular, in their estimates of the evolutionary distance between sequences and of the phylogenetic relationships. First, the tree length under CAT-Poisson is 10% to 20% higher than under JTT+F, suggesting that mutational saturation of the sequences could be better accounted for by CAT-Poisson than by JTT+F. This would not be completely surprising (Miyamoto and Fitch 1996). Mutational saturation—i.e., a high number of convergences and reversions throughout the sequences—will be all the more frequent as the substitution process at any given site is, on average, confined to a very restricted set of amino acids. This seems to be exactly the kind of situation inferred by CAT-Poisson. Importantly, these homoplasies will be more easily missed by models that underestimate site-specific restrictions of the set of admissible residues. It is quite plausible that JTT+F leads to such an underestimation, and this would explain the discrepancies observed between JTT+F and CAT-Poisson, concerning the number of substitutions along the tree.
Second, there are a few differences in the phylogenetic reconstruction obtained under CAT-Poisson and under JTT+F (fig. 1). One of the groupings found under CAT-Poisson, the monophyly of chromalveolates (stramenopiles + alveolates), is reasonable, and has been advocated by other studies (Baldauf et al. 2000;Fast et al. 2001;Bapteste et al. 2002), whereas others, like the polyphyly of amoebas and the position of the choanoflagellateMonosiga within the animal clade, are more questionable (Bapteste et al. 2002;Lang et al. 2002). More work is still needed to evaluate the performances of CAT in phylogenetic reconstruction, an issue we are currently investigating. The present results already illustrate that a better statistical fit does not necessarily imply a more reliable tree (Sullivan and Swofford 2001). In any case, CAT might allow better inference of evolutionary distances, and for that reason, it could be used for molecular datings. In addition, it can be a promising tool for analyzing the relationships between site-specific substitution patterns and structure-function determinants.
Note that for β = 0 (β = 1),pβ reduces to the posterior density underM0 (M1). Thus, the set (pβ)0≤β≤1 defines a continuous path in the space of probability distributions, connecting the posterior distributions underM0 andM1.
The following materials relevant to this article are available online:table 1. Settings of the MCMC runs;table 2. Distribution of the sites allocated to stable classes, according to secondary structure and solvent accessibility;fig. 1. CAT-Poisson analysis on the elongation factor data set EF30-627, under constrained (A,C) and unconstrained (B,D) topology;fig. 2. CAT-POISSON analysis on the eukaryotic data set Ek55-1525.fig. 3. CAT-Poisson analysis on the mammal mitochondrial data set Mt45-3596.
Associate Editor
Majority rule consensus phylogenies obtained with the models JTT+F (A) and CAT-Poisson (B). Branch lengths are proportional to the expected number of substitutions per site. Node support values are equal to the posterior probabilities
A. Traceplot showing the evolution ofK, the number of modes, during the elongation of two independent MCMCs (only the first 500 cycles are shown). The initial conditions wereK = 1 (circles) andK =N (squares), withN standing for the number of sites.B. Histograms displaying the frequencies at which each value ofK was observed along the two chains referred to inA, starting fromK = 1 (solid lines), andK =N (dashed lines)
A,B,C. Principal Component analysis of the profiles obtained under the CAT model, usingA, real sequences (EF30-627);B, data simulated under CAT (SimuCAT30-627);C, data simulated under JTT (SimuJTT30-627). To make comparison easier, the profiles inB andC are projected onto the 2-dimensional subspace ofR20 defined by the principal components computed inA. First axis accounts for 17.3%, and second axis for 12.7%, of the total variance inA andB, and 15.3% and 11.3% of the variance inC.D. Profiles of the 11 stable modes obtained with EF30-627. The one-letter amino-acid code is used. Font height is proportional to the frequency of the corresponding amino-acid. Modes are reported on panelsA andB.E. Profiles of the six stable modes obtained with JTTSimu30-627. Modes are reported on panelC
Histograms showing the estimated posterior probability distribution ofK, the number of modes under the CAT-JTT for data simulated under JTT+F (dashed lines), under CAT-JTT for the real data (EF30-627, solid lines), and under CAT-GTR for the real data (dotted lines)
Estimates of Class Number and Tree Length Under CAT-Poisson, MAX-Poisson and JTT+F.
CAT-Poisson | MAX-Poisson | JTT+F | |||||||
---|---|---|---|---|---|---|---|---|---|
Data Set | 〈α〉a | 〈K〉b | KSMc | 〈TL〉CATd | 〈TL〉MAXe | 〈TL〉JTT+Ff | |||
EF30-627 | 6.3 ± 1.6 | 28.4 ± 3.5 | 11 | 8.13 ± 0.21 | 7.06 ± 0.14 | 7.46 ± 0.14 | |||
Ek55-1525 | 4.7 ± 1.3 | 26.4 ± 3.7 | 14 | 5.36 ± 0.10 | 4.82 ± 0.07 | 4.78 ± 0.08 | |||
Mt45-3596 | 5.4 ± 1.1 | 35.3 ± 3.2 | 22 | 7.54 ± 0.10 | 5.90 ± 0.05 | 5.91 ± 0.05 |
CAT-Poisson | MAX-Poisson | JTT+F | |||||||
---|---|---|---|---|---|---|---|---|---|
Data Set | 〈α〉a | 〈K〉b | KSMc | 〈TL〉CATd | 〈TL〉MAXe | 〈TL〉JTT+Ff | |||
EF30-627 | 6.3 ± 1.6 | 28.4 ± 3.5 | 11 | 8.13 ± 0.21 | 7.06 ± 0.14 | 7.46 ± 0.14 | |||
Ek55-1525 | 4.7 ± 1.3 | 26.4 ± 3.7 | 14 | 5.36 ± 0.10 | 4.82 ± 0.07 | 4.78 ± 0.08 | |||
Mt45-3596 | 5.4 ± 1.1 | 35.3 ± 3.2 | 22 | 7.54 ± 0.10 | 5.90 ± 0.05 | 5.91 ± 0.05 |
a Mean posterior value of the Dirichlet Prior concentration parameter.
b Mean posterior number of classes.
c Number of stable classes detected by clustering.
d Mean posterior tree length under CAT.
e Mean posterior tree length under MAX-Poisson.
f Mean posterior tree length under JTT+F.
Estimates of Class Number and Tree Length Under CAT-Poisson, MAX-Poisson and JTT+F.
CAT-Poisson | MAX-Poisson | JTT+F | |||||||
---|---|---|---|---|---|---|---|---|---|
Data Set | 〈α〉a | 〈K〉b | KSMc | 〈TL〉CATd | 〈TL〉MAXe | 〈TL〉JTT+Ff | |||
EF30-627 | 6.3 ± 1.6 | 28.4 ± 3.5 | 11 | 8.13 ± 0.21 | 7.06 ± 0.14 | 7.46 ± 0.14 | |||
Ek55-1525 | 4.7 ± 1.3 | 26.4 ± 3.7 | 14 | 5.36 ± 0.10 | 4.82 ± 0.07 | 4.78 ± 0.08 | |||
Mt45-3596 | 5.4 ± 1.1 | 35.3 ± 3.2 | 22 | 7.54 ± 0.10 | 5.90 ± 0.05 | 5.91 ± 0.05 |
CAT-Poisson | MAX-Poisson | JTT+F | |||||||
---|---|---|---|---|---|---|---|---|---|
Data Set | 〈α〉a | 〈K〉b | KSMc | 〈TL〉CATd | 〈TL〉MAXe | 〈TL〉JTT+Ff | |||
EF30-627 | 6.3 ± 1.6 | 28.4 ± 3.5 | 11 | 8.13 ± 0.21 | 7.06 ± 0.14 | 7.46 ± 0.14 | |||
Ek55-1525 | 4.7 ± 1.3 | 26.4 ± 3.7 | 14 | 5.36 ± 0.10 | 4.82 ± 0.07 | 4.78 ± 0.08 | |||
Mt45-3596 | 5.4 ± 1.1 | 35.3 ± 3.2 | 22 | 7.54 ± 0.10 | 5.90 ± 0.05 | 5.91 ± 0.05 |
a Mean posterior value of the Dirichlet Prior concentration parameter.
b Mean posterior number of classes.
c Number of stable classes detected by clustering.
d Mean posterior tree length under CAT.
e Mean posterior tree length under MAX-Poisson.
f Mean posterior tree length under JTT+F.
Data Set | P(K = 1) | 〈K〉 | KSM |
---|---|---|---|
SimuJTT30-627 | 0.60 | 1.7 ± 1.1 | 1 |
EF30-627 | 0 | 51.6 ± 8.3 | 5 |
SimuJTT55-1525 | 0.64 | 1.8 ± 1.4 | 1 |
Ek55-1525 | 0 | 45.7 ± 6.0 | 14 |
Mt45-3596 | 0 | 70.9 ± 6.6 | 16 |
Data Set | P(K = 1) | 〈K〉 | KSM |
---|---|---|---|
SimuJTT30-627 | 0.60 | 1.7 ± 1.1 | 1 |
EF30-627 | 0 | 51.6 ± 8.3 | 5 |
SimuJTT55-1525 | 0.64 | 1.8 ± 1.4 | 1 |
Ek55-1525 | 0 | 45.7 ± 6.0 | 14 |
Mt45-3596 | 0 | 70.9 ± 6.6 | 16 |
Data Set | P(K = 1) | 〈K〉 | KSM |
---|---|---|---|
SimuJTT30-627 | 0.60 | 1.7 ± 1.1 | 1 |
EF30-627 | 0 | 51.6 ± 8.3 | 5 |
SimuJTT55-1525 | 0.64 | 1.8 ± 1.4 | 1 |
Ek55-1525 | 0 | 45.7 ± 6.0 | 14 |
Mt45-3596 | 0 | 70.9 ± 6.6 | 16 |
Data Set | P(K = 1) | 〈K〉 | KSM |
---|---|---|---|
SimuJTT30-627 | 0.60 | 1.7 ± 1.1 | 1 |
EF30-627 | 0 | 51.6 ± 8.3 | 5 |
SimuJTT55-1525 | 0.64 | 1.8 ± 1.4 | 1 |
Ek55-1525 | 0 | 45.7 ± 6.0 | 14 |
Mt45-3596 | 0 | 70.9 ± 6.6 | 16 |
Model Comparison. Natural Logarithm of the Bayes Factor Between Poisson+F and All Other Models.
Data Set | JTT+F | WAG+F | mtREV+F | GTR | CAT-Poisson | MAX-Poisson |
---|---|---|---|---|---|---|
EF30-627 | 1,631 | 1,760 | 1,374 | 1,550 | 1,932 | 807 |
Ek55-1525 | 3,324 | 3,628 | 2,848 | — | 3,808 | 1,112 |
Mt45-3596 | 10,089 | 8,970 | 10,943 | — | 12,389 | 855 |
Data Set | JTT+F | WAG+F | mtREV+F | GTR | CAT-Poisson | MAX-Poisson |
---|---|---|---|---|---|---|
EF30-627 | 1,631 | 1,760 | 1,374 | 1,550 | 1,932 | 807 |
Ek55-1525 | 3,324 | 3,628 | 2,848 | — | 3,808 | 1,112 |
Mt45-3596 | 10,089 | 8,970 | 10,943 | — | 12,389 | 855 |
Model Comparison. Natural Logarithm of the Bayes Factor Between Poisson+F and All Other Models.
Data Set | JTT+F | WAG+F | mtREV+F | GTR | CAT-Poisson | MAX-Poisson |
---|---|---|---|---|---|---|
EF30-627 | 1,631 | 1,760 | 1,374 | 1,550 | 1,932 | 807 |
Ek55-1525 | 3,324 | 3,628 | 2,848 | — | 3,808 | 1,112 |
Mt45-3596 | 10,089 | 8,970 | 10,943 | — | 12,389 | 855 |
Data Set | JTT+F | WAG+F | mtREV+F | GTR | CAT-Poisson | MAX-Poisson |
---|---|---|---|---|---|---|
EF30-627 | 1,631 | 1,760 | 1,374 | 1,550 | 1,932 | 807 |
Ek55-1525 | 3,324 | 3,628 | 2,848 | — | 3,808 | 1,112 |
Mt45-3596 | 10,089 | 8,970 | 10,943 | — | 12,389 | 855 |
We are grateful to David Bryant, Olivier Gascuel, Franz Lang, Jeffrey Thorne, and two anonymous referees for their useful comments on the manuscript. We wish to thank Sandra Baldauf for making available the alignment of four eukaryotic proteins, and Béatrice Philippe for her help in determining the relations between classes and secondary structure on the elongation factor data set. This work was supported by the Canadian Research Chair, and by a starting fund from the University of Montréal.
Adachi, J., and M. Hasegawa.
Adachi, J., P. J. Wadell, W. Martin, and M. Hasegawa.
Antoniak, C. E.
Baldauf, S. L., A. J. Roger, I. Wenk-Siefert, and W. F. Doolittle.
Bapteste, E., H. Brinkmann, J. A. Lee, D. V. Moore, C. W. Sensen, P. Gordon, L. Durufle, T. Gaasterland, P. Lopez, M. Muller, and H. Philippe.
Broet, P., S. Richardson, and F. Radvanyi.
Bruno, W. J.
Dayhoff, M., R. V. Eck, and C. M. Park.
Dayhoff, M., R. Schwartz, and B. Orcutt.
Dimmic, M. W., D. P. Mindell, and R. A. Goldstein.
Escobar, M., and M. West.
Eskin, E., W. N. Grundy, and Y. Singer.
Fast, N. M., J. C. Kissinger, D. S. Roos, and P. J. Keeling.
Felsenstein, J.
Gelman, A.
Gelman, A., X. L. Meng, and H. Stern.
Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin.
Goldman, N., J. Thorne, and D. Jones.
Goldman, N., J. L. Thorne, and D. T. Jones.
Goldman, N., and S. Whelan.
Green, P. J., and S. Richardson.
Halpern, A. L., and W. J. Bruno.
Huelsenbeck, J. P., B. Larget, R. E. Miller., and F. Ronquist.
Huelsenbeck, J. P., and R. Nielsen.
Huelsenbeck, J. P., and F. Ronquist.
Jaynes, E.
Jeffreys, H.
Jones, D. T., W. R. Taylor, and J. M. Thornton.
Koshi, J. M., and R. A. Goldstein.
Koshi, J. M., and R. A. Goldstein.
Koshi, J. M., D. P. Mindell, and R. A. Goldstein.
Lang, B. F., C. O'Kelly, T. Nerad, M. W. Gray, and G. Burger.
Larget, B., and D. Simon.
Li, S.
Liò, P., and N. Goldman.
Lopez, P.
Miyamoto, M. M., and W. M. Fitch.
Muller, T., R. Spang, and M. Vingron.
Neal, R. M.
Ogata, Y.
Posada, D., and K. Crandall.
Rannala, B.
Rubin, D. B.
Schneider, T. D., and R. M. Stephens.
Soyer, O., M. W. Dimmic, R. R. Neubig, and R. A. Goldstein.
Suchard, M., R. Weiss, and J. Sinsheimer.
Sullivan, J., and D. L. Swofford.
Swofford, D., G. P. Olsen, P. J. Waddell, and D. M. Hillis.
Thorne, J. L., N. Goldman, and D. T. Jones.
Whelan, S., and N. Goldman.
Yang, Z.
Yang, Z.
Yang, Z.
Month: | Total Views: |
---|---|
November 2016 | 1 |
January 2017 | 13 |
February 2017 | 58 |
March 2017 | 67 |
April 2017 | 64 |
May 2017 | 70 |
June 2017 | 54 |
July 2017 | 15 |
August 2017 | 32 |
September 2017 | 63 |
October 2017 | 56 |
November 2017 | 61 |
December 2017 | 218 |
January 2018 | 301 |
February 2018 | 285 |
March 2018 | 378 |
April 2018 | 269 |
May 2018 | 357 |
June 2018 | 311 |
July 2018 | 299 |
August 2018 | 305 |
September 2018 | 319 |
October 2018 | 330 |
November 2018 | 391 |
December 2018 | 267 |
January 2019 | 176 |
February 2019 | 331 |
March 2019 | 366 |
April 2019 | 320 |
May 2019 | 316 |
June 2019 | 263 |
July 2019 | 304 |
August 2019 | 281 |
September 2019 | 250 |
October 2019 | 281 |
November 2019 | 238 |
December 2019 | 239 |
January 2020 | 195 |
February 2020 | 118 |
March 2020 | 108 |
April 2020 | 134 |
May 2020 | 104 |
June 2020 | 102 |
July 2020 | 182 |
August 2020 | 138 |
September 2020 | 138 |
October 2020 | 110 |
November 2020 | 141 |
December 2020 | 105 |
January 2021 | 125 |
February 2021 | 101 |
March 2021 | 165 |
April 2021 | 154 |
May 2021 | 193 |
June 2021 | 155 |
July 2021 | 124 |
August 2021 | 126 |
September 2021 | 142 |
October 2021 | 147 |
November 2021 | 163 |
December 2021 | 147 |
January 2022 | 108 |
February 2022 | 103 |
March 2022 | 131 |
April 2022 | 108 |
May 2022 | 133 |
June 2022 | 138 |
July 2022 | 90 |
August 2022 | 112 |
September 2022 | 110 |
October 2022 | 112 |
November 2022 | 148 |
December 2022 | 139 |
January 2023 | 104 |
February 2023 | 148 |
March 2023 | 169 |
April 2023 | 160 |
May 2023 | 135 |
June 2023 | 93 |
July 2023 | 113 |
August 2023 | 86 |
September 2023 | 95 |
October 2023 | 159 |
November 2023 | 122 |
December 2023 | 139 |
January 2024 | 165 |
February 2024 | 153 |
March 2024 | 213 |
April 2024 | 165 |
May 2024 | 147 |
June 2024 | 92 |
July 2024 | 153 |
August 2024 | 139 |
September 2024 | 117 |
October 2024 | 115 |
November 2024 | 102 |
December 2024 | 101 |
January 2025 | 109 |
February 2025 | 118 |
March 2025 | 97 |
April 2025 | 106 |
Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide
This PDF is available to Subscribers Only
View Article Abstract & Purchase OptionsFor full access to this pdf, sign in to an existing account, or purchase an annual subscription.