Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
NCBI home page
Search in PMCSearch
  • View on publisher site icon
As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more:PMC Disclaimer | PMC Copyright Notice
Proceedings of the National Academy of Sciences of the United States of America logo

Prevalence of positive selection among nearly neutral amino acid replacements inDrosophila

Stanley A Sawyer*,John Parsch,Zhi Zhang,Daniel L Hartl‡,§
*Department of Mathematics, Washington University, St. Louis, MO 63130;
Section of Evolutionary Biology, Department of Biology II, University of Munich, 82152 Munich, Germany; and
Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138
§

To whom correspondence should be addressed. E-mail:dhartl@oeb.harvard.edu

Contributed by Daniel L. Hartl, February 21, 2007

#

Author contributions: S.A.S. and J.P. contributed equally to this work; S.A.S., J.P., and D.L.H. designed research; J.P. and Z.Z. performed research; S.A.S. contributed new reagents/analytic tools; S.A.S. and D.L.H. analyzed data; and S.A.S., J.P., and D.L.H. wrote the paper.

Series information

Inaugural Article

Received 2007 Feb 7; Issue date 2007 Apr 17.

© 2007 by The National Academy of Sciences of the USA
PMCID: PMC1871816  PMID:17409186
See commentary "Profile of Daniel L. Hartl" on page 9111.

Abstract

We have estimated the selective effects of amino acid replacements in natural populations by comparing levels of polymorphism in 91 genes in African populations ofDrosophila melanogaster with their divergence fromDrosophila simulans. The genes include about equal numbers whose level of expression in adults is greater in males, greater in females, or approximately equal in the sexes. Markov chain Monte Carlo methods were used to sample key parameters in the stationary distribution of polymorphism and divergence in a model in which the selective effect of each nonsynonymous mutation is regarded as a random sample from some underlying normal distribution whose mean may differ from one gene to the next. Our analysis suggests that ≈95% of all nonsynonymous mutations that could contribute to polymorphism or divergence are deleterious, and that the average proportion of deleterious amino acid polymorphisms in samples is ≈70%. On the other hand, ≈95% of fixed differences between species are positively selected, although the scaled selection coefficient (Nes) is very small. We estimate that ≈46% of amino acid replacements haveNes < 2, ≈84% haveNes < 4, and ≈99% haveNes < 7. Although positive selection among amino acid differences between species seems pervasive, most of the selective effects could be regarded as nearly neutral. There are significant differences in selection between sex-biased and unbiased genes, which relate primarily to the mean of the distributions of mutational effects and the fraction of slightly deleterious and weakly beneficial mutations that are fixed.

Keywords: McDonald–Kreitman test, polymorphism and divergence, protein evolution


Synonymy in the genetic code results in a natural periodicity in which the third nucleotide of many codons is only weakly constrained because any of two or more nucleotides at this position specify the same amino acid in the polypeptide chain. Fourfold degenerate codons allow any nucleotide at the third position, whereas twofold degenerate codons treat either both pyrimidine nucleotides or both purine nucleotides as synonymous. Of the 20 common amino acids, the codons for 12 are twofold degenerate at the third position, 1 is threefold degenerate (isoleucine, which allows U, C, or A at the third position), and 8 are fourfold degenerate. (In this tabulation, leucine, serine, and arginine are each counted twice because each is specified by six codons.) In a typical coding sequence with a GC content of 50% the average codon degeneracy is 3.

The high level of synonymy in the genetic code is a boon to population genomics, because the synonymous sites in a coding sequence serve as a sort of internal control for historical and demographic factors affecting a population, relatively free of selective constraint. Because nonsynonymous sites in the same coding sequence share the same history and demography as the synonymous sites, but may be subject to greater selective constraints or even positive selection, comparisons between nonsynonymous sites and synonymous sites can potentially reveal the magnitude and direction of selection pressures operating on the nonsynonymous sites.

An early application of this approach compared the frequency spectrum of polymorphic nonsynonymous sites with that of synonymous sites among sequences encoding 6-phosphogluconate dehydrogenase in a sample of the enteric bacteriumEscherichia coli (1). An excess of low-frequency nonsynonymous polymophisms suggested that most amino acid polymorphisms in this enzyme are very slightly deleterious, with a selection coefficient on the order of 6–26 times the reciprocal of the effective population size. No more than half of all amino acid polymorphisms in the enzyme could be considered as selectively neutral.

An important extension of this approach came from McDonald and Kreitman (2), who compared polymorphisms within species to divergence between species. This approach avoided any need to estimate the allele-frequency spectrum of polymorphisms, while taking advantage of evolutionary changes through time. First applied to theAdh gene encoding alcohol dehydrogenase in three species of theDrosophila melanogaster species subgroup, the approach yielded evidence that a significant proportion of amino acid replacements between species are driven by positive selection. Explicit expressions for the expected values in comparisons of polymorphism and divergence were soon developed based on a sampling theory for the independent infinite-sites model with selection (3). Application of this theory to theDrosophila Adh data again suggested small selection coefficients, on the order of five times the reciprocal of the effective population size, and that the number of amino acids in the enzyme that are susceptible to favorable mutation at any one time ranges from 2 to 23.

One limitation of the McDonald–Kreitman test is that, for the sample sizes typically available, the statistical test for homogeneity in a 2 × 2 table is relatively lacking in power. Another limitation is that such data often include one or more cells whose entry is 0. Thus there has been an effort to examine polymorphism-divergence data across multiple genes to estimate α, defined as the fraction of amino acid fixations driven by positive selection (4,5). Maximum-likelihood approaches yield estimates of α of 25% ± 20% across several species ofDrosophila (6,7). This approach assumes that harmful mutations are so drastically deleterious, and beneficial mutations so strongly favored, that their fate is settled so rapidly by selection that they cannot contribute significantly to the level of amino acid polymorphism. Considerable evidence suggests that this assumption is not correct (1,5,810). To the extent that mildly deleterious and mildly favorable nonsynonymous substitutions contribute to amino acid polymorphisms, the estimate of α is biased downward. The assumption of fluctuating selection leads to somewhat higher estimates (11).

Quite another approach to the analysis of polymorphism and divergence makes use of population genetics theory (3) to estimate the values of the parameters governing mutation, selection, and random genetic drift at independent nucleotide sites (12). The intuitive appeal of this approach is that it avoids the artificial dichotomy between what is selectively neutral and what is not, but rather focuses on the actual estimates of the selection coefficients that emerge from the analysis. In this model, the expected value of each cell in a McDonald–Kreitman table can be shown to be an independent Poisson random variable (3), and the parameters governing mutation, selection, random genetic drift, and time since species divergence can be estimated by Markov chain Monte Carlo simulation using a hierarchical Bayesian model (12). In the original formulation, each nonsynonymous substitution likely to contribute to polymorphism or divergence in a particular gene is assumed to have the same selective effect, but these values can differ from one gene to the next. The selective effect is scaled according to the diploid effective population number, which is to say that it is estimated as some multiple ofNes, wheres is the conventional selection coefficient andNe is the diploid effective population size. This approach is reliable provided that the species being compared are sufficiently closely related that multiple nucleotide substitutions at the same site, or synonymous sites mutating to nonsynomous sites or vice versa, can be ignored (13).

The assumption that each nonsynonymous substitution in a gene has the same selective effect is obviously artificial, but it served the original purpose of estimating the distribution of the scaled selection coefficient among genes (12). A more sophisticated and biologically realistic model was introduced by Sawyeret al. (9). In this model, the selective effect of each nonsynonymous mutation likely to contribute to polymorphism or divergence is regarded as a random sample from some underlying normal distribution whose mean but not variance may differ from one gene to the next. The spirit of the model is analogous to that of analysis of variance, in which different “treatments” (in this case, genes) have different “effects” (in this case, mean selective effects). The assumption that the underlying distributions are Gaussian is natural in a continuous-time model of selection (14) given the implications of the Central Limit Theorem, but plausible alternatives should also eventually be considered.

Changes in demographics can confound the interpretation of polymorphism and divergence (2,5,15). For example, a rapid dramatic increase in the effective population number will result in the selective elimination of some deleterious nonsynonymous polymorphisms that might previously have remained polymorphic, thereby reducing the nonsynonymous polymorphisms without affecting nonsynonymous divergence. Demographics need to be considered for the sibling speciesD. melanogaster andDrosophila simulans, which appear to have expanded their range out of Africa ≈10,000–15,000 years ago (16,17), probably with an accompanying a population bottleneck followed by an expansion (18).

Hence, forDrosophila the ideal polymorphism data would seem to be that derived from African populations. As it happens, Pröschelet al. (19) have recently acquired such data for a large set of genes. These data afford a valuable opportunity to apply the Sawyer model (9) to estimate values of great interest in population genomics, including the fraction of amino acid polymorphisms that are deleterious, the fraction of amino acid differences between related species that are nearly neutral or positively selected, and the distribution of selection coefficients among new mutations likely to become polymorphic or among mutations that are fixed. In this article we present the results of the analysis. The principal inferences are that the majority of amino acid polymorphisms withinDrosophila species are mildly deleterious but that a large fraction of amino acid differences between species are driven by positive selection. However, the magnitude of selection that needs to be postulated to explain the data is extremely small, usually >2 but <10 times the reciprocal of the effective population size. These results are predicated on the assumption that most synonymous polymorphisms and fixed differences are selectively neutral or nearly neutral, and so they pertain only to amino acid substitutions and not to nucleotide substitutions in noncoding DNA.

Results

Data.

The Pröschelet al. (19) data consist of the coding sequences of up to 12 alleles of each of 91 genes in samples ofD. melanogaster derived from Lake Kariba, Zimbabwe (20). Among these genes are 33 that are male-biased in their expression, 28 that are female-biased, and 30 that are equally expressed in the sexes (unbiased). Sex-biased expression means at least a 2-fold expression difference between males and females (or between testes and ovaries) as estimated in microarray experiments (2123), and unbiased expression means a ratio of expression in the range 0.75–1.25 (19). These polymorphism data were compared with divergence from a highly inbred line ofD. simulans from Chapel Hill, North Carolina (24) to estimate α, the proportion of amino acid replacements subject to positive selection (6), and the distribution of scaled selection coefficients across genes (12), to test for differential selection between sex-biased and unbiased genes (19). Here, we describe and apply a model that relaxes the assumption that the selection coefficient is identical for all amino acid substitutions in each gene. This model allows us to estimate quantitatively the distribution of selection coefficients within and among loci and the fraction of amino acid replacements between species that are selectively neutral or nearly neutral.

Random-Effects Model of Selection.

For the sake of generality, consider a set of aligned coding sequences without gaps representingm alleles sampled from one species andn alleles sampled from the orthologous gene in a related species. The species are assumed to be sufficiently closely related that multiple substitutions of the same nucleotide are unlikely. We shall disregard all codons that are monomorphic across both samples and classify the others into one of four categories: synonymous divergence (both samples monomophic but differ in a synonymous codon), synonymous polymorphic (one or both samples polymorphic for a synonymous codon), replacement divergent (both samples monomophic but differ in a nonsynonymous codon), or replacement polymorphic (one or both samples polymorphic for a nonsynonymous codon). These four counts form a 2 × 2 McDonald–Kreitman table (2) for the alleles of any one locus, and for any set ofk loci they form a group ofk such tables.

We assume that all synonymous substitutions are selectively neutral or nearly neutral but that nonsynonymous substitutions are each potentially subject to selection. Our objective is to estimate the distribution of selection coefficients of the nonsynonymous substitutions at each locus. We assume a population of constant and finite size reproducing continuously in time, so that the appropriate measure of relative fitness is the malthusian parameter defined as the natural logarithm of the Darwinian fitness (14).

Suppose that at theith locus the distribution of selection coefficients is normal with mean γi and variance σw2, where the within-locus variance σw2 is the same for each locus. Symbolically, we can write the distribution of selection coefficients for new mutations at theith locus as

graphic file with name zpq01607-5869-m01.jpg

whereN(0, 1) is a random sample from a standard normal distribution. Across loci, the selection coefficients have a normal distribution with variance σb2 + σw2, where σb2 is the between-locus variance and is equal to the variance among the γi values. These assumptions imply that the selection coefficients for two new mutations at the same locus are normally distributed with variance σb2 + σw2 and covariance σb2. The γ values are scaled according to two times the diploid effective population size, hence γ/2 =Nes, whereNe is the diploid effective population number ands is the conventional selection coefficient.

At equilibrium in a large population, each nonsynonymous substitution that is polymorphic at theith locus can be described by a pair (y,p) wherey is drawn from a normal distribution with mean γi (the mean scaled selection coefficient) andp is the proportion of the population that carries the nonancestral nucleotide at the site. The pairs (y,p) are the points of a 2D Poisson random field on (−∞, +∞) × (0, 1). Likewise, if we defineT as the time since divergence of the two species and assume thatT is large enough for mutation-selection-drift equilibrium to have been attained, the selection coefficients of nonsynonymous fixed differences between the species form a Poisson random field on (−∞, +∞). Similar considerations apply to polymorphic and divergent synonymous sites, except that the selection coefficients are all set to 0.

For nonsynonymous polymorphic sites, the mean density of the Poisson random field is given by

graphic file with name zpq01607-5869-m02.jpg

where θr,i is the rate of mutation to nonsynonymous nucleotides that have a reasonable chance of becoming polymorphic or fixed. The magnitude of θr,i is scaled by 4Ne, whereNe is the diploid effective population size. We stipulate that the only mutations under consideration are those that have a reasonable chance of becoming polymorphic or fixed because the polymorphism and divergence of samples are uninformative about mutations whose deleterious effects are so severe that they are very unlikely ever to be present in a sample. InEq.2, φ(y, γi, σw) is a normal probability density function for a random variabley with mean γi and variance σw2.

For nonsynonymous fixed differences, the mean density of the Poisson random field is given by

graphic file with name zpq01607-5869-m03.jpg

whereT is the time in generations since species divergence, scaled by two times the diploid effective population size.Eqs.2 and3 are extensions of Wright's formulas (25), for which a different proof was given in Sawyer and Hartl (3) as part of a derivation of formulas for Poisson random fields.

The corresponding mean densities of the Poisson random fields for polymorphism and divergence of synonymous sites are, respectively:

graphic file with name zpq01607-5869-m04.jpg
graphic file with name zpq01607-5869-m05.jpg

where θs,i is the rate of mutation to synonymous nucleotides, scaled by four times the diploid effective population size. Because all synonymous mutations are assumed to be selectively neutral and so have a reasonable chance of becoming polymorphic, θs,i includes all synonymous mutations.

Eqs.25 imply that each of the cells in a McDonald–Kreitman table for one locus has a count whose magnitude is distributed as a Poisson distribution whose mean is shown inTable 1, wherea1(m) = 1 + 1/2 + 1/3 + … + 1/(m − 1) and

graphic file with name zpq01607-5869-m06.jpg
graphic file with name zpq01607-5869-m07.jpg
graphic file with name zpq01607-5869-m08.jpg

InTable 1, the term involvingG0 counts the number of nonsynonymous substitutions that are already fixed between species, whereas the terms involvingG1 count the number of nonsynonymous substitutions that are fixed differences in the sample but polymorphic in one or both populations. Similarly, the term involvingG2 counts the number of nonsynonymous substitutions that are polymorphic in the sample and in the population.

Table 1.

Expected counts for polymorphism and divergence

Divergence
Polymorphism
Synonymousθs,i(T+1m+1n)θs,i[a1(m)+a1(n)]
Replacementθr, i[G0i, σw)T +G1i, σw,m) +G1i, σw,n)]θr,i[G2i, σw,m) +G2i, σw,n)]

For polymorphism-divergence data across a set ofk loci, each of thek loci has three locus-specific parameters (θr,i, θs,i, and γi). The model also has four global parameters (T, σw, σb, and μ), where μ is the average selection coefficient of new nonsynonymous mutations across loci. These 3k + 4 = 277 parameters in the Pröschelet al. data (19) were estimated by means of sampling from Monte Carlo Markov chains whose stationary distributions simulate those of the mutation-selection-drift process (26,27). Details are described inMethods.

Distribution of Selection Coefficients Among New Mutations.

In the model, values of the scaled selection coefficient at theith locus are assumed to be drawn from a normal distributionNi, σw) with mean γi and standard deviation σw. Across loci, the γi are distributed asN(μ, σb). Estimates of these parameters were obtained from the average across 200,000 samples from each of 10 subchains and are presented here as multiples of the diploid population sizeNe.

Fig. 1 shows the distributions of the estimated scaled selection coefficients (γi/2 =Nes) for genes whose expression in mature flies is male-biased (red), female-biased (green), or unbiased (blue). Scaled to the diploid population size, the global parameter estimates are μ = −5.7 ± 15.5 and σb = 2.1 ± 2.2, and within each locus the standard deviation is estimated as σw = 3.5 ± 5.7. The inferred distributions inFig. 1 support the commonplace belief that most nonsynonymous mutations are deleterious. The nonsynonymous mutations included inFig. 1 are only a subset of all nonsynonymous mutations, however. They include only those that could reach a high enough frequency in a population to have a reasonable chance of being included in a relatively small sample. Excluded fromFig. 1 are what must be a very large number of nonsynonymous mutations whose deleterious effects are so severe that there is essentially no chance of their becoming polymorphic.

Fig. 1.

Fig. 1.

Inferred distribution of scaled selection coefficientsNes among new nonsynonymous mutations that could plausibly become polymorphic or fixed, whereNe is the diploid effective population number ands is the conventional selection coefficient.Nes corresponds to the parameter γ/2 inEq.1. The mutational distributions exclude all mutations that are lethal or sterile and those with selective effects that are so deleterious as to preclude their becoming polymorphic. The distributions are based on an analysis of 33 genes whose expression is male-biased (red), 28 genes whose expression is female-biased (green), and 30 genes with approximately equal expression in adults of both sexes (blue).

Proportion of Amino Acid Polymorphisms That Are Deleterious.

Fig. 2 shows the estimated mean proportion of nonsynonymous substitutions that are positively selected (beneficial). The proportions differ widely among new mutations (N), polymorphisms present in the samples (S), or fixed differences between the species (F). The error bars denote the 95% confidence interval on the estimate of the mean. The results are shown separately for genes with unbiased adult expression (blue), female-biased expression (green), male-biased expression (red), and all genes combined (gold).

Fig. 2.

Fig. 2.

Estimated proportion of positively selected nonsynonymous mutations among new mutations (N), sample polymorphisms (S), and fixed differences (F) betweenD. melanogaster andD. simulans. New mutations include only those that could plausibly become polymorphic or fixed. The error bars are the 95% credible intervals around the means. Blue bars indicate genes expressed approximately equally in adults of both sexes; green bars indicate genes with female-biased expression; red bars indicate genes with male-biased expession; and gold bars indicate all genes combined.

For all 91 genes taken together, the fraction of new nonsynonymous mutations that are deleterious averages 0.94 ± 0.01. The preponderance of deleterious new mutations reflects the estimate of μ = −5.7 ± 15.5 for the average selection coefficient of new mutations across loci.

Our analysis also implies that many of the deleterious nonsynonymous mutations that become polymorphic in the population attain allele frequencies sufficiently high that they account for a significant proportion of the polymorphisms observed in samples. InFig. 2, among all 91 genes, the expected average proportion of deleterious amino acid polymorphisms in samples is 0.70 ± 0.06. These results again support the widely held belief that most amino acid polymorphisms are deleterious and are maintained in the population by recurrent mutation.

In contrast, while the vast majority of new nonsynonymous mutations and most amino acid polymorphisms are inferred to be deleterious, the model also implies that most amino acid fixations between species are positively selected. InFig. 2, among all genes taken together, the average proportion of fixed differences that are positively selected is 0.94 ± 0.20.

Weak Positive Selection for Amino Acid Differences Between Species.

Although our analysis implies that most amino acid replacements betweenD. melanogaster andD. simulans are associated with positive selection, the selection coefficients are very small. The means and standard deviations of the distribution of the scaled selection coefficients of fixed differences for male-biased, female-biased, and unbiased genes are 2.5 ± 0.3, 2.5 ± 0.5, and 2.4 ± 0.4, respectively. These are scaled according to the diploid effective population size, which in theDrosophila species considered here is thought to be on the order of 106 (28,29). The unscaled mean selection coefficients among fixed amino acid replacements are therefore on the order ofs = 2.5 × 10−6.

Comparison of Genes with Sex-Biased or Unbiased Expression.

A previous analysis of these data emphasized evidence for apparently greater selection among genes that are sex-biased in their expression (19). Our model provides a somewhat more nuanced breakdown as to the source of the differences. The comparisons are shown inTable 2, which summarizes the mean values for various features of the data and compares the 33 male-biased genes and the 28 female-biased genes with the 30 unbiased genes. EachP value is based on a null model composed of 10,000 random permutations of the data comparing genes with either male-biased or female-biased expression against genes whose expression is unbiased between the sexes.

Table 2.

Comparison of sex-biased and unbiased genes

FeatureMale-biased expressionFemale-biased expressionUnbiased expression
Mean γ of estimated mutational distribution−5.5 (P = 0.02)−5.4 (P = 0.02)−6.2
Proportion of new mutations withNes > 00.066 (P = 0.06)0.076 (P = 0.01)0.048
Proportion of sample polymorphisms withNes > 00.316 (P = 0.097)0.341 (P = 0.042)0.247
Proportion of fixed differences withNes > 00.952 (P = 0.34)0.919 (P = 0.77)0.940
MeanNes of fixed differences2.6 (P = 0.07)2.5 (P = 0.22)2.4
Mean proportion of fixed mutations with −1 <Nes < 1*0.209 (P = 0.02)0.210 (P = 0.03)0.239

EachP value is based on the results of 10,000 random permutations comparing either male-biased genes or female-biased genes with unbiased genes (genes whose expression does not differ between the sexes).

*Based on sampling from the stationary Markov chain Monte Carlo distribution.

Interestingly, the difference between the sex-biased genes and the unbiased genes is not reflected in the proportion of fixed differences that are positively selected (Nes > 0). The differences in the sex-biased genes are mainly in the mean of the mutational distributions and the smaller fraction of slightly deleterious and weakly beneficial mutations that are fixed. For instance, in comparison with unbiased genes, male-biased genes have a significantly higher meanNes of the estimated mutational distribution and a significantly lower proportion of nearly neutral (−1 <Nes < 1) fixed differences (Table 2). Furthermore, the male-biased genes have a higher overall fraction of positively selected (Nes > 0) polymorphisms and a greater mean value ofNes among fixed differences, although in these comparisons the differences are marginally significant. The female-biased genes show similar patterns when compared with the unbiased genes (Table 2).

Deleterious and Nearly Neutral Amino Acid Replacements.

The histograms inFig. 2 implying a prevalence of positive selection at first seems at odds with the hypothesis that many amino acid replacements fixed between species are nearly neutral (3032). But the distinction between “near neutrality” and “weak positive selection” is somewhat arbitrary. To approach the issue quantitatively, we estimated the expected proportion of fixed amino acid replacements in which the scaled selection coefficient is smaller than some fixed value ofNes. For each gene the estimated normal density function of the distribution of scaled selection coefficients among new mutations, weighed by the probability of fixation, was numerically integrated from −∞ toNes for a fixed value ofNes. The results are shown inFig. 3 for genes whose expression is male-biased (red), female-biased (green), or unbiased (blue). The proportion of fixed differences that are slightly deleterious (Nes < 0) is by no means negligible. It ranges from 0.02 to 0.11 and across all genes has a mean and 95% confidence interval of 0.05 ± 0.02. Likewise a significant proportion of fixed differences show weak positive selection (0 <Nes < 1), across all genes averaging 0.17 ± 0.04 (data not shown).

Fig. 3.

Fig. 3.

Estimated proportion of fixed amino acid replacements betweenD. melanogaster andD. simulans whose scaled selection coefficient is less than various specified valus ofNes, ordered by rank among all 91 genes.Ne is the diploid effective population number ands is the conventional selection coefficient. Blue dots indicate genes expressed approximately equally in adults of both sexes; green dots indicate genes with female-biased expression; and red dots indicate genes with male-biased expression.

Across all genes, the average proportion of fixed differences that are positively selected (Nes > 0) is 0.95 ± 0.02. Positive selection is therefore prevalent. On the other hand, the scaled selection coefficients are very small. For the values ofNes given inFig. 3, the means and standard deviations of the estimated proportion of fixed differences that have scaled selection coefficients smaller thanNes are given by 0.22 ± 0.05 (Nes = 1), 0.46 ± 0.08 (Nes = 2), 0.68 ± 0.07 (Nes = 3), and 0.83 ± 0.05 (Nes = 4). Because the proportion of amino acid replacements that are nearly neutral depends on what value ofNes is chosen as an upper limit, one could argue that anywhere from 22% to 83% of fixed amino acid replacements are nearly neutral. This issue is examined further inDiscussion.

DefineCDF(Nes) as the cumulative density function of fixed amino acid replacements whose scaled selection coefficient is smaller thanNes, and α(Nes) as the proportion of fixed amino acid replacements whose scaled selection coefficient is greater thanNes.Fig. 4 shows these functions as estimated from the present data. About 50% of all amino acid replacements haveNes < 2, >80% haveNes < 4, and 99% haveNes < 7. Correspondingly, α(2) = 0.54, α(4) = 0.16, and α(7) = 0.01. These estimates contrast with that of α = 0.25 ± 0.20 (6), which assumes three classes of nonsynonymous mutations (deleterious, neutral, and beneficial) with deleterious mutations being so deleterious and beneficial mutations being so beneficial that neither class contributes significantly to polymorphism. Our model takes slightly deleterious and weakly beneficial mutations into account, and as shown inFig. 2 these classes of mutations do contribute substantially to amino acid polymorphisms. The estimate α = 0.25 corresponds roughly to α(1.15) inFig. 4, hence it implies a threshold for near-neutral effects ofNes = 1.15.

Fig. 4.

Fig. 4.

Inferred cumulative density function (CDF) of the scaled selection coefficients among fixed amino acid replacements in 91 genes betweenD. melanogaster andD. simulans.Ne is the diploid effective population number, ands is the conventional selection coefficient.CDF(Nes) is the average proportion of amino acid differences whose scaled selection coefficient is smaller thanNes, and α(Nes) is the proportion of amino acid differences whose scaled selection coefficient is greater thanNes.

Discussion

Our model makes a number of assumptions that should be emphasized. The theory assumes mutation-selection-drift equilibrium, invokes diffusion theory, stipulates independence between nucleotide sites, and posits additivity of fitness effects of mutations at different nucleotide sites. The first assumption could be undermined by demographic factors such as population bottlenecks or expansions, the second could be compromised by very strong positive selection, the third may be challenged for genes in regions of the genome with reduced recombination, and the fourth could be subverted by potential epistatic effects of nonsynonymous mutations in the same gene (9). Additional study is needed to determine how robust the model may be to small departures from the assumptions.

Several features of the results give some reassurance because they support plausible notions and other evidence that most nonsynonymous mutations and many nonsynonymous polymorphisms are deleterious (1,5,810). Our analysis implies that some 19 of 20 new amino acid replacements are deleterious with an average fitness reduction on the order of five times the reciprocal of the effective population size. These estimates pertain only to the subset of nonsynonymous mutations whose effect are not so severe as to preclude their becoming polymorphic, but they support other evidence that selection against deleterious mutations plays in key role in shaping patterns of genetic variation inDrosophila (33). Likewise, we estimate that ≈7 of 10 amino acid replacements that are polymorphic in samples are deleterious.

One feature of our results that might animate some surprise is the high proportion of amino acid fixations between species that show positive selection, ≈95% in our data. This finding seems to reflect what Wallace (34) called the “overwhelming odds against the less fit.” It can be appreciated quantitatively by noting that a new mutation withNes = 2 is eight times more likely to be fixed than one withNes = 0 and ≈3,000 times more likely to be fixed than one withNes = −2. There would be a preponderance of deleterious fixations if beneficial mutations were vanishingly rare. But for mutations with selective effects near neutrality, Fisher (35) argued from analogy that the proportion of beneficial mutations should actually be close to one-half:

“The conformity of these statistical requirements with common experience will be perceived by comparison with the mechanical adaptation of an instrument, such as the microscope, when adjusted for distinct vision. If we imagine a derangement of the system by moving a little each of the lenses, either longitudinally or transversely, or by twisting through an angle, by altering the refractive index and transparency of the different components, or the curvature, or the polish of the interfaces, it is sufficiently obvious that any large derangement will have a very small probability of improving the adjustment, while in the case of alterations much less than the smallest of those intentionally effected by the maker or the operator, the chance of improvement should be almost exactly half.”

This inference also follows from the assumption of a normal distribution of selection coefficients, because adjacent small intervals of the same width on opposite sides ofNes = 0 will have approximately equal areas.

The results inFig. 4 might give satisfaction to both selectionists and nearly neutralists. On the one hand, ≈95% of the fixed amino acid replacements are positively selected; on the other hand, most of the selection coefficients are small (averageNes ≈ 2.5). As emphasized by Nei (32), the fate of mutations with such a small selective advantage will be determined in large part by random genetic drift. Nevertheless when a large number of sites are examined (>58,000 nonsynonymous sites in the present case), the statistical signal of weak positive selection is evident. These results suggest that, across the genome as a whole, weak positive selection plays an important role in the evolution of protein sequences.

What fraction of amino acid replacements should be considered as nearly neutral is a matter of definition. Ohta (36) has stressed that the key feature of nearly neutral mutations is that their fate in the population depends on both selection and random genetic drift and has suggested that an absolute value ofNes < 2 would be suitable as a definition. For our data, this threshold implies that ≈46% of fixed amino acid replacements are selectively nearly neutral. One might also regard a mutation as selectively nearly neutral if its probability of fixation were <10 times that of a truly neutral allele; with this definitionNes = 2.5 and the proportion of fixed amino acid replacements that are selectively nearly neutral is 58%. A threshold ofNes = 4 yields a proportion of selectively neutral amino acid fixations of 0.84. Nei (32) has given reasons a much larger threshold could be defended. IfNes = 7 the proportion of nearly neutral amino acid replacements becomes 0.9878, and forNes = 10 it becomes 0.9996. Our model also explicitly assumes that all synonymous polymorphisms and replacements are neutral or nearly neutral. Because our model as presently formulated pertains only to coding regions, which are very sparse in complex genomes such as the human genome, the model is uninformative with regard to the selective effects of mutations in introns and other noncoding regions.

What might be the molecular mechanism behind extremely small selective effects of amino acid replacements? There is no definitive evidence, but DePristoet al. (37) have suggested a model of protein evolution in which many amino acid replacements result in very small differences in protein stability, aggregation, or degradation. Their model is based on the observation that many native proteins have a free energy of folding equivalent to only a few hydrogen bonds. Most amino acid replacements are assumed to be approximately additive with respect to their effects on stability, aggregation, or degradation, and within broad limits are selectively nearly neutral. Outside these limits increased instability results in greater aggregation and degradation and a lower equilibrium concentration of active protein, whereas increased stability results in resistance to degradation and a greater concentration of active protein. The effect of any amino acid replacement therefore depends on its context. What is slightly deleterious in one genetic background may be mildly beneficial in another. However, most amino acid replacements with small effects are expected to be deleterious. It has been noted that the low frequencies of most amino acid polymorphisms in natural populations ofE. coli andSalmonella enterica imply that the mutations are slightly deleterious (38), and in the context of the stability-aggregation-degradation model it is of interest that virtually all of these are physically located in regions of high solvent accessibility on the “outside” of the molecule (39).

The mapping of stability and aggregation onto fitness implies that amino acid replacements would show epistasis at the level of fitness even though they may be additive in their contribution to the free energy of folding. The model of selection presented here does not capture these interaction effects on fitness, nor does it capture the potential context dependence of amino acid replacements. Any model that takes such interactions into account might have to be quite protein-specific. Our model is more generic and may instead be thought of as estimating the “effective” selection among nonsynonymous mutations in a set of ideal loci in which all nucleotide sites are independent and all selective effects constant and additive, and whose levels of polymorphism and divergence are similar to those observed among the actual loci.

Methods

In principle, after initialization, each each step of the Monte Carlo Markov chain in the (3k + 4)-dimensional parameter space of vectors (γi, θs,j, θr,j,T, σw, μ, σb) could be composed of a series of Metropolis-random-walk (40) or Gibbs-sampler (41) substeps, with each substep updating a single 1D or 2D component of the vector of parameters. The structure of the model is such that θs,j and θr,j have Gibbs-sampler updates based on gamma distributions, and (μ, σb) together can be updated by using a 2D inverse-gamma-normal Gibbs update (27). The other components would be updated by using Metropolis random-walk steps. Updating σw is the most time-consuming step in this algorithm because each update requires the numerical calculation of up to 4k double integrals.

In practice, the method described above took extremely long to converge, with some data sets converging to different distributions depending on the initial point. The reason was that updates of (μ, σw) in particular, and to a lesser extent (μ, σw, σb, γi), were highly autocorrelated. What was done was to use a long run of the process described above to estimate a joint covariance matrix for (μ, σw, σb). The Metropolis update for σw was then replaced by a joint Metropolis update of (μ, σw, σb) based on a 3D normal distribution with a larger step. A linear or skew transformation of the γi was made at the same time corresponding to the change in (μ, σb). The resulting (k + 3)-dimensional update is not of Metropolis–Hastings form because it is defined by a singular motion in (k + 3) dimensions, but it does satisfy the detailed balance condition (42,43) and hence preserves the posterior likelihood. The resulting process converged to the same distribution independent of starting position. Trace plots of the hyperparameters (μ, σw, σb) appeared highly random.

The results are based on 10 consecutive subchains of 200,000 samples each after a burn-in of 1 million iterations. Samples were taken every 10 iterations to reduce autocorrelation, so there was a total of 21,000,000 iterations. Acceptance proportions for the Metropolis random-walk component updates ranged from 0.17 to 0.32. The Gelmanet al. (27) statistic for convergence ranged from 1.000 to 1.020.

Acknowledgments

We thank Tomoko Ohta and Masatoshi Nei for their careful reading and helpful comments on the manuscript. This work was supported by National Institutes of Health Grants GM68465 and GM61351 (to D.L.H.), National Science Foundation Grant DMS-0107420 (to S.A.S.), and Deutsche Forschungsgemeinschaft Grant PA 903/2 (to J.P.).

Footnotes

This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected on May 3, 2005.

The authors declare no conflict of interest.

References

  • 1.Sawyer SA, Dykhuizen DE, Hartl DL. Proc Natl Acad Sci USA. 1987;84:6225–6228. doi: 10.1073/pnas.84.17.6225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.McDonald JH, Kreitman M. Nature. 1991;351:652–654. doi: 10.1038/351652a0. [DOI] [PubMed] [Google Scholar]
  • 3.Sawyer SA, Hartl DL. Genetics. 1992;132:1161–1176. doi: 10.1093/genetics/132.4.1161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Smith NGC, Eyre-Walker A. Nature. 2002;415:1022–1024. doi: 10.1038/4151022a. [DOI] [PubMed] [Google Scholar]
  • 5.Fay JC, Wyckoff GJ, Wu CI. Nature. 2002;415:1024–1026. doi: 10.1038/4151024a. [DOI] [PubMed] [Google Scholar]
  • 6.Bierne N, Eyre-Walker A. Mol Biol Evol. 2004;21:1350–1360. doi: 10.1093/molbev/msh134. [DOI] [PubMed] [Google Scholar]
  • 7.Shapiro JA, Huang W, Zhang C, Hubisz MJ, Lu J, Turissini DA, Fang S, Wang H-Y, Hudson RR, Nielsen R, et al. Proc Natl Acad Sci USA. 2007;104:2271–2276. doi: 10.1073/pnas.0610385104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Fay JC, Wyckoff GJ, Wu C-I. Genetics. 2001;158:1227–1234. doi: 10.1093/genetics/158.3.1227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Sawyer SA, Kulathinal R, Bustamante CD, Hartl DL. J Mol Evol. 2003;57:S154–S164. doi: 10.1007/s00239-003-0022-3. [DOI] [PubMed] [Google Scholar]
  • 10.Bustamante CD, Fledel-Alon A, Williamson S, Nielsen R, Hubisz MT, Glanowski S, Tanenbaum DM, White TJ, Sninsky JJ, Hernandez RD, et al. Nature. 2006;437:1153–1157. doi: 10.1038/nature04240. [DOI] [PubMed] [Google Scholar]
  • 11.Mustonen V, Lässig M. Proc Natl Acad Sci USA. 2007;104:2277–2282. doi: 10.1073/pnas.0607105104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Bustamante C, Nielsen R, Sawyer SA, Olsen KM, Purugganan MD, Hartl DL. Nature. 2002;416:531–534. doi: 10.1038/416531a. [DOI] [PubMed] [Google Scholar]
  • 13.Whittam TS, Nei M. Nature. 1991;354:115–116. [Google Scholar]
  • 14.Hartl DL, Clark AG. Principles of Population Genetics. Sunderland, MA: Sinauer; 2007. [Google Scholar]
  • 15.Eyre-Walker A. Genetics. 2002;162:2017–2024. doi: 10.1093/genetics/162.4.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.David JR, Capy P. Trends Genet. 1988;4:106–111. doi: 10.1016/0168-9525(88)90098-4. [DOI] [PubMed] [Google Scholar]
  • 17.Lachaise D, Cariou M, David JR, Lemeunier F, Tsacas L, Ashburner M. In: Evolutionary Biology. Hecht MK, Wallace B, Prance GT, editors. Vol 22. New York: Plenum; 1988. pp. 159–227. [Google Scholar]
  • 18.Li H, Stephan W. PLoS Genet. 2006;2:e166. doi: 10.1371/journal.pgen.0020166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Pröschel M, Zhang Z, Parsch J. Genetics. 2006;174:893–900. doi: 10.1534/genetics.106.058008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Glinka S, Ometto L, Mousset S, Stephan W, De Lorenzo D. Genetics. 2003;165:1269–1278. doi: 10.1093/genetics/165.3.1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Parisi M, Nuttall R, Naiman D, Bouffard G, Malley J, Andrews J, Eastman S, Oliver B. Science. 2003;299:697–700. doi: 10.1126/science.1079190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Ranz JM, Castillo-Davis CI, Meiklejohn CD, Hartl DL. Science. 2003;300:1742–1745. doi: 10.1126/science.1085881. [DOI] [PubMed] [Google Scholar]
  • 23.Gibson G, Riley-Berger R, Harshman L, Kopp A, Vacha S, Nuzhdin S, Wayne M. Genetics. 2004;167:1791–1799. doi: 10.1534/genetics.104.026583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Meiklejohn CD, Kim Y, Hartl DL, Parsch J. Genetics. 2004;168:265–279. doi: 10.1534/genetics.103.025494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wright S. Proc Natl Acad Sci USA. 1938;24:253–259. doi: 10.1073/pnas.24.7.253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Gilks R, Richardson S, Spiegelhalter DJ. Markov Chain Monte Carlo in Practice. London: Chapman & Hall; 1996. [Google Scholar]
  • 27.Gelman A, Carlin JS, Stern HS, Rubin DB. Bayesian Data Analysis. Boca Raton, FL: CRC; 2003. [Google Scholar]
  • 28.Akashi H. Genetics. 1995;139:1067–1076. doi: 10.1093/genetics/139.2.1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Akashi H. Genetics. 1996;144:1297–1307. doi: 10.1093/genetics/144.3.1297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Ohta T. Nature. 1973;246:96–98. doi: 10.1038/246096a0. [DOI] [PubMed] [Google Scholar]
  • 31.Ohta T. Annu Rev Ecol Syst. 1992;23:263–286. [Google Scholar]
  • 32.Nei M. Mol Biol Evol. 2005;22:2318–2342. doi: 10.1093/molbev/msi242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Haag-Liautard C, Dorris M, Maside XR, Macaskill S, Halligan DL, Charlesworth B, Keightley PD. Nature. 2007;445:82–85. doi: 10.1038/nature05388. [DOI] [PubMed] [Google Scholar]
  • 34.Wallace AR. Nat Sci. 1892;1:749–750. [Google Scholar]
  • 35.Fisher RA. The Genetical Theory of Natural Selection. Oxford: Oxford Univ Press; 1930. [Google Scholar]
  • 36.Ohta T. Proc Natl Acad Sci USA. 2002;99:16134–16137. doi: 10.1073/pnas.252626899. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.DePristo MA, Weinreich DM, Hartl DL. Nat Rev Genet. 2005;6:678–687. doi: 10.1038/nrg1672. [DOI] [PubMed] [Google Scholar]
  • 38.Hartl DL, Boyd EF, Bustamante CD, Sawyer SA. In: Genomics and Proteomics. Suhai S, editor. New York: Plenum; 2000. pp. 37–49. [Google Scholar]
  • 39.Bustamante CD, Townsend JP, Hartl DL. Mol Biol Evol. 2000;17:301–308. doi: 10.1093/oxfordjournals.molbev.a026310. [DOI] [PubMed] [Google Scholar]
  • 40.Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. J Chem Phys. 1953;21:1087–1091. [Google Scholar]
  • 41.Geman S, Geman D. IEEE Trans Pattern Anal Machine Intelligence. 1984;6:721–741. doi: 10.1109/tpami.1984.4767596. [DOI] [PubMed] [Google Scholar]
  • 42.Chib S, Greenberg E. Am Stat. 1995;49:327–335. [Google Scholar]
  • 43.Liu JS. Monte Carlo Strategies in Scientific Computing. New York: Springer; 2001. [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy ofNational Academy of Sciences

ACTIONS

RESOURCES


[8]ページ先頭

©2009-2026 Movatter.jp