Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
NCBI home page
Search in PMCSearch
As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more:PMC Disclaimer | PMC Copyright Notice
Molecular Biology and Evolution logo

Testing for Ancient Admixture between Closely Related Populations

Eric Y Durand1,*,Nick Patterson2,David Reich2,3,Montgomery Slatkin1
1Department of Integrative Biology, University of California, Berkeley
2Broad Institute of MIT and Harvard
3Department of Genetics, Harvard Medical School
*

Corresponding author: E-mail:eric.durand@berkeley.edu.

Associate editor: Jonathan Pritchard

Issue date 2011 Aug.

© The Author 2011. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
PMCID: PMC3144383  PMID:21325092

Abstract

One enduring question in evolutionary biology is the extent of archaic admixture in the genomes of present-day populations. In this paper, we present a test for ancient admixture that exploits the asymmetry in the frequencies of the two nonconcordant gene trees in a three-population tree. This test was first applied to detect interbreeding between Neandertals and modern humans. We derive the analytic expectation of a test statistic, called theD statistic, which is sensitive to asymmetry under alternative demographic scenarios. We show that theD statistic is insensitive to some demographic assumptions such as ancestral population sizes and requires only the assumption that the ancestral populations were randomly mating. An important aspect ofD statistics is that they can be used to detect archaic admixture even when no archaic sample is available. We explore the effect of sequencing error on the false-positive rate of the test for admixture, and we show how to estimate the proportion of archaic ancestry in the genomes of present-day populations. We also investigate a model of subdivision in ancestral populations that can result inD statistics that indicate recent admixture.

Keywords: admixture, gene genealogies, lineage sorting

Introduction

Detecting ancient admixture and estimating the extent of archaic ancestry in the genomes of present-day populations are of major importance for many aspects of evolutionary biology (Barton 2001). Admixture occurs during secondary contact between previously isolated populations, and it can play an important role in the speciation process (Barton and Hewitt 1985;Barton 2001). Among anthropologists, there is a long-standing debate on the relationships between modern humans and different groups of archaic humans such as Neandertals. A recent analysis of a draft sequence of the Neandertal genome suggested that 1–4% of genomes from people of Eurasian ancestry are derived from Neandertal (Green et al. 2010). However, there is still no consensus on the overall extent of interbreeding between archaic and modern humans (for a review, seeWall and Hammer 2006).

Methods for detecting and estimating archaic admixture often rely on DNA sequence data from extent populations only. One class of such methods relies on the signature that ancient admixture leaves on patterns of linkage disequilibrium (Plagnol and Wall 2006;Wall et al. 2009). Although they have the advantage of not relying on scarce or damaged ancient DNA samples, they make simplifying assumptions about the demographic history of extant populations that are not testable in general (Wall and Hammer 2006). Another class of methods models genetic introgression using spatially explicit simulations. They aim to make predictions about the proportion of introgression in present-day genomes if ancient admixture had occurred. These methods use either forward-in-time computer simulations (Currat and Excoffier 2004) or deterministic modeling (Forhan et al. 2008).

Recently, direct comparisons of DNA sequences between extant and archaic populations have been made possible by the sequencing of ancient DNA samples. In humans, this line of research began with a comparison of mitochondrial DNA between modern humans and Neandertals (Krings et al. 1997;Serre et al. 2004;Green et al. 2008). It was then extended to fragments of Neandertal nuclear DNA (Green et al. 2006;Noonan et al. 2006;Krause et al. 2007;Lalueza-Fox et al. 2007); the full sequence of Neandertals was finally analyzed inGreen et al. (2010). Although data from extinct populations are still limited, direct comparison of ancient and modern sequences has the potential to directly determine the proportion of present-day genomes inherited from archaic populations (Wall 2000).Green et al. (2010) developed a formal test for admixture based on the direct comparison of DNA sequences from Neandertals and modern human populations.

Green et al. (2010) defined a new statistic, called theD statistic, to test for admixture between three closely related populations. The expectation ofD was derived under a simple model of admixture with constant population size and was used to estimate the proportion of Neandertal ancestry in modern humans (Green et al. 2010, Supplementary Online Material 19). However,Green et al. (2010) noted that their test relies on the assumption that the population ancestral to the sampled populations was randomly mating. In this paper, we extend the results ofGreen et al. (2010) by deriving the expectation ofD under a model where the ancestral populations are not randomly mating. We also study the statistical properties ofD. In addition, we explore the effects of sequencing error and ascertainment bias onD, and we show thatD can indicate past admixture even when the admixing population is not sampled.

In a first section, we recall the definition of theD statistic and why it is expected to detect admixture in the case of randomly mating ancestral populations. Then we derive the expectation ofD under a model of admixture with arbitrarily varying ancestral population sizes, which generalizes the model of admixture proposed inGreen et al. (2010). The derivation reveals a strategy to estimate the proportion of archaic introgression in present-day genomes. In a next section, we derive the expectation ofD in a model where the ancestral populations are not randomly mating and show that ancient population structure can confound the test for admixture. Then we determine the effect of sequencing error onD, and we study the power of the test for admixture on synthetic data. WhenD is computed from genotype data, we determine the impact of ascertainment bias on the test for admixture. AlthoughGreen et al. (2010) definedD to compare the Neandertal genome with the sequences of modern humans, we show thatD statistics can be used to test for admixture when no archaic sample is available. Finally, we apply the test to the Neandertal data published inGreen et al. (2010).

Materials and Methods

A Four-Taxon Statistic to Test for Admixture

Assume that we have sequenced one chromosome from two present-day populations and denote by P1 and P2 those populations. Furthermore, assume that we have sampled one chromosome from an archaic population, which we denote as P3, and one chromosome from an outgroup population, denoted O. Suppose that the four sequences were aligned without error. The null hypothesis that we wish to test is a demographic scenario in which P1 and P2 descend from a common ancestral population that diverged from the ancestors of P3 at an earlier time, without any gene flow between P3 and P1 or P2 after they split. The alternative hypothesis is that P3 exchanged genes with P1 or P2 after these two populations diverged.

We first restrict to positions in the genome where we have coverage for P1, P2, P3, and O. We denote the outgroup allele as “A” and restrict our analysis to biallelic sites at which P1 and P2 differ and the alternative allele “B” is seen in P3. At these sites, we have observed two copies of both alleles, making it less likely that the patterns we analyze have arisen because of sequencing error.

For the ordered set {P1, P2, P3, O}, we call the two allelic configurations of interest “ABBA” or “BABA.” The pattern ABBA refers to biallelic sites where P1 has the outgroup allele and P2 and P3 share the derived copy. The pattern BABA corresponds to sites where P1 and P3 share the derived allele and P2 has the outgroup allele.Green et al. (2010) defined a statistic corresponding to the difference in the counts of ABBA and BABA sites across then base pairs for which we have data of all four samples, normalized by the total number of observations. In this statistic,CABBA(i) andCBABA(i) are indicator variables; they can be 0 or 1 depending on whether an ABBA or a BABA pattern is seen at basei.Green et al. (2010) denoted this statistic byD, and we have

graphic file with name molbiolevolmsr048fx1_ht.jpg(1)

We further denote byS(P1, P2, P3, O) the numerator ofD(P1, P2, P3, O).

Under the null hypothesis that P1 and P2 descend from a common ancestral population that diverged at an earlier time from the ancestral population of P3, and if the ancestral population of P1, P2, and P3 was panmictic, then derived alleles in P3 should match derived alleles in P1 and P2 equally often. This is because the patterns ABBA and BABA can only arise from gene trees that are nonconcordant with the population tree of P1, P2, and P3. Under the null hypothesis, the two nonconcordant gene trees should occur with equal frequencies (Tajima 1983;Hudson 1983), andD should equal zero. There are three classes of events that can produce a significant deviation from the null hypothesis. First, P3 exchanged genes with P1 or P2. Then the population ancestral to P1, P2, and P3 may have been structured in such a way that one of the two non-concordant gene trees occurs more often than the other. Alternatively, P1 or P2 could have received genes from an unsampled ghost population that we denote as PG. Note that PG needs to be at least as diverged as P3 from (P1, P2) forD to differ significantly from zero (supplementary fig. S1,Supplementary Material online). Also note that gene flow between P1 and P2, or between P3 and the ancestor of P1 and P2, is not expected to produce a deviation from the null hypothesis.

AlthoughD statistics are primarily designed to be applied on sequence data, they can be readily computed on single nucleotide polymorphism (SNP) data. Assume thatn SNPs were genotyped in populations P1, P2, P3, and O. Denote byp^ij the observed frequency of SNPi in population Pj (j = 4 denotes population O). We have

graphic file with name molbiolevolmsr048fx2_ht.jpg(2)

Thus, computingD on genotype data is straightforward. However, the way SNPs were ascertained can bias estimates of allele frequencies in the different populations (Eberle and Kruglyak 2000;Kuhner et al. 2000;Clark et al. 2005).

A Genealogical Argument for the Expected Frequencies of ABBA and BABA Sites

Assume that a single nucleotide substitution occurred on the gene genealogy representing the ancestry of the four samples (P1, P2, P3, and O). There are three possible topologies of the gene genealogy, assuming that O is the outgroup. Each of these topologies has four branches on which a mutation can create a derived allele in P1, P2, and P3.Figure 1 presents the 12 possible configurations for the four samples P1, P2, P3, and O, assuming that O carries the ancestral allele. The pattern BABA (P1 and P3 have the derived nucleotide) is consistent only with tree IIc. The length of the internal branch represents the time during which a mutation can produce the BABA pattern. Similarly, the pattern ABBA (P2 and P3 have the derived nucleotide) is consistent only with tree IIIc, and the length of that internal branch is the time during which mutation can create ABBA. The probability that a randomly chosen site will have either pattern is the probability of the appropriate topology multiplied by the expected branch length multiplied by the mutation rate,μ. Thus, to derive the expected frequencies of patterns ABBA and BABA, we need to compute the probabilities of the corresponding tree topologies and average lengths of the relevant branches under the demographic scenario relating P1, P2, and P3.

FIG. 1.

FIG. 1.

There are three possible topologies of the gene genealogies, I, II, and III, that can relate samples P1, P2, and P3, and the outgroup O. Assuming that any polymorphisms reflect a single historical mutation, ABBA sites can occur only on a type III gene genealogy due to a mutation that occurs on the branch ancestral to samples P2 and P3 (IIIc), and BABA sites can occur only on a type II gene genealogy due to a mutation on the branch ancestral to samples P1 and P3 (IIc).

In what follows, we derive the expected frequencies of patterns ABBA and BABA under a simple demographic model of ancient admixture. We use this model to show that the test is not confounded by demographic events other than admixture, assuming that the population ancestral to P1, P2, and P3 was panmictic.

Instantaneous Unidirectional Admixture

In this model, we assume that there was a single episode of admixture at timetGF in the past (t = 0 being the present) from population P3 to P2 after the separation of populations P1 and P2 (fig. 2). With probabilityf, the P2 lineage was derived from a lineage from P3. The parameterf represents the fraction of the genomes in P2 that originated from P3. We define the divergence time of P1 and P2 populations astP2 >tGF. We denote bytP3 >tP2 the divergence time of P3 and the population ancestral to P1 and P2, denoted P(12). At generationt in the past, the effective population size of the P3 population isN3(t); the effective size of P(12) isN(12)(t); and the effective size of the population ancestral to P1, P2, and P3, denoted P(123), isN(123)(t). All the populations are assumed to be unstructured (i.e., they are randomly mating). We denote by IUA the instantaneous unidirectional admixture model.

FIG. 2.

FIG. 2.

Model of IUA. P1 and P2 split at timetP2. P3 splits from the ancestral population of P1 and P2, denoted P(12), at timetP3. A single episode of admixture takes place from P3 to P2 at timetGF. We denote byf the admixture proportion, which is the proportion of P3 ancestry in P2 individuals at timetGF. We denote byN3(t),N(12)(t), andN(123)(t) the size of populations P3, P(12), and P(123) at generationt in the past. The sizes of populations P1 and P2 do not influenceD statistics.

Expected Counts of ABBA and BABA under the Model of Instantaneous Admixture

Under the IUA model, there are three classes of events that can produce the patterns ABBA and BABA. The first class corresponds to the case where the P2 lineage did not originate from P3. It occurs with probability 1 −f. In the second class of events, the P2 lineage traces its ancestry in P3 (probabilityf), but the P2 and P3 lineages did not coalesce beforetP3. The third class corresponds to the case where the P2 lineage originated from P3 (probabilityf) and the P2 and P3 lineages coalesced beforetP3. The first two classes produce patterns ABBA and BABA with equal frequencies. This is because, assuming that P(123) was panmictic, the P3 lineage coalesces first with probability 1/3. The third class of event produces only ABBA because the P2 and P3 lineages coalesced first. The probabilities of ABBA and BABA are derived in Appendix 1 for a model with arbitrary varying population size in P3 and P(12) and constant population size in P(123).

Green et al. (2010) studied the simplified case where population size is constant. When we setN3(t) =N(12)(t) =N(123) =N in Appendix 1, we find, in accordance withGreen et al. (2010), that

graphic file with name molbiolevolmsr048fx3_ht.jpg(3)

and

graphic file with name molbiolevolmsr048fx4_ht.jpg(4)

whereμ denotes the per base mutation rate. In this case, the expectation ofD reduces to

graphic file with name molbiolevolmsr048fx5_ht.jpg(5)

Note that the test statistic equals 0 when there is no admixture (f = 0). If there is admixture (f > 0),D tends to one whentP3tGF becomes large. This is because the P2 and P3 lineages have more time to coalesce astP3tGF increases. Finally, in the case of constant ancestral population size,D tends to 0 when the effective population size becomes large. This is explained by the fact that whenN is large, then the probability that the P2 and P3 lineages coalesce in P3 is small.

A Test for Admixture Insensitive to Many Demographic Assumptions

Under the IUA model,equation (A1.5) shows thatE[D(P1, P2, P3, O)] = 0 if and only iff = 0 ortGF =tP3, regardless of the population size fluctuations of P1, P2, P3, P(12), and P(123) (as long as population sizes stay finite) and regardless of the times of population divergences and admixture (as long astGF <tP3). Note that a model withtGF =tP3 is equivalent to a model without admixture (seefig. 2). Thus, under the IUA model assumptions, a significant deviation from 0 ofD(P1, P2, P3, O) indicates that P3 exchanged genes with P1 (D(P1, P2, P3, O) < 0) or P2 (D(P1, P2, P3, O) > 0). Although we modeled admixture with an instantaneous episode of gene flow, this conclusion holds for ongoing migration between P3 and P2 or P1.

Sites involved in the computation ofD are likely to be in linkage disequilibrium. Therefore, a simple binomial test to assess whetherD significantly differs from zero is not appropriate. Instead, the test significance can be assessed using a standard block jackknife procedure (Efron 1981). In this procedure, one removes blocks of adjacent sites one at a time. The size of blocks should be chosen to be larger than the extent of linkage disequilibrium. By computing the variance of theD statistic over the sequencesM times leaving each block of the sequence in turn, and then multiplying byM and taking the square root, we can obtain an approximately normally distributed standard error using the theory of the jackknife (Reich et al. 2009).

Constraining the Admixture Proportion

D statistics depend on the demographic parameters in a complex way. As such, they are not good candidates to estimate demographic parameters. However, Appendix 1 shows that the numerator ofD, denotedS, is a much simpler function of demographic parameters; for a model with arbitrary varying population sizes, we have

graphic file with name molbiolevolmsr048fx6_ht.jpg(6)

wheretP(123) is the expected time of coalescence of two lineages in P(123) andt* is the expected time of coalescence of two lineages in P3, given that they coalesced betweentGF andtP3. The timetP(123) is equal to 2N(123) in Appendix 1 where we assumed constant population size in P(123). Assume now that we have sampled a second lineage in P3. Denote by P3,1 and P3,2 the two lineages sampled in P3. It is straightforward to see thatS(P1, P3,1, P3,2, O) is equal toS(P1, P2, P3, O) withf = 1 andtGF = 0 because populations P2 and P3 are identical with such parameter values. Therefore,

graphic file with name molbiolevolmsr048fx7_ht.jpg(7)

wheret** is the expected time of coalescence of two lineages in P3, given that they coalesced betweent = 0 andtP3. Thus,S(P1, P2, P3, O)/S(P1, P3,1, P3,2, O) is an upper bound forf. In the case of constant ancestral population size, we have

graphic file with name molbiolevolmsr048fx8_ht.jpg(8)

If we assume thattGF is small compared withtP3, then one can estimatef byInline graphic, which is a conservative minimum.

We note here that the strategy to estimate the admixture proportion can be extended to more populations. Assume that a sample from a sister population of P3 is available. Denote by P4 this population and assume that it diverged from P3 before the admixture event (tP4 >tGF, wheretP4 is the time of divergence of P3 and P4) (supplementary fig. S2,Supplementary Material online). ThenInline graphic is an unbiased estimate of the admixture proportion independent of population sizes and divergence times.

Testing for Admixture without Archaic Sample

In practice, it may be very common that no sample from the admixing archaic population is available. On the other hand, it may be fairly easy to sample from other extant populations. Assume that a sample from an outgroup to P1 and P2 is available and denote by P0 this population. Assume that P0 did not exchange genes with P1 or P2. LettP0 be the time of divergence of P0 and the population ancestral to P1 and P2 (seesupplementary fig. S3,Supplementary Material online). We can then deriveD(P2, P1, P0, O) following the same methodology as in Appendix 1. In the case where population size is constant and equal toN in all ancestral populations, we have

graphic file with name molbiolevolmsr048fx11_ht.jpg(9)

It is remarkable thatequation (9) does not depend on the time of admixture,tGF, which is likely to be unknown. It does depend, however, on the time of divergence of the archaic population,tP3. Using two samples from P0 or P1, one can constrain the admixture proportion by using the same strategy as the one using two samples from P3.

Ancestral Subdivision

Here we derive the expected frequencies of patterns ABBA and BABA under a simple model in which the ancestral population of P1, P2, and P3 is not panmictic. We assume that the ancestral populations P(12) and P(123) were structured in two random mating subpopulations (fig. 3). We denote by P(12),1 and P(12),2 the subpopulations in P(12) and by P(123),1 and P(123),2 in P(123). We assume that subpopulations exchange migrants at symmetrical ratem per generation. At timeT in the past, subpopulations merge into one panmictic population. We assume constant population sizeN in all ancestral populations. A similar model was proposed inSlatkin and Pollack (2008).Slatkin and Pollack (2008) showed that, for large values ofT, the nonconcordant gene tree compatible with pattern ABBA occurred with higher frequency than the other nonconcordant gene tree. For small values ofm, its frequency is even higher than the frequency of the concordant gene tree. We use AS to denote the ancestral subdivision model.

FIG. 3.

FIG. 3.

Model of ancient subdivision (AS). P1 and P2 definitively split at timetP2. The ancestral population of P1 and P2 is subdivided in two subpopulations, denoted P(12),1 and P(12),2. The two subpopulations exchange migrants at ratem. P3 splits at timetP3. The subdivision persists in the ancestral population of P(12),1, P(12),2, and P3. We denote by P(123),1 the population ancestral to P(12),1 and by P(123),2 the ancestral population of P(12),2. These two subpopulations exchange migrants at ratem. Subpopulations merge at timeT >tP3. All population sizes are constant and equal toN.

Five-State Markov Chain in P(12)

To analyze the ancestry of the P1 and P2 lineages in P(12), we use a Markov chain method similar to the one developed bySlatkin and Pollack (2008). The states of the Markov chain are as follows: state 1, P1 is in P(12),1 and P2 is in P(12),2 (initial state); state 2, P1 is in P(12),2 and P2 is in P(12),1; state 3, P1 and P2 are both in P(12),1; state 4, P1 and P2 are both in P(12),2, and state 5, P1 and P2 coalesced (absorbing state).

To write the transition probabilities, we assume thatm is small enough that only one lineage can migrate to another subpopulation per generation. Moreover, coalescent and migration events cannot occur at the same generation. Assuming constant population sizeN in P(12),1 and P(12),2, the nonzero off-diagonal elements of the transition matrixP(12) are

graphic file with name molbiolevolmsr048fx12_ht.jpg

At timetP3, the distribution of states isπ(12)(tP3) =π(12)(tP2)(P(12))tP3tP2, whereπ(12)(tP2) = {1,0,0,0,0}.

Thirteen-State Markov Chain in P(123)

To model the ancestry of P1, P2, and P3 in P(123), we use a 13-state Markov chain with states defined by state 1, P3, P2 | P1; state 2, P1, P3 | P2; state 3, P1, P2 | P3; state 4, P1, P2, P3; state 5, (P2, P3) | P1; state 6, (P1, P3) | P2; state 7, (P1, P2) | P3; state 8, (P2, P3), P1; state 9, (P1, P3), P2; state 10, (P1, P2), P3; state 11, ((P2, P3), P1); state 12, ((P1, P3), P2), and state 13, ((P1, P2), P3). We used the symbol “|” to denote that lineages are in different subpopulations and brackets denote coalescence events. For example, state 6 takes into account the case where P1 and P3 lineages have coalesced, and the resulting ancestral lineage is in a different subpopulation than the P2 lineage.

Again we assume thatm is small enough that only one lineage can migrate to another subpopulation per generation and that coalescent and migration events cannot occur at the same generation. Assuming constant population sizeN in P(123),1 and P(123),2, the nonzero off-diagonal elements of the transition matrixP(123) are

graphic file with name molbiolevolmsr048fx13_ht.jpg

At timeT, the distribution of states isπ(123)(tP3) =π(123)(tP3)(P(123))TtP3, whereπ(123)(tP3) = {π1(12),π2(12),π3(12),π4(12),0,0,π5(12),0,0,0,0,0,0}.

Events (coalescence and/or migration) in P(12) only affect the expected value ofD by changing the starting probabilities in P(123). Appendix 2 presents the details of all the possible cases leading to ABBA and BABA patterns at timeT, where all subpopulations merge into a single random mating population.

The frequencies of patterns ABBA and BABA have no simple analytical expressions under the AS model. However, they can be easily computed numerically for any set of parameter values.

Expected Coalescence Time Conditional on Gene Tree Topology

In this section, we derive the expected coalescence time of P2 and P3, given that they coalesced first, under the IUA and the AS models. This is a quantity of interest because it determines the length of the internal branch of tree topology III (fig. 1), and therefore the frequency of pattern ABBA. We use these derivations to illustrate the confounding effect of ancestral subdivision on the test for admixture.

We denote by τ23(IUA) the expected coalescence time of P2 and P3, given that they coalesced first, under the IUA model with constant ancestral population size. It can be decomposed as the sum of two terms, depending on whether the P2 lineage originated from P3 or not:

graphic file with name molbiolevolmsr048fx14_ht.jpg(10)

wheret* is the expected time of coalescence of P2 and P3 given that they coalesce betweentGF andtP3 (Appendix 1,eq. A1.3).

Then we derive the conditional expected time of coalescence of P2 and P3 under the AS model, denotedτ23(AS). We start by conditioning on whether coalescence occurs before timeT:

graphic file with name molbiolevolmsr048fx15_ht.jpg(11)

The first term is equal to the expected time for the Markov chain to first hit state 5, 8, or 11, conditional on this time being lower thanT; we compute it using standard Markov chain theory. It can be shown that this conditional time tends to 4N + 1/(2m) whenT grows to infinity, which is equal to the expected coalescence time of two individuals in different demes in a two-island model (Slatkin 1991). Because the ancestral population is assumed to be panmictic after timeT, the second term ofequation (10) is equal to

graphic file with name molbiolevolmsr048fx16_ht.jpg(12)

Although there is no simple analytical expression forτ23(AS), it can be computed numerically for any set of parameter values.

Effect of Sequencing Error onD Statistics

In this section, we study the effect of sequencing error onD. In what follows, we use capital letters to denote the observed count of a pattern and small letters to denote the true counts in the absence of sequencing error. We assume that sequencing error is uniform along the sequence. We denote bye1,e2,e3, ande4 the probability that a base was incorrectly read in population P1, P2, P3, and O, respectively. Leteij be the probability that a site was read with an error in populations Pi and Pj at the same time (j = 4 denotes population O). We neglect terms of higher order. A sequencing error on O will cause the reads for P1, P2, and P3 to be mislabeled. For example, if O was sequenced with an error at a site where the true pattern is “aaaa”, we would read pattern “BBBA.” Thus, for the two patterns of interest, we have

graphic file with name molbiolevolmsr048fx17_ht.jpg(13)

and

graphic file with name molbiolevolmsr048fx18_ht.jpg(14)

assuming that error rates are small.

False-Positive Rate in the Presence of Sequencing Error

Here we look at the effect of sequencing error on the false-positive rate of the test for admixture. We assume a null model with no admixture and randomly mating ancestral populations. In this model, the true expected counts verifynabba =nbaba andnabaa =nbaaa. The second equality assumes that the mutation rates in P1 and P2 are equal. Under these assumptions, we have

graphic file with name molbiolevolmsr048fx19_ht.jpg(15)

The effect of sequencing error onD(P1, P2, P3, O) is then

graphic file with name molbiolevolmsr048fx20_ht.jpg(16)

If the sequencing error rates are the same for the four samples, then sequencing error will have no influence on the false-positive rate of the test as long as they are small enough that third-order terms can be ignored.

To illustrate the effect of different sequencing error rates on the false-positive rate of the test, note that in the case where the outgroup divergence time,tO, is large compared with the divergence time of P3, the two dominant patterns will be bbba and aaaa. We considered a case where 98% of sites were aaaa, 0.02% abba and baba, and 1.97% bbba. All other configurations represented 0.01% of sites. We took 109 sites in total. This settings roughly correspond to a population tree where P1 and P2 are modern human populations, P3 is a Neandertal population, and Chimpanzee is the outgroup. Indeed,Green et al. (2010) estimated that the time of divergence of Neandertal and modern humans was less than 440,000 years before present, which is small compared with divergence time of modern humans and chimpanzee.

Under the simplifying assumption that sites are independent, true pattern counts abba and baba follow a binomial distribution with parametersp = 0.5 andn =nabba +nbaba. In this case, the standard deviation (SD) ofD is equal toInline graphic.

Synthetic Data Analysis

Power of the Test for Admixture

In order to assess the power of the test for admixture, we simulated data under different demographic scenarios. To simulate counts for ABBA and BABA patterns, we simulated trees using thems software (Hudson 2002). We then placed a mutation on the tree; the branch it fell on was selected with probability proportional to its length. The descendants of the mutated branch then define which pattern was generated. Under this simulation scheme, the number of independentms replicates corresponds to the number of unlinked polymorphic sites in the three populations P1, P2, and P3. For all the demographic scenarios described below, we varied this number between 10,000 and 1,000,000.

We simulated counts for ABBA and BABA patterns under the IUA model. The time of divergence of P1 and P2 was set totP2 = 3,000 generations. P3 diverged from the ancestral population P(123) at timetP3 = 12,000 generations. Admixture occurred at timetGF = 2,500 generations. We considered different cases for the ancestral population sizes: constant sizeN = 10,000 in all populations; a bottleneck in P3 prior to the admixture event; and a bottleneck in P(12). Both bottlenecks involved a 100-fold reduction in population size and lasted for 1,000 generations. They both started at 3,500 generations in the past. We also considered a case where P1 and P2 exchanged migrants at rate 2Nm = 0.5 betweent = 0 andtP2. We varied the admixture proportionf from 0 (no admixture) to 0.1 (10% introgression from P3 into P2).

Because replicates are independent,D is approximately normally distributed, and its SD can be computed from the binomial distribution:Inline graphic Each simulation class was repeated 1,000 times. We computed the power of the test for admixture as the proportions of the 1,000 repetitions were |D| was larger than twice its standard error. For the simulations without admixture, we expect that roughly 95% of the simulatedD statistics will be comprised between plus or minus twice the SD.

Effect of Ascertainment Bias

When computed on genotype data, the way SNPs were ascertained can biasD statistics. Typically, an ascertainment scheme consists of two phases. First, SNPs are discovered from the genetic material of a small group of individuals, called the discovery panel. Then the discovered SNPs are typed in a larger sample (Eberle and Kruglyak 2000). The extent to which the ascertainment scheme will biasD statistics depends on many parameters, such as the genetic composition of the discovery panel and its relationship to the larger samples. Ascertainment bias can be corrected when one assumes that the discovery panel is a random sample from the studied population (Nielsen et al. 2004). However, in the context ofD statistics, the relationship between the discovery sample and the four typed populations P1, P2, P3, and O can be complex.

To illustrate the effect of a simple ascertainment scheme onD statistics, we simulated genotype data under the IUA model using 100,000 independent replicates ofms (Hudson 2002). The time of divergence of P1 and P2 was set totP2 = 3,000 generations. P3 diverged from the ancestral population P(123) at timetP3 = 12,000 generations. Admixture occurred at timetGF = 2,500 generations. All population sizes were set equal toN = 10,000. We simulated 100 individuals in P1 and P2 and 1 individual in P3. To simulate ascertainment bias, we subsampled individuals in P1 or P2 and kept only sites that were variable in those individuals. In total, we explored five subsampling scenarios: 1) 2 individuals from P1; 2) 20 individuals from P1; 3) 2 individuals from P2; 4) 20 individuals from P2; and 5) no subsampling (no ascertainment bias). Each of these five subsampling scenarios was repeated forf = 0 (no admixture) andf = 0.05 (5% introgression from P3 into P2).

Application to Neandertal Data

The Neandertal genome was recently sequenced to ∼1× coverage, producing a total of ∼4 billon base pairs (Green et al. 2010).Green et al. (2010) analyzed data from five present-day human males from the CEPH-Human Genome Diversity Project panel (French, Han, Papuan, San, and Yoruba) that were sequenced to ∼5× coverage. For each pair of modern humans P1 and P2,Green et al. (2010) computedD(P1, P2, Neandertal, Chimpanzee). The chimpanzee individual was represented by the reference chimpanzee genome (PanTro2).

Results

Ancient Subdivision Confounds the Test for Admixture

We computedD(P1, P2, P3, O) under the IUA and the AS models (Appendices 1 and 2) for parameter valuestGF = 2,500,tP2 = 3,000,tP3 = 12,000,T = 25,000 (times in generations), and a constant ancestral population size equal toN = 10,000.Figure 4A displaysD statistics as a function off (IUA) and 2Nm (AS). We also computed the expected time of coalescence of P2 and P3 given that they coalesced first.Figure 4B displaysτ23(IUA) as a function off andτ23(AS) as a function of 2Nm. We used a computer algebra program to compute numerically quantities under the AS model.

FIG. 4.

FIG. 4.

D statistics and expected coalescence time of P2 and P3 under the IUA and AS models. We assumed parameter values oftGF = 2,500,tP2 = 3,000,tP3 = 12,000,T = 25,000 (times in generations), and a constant ancestral population size equal toN = 10,000. (A)D(P1, P2, P3, O) as a function off (IUA) and 2Nm (AS). (B) Expected internal branch length (IBL) for topologies where P2 and P3 coalesced first as a function off (IUA) and 2Nm (AS). The rescaling for 2Nm on (A) and (B) was chosen so that 2Nm andf vary in the same range.

TheD statistic curves intersect, which shows that both the IUA and the AS models can be made to fit data if P3 shares more derived allele with P1 or P2. Therefore, the test for admixture based onD alone will always be confounded by the presence of ancestral subdivision.

More generally,figure 4B shows that the IUA and the AS models predict the same expected coalescence times for some parameter values. Thus, any test for admixture solely based on expected coalescence times between one lineage from different populations will be confounded by nonrandom mating in ancestral populations.

Robustness to Sequencing Error

We assume a background sequencing error of 0.001 in each sample.Figure 5 plotsD againste1e2, keepinge1 +e2 = 0.002 ande3 =e4 = 0.001. If we assume that the sequencing error rate in two populations is the product of error rates in each of the two populations (i.e., errors are independent in the two populations), then a difference in sequencing error ratese1e2 = 0.00011 will cause the test for admixture to produce false-positive results. If we assume that the probability to sequence two populations with an error at the same base is only one order of magnitude lower than the error rate in one population, then differences in error rates of 0.00008 will create false positives. In the worst-case scenario, where sequencing error in two populations is of the same order of magnitude as sequencing error in one populations, thene1e2 = 0.000045 is enough to create false positives.

FIG. 5.

FIG. 5.

D(P1, P2, P3, O) as a function of the sequencing error differencee1e2. The dotted curve assumes that the probability to have an error in two populations is equal to the product of sequencing errors in the two populations (eij =ei ×ej). The dashed curve corresponds toeij = min(ei,ej)/10, and the solid line assumeseij = min(ei,ej). The horizontal dashed line corresponds to twice the standard error ofD, assuming that sites are independent.

Power of the Test for Admixture

Figure 6 reports the power analysis for the simulations of instantaneous admixture with constant population size, a bottleneck in P3, a bottleneck in P(12), and migration between P1 and P2. Adding a bottleneck in P3 increased the power of the test substantially; this is because, if the P2 lineage originated from P3, the bottleneck increases the probability that the P2 and P3 lineages coalesce beforetP3. Adding a bottleneck in P(12) improved the power of the test even more. This is because, in the case where admixture did not happen (eq. A1.1), the probability that the P1 and P2 lineages do not coalesce beforetP3 is decreased by the bottleneck. Therefore, this case contributes less to the denominator ofD. Migration between P1 and P2 decreased the power of the test for admixture.

FIG. 6.

FIG. 6.

Power of the test for admixture. We simulated (A) 10,000 unlinked polymorphic sites and (B) 100,000 unlinked polymorphic sites. Dots: no bottleneck and no migration between P1 and P2. Circles: bottleneck in the ancestral population of P1 and P2. Grey triangles: bottleneck in P3. Squares: ongoing migration between P1 and P2 at rate 2Nm = 0.5. All bottlenecks involved a 100-fold reduction in population size and lasted 1,000 generations. They started at 3,500 generations in the past.

For the simulations without admixture (f = 0),figure 6 reports the false-positive rate of the test for admixture at a 5% level. In the case of constant population size, the false-positive rate was 0.050 for 10,000 sites and 0.046 for 100,000 sites. This is an expected result asD is approximately normally distributed for independent sites. Plus or minus twice the SD represents roughly a 95% confidence interval forD. As predicted, adding bottlenecks in the ancestral populations, or migration between P1 and P2, did not significantly affect the false-positive rate of the test.

Robustness to Ascertainment Bias

For data simulated without admixture (f = 0),D(P1, P2, P3, O) did not differ significantly from 0 regardless of the ascertainment scheme used. More precisely, we obtained the following values forD(P1, P2, P3, O) under each scheme: 1) −7.2 × 10−4 (SD = 5.7 × 10−3); 2) 6.2 × 10−4 (SD = 3.4 × 10−3); 3) −2.5 × 10−5 (SD = 5.5 × 10−3); 4) 3.8 × 10−4 (SD = 3.4 × 10−3); and 5) −7.9 × 10−5 (SD = 3.5 × 10−3).

The IUA model with 5% admixture (f = 0.05) predictsD(P1, P2, P3, O) = 0.053 for the demographic parameters we used (eq. A1.5). The simulated data yielded the following values: 1) 0.021 (SD = 4.8 × 10−3); 2) 0.024 (SD = 3.4 × 10−3); 3) 0.037 (SD = 6.0 × 10−3); 4) 0.048 (SD = 3.5 × 10−3); and 5) 0.053 (SD = 3.5 × 10−3).

Therefore, the simple ascertainment scheme we used here did not affect the false-positive rate of the test for admixture. However, it lowered the value ofD(P1, P2, P3, O) for scenarios with 5% admixture. We checked that this result was consistent for other admixture proportions (not reported). The difference between the true value and the observed one was larger when the discovery sample was taken in P1.

Application to Neandertal

Testing for Admixture between Neandertals and Modern Humans

Table 1 showsD statistics fromGreen et al. (2010).D(P1, P2, P3, O) were significantly positive if P1 is an African individual (San or Yoruba) and P2 is a non-African individual (French, Han, or Papuan). When comparing two African individuals (P1 = San and P2 = Yoruba) or two non-African individuals (P1 and P2 = French, Han, or Papuan),D did not differ significantly from 0. Furthermore,D statistics involving one African and one non-African did not differ significantly from one another. Significance was assessed using a standard block jackknife method (Efron 1981). AD value was considered to be significantly different from 0 when its absolute value was greater than three times its standard error. TheD statistics are compatible with a scenario where Neandertals admixed with modern humans outside of Africa, before Europeans, East Asians, and Melanesians split (Green et al. 2010).

Table 1.

Neandertals Share More Genetic Variants With Non-Africans than With Africans (Green et al. 2010).

P1P2D(P1, P2, N, C), %
African–AfricanSanYoruba−0.1 ± 0.4
African–non-AfricanSanFrench4.2 ± 0.4
SanPapuan3.9 ± 0.5
SanHan5.0 ± 0.5
YorubaFrench4.5 ± 0.4
YorubaPapuan4.4 ± 0.6
YorubaHan5.3 ± 0.5
Non-African–non-AfricanFrenchPapuan0.1 ± 0.5
FrenchHan1.0 ± 0.6
PapuanHan0.7 ± 0.6

We note thatGreen et al. (2010) used about 7,475,200 biallelic sites for theD statistics analysis (Green et al. 2010, SOM 15). Therefore, we expect the power of the test to be extremely high.Green et al. (2010) also computedD statistics using genotype data from the CEPH-Human Genome Diversity Project panel.D statistics were lower when computed on SNPs than when computed on sequencing data, consistent with the predicted bias of a simple ascertainment scheme. However, it is remarkable that although raw values ofD statistics changed, the general conclusion that non-Africans were significantly closer to Neandertals than Africans held.

Estimating the Proportion of Neandertal Ancestry in Non-Africans

Taking advantage of the fact that bones from different Neandertal individuals were available,Green et al. (2010) estimated the proportion of Neandertal ancestry in non-Africans as

graphic file with name molbiolevolmsr048fx23_ht.jpg

where Afr is an African sample (San or Yoruba), OOA is a non-African sample (French, Han, or Papuan), and Nea1 and Nea2 are two Neandertal individuals. TheS statistics were averaged over the different possible combinations of African and non-African samples. Using the standard errors ofS statistics (estimated by block jackknife), this yielded an estimate off^=14%. However,equation (7) shows that this is a conservative minimum.Green et al. (2010) estimated the population divergence time of Neandertals and modern humans to betN ≈ 270,000 − 440,000 years before present. The time of gene flow has to occur after modern humans went out of Africa, an event that took place an estimatedtOOA ≈ 45,000 − 100,000 years before present (Forster and Matsumura 2005). Assuming a constant population size, the bias in the estimator off is most important whentNtOOA is small. In the worst-case scenario, the bias is equal to (270,000 − 100,000)/100,000 = 0.063. Therefore, a corrected estimate off in the case of constant ancestral population size isf* = 1–6%.

The data are also compatible with a scenario of ancestral subdivision. UsingtP3 = 12,000 generations,tP2 = 3,000 generations, and a constant population size ofN = 10,000, migration rates 1.0 < 2Nm < 2.0 and 15,000 <T < 25,000 generations were compatible with the data. However, we note that the ancestral subdivision has to last for thousands of generations in order forD to be on the order of 5% (supplementary fig. S4,Supplementary Material online). Such subdivision would require that local populations and the geographic barriers that separate them persist for very long times, much longer than seems reasonable for highly mobile and adaptable hominins.

Discussion

Asymmetry in three population gene trees is known to occur in the presence of gene flow (Meng and Kubatko 2009) or ancestral subdivision (Slatkin and Pollack 2008). Here we show that asymmetry can be exploited to design a test for admixture between three closely related populations. The test requires only the assumption that the ancestral population is randomly mating; it is robust to other demographic assumptions such as variations in population size. Moreover, the test is robust to sequencing error if one assumes that the error rate is the same in the sequences analyzed. However, if sequencing error rates differ, then sequencing error can create false positives and that is more likely to be true if sequencing errors are not independent. Finally, we illustrated how to constrain the admixture proportion in a simple model of admixture.

Duration and Direction of Gene Flow

The model of admixture as a one-way instantaneous gene flow event is not intended to be realistic. Instead, instantaneous gene flow is the simplest possible admixture model that enables to derive theD statistic expectation and estimate the admixture proportion. In the more realistic case of ongoing migration,D statistics can be computed numerically using the theory of Markov chains, as illustrated in the ancient structure model.

Introgression is often asymmetrical between hybridizing species (Barton and Hewitt 1985;Orive and Barton 2002). When an expanding population colonizes a new habitat and hybridizes with previous residents, breeding events at the front of expansion can result in the substantial introduction of genes in the expanding population (Currat et al. 2008). As a consequence, detectable ancient admixture is more likely to be found in extant populations when introgression with archaic populations occurred during a range expansion, as is thought to be the case for modern humans and Neandertals (Currat et al. 2008;Green et al. 2010).

However, there are also cases known where detectable introgression occurs from the invasive population to the resident species (Hastings et al. 2005). We note that derivingD statistics when gene flow is bidirectional or in the other direction is not more difficult, and the same methodology can be applied. If there was gene flow from the extant population to the archaic one, then anyD statistic involving individuals only from the extant populations would not be affected. This provides a framework to test for the direction of gene flow. Taking advantage that Eurasians are more closely related to some African populations than to others,Green et al. (2010) used a similar argument to rule out substantial gene flow from modern humans to Neandertals.

Effect of Ascertainment Bias

AlthoughD statistics are primarily designed for sequence data, they can be readily applied to genotype data. We showed that the false-positive rate of the test for admixture was not affected by a very simplified ascertainment scheme where we subsampled individuals in either P1 or P2. This is because, in this case, the bias in allele frequencies introduced by the ascertainment compensated itself in patterns ABBA and BABA. However, even this very simple ascertainment scheme lowers the power of the test because the ascertainment biases sampled SNPs toward higher frequencies, increasing the counts of pattern ABBA and BABA to a similar extent, and therefore increasing the denominator ofD.

We caution that we used an oversimplified ascertainment scheme. More complex scenarios can introduce bias in an unforeseen way. For example, if the discovery sample is from neither P1 nor P2, then the bias will depend on the genealogical relationships between the population where SNPs were discovered and P1, P2, and P3. However, we note that the main conclusion regarding the test for admixture ofGreen et al. (2010) held when applied to genotype data. Still, we stress that the test is likely to be more powerful using sequencing data. An investigator wishing to apply the test using genotype data should try to correct allele frequencies for ascertainment bias. Several approaches have been proposed to do so (e.g.,Clark et al. 2005).

Effect of Recurrent Mutation

We assumed that each polymorphism was created by a single mutation. If mutations were recurrent, then different copies of the derived allele would arise independently. Hence, many of the gene genealogies infigure 1 that we ignored in computing the expectation ofD would have to be accounted for. If the outgroup diverged from the lineage leading to the other populations a long time in the past, as it is the case for modern humans, Neandertals, and chimpanzees, recurrent mutations are likely to occur on the long branch leading to the outgroup. As a consequence, genealogies Ib, IId, and IIIb would create the pattern BABA and genealogies Ia, IIa, and IIId would create the pattern ABBA (seefig. 1). If there is no gene flow from P3 to P1 or P2, and if the mutation rate is the same in the three populations, then Ia = Ib, IId = IIId, and IIIb = IIa (where the equal sign denotes that corresponding trees occur with equal frequencies). Therefore, recurrent mutations are not expected to increase the false-positive rate of the test for Neandertal admixture. In the case where P1, P2, P3, and O are more diverged from each other (e.g., modern humans, chimpanzee, gorilla, and orangutan), recurrent mutations are likely to occur on other branches as well. Recurrent mutations on the terminal branches leading to P1 and P2 could potentially increase the false-positive rate of the test for admixture.

In the presence of gene flow, it is difficult to predict whether recurrent mutation increases or decreasesD. The exact result depends on details of the demographic model, but the effect is proportional to the probability that a second mutation occurs and the constant of proportionality is less than one because many of the gene genealogies still have no effect onD. Still, recurrent mutations can affect the power of the test for admixture.

We also note that differing mutation rates in the P1 and P2 populations since their divergence are not expected to biasD. The reason for this is that by restricting to sites where P3 is derived relative to the outgroup O, we are restricting the analysis to mutations that arose prior to divergence of the ancestral populations of P1, P2, and P3.

Effect of Mapping Errors

We assumed that the sequences from P1, P2, P3, and O were aligned to a common sequence without error. Depending on the nature of the sampled populations and the sequencing technology, different alignment strategies may be used.Green et al. (2010) aligned the Neandertal sequence both to the reference human and to the chimpanzee genomes (hg18 and panTro2). Different aligners were used depending on the technology used to sequence the Neandertal and modern humans genomes used in the study. Because of the variety of the alignment strategies, it is difficult to model the effect of mapping error onD statistics in a general framework.

The extent to whichD statistics will be affected depends on which sample is more prone to mapping error. Let us assume that all sequences were aligned to the outgroup O. Mapping errors in the P3 sequence are likely to have a negligible effect on the false-positive rate of the test for admixture. This is because mismapping in P3 will equally affect patterns ABBA and BABA. However, alignment error in P3 can affect the power of the test if it artificially increases the counts of patterns ABBA and BABA. Mapping errors in P1 or P2 can have a strong effect on the false-positive rate of the test. Mismapping in P1 will cause P2 and P3 to match each other artificially too often (therefore increasing the count of the ABBA pattern). Thus, it is crucial to test the effect of mapping error in P1 or P2. One possible approach is to stratify the data into categories based on the read coverage in sequences P1 and P2 and to computeD(P1, P2, P3, O) in each category. If mapping is not an issue, thenD should remain stable in categories where the coverage is high enough.

Effect of Natural Selection

We implicitly assumed selective neutrality in all our analyses. For natural selection to bias the test for admixture, it must increase the number of derived alleles shared by P3 and P1 or P2. The ABBA and BABA patterns inGreen et al. (2010) were distributed genome wide. Therefore, it seems unlikely that the observedD statistics inGreen et al. (2010) were biased by natural selection.

However, when the data consist of a few loci, natural selection could potentially biasD statistics by increasing the number of derived alleles in, say, P2. It would result in P2 matching P3 more often than P1, even if there was no admixture. The exact effect of natural selection onD statistics remains a subject of further investigation.

Distinguishing between Gene Flow and AS

We showed thatD statistics do not allow us to distinguish between the models of admixture and ancestral subdivision because ancestral subdivision and admixture can produce the same expected coalescence time between two individuals from different populations. However, it is known that ancestral population subdivision results in a higher than expected variance in coalescence times (Wall et al. 2009). Therefore, ancestral subdivision is likely to result in more variation in gene tree depth when using several samples from the extant population. This in turn will affect the site frequency spectrum of the extant population (Harpending et al. 1998). Designing a statistic to distinguish between the two scenarios will require using more than one sample per population.

Concluding Remarks

Inference of ancient admixture is usually limited by the lack of archaic DNA sequence data. Most methods so far have focused on detecting the expected signature of ancient admixture in extant genomes. Here we studied the behavior of a test statistic for ancient admixture based on the direct comparison of DNA sequences of three closely related populations. Our derivation of this test statistic under alternative demographic models showed how to robustly estimate the archaic contribution to extant populations. We demonstrated that our approach can also be applied when no archaic population has been sampled. This paper illustrates the advantage of using one sample per population, which greatly simplifies inference by removing the influence of many demographic assumptions that are not testable in most situations.

Supplementary Material

Supplementary figures S1S4 are available atMolecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

Supplementary Data

Acknowledgments

We thank Anna-Sapfo Malaspinas for helpful discussions that helped improve the manuscript. E.Y.D and M.S were supported by a US National Institute of Health grant (R01-GM40282).

Appendix A. Derivation ofD under the Model of Gene Flow

Here we present the detailed derivation ofD(P1, P2, P3, O) under the IUA model. For simplicity of notations, we assume that the population size in P(123) is constant and equal toN(123). However, the derivations can be easily extended for arbitrary varying population size in P(123). There are three classes of event that can produce patterns ABBA and BABA.

  1. The P2lineage traces its ancestry through the P(12)side of the phylogeny (probability 1−f), and between tP2 and tP3,the P1 and P2 lineages do not coalesceInline graphic
    graphic file with name molbiolevolmsr048fx25_ht.jpg(A1.1)
  2. The P2 lineage traces its ancestry through the P3 side of the phylogeny (probability f), and between tGF and tP3, the two lineages do not coalesceInline graphic
    graphic file with name molbiolevolmsr048fx27_ht.jpg(A1.2)
  3. The P2 lineage traces its ancestry through the P3 side of the phylogeny (probability f), and between tGF and tP3, the two lineages coalesce. This history creates gene genealogies only of type III and results only in ABBA sites (never BABA sites in the absence of recurrent mutation in the same genealogy). The probability that there is a coalescence beforetP3 isInline graphic. Once the coalescence occurs, the ancestral lineage cannot coalesce with P1 beforetP3. AftertP3, the average coalesce time is 2N(123). Therefore, the expected length of the internal branch istP3+2N(123)(tGF+t), wheret is the expected coalescence time between P2 and P3, given that the coalescence occurs beforetP3. A short analysis shows that
    graphic file with name molbiolevolmsr048fx29_ht.jpg(A1.3)
    Thus,
    graphic file with name molbiolevolmsr048fx30_ht.jpg(A1.4)
    The overall probability of ABBA and BABA is obtained by adding, and the mutation rate parameterμ cancels. Therefore, Appendix B

Here we present the detailed derivation ofD(P1, P2, P3, O) under the AS model. At timeT, there are six possibilities depending on the states of the Markov chain running in P(123).

Appendix B

  1. No coalescence event occurred

This case corresponds to states 1, 2, 3, and 4 of the Markov chain in P(123). It contributes equally to ABBA and BABA:

graphic file with name molbiolevolmsr048fx31_ht.jpg(A2.1)
  • 2. P1 and P2 lineages coalesced

This case does not contribute to ABBA and BABA.

  • 3. P2 and P3 lineages coalesced, but not the P1 lineage
    graphic file with name molbiolevolmsr048fx32_ht.jpg(A1.5)

This case corresponds to states 5 and 8 of the Markov chain in P(123). It does not contribute to BABA. Its contribution to ABBA is

graphic file with name molbiolevolmsr048fx33_ht.jpg(A2.2)

wheret23* is the expected time of coalescence of the P2 and P3 lineages given that coalescence occurs betweentP3 andT.

  • 4. P3 and P1 lineages coalesced, but not the P2 lineage

This case corresponds to states 6 and 9 of the Markov chain in P(123). It does not contribute to ABBA. Its contribution to BABA is, similarly to case (3),

graphic file with name molbiolevolmsr048fx34_ht.jpg(A2.3)
  • 5. P3 and P2 lineages coalesced first and then with the P1 lineage.

This case corresponds to state 11 of the Markov chain in P(123). It does not contribute to BABA. Its contribution to ABBA is

graphic file with name molbiolevolmsr048fx35_ht.jpg(A2.4)

wheret1** is the expected time of coalescence of the P1 lineage, given that coalescence occurs betweent23* andT.

  • 6. P3 and P1 lineages coalesced first and then with the P2 lineage.

This case corresponds to state 12 of the Markov chain in P(123). It does not contribute to ABBA. Similarly to case (5), its contribution to BABA is

graphic file with name molbiolevolmsr048fx36_ht.jpg(A2.5)

wheret2** is the expected time of coalescence of the P2 lineage, given that coalescence occurs betweent13* andT.

The conditional expected coalescence times,t23*,t13*,t1**, andt2** are computed using standard Markov chain theory as the expected times of first hit of corresponding states given that the hit occurred before timeT. We have

graphic file with name molbiolevolmsr048fx37_ht.jpg(A2.6)

The other conditional expected coalescence times are computed similarly. There is no simple analytic expression forD.

References

  1. Barton N. The role of hybridization in evolution. Mol Ecol. 2001;10:551–568. doi: 10.1046/j.1365-294x.2001.01216.x. [DOI] [PubMed] [Google Scholar]
  2. Barton N, Hewitt G. Analysis of hybrid zones. Annu Rev Ecol Syst. 1985;16:113–148. [Google Scholar]
  3. Clark AG, Hubisz MJ, Bustamante CD, Williamson SH, Nielsen R. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 2005;15(11):1496–1502. doi: 10.1101/gr.4107905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Currat M, Excoffier L. Modern humans did not admix with Neanderthals during their range expansion into Europe. PLoS Biol. 2004;2(12):e421. doi: 10.1371/journal.pbio.0020421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Currat M, Ruedi M, Petit RJ, Excoffier L. The hidden side of invasions: massive introgression by local genes. Evolution. 2008;62(8):1908–1920. doi: 10.1111/j.1558-5646.2008.00413.x. [DOI] [PubMed] [Google Scholar]
  6. Eberle M, Kruglyak L. An analysis of strategies for discovery of single-nucleotide polymorphisms. Genet Epidemiol. 2000;19(Suppl 1):S29–S35. doi: 10.1002/1098-2272(2000)19:1+<::AID-GEPI5>3.0.CO;2-P. [DOI] [PubMed] [Google Scholar]
  7. Efron B. Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika. 1981;68(3):589–599. [Google Scholar]
  8. Forhan G, Martiel J, Blum M. A deterministic model of admixture and genetic introgression: the case of Neanderthal and Cro-Magnon. Math Biosci. 2008;216:71–76. doi: 10.1016/j.mbs.2008.08.004. [DOI] [PubMed] [Google Scholar]
  9. Forster P, Matsumura S. Evolution—did early humans go north or south? Science. 2005;308(5724):965–966. doi: 10.1126/science.1113261. [DOI] [PubMed] [Google Scholar]
  10. Green RE, Krause J, Briggs AW, et al. A draft sequence of the Neandertal genome. Science. (56 co-authors). 2010;328(5979):710–722. doi: 10.1126/science.1188021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Green RE, Krause J, Ptak SE, et al. Analysis of one million base pairs of Neanderthal DNA. Nature. (11 co-authors). 2006;444(7117):330–336. doi: 10.1038/nature05336. [DOI] [PubMed] [Google Scholar]
  12. Green RE, Malaspinas AS, Krause J, et al. A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing. Cell. (25 co-authors). 2008;134(3):416–426. doi: 10.1016/j.cell.2008.06.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Harpending HC, Batzer MA, Gurven M, Jorde LB, Rogers AR, Sherry ST. Genetic traces of ancient demography. Proc Natl Acad Sci U S A. 1998;95(4):1961–1967. doi: 10.1073/pnas.95.4.1961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hastings A, Cuddington K, Davies K. The spatial spread of invasions: new developments in theory and evidence. Ecol Lett. 2005;8(1):91–101. [Google Scholar]
  15. Hudson RR. Testing the constant-rate neutral allele model with protein-sequence data. Evolution. 1983;37(1):203–217. doi: 10.1111/j.1558-5646.1983.tb05528.x. [DOI] [PubMed] [Google Scholar]
  16. Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18(2):337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
  17. Krause J, Lalueza-Fox C, Orlando L, et al. The derived FOXP2 variant of modern humans was shared with Neandertals. Curr Biol. (13 co-authors). 2007;17(21):1908–1912. doi: 10.1016/j.cub.2007.10.008. [DOI] [PubMed] [Google Scholar]
  18. Krings M, Stone A, Schmitz RW, Krainitzki H, Stoneking M, Pääbo S. Neandertal DNA sequences and the origin of modern humans. Cell. 1997;90(1):19–30. doi: 10.1016/s0092-8674(00)80310-4. [DOI] [PubMed] [Google Scholar]
  19. Kuhner MK, Beerli P, Yamato J, Felsenstein J. Usefulness of single nucleotide polymorphism data for estimating population parameters. Genetics. 2000;156(1):439–447. doi: 10.1093/genetics/156.1.439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lalueza-Fox C, Rompler H, Caramelli D, et al. A melanocortin 1 receptor allele suggests varying pigmentation among Neanderthals. Science. (17 co-authors). 2007;318(5855):1453–1455. doi: 10.1126/science.1147417. [DOI] [PubMed] [Google Scholar]
  21. Meng C, Kubatko LS. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theor Popul Biol. 2009;75(1):35–45. doi: 10.1016/j.tpb.2008.10.004. [DOI] [PubMed] [Google Scholar]
  22. Nielsen R, Hubisz M, Clark A. Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics. 2004;168(4):2373–2382. doi: 10.1534/genetics.104.031039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Noonan JP, Coop G, Kudaravalli S, et al. Sequencing and analysis of Neanderthal genomic DNA. Science. (11 co-authors). 2006;314(5802):1113–1118. doi: 10.1126/science.1131412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Orive ME, Barton NH. Associations between cytoplasmic and nuclear loci in hybridizing populations. Genetics. 2002;162(3):1469–1485. doi: 10.1093/genetics/162.3.1469. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Plagnol V, Wall JD. Possible ancestral structure in human populations. PLoS Genet. 2006;2(7):e105. doi: 10.1371/journal.pgen.0020105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Reich D, Thangaraj K, Patterson N, Price AL, Singh L. Reconstructing Indian population history. Nature. 2009;461(7263):489–494. doi: 10.1038/nature08365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Serre D, et al. No evidence of Neandertal mtDNA contribution to early modern humans. PLoS Biol. 2004;2(3):E57. doi: 10.1371/journal.pbio.0020057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Slatkin M. Inbreeding coefficients and coalescence times. Genet Res. 1991;58(2):167–175. doi: 10.1017/s0016672300029827. [DOI] [PubMed] [Google Scholar]
  29. Slatkin M, Pollack JL. Subdivision in an ancestral species creates asymmetry in gene trees. Mol Biol Evol. 2008;25(10):2241–2246. doi: 10.1093/molbev/msn172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105:437–460. doi: 10.1093/genetics/105.2.437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Wall JD. Detecting ancient admixture in humans using sequence polymorphism data. Genetics. 2000;154(3):1271–1279. doi: 10.1093/genetics/154.3.1271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wall JD, Hammer MF. Archaic admixture in the human genome. Curr Opin Genet Dev. 2006;16(6):606–610. doi: 10.1016/j.gde.2006.09.006. [DOI] [PubMed] [Google Scholar]
  33. Wall JD, Lohmueller KE, Plagnol V. Detecting ancient admixture and estimating demographic parameters in multiple human populations. Mol Biol Evol. 2009;26(8):1823–1827. doi: 10.1093/molbev/msp096. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Molecular Biology and Evolution are provided here courtesy ofOxford University Press

ACTIONS

RESOURCES


[8]ページ先頭

©2009-2025 Movatter.jp