. 2011 Feb 15;28(8):2239–2252. doi:10.1093/molbev/msr048

Testing for Ancient Admixture between Closely Related Populations

Eric Y Durand^1,^*,Nick Patterson²,David Reich^2,³,Montgomery Slatkin¹

¹Department of Integrative Biology, University of California, Berkeley

²Broad Institute of MIT and Harvard

³Department of Genetics, Harvard Medical School

Corresponding author: E-mail:eric.durand@berkeley.edu.

Associate editor: Jonathan Pritchard

Issue date 2011 Aug.

PMC Copyright notice

PMCID: PMC3144383 PMID:21325092

Abstract

One enduring question in evolutionary biology is the extent of archaic admixture in the genomes of present-day populations. In this paper, we present a test for ancient admixture that exploits the asymmetry in the frequencies of the two nonconcordant gene trees in a three-population tree. This test was first applied to detect interbreeding between Neandertals and modern humans. We derive the analytic expectation of a test statistic, called theD statistic, which is sensitive to asymmetry under alternative demographic scenarios. We show that theD statistic is insensitive to some demographic assumptions such as ancestral population sizes and requires only the assumption that the ancestral populations were randomly mating. An important aspect ofD statistics is that they can be used to detect archaic admixture even when no archaic sample is available. We explore the effect of sequencing error on the false-positive rate of the test for admixture, and we show how to estimate the proportion of archaic ancestry in the genomes of present-day populations. We also investigate a model of subdivision in ancestral populations that can result inD statistics that indicate recent admixture.

Keywords: admixture, gene genealogies, lineage sorting

Introduction

Detecting ancient admixture and estimating the extent of archaic ancestry in the genomes of present-day populations are of major importance for many aspects of evolutionary biology (Barton 2001). Admixture occurs during secondary contact between previously isolated populations, and it can play an important role in the speciation process (Barton and Hewitt 1985;Barton 2001). Among anthropologists, there is a long-standing debate on the relationships between modern humans and different groups of archaic humans such as Neandertals. A recent analysis of a draft sequence of the Neandertal genome suggested that 1–4% of genomes from people of Eurasian ancestry are derived from Neandertal (Green et al. 2010). However, there is still no consensus on the overall extent of interbreeding between archaic and modern humans (for a review, seeWall and Hammer 2006).

Methods for detecting and estimating archaic admixture often rely on DNA sequence data from extent populations only. One class of such methods relies on the signature that ancient admixture leaves on patterns of linkage disequilibrium (Plagnol and Wall 2006;Wall et al. 2009). Although they have the advantage of not relying on scarce or damaged ancient DNA samples, they make simplifying assumptions about the demographic history of extant populations that are not testable in general (Wall and Hammer 2006). Another class of methods models genetic introgression using spatially explicit simulations. They aim to make predictions about the proportion of introgression in present-day genomes if ancient admixture had occurred. These methods use either forward-in-time computer simulations (Currat and Excoffier 2004) or deterministic modeling (Forhan et al. 2008).

Recently, direct comparisons of DNA sequences between extant and archaic populations have been made possible by the sequencing of ancient DNA samples. In humans, this line of research began with a comparison of mitochondrial DNA between modern humans and Neandertals (Krings et al. 1997;Serre et al. 2004;Green et al. 2008). It was then extended to fragments of Neandertal nuclear DNA (Green et al. 2006;Noonan et al. 2006;Krause et al. 2007;Lalueza-Fox et al. 2007); the full sequence of Neandertals was finally analyzed inGreen et al. (2010). Although data from extinct populations are still limited, direct comparison of ancient and modern sequences has the potential to directly determine the proportion of present-day genomes inherited from archaic populations (Wall 2000).Green et al. (2010) developed a formal test for admixture based on the direct comparison of DNA sequences from Neandertals and modern human populations.

Green et al. (2010) defined a new statistic, called theD statistic, to test for admixture between three closely related populations. The expectation ofD was derived under a simple model of admixture with constant population size and was used to estimate the proportion of Neandertal ancestry in modern humans (Green et al. 2010, Supplementary Online Material 19). However,Green et al. (2010) noted that their test relies on the assumption that the population ancestral to the sampled populations was randomly mating. In this paper, we extend the results ofGreen et al. (2010) by deriving the expectation ofD under a model where the ancestral populations are not randomly mating. We also study the statistical properties ofD. In addition, we explore the effects of sequencing error and ascertainment bias onD, and we show thatD can indicate past admixture even when the admixing population is not sampled.

In a first section, we recall the definition of theD statistic and why it is expected to detect admixture in the case of randomly mating ancestral populations. Then we derive the expectation ofD under a model of admixture with arbitrarily varying ancestral population sizes, which generalizes the model of admixture proposed inGreen et al. (2010). The derivation reveals a strategy to estimate the proportion of archaic introgression in present-day genomes. In a next section, we derive the expectation ofD in a model where the ancestral populations are not randomly mating and show that ancient population structure can confound the test for admixture. Then we determine the effect of sequencing error onD, and we study the power of the test for admixture on synthetic data. WhenD is computed from genotype data, we determine the impact of ascertainment bias on the test for admixture. AlthoughGreen et al. (2010) definedD to compare the Neandertal genome with the sequences of modern humans, we show thatD statistics can be used to test for admixture when no archaic sample is available. Finally, we apply the test to the Neandertal data published inGreen et al. (2010).

Materials and Methods

A Four-Taxon Statistic to Test for Admixture

Assume that we have sequenced one chromosome from two present-day populations and denote by P₁ and P₂ those populations. Furthermore, assume that we have sampled one chromosome from an archaic population, which we denote as P₃, and one chromosome from an outgroup population, denoted O. Suppose that the four sequences were aligned without error. The null hypothesis that we wish to test is a demographic scenario in which P₁ and P₂ descend from a common ancestral population that diverged from the ancestors of P₃ at an earlier time, without any gene flow between P₃ and P₁ or P₂ after they split. The alternative hypothesis is that P₃ exchanged genes with P₁ or P₂ after these two populations diverged.

We first restrict to positions in the genome where we have coverage for P₁, P₂, P₃, and O. We denote the outgroup allele as “A” and restrict our analysis to biallelic sites at which P₁ and P₂ differ and the alternative allele “B” is seen in P₃. At these sites, we have observed two copies of both alleles, making it less likely that the patterns we analyze have arisen because of sequencing error.

For the ordered set {P₁, P₂, P₃, O}, we call the two allelic configurations of interest “ABBA” or “BABA.” The pattern ABBA refers to biallelic sites where P₁ has the outgroup allele and P₂ and P₃ share the derived copy. The pattern BABA corresponds to sites where P₁ and P₃ share the derived allele and P₂ has the outgroup allele.Green et al. (2010) defined a statistic corresponding to the difference in the counts of ABBA and BABA sites across then base pairs for which we have data of all four samples, normalized by the total number of observations. In this statistic,C_ABBA(i) andC_BABA(i) are indicator variables; they can be 0 or 1 depending on whether an ABBA or a BABA pattern is seen at basei.Green et al. (2010) denoted this statistic byD, and we have

graphic file with name molbiolevolmsr048fx1_ht.jpg

(1)

We further denote byS(P₁, P₂, P₃, O) the numerator ofD(P₁, P₂, P₃, O).

Under the null hypothesis that P₁ and P₂ descend from a common ancestral population that diverged at an earlier time from the ancestral population of P₃, and if the ancestral population of P₁, P₂, and P₃ was panmictic, then derived alleles in P₃ should match derived alleles in P₁ and P₂ equally often. This is because the patterns ABBA and BABA can only arise from gene trees that are nonconcordant with the population tree of P₁, P₂, and P₃. Under the null hypothesis, the two nonconcordant gene trees should occur with equal frequencies (Tajima 1983;Hudson 1983), andD should equal zero. There are three classes of events that can produce a significant deviation from the null hypothesis. First, P₃ exchanged genes with P₁ or P₂. Then the population ancestral to P₁, P₂, and P₃ may have been structured in such a way that one of the two non-concordant gene trees occurs more often than the other. Alternatively, P₁ or P₂ could have received genes from an unsampled ghost population that we denote as P_G. Note that P_G needs to be at least as diverged as P₃ from (P₁, P₂) forD to differ significantly from zero (supplementary fig. S1,Supplementary Material online). Also note that gene flow between P₁ and P₂, or between P₃ and the ancestor of P₁ and P₂, is not expected to produce a deviation from the null hypothesis.

AlthoughD statistics are primarily designed to be applied on sequence data, they can be readily computed on single nucleotide polymorphism (SNP) data. Assume thatn SNPs were genotyped in populations P₁, P₂, P₃, and O. Denote by ${\hat{p}}_{i j}$ the observed frequency of SNPi in population P_j (j = 4 denotes population O). We have

graphic file with name molbiolevolmsr048fx2_ht.jpg

(2)

Thus, computingD on genotype data is straightforward. However, the way SNPs were ascertained can bias estimates of allele frequencies in the different populations (Eberle and Kruglyak 2000;Kuhner et al. 2000;Clark et al. 2005).

A Genealogical Argument for the Expected Frequencies of ABBA and BABA Sites

Assume that a single nucleotide substitution occurred on the gene genealogy representing the ancestry of the four samples (P₁, P₂, P₃, and O). There are three possible topologies of the gene genealogy, assuming that O is the outgroup. Each of these topologies has four branches on which a mutation can create a derived allele in P₁, P₂, and P₃.Figure 1 presents the 12 possible configurations for the four samples P₁, P₂, P₃, and O, assuming that O carries the ancestral allele. The pattern BABA (P₁ and P₃ have the derived nucleotide) is consistent only with tree IIc. The length of the internal branch represents the time during which a mutation can produce the BABA pattern. Similarly, the pattern ABBA (P₂ and P₃ have the derived nucleotide) is consistent only with tree IIIc, and the length of that internal branch is the time during which mutation can create ABBA. The probability that a randomly chosen site will have either pattern is the probability of the appropriate topology multiplied by the expected branch length multiplied by the mutation rate,μ. Thus, to derive the expected frequencies of patterns ABBA and BABA, we need to compute the probabilities of the corresponding tree topologies and average lengths of the relevant branches under the demographic scenario relating P₁, P₂, and P₃.

FIG. 1. — There are three possible topologies of the gene genealogies, I, II, and III, that can relate samples P₁, P₂, and P₃, and the outgroup O. Assuming that any polymorphisms reflect a single historical mutation, ABBA sites can occur only on a type III gene genealogy due to a mutation that occurs on the branch ancestral to samples P₂ and P₃ (IIIc), and BABA sites can occur only on a type II gene genealogy due to a mutation on the branch ancestral to samples P₁ and P₃ (IIc).

In what follows, we derive the expected frequencies of patterns ABBA and BABA under a simple demographic model of ancient admixture. We use this model to show that the test is not confounded by demographic events other than admixture, assuming that the population ancestral to P₁, P₂, and P₃ was panmictic.

Instantaneous Unidirectional Admixture

In this model, we assume that there was a single episode of admixture at timet_GF in the past (t = 0 being the present) from population P₃ to P₂ after the separation of populations P₁ and P₂ (fig. 2). With probabilityf, the P₂ lineage was derived from a lineage from P₃. The parameterf represents the fraction of the genomes in P₂ that originated from P₃. We define the divergence time of P₁ and P₂ populations ast_P2 >t_GF. We denote byt_P3 >t_P2 the divergence time of P₃ and the population ancestral to P₁ and P₂, denoted P₍₁₂₎. At generationt in the past, the effective population size of the P₃ population isN₃(t); the effective size of P₍₁₂₎ isN₍₁₂₎(t); and the effective size of the population ancestral to P₁, P₂, and P₃, denoted P₍₁₂₃₎, isN₍₁₂₃₎(t). All the populations are assumed to be unstructured (i.e., they are randomly mating). We denote by IUA the instantaneous unidirectional admixture model.

FIG. 2. — Model of IUA. P₁ and P₂ split at timet_P2. P₃ splits from the ancestral population of P₁ and P₂, denoted P₍₁₂₎, at timet_P3. A single episode of admixture takes place from P₃ to P₂ at timet_GF. We denote byf the admixture proportion, which is the proportion of P₃ ancestry in P₂ individuals at timet_GF. We denote byN₃(t),N₍₁₂₎(t), andN₍₁₂₃₎(t) the size of populations P₃, P₍₁₂₎, and P₍₁₂₃₎ at generationt in the past. The sizes of populations P₁ and P₂ do not influenceD statistics.

Expected Counts of ABBA and BABA under the Model of Instantaneous Admixture

Under the IUA model, there are three classes of events that can produce the patterns ABBA and BABA. The first class corresponds to the case where the P₂ lineage did not originate from P₃. It occurs with probability 1 −f. In the second class of events, the P₂ lineage traces its ancestry in P₃ (probabilityf), but the P₂ and P₃ lineages did not coalesce beforet_P3. The third class corresponds to the case where the P₂ lineage originated from P₃ (probabilityf) and the P₂ and P₃ lineages coalesced beforet_P3. The first two classes produce patterns ABBA and BABA with equal frequencies. This is because, assuming that P₍₁₂₃₎ was panmictic, the P₃ lineage coalesces first with probability 1/3. The third class of event produces only ABBA because the P₂ and P₃ lineages coalesced first. The probabilities of ABBA and BABA are derived in Appendix 1 for a model with arbitrary varying population size in P₃ and P₍₁₂₎ and constant population size in P₍₁₂₃₎.

Green et al. (2010) studied the simplified case where population size is constant. When we setN₃(t) =N₍₁₂₎(t) =N₍₁₂₃₎ =N in Appendix 1, we find, in accordance withGreen et al. (2010), that

graphic file with name molbiolevolmsr048fx3_ht.jpg

(3)

and

graphic file with name molbiolevolmsr048fx4_ht.jpg

(4)

whereμ denotes the per base mutation rate. In this case, the expectation ofD reduces to

graphic file with name molbiolevolmsr048fx5_ht.jpg

(5)

Note that the test statistic equals 0 when there is no admixture (f = 0). If there is admixture (f > 0),D tends to one whent_P3 –t_GF becomes large. This is because the P₂ and P₃ lineages have more time to coalesce ast_P3 –t_GF increases. Finally, in the case of constant ancestral population size,D tends to 0 when the effective population size becomes large. This is explained by the fact that whenN is large, then the probability that the P₂ and P₃ lineages coalesce in P₃ is small.

A Test for Admixture Insensitive to Many Demographic Assumptions

Under the IUA model,equation (A1.5) shows thatE[D(P₁, P₂, P₃, O)] = 0 if and only iff = 0 ort_GF =t_P3, regardless of the population size fluctuations of P₁, P₂, P₃, P₍₁₂₎, and P₍₁₂₃₎ (as long as population sizes stay finite) and regardless of the times of population divergences and admixture (as long ast_GF <t_P3). Note that a model witht_GF =t_P3 is equivalent to a model without admixture (seefig. 2). Thus, under the IUA model assumptions, a significant deviation from 0 ofD(P₁, P₂, P₃, O) indicates that P₃ exchanged genes with P₁ (D(P₁, P₂, P₃, O) < 0) or P₂ (D(P₁, P₂, P₃, O) > 0). Although we modeled admixture with an instantaneous episode of gene flow, this conclusion holds for ongoing migration between P₃ and P₂ or P₁.

Sites involved in the computation ofD are likely to be in linkage disequilibrium. Therefore, a simple binomial test to assess whetherD significantly differs from zero is not appropriate. Instead, the test significance can be assessed using a standard block jackknife procedure (Efron 1981). In this procedure, one removes blocks of adjacent sites one at a time. The size of blocks should be chosen to be larger than the extent of linkage disequilibrium. By computing the variance of theD statistic over the sequencesM times leaving each block of the sequence in turn, and then multiplying byM and taking the square root, we can obtain an approximately normally distributed standard error using the theory of the jackknife (Reich et al. 2009).

Constraining the Admixture Proportion

D statistics depend on the demographic parameters in a complex way. As such, they are not good candidates to estimate demographic parameters. However, Appendix 1 shows that the numerator ofD, denotedS, is a much simpler function of demographic parameters; for a model with arbitrary varying population sizes, we have

graphic file with name molbiolevolmsr048fx6_ht.jpg

(6)

where ${\bar{t}}_{P(123)}$ is the expected time of coalescence of two lineages in P₍₁₂₃₎ andt* is the expected time of coalescence of two lineages in P₃, given that they coalesced betweent_GF andt_P3. The time ${\bar{t}}_{P(123)}$ is equal to 2N₍₁₂₃₎ in Appendix 1 where we assumed constant population size in P₍₁₂₃₎. Assume now that we have sampled a second lineage in P₃. Denote by P_3,1 and P_3,2 the two lineages sampled in P₃. It is straightforward to see thatS(P₁, P_3,1, P_3,2, O) is equal toS(P₁, P₂, P₃, O) withf = 1 andt_GF = 0 because populations P₂ and P₃ are identical with such parameter values. Therefore,

graphic file with name molbiolevolmsr048fx7_ht.jpg

(7)

wheret** is the expected time of coalescence of two lineages in P₃, given that they coalesced betweent = 0 andt_P3. Thus,S(P₁, P₂, P₃, O)/S(P₁, P_3,1, P_3,2, O) is an upper bound forf. In the case of constant ancestral population size, we have

graphic file with name molbiolevolmsr048fx8_ht.jpg

(8)

If we assume thatt_GF is small compared witht_P3, then one can estimatef by Inline graphic , which is a conservative minimum.

We note here that the strategy to estimate the admixture proportion can be extended to more populations. Assume that a sample from a sister population of P₃ is available. Denote by P₄ this population and assume that it diverged from P₃ before the admixture event (t_P4 >t_GF, wheret_P4 is the time of divergence of P₃ and P₄) (supplementary fig. S2,Supplementary Material online). Then Inline graphic is an unbiased estimate of the admixture proportion independent of population sizes and divergence times.

Testing for Admixture without Archaic Sample

In practice, it may be very common that no sample from the admixing archaic population is available. On the other hand, it may be fairly easy to sample from other extant populations. Assume that a sample from an outgroup to P₁ and P₂ is available and denote by P₀ this population. Assume that P₀ did not exchange genes with P₁ or P₂. Lett_P0 be the time of divergence of P₀ and the population ancestral to P₁ and P₂ (seesupplementary fig. S3,Supplementary Material online). We can then deriveD(P₂, P₁, P₀, O) following the same methodology as in Appendix 1. In the case where population size is constant and equal toN in all ancestral populations, we have

graphic file with name molbiolevolmsr048fx11_ht.jpg

(9)

It is remarkable thatequation (9) does not depend on the time of admixture,t_GF, which is likely to be unknown. It does depend, however, on the time of divergence of the archaic population,t_P3. Using two samples from P₀ or P₁, one can constrain the admixture proportion by using the same strategy as the one using two samples from P₃.

Ancestral Subdivision

Here we derive the expected frequencies of patterns ABBA and BABA under a simple model in which the ancestral population of P₁, P₂, and P₃ is not panmictic. We assume that the ancestral populations P₍₁₂₎ and P₍₁₂₃₎ were structured in two random mating subpopulations (fig. 3). We denote by P_(12),1 and P_(12),2 the subpopulations in P₍₁₂₎ and by P_(123),1 and P_(123),2 in P₍₁₂₃₎. We assume that subpopulations exchange migrants at symmetrical ratem per generation. At timeT in the past, subpopulations merge into one panmictic population. We assume constant population sizeN in all ancestral populations. A similar model was proposed inSlatkin and Pollack (2008).Slatkin and Pollack (2008) showed that, for large values ofT, the nonconcordant gene tree compatible with pattern ABBA occurred with higher frequency than the other nonconcordant gene tree. For small values ofm, its frequency is even higher than the frequency of the concordant gene tree. We use AS to denote the ancestral subdivision model.

FIG. 3. — Model of ancient subdivision (AS). P₁ and P₂ definitively split at timet_P2. The ancestral population of P₁ and P₂ is subdivided in two subpopulations, denoted P_(12),1 and P_(12),2. The two subpopulations exchange migrants at ratem. P₃ splits at timet_P3. The subdivision persists in the ancestral population of P_(12),1, P_(12),2, and P₃. We denote by P_(123),1 the population ancestral to P_(12),1 and by P_(123),2 the ancestral population of P_(12),2. These two subpopulations exchange migrants at ratem. Subpopulations merge at timeT >t_P3. All population sizes are constant and equal toN.

Five-State Markov Chain in P₍₁₂₎

To analyze the ancestry of the P₁ and P₂ lineages in P₍₁₂₎, we use a Markov chain method similar to the one developed bySlatkin and Pollack (2008). The states of the Markov chain are as follows: state 1, P₁ is in P_(12),1 and P₂ is in P_(12),2 (initial state); state 2, P₁ is in P_(12),2 and P₂ is in P_(12),1; state 3, P₁ and P₂ are both in P_(12),1; state 4, P₁ and P₂ are both in P_(12),2, and state 5, P₁ and P₂ coalesced (absorbing state).

To write the transition probabilities, we assume thatm is small enough that only one lineage can migrate to another subpopulation per generation. Moreover, coalescent and migration events cannot occur at the same generation. Assuming constant population sizeN in P_(12),1 and P_(12),2, the nonzero off-diagonal elements of the transition matrixP⁽¹²⁾ are

graphic file with name molbiolevolmsr048fx12_ht.jpg

At timet_P3, the distribution of states isπ⁽¹²⁾(t_P3) =π⁽¹²⁾(t_P2)(P⁽¹²⁾)^{t_P3 −t_P2}, whereπ⁽¹²⁾(t_P2) = {1,0,0,0,0}.

Thirteen-State Markov Chain in P₍₁₂₃₎

To model the ancestry of P₁, P₂, and P₃ in P₍₁₂₃₎, we use a 13-state Markov chain with states defined by state 1, P₃, P₂ | P₁; state 2, P₁, P₃ | P₂; state 3, P₁, P₂ | P₃; state 4, P₁, P₂, P₃; state 5, (P₂, P₃) | P₁; state 6, (P₁, P₃) | P₂; state 7, (P₁, P₂) | P₃; state 8, (P₂, P₃), P₁; state 9, (P₁, P₃), P₂; state 10, (P₁, P₂), P₃; state 11, ((P₂, P₃), P₁); state 12, ((P₁, P₃), P₂), and state 13, ((P₁, P₂), P₃). We used the symbol “|” to denote that lineages are in different subpopulations and brackets denote coalescence events. For example, state 6 takes into account the case where P₁ and P₃ lineages have coalesced, and the resulting ancestral lineage is in a different subpopulation than the P₂ lineage.

Again we assume thatm is small enough that only one lineage can migrate to another subpopulation per generation and that coalescent and migration events cannot occur at the same generation. Assuming constant population sizeN in P_(123),1 and P_(123),2, the nonzero off-diagonal elements of the transition matrixP⁽¹²³⁾ are

graphic file with name molbiolevolmsr048fx13_ht.jpg

At timeT, the distribution of states isπ⁽¹²³⁾(t_P3) =π⁽¹²³⁾(t_P3)(P⁽¹²³⁾)^{T −t_P3}, whereπ⁽¹²³⁾(t_P3) = {π₁⁽¹²⁾,π₂⁽¹²⁾,π₃⁽¹²⁾,π₄⁽¹²⁾,0,0,π₅⁽¹²⁾,0,0,0,0,0,0}.

Events (coalescence and/or migration) in P₍₁₂₎ only affect the expected value ofD by changing the starting probabilities in P₍₁₂₃₎. Appendix 2 presents the details of all the possible cases leading to ABBA and BABA patterns at timeT, where all subpopulations merge into a single random mating population.

The frequencies of patterns ABBA and BABA have no simple analytical expressions under the AS model. However, they can be easily computed numerically for any set of parameter values.

Expected Coalescence Time Conditional on Gene Tree Topology

In this section, we derive the expected coalescence time of P₂ and P₃, given that they coalesced first, under the IUA and the AS models. This is a quantity of interest because it determines the length of the internal branch of tree topology III (fig. 1), and therefore the frequency of pattern ABBA. We use these derivations to illustrate the confounding effect of ancestral subdivision on the test for admixture.

We denote by τ₂₃^(IUA) the expected coalescence time of P₂ and P₃, given that they coalesced first, under the IUA model with constant ancestral population size. It can be decomposed as the sum of two terms, depending on whether the P₂ lineage originated from P₃ or not:

graphic file with name molbiolevolmsr048fx14_ht.jpg

(10)

wheret* is the expected time of coalescence of P₂ and P₃ given that they coalesce betweent_GF andt_P3 (Appendix 1,eq. A1.3).

Then we derive the conditional expected time of coalescence of P₂ and P₃ under the AS model, denotedτ₂₃^(AS). We start by conditioning on whether coalescence occurs before timeT:

graphic file with name molbiolevolmsr048fx15_ht.jpg

(11)

The first term is equal to the expected time for the Markov chain to first hit state 5, 8, or 11, conditional on this time being lower thanT; we compute it using standard Markov chain theory. It can be shown that this conditional time tends to 4N + 1/(2m) whenT grows to infinity, which is equal to the expected coalescence time of two individuals in different demes in a two-island model (Slatkin 1991). Because the ancestral population is assumed to be panmictic after timeT, the second term ofequation (10) is equal to

graphic file with name molbiolevolmsr048fx16_ht.jpg

(12)

Although there is no simple analytical expression forτ₂₃^(AS), it can be computed numerically for any set of parameter values.

Effect of Sequencing Error onD Statistics

In this section, we study the effect of sequencing error onD. In what follows, we use capital letters to denote the observed count of a pattern and small letters to denote the true counts in the absence of sequencing error. We assume that sequencing error is uniform along the sequence. We denote bye₁,e₂,e₃, ande₄ the probability that a base was incorrectly read in population P₁, P₂, P₃, and O, respectively. Lete_ij be the probability that a site was read with an error in populations P_i and P_j at the same time (j = 4 denotes population O). We neglect terms of higher order. A sequencing error on O will cause the reads for P₁, P₂, and P₃ to be mislabeled. For example, if O was sequenced with an error at a site where the true pattern is “aaaa”, we would read pattern “BBBA.” Thus, for the two patterns of interest, we have

graphic file with name molbiolevolmsr048fx17_ht.jpg

(13)

and

graphic file with name molbiolevolmsr048fx18_ht.jpg

(14)

assuming that error rates are small.

False-Positive Rate in the Presence of Sequencing Error

Here we look at the effect of sequencing error on the false-positive rate of the test for admixture. We assume a null model with no admixture and randomly mating ancestral populations. In this model, the true expected counts verifyn_abba =n_baba andn_abaa =n_baaa. The second equality assumes that the mutation rates in P₁ and P₂ are equal. Under these assumptions, we have

graphic file with name molbiolevolmsr048fx19_ht.jpg

(15)

The effect of sequencing error onD(P₁, P₂, P₃, O) is then

graphic file with name molbiolevolmsr048fx20_ht.jpg

(16)

If the sequencing error rates are the same for the four samples, then sequencing error will have no influence on the false-positive rate of the test as long as they are small enough that third-order terms can be ignored.

To illustrate the effect of different sequencing error rates on the false-positive rate of the test, note that in the case where the outgroup divergence time,t_O, is large compared with the divergence time of P₃, the two dominant patterns will be bbba and aaaa. We considered a case where 98% of sites were aaaa, 0.02% abba and baba, and 1.97% bbba. All other configurations represented 0.01% of sites. We took 10⁹ sites in total. This settings roughly correspond to a population tree where P₁ and P₂ are modern human populations, P₃ is a Neandertal population, and Chimpanzee is the outgroup. Indeed,Green et al. (2010) estimated that the time of divergence of Neandertal and modern humans was less than 440,000 years before present, which is small compared with divergence time of modern humans and chimpanzee.

Under the simplifying assumption that sites are independent, true pattern counts abba and baba follow a binomial distribution with parametersp = 0.5 andn =n_abba +n_baba. In this case, the standard deviation (SD) ofD is equal to Inline graphic .

Synthetic Data Analysis

Power of the Test for Admixture

In order to assess the power of the test for admixture, we simulated data under different demographic scenarios. To simulate counts for ABBA and BABA patterns, we simulated trees using thems software (Hudson 2002). We then placed a mutation on the tree; the branch it fell on was selected with probability proportional to its length. The descendants of the mutated branch then define which pattern was generated. Under this simulation scheme, the number of independentms replicates corresponds to the number of unlinked polymorphic sites in the three populations P₁, P₂, and P₃. For all the demographic scenarios described below, we varied this number between 10,000 and 1,000,000.

We simulated counts for ABBA and BABA patterns under the IUA model. The time of divergence of P₁ and P₂ was set tot_P2 = 3,000 generations. P₃ diverged from the ancestral population P₍₁₂₃₎ at timet_P3 = 12,000 generations. Admixture occurred at timet_GF = 2,500 generations. We considered different cases for the ancestral population sizes: constant sizeN = 10,000 in all populations; a bottleneck in P₃ prior to the admixture event; and a bottleneck in P₍₁₂₎. Both bottlenecks involved a 100-fold reduction in population size and lasted for 1,000 generations. They both started at 3,500 generations in the past. We also considered a case where P₁ and P₂ exchanged migrants at rate 2Nm = 0.5 betweent = 0 andt_P2. We varied the admixture proportionf from 0 (no admixture) to 0.1 (10% introgression from P₃ into P₂).

Because replicates are independent,D is approximately normally distributed, and its SD can be computed from the binomial distribution: Inline graphic Each simulation class was repeated 1,000 times. We computed the power of the test for admixture as the proportions of the 1,000 repetitions were |D| was larger than twice its standard error. For the simulations without admixture, we expect that roughly 95% of the simulatedD statistics will be comprised between plus or minus twice the SD.

Effect of Ascertainment Bias

When computed on genotype data, the way SNPs were ascertained can biasD statistics. Typically, an ascertainment scheme consists of two phases. First, SNPs are discovered from the genetic material of a small group of individuals, called the discovery panel. Then the discovered SNPs are typed in a larger sample (Eberle and Kruglyak 2000). The extent to which the ascertainment scheme will biasD statistics depends on many parameters, such as the genetic composition of the discovery panel and its relationship to the larger samples. Ascertainment bias can be corrected when one assumes that the discovery panel is a random sample from the studied population (Nielsen et al. 2004). However, in the context ofD statistics, the relationship between the discovery sample and the four typed populations P₁, P₂, P₃, and O can be complex.

To illustrate the effect of a simple ascertainment scheme onD statistics, we simulated genotype data under the IUA model using 100,000 independent replicates ofms (Hudson 2002). The time of divergence of P₁ and P₂ was set tot_P2 = 3,000 generations. P₃ diverged from the ancestral population P₍₁₂₃₎ at timet_P3 = 12,000 generations. Admixture occurred at timet_GF = 2,500 generations. All population sizes were set equal toN = 10,000. We simulated 100 individuals in P₁ and P₂ and 1 individual in P₃. To simulate ascertainment bias, we subsampled individuals in P₁ or P₂ and kept only sites that were variable in those individuals. In total, we explored five subsampling scenarios: 1) 2 individuals from P₁; 2) 20 individuals from P₁; 3) 2 individuals from P₂; 4) 20 individuals from P₂; and 5) no subsampling (no ascertainment bias). Each of these five subsampling scenarios was repeated forf = 0 (no admixture) andf = 0.05 (5% introgression from P₃ into P₂).

Application to Neandertal Data

The Neandertal genome was recently sequenced to ∼1× coverage, producing a total of ∼4 billon base pairs (Green et al. 2010).Green et al. (2010) analyzed data from five present-day human males from the CEPH-Human Genome Diversity Project panel (French, Han, Papuan, San, and Yoruba) that were sequenced to ∼5× coverage. For each pair of modern humans P₁ and P₂,Green et al. (2010) computedD(P₁, P₂, Neandertal, Chimpanzee). The chimpanzee individual was represented by the reference chimpanzee genome (PanTro2).

Results

Ancient Subdivision Confounds the Test for Admixture

We computedD(P₁, P₂, P₃, O) under the IUA and the AS models (Appendices 1 and 2) for parameter valuest_GF = 2,500,t_P2 = 3,000,t_P3 = 12,000,T = 25,000 (times in generations), and a constant ancestral population size equal toN = 10,000.Figure 4A displaysD statistics as a function off (IUA) and 2Nm (AS). We also computed the expected time of coalescence of P₂ and P₃ given that they coalesced first.Figure 4B displaysτ₂₃^(IUA) as a function off andτ₂₃^(AS) as a function of 2Nm. We used a computer algebra program to compute numerically quantities under the AS model.

FIG. 4. — D statistics and expected coalescence time of P₂ and P₃ under the IUA and AS models. We assumed parameter values oft_GF = 2,500,t_P2 = 3,000,t_P3 = 12,000,T = 25,000 (times in generations), and a constant ancestral population size equal toN = 10,000. (A)D(P₁, P₂, P₃, O) as a function off (IUA) and 2Nm (AS). (B) Expected internal branch length (IBL) for topologies where P₂ and P₃ coalesced first as a function off (IUA) and 2Nm (AS). The rescaling for 2Nm on (A) and (B) was chosen so that 2Nm andf vary in the same range.

TheD statistic curves intersect, which shows that both the IUA and the AS models can be made to fit data if P₃ shares more derived allele with P₁ or P₂. Therefore, the test for admixture based onD alone will always be confounded by the presence of ancestral subdivision.

More generally,figure 4B shows that the IUA and the AS models predict the same expected coalescence times for some parameter values. Thus, any test for admixture solely based on expected coalescence times between one lineage from different populations will be confounded by nonrandom mating in ancestral populations.

Robustness to Sequencing Error

We assume a background sequencing error of 0.001 in each sample.Figure 5 plotsD againste₁ −e₂, keepinge₁ +e₂ = 0.002 ande₃ =e₄ = 0.001. If we assume that the sequencing error rate in two populations is the product of error rates in each of the two populations (i.e., errors are independent in the two populations), then a difference in sequencing error ratese₁ −e₂ = 0.00011 will cause the test for admixture to produce false-positive results. If we assume that the probability to sequence two populations with an error at the same base is only one order of magnitude lower than the error rate in one population, then differences in error rates of 0.00008 will create false positives. In the worst-case scenario, where sequencing error in two populations is of the same order of magnitude as sequencing error in one populations, thene₁ −e₂ = 0.000045 is enough to create false positives.

FIG. 5. — D(P₁, P₂, P₃, O) as a function of the sequencing error differencee₁ −e₂. The dotted curve assumes that the probability to have an error in two populations is equal to the product of sequencing errors in the two populations (*e_ij* =*e_i* ×*e_j*). The dashed curve corresponds toe_ij = min(*e_i*,*e_j*)/10, and the solid line assumese_ij = min(*e_i*,*e_j*). The horizontal dashed line corresponds to twice the standard error ofD, assuming that sites are independent.

Power of the Test for Admixture

Figure 6 reports the power analysis for the simulations of instantaneous admixture with constant population size, a bottleneck in P₃, a bottleneck in P₍₁₂₎, and migration between P₁ and P₂. Adding a bottleneck in P₃ increased the power of the test substantially; this is because, if the P₂ lineage originated from P₃, the bottleneck increases the probability that the P₂ and P₃ lineages coalesce beforet_P3. Adding a bottleneck in P₍₁₂₎ improved the power of the test even more. This is because, in the case where admixture did not happen (eq. A1.1), the probability that the P₁ and P₂ lineages do not coalesce beforet_P3 is decreased by the bottleneck. Therefore, this case contributes less to the denominator ofD. Migration between P₁ and P₂ decreased the power of the test for admixture.

FIG. 6. — Power of the test for admixture. We simulated (A) 10,000 unlinked polymorphic sites and (B) 100,000 unlinked polymorphic sites. Dots: no bottleneck and no migration between P₁ and P₂. Circles: bottleneck in the ancestral population of P₁ and P₂. Grey triangles: bottleneck in P₃. Squares: ongoing migration between P₁ and P₂ at rate 2Nm = 0.5. All bottlenecks involved a 100-fold reduction in population size and lasted 1,000 generations. They started at 3,500 generations in the past.

For the simulations without admixture (f = 0),figure 6 reports the false-positive rate of the test for admixture at a 5% level. In the case of constant population size, the false-positive rate was 0.050 for 10,000 sites and 0.046 for 100,000 sites. This is an expected result asD is approximately normally distributed for independent sites. Plus or minus twice the SD represents roughly a 95% confidence interval forD. As predicted, adding bottlenecks in the ancestral populations, or migration between P₁ and P₂, did not significantly affect the false-positive rate of the test.

Robustness to Ascertainment Bias

For data simulated without admixture (f = 0),D(P₁, P₂, P₃, O) did not differ significantly from 0 regardless of the ascertainment scheme used. More precisely, we obtained the following values forD(P₁, P₂, P₃, O) under each scheme: 1) −7.2 × 10⁻⁴ (SD = 5.7 × 10⁻³); 2) 6.2 × 10⁻⁴ (SD = 3.4 × 10⁻³); 3) −2.5 × 10⁻⁵ (SD = 5.5 × 10⁻³); 4) 3.8 × 10⁻⁴ (SD = 3.4 × 10⁻³); and 5) −7.9 × 10⁻⁵ (SD = 3.5 × 10⁻³).

The IUA model with 5% admixture (f = 0.05) predictsD(P₁, P₂, P₃, O) = 0.053 for the demographic parameters we used (eq. A1.5). The simulated data yielded the following values: 1) 0.021 (SD = 4.8 × 10⁻³); 2) 0.024 (SD = 3.4 × 10⁻³); 3) 0.037 (SD = 6.0 × 10⁻³); 4) 0.048 (SD = 3.5 × 10⁻³); and 5) 0.053 (SD = 3.5 × 10⁻³).

Therefore, the simple ascertainment scheme we used here did not affect the false-positive rate of the test for admixture. However, it lowered the value ofD(P₁, P₂, P₃, O) for scenarios with 5% admixture. We checked that this result was consistent for other admixture proportions (not reported). The difference between the true value and the observed one was larger when the discovery sample was taken in P₁.

Application to Neandertal

Testing for Admixture between Neandertals and Modern Humans

Table 1 showsD statistics fromGreen et al. (2010).D(P₁, P₂, P₃, O) were significantly positive if P₁ is an African individual (San or Yoruba) and P₂ is a non-African individual (French, Han, or Papuan). When comparing two African individuals (P₁ = San and P₂ = Yoruba) or two non-African individuals (P₁ and P₂ = French, Han, or Papuan),D did not differ significantly from 0. Furthermore,D statistics involving one African and one non-African did not differ significantly from one another. Significance was assessed using a standard block jackknife method (Efron 1981). AD value was considered to be significantly different from 0 when its absolute value was greater than three times its standard error. TheD statistics are compatible with a scenario where Neandertals admixed with modern humans outside of Africa, before Europeans, East Asians, and Melanesians split (Green et al. 2010).

Table 1.

Neandertals Share More Genetic Variants With Non-Africans than With Africans (Green et al. 2010).

	P₁	P₂	D(P₁, P₂, N, C), %
African–African	San	Yoruba	−0.1 ± 0.4
African–non-African	San	French	4.2 ± 0.4
	San	Papuan	3.9 ± 0.5
	San	Han	5.0 ± 0.5
	Yoruba	French	4.5 ± 0.4
	Yoruba	Papuan	4.4 ± 0.6
	Yoruba	Han	5.3 ± 0.5
Non-African–non-African	French	Papuan	0.1 ± 0.5
	French	Han	1.0 ± 0.6
	Papuan	Han	0.7 ± 0.6

Open in a new tab

We note thatGreen et al. (2010) used about 7,475,200 biallelic sites for theD statistics analysis (Green et al. 2010, SOM 15). Therefore, we expect the power of the test to be extremely high.Green et al. (2010) also computedD statistics using genotype data from the CEPH-Human Genome Diversity Project panel.D statistics were lower when computed on SNPs than when computed on sequencing data, consistent with the predicted bias of a simple ascertainment scheme. However, it is remarkable that although raw values ofD statistics changed, the general conclusion that non-Africans were significantly closer to Neandertals than Africans held.

Estimating the Proportion of Neandertal Ancestry in Non-Africans

Taking advantage of the fact that bones from different Neandertal individuals were available,Green et al. (2010) estimated the proportion of Neandertal ancestry in non-Africans as

graphic file with name molbiolevolmsr048fx23_ht.jpg

where Afr is an African sample (San or Yoruba), OOA is a non-African sample (French, Han, or Papuan), and Nea₁ and Nea₂ are two Neandertal individuals. TheS statistics were averaged over the different possible combinations of African and non-African samples. Using the standard errors ofS statistics (estimated by block jackknife), this yielded an estimate of $\hat{f} = 1 - 4 %$ . However,equation (7) shows that this is a conservative minimum.Green et al. (2010) estimated the population divergence time of Neandertals and modern humans to bet_N ≈ 270,000 − 440,000 years before present. The time of gene flow has to occur after modern humans went out of Africa, an event that took place an estimatedt_OOA ≈ 45,000 − 100,000 years before present (Forster and Matsumura 2005). Assuming a constant population size, the bias in the estimator off is most important whent_N −t_OOA is small. In the worst-case scenario, the bias is equal to (270,000 − 100,000)/100,000 = 0.063. Therefore, a corrected estimate off in the case of constant ancestral population size isf* = 1–6%.

The data are also compatible with a scenario of ancestral subdivision. Usingt_P3 = 12,000 generations,t_P2 = 3,000 generations, and a constant population size ofN = 10,000, migration rates 1.0 < 2Nm < 2.0 and 15,000 <T < 25,000 generations were compatible with the data. However, we note that the ancestral subdivision has to last for thousands of generations in order forD to be on the order of 5% (supplementary fig. S4,Supplementary Material online). Such subdivision would require that local populations and the geographic barriers that separate them persist for very long times, much longer than seems reasonable for highly mobile and adaptable hominins.

Discussion

Asymmetry in three population gene trees is known to occur in the presence of gene flow (Meng and Kubatko 2009) or ancestral subdivision (Slatkin and Pollack 2008). Here we show that asymmetry can be exploited to design a test for admixture between three closely related populations. The test requires only the assumption that the ancestral population is randomly mating; it is robust to other demographic assumptions such as variations in population size. Moreover, the test is robust to sequencing error if one assumes that the error rate is the same in the sequences analyzed. However, if sequencing error rates differ, then sequencing error can create false positives and that is more likely to be true if sequencing errors are not independent. Finally, we illustrated how to constrain the admixture proportion in a simple model of admixture.

Duration and Direction of Gene Flow

The model of admixture as a one-way instantaneous gene flow event is not intended to be realistic. Instead, instantaneous gene flow is the simplest possible admixture model that enables to derive theD statistic expectation and estimate the admixture proportion. In the more realistic case of ongoing migration,D statistics can be computed numerically using the theory of Markov chains, as illustrated in the ancient structure model.

Introgression is often asymmetrical between hybridizing species (Barton and Hewitt 1985;Orive and Barton 2002). When an expanding population colonizes a new habitat and hybridizes with previous residents, breeding events at the front of expansion can result in the substantial introduction of genes in the expanding population (Currat et al. 2008). As a consequence, detectable ancient admixture is more likely to be found in extant populations when introgression with archaic populations occurred during a range expansion, as is thought to be the case for modern humans and Neandertals (Currat et al. 2008;Green et al. 2010).

However, there are also cases known where detectable introgression occurs from the invasive population to the resident species (Hastings et al. 2005). We note that derivingD statistics when gene flow is bidirectional or in the other direction is not more difficult, and the same methodology can be applied. If there was gene flow from the extant population to the archaic one, then anyD statistic involving individuals only from the extant populations would not be affected. This provides a framework to test for the direction of gene flow. Taking advantage that Eurasians are more closely related to some African populations than to others,Green et al. (2010) used a similar argument to rule out substantial gene flow from modern humans to Neandertals.

Effect of Ascertainment Bias

AlthoughD statistics are primarily designed for sequence data, they can be readily applied to genotype data. We showed that the false-positive rate of the test for admixture was not affected by a very simplified ascertainment scheme where we subsampled individuals in either P₁ or P₂. This is because, in this case, the bias in allele frequencies introduced by the ascertainment compensated itself in patterns ABBA and BABA. However, even this very simple ascertainment scheme lowers the power of the test because the ascertainment biases sampled SNPs toward higher frequencies, increasing the counts of pattern ABBA and BABA to a similar extent, and therefore increasing the denominator ofD.

We caution that we used an oversimplified ascertainment scheme. More complex scenarios can introduce bias in an unforeseen way. For example, if the discovery sample is from neither P₁ nor P₂, then the bias will depend on the genealogical relationships between the population where SNPs were discovered and P₁, P₂, and P₃. However, we note that the main conclusion regarding the test for admixture ofGreen et al. (2010) held when applied to genotype data. Still, we stress that the test is likely to be more powerful using sequencing data. An investigator wishing to apply the test using genotype data should try to correct allele frequencies for ascertainment bias. Several approaches have been proposed to do so (e.g.,Clark et al. 2005).

Effect of Recurrent Mutation

We assumed that each polymorphism was created by a single mutation. If mutations were recurrent, then different copies of the derived allele would arise independently. Hence, many of the gene genealogies infigure 1 that we ignored in computing the expectation ofD would have to be accounted for. If the outgroup diverged from the lineage leading to the other populations a long time in the past, as it is the case for modern humans, Neandertals, and chimpanzees, recurrent mutations are likely to occur on the long branch leading to the outgroup. As a consequence, genealogies Ib, IId, and IIIb would create the pattern BABA and genealogies Ia, IIa, and IIId would create the pattern ABBA (seefig. 1). If there is no gene flow from P₃ to P₁ or P₂, and if the mutation rate is the same in the three populations, then Ia = Ib, IId = IIId, and IIIb = IIa (where the equal sign denotes that corresponding trees occur with equal frequencies). Therefore, recurrent mutations are not expected to increase the false-positive rate of the test for Neandertal admixture. In the case where P₁, P₂, P₃, and O are more diverged from each other (e.g., modern humans, chimpanzee, gorilla, and orangutan), recurrent mutations are likely to occur on other branches as well. Recurrent mutations on the terminal branches leading to P₁ and P₂ could potentially increase the false-positive rate of the test for admixture.

In the presence of gene flow, it is difficult to predict whether recurrent mutation increases or decreasesD. The exact result depends on details of the demographic model, but the effect is proportional to the probability that a second mutation occurs and the constant of proportionality is less than one because many of the gene genealogies still have no effect onD. Still, recurrent mutations can affect the power of the test for admixture.

We also note that differing mutation rates in the P₁ and P₂ populations since their divergence are not expected to biasD. The reason for this is that by restricting to sites where P₃ is derived relative to the outgroup O, we are restricting the analysis to mutations that arose prior to divergence of the ancestral populations of P₁, P₂, and P₃.

Effect of Mapping Errors

We assumed that the sequences from P₁, P₂, P₃, and O were aligned to a common sequence without error. Depending on the nature of the sampled populations and the sequencing technology, different alignment strategies may be used.Green et al. (2010) aligned the Neandertal sequence both to the reference human and to the chimpanzee genomes (hg18 and panTro2). Different aligners were used depending on the technology used to sequence the Neandertal and modern humans genomes used in the study. Because of the variety of the alignment strategies, it is difficult to model the effect of mapping error onD statistics in a general framework.

The extent to whichD statistics will be affected depends on which sample is more prone to mapping error. Let us assume that all sequences were aligned to the outgroup O. Mapping errors in the P₃ sequence are likely to have a negligible effect on the false-positive rate of the test for admixture. This is because mismapping in P₃ will equally affect patterns ABBA and BABA. However, alignment error in P₃ can affect the power of the test if it artificially increases the counts of patterns ABBA and BABA. Mapping errors in P₁ or P₂ can have a strong effect on the false-positive rate of the test. Mismapping in P₁ will cause P₂ and P₃ to match each other artificially too often (therefore increasing the count of the ABBA pattern). Thus, it is crucial to test the effect of mapping error in P₁ or P₂. One possible approach is to stratify the data into categories based on the read coverage in sequences P₁ and P₂ and to computeD(P₁, P₂, P₃, O) in each category. If mapping is not an issue, thenD should remain stable in categories where the coverage is high enough.

Effect of Natural Selection

We implicitly assumed selective neutrality in all our analyses. For natural selection to bias the test for admixture, it must increase the number of derived alleles shared by P₃ and P₁ or P₂. The ABBA and BABA patterns inGreen et al. (2010) were distributed genome wide. Therefore, it seems unlikely that the observedD statistics inGreen et al. (2010) were biased by natural selection.

However, when the data consist of a few loci, natural selection could potentially biasD statistics by increasing the number of derived alleles in, say, P₂. It would result in P₂ matching P₃ more often than P₁, even if there was no admixture. The exact effect of natural selection onD statistics remains a subject of further investigation.

Distinguishing between Gene Flow and AS

We showed thatD statistics do not allow us to distinguish between the models of admixture and ancestral subdivision because ancestral subdivision and admixture can produce the same expected coalescence time between two individuals from different populations. However, it is known that ancestral population subdivision results in a higher than expected variance in coalescence times (Wall et al. 2009). Therefore, ancestral subdivision is likely to result in more variation in gene tree depth when using several samples from the extant population. This in turn will affect the site frequency spectrum of the extant population (Harpending et al. 1998). Designing a statistic to distinguish between the two scenarios will require using more than one sample per population.

Concluding Remarks

Inference of ancient admixture is usually limited by the lack of archaic DNA sequence data. Most methods so far have focused on detecting the expected signature of ancient admixture in extant genomes. Here we studied the behavior of a test statistic for ancient admixture based on the direct comparison of DNA sequences of three closely related populations. Our derivation of this test statistic under alternative demographic models showed how to robustly estimate the archaic contribution to extant populations. We demonstrated that our approach can also be applied when no archaic population has been sampled. This paper illustrates the advantage of using one sample per population, which greatly simplifies inference by removing the influence of many demographic assumptions that are not testable in most situations.

Supplementary Material

Supplementary figures S1–S4 are available atMolecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

Supplementary Data

supp_28_8_2239__index.html^{(811B, html)}

Acknowledgments

We thank Anna-Sapfo Malaspinas for helpful discussions that helped improve the manuscript. E.Y.D and M.S were supported by a US National Institute of Health grant (R01-GM40282).

Appendix A. Derivation ofD under the Model of Gene Flow

Here we present the detailed derivation ofD(P₁, P₂, P₃, O) under the IUA model. For simplicity of notations, we assume that the population size in P₍₁₂₃₎ is constant and equal toN₍₁₂₃₎. However, the derivations can be easily extended for arbitrary varying population size in P₍₁₂₃₎. There are three classes of event that can produce patterns ABBA and BABA.

The P₂lineage traces its ancestry through the P₍₁₂₎side of the phylogeny (probability 1−f), and between t_P2 and t_P3,the P₁ and P₂ lineages do not coalesce
(A1.1)
The P₂ lineage traces its ancestry through the P₃ side of the phylogeny (probability f), and between t_GF and t_P3, the two lineages do not coalesce
(A1.2)
The P₂ lineage traces its ancestry through the P₃ side of the phylogeny (probability f), and between t_GF and t_P3, the two lineages coalesce. This history creates gene genealogies only of type III and results only in ABBA sites (never BABA sites in the absence of recurrent mutation in the same genealogy). The probability that there is a coalescence beforet_P3 is. Once the coalescence occurs, the ancestral lineage cannot coalesce with P₁ beforet_P3. Aftert_P3, the average coalesce time is 2N₍₁₂₃₎. Therefore, the expected length of the internal branch is $t_{P3} + 2 N_{(123)} - (t_{GF} + \bar{t})$ , where $\bar{t}$ is the expected coalescence time between P₂ and P₃, given that the coalescence occurs beforet_P3. A short analysis shows that
(A1.3)
Thus,
(A1.4)
The overall probability of ABBA and BABA is obtained by adding, and the mutation rate parameterμ cancels. Therefore, Appendix B

Here we present the detailed derivation ofD(P₁, P₂, P₃, O) under the AS model. At timeT, there are six possibilities depending on the states of the Markov chain running in P₍₁₂₃₎.

Appendix B

No coalescence event occurred

This case corresponds to states 1, 2, 3, and 4 of the Markov chain in P₍₁₂₃₎. It contributes equally to ABBA and BABA:

graphic file with name molbiolevolmsr048fx31_ht.jpg

(A2.1)

2. P₁ and P₂ lineages coalesced

This case does not contribute to ABBA and BABA.

3. P₂ and P₃ lineages coalesced, but not the P₁ lineage
(A1.5)

This case corresponds to states 5 and 8 of the Markov chain in P₍₁₂₃₎. It does not contribute to BABA. Its contribution to ABBA is

graphic file with name molbiolevolmsr048fx33_ht.jpg

(A2.2)

wheret₂₃^* is the expected time of coalescence of the P₂ and P₃ lineages given that coalescence occurs betweent_P3 andT.

4. P₃ and P₁ lineages coalesced, but not the P₂ lineage

This case corresponds to states 6 and 9 of the Markov chain in P₍₁₂₃₎. It does not contribute to ABBA. Its contribution to BABA is, similarly to case (3),

graphic file with name molbiolevolmsr048fx34_ht.jpg

(A2.3)

5. P₃ and P₂ lineages coalesced first and then with the P₁ lineage.

This case corresponds to state 11 of the Markov chain in P₍₁₂₃₎. It does not contribute to BABA. Its contribution to ABBA is

graphic file with name molbiolevolmsr048fx35_ht.jpg

(A2.4)

wheret₁^** is the expected time of coalescence of the P₁ lineage, given that coalescence occurs betweent₂₃^* andT.

6. P₃ and P₁ lineages coalesced first and then with the P₂ lineage.

This case corresponds to state 12 of the Markov chain in P₍₁₂₃₎. It does not contribute to ABBA. Similarly to case (5), its contribution to BABA is

graphic file with name molbiolevolmsr048fx36_ht.jpg

(A2.5)

wheret₂^** is the expected time of coalescence of the P₂ lineage, given that coalescence occurs betweent₁₃^* andT.

The conditional expected coalescence times,t₂₃^*,t₁₃^*,t₁^**, andt₂^** are computed using standard Markov chain theory as the expected times of first hit of corresponding states given that the hit occurred before timeT. We have

graphic file with name molbiolevolmsr048fx37_ht.jpg

(A2.6)

The other conditional expected coalescence times are computed similarly. There is no simple analytic expression forD.

References

Barton N. The role of hybridization in evolution. Mol Ecol. 2001;10:551–568. doi: 10.1046/j.1365-294x.2001.01216.x. [DOI] [PubMed] [Google Scholar]
Barton N, Hewitt G. Analysis of hybrid zones. Annu Rev Ecol Syst. 1985;16:113–148. [Google Scholar]
Clark AG, Hubisz MJ, Bustamante CD, Williamson SH, Nielsen R. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 2005;15(11):1496–1502. doi: 10.1101/gr.4107905. [DOI] [PMC free article] [PubMed] [Google Scholar]
Currat M, Excoffier L. Modern humans did not admix with Neanderthals during their range expansion into Europe. PLoS Biol. 2004;2(12):e421. doi: 10.1371/journal.pbio.0020421. [DOI] [PMC free article] [PubMed] [Google Scholar]
Currat M, Ruedi M, Petit RJ, Excoffier L. The hidden side of invasions: massive introgression by local genes. Evolution. 2008;62(8):1908–1920. doi: 10.1111/j.1558-5646.2008.00413.x. [DOI] [PubMed] [Google Scholar]
Eberle M, Kruglyak L. An analysis of strategies for discovery of single-nucleotide polymorphisms. Genet Epidemiol. 2000;19(Suppl 1):S29–S35. doi: 10.1002/1098-2272(2000)19:1+<::AID-GEPI5>3.0.CO;2-P. [DOI] [PubMed] [Google Scholar]
Efron B. Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika. 1981;68(3):589–599. [Google Scholar]
Forhan G, Martiel J, Blum M. A deterministic model of admixture and genetic introgression: the case of Neanderthal and Cro-Magnon. Math Biosci. 2008;216:71–76. doi: 10.1016/j.mbs.2008.08.004. [DOI] [PubMed] [Google Scholar]
Forster P, Matsumura S. Evolution—did early humans go north or south? Science. 2005;308(5724):965–966. doi: 10.1126/science.1113261. [DOI] [PubMed] [Google Scholar]
Green RE, Krause J, Briggs AW, et al. A draft sequence of the Neandertal genome. Science. (56 co-authors). 2010;328(5979):710–722. doi: 10.1126/science.1188021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Green RE, Krause J, Ptak SE, et al. Analysis of one million base pairs of Neanderthal DNA. Nature. (11 co-authors). 2006;444(7117):330–336. doi: 10.1038/nature05336. [DOI] [PubMed] [Google Scholar]
Green RE, Malaspinas AS, Krause J, et al. A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing. Cell. (25 co-authors). 2008;134(3):416–426. doi: 10.1016/j.cell.2008.06.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Harpending HC, Batzer MA, Gurven M, Jorde LB, Rogers AR, Sherry ST. Genetic traces of ancient demography. Proc Natl Acad Sci U S A. 1998;95(4):1961–1967. doi: 10.1073/pnas.95.4.1961. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hastings A, Cuddington K, Davies K. The spatial spread of invasions: new developments in theory and evidence. Ecol Lett. 2005;8(1):91–101. [Google Scholar]
Hudson RR. Testing the constant-rate neutral allele model with protein-sequence data. Evolution. 1983;37(1):203–217. doi: 10.1111/j.1558-5646.1983.tb05528.x. [DOI] [PubMed] [Google Scholar]
Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18(2):337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
Krause J, Lalueza-Fox C, Orlando L, et al. The derived FOXP2 variant of modern humans was shared with Neandertals. Curr Biol. (13 co-authors). 2007;17(21):1908–1912. doi: 10.1016/j.cub.2007.10.008. [DOI] [PubMed] [Google Scholar]
Krings M, Stone A, Schmitz RW, Krainitzki H, Stoneking M, Pääbo S. Neandertal DNA sequences and the origin of modern humans. Cell. 1997;90(1):19–30. doi: 10.1016/s0092-8674(00)80310-4. [DOI] [PubMed] [Google Scholar]
Kuhner MK, Beerli P, Yamato J, Felsenstein J. Usefulness of single nucleotide polymorphism data for estimating population parameters. Genetics. 2000;156(1):439–447. doi: 10.1093/genetics/156.1.439. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lalueza-Fox C, Rompler H, Caramelli D, et al. A melanocortin 1 receptor allele suggests varying pigmentation among Neanderthals. Science. (17 co-authors). 2007;318(5855):1453–1455. doi: 10.1126/science.1147417. [DOI] [PubMed] [Google Scholar]
Meng C, Kubatko LS. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theor Popul Biol. 2009;75(1):35–45. doi: 10.1016/j.tpb.2008.10.004. [DOI] [PubMed] [Google Scholar]
Nielsen R, Hubisz M, Clark A. Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics. 2004;168(4):2373–2382. doi: 10.1534/genetics.104.031039. [DOI] [PMC free article] [PubMed] [Google Scholar]
Noonan JP, Coop G, Kudaravalli S, et al. Sequencing and analysis of Neanderthal genomic DNA. Science. (11 co-authors). 2006;314(5802):1113–1118. doi: 10.1126/science.1131412. [DOI] [PMC free article] [PubMed] [Google Scholar]
Orive ME, Barton NH. Associations between cytoplasmic and nuclear loci in hybridizing populations. Genetics. 2002;162(3):1469–1485. doi: 10.1093/genetics/162.3.1469. [DOI] [PMC free article] [PubMed] [Google Scholar]
Plagnol V, Wall JD. Possible ancestral structure in human populations. PLoS Genet. 2006;2(7):e105. doi: 10.1371/journal.pgen.0020105. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reich D, Thangaraj K, Patterson N, Price AL, Singh L. Reconstructing Indian population history. Nature. 2009;461(7263):489–494. doi: 10.1038/nature08365. [DOI] [PMC free article] [PubMed] [Google Scholar]
Serre D, et al. No evidence of Neandertal mtDNA contribution to early modern humans. PLoS Biol. 2004;2(3):E57. doi: 10.1371/journal.pbio.0020057. [DOI] [PMC free article] [PubMed] [Google Scholar]
Slatkin M. Inbreeding coefficients and coalescence times. Genet Res. 1991;58(2):167–175. doi: 10.1017/s0016672300029827. [DOI] [PubMed] [Google Scholar]
Slatkin M, Pollack JL. Subdivision in an ancestral species creates asymmetry in gene trees. Mol Biol Evol. 2008;25(10):2241–2246. doi: 10.1093/molbev/msn172. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105:437–460. doi: 10.1093/genetics/105.2.437. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wall JD. Detecting ancient admixture in humans using sequence polymorphism data. Genetics. 2000;154(3):1271–1279. doi: 10.1093/genetics/154.3.1271. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wall JD, Hammer MF. Archaic admixture in the human genome. Curr Opin Genet Dev. 2006;16(6):606–610. doi: 10.1016/j.gde.2006.09.006. [DOI] [PubMed] [Google Scholar]
Wall JD, Lohmueller KE, Plagnol V. Detecting ancient admixture and estimating demographic parameters in multiple human populations. Mol Biol Evol. 2009;26(8):1823–1827. doi: 10.1093/molbev/msp096. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

supp_28_8_2239__index.html^{(811B, html)}

supp_msr048_Supplementary_Information.pdf^{(92.7KB, pdf)}

Movatterモバイル変換

PERMALINK

Testing for Ancient Admixture between Closely Related Populations

Eric Y Durand

Nick Patterson

David Reich

Montgomery Slatkin

Abstract

Introduction

Materials and Methods

A Four-Taxon Statistic to Test for Admixture

A Genealogical Argument for the Expected Frequencies of ABBA and BABA Sites

FIG. 1.

Instantaneous Unidirectional Admixture

FIG. 2.

Expected Counts of ABBA and BABA under the Model of Instantaneous Admixture

A Test for Admixture Insensitive to Many Demographic Assumptions

Constraining the Admixture Proportion

Testing for Admixture without Archaic Sample

Ancestral Subdivision

FIG. 3.

Five-State Markov Chain in P(12)

Thirteen-State Markov Chain in P(123)

Expected Coalescence Time Conditional on Gene Tree Topology

Effect of Sequencing Error onD Statistics

False-Positive Rate in the Presence of Sequencing Error

Synthetic Data Analysis

Power of the Test for Admixture

Effect of Ascertainment Bias

Application to Neandertal Data

Results

Ancient Subdivision Confounds the Test for Admixture

FIG. 4.

Robustness to Sequencing Error

FIG. 5.

Power of the Test for Admixture

FIG. 6.

Robustness to Ascertainment Bias

Application to Neandertal

Testing for Admixture between Neandertals and Modern Humans

Table 1.

Estimating the Proportion of Neandertal Ancestry in Non-Africans

Discussion

Duration and Direction of Gene Flow

Effect of Ascertainment Bias

Effect of Recurrent Mutation

Effect of Mapping Errors

Effect of Natural Selection

Distinguishing between Gene Flow and AS

Concluding Remarks

Supplementary Material

Acknowledgments

Appendix A. Derivation ofD under the Model of Gene Flow

Appendix B

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Five-State Markov Chain in P₍₁₂₎

Thirteen-State Markov Chain in P₍₁₂₃₎