. 2019 Sep 13;15(9):e1008384. doi:10.1371/journal.pgen.1008384

An approximate full-likelihood method for inferring selection and allele frequency trajectories from DNA sequence data

Aaron J Stern^1,^*,Peter R Wilton²,Rasmus Nielsen^2,³

Editor:Ryan D Hernandez⁴

¹Graduate Group in Computation Biology, University of California, Berkeley, Berkeley, California, United States of America

²Department of Integrative Biology, University of California, Berkeley, Berkeley, California, United States of America

³Department of Statistics, University of California, Berkeley, Berkeley, California, United States of America

⁴UCSF, UNITED STATES

The authors have declared that no competing interests exist.

^✉

* E-mail:ajstern@berkeley.edu

Roles

Aaron J Stern:Conceptualization, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

Peter R Wilton:Software, Writing – original draft

Rasmus Nielsen:Conceptualization, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

Ryan D Hernandez:Editor

Received 2019 Mar 29; Accepted 2019 Aug 26; Collection date 2019 Sep.

This is an open access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PMC Copyright notice

PMCID: PMC6760815 PMID:31518343

Abstract

Most current methods for detecting natural selection from DNA sequence data are limited in that they are either based on summary statistics or a composite likelihood, and as a consequence, do not make full use of the information available in DNA sequence data. We here present a new importance sampling approach for approximating the full likelihood function for the selection coefficient. Our methodCLUES treats the ancestral recombination graph (ARG) as a latent variable that is integrated out using previously published Markov Chain Monte Carlo (MCMC) methods. The method can be used for detecting selection, estimating selection coefficients, testing models of changes in the strength of selection, estimating the time of the start of a selective sweep, and for inferring the allele frequency trajectory of a selected or neutral allele. We perform extensive simulations to evaluate the method and show that it uniformly improves power to detect selection compared to current popular methods such as nSL and SDS, and can provide reliable inferences of allele frequency trajectories under many conditions. We also explore the potential of our method to detect extremely recent changes in the strength of selection. We use the method to infer the past allele frequency trajectory for a lactase persistence SNP (MCM6) in Europeans. We also infer the trajectory of a SNP (EDAR) in Han Chinese, finding evidence that this allele’s age is much older than previously claimed. We also study a set of 11 pigmentation-associated variants. Several genes show evidence of strong selection particularly within the last 5,000 years, includingASIP,KITLG, andTYR. However, selection onOCA2/HERC2 seems to be much older and, in contrast to previous claims, we find no evidence of selection onTYRP1.

Author summary

Current methods to study natural selection using modern population genomic data are limited in their power and flexibility. Here, we present a new method to infer natural selection that builds on recent methodological advances in estimating genome-wide genealogies. By using importance sampling we are able to efficiently estimate the likelihood function of the selection coefficient. We show our method improves power to test for selection over competing methods across a diverse range of scenarios, and also accurately infers the selection coefficient. We also demonstrate a novel capability of our model, using it to infer the allele’s frequency over time. We validate these results with a study of a lactase persistence SNP in Europeans, and also study a SNP atEDAR, as well as a set of 11 pigmentation-associated variants.

Introduction

Direct observation of the change in allele frequency over time (the allele frequency trajectory) allows one to make powerful inferences regarding whether selection acted on the allele [1,2]. However, outside of certain contexts such as experimental evolution of viruses or bacteria [3–6] or analyses of ancient DNA samples [7,8], in most cases such direct observations of allele frequencies at multiple points in the history of a population are unavailable. Instead, selection must be inferred from contemporary, modern data. A wide variety of methods have been developed to detect selection based on patterns observed from modern DNA sequences (e.g. [9–11]).

The hitch-hiking effect provides a key signature of selection in modern datasets [12,13]. Hitch-hiking causes aberrations in the spatial pattern of genetic diversity, including the site frequency spectrum (SFS) [14,15] and the pattern of haplotype homozygosity [9]. Methods designed to detect these aberrations are particularly useful in the setting where a single population is surveyed, and the only information available is variation within this single population.

The most familiar methods for detecting selection are based on linear functions of the SFS, such as Tajima’sD, Fu and Li’sD, or Fay and Wu’sH [14,16,17]. An advantage of SFS-based methods is that they do not require the data to be phased. However, these methods have several limitations: they tend to confound selection with other non-equilibrium conditions, such as a fluctuating population size [10,18]; they are not suitable for estimating parameters such as the value of the selection coefficients; significance can usually only be established using an empirical null distribution; and crucially, these methods do not incorporate any features of the haplotype structure.

To make fuller use of information provided by phased sequence data, a number of methods have incorporated summary statistics based on haplotype structure. In a broad sense, these methods are based on calculations of haplotype similarity in a window around some core site of interest [9]. Several methods have adapted this general concept to specifically detect ongoing selection [11,19,20]. More recently, [21] showed that the density of singletons surrounding a focal SNP can be a powerful signal of extremely recent selection in large cohorts. In addition to recent and ongoing selection, it has been demonstrated that these methods have compelling advantages to detecting selection from standing variation [20–22]. However, these methods share the major limitation of SFS-based method in that they are not suitable for parametric inference and it is unclear how to establish significance without use of an empirical null model.

Recently, supervised machine learning methods have been proposed as an alternative to traditional summary-statistic based methods (see e.g., [23]). Standard machine learning techniques applied to population genomic data afford some major advantages over methods based on summary-statistics: standard techniques can produce accurate classifiers based on summaries of the data that live in much higher-dimensional space than the aforementioned summary statistics, and these techniques often encompass a wide space of classification functions that are often non-linear (see e.g. [24,25]). Some studies have demonstrated these methods can have improved robustness to demographic model mis-specification [22,26]. Although these methods can potentially detect complex patterns left by selection, they demand training on large data sets which typically are simulated using models that may not accurately correspond to the empirical data.

In contrast to the aforementioned methods, one might aim to develop a full likelihood method which would take into account the full data set, rather than merely summary statistics. A common strategy for obtaining the full likelihood has been to find the distribution of the genealogy under selection. For example, Krone and Neuhauser described the distribution of the coalescence tree of a locus under weak selection and no recombination [27]. Alternatively, one can describe how the genealogy depends on the trajectory of the derived allele (first described by [28]), and in turn how the trajectory depends on selection. To this end, Coop and Griffiths [29] developed a sampling method for approximating the full likelihood of the selection coefficient. Their method uses sampling to marginalize out two layers of latent variables: the allele frequency trajectory and genealogy of the locus. To estimate the likelihood function, they perform random sampling of both the trajectory, and the genealogy conditioned on the trajectory. Unfortunately, selection likelihood methods that consider the both coalescence and recombination are generally considered computationally intractable.

Composite likelihood methods (see e.g. [10,30]) are able to approximate the likelihood function using tractable expressions for the frequency distribution of a neutral site linked to the selected site [15,31]. These methods approximate the joint distribution of frequencies observed at linked sites as the product of their marginals. These approaches can be applied to test for selection, and estimate the strength of selection. The approximations made by composite likelihood methods are more accurate under strong selection (arguably beyond the strength of most recent selection in humans), and thus have less power to detect weak selection—although to some extent low power to detect weak selection is a natural outcome of any selection method.

Approximate Bayesian computation (ABC) and rejection sampling methods approximate the likelihood function by simulation. One advantage over the composite likelihood approach is that ABC can capture dependencies between linked neutral sites. For example, methods have been used to jointly infer the strength and timing of selection acting on a locus and determine whether a sweep occurred from ade novovs standing variant [32–35]. However, a major disadvantage of such approaches is that the amount of simulation necessary to obtain an accurate estimate grows dramatically with the dimensionality of the model parameters. There is an additional tradeoff between information utilized from the data and computational burden; as the number of the summary statistics used increases, the number of simulations required to approximate the likelihood at a fixed parameter value also increases (for a discussion, see e.g. [36]).

The method we present in this paper draws inspiration from the Coop & Griffiths method [29], and has several key similarities: our method produces a likelihood; involves integrating out the allele frequency trajectory and genealogy, i.e., the aforementioned two hidden layers; and both methods account for selection by modeling how allele frequency changes depend on selection. However, there are several key differences between this method and our approach: while Coop and Griffiths assume no recombination of the locus, our method is based on the coalescent with recombination (i.e. the ancestral recombination graph or ARG) [37]. Also, whereas Coop & Griffiths simulate random trajectories, we use dynamic programming algorithms similar to those used in Hidden Markov Models (HMMs) to completely marginalize the latent trajectory. The hidden states represent allele frequencies and the emission probabilities are coalescence probabilities. While the framework is not a traditional HMM because the process is time-inhomogenous and the emission space changes with time, the similarities with traditional HMMs are, nonetheless, so significant that we will refer to this as an HMM. Lastly, our method uses a novel importance sampling scheme that allows us to sample ARGs assuming a neutral prior, and find the likelihood function at arbitrary values ofs; this drastically reduces the amount of ARG sampling necessary.

Furthermore, the new method is, to our knowledge, the first that is capable of inferring the allele frequency trajectories for models with recombination and selection using only modern data. We are able to accomplish this task using the aforementioned Markovian structure of both coalescence and the trajectory, forming a HMM over these two hidden states and solving for the posterior marginals of each hidden allele frequency state over time. Recently, Edge & Coop proposed a method to reconstruct changes to polygenic scores over time via such estimates of the local trees, but their method is not suitable for estimating allele frequency changes or selection at individual loci [38].

Materials and methods

Overview

We begin with an overview of our method for jointly inferring selection and the allele frequency trajectory, which we summarize inFig 1. Our method begins with input in the form of phased SNP data from a linked genomic region (Fig 1A), although technically, it is also possible to use unphased data, and sample possible phasings. While the method generalizes to arbitrary sample size, we recommend usingn = 25 − 100 diploid individuals when using ARGweaver as done in this study, as ARGweaver runtime increases quadratically with sample size. The method also generalizes to arbitrarily long regions, although we recommend using regions of 10² − 10³ kb, roughly the size of many LD blocks in the human genome [39].

Next, we sample the genealogy at the selected site from its posterior distribution, assuming a neutral model (Fig 1B). By sampling over this distribution, we marginalize out the hidden coalescence events, the first of two latent variables or “hidden layers” in our model. Specifically, we sample the full ancestral recombination graph (ARG) of the input haplotypes. The ARG is a graph that summarizes all of the common ancestry and recombination events that have occurred within the sample. We sample ARGs rather than gene trees in order to account for recombination, and to incorporate information from sites in long-range linkage disequilibrium with the selected site. Then we extract the genealogy at the site of interest (the “local tree”) and from here on, this is the only component of the ARG that goes into our subsequent calculations. To perform ARG sampling, we choose to use ARGweaver [37], which is the only currently available method to sample the posterior ARG. In practice, it is possible and straightforward to adapt this method to other ARG inference methods designed for larger samples, but sampling the posterior yields beneficial statistical properties (see “Importance Sampling” underMaterials and methods).

Then, for each local tree we have sampled, we form a hidden Markov model (HMM,Fig 1C) indexed in time according to the discretization chosen for ARGweaver; in this HMM, observed states are coalescence times in this local tree and hidden states are the selected allele’s frequency trajectory over time (i.e., the second hidden layer of our overall model). We use a discrete-time model of the coalescent process to match the model used by ARGweaver, so that the length of the HMM is of manageable, finite length. Emission probabilities (i.e., coalescence probabilities) depend both on the allele frequency and the current coalescent state. Hence, the model is time-inhomogenous as the coalescent state changes through time. However, the dependence structure is otherwise identical to traditional HMMs and all the usual dynamic programming algorithms apply. The transition probabilities of allele frequencies depend on the selection coefficients, the parameter we are ultimately interested in estimating. Marginalizing out the allele frequency trajectory from the HMM yields the probability of the sample local tree as a function ofs. To obtain the likelihood function ofs, we perform importance sampling over all sample trees, reweighting their coalescent probabilities and summing them up. This approach allows us to use trees sampled exclusively under a prior of selective neutrality (s = 0) to calculate the likelihood function at arbitrary values ofs. In other words, this approach allows us to minimize the amount of ARG sampling necessary to estimate the likelihood function, which is notable because ARG sampling is generally the most computationally intensive step of our method.

Finally, we can analyze the results to test for selection or estimate the selection coefficient (Fig 1D). Additionally, we show that we can decode the HMMs depicted inFig 1C and use them to obtain a posterior estimate of the allele frequency trajectory (Fig 1E).

A glossary to accompany the following derivations and description of the method is available inS1 Text.

Coalescent model for a site under selection

First, let us consider how the distribution of the local treeT at a site under selection depends on the frequency trajectory of an allele at that site. We assume that the tree is labeled, i.e. we know which branches subtend each allele. We also assume the tree to be compatible with the infinite sites assumption, i.e. that there is at most one mutation event that has occurred at the focal site, and thus the site is bi-allelic. We model the likelihood of the tree using a structured coalescent; moving backwards in time from the time of sampling until the time of the mutation, lineages can only coalesce with other lineages that subtend the same allele, and the coalescence rate within the derived and ancestral classes depends on both the derived allele frequencyX(t) and the effective population sizeN(t), both indexed by the timet ≥ 0 in coalescent units before the present day. Proceeding back in time, lineages coalesce freely after the time of mutation, and the coalescence rate depends only onN(t). In the rest of this section we treat the trajectoryX(t) as known, but in practice the trajectory is hidden and highly stochastic; in a later section we develop a hidden Markov model to efficiently integrate outX(t).

We use a discrete-time model of the coalescent employed also by ARGweaver [37]. That is, we only observe the coalescent process at a set ofK discrete timepoints {t₁, …,t_K}, and also make the additional assumption that all lineages must coalesce byt_K. (Typicallyt_K is set to ∼100 ×N_e, implying coalescence would be extremely unlikely to occur aftert_K, and hence this assumption is very reasonable). Henceforth, using this discretization we also discretizeX andN; we assumeX(t) =X_i fort ∈ (t_i,t_i+1], andN(t) =N_i fort ∈ (t_i,t_i+1].

We useC to track the number of lineages remaining at these timepoints leading back into the past; as long as we keep track of the number of lineages belonging to each of the allelic classes, by exchangeability of lineages within an allelic class, we can model the likelihood function in the usual way, as independent of the topology given the waiting times. Hence, we define three simultaneous, related processesC = (C^der,C^anc,C^mix). The processesC^der andC^anc refer to coalescence within the derived and ancestral classes during the time going back from the time of sampling to the time of the mutation. The mixed processC^mix refers to coalescence going backwards from the time of the mutation. We call it the mixed process because it includes un-coalesced lineages fromC^anc, as well as the lineage ancestral to all derived lineages. Assuming the infinite sites model,C^mix will have one additional lineage relative toC^anc at the time of the mutation, and will eventually reachC^anc =C^mix once that lineage coalesces with one of the other lineages in the ancestral class. InFig 2 andTable 1, we illustrate the lines-of-descent process in the these three classes.

Table 1. (Companion toFig 2) The numbers of derived, ancestral, and mixed lineages at each time point.

Numbers with unshaded cells factor into the likelihood calculation, whereas numbers with shaded cells do not.

Time	0	1	2	3	4	5
C^der	4	3	2	1	1	1
C^anc	3	3	2	2	2	1
C^mix	4	4	3	3	2	1

Open in a new tab

We model the probability of transitioning fromC_i →C_i+1 lineages during some time interval [t_i,t_i+1] using a simple variation of Tavare’s formula for the exact distribution of the number of lines of descent remaining aftert generations [40]. We use Tavare’s formula in order to model the coalescent at discrete timepoints, allowing multiple coalescences at each epoch.

We write the probability ofC given the trajectoryX (note this is distinct from the full likelihood ofs) as

\begin{matrix} P (C ∣ X, N) = \prod_{i = 0}^{K - 1} P (C_{i + 1} ∣ C_{i}, X_{i}, N_{i}) \end{matrix}

(1)

More precisely, in terms of the derived, ancestral, and mixed processes,

\begin{matrix} P (C ∣ X, N) = \prod_{i = 0}^{i^{*} - 1} P (C_{i + 1}^{der} ∣ C_{i}^{der}, X_{i}, N_{i}) P (C_{i + 1}^{anc} ∣ C_{i}^{anc}, X_{i}, N_{i}) \times \prod_{i = i^{*}}^{K - 1} P (C_{i + 1}^{mix} ∣ C_{i}^{mix}, N_{i}) \end{matrix}

(2)

wherei* ≔ max{i:X_i > 0} denotes the index of the epoch during which the allele arose via mutation. Naturally, the mixed process—which we only keep track of while the derived allele is nonexistent—does not depend onX. We can write the transition probabilities using a variation of Tavare’s formula for the transition probabilities of the number of lines of descent [40]; in place of the effective population size, we substitute the size of a allelic classz^class in order to reflect the coalescence rate within an allelic class:

P (C_{i + 1}^{class} = b ∣ C_{i}^{class} = a, Z_{i} = z_{i}^{class}) = \sum_{k = b}^{a} {exp (\frac{- (\begin{matrix} k \\ 2 \end{matrix})}{2 z_{i}} (t_{i + 1} - t_{i})) \frac{(2 k - 1) {(- 1)}^{k - b}}{b! (k - b)! (k + b - 1)} \prod_{l = 0}^{k - 1} \frac{(b + l) (a - l)}{(a + l)}}

(3)

where

\begin{matrix} z_{i}^{class} = {\begin{matrix} N_{i} X_{i} & : class = der \\ N_{i} (1 - X_{i}) & : class = anc \\ N_{i} & : class = mix \end{matrix} \end{matrix}

(4)

We note that this formula is known to be computationally unstable for large values ofC, large values ofN, and/or small values of Δt_i =t_i+1 −t_i; under such conditions, the asymptotic distribution of $C_{i + 1}^{class} ∣ C_{i}^{class} = a$ (wherea is, e.g., the number of derived lines of descent present att_i) takes on a normal distribution [41]:

\begin{matrix} C_{i + 1}^{class} ∣ C_{i}^{class} = a \sim N (μ (Δ t), σ^{2} (Δ t)) \end{matrix}

(5)

where

\begin{matrix} μ (Δ t) = \frac{2 η}{Δ τ} \end{matrix}

(6)

and

\begin{matrix} σ^{2} (Δ t) = 2 η / Δ τ {(η + β)}^{2} (1 + η / (η + β) - η / α - η / (α + β) - 2 η) β^{- 2} \end{matrix}

(7)

and

\begin{matrix} Δ τ = Δ t / (2 z_{i}^{class}) \end{matrix}

(8)

whereα =aΔτ/2,β = −Δτ/2, andη =αβ/[α(e^β − 1) +βe^b] [41]. In practice, for samples ofn = 50 haplotypes under constantN_e = 10⁴, we find this approximation is unnecessary; however, for the same sample size under a European demographic model, which exhibits very large recentN_e, we find it necessary to use this approximation during the roughly 10³ generations preceding the present day, prior to which Δt is sufficiently large that we change over to Tavare’s exact formula [42].

Marginalizing the hidden allele frequency states

In the previous sections we showed how we obtain $P (C ∣ X)$ and $P (X ∣ s)$ . Here we illustrate how to model coalescence in the two allelic classes when the trajectoryX is as a random latent variable. The probability ofC givens is thus

\begin{matrix} P (C ∣ s) = \sum_{x \in X} P (C ∣ X = x) P (X = x ∣ s) . \end{matrix}

(9)

Naively, this involves a prohibitively large sum overd^K−1 terms in $X$ , the space of possible trajectories, whered represents the number of possible values that the allele frequency can take. But due to the conditional independence of the likelihood, we can calculate the likelihood much faster using a recursion similar to the forward and backward algorithms commonly deployed on HMMs; we show the derivations of these recursions inS1 Text. In essence, this algorithm works by iteratively marginalizing out the allele frequency in each epoch; the backward recursion marginalizes frequency during earlier epochs first and later epochs last, and the forward recursions marginalizes in reverse order. At each epochi the recursions yieldb_i(x_i) andf_i(x_i); these correspond to $P (C_{1 : i} ∣ X_{i}, N_{i - 1})$ and $P (C_{i + 1 : K - 1}, X_{i} ∣ X_{i}, N_{i})$ , whereC_a:b =C_a,C_a+1, …,C_b.

Consequently, the two recursions can be used together to obtain the the posterior probability of the allele frequency during theith epochX_i,

\begin{matrix} P (X_{i} = x_{i} ∣ C, s) = \frac{b_{i} (x_{i}) f_{i} (x_{i})}{\sum_{x_{i}^{'}} b_{i} (x_{i}^{'}) f_{i} (x_{i}^{'})} \end{matrix}

(10)

which gives the posterior marginal ofX_i using the familiar forward-backward algorithm.

Importance sampling to estimate the likelihood function

The above formulas pertain immediately only to the case in which the local tree is observed directly and without noise. In practical settings, the local tree is hidden to us and we must integrate over the space of possible local trees using sampling methods. Here we describe a novel importance sampling method to reweight posterior samples of the ARG to approximate the likelihood function of selection. Although we uses to express the argument of the likelihood function, we use this as shorthand for estimating the likelihood function of arbitrarily complex parameters; for example, one could estimate the selection coefficients, as well as the time of selection’s onset,t_s, before which the allele behaved neutrally.

We are given haplotype dataD representingn haplotypes withl segregating sites. We wish to useD to infer the maximum-likelihood value ofs for some sitek ∈ {1, 2, …,l}, wherel is the number of sites in the region, assuming that all other sites are selectively neutral (i.e.s_j = 0 ∀j ∈ {1, 2, …,k − 1,k + 1, …,l}). In other words, we restrict ourselves to testing simple hypotheses of the form “sitek has selection coefficients_k and all of its flanking sites are selectively neutral.”

The likelihood ofs under the data can be expressed as the expected value of the likelihood of the ARG $G$ given the dataD, with respect to the distribution of $G$ givens:

\begin{matrix} L (s) = E_{G ∣ s} [P (D ∣ G, s)] \end{matrix}

(11)

At this stage, we introduceG, the discrete-time approximation of $G$ (discussed in more detail by [37]), and we assume

\begin{matrix} L (s) = E_{G ∣ s} [P (D ∣ G, s)] \end{matrix}

(12)

By importance sampling, we are able to express the expectation over an alternative distributionq(G), as long as $P (G, D ∣ s) > 0 \Rightarrow q (G) > 0$ . Notice that this implies we can conduct sampling underq(G) once, and reweight these samples for arbitrary values ofs without having to conduct additional sampling. In other words, approximatingL(s) using importance sampling does not require sampling under each value ofs at which you want to approximateL(s).

In this paper we specifically consider the importance sampling proposal distribution $q (G) = P (G ∣ D, s = 0)$ , which corresponds to the posterior ARG assuming a neutral model. Later, we evaluate the performance of the estimator using the Markov chain Monte Carlo method ARGweaver, which samples from the posterior [37]. One can obtain the importance sampling estimate of the full likelihoodL(s) by expressingEq 12 as an expectation over a different distribution, i.e. the posterior distribution of the ARG (assuming selective neutrality):

\begin{matrix} L (s) = E_{G ∣ s} [P (D ∣ G, s)] = E_{G ∣ D, s = 0} [\frac{P (D ∣ G, s) P (G ∣ s)}{P (G ∣ D, s = 0)}] \end{matrix}

(13)

We can expressEq 13 using the Monte Carlo approximation

\begin{matrix} \hat{L} (s) = \frac{1}{M} \sum_{m = 1}^{M} P (D ∣ G^{(m)}, s) \frac{P (G^{(m)} ∣ s)}{P (G^{(m)} ∣ D, s = 0)} \to L (s) \end{matrix}

(14)

where $G^{(m)} \sim P (G ∣ D, s = 0), m = 1, 2, \dots, M$ , and “→”, here and in the following, means that the left-hand side converges almost surely to the right-hand side asM goes to infinity, assuming that a Law of Large Numbers for ergodic processes holds (the Birkhoff–Khinchin theorem).

Hence, if we sample genealogies from the posterior under selective neutrality, that is, $G^{(m)} \sim P (G ∣ D, s = 0), m = 1, 2, \dots, M$ (whereM is the number of ARGs sampled), then the right-hand side ofEq 14 can be used as a Monte Carlo estimator of the likelihood function. However, in practice this estimator is highly unstable. However, a more stable estimator of the likelihood ratio $\frac{L (s)}{L (s = 0)}$ can be derived. We can divide throughEq 13 by $L (s = 0) = P (D ∣ s = 0)$ to get

\begin{matrix} \frac{L (s)}{L (s = 0)} = E_{G ∣ D, s = 0} [\frac{P (D, G ∣ s)}{P (D, G ∣ s = 0)}] \end{matrix}

(15)

Because we assume the data are conditionally independent of selection given the full ARG, we can simplify this as

\begin{matrix} \frac{L (s)}{L (s = 0)} = E_{G ∣ D, s = 0} [\frac{P (D ∣ G) P (G ∣ s)}{P (D ∣ G) P (G ∣ s = 0)}] = E_{G ∣ D, s = 0} [\frac{P (G ∣ s)}{P (G ∣ s = 0)}] \end{matrix}

(16)

A key development in our method is that although we sample the ARG of the entire sequence, we only calculate likelihoods using the marginal tree at the selected site, which we will callG_k. First, let us defineG_\k as the rest of the ARG omitting the local tree at sitek,G_k. Consequently,G is equivalent to (G_k,G_\k).

We make a key assumption that, for differing sweep parameterss ands′

\begin{matrix} P (G_{\ k} ∣ G_{k}, s) \approx P (G_{\ k} ∣ G_{k}, s^{'}) \end{matrix}

(17)

That is, we assume thatG_\k is approximately conditionally independent ofs given the marginal tree at the selected site,G_k. Thus, we can reduceEq 16 to

\begin{matrix} \frac{L (s)}{L (s = 0)} & = E_{G ∣ D, s = 0} [\frac{P (G ∣ s)}{P (G ∣ s = 0)}] \\ = E_{G ∣ D, s = 0} [\frac{P (G_{\ k} ∣ G_{k}, s)}{P (G_{\ k} ∣ G_{k}, s = 0)} \frac{P (G_{k} ∣ s)}{P (G_{k} ∣ s = 0)}] \\ \approx E_{G ∣ D, s = 0} [\frac{P (G_{k} ∣ s)}{P (G_{k} ∣ s = 0)}] \end{matrix}

which suggests the following importance sampling estimator using genealogies sampled from ARGweaver will converge almost surely to a close approximation to the likelihood ratio:

\begin{matrix} \hat{LR} (s) = \frac{1}{M} \sum_{m = 1}^{M} \frac{P (G_{k}^{(m)} ∣ s)}{P (G_{k}^{(m)} ∣ s = 0)} \to E_{G ∣ D, s = 0} [\frac{P (G_{k} ∣ s)}{P (G_{k} ∣ s = 0)}] \approx \frac{L (s)}{L (s = 0)} \end{matrix}

(18)

where $G^{(m)} \sim P (G | D, s = 0)$ form = 1, 2, …,M.

Finally, due to exchangeability of lineages within the derived and ancestral allelic classes, we can assume

\begin{matrix} P (G_{k} ∣ s) \propto P (C_{k} ∣ s) \Rightarrow \hat{LR} (s) = \frac{1}{M} \sum_{m = 1}^{M} Ω^{(m)} (s) \end{matrix}

(19)

where

\begin{matrix} Ω^{(m)} (s) ≔ \frac{P (C_{k}^{(m)} ∣ s)}{P (C_{k}^{(m)} ∣ s = 0)} \end{matrix}

(20)

denotes the summand of the importance sampling estimator.

We can maximize the likelihood ratio over different values ofs to obtain the maximum-likelihood estimate ofs

\begin{matrix} \hat{s} = {argmax}_{s} \hat{LR} (s) \end{matrix}

(21)

Finally, we show inS1 Text that an importance sampling estimate ofπ(x_i∣D,s), the posterior marginal of the allele frequency at timepointi,X_i, is given by

\begin{matrix} \hat{π} (x_{i} ∣ D, s) ≔ \frac{\sum_{m = 1}^{M} P (X_{i} ∣ C_{k}^{(m)}, s) Ω^{(m)} (s)}{\sum_{m = 1}^{M} Ω^{(m)} (s)} \end{matrix}

(22)

where in the summand we use the posterior marginal established inEq 10. In practice, we fix $s = \hat{s}$ . A concern is, therefore, that this estimator does not take uncertainty in the estimate ofs into account. This problem can be addressed by using a Bayesian approach, which we demonstrate briefly inS1 Text.

The method is implemented in package namedCLUES, available for download athttps://github.com/35ajstern/clues, with accompanying documentation currently provided inS1 Text. In this paper and the currect software release we assume positive directional selection with an additive effect on fitness, our method can be easily extended to general dominance relationships as well as negative selection.

Simulations

To evaluate the power ofCLUES to determine whether a site has been subject to selection, we simulated a dataset ofn = 25 diploid individuals under two different demographic models; (1) a model of constant effective population size (N = 10⁴), and (2) a model of European (CEU) demography [43]. We performed both sets of simulations using the programdiscoal [44]. We setμ = 2r = 2.5 × 10⁻⁸ mut/bp/gen,L = 1 × 10⁵ bp or 2 × 10⁵ bp for the constant-size and CEU models, respectively, and simulated conditional on a variety of present-day frequencies and selection coefficients, the latter of which we ranged from weak to strong values. Under each condition, we simulated 100 independent iterations. We also sampled 1 ancient haplotype; because ARGweaver, which we used subsequently to sample the posterior ARG, does not incorporate any information about ancestral/derived states, it is best practice to add an ancient individual or outgroup to help polarize the the alleles. In practical settings where the ancestral state is unknown, ARGWeaver accomodates specification of missing data on the ancient haplotype. For the constant-size and CEU models, we used ancient sampling dates of 2 × 10⁴ and 1.6 × 10⁴ generations before present, respectively. Becausediscoal can only simulate piecewise-constant population sizes, we specified population sizes to take on the value of their harmonic mean over the epoch, calculated from the original CEU model. Commands to run simulations of trajectories, local trees, and haplotypes are described inS1 Text.

Importantly, we conditioned simulations on the site of interest segregating at a particular frequency in the present day. Hence, when we considered the power to discriminate between neutral and selected alleles, we controlled the present-day frequency to be equal in both of these cases. Avoiding this step would otherwise upwardly bias estimates of the statistical power, due simply to the tendency for selected alleles to segregate at higher frequencies than neutral alleles [45]. (If the allele frequency in itself is also of interest, this part of the likelihood could trivially be added at a later stage, by simply using the stationary distribution of the allele frequency; see “Allele frequency transition probabilities” inS1 Text). We then simulate the allele frequency backwards in time, from the present-day frequency, until the allele reaches a frequency of 0. Simulators such asdiscoal achieve this by using the conditional Wright-Fisher diffusion (see e.g. [46]). In the case where effective population size changes over time, running conditional simulations requires additional considerations because the probability of a mutation entering the population scales approximately linearly with population size. Naively sampling the trajectory backwards in time will therefore produce a bias, unless trajectories where the mutation occurs whileN_e is low are somehow penalized. Thus, approaches such as reweighting sample trajectories using importance sampling have been used to correct this bias [47]. The programdiscoal implements a similar bias-correcting scheme using rejection sampling that rejects trajectories where the mutation occurs whileN_e is low with higher probability than trajectories where the mutation occurs whileN_e is high.

Next, we inferred the posterior ARG given the sequence data we simulated using ARGweaver [37]. This method works by proposing adjustments to an initial ARG, and randomly accepting or rejecting these proposals based on calculations of the prior probability of the proposed ARG, as well as its likelihood given the sequence data. Because the prior probability is based on the effective population size, we specified the same effective population size in the prior as we used to generate the sequence data. We found it important to adjust the proposal mechanism of ARGweaver; specifically, we adjusted resample window size and the number of resamples per window to achieve an acceptance rate of about 30-70%. In total, we sampled 3 × 10³ ARGs for each simulation, discarding the first 1 × 10³ as a burn-in period, and subsequently thinning the remaining samples to reduce the computational burden of downstream analyses; we used a thinning rate of 100 samples, resulting inM = 20 approximately independent samples. Reducing the thinning rate would increase accuracy and convergence of the inference at the cost of additional computation to calculate the likelihood of each additional sample tree. Commands to conduct ARG-sampling and local tree extraction are described inS1 Text.

Using utilities in the ARGweaver package, we extracted local trees at the selected site (at the center of the locus) from these sample ARGs. We then analyze this final set of trees usingCLUES. We also analyzed the same sequence data using nSL,H₁₂, and Tajima’sD [14,19,48]. The nSL method is essentially equivalent to iHS [11], except nSL does not require specifying a genetic map; despite this, these methods have been shown to have very similar statistical power with a slight advantage of nSL under some conditions.H₁₂ is a method to calculate haplotype homozygosity merging the two most common haplogroups; thus, it is a test for selection that is robust to the origin of a sweep, i.e. whether it is hard or soft. Tajima’sD is a site frequency spectrum-based statistic which is sensitive to skews in the frequency distribution of linked alleles caused by hitchhiking on the partially swept selected allele. We used scripts provided by [22] to calculateD andH₁₂, using a window size of 100kb centered on the selected site. We compare testing for selection under these methods by comparing their power curves under both the constantN_e and CEU demography models (Figs3 and4).

Fig 3 — Rows correspond to simulations conditioned on the same present-day allele frequency (A-C: 25%; D-F: 50%; G-I: 75%), and columns correspond to simulations with the same value ofs (A,D,G: 0.001; B,E,H: 0.003; C,F,I: 0.01). Simulations were performed under a model of constant effective population size (N_e = 10⁴) using a locus of 100kb,n = 25 diploid individuals andμ = 2.5 × 10⁻⁸ mut/bp/gen,r = 1.25 × 10⁻⁸ recombinations/bp/gen.

Fig 4 — Rows correspond to simulations conditioned on the same present-day allele frequency (A-C: 25%; D-F: 50%; G-I: 75%), and columns correspond to simulations with the same value ofs (A,D,G: 0.001; B,E,H: 0.003; C,F,I: 0.01). Simulations were performed under a model of European demography using a locus of 200kb,n = 25 diploid individuals andμ = 2.5 × 10⁻⁸ mut/bp/gen,r = 1.25 × 10⁻⁸ recombinations/bp/gen.

We also conducted a similar simulation study for detecting recent selection starting 100 generations ago. We simulated under the same CEU demographic model as previously described, but instead samplen = 50 diploids. We conducted ARG sampling and thinning as previously described, but in our analysis of the sample trees usingCLUES, we calculated the likelihood for models of selection wheres = 0 up until 100 generations ago, ands ≥ 0 from that point until the present day. This sweep from standing variation (SSV) model differs from the hard sweep model we used previously, which assumess is constant throughout history. Instead of optimizing the likelihood function only with respect tos, we optimized with respect to two parameters,s andt_s, jointly; heret_s represents the time of the onset of selection.

Results

Testing for selection

We found that across all scenarios,CLUES matches or exceeds the statistical power of the other methods evaluated (Figs3 and4). As expected, all methods had highest power under large values of both the selection coefficient and the derived allele frequency (Fig 3I). Under these conditions,CLUES had 100% power at the 1% significance threshhold; the next most powerful method, nSL, had 68% power at the same significance level.CLUES also demonstrated improvement in power under weak selection; as the selection coefficient was decreased, nSL retained about 20% power whens = 0.003 and <5% power whens = 0.001, and Tajima’sD andH₁₂ retained <5% power under boths = 0.001, 0.003 (Fig 3G and 3H). By contrast,CLUES retained approximately 45% and 90% power unders = 0.001, 0.003, respectively. We conclude thatCLUES has high power across a wide regime of selection strengths, and has notably improved power over standard methods under weaker values ofs.

We also considered the effect of present-day allele frequency on statistical power. Previous studies have shown a strong dependence of power on current allele frequency, with methods such as nSl and iHs having highest power at allele frequencies in the 70-90% range (see e.g. [11]). We tested for selection at alleles ranging in present day frequency from 25% to 75%, and whileCLUES showed the expected pattern of increasing power with frequency, it also improved on the performance of other methods at lower frequencies. For example, under strong selection (s = 0.01), the power ofCLUES changed from 100% to 90% to 85% as the frequency is decreased from 75% to 50% to 25% (Fig 3C, 3F and 3I). By contrast, the power of the next most powerful method,H₁₂, dropped from approximately 65% to 45% to 15% (Fig 3C, 3F and 3I). Under moderate selection (s = 0.003), these effects were even more drastic, with the power ofCLUES and nSL (the next most powerful method in this regime) changing from 90% to 60% to 50% and 20% to 5% to <5%, respectively. We conclude thatCLUES has high power compared to standard methods across a wide range of allele frequencies, with the most major improvements in performance occurring when the derived allele is at lower frequencies (<50%). We found that using the approximation due to Griffiths (Eq 5, [41]) decreased power ofCLUES by increasing variability of the null distribution of the likelihood ratios. Hence, for testing under nonequilibrium demography we used the exact lines-of-descent probabilities (Eq 3). By contrast, as we will later show, we found the approximation given byEq 5 fort ∈ [0, 1000] to improve estimation of allele frequency trajectories under this demographic model.

We also considered the same testing procedure under non-equilibrium demography, simulating under the previously described model of CEU demography (Figs3 and4). We found in general reduced power to detect selection under this regime relative to the constant population size regime (Fig 4I, cf.Fig 3I), consistent with the well-known confounding of expanding population size with selection [10]. Nonetheless,CLUES demonstrated improved power relative to the competing methods across a wide range of selection coefficients (Fig 4C, 4F and 4I), as well as across a wide range of derived allele frequencies (Fig 4G, 4H and 4I).

Estimating selection coefficients

Using the simulations from the previous section to study statistical power in testing for selection, we used our estimate of the likelihood surface fors to estimate the value of the selection coefficient via maximum likelihood (seeEq 21), restricted to 0 ≤s ≤ 0.5, as we only calculate transition probabilities for this range ofs. We obtained selection coefficient estimates under importance sampling using ARGweaver (Fig 5), as well as selection coefficient estimates based on the true local tree observed directly (S1 Fig). Generally, the estimates are approximately unbiased. For example, the mean estimates ofs = 0, 1 × 10⁻³, 3 × 10⁻³, 1 × 10⁻² were approximately $\bar{\hat{s}} = 1.9 \times 10^{- 4}, 9.6 \times 10^{- 4}, 3.2 \times 10^{- 3}, 1.3 \times 10^{- 2}$ when the present day frequency was fixed to 75% (Fig 5A). Relative to inference when the true tree is observed, we found that the importance sampling estimates had increased variance, reflecting uncertainty in the tree. For example, we saw increased variability in the importance sampling vs. true tree estimates under constant population size (Fig 5A vs.S1A Fig), as well as under CEU demography (Fig 5B vs.S1B Fig). This pattern is consistent with the additional uncertainty ins when the local tree is not observed directly. Notably, we found that importance sampling under a model of CEU demography yields estimates with a slight bias towards lower values ofs, especially under strong selection (e.g.s = 0.01).

Fig 5 — A: Constant population size. B: Tennessen CEU model. Marker color denotes present-day allele frequency (25/50/75% correspond to yellow/red/purple, respectively). Horizontal dashed lines denote the true value of the selection coefficient.

Inferring allele frequency trajectories

Using the same simulations and importance sampling estimates we obtained in the previous sections, we decoded the hidden Markov model (HMM) described in the section Materials & Methods. Specifically, we take $\hat{s}$ , the maximum likelihood estimate ofs, and plug it into the posterior marginal (Eq 10) to obtain a probabilistic estimate of the allele frequency during a particular epoch; we do this independently for each epoch in our discrete-time model. To get a point estimate, we choose to use the posterior marginal mean; i.e., for each epoch, we choose the mean of the posterior marginal distribution. We illustrate the accuracy of these allele frequency trajectory estimates assuming the true local tree is observed and under importance sampling when the true tree is unknown inFig 6. We find that estimates of the allele frequency trajectory are generally unbiased for both true trees (Fig 6A and 6B) and importance sampling (Fig 6C and 6D), with increased variance in the trajectory estimates in the importance sampling setting. We also illustrated variability in true vs. inferred trajectories controlling fors (S6 Fig, here settings = 0).

Whereas inference tended to be relatively accurate for high-frequency alleles (Fig 6B and 6D), when the derived allele was simulated conditioned on lower frequencies (e.g. 25%,Fig 6C), estimates tend to be downwardly biased. We tracked this bias to a lack of convergence in ARGweaver; specifically, we found that across different demographic scenarios and selection coefficients, ARGweaver can drastically overestimate the occurrence of very recent coalescences (in our case, in the last 100 generations; seeS5 Fig). Under constant population size, we see a nearly 7-fold excess in the number of recent coalescences inferred by ARGweaver. Naturally, this bias will affect estimates for low-frequency alleles more strongly, as fewer lineages subtend the derived allele, and thus a larger proportion of them are susceptible to this bias.

Because recombination rates vary substantially throughout the genomes of humans and other organisms, we also evaluated the accuracy of the estimates assumingμ =r, larger than theμ = 2ρ setting we used in the other simulations, and estimation accuracy to be robust to this increase in recombination rate (S2 Fig).

We also examined trajectory inference under non-equilibrium demography; i.e., the aforementioned model of CEU demography (S3 Fig). Under the CEU model, we found trajectory estimates to have increased variance under importance sampling vs. true trees, but also a slight downward bias in estimating the selection coefficient under strong selection (i.e.s = 0.01; seeFig 5B,S3D Fig). As this bias does not occur under the true trees (S1B andS3B Figs), we inspected the posterior trees sampled by ARGweaver for patterns consistent with this bias. We found that under this demographic model in particular, ARGweaver tends to under-sample trees with short times to most recent common ancestor (TMRCAs; seeS4 Fig). For reference, nearly 60% of runs under constantN_e contained even a single sample tree that had a TMRCA less than or equal to that of the true TMRCA (S4A Fig). By comparison, unders = 0.01 and CEU demography, only 11% of ARGweaver runs met this criterion (S4B Fig). Some bias is to be expected, as trees were sampled under a posterior distribution that assumes selective neutrality; however, these results suggest that, if ARGweaver is sampling from the true posterior assuming selective neutrality, then importance sampling estimates (of the selection coefficient, for example) will at least have much higher variance under the CEU model than under constant population size.

We further investigated whether uncertainty ins due to importance sampling variance drove the downward bias when estimating strong selection (Fig 5B andS3D Fig). First, we obtained importance sampling estimates of the trajectory fixings to its true value (S7A Fig). If uncertainty ins were the cause of the bias, then fixing the true value ofs ought to correct for bias due to uncertainty. While we observe less bias in the estimates when fixing the true value ofs, the bias is not totally eliminated. We observe a similar reduction in the bias of estimates under neutrality when we fixs = 0 (seeS6B, S6E and S6H Fig vs.S6C, S6F and S6I Fig). Thus, we conclude the bias is due to a lack of convergence in ARGweaver, which appears to be exacerbated in settings where strong selection is combined with non-equilibrium demography.

We also investigated whether incorporating uncertainty in the estimate ofs, rather than fixing $s = \hat{s}$ , would improve the accuracy of trajectory inference. One strategy for modeling uncertainty ins is to apply a prior distribution tos. We found that marginalizing outs with respect to its posterior distribution (assuming a uniform prior ons) did not have a noticeable effect on inference for large values ofs (S7B Fig). This result is concordant with our observation that for large values ofs, the likelihood surface peaks so strongly that the posterior remains tightly concentrated around the MLE $\hat{s}$ . Hence, applying a prior distribution tos does not appear to be an adequate strategy to model uncertainty ins.

Inferring extremely recent selection

We applied our likelihood model of a sweep from a standing variant (SSV) to two types of datasets: selection from a standing variant starting 100 generations ago and selection with constants (includings = 0), both described in ‘Simulations’ under Materials and Methods. We inferred trajectories under the best case scenario where the true trees are observed (Fig 7A and 7B). We found that overall the method inferred the trajectory, as well as the strength and timing of selection, with highest accuracy when selection is strong (e.g.s = 0.03 inFig 7A and 7B). However, we found that ass took on smaller values (s = 0.01), many combinations ofs andt_s had very similar likelihood (Fig 7B), and thus estimates ofs,t_s, and the allele frequency trajectory tended to be noisier than under very strong selection (Fig 7A and 7B). Adding the extra parametert_s did not cause overfitting when inferring the trajectories of hard sweeps (Fig 7A). We also found good power to distinguish between hard vs. soft sweeps (i.e. sweeps from a standing variant), as apparent in the trajectories inferred inFig 7A. We calculated statistical power to test for a hard sweep using the statistic $max_{s, t_{s}} {L (s, t_{s})} / max_{s} {L (s, t_{s} = \infty)}$ ; intuitively, this statistic is the ratio of the highest likelihood under any model with a SSV (t_s ≠ ∞) to the highest likelihood of any hard sweep (t_s = ∞). At the 1% significance level we found 60% and 100% power to distinguish soft vs. hard sweeps withs = 0.01, 0.03, respectively.

Fig 7 — (A) Trajectories inferred from true trees under both hard sweeps and recent selection on a standing variant (i.e. soft sweeps) when boths and time of selection onset are unknown. (B) The log-likelihood surface for joint inference ofs and onset of selection, averaged over 100 simulations, taking the true tree as observed. (C) ROC curves (using importance sampling) illustrating performance of tests between selection from a standing variant where onset of selection occurs 100 generations ago. We condition on a present day frequency of 50%.

We also performed importance sampling using ARGweaver and evaluated the power of the importance sampling estimates to detect recent selection vs. neutrality (Fig 7C). Instead of comparing our method to nSL, which is not designed to detect signals of extremely recent selection, we compared to Singleton Density Score (SDS; [21]), as well asH₁₂ and Tajima’sD. We found that for lower values ofs, all methods had generally low power. AlthoughCLUES exhibited fairly high power (44%) to detect very strong recent selection (s = 0.03) —even outperforming SDS—we found thatH₁₂ has about the same power (45%) in this particular case. The lower power (<5%) of SDS is consistent with the fact that the method was explicitly designed to have high power for large datasets (n > 1000 for selection coefficients of this magnitude). Although we demonstrate thatCLUES has substantial power to detect extremely recent selection, we found that importance sampling point estimates ofs,t_s, and the trajectory were highly vulnerable to biases in the distribution sampled by ARGweaver (S5 Fig). Specifically, we found that across various demographic and selection conditions, ARGweaver samples trees with substantially more recent coalescent events than in the true trees. Specifically, under the European demographic model with the settings used here to study recent selection, we find ARGweaver samples about a 4-fold excess of recent coalescent events (S5B Fig). Clearly, this bias would produce a false signature of recent selection under neutral conditions. Thus, we did not further explore importance sampling estimates ofs and the trajectory under the recent selection model. We conclude that potential ARG-sampling methods that avoid this bias will improve upon power to detect recent selection, as well as point estimates of the strength, timing of selection, and the allele frequency trajectory.

Fine-mapping the selected site

Strongly selected alleles tend to have high levels of LD to linked neutral sites, and thus many methods to detect selection are limited in their ability to determine the exact site under selection. To assess whether the likelihood ratio statistic produced byCLUES can be used to fine-map the selected site, we ranCLUES at linked neutral sites in a locus centered on a site under positive selection. Simulations were identical to those used in the simulation study under constantN_e = 10⁴, with a present-day selected allele frequency of 75%,s = 0.01, and 100 independent simulations. We chose sites with the maximal squared correlation coefficientr² to the selected allele, such thatr² did not exceed a threshold value, and vary that threshold from 0.50 to 0.99 (Fig 8). We found that when the true tree is observed (or sampled with high accuracy), the likelihood ratio statistic identifies the selected site correctly in a head-to-head test with the neutral linked site with 85% accuracy even whenr² ≤ 0.99; this quantity reaches 100% forr² ≤ 0.50 (Fig 8A). When the likelihood ratio is estimated using importance sampling via ARGweaver, the accuracy declines to about 50% and 85%, respectively (Fig 8A). Because the exact causal site may not be known in many studies, we also investigated how the estimate of the selection coefficient,s, depends onr² between the site analyzed and the site under selection. We estimates given the true tree, and find that, on average, estimates ofs decline withr², such that forr² ≤ 0.50, the mean estimate ofs at these neutral sites is less than 20% of the true value ofs at the causal site (Fig 8B).

Fig 8 — (A) Probability of correctly identifying the selected site in a head-to-head test with a linked neutral site. Vertical bars represent 95% CIs estimated by plug-in bootstrap. (B) Estimates of the selection coefficient at the causal vs. linked neutral sites, given the true tree. Mean estimates are represented by blue hash marks.

Effects of background selection

We also assessed the effects of linked selection on inference of selection at a particular site. In particular, we consider the effects of background selection (BGS) on inference of positive selection on a linked site (S8 Fig). We performed forward simulations in SLiM 3 [49], assuming a model of constantN_e = 10³, a locus of length 1Mb,n = 25 diploid individuals andμ = 2.5 × 10⁻⁸ mut/bp/gen,r = 1.25 × 10⁻⁸ recombinations/bp/gen. We let mutations be neutral with probability 90% and (negatively) selected with probability 10%. We simulated under deleterious effects ofs = 0, −0.001, −0.003 and -0.01, performing 100 independent replicates under each case. To simulate selection at a focal allele, 100 generations prior to the present day, we choose a random neutral allele conditional on its frequency falling in the interval [0.005,0.015] and its position falling in the interval [4 × 10⁵, 6 × 10⁵]bp (to ensure it is somewhat centered in the 1Mb region), and then endow it with a selection coefficient ofs = 0.15. Prior to this timepoint, we perform a burn-in phase of 19900 generations with only neutral and deleterious mutation. From the sampled haplotypes, we perform importance sampling using ARGweaver to estimate the likelihood of the selection coefficient.

We find that our method is quite robust to BGS; e.g., the median estimate of the selection coefficient is approximately unbiased (mean $\hat{s} = 0.13$ ) as the strength of BGS is increased (S8 Fig). Also, regardless of BGS strength there is nearly perfect power to detect positive selection when comparing to neutral simulations (S8 Fig). We note that the strength of selection on the beneficial allele investigated here is somewhat strong; for weaker selection on a beneficial allele, inference with sites under BGS may be a more significant determinant of the beneficial allele’s trajectory (see e.g. [50]). We also note that our simulations assume a model of equilibrium demography, but under non-equilbrium conditions (e.g., rapid population size expansion) BGS has a magnified effect on neutral diversity, which may further bias estimates of selection [51].

Effects of demographic model misspecification

To explore the effects of demographic model misspecification on inference of selection, we ranCLUES on datasets simulated under a model of European demography (described earlier inMethods), using a mismatched model of constantN_e = 10⁴ to calculate the likelihood, and compare them to calculations under the true demographic model (S9 Fig). We report likelihood ratios (S9A and S9C Fig) as well as estimates of the selection coefficients (S9B and S9D Fig) both given the true tree (S9A and S9B Fig) and approximated via importance sampling using ARGweaver (S9C and S9D Fig). We find that statistical power to detect selection, especially strong selection, is not substantially impeded by model misspecification (S9A and S9C Fig). We do, however, find upward bias in estimates of the selection coefficient whens is 0 or close to 0 (S9B and S9D Fig).

Analysis of a lactase persistence SNP

To assess performance ofCLUES on empirical data, we applied our method to study selection acting on the SNP rs4988235 in theMCM6 gene, known to regulate the neighboringLCT gene and affect the lactase persistence trait. The derived allele (A) current segregates at approximately 72% in the 1000 Genomes Phase 3 reference panel (British in England and Scotland, henceforth GBR; seeS10A Fig). We conducted sampling in ARGweaver assuming a model of European demography [43], using a 300kbp region centered around the focal SNP and polarizing alleles using the genomes of three ancient individuals (Altai Neandertal, Denisova, and Vindija Neandertal [52–54]). We sampledM = 200 ARGs, extracted local trees using tools in the ARGweaver package, and conducted importance sampling to estimate likelihood surfaces and trajectories usingCLUES.

We found very strong evidence for selection on rs4988235 (s = 0.0161, logLR = 131.82). The trajectory as well as the value of the selection coefficient inferred byCLUES are consistent with previous estimates of the trajectory ands = 0.018 due to Mathieson and Mathieson (2018), illustrated inFig 9 [55]. Their method incorporates genomic times series spanning thousands of generations using an HMM-based approach, where hidden states are population-wide allele frequencies, observed states are genotypes of sampled ancient individuals, and transition probabilities are governed by the selection coefficient. Our approach, by contrast, does not utilize any ancient/timecourse data except for the 3 aforementioned ancient individuals, which we use to simply polarize the derived and ancestral states of each allele.

Fig 9 — Black curve is the posterior median allele frequency, whereas gray areas are a 95% posterior interval. The red surface is the posterior of the frequency trajectory within Steppe ancestry conditioned on an ancient DNA time series, adapted from [55].

Analysis ofEDAR

Another canonical example of a common SNP under selection in humans in rs3827760 (see e.g. [56]). This SNP is a non-synonymous mutation that is present at high frequency in East Asian populations (e.g. 94% in Han Chinese [CHB] in the 1000 Genomes database), intermediate-high frequency in Central and South America, and low frequency in other geographical regionsS10 Fig. This variant is associated with a number of traits, including tooth shape and hair straightness [57,58]. To estimate selection on this SNP, we conducted sampling in ARGweaver assuming a model of East Asian demography [59], using a 300kbp region centered around the focal SNP and polarizing alleles using the genomes of three ancient individuals (Altai Neandertal, Denisova, and Vindija Neandertal [52–54]). We sampledM = 200 ARGs, extracted local trees using tools in the ARGweaver package, and conducted importance sampling to estimate likelihood surfaces and trajectories usingCLUES. We estimate that rs3827760 has undergone selection withs = 0.0047, corresponding to an allele age of roughly 45kya (S11 Fig). Our estimates are in stark contrast to some previous estimates obtained using ABC methods, which estimate > 30-fold stronger selection on this SNP, and an allele age of 1.4-6.9kya [32]. Our results are consistent with ancient DNA evidence, which suggest the derived allele to have originated prior to 7.5kya [8].

Analysis of pigmentation alleles

Using the same GBR panel from 1000 Genomes Phase 3, we analyzed a set of SNPs associated with pigmentation-related traits, some of which were previously identified as likely targets of recent selection [21]. We conducted sampling in ARGweaver assuming a model of European demography, using a 300kbp region centered around the focal SNP and samplingM = 200 approximately iid ARGs. We ranCLUES and estimated likelihood surfaces and allele frequency trajectories for these SNPs (Fig 10). We found significant concordance between the SDS values and our likelihood ratio statistics paired for each SNP (p = 1.7 × 10⁻³, Spearman one-sided) [21]. We also illustrated the geographical distribution of these SNPs among diverse populations (S12 Fig) using GGV [60].

We found several signals of very strong selection acting on rs619865 (ASIP,s ≈ 0.10,Fig 10I), rs12821256 (KITLG,s ≈ 0.016,Fig 10H), and rs1393350 (TYR,s ≈ 0.011,Fig 10J); these SNPs are significantly associated with freckling, blonde hair color, and freckling and blue/green eye color, respectively [61–63]. Interestingly, these SNPs all demonstrated a signal of selection mostly concentrated in the last ∼5 kya. The geographical distribution of the frequency of these SNPs shows that the derived version of these variants are mostly concentrated in European populations, with minimal sharing with populations located in Africa and Asia (S12I, S12H and S12J Fig). For example,TYR andKITLG segregate at a frequency ∼20% in several European populations and have a frequency close to 0% in African and East Asian populations (S12J Fig). These three SNPs are the only ones in this set of SNPs which have a frequency of nearly 0% across the African populations surveyed, with the exception of OCA2/HERC2 (S12A, S12H, S12I and S12J Fig), consistent with our evidence for recent selection at these loci. The frequencies of these variants in GBR ranges from ∼10-20%; by contrast, the only other variant in this set with comparable frequency in GBR (13%), rs35264875 (TPCN2), we find inconclusive evidence of selection (Fig 10F), consistent with its comparably even geographical distribution relative to the aforementioned SNPs atASIP, KITLG, andTYR (S12F Fig).

At rs12896399 (SLC24A4,Fig 10B), a SNP identified to be significantly associated with hair color [62], we found strong evidence for moderate selection (s ≈ 0.005). This result is consistent with a previous analysis that suggested positive selection acted on this allele in Out-of-Africa (OoA) populations, based on its high allele differentiation relative to a YRI panel, and low haplotype diversity within CEU individuals [63]. Our results, paired with the apparent low levels of differentiation between European and Asian populations relative to differentiation between OoA populations and African populations at this locus (S12B Fig) are consistent with our estimate that selection acted onSLC24A4 as early as ∼30 kya, during the OoA bottleneck as inferred by [43,59].

Notably, we find moderate evidence for selection on rs12913832 (OCA2/HERC2,Fig 10A,S13 Fig), a SNP previously shown to be causal for blue-brown eye color [64] and significantly associated with hair color [62]. This gene exhibits abberantly high differentiation across populations [65], consistent with a model of local adaptation of eye color. Compared to previous estimates based on ancient DNA samples [66], we estimate substantially weaker selection acting on this gene (s ≈ 0.002 vs.s ≈ 0.04), and we find no evidence to support a recent increase in selection acting on this SNP (i.e., our method found a hard sweep to have higher likelihood than a SSV). Our estimate of moderate selection and lack of a recent change in the selection coefficient imply that selection on OCA2/HERC2 began at least ∼50 kya, roughly the time of the start of the OoA bottleneck estimated by [43,59]. Our analysis suggests that selection onOCA2/HERC2 may have begun much earlier than previously suggested [66]. We also note that the aforementioned rs12913832 (OCA2/HERC2)—as well as rs2153271 (BNC2), a SNP which is significantly associated with freckling (Fig 10G)—occur in high-frequency archaic haplotypes [67]. While our method is not explicitly designed to control for population structure between archaic and modern human lineages, we do find moderate evidence for selection on both of these SNPs.

One surprising result is that we found no signal of selection acting at rs13289810 (TYRP1,s ≈ 0,Fig 10E). In Europeans,TYRP1 is associated with hair and eye pigmentation [68–71]. Some analyses of European populations have indicated evidence for positive selection onTYRP1 [56,63,69]. Our results temper these claims, and appear consistent with the fairly even geographical distribution of rs13289810 frequency across European, African, and Middle Eastern populations (S12E Fig).

Discussion

We have developed an approach to use modern population genomic data to approximate the full likelihood of selection acting on a locus. We use this approach to test for and estimate the strength and timing of selection, as well as estimate the full allele frequency trajectory. The method is effective across a span of selection coefficients (s = 0 − 0.01), derived allele frequencies (f = 25% − 75%), and under multiple demographic models.

Our method draws on previously published methods to estimate the ancestral recombination graph (ARG). We chose to use ARGweaver because it is the only currently available method for sampling the posterior of the ARG; as shown in our derivation of the importance sampling estimates, we rely on sampling from the posterior in order to make rigorous guarantees regarding convergence and consistency of our estimators. Intuitively, it is important to model the uncertainty in the local tree in order to marginalize out this latent variable. We showed that estimates of the selection coefficient and the trajectory are generally accurate, barring scenarios where importance sampling is inefficient, or ARGweaver produces a bias in the inferred trees. In light of these biases, under certain conditions—primarily when the derived allele is at low frequencies (≤25%)—importance sampling using ARGweaver trees has limited power to detect selection.

Another important limitation of ARGweaver is its computational cost; in order to study selection on short timescales, large sample sizes are necessary, often on the order of thousands of individuals [21]. The runtime of ARGweaver grows dramatically with increasing sample size; not only does the cost of the individual sampling steps increase with sample size, but also so does the size of the state space, necessitating more samples be taken in order to achieve convergence to the stationary distribution.

However, we see potential to make use of recent advances in inference of local trees in order to further advance approximate full-likelihood methods to infer selection (see e.g., [72–75]; it is worth noting that some of these methods, such as [75], do not infer the ARG in a strict sense, but rather the sequence of local trees along a recombining locus). A major benefit of these methods is that they are far more scalable than ARGweaver, and hence offer more potential to study selection on short, punctuated timescales. However, they also possess several limitations: firstly, several of these methods only infer topologies, rather than branch lengths [73,74]. While it is possible to infer branch lengths condition on topology estimates, it is unclear how accurate these estimates would be. In contrast, methods that infer branch lengths along with topology entail a slight tradeoff in their scalability [72,75]. Another limitation of these methods is that they only yield a point estimate of the local tree, rather than estimating uncertainty in the tree. Nonetheless, it may be feasible to quantify uncertainty in the local tree using a jackknife approach where the local tree is inferred over random subsets of the individuals.

It may also be possible to make use of recent advances in inferring pairwise coalescence times (e.g., [76]) to build an approximation to the full likelihood. Recently, Albers & McVean proposed a composite likelihood method to estimate allele age by “sandwiching” the age using identity-by-descent tracts at the site of interest [77]. However, their method does not extend to inferring how the allele frequency changed over time, and does not explicitly model selection.

Currently our method assumes correct knowledge of the demographic history. The effects of latent or mis-specified population structure on inference of selection are well known (e.g., [78]), but in future work one might try to determine the exact effects of mis-specification of effective population size on both inferring the local tree, and inferring selection conditional on the local tree. One approach to dealing with this is to extend the importance sampling approach we use to correct for selection to additionally correct for demography, when ARG sampling is performed under a mis-specified demographic model.

Furthermore, many aspects of our model of selection (e.g. coalescence, allele frequency transitions) assume a panmictic population. To extend our model to more complex demographic models and/or linked selection (i.e., allowing multiple sites to be subject to selection) would entail drastically increased computational cost (e.g., marginalizing allele frequencies corresponding to each population, rather than the allele frequency in a single population). Using a deterministic approximation of the allele frequency trajectory would circumvent this issue, but it would also raise new issues, such as how to model allele frequencies whens ≈ 0.

Despite its limitations, the method presented here provides the first close approximation to a full likelihood function for the selection coefficient under simple models. As demonstrated by our simulations, full likelihood methods have the potential to greatly improve power to detect selection and estimate the strength of selection under a variety of conditions. It also provides a rigorous and accurate method for estimating allele frequency trajectories, and is the first to achieve so using modern data. As methods for inferring ARGs improve in the future, so too will the derived methods for detecting and quantifying selection and inferring allele frequency changes.

Supporting information

S1 Fig. Selection coefficients inferred directly from the true local trees.

Left: constant population size (N_e = 10⁴). Right: Tennessen CEU demographic model. Marker color denotes present-day allele frequency (25/50/75% correspond to yellow/red/purple, respectively).

(EPS)

Click here for additional data file.^{(4.1MB, eps)}

S2 Fig. Allele frequency trajectories inferred from ARGweaver local trees whenμ =r.

We setμ = 2.5 × 10⁻⁸ mut/bp/gen,r = 2.5 × 10⁻⁸ recombinations/bp/gen and fix the present day allele frequency toX₀ = 50% Stepwise trajectories are inferred, dashed trajectories are the ground truth. Vertical bars denote the 25-75th percentile range of estimates. For each condition we show 20 randomly selected simulations and their corresponding inferences. All data simulated under a demographic model with constant sizeN = 10⁴.

(EPS)

Click here for additional data file.^{(3.4MB, eps)}

S3 Fig. Inferring allele frequency trajectories under CEU demography.

Trajectories were inferred from true local trees (top row) and importance sampling on ARGweaver local trees (bottom row). Columns correspond to different present-day allele frequencies (left: 50%, right: 75%). For each condition we show 20 randomly selected simulations (dashed, translucent lines) and their corresponding inferences (piecewise constant curves; dots and vertical bars indicate the median and 25-75 percentiles of estimates, respectively). The gray box indicates the timing of the bottleneck, occurring approximately 920-2040 generations ago. Simulations were done under the European demographic model described in Methods and Materials using a locus of 200kb,n = 25 diploid individuals andμ = 2.5 × 10⁻⁸ mut/bp/gen,r = 1.25 × 10⁻⁸ recombinations/bp/gen.

(EPS)

Click here for additional data file.^{(32.4MB, eps)}

S4 Fig. ARGweaver proposes less accurate trees under non-equilibrium demography.

Left: constantN_e = 10⁴. Right: Tennessen CEU demographic model. We found inS2 Fig that importance sampling using ARGweaver tends to underestimate the selection coefficient under a model of CEU demography. To demonstrate that the proposal distribution for sampling the local tree is the source of this bias, we use TMRCA of the local tree as a heuristic the locus’s selection coefficient. For the sake of argument, we postulate that as one decreases the TMRCA of a local tree, the maximum-likelihood estimate of the selection coefficient strictly descreases. If so, then if the minimum value of the sampled TMRCAs is greater than the true TMRCA, then this instance of the importance sampling estimate will underestimate the selection coefficient. Hence, one can measure importance sampling efficiency by looking at the probability that the minimum value of the sampled TMRCAs is less than the true TMRCA. This is shown graphically by the proportion of points that fall in the red upper triangle. The selection coefficientss = 0, 0.001, 0.003, 0.01 are indicated by purple, blue, green, and orange, respectively. Simulations were done under the European demographic model described in Methods and Materials using a locus of 200kb,n = 25 diploid individuals andμ = 2.5 × 10⁻⁸ mut/bp/gen,r = 1.25 × 10⁻⁸ recombinations/bp/gen. We fixed the present-day allele frequency to be 75%.

(EPS)

Click here for additional data file.^{(3.9MB, eps)}

S5 Fig. ARGweaver infers an excess of recent coalescences.

As a diagnostic for the trees outputted by ARGweaver, we compared the amount of coalescence in the sample trees vs. the local trees. Leta_i,s be the number of coalescent events during epochi in thesth replicate of the simulation. We calculate

\begin{matrix} e_{i} = \frac{\frac{1}{M} \sum_{s = 1}^{S} \sum_{m = 1}^{M} a_{i, s}^{(m)}}{\sum_{s = 1}^{S} a_{i, s}} \end{matrix}

as an estimate of the “fold excess” of coalescence, where

a_{i, s}^{(m)}

denotes themth ARGweaver sample ofa_i,s. Notice that if the sample trees closely approximate the true tree, thene_i ≈ 0. The dashed line indicates no excess, i.e., no bias in the estimates. We find that ARGweaver can have a nearly 4× excess of inferred coalescence events in the most recent epochs (e.g. [0, 100] generations ago). Here we simulated recent selection starting 100 generations ago under a model of European demography withn = 50 diploid individuals, a physical length of 200kb, andμ = 2.5 × 10⁻⁸ mut/bp/gen,r = 1.25 × 10⁻⁸ mut/bp/gen. We condition on the variant having a present-day frequency of 50%.

(EPS)

Click here for additional data file.^{(1.4MB, eps)}

S6 Fig. Performance of trajectory inference across replicates.

We illustrated inferred vs. true trajectories, holding selection coefficient constant across the replicates. In this case, we choses = 0 in order to maximize the amount of variability in the trajectories of the replicates. Rows correspond to simulations conditioned on different present-day allele frequencies (0.25: A-C,0.50: D-F,0.75: G-I). Left column (A,D,G): trajectories inferred using the true local tree. Middle column (B,E,H): trajectories inferred using importance sampling, fixing the selection coefficient to the ground truth (s = 0). Right column (C,F,I): trajectories inferred under importance sampling and estimatings. Simulations were done under the constant size model described in Methods and Materials using a locus of 100kb,n = 25 diploid individuals andμ = 2.5 × 10⁻⁸ mut/bp/gen,r = 1.25 × 10⁻⁸ recombinations/bp/gen.

(EPS)

Click here for additional data file.^{(8.3MB, eps)}

S7 Fig. Effect of uncertainty ins on trajectory inference.

A: trajectories inferred using importance sampling (i.e.Eq 22), fixing $\hat{s} = s$ . B: trajectories inferred using importance sampling and integrating over a uniform prior ons (seeS1 Text for formulae). Simulations were done under the European demographic model described in Methods and Materials using a locus of 200kb,n = 25 diploid individuals andμ = 2.5 × 10⁻⁸ mut/bp/gen,r = 1.25 × 10⁻⁸ recombinations/bp/gen.

(EPS)

Click here for additional data file.^{(12.4MB, eps)}

S8 Fig. Effect of background selection on inference of selection.

Estimates of the selection coefficient of a selected allele (s = 0.1) linked to sites under background selection of varying strength. Simulations were done using forward simulations assuming constantN_e = 10³ using a locus of 1Mb,n = 25 diploid individuals andμ = 2.5 × 10⁻⁸ mut/bp/gen,r = 1.25 × 10⁻⁸ recombinations/bp/gen. Mean estimates are represented by blue hash marks.

(EPS)

Click here for additional data file.^{(2.7MB, eps)}

S9 Fig. Effect of model misspecification on inference of selection.

Likelihood ratios (A,C) and selection coefficient estimates (B,D) calculated given the true tree (A,B) and via importance sampling using ARGweaver (C,D). Blue violin plots represent estimates using the correct (European) demographic model, whereas orange plots represent estimates using a model of constantN_e = 10⁴. Simulations were done under the European demographic model described in Methods and Materials using a locus of 200kb,n = 25 diploid individuals andμ = 2.5 × 10⁻⁸ mut/bp/gen,r = 1.25 × 10⁻⁸ recombinations/bp/gen.

(TIF)

Click here for additional data file.^{(11.1MB, tif)}

S10 Fig. Geographical distribution of pigmentation SNPs.

Population-wide allele frequencies of (A) rs4988235 (MCM6) and (B) rs3827760 (EDAR) plotted geographically using GGV.

(EPS)

Click here for additional data file.^{(2.6MB, eps)}

S11 Fig. Allele frequency trajectory of rs3827760 (EDAR) in CHB.

Frequency trajectory inferred using importance sampling under a model of CHB demography.

(EPS)

Click here for additional data file.^{(717.8KB, eps)}

S12 Fig. Geographical distribution of pigmentation SNPs.

Population-wide allele frequencies of pigmentation SNPs fromFig 10 plotted geographically using GGV.

(EPS)

Click here for additional data file.^{(8.6MB, eps)}

S13 Fig. Allele frequency trajectory estimate of rs12913832 (OCA2/HERC2).

The same trajectory estimate as inFig 10A withx-axis limits extended to illustrate earlier history of the allele.

(EPS)

Click here for additional data file.^{(1.4MB, eps)}

S1 Text. Appendix and commands to reproduce simulations and analyses.

(PDF)

Click here for additional data file.^{(230.7KB, pdf)}

Acknowledgments

We thank Melissa Hubisz and Andrew Kern for extensive help with the software packages ARGweaver and Discoal, respectively. We also thank Graham Coop, Michael Edge, Priya Moorjani, Yun Song, as well as Vladimir Shchur and other members of the Nielsen Lab, for helpful discussions.

Data Availability

Procedures to simulate data are documented inS1 Text. Other data is available freely through 1000 Genomes (Phase 3),http://www.internationalgenome.org/data-portal/sample.

Funding Statement

RN received grant R01GM116044, National Institutes of Healthhttps://www.nih.gov/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Watterson GA. Testing Selection at a Single Locus. Biometrics. 1982;38(2):323–331. 10.2307/2530446 [DOI] [PubMed] [Google Scholar]
2.Mathieson I, McVean G. Estimating selection coefficients in spatially structured populations from time series data of allele frequencies. Genetics. 2013;193(3):973–984. 10.1534/genetics.112.147611 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Williamson EG, Slatkin M. Using Maximum Likelihood to Estimate Population Size From Temporal Changes in Allele Frequencies. 1999;. [DOI] [PMC free article] [PubMed]
4.Bollback JP, York TL, Nielsen R. Estimation of 2Nes from temporal allele frequency data. Genetics. 2008;179(1):497–502. 10.1534/genetics.107.085019 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Lang GI, Rice DP, Hickman MJ, Sodergren E, Weinstock GM, Botstein D, et al. Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature. 2013;500(7464):57110.1038/nature12344 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Good BH, McDonald MJ, Barrick JE, Lenski RE, Desai MM. The dynamics of molecular evolution over 60,000 generations. Nature. 2017;551(7678):4510.1038/nature24287 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Lazaridis I, Patterson N, Mittnik Alissaet al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature. 2014;513(7518):409–13. 10.1038/nature13673 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Mathieson I, Lazaridis I, Rohland N, Mallick S, Patterson N, Roodenberg SA, et al. Genome-wide patterns of selection in. Nature. 2015;528(7583):499–503. 10.1038/nature16152 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Sabeti PC, Reich DE, Higgins JM, Levine HZP, Richter DJ, Schaffner SF, et al. Detecting recent positive selection in the human genome from haplotype structure. 2002;419(October). [DOI] [PubMed] [Google Scholar]
10.Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, Bustamante C. Genomic scans for selective sweeps using SNP data. 2005; p. 1566–1575. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS biology. 2006;4(3):e7210.1371/journal.pbio.0040072 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Smith JM, Haigh J. The hitch-hiking effect of a favourable gene. Genetics Research. 1974;23(1):23–35. 10.1017/S0016672300014634 [DOI] [PubMed] [Google Scholar]
13.Kaplan NL, Hudson R, Langley C. The “hitchhiking effect” revisited. Genetics. 1989;123(4):887–899. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123(3):585–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Stephan W, Wiehe TH, Lenz MW. The effect of strongly selected substitutions on neutral polymorphism: analytical results based on diffusion theory. Theoretical Population Biology. 1992;41(2):237–254. 10.1016/0040-5809(92)90045-U [DOI] [Google Scholar]
16.Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics. 1993;133(3):693–709. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Fay JC, Wu Ci. Hitchhiking Under Positive Darwinian Selection. 2000;. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Teshima KM, Coop G, Przeworski M. How reliable are empirical genomic scans for selective sweeps?2006;(773):702–712. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Ferrer-Admetlla A, Liang M, Korneliussen T, Nielsen R. On detecting incomplete soft or hard selective sweeps using haplotype structure. Molecular biology and evolution. 2014;31(5):1275–1291. 10.1093/molbev/msu077 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Garud NR, Messer PW, Buzbas EO, Petrov Da. Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLoS genetics. 2015;11(2):e100500410.1371/journal.pgen.1005004 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Field Y, Boyle EA, Telis N, Gao Z, Gaulton KJ. Detection of human adaptation during the past 2, 000 years. 2016; p. 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Schrider DR, Kern AD. S/HIC: Robust Identification of Soft and Hard Sweeps Using Machine Learning. PLoS genetics. 2016;12(3):e100592810.1371/journal.pgen.1005928 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Schrider DR, Kern AD. Supervised Machine Learning for Population Genetics: A New ParadigmTrends in Genetics. 2018;. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Lin K, Li H, Schlötterer C, Futschik A. Distinguishing positive selection from neutral evolution: boosting the performance of summary statistics. Genetics. 2011;187(1):229–244. 10.1534/genetics.110.122614 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Ronen R, Udpa N, Halperin E, Bafna V. Learning natural selection from the site frequency spectrum. Genetics. 2013;195(1):181–193. 10.1534/genetics.113.152587 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Sheehan S, Song YS. Deep learning for population genetic inference. PLoS computational biology. 2016;12(3):e100484510.1371/journal.pcbi.1004845 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Krone SM, Neuhauser C. Ancestral processes with selection. Theoretical population biology. 1997;51(3):210–237. 10.1006/tpbi.1997.1299 [DOI] [PubMed] [Google Scholar]
28.Kaplan NL, Darden T, Hudson RR. The Coalescent Process in Models With Selection. 1988;829(2):819–829. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Coop G, Griffiths RC. Ancestral inference on gene trees under selection. Theoretical population biology. 2004;66(3):219–32. 10.1016/j.tpb.2004.06.006 [DOI] [PubMed] [Google Scholar]
30.Vy HMT, Kim Y. A composite-likelihood method for detecting incomplete selective sweep from population genomic data. Genetics. 2015;200(2):633–649. 10.1534/genetics.115.175380 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Kim Y, Stephan W. Detecting a Local Signature of Genetic Hitchhiking Along a Recombining Chromosome. 2002;777(February):765–777. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Peter BM, Huerta-Sanchez E, Nielsen R. Distinguishing between selective sweeps from standing variation and from a de novo mutation. PLoS genetics. 2012;8(10):e100301110.1371/journal.pgen.1003011 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Ormond L, Foll M, Ewing GB, Pfeifer SP, Jensen JD. Inferring the age of a fixed beneficial allele. Molecular ecology. 2016;25(1):157–169. 10.1111/mec.13478 [DOI] [PubMed] [Google Scholar]
34.Ilardo MA, Moltke I, Korneliussen TS, Cheng J, Stern AJ, Racimo F, et al. Physiological and genetic adaptations to diving in sea nomads. Cell. 2018;173(3):569–580. 10.1016/j.cell.2018.03.054 [DOI] [PubMed] [Google Scholar]
35.Corl A, Bi K, Luke C, Challa AS, Stern AJ, Sinervo B, et al. The genetic basis of adaptation following plastic changes in coloration in a novel environment. Current Biology. 2018;28(18):2970–2977. 10.1016/j.cub.2018.06.075 [DOI] [PubMed] [Google Scholar]
36.Sugden LA, Atkinson EG, Fischer AP, Rong S, Henn BM, Ramachandran S. Localization of adaptive variants in human genomes using averaged one-dependence estimation. Nature communications. 2018;9(1):70310.1038/s41467-018-03100-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS genetics. 2014;10(5):e100434210.1371/journal.pgen.1004342 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Edge MD, Coop G. Reconstructing the history of polygenic scores using coalescent trees. Genetics. 2019;211(1):235–262. 10.1534/genetics.118.301687 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Berisa T, Pickrell JK. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics. 2016;32(2):28310.1093/bioinformatics/btv546 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Tavaré S. Line-of-descent and genealogical processes, and their applications in population genetics models. Theoretical population biology. 1984;26(2):119–164. 10.1016/0040-5809(84)90027-3 [DOI] [PubMed] [Google Scholar]
41.Griffiths R. Asymptotic line-of-descent distributions. Journal of Mathematical Biology. 1984;21(1):67–75. 10.1007/BF00275223 [DOI] [Google Scholar]
42.Jewett EM, Rosenberg NA. Theory and applications of a deterministic approximation to the coalescent model. Theoretical population biology. 2014;93:14–29. 10.1016/j.tpb.2013.12.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Tennessen JA, Bigham AW, O’connor TD, Fu W, Kenny EE, Gravel S, et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. science. 2012;337(6090):64–69. 10.1126/science.1219240 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Kern AD, Schrider DR. Discoal: flexible coalescent simulations with selection. Bioinformatics. 2016;32(24):383910.1093/bioinformatics/btw556 [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Wright S. The distribution of gene frequencies under irreversible mutation. Proceedings of the National Academy of Sciences. 1938;24(7):253–259. 10.1073/pnas.24.7.253 [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Przeworski M. The signature of positive selection at randomly chosen loci. Genetics. 2002;160(3):1179–1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Slatkin M. Simulating genealogies of selected alleles in a population of variable size. Genetics Research. 2001;78(1):49–57. 10.1017/S0016672301005183 [DOI] [PubMed] [Google Scholar]
48.Garud NR, Messer PW, Buzbas EO, Petrov DA. Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLoS Genet. 2015;11(2):e100500410.1371/journal.pgen.1005004 [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Haller BC, Messer PW. SLiM 3: forward genetic simulations beyond the Wright–Fisher model. Molecular biology and evolution. 2019;36(3):632–637. 10.1093/molbev/msy228 [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Barton NH. Linkage and the limits to natural selection. Genetics. 1995;140(2):821–841. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Torres R, Szpiech ZA, Hernandez RD. Human demographic history has amplified the effects of background selection across the genome. PLoS genetics. 2018;14(6):e100738710.1371/journal.pgen.1007387 [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, et al. A high-coverage genome sequence from an archaic Denisovan individual. Science. 2012;338(6104):222–226. 10.1126/science.1224344 [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Prüfer K, Racimo F, Patterson N, Jay F, Sankararaman S, Sawyer S, et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature. 2014;505(7481):4310.1038/nature12886 [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Prüfer K, de Filippo C, Grote S, Mafessoni F, Korlević P, Hajdinjak M, et al. A high-coverage Neandertal genome from Vindija Cave in Croatia. Science. 2017;358(6363):655–658. 10.1126/science.aao1887 [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Mathieson S, Mathieson I. FADS1 and the Timing of Human Adaptation to Agriculture. Molecular Biology and Evolution. 2018;35(12):2957–2970. 10.1093/molbev/msy180 [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Grossman SR, Shylakhter I, Karlsson EK, Byrne EH, Morales S, Frieden G, et al. A composite of multiple signals distinguishes causal variants in regions of positive selection. Science. 2010;327(5967):883–886. 10.1126/science.1183863 [DOI] [PubMed] [Google Scholar]
57.Kimura R, Yamaguchi T, Takeda M, Kondo O, Toma T, Haneji K, et al. A common variation in EDAR is a genetic determinant of shovel-shaped incisors. The American Journal of Human Genetics. 2009;85(4):528–535. 10.1016/j.ajhg.2009.09.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Wu S, Tan J, Yang Y, Peng Q, Zhang M, Li J, et al. Genome-wide scans reveal variants at EDAR predominantly affecting hair straightness in Han Chinese and Uyghur populations. Human genetics. 2016;135(11):1279–1286. 10.1007/s00439-016-1718-y [DOI] [PubMed] [Google Scholar]
59.Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, Clark AG, et al. Demographic history and rare allele sharing among human populations. Proceedings of the National Academy of Sciences of the United States of America. 2011;108(29):11983–8. 10.1073/pnas.1019276108 [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Marcus JH, Novembre J. Visualizing the geography of genetic variants. Bioinformatics. 2017;33(4):594–595. 10.1093/bioinformatics/btw643 [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Eriksson N, Macpherson JM, Tung JY, Hon LS, Naughton B, Saxonov S, et al. Web-based, participant-driven studies yield novel genetic associations for common traits. PLoS genetics. 2010;6(6):e100099310.1371/journal.pgen.1000993 [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Han J, Kraft P, Nan H, Guo Q, Chen C, Qureshi A, et al. A genome-wide association study identifies novel alleles associated with hair color and skin pigmentation. PLoS genetics. 2008;4(5):e100007410.1371/journal.pgen.1000074 [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Sulem P, Gudbjartsson DF, Stacey SN, Helgason A, Rafnar T, Magnusson KP, et al. Genetic determinants of hair, eye and skin pigmentation in Europeans. Nature genetics. 2007;39(12):144310.1038/ng.2007.13 [DOI] [PubMed] [Google Scholar]
64.Sturm RA, Duffy DL, Zhao ZZ, Leite FP, Stark MS, Hayward NK, et al. A single SNP in an evolutionary conserved region within intron 86 of the HERC2 gene determines human blue-brown eye color. The American Journal of Human Genetics. 2008;82(2):424–431. 10.1016/j.ajhg.2007.11.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Huerta-Sánchez E, Jin X, Bianba Z, Peter BM, Vinckenbosch N, Liang Y, et al. Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA. Nature. 2014;512(7513):19410.1038/nature13408 [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Wilde S, Timpson A, Kirsanow K, Kaiser E, Kayser M, Unterländer M, et al. Direct evidence for positive selection of skin, hair, and eye pigmentation in Europeans during the last 5,000 y. Proceedings of the National Academy of Sciences. 2014;111(13):4832–4837. 10.1073/pnas.1316513111 [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Gittelman RM, Schraiber JG, Vernot B, Mikacenic C, Wurfel MM, Akey JM. Archaic hominin admixture facilitated adaptation to out-of-Africa environments. Current Biology. 2016;26(24):3375–3382. 10.1016/j.cub.2016.10.041 [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Frudakis T, Thomas M, Gaskin Z, Venkateswarlu K, Chandra KS, Ginjupalli S, et al. Sequences associated with human iris pigmentation. Genetics. 2003;165(4):2071–2083. [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Sulem P, Gudbjartsson DF, Stacey SN, Helgason A, Rafnar T, Jakobsdottir M, et al. Two newly identified genetic determinants of pigmentation in Europeans. Nature genetics. 2008;40(7):83510.1038/ng.160 [DOI] [PubMed] [Google Scholar]
70.Liu F, Wollstein A, Hysi PG, Ankra-Badu GA, Spector TD, Park D, et al. Digital quantification of human eye color highlights genetic association of three new loci. PLoS genetics. 2010;6(5):e100093410.1371/journal.pgen.1000934 [DOI] [PMC free article] [PubMed] [Google Scholar]
71.Kenny EE, Timpson NJ, Sikora M, Yee MC, Moreno-Estrada A, Eng C, et al. Melanesian blond hair is caused by an amino acid change in TYRP1. Science. 2012;336(6081):554–554. 10.1126/science.1217849 [DOI] [PMC free article] [PubMed] [Google Scholar]
72.Mirzaei S, Wu Y. RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination. Bioinformatics. 2016;33(7):1021–1030. [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Kelleher J, Wong Y, Albers P, Wohns AW, McVean G. Inferring the ancestry of everyone. BioRxiv. 2018; p. 458067.
74.Shchur V, Ziganurova L, Durbin R. Fast and scalable genome-wide inference of local tree topologies from large number of haplotypes based on tree consistent PBWT data structure. bioRxiv. 2019; p. 542035.
75.Speidel L, Forest M, Shi S, Myers S. A method for genome-wide genealogy estimation for thousands of samples. BioRxiv. 2019; p. 550558. [DOI] [PMC free article] [PubMed]
76.Palamara PF, Terhorst J, Song YS, Price AL. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. bioRxiv. 2018; p. 276931. [DOI] [PMC free article] [PubMed]
77.Albers PK, McVean G. Dating genomic variants and shared ancestry in population-scale sequencing data. bioRxiv. 2018;. [DOI] [PMC free article] [PubMed]
78.Galtier N, Depaulis F, Barton NH. Detecting bottlenecks and selective sweeps from DNA sequence polymorphism. Genetics. 2000;155(2):981–987. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Selection coefficients inferred directly from the true local trees.

Left: constant population size (N_e = 10⁴). Right: Tennessen CEU demographic model. Marker color denotes present-day allele frequency (25/50/75% correspond to yellow/red/purple, respectively).

(EPS)

Click here for additional data file.^{(4.1MB, eps)}

S2 Fig. Allele frequency trajectories inferred from ARGweaver local trees whenμ =r.

(EPS)

Click here for additional data file.^{(3.4MB, eps)}

S3 Fig. Inferring allele frequency trajectories under CEU demography.

(EPS)

Click here for additional data file.^{(32.4MB, eps)}

S4 Fig. ARGweaver proposes less accurate trees under non-equilibrium demography.

(EPS)

Click here for additional data file.^{(3.9MB, eps)}

S5 Fig. ARGweaver infers an excess of recent coalescences.

\begin{matrix} e_{i} = \frac{\frac{1}{M} \sum_{s = 1}^{S} \sum_{m = 1}^{M} a_{i, s}^{(m)}}{\sum_{s = 1}^{S} a_{i, s}} \end{matrix}

as an estimate of the “fold excess” of coalescence, where

a_{i, s}^{(m)}

(EPS)

Click here for additional data file.^{(1.4MB, eps)}

S6 Fig. Performance of trajectory inference across replicates.

(EPS)

Click here for additional data file.^{(8.3MB, eps)}

S7 Fig. Effect of uncertainty ins on trajectory inference.

(EPS)

Click here for additional data file.^{(12.4MB, eps)}

S8 Fig. Effect of background selection on inference of selection.

(EPS)

Click here for additional data file.^{(2.7MB, eps)}

S9 Fig. Effect of model misspecification on inference of selection.

(TIF)

Click here for additional data file.^{(11.1MB, tif)}

S10 Fig. Geographical distribution of pigmentation SNPs.

Population-wide allele frequencies of (A) rs4988235 (MCM6) and (B) rs3827760 (EDAR) plotted geographically using GGV.

(EPS)

Click here for additional data file.^{(2.6MB, eps)}

S11 Fig. Allele frequency trajectory of rs3827760 (EDAR) in CHB.

Frequency trajectory inferred using importance sampling under a model of CHB demography.

(EPS)

Click here for additional data file.^{(717.8KB, eps)}

S12 Fig. Geographical distribution of pigmentation SNPs.

Population-wide allele frequencies of pigmentation SNPs fromFig 10 plotted geographically using GGV.

(EPS)

Click here for additional data file.^{(8.6MB, eps)}

S13 Fig. Allele frequency trajectory estimate of rs12913832 (OCA2/HERC2).

The same trajectory estimate as inFig 10A withx-axis limits extended to illustrate earlier history of the allele.

(EPS)

Click here for additional data file.^{(1.4MB, eps)}

S1 Text. Appendix and commands to reproduce simulations and analyses.

(PDF)

Click here for additional data file.^{(230.7KB, pdf)}

Data Availability Statement

Procedures to simulate data are documented inS1 Text. Other data is available freely through 1000 Genomes (Phase 3),http://www.internationalgenome.org/data-portal/sample.

Movatterモバイル変換

PERMALINK

An approximate full-likelihood method for inferring selection and allele frequency trajectories from DNA sequence data

Aaron J Stern

Peter R Wilton

Rasmus Nielsen

Roles

Abstract

Author summary

Introduction

Materials and methods

Overview

Fig 1.

Coalescent model for a site under selection

Fig 2. (Companion toTable 1) Coalescence conditioned on the allele frequency trajectory (dashed line).

Table 1. (Companion toFig 2) The numbers of derived, ancestral, and mixed lineages at each time point.

Marginalizing the hidden allele frequency states

Importance sampling to estimate the likelihood function

Simulations

Fig 3. (A-I) ROC curves illustrating performance of tests between selection and neutrality.

Fig 4. ROC curves illustrating performance of tests between selection and neutrality.

Results

Testing for selection

Estimating selection coefficients

Fig 5. Inference of selection coefficients of varying strength using importance sampling method based on ARGweaver local trees.

Inferring allele frequency trajectories

Fig 6. Allele frequency trajectories inferred from true trees (A,B) and ARGweaver local trees (C,D).

Inferring extremely recent selection

Fig 7.

Fine-mapping the selected site

Fig 8.

Effects of background selection

Effects of demographic model misspecification

Analysis of a lactase persistence SNP

Fig 9. Comparison of inferred allele frequency trajectories for a sweep at rs4988235 (MCM6) in GBR under an ancient DNA (aDNA) based method vs.CLUES, which only uses contemporary modern data.

Analysis ofEDAR

Analysis of pigmentation alleles

Fig 10. Allele frequencies trajectories inferred for 11 pigmentation-associated SNPs in GBR (A-K, gene names and accession numbers inset).

Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig 9. Comparison of inferred allele frequency trajectories for a sweep at rs4988235 (MCM6) in GBR under an ancient DNA (aDNA) based method vs.`CLUES`, which only uses contemporary modern data.