
The landscape of Neandertal ancestry in present-dayhumans
Sriram Sankararaman
Swapan Mallick
Michael Dannemann
Kay Prüfer
Janet Kelso
Svante Pääbo
Nick Patterson
David Reich
To whom correspondence should be addressed:sankararaman@genetics.med.harvard.edu orreich@genetics.med.harvard.edu
Issue date 2014 Mar 20.
Users may view, print, copy, download and text and data- mine thecontent in such documents, for the purposes of academic research, subjectalways to the full Conditions of use:http://www.nature.com/authors/editorial_policies/license.html#terms
Abstract
Analyses of Neandertal genomes have revealed that Neandertals havecontributed genetic variants to modern humans1–2. Theantiquity of Neandertal gene flow into modern humans means that regions thatderive from Neandertals in any one human today are usually less than a hundredkilobases in size. However, Neandertal haplotypes are also distinctive enoughthat several studies have been able to detect Neandertal ancestry at specificloci1,3–8. Here, we have systematically inferred Neandertal haplotypesin the genomes of 1,004 present-day humans12. Regions that harbor a high frequency of Neandertalalleles in modern humans are enriched for genes affecting keratin filamentssuggesting that Neandertal alleles may have helped modern humans adapt tonon-African environments. Neandertal alleles also continue to shape humanbiology, as we identify multiple Neandertal-derived alleles that confer risk fordisease. We also identify regions of millions of base pairs that are nearlydevoid of Neandertal ancestry and enriched in genes, implying selection toremove genetic material derived from Neandertals. Neandertal ancestry issignificantly reduced in genes specifically expressed in testis, and there is anapproximately 5-fold reduction of Neandertal ancestry on chromosome X, which isknown to harbor a disproportionate fraction of male hybrid sterilitygenes20–22. These results suggest thatpart of the reduction in Neandertal ancestry near genes is due to Neandertalalleles that reduced fertility in males when moved to a modern human geneticbackground.
To search systematically for Neandertal haplotypes, we developed a method basedon a Conditional Random Field9 (CRF)that combines information from three features of genetic variation that are signaturesof Neandertal ancestry (SI 1;Extended Data Fig.1). The first is the allelic pattern at a single nucleotide polymorphism (SNP):if a non-African carries a derived allele seen in Neandertals but absent from the WestAfrican Yoruba (YRI), it likely originates from Neandertals. The second is high sequencedivergence to all YRI haplotypes but low divergence to Neandertal. The third is a lengthconsistent with interbreeding 37–86 thousand years ago10. We trained the CRF using simulations11, and established its robustness todeviations from the assumed demography (SI 2).
Extended Data Figure 1. Three features used in the Conditional Random Field for predicting Neandertalancestry.
Feature 1: Patterns of variation at a single SNP. Sites where a panel ofsub-Saharan Africans carries the ancestral allele and where the sequencedNeandertal and the test haplotype carry the derived allele are likely to bederived from Neandertal gene flow. Feature 2: Haplotype divergence patterns.Genomic segments where the divergence of the test haplotype to the sequencedNeandertal is low while the divergence to a panel of sub-Saharan Africans ishigh are likely to be introgressed. Feature 3: we search for segments that havea length consistent with what is expected from the Neandertal-into-modern humangene flow 2,000 generations ago, corresponding to a size of about 0.05cM= (100cM/Morgan)/(2000 generations).
We screened for Neandertal haplotypes in the 1000 Genomes Project Phase 1 data12 (1KG) using the AltaiNeandertal genome of 52-fold average coverage to determine alleles present inNeandertals2, a 6-primateconsensus to determine ancestral alleles13, and 176 YRI genomes as a reference panel assumed to harbor noNeandertal ancestry (Fig. 1a).Table 1 reports the mean and standard deviation acrossindividuals of the fraction of their ancestry confidently inferred to be Neandertal(probability >90%).Fig. 1b andExtended Data Fig. 2 plot the fraction of European(n=758) and East Asian (n=572) haplotypes that descend from Neandertalsat each genomic location (SI 3). We created a tiling path of inferred Neandertalhaplotypes that spans 1.1 Gigabases (Gb) over 4,437 contigs (SI 4), thus filling in gapsin the Neandertal sequence over a number of repetitive regions that cannot bereconstructed from short ancient DNA fragments (ExtendedData Fig. 3).
Figure 1. Maps of Neandertal ancestry.
(a) Individual maps: We show the marginal probability of Neandertalancestry for 1 European American, 1 East Asian and 1 sub-Saharan African phasedgenome across chromosome 9. (b) Population maps: We show theaverage of the inferred proportion of Neandertal ancestry in Europeans (aboveaxis) and East Asians (below axis) in non-overlapping 100 kb windows onchromosome 9.
Table 1.
Genome-wide estimates of Neandertal ancestry
| Population | Individuals | Neandertal ancestry(%) | ||
|---|---|---|---|---|
| Autosomes | X | |||
| Europeans | CEU | 85 | 1.17±0.08 | 0.21±0.17 |
| FIN | 93 | 1.20±0.07 | 0.19±0.14 | |
| GBR | 89 | 1.15±0.08 | 0.20±0.15 | |
| IBS | 14 | 1.07±0.06 | 0.23±0.18 | |
| TSI | 98 | 1.11±0.07 | 0.25±0.20 | |
| East Asians | CHB | 97 | 1.40±0.08 | 0.30±0.21 |
| CHS | 100 | 1.37±0.08 | 0.27±0.21 | |
| JPT | 89 | 1.38±0.10 | 0.26±0.21 | |
| Americans | CLM | 60 | 1.14±0.12 | 0.22±0.16 |
| MXL | 66 | 1.22±0.09 | 0.21±0.15 | |
| PUR | 55 | 1.05±0.12 | 0.20±0.15 | |
| Africans | LWK | 97 | 0.08±0.02 | 0.04±0.07 |
| ASW | 61 | 0.34±0.22 | 0.07±0.11 | |
Note: For each computationally phased genome in each population, we estimatedthe probability of Neandertal ancestry at each SNP and the fraction ofautosomal and X-chromosome SNPs that are confidently Neandertal in eachindividual (marginal probability > 90%). The table reports theaverage and standard deviation of this measure across individuals withineach population.
Extended Data Figure 2. Map of Neandertal ancestry in 1000 Genomes European and East Asianpopulations.
For each chromosome, we plot the fraction of alleles confidently inferred to beof Neandertal origin (probability > 90%) in non-overlapping 1 Mbwindows. We label 10 Mb scale windows that are deficient in Neandertal ancestry(e1–e9, a1–a17) (SI 8).
Extended Data Figure 3. Tiling path from confidently inferred Neandertal haplotypes.
a, Example tiling path at theBNC2 locus onchromosome 9 in Europeans. Red denotes confident Neandertal haplotypes. Bluedenotes the resulting tiling path. We identified Neandertal haplotypes byscanning for runs of consecutive SNPs along a haplotype with a marginalprobability > 90% and requiring the haplotypes to be at least 0.02 cMlong.b, Distribution of contig lengths obtained by constructing atiling path across confidently inferred Neandertal haplotypes. On mergingNeandertal haplotypes in each of the 1000 Genomes European and East Asianpopulations, we reconstructed 4,437 Neandertal contigs with median length 129kb.
Four features of the Neandertal introgression map suggest that it is producingreasonable results. First, when we infer Neandertal ancestry using low-coverage datafrom Croatian Neandertals1 we obtaincorrelated inferences (Spearman rank correlationρSpearman=0.88 in Europeans; SI 3). Second, in the AfricanLuhya (LWK), the proportion of the genome inferred to be Neandertal is 0.08%, anorder of magnitude smaller than in non-Africans (Table1). Third, the proportion of confidently inferred Neandertal ancestry has amean of 1.38% in East Asians and 1.15% in Europeans (Table 1), consistent with previous reports of moreNeandertal ancestry in East Asians than in Europeans7,14. Fourth, the standarddeviation in Neandertal ancestry within populations is 0.06–0.10%, inline with theoretical expectation (SI 3) and showing that Neandertal ancestrycalculators that estimate differences on the order of a percent15 are largely inferring noise.
The Neandertal introgression map reveals locations where Neandertal ancestry isinferred to be as high as 62% in East Asians and 64% in Europeans (Fig. 1b andExtendedData Fig. 2). Several of these regions provide evidence of positive selectionif we assume a model in which the distribution of Neandertal ancestry has been governedby neutral drift; however, this assumption is problematic in light of the evidence forwidespread negative selection against Neandertal ancestry reported below (SI 5). As analternative test for whether Neandertal alleles have been affected by positiveselection, we examined the 5% of genes with the highest inferred Neandertalancestry. We do not detect tissue-specific expression patterns; however genes involvedinkeratin filament formation and some other biological pathways aresignificantly enriched in Neandertal ancestry in Europeans, East Asians, or both (Extended Data Table 1, SI 6). Thus, Neandertalalleles that affect skin and hair may have been used by modern humans to adapt tonon-African environments. We also directly established the relevance of Neandertalalleles to present-day human biology by identifying alleles of Neandertal origin (SI 7),and overlapping this list with alleles that have been associated withphenotype16. We identifyalleles of Neandertal origin that affect lupus, biliary cirrhosis, Crohn’sdisease, optic disk size, smoking behavior, IL-18 levels and type 2 diabetes17 (Extended Data Table 2).
Extended Data Table 1.
Gene categories enriched or depleted in Neandertal ancestry. Enrichment of GeneOntology categories in genes with depleted or elevated Neandertal ancestry wasassessed using the hypergeometric test implemented in the FUNC package. Wereport Family-wise error rate P-values (FWER) associated with each GO category(P-values corrected for the testing of multiple categories).
| Biological pathway (GO categorization) | Neandertal ancestry | Europe FWER | East Asian FWER |
|---|---|---|---|
| nucleic acid binding (molecular_function,GO:0003676) | Depleted | 0.018 | 0.032 |
| RNA processing (biological_process,GO:0006396) | Depleted | 0.004 | 0.049 |
| ribonucleoprotein complex (cellular_component,GO:0030529) | Depleted | <0.001 | 0.027 |
| organelle part (cellular_component,GO:0044422) | Depleted | <0.001 | 0.037 |
| intracellular organelle part(cellular_component, GO:0044446) | Depleted | <0.001 | 0.025 |
| mRNA metabolic process (biological_process,GO:0016071) | Depleted | <0.001 | 0.014 |
| nuclear lumen (cellular_component,GO:0031981) | Depleted | 0.039 | 0.017 |
| nuclear part (cellular_component,GO:0044428) | Depleted | 0.005 | 0.022 |
| keratin filament (cellular_component,GO:0045095) | Enriched | <0.001 | <0.001 |
Extended Data Table 2.
Neandertal derived alleles that have been reported to be associated withphenotypes in genome-wide association studies (GWAS). We identified alleles thatare likely to have been introduced by Neandertal gene flow (SI 10) andintersected these alleles with SNPs that have been shown to be associated withphenotypes from the NHGRI GWAS catalog as well as from a recent GWAS for type 2 diabetes17).
| rs id | Coordinates | Derived allele | Derived allele frequency(%) | Phenotype | |
|---|---|---|---|---|---|
| Europeans | East Asians | ||||
| rs12531711 | 7:128,617,466 | G | 10.03 | 0.17 | Systemic lupus erythematosus, Primary biliarycirrhosis |
| rs3025343 | 9:136,478,355 | A | 8.44 | 0.00 | Smoking behavior |
| rs7076156 | 10:64,415,184 | A | 26.52 | 8.74 | Crohn’s disease |
| rs12571093 | 10:70,019,371 | A | 16.35 | 14.86 | Optic disc size |
| rs1834481 | 11:112,023,827 | G | 21.50 | 0.35 | Interleukin-18 levels |
| rs11175593 | 12:40,601,940 | T | 1.98 | 3.32 | Crohn’s disease |
| rs75493593 | 17:6,945,087 | T | 1.85 | 12.06 | |
| rs75418188 | 17:6,945,483 | T | 1.85 | 11.54 | Type-2 Diabetes |
| rs117767867 | 17:6,946,330 | T | 1.85 | 11.54 | |
The most striking feature of the introgression map is its large“deserts” of Neandertal ancestry: on a 10 Megabase (Mb) scale on theautosomes, there are 4 windows in Europeans and 14 in East Asians with Neandertalancestry <0.1% (Extended Data Fig. 2, SI8). Two analyses show that these deserts are not artifacts of reduced power to detectancestry. First, when we lower the probability threshold for calling a segment asNeandertal from 90% to 25% our qualitative findings are unchanged (SI8). Second, when we estimate Neandertal ancestry in regions of low recombination ratewhere Neandertal haplotypes are longer so that we have more power to detect them, we seea decreased Neandertal ancestry proportion, opposite to the expectation from increasedpower (ρSpearman=0.221, P=4.4 ×10−4 in Europeans; ρSpearman=0.226,P=1.9 × 10−4 in East Asians) (SI 8). Part of theexplanation for the ancestry deserts is likely to be small population sizes shortlyafter interbreeding, as this could explain why we also observe multi-megabase rises andnot just falls in Neandertal ancestry (SI 8). However, selection too appears to havecontributed to Neandertal ancestry deserts, as we also detect a correlation tofunctionally important regions (below).
To explore whether selection provides part of the explanation for regions ofreduced Neandertal ancestry, we tested for a correlation of Neandertal ancestry to apreviously defined “B-statistic”, in which low B implies a high densityof functionally important elements18.We find that low B is significantly correlated to low Neandertal ancestry:ρSpearman=0.32 in Europeans (P= 4.9 ×10−87) and ρSpearman=0.31 in EastAsians (P=3.88 × 10−68) (Fig. 2, SI 8). The inference of low Neandertal ancestry inthese regions is not an artifact of reduced power, as there is expected to be reducedgenetic variation in regions of low B which should make introgressed Neandertalhaplotypes stand out more clearly (Extended Data Table3). We also estimated Neandertal ancestry in quintiles of B-statistic usingan approach that is not biased by varying mutation rates, recombination rates, orgenealogical tree depth19, andconfirmed that the quintile with the highest B has significantly higher Neandertalancestry than the other quintiles (P=7×10−4) (Extended Data Table 4, SI 9).
Figure 2. Functionally important regions are deficient in Neandertal ancestry.
We plot the median of the proportion of Neandertal ancestry (the average over themarginal probability of Neandertal ancestry assigned to each individual alleleat a SNP) within quintiles of a B-statistic that measures proximity tofunctionally important regions (1-low, 5-high). We show results on the autosomesand chromosome X, and in Europeans and East Asians.
Extended Data Table 3.
Power to infer Neandertal ancestry. a, Simulated power to infer Neandertalancestry as a function of the effective population size.b, Powerto infer Neandertal ancestry on chromosome X versus the autosomes. Recall iscomputed at a precision of 90%. Standard errors are estimated by a blockjackknife with 100 blocks.
| a | |
|---|---|
| Effective population size | Recall |
| 2500 | 0.552 ± 0.009 |
| 5000 | 0.506 ± 0.009 |
| 7500 | 0.430 ± 0.006 |
| 10000 | 0.384 ± 0.006 |
| b | |
|---|---|
| Recall | |
| Autosomes | 0.384 ± 0.006 |
| X | 0.495 ± 0.009 |
Extended Data Table 4.
Unbiased estimate of the proportion of Neandertal ancestry as a function ofB-statistic. We estimate the proportion of Neandertal ancestry in quintiles ofB-statistics. We find a significant excess of Neandertal ancestry in thequintile with the highest B-statistic relative to the remaining four quintiles(significant after correcting for ten hypotheses).
| 11 Non-Africans (transitions+ transversions) | 11 Non-Africans (transversionsonly) | 4 Europeans (transversionsonly) | 7 Eastern (transversionsonly) | |||||
|---|---|---|---|---|---|---|---|---|
| Est. | Err. | Est. | Err. | Est. | Err. | Est. | Err. | |
| Quintile 1: B=0–0.63 | 0.641 | 0.304 | 0.672 | 0.316 | 0.472 | 0.397 | 0.778 | 0.317 |
| Quintile 2: B=0.63–0.80 | 0.825 | 0.209 | 0.779 | 0.234 | 0.849 | 0.290 | 0.750 | 0.236 |
| Quintile 3: B=0.80–0.88 | 0.578 | 0.248 | 0.745 | 0.298 | 0.987 | 0.349 | 0.647 | 0.297 |
| Quintile 4: B=0.88–0.94 | 0.684 | 0.184 | 0.676 | 0.208 | 0.446 | 0.256 | 0.771 | 0.221 |
| Quintile 5: B=0.94–1.00 | 1.537 | 0.152 | 1.445 | 0.164 | 1.502 | 0.185 | 1.419 | 0.177 |
| B≥0.94 vs. B ≤ 0.94 | Z=3.82 | Z=3.02 | Z=3.12 | Z=2.58 | ||||
| Correct for 10 hypotheses | P=0.00066 | P=0.013 | P=0.0090 | P=0.049 | ||||
The largest deserts of Neandertal ancestry are on chromosome X, where the meanNeandertal ancestry is about a fifth of the autosomes (Table 1). The power of our CRF to detect Neandertal ancestry is higher onchromosome X than on the autosomes (Extended Data Table3), implying that this observation cannot be an artifact of reduced power. Atleast some of the reduction in Neandertal ancestry that we observe on chromosome X mustbe due to selection, since just as on the autosomes, we observe that Neandertal ancestryis positively correlated with B-statistic (ρSpearman=0.276, P= 3.1 × 10−4 for Europeans;ρSpearman=0.176, P = 0.02 for East Asians) (Fig. 2, SI 8). Studies in many species have shownthat genes responsible for reduced male fertility disproportionally map to chromosomeX20–22. We hypothesized that this “Large XEffect”23 could explain whychromosome X was more resistant to introgression of Neandertal ancestry than theautosomes.
If male hybrid sterility is contributing to our observations, a prediction isthat the responsible genes will be disproportionally expressed in testis24. To test this hypothesis, we analyzedgene transcripts from 16 human tissues25 and defined “tissue-specific” genes as those with asignificantly higher expression level in that tissue than any other. We found that onlygenes specifically expressed in testis were enriched in regions of low Neandertalancestry, an effect that remained significant after permuting gene annotations whilepreserving the correlation structure between Neandertal ancestry and gene expression (P= 0.0095 in Europeans; P = 0.018 in East Asians) (Table 2, SI 6). However, hybrid sterility is not the onlyfactor responsible for selection against Neandertal material, as Neandertal ancestry isalso depleted in conserved pathways such as RNA processing (P<0.05;Extended Data Table 2; SI 6).
Table 2.
Enrichment of tissue-specific genes in regions deficient in Neandertalancestry
| Tissue | Europeans | East Asians | ||||
|---|---|---|---|---|---|---|
| Genome | Chr. X | Autosomes | Genome | Chr. X | Autosomes | |
| adipose | 0.93 | 1 | 0.81 | 0.99 | 1 | 0.95 |
| adrenal | 0.5 | NA | 0.5 | 0.42 | NA | 0.42 |
| blood | 0.99 | 0.98 | 0.99 | 0.94 | 0.73 | 0.94 |
| brain | 1 | 1 | 1 | 1 | 1 | 1 |
| breast | 0.98 | 0.63 | 0.99 | 1 | 0.94 | 1 |
| colon | 0.64 | 0.77 | 0.63 | 0.94 | 0.97 | 0.89 |
| heart | 0.99 | 0.71 | 0.99 | 0.8 | 0.57 | 0.81 |
| kidney | 1 | 0.15 | 1 | 1 | 0.08 | 1 |
| liver | 0.99 | 0.99 | 0.99 | 1 | 0.86 | 1 |
| lung | 0.96 | 0.64 | 0.96 | 0.99 | 0.87 | 0.99 |
| lymph | 0.88 | 0.62 | 0.9 | 0.99 | 0.51 | 0.99 |
| ovary | 0.84 | 0.95 | 0.81 | 0.62 | 0.91 | 0.58 |
| prostate | 1 | 0.79 | 1 | 1 | 0.73 | 1 |
| muscle | 0.95 | 0.7 | 0.95 | 0.83 | 0.1 | 0.88 |
| testes | 0.0095 | 0.13 | 0.016 | 0.018 | 0.039 | 0.055 |
| thyroid | 0.86 | 0.62 | 0.88 | 0.87 | 0.94 | 0.86 |
Note: We compare tissue-specific genes (defined as those that aresignificantly more highly expressed in the specified tissue than in any ofthe 15 other tissues) to all other expressed genes in that tissue. Of thesixteen tissues tested, only testis-specific genes are significantlyenriched in the regions deficient in Neandertal ancestry, defined aslocations where all sites across all individuals are assigned a marginalprobability of Neandertal ancestry of <10% (47% of genesin Europeans and 52% of genes in East Asians fall in this category).NA denotes that there were no tissue-specific genes for this tissue onchromosome X.
We have shown that interbreeding of Neandertals and modern humans introducedalleles onto the modern human genetic background that were not tolerated and were sweptaway, in part because they contributed to male hybrid sterility. The resulting reductionin Neandertal ancestry was quantitatively large: in the fifth of the genome with highestB, Neandertal ancestry is 1.54 ± 0.15 times the genome-wide average (Extended Data Table 4; SI 9)19. If we assume that this subset of the genome wasunaffected by selection, this implies that the proportion of Neandertal ancestry shortlyafter introgression must have been >3% rather than the present-day~2%. In passing, we note that the large effect of negative selection onpresent-day levels of Neandertal ancestry may explain why the proportion of Neandertalancestry is significantly higher in present-day East Asians than in Europeans (Table 1)7,14; population sizesappear to have been smaller in East Asians than Europeans for some of the time sincetheir separation26, and this wouldresult in less efficient selection to remove Neandertal-derived deleterious alleles. Theevidence for male hybrid sterility is particularly remarkable when compared with mixedpopulations of present-day humans in which no convincing signals of selection againstalleles inherited from one of the mixing populations have been found despite high powerto detect such effects27. Thus, whilethe time of separation between Neandertals and modern humans was about 5 times largerthan that between present-day Europeans and West Africans2, the biological incompatibility was far greater.A potential explanation is the “snowball effect”, whereby hybridsterility genes are expected to accumulate in proportion to the square of thesubstitutions between two taxa because two interacting loci need to change to produce anincompatibility (“Dobzhansky-Muller incompatibilities”)28. An important direction for futurework is to explore whether similar phenomena have affected other interbreeding eventsbetween diverged humans.
Online Methods
Conditional Random Field for inferring Neandertal local ancestry
Consider a haploid genome in a test population that carries Neandertalancestry, for example Europeans. Given the allelic states of a sequence of SNPsalong this haplotype, we would like to infer the ancestral state of the alleleat each SNP, specifically, whether it has entered modern humans throughNeandertal gene flow. In addition to the test haplotype, the data we analyzeconsist of a panel of haplotypes from the sub-Saharan African Yoruba (YRI) whowe assume harbor no Neandertal ancestry1. To determine the allelic state of the Neandertals, weuse a high-coverage Neandertal genome2. We determine the ancestral and derived allele at eachSNP using a 6-primate consensus sequence13. To estimate the genetic distance between adjacentSNPs, we use the Oxford combined linkage disequilibrium (LD) map29. We specify the distributionof the unobserved Neandertal ancestry states at each SNP given the observedgenetic data as a Conditional Random Field (CRF)9. Intuitively, we specify featurefunctions that relate the observed data and the unobserved ancestral state ateach SNP (“emission functions”) as well as feature functionsthat relate the unobserved ancestral states at adjacent SNPs(“transition functions”). Thus, the model is a linear-chain CRF.The feature functions and their associated parameters fully specify thedistribution of the unobserved ancestral states given the observed data. Giventhe parameters and the observed data, we are able to infer the marginalprobability of Neandertal ancestry at each SNP of the haploid genome. We computethe marginal probabilities efficiently using the forward-backward algorithm9,30. SI 1 presents the mathematicaldetails.
Feature functions
The emission functions couple the unobserved ancestral state at a SNP tothe observed features. We use two classes of emission functions.
The first class of emission function captures information from the jointpatterns observed at a single SNP across Europeans, Africans and Neandertals.These features are indicator functions that assume the value “1”when a specific pattern is observed at a SNP and “0” otherwise.We use feature functions that pick out two classes of allelic patterns. One ofthese features is 1 if at a given SNP, the test haplotype carries the derivedallele, all the YRI haplotypes carry the ancestral allele, and either of the twoNeandertal alleles is derived. SNPs with this joint configuration have anincreased likelihood of Neandertal ancestry. In the CRF, an increased likelihoodassociated with this feature is reflected in the fact that the parameter ispositive with a magnitude determined by the informativeness of the feature. Thesecond feature is 1 if at a given SNP, the test haplotype carries a derivedallele that is polymorphic in the panel of Africans but absent in theNeandertal. SNPs with this joint configuration have a decreased likelihood ofNeandertal ancestry.
The second class of emission functions uses multiple SNPs to capture thesignal of Neandertal ancestry. Specifically, we compare the divergence of thetest haplotype to the Neandertal sequence to the minimum divergence of the testhaplotype to all African haplotypes over non-overlapping 100 kilobase (kb)windows (the size scale we expect for Neandertal haplotypes today based on thetime of Neandertal gene flow into modern humans10). In a region of the genome where thetest haplotype carries Neandertal ancestry, we expect the test haplotype to becloser to the Neandertal sequence than to most modern human sequences (albeitwith a large variance), and we expect the pattern to be reversed outside theseregions. While computing distance to the Neandertal sequence, we build aNeandertal haplotype by choosing the derived allele at heterozygous sites sothat this distance is effectively the minimum distance of the potentiallyintrogressed test haplotype to one of the two Neandertal haplotypes.
The transition feature function modulates the correlation of theancestral states at adjacent SNPs. We define this feature function as anapproximation, at small genetic distance, to the log of the transitionprobabilities of a standard Markov process of admixture between two populations.This approximation makes parameter estimation in the CRF efficient.
Parameter Estimation
To estimate the parameters of the CRF, we need haplotypes labeled withNeandertal ancestries; that is, training data. Since we do not in fact know thetrue Neandertal state in any individual, we estimated the CRF parameters on datasimulated under a demographic model. We estimate parameters by maximizing theL2-regularized conditional log likelihood using a limited-memory version ofLBFGS31. We fixed the valueof the parameter associated with the L2-penalty at 10 although a broad range ofvalues appear to work in practice. We assumed a simple demographic modelrelating Africans, Europeans and Neandertals with Neandertal-modern humanadmixture occurring 1,900 generations ago10 and a fraction of Neandertal ancestry of 3%1. The model parameterswere broadly constrained by the observed allele frequency differentiationbetween the West African YRI and European American CEU populations and by theobserved excess sharing of alleles between Europeans and Africans relative toNeandertals. The simulations incorporated hotspots of recombination11 as well as the reduced powerto detect low-frequency alleles from low-coverage sequencing data12.
Validation of the CRF
We assessed the accuracy of the CRF to predict Neandertal ancestry usingsimulated data. Given the marginal probabilities estimated by the CRF, weestimated the precision (fraction of predictions that are truly Neandertal) andthe recall (fraction of true Neandertal alleles that are predicted) as we varythe threshold on the marginal probability for an allele to be declaredNeandertal. We also evaluated the accuracy when the haplotype phase needs to beinferred and when the genetic map has errors. Since the CRF parameter estimationassumes a specific demographic model, we were concerned about the possibilitythat the inferences might be sensitive to the model assumed. We thereforeperturbed each demographic parameter in turn and applied the CRF to datasimulated under these perturbed models, fixing the parameters of the CRF to theestimates obtained under the original model. For each of these perturbed models,we evaluated the false discovery rate (1-precision) when we restrict to sites atwhich the CRF assigns a marginal probability of at least 0.90. SI 2 presents thedetails.
Preparation of 1000 Genomes Data
We applied the CRF to the computationally phased haplotypes in each ofthe 13 populations in the 1000 Genomes project12 (1KG), excluding the west African Yoruba(YRI). The CRF requires reference genomes from Africans and Neandertals. For theAfrican population, we used 176 haplotypes from 88 YRI individuals. For theNeandertal genome, we used the genotypes called from the recently generatedhigh-coverage Neandertal sequence2. We restricted our analysis to sites passing the filtersdescribed in ref.2 and forwhich the genotype quality score GQ ≥ 30. These filters discard sitesthat are identified as repeats by the Tandem Repeat Finder32, that have Phred-scaled mapping qualityscores of MQ < 30, or that map to regions where the alignment is ambiguous orwhich fall within the upper or lower 2.5th percentile of thesample-specific coverage distribution (applied within the regions of uniquemappability binned according to the GC-content of the reference genome). For themappability filter, we used the liberal filter that requires that at least50% of all 35-mers that overlap a position do not map to any otherposition in the genome allowing up to one mismatch. We further restricted ouranalysis to sites that are biallelic across the Neandertal and the 1000 Genomesproject samples. For each haplotype analyzed, we also restricted to the set ofpolymorphic sites in the population containing the haplotype. After filtering,we were able to analyze 26,493,206 SNPs on the autosomes and 817,447 SNPs onchromosome X. We obtained genetic distances from the Oxford combined LD maplifted over to hg19 coordinates29. For the X chromosome, we obtained an appropriatesex-averaged map by scaling the X chromosome LD-based map by 2/3.
Statistics for measuring Neandertal ancestry
We computed several statistics to summarize the Neandertal ancestryinferred by the CRF. We estimated the proportion of an individual diploid genomethat is confidently inferred to be Neandertal as the fraction of sites for whichthe marginal probability is ≥ 90%. To assess variation in theproportion of Neandertal ancestry along the genome, we computed the fraction ofalleles across individuals with marginal probability greater than a specifiedthreshold. This statistic is likely to be affected by variation in power alongthe genome. Hence, we also consider an estimate of the ancestry proportionobtained by averaging the marginal probability across individuals. Depending onthe analyses, these statistics are estimated at a single SNP or innon-overlapping windows of a specified size.
To assess whether the predictions made by the CRF are sensible, weinferred Neandertal ancestry using the low-coverage genome from the VindijaNeandertals1. For thisanalysis, we restricted to sites at which there is at least one read withmapping quality score between 60 and 90 and base quality of at least 40. As asecond validation analysis, we applied the CRF to the sub-Saharan African Luhya(LWK), using the parameters optimized for non-Africans. We empirically assessedthe accuracy of the CRF on the 1000 Genomes project data by assuming that LWKhas no Neandertal ancestry, that the false discovery rate in each non-Africanpopulation is equal to the false discovery rate in LWK, and using thegenome-wide proportion of Neandertal ancestry estimated in ref.2 (SI 3). We computed thetheoretical standard deviation in the proportion of Neandertal ancestry33 assuming a pulsemodel of admixture with 2% Neandertal ancestry followed by 2,000generations of random mating, and 2.03 Gigabases as the number of bases of thehigh-coverage Neandertal genome that pass filters.
Tiling path of Neandertal haplotypes
We identified Neandertal haplotypes as runs of consecutive alleles alonga test haplotype assigned a marginal probability of > 90%. Wefiltered haplotypes smaller than 0.02 cM. At each SNP that is covered by atleast one such haplotype, we estimated the allelic state as the consensus alleleacross the spanning haplotypes. See SI 4.
Functional analysis of introgressed alleles
We defined two subsets of Consensus Coding Sequence (CCDS) genes34. We define a genewith “low Neandertal ancestry” as one in which all allelesacross all individuals have a marginal probability ≤ 10%. Wealso require that the genes included in this analysis include at least ≥100 SNPs within a 100 kb window centered at its midpoint (this excluded geneswith low power). We define a gene with “high Neandertalancestry” as one that is in the top 5% of CCDS genes ranked bythe average marginal probability across individual haplotypes.
Functional enrichment analysis
We tested for enrichment of Gene Ontology (GO)35 categories in genes with low or highNeandertal ancestry, using the hypergeometric test implemented in the FUNCpackage36. We reportmultiple-testing corrected P-values estimated from 1000 permutations for the GOenrichment analysis (Family-Wise Error Rate – FWER). Given the observedcorrelation between Neandertal ancestry and B-statistic18, a concern is that the functionalcategories may not be randomly distributed with respect to B-statistic. Tocontrol for this, we assigned a B-statistic to each gene (estimated as anaverage of the B-statistic over the length of the gene) as well as a uniformrandom number. This resulted in 17,249 autosomal genes. Genes were binned into20-equal sized bins based on the gene-specific B-statistic. Within each bin,genes were sorted by Neandertal ancestry and then by the random number. Genesranked within the top 5% within each bin were used for the analysis. SeeSI 6 for details.
Identifying alleles born in Neandertals and cross-correlation withassociation study data
To infer whether an allele segregating in a present-day human populationwas introduced by Neandertal gene flow, we defined a probable Neandertal alleleas one with marginal probability of ≥ 90% and a non-Neandertalallele as having a marginal probability of ≤ 10%. A SNP at whichall of the confident non-Neandertal alleles as well as all alleles in YRI areancestral and all of the confident Neandertal alleles are derived is inferred tobe of Neandertal origin. This allows for some false negatives in the predictionof the CRF. This procedure yields 97,365 Neandertal-derived SNPs when applied tothe predictions in Europeans and East Asians. We downloaded the variants listedin the NHGRI GWAS catalog16,retaining entries for which the reported association is a SNP with an assignedrs-number, and where the nominal P-value is<5×10−8. This resulted in 5,022 associations,which we then intersected with the Neandertal-derived list. See SI 7 fordetails.
Identification of genomic regions deficient in Neandertal ancestry
We measured the fraction of alleles across individuals and SNPs thathave less than a specified proportion of Neandertal ancestry measured withinnon-overlapping 10 Megabase (Mb) windows. We chose a threshold of marginalprobability of >25% as this threshold was found to lead to highrecall in our empirical assessment (SI 3). We reported windows for which thisstatistic is < 0.1%.
To understand the causes for variation in Neandertal ancestry, we testedfor correlation to a B-statistic. Each SNP was annotated with the B-statisticlifted over to hg19 coordinates. We assessed correlation between B and estimatesof Neandertal ancestry proportion at a nucleotide level as well as at differentsize scales, and separately on the autosomes and chromosome X (SI 8). We alsoused wavelet decomposition37to analyze the correlation of the inferred Neandertal ancestry between Europeansand East Asians at multiple size scales (SI 10).Figure 2 reports the relation between the mean marginal probabilityof Neandertal ancestry across individuals and quintiles of B-statistic at eachSNP. To assess significance, we estimated Spearman’s correlationρSpearman and standard errors using a block jackknife38 with 10 Mb blocks(SI 8).
To understand the contribution of demography to variation in Neandertalancestry along the genome, we measured the coefficient of variation, at a 10 Mbsize scale, of the proportion of ancestry estimated as defined above. We thenapplied the CRF to data simulated under diverse demographic models and comparedthe coefficient of variation to the observed value (SI 8).
Power to infer Neandertal ancestry as a function of demography and genomicfeatures
To assess the power of the CRF to infer Neandertal ancestry, wesimulated data under diverse demographic models39. In one simulation series, we variedeffective population size to approximate the effect of background selection andmeasured recall at a precision of 90%. In a second series, we assessedpower on chromosome X versus the autosomes by matching the effective populationsize, recombination rate and mutation rate to estimated values for chromosome X.See SI 2.
Unbiased estimate of the proportion of Neandertal ancestry as a function ofB-statistic
To estimate the proportion of Neandertal ancestry in an unbiased way, wedivided the genome into quintiles of B, and estimated the proportion ofNeandertal ancestry using a statistic first published in ref.19. This statistic measures howmuch closer a non-African individual is to Denisova than an African individual,divided by the same quantity replacing the non-African individual withNeandertal. We report the estimated proportion of Neandertal ancestry in eachquintile as a fraction of the genome-wide mean and obtain standard errors usinga block jackknife with 100 blocks.
We analyzed data from 27 deeply sequenced genomes: 25 present-day humansand the high-coverage Neandertal and Denisova14 genomes. For each, we required thatsites pass the more stringent set of the two filters described in ref.2, have a genotypequality of GQ ≥ 45, and have an ancestral allele that can be determinedbased on comparison to chimpanzee and at least one of gorilla or orangutan. Wecomputed a Z-score for the difference in the ancestry across the bin of highestB-statistic versus the rest and used a Bonferroni correction for ten hypotheses(5 hypotheses based on which set of bins we merge and a 2-sided test in each).In our main analysis, we analyzed both transitions and transversions and pooledgenomes for all non-African samples. We also analyzed other subsets of the data:transversions only in non-Africans, in Europeans and in eastern non-Africans.See SI 9 for details.
Tissue-specific expression
We defined tissue-specific expression levels using the Illumina BodyMap2.0 RNA-seq data25, whichcontains expression data from 16 human tissues. We identified genes that areexpressed in a tissue-specific manner using the DESeq package40 and used a P-value cutoff of0.05. We tested enrichment of tissue-specific genes in regions of high or lowNeandertal ancestry. A concern when testing for enrichment is that clustering ofsimilarly expressed genes coupled with the large size of regions of lowNeandertal ancestry might lead to spurious signals of enrichment. Hence, wedevised a permutation test that randomly rotated the annotations of genes(treating each chromosome as a circle) while maintaining the correlation withingenes and within Neandertal ancestry as well as between Neandertal ancestry andgenes. We tested enrichment on the whole genome, on the autosomes alone, and onchromosome X alone. We generated 1,000 random rotations for each test except forthe X chromosome for which we generated all possible rotations. We computed thefraction of permutations for which the P-value of Fischer’s exact testis at least as low as the observed P-value (See SI 6).
Supplementary Material
Acknowledgments
We thank Adrian Briggs, Priya Moorjani, Molly Przeworski, Daven Presgraves, AmyWilliams, and two anonymous reviewers for critical comments, and Kathryn Kavanaghfor help with ExtendedFigure 2. We aregrateful for support from the Presidential Innovation Fund of the Max PlanckSociety, NSF HOMINID grant #1032255 and NIH grant GM100233. SS was supportedby a post-doctoral fellowship from the Initiative for the Science of the Human Pastat Harvard. DR is a Howard Hughes Medical Institute Investigator.
Footnotes
SupplementaryInformation is linked to the online version of the paper atwww.nature.com/nature. The tiling path of confidently inferredNeandertal haplotypes, as well as the Neandertal introgression map, can be foundathttp://genetics.med.harvard.edu/reichlab/Reich_Lab/Datasets.html.
Author contributions
SS, NP, SP and DR conceived of the study. SS, SM MD, KP, JK and DR performedanalyses. JK, SP, NP and DR supervised the study. SS and DR wrote the manuscriptwith help from all coauthors.
The authors declare no competing financial interests.
References
- 1.Green RE, et al. A draft sequence of the Neandertal genome. Science. 2010;328:710–722. doi: 10.1126/science.1188021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Prüfer K, et al. The complete genome sequence of a Neandertal from the AltaiMountains. Nature. 2014;505:43–49. doi: 10.1038/nature12886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Abi-Rached L, et al. The shaping of modern human immune systems by multiregionaladmixture with archaic humans. Science. 2011;334:89–94. doi: 10.1126/science.1209202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Mendez FL, Watkins JC, Hammer MF. A haplotype at STAT2 Introgressed from neanderthals and serves asa candidate of positive selection in Papua New Guinea. Am J Hum Genet. 2012;91:265–274. doi: 10.1016/j.ajhg.2012.06.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Mendez FL, Watkins JC, Hammer MF. Neandertal origin of genetic variation at the cluster of OASimmunity genes. Mol Biol Evol. 2013;30:798–801. doi: 10.1093/molbev/mst004. [DOI] [PubMed] [Google Scholar]
- 6.Yotova V, et al. An X-linked haplotype of Neandertal origin is present among allnon-African populations. Mol Biol Evol. 2011;28:1957–1962. doi: 10.1093/molbev/msr024. [DOI] [PubMed] [Google Scholar]
- 7.Wall JD, et al. Higher levels of neanderthal ancestry in East aSians than inEuropeans. Genetics. 2013;194:199–209. doi: 10.1534/genetics.112.148213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lachance J, et al. Evolutionary history and adaptation from high-coveragewhole-genome sequences of diverse African hunter-gatherers. Cell. 2012;150:457–469. doi: 10.1016/j.cell.2012.07.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Lafferty J, McCallum A, Pereira FCN. Conditional Random Fields: Probabilistic Models for Segmentingand Labeling Sequence Data. ICML ‘01 Proceedings of the Eighteenth InternationalConference on Machine Learning; 2001. pp. 282–289. [Google Scholar]
- 10.Sankararaman S, Patterson N, Li H, Paabo S, Reich D. The date of interbreeding between Neandertals and modernhumans. PLoS Genet. 2012;8:e1002947. doi: 10.1371/journal.pgen.1002947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hellenthal G, Stephens M. msHOT: modifying Hudson’s ms simulator to incorporatecrossover and gene conversion hotspots. Bioinformatics. 2007;23:520–521. doi: 10.1093/bioinformatics/btl622. [DOI] [PubMed] [Google Scholar]
- 12.Abecasis GR, et al. An integrated map of genetic variation from 1,092 humangenomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Paten B, Herrero J, Beal K, Fitzgerald S, Birney E. Enredo and Pecan: genome-wide mammalian consistency-basedmultiple alignment with paralogs. Genome Res. 2008;18:1814–1828. doi: 10.1101/gr.076554.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Meyer M, et al. A high-coverage genome sequence from an archaic Denisovanindividual. Science. 2012;338:222–226. doi: 10.1126/science.1224344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.23andMe White Paper 23-05. Neanderthal Ancestry Estimator. <https://23andme.https.internapcdn.net/res/pdf/hXitekfSJe1lcIy7-Q72XA_23-05_Neanderthal_Ancestry.pdf>.
- 16.Hindorff LA, et al. Potential etiologic and functional implications of genome-wideassociation loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362–9367. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.The SIGMA Type 2 Diabetes Consortium. Sequence variants in SLC16A11 are a common risk factor for type 2diabetes in Mexico. Nature. 2014;506(7486):97–101. doi: 10.1038/nature12828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.McVicker G, Gordon D, Davis C, Green P. Widespread genomic signatures of natural selection in hominidevolution. PLoS Genet. 2009;5:e1000471. doi: 10.1371/journal.pgen.1000471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Reich D, et al. Genetic history of an archaic hominin group from Denisova Cave inSiberia. Nature. 2010;468:1053–1060. doi: 10.1038/nature09710. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Tucker PK, Sage RD, Wilson AC, Eichler EM. Abrupt cline for sex chromosomes in a hybrid zone between twospecies of mice. Evolution Int J Org Evolution. 1992;46:1146–1163. doi: 10.1111/j.1558-5646.1992.tb00625.x. [DOI] [PubMed] [Google Scholar]
- 21.Good JM, Dean MD, Nachman MW. A complex genetic basis to X-linked hybrid male sterility betweentwo species of house mice. Genetics. 2008;179:2213–2228. doi: 10.1534/genetics.107.085340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Presgraves DC. Sex chromosomes and speciation in Drosophila. Trends Genet. 2008;24:336–343. doi: 10.1016/j.tig.2008.04.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Coyne JAO, HA . In: Speciation and its Consequences. Otte D, Endler JA, editors. 1989. pp. 180–207. [Google Scholar]
- 24.Wu CI, Davis AW. Evolution of postmating reproductive isolation: the compositenature of Haldane’s rule and its genetic basis. American Naturalist. 1993;142:187–212. doi: 10.1086/285534. [DOI] [PubMed] [Google Scholar]
- 25.Derrien T, et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis oftheir gene structure, evolution, and expression. Genome Res. 2012;22:1775–1789. doi: 10.1101/gr.132159.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Keinan A, Mullikin JC, Patterson N, Reich D. Measurement of the human allele frequency spectrum demonstratesgreater genetic drift in East Asians than in Europeans. Nat Genet. 2007;39:1251–1255. doi: 10.1038/ng2116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Bhatia G, et al. Genome-wide scan of 29,141 African Americans finds no evidence ofselection since admixture. arXiv:1312.2675 [q-bio.PE] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Orr HA, Turelli M. The evolution of postzygotic isolation: accumulatingDobzhansky-Muller incompatibilities. Evolution. 2001;55:1085–1094. doi: 10.1111/j.0014-3820.2001.tb00628.x. [DOI] [PubMed] [Google Scholar]
- 29.Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. A fine-scale map of recombination rates and hotspots across thehuman genome. Science. 2005;310:321–324. doi: 10.1126/science.1117196. [DOI] [PubMed] [Google Scholar]
- 30.Sutton C, McCallum A. In: Introduction to Statistical Relational Learning. Getoor Lise, Taskar Ben., editors. Ch. 4. MIT Press; 2007. pp. 93–128. [Google Scholar]
- 31.Byrd RH, Nocedal J, Schnabel RB. Representations of quasi-Newton matrices and their use in limitedmemory methods. Mathematical Programming. 1994;63.1-3:129–156. [Google Scholar]
- 32.Tandem Repeat Finder. <url={http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/simpleRepeat.txt.gz}>.
- 33.Gravel S. Population genetics models of local ancestry. Genetics. 2012;191:607–619. doi: 10.1534/genetics.112.139808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Pruitt KD, et al. The consensus coding sequence (CCDS) project: Identifying acommon protein-coding gene set for the human and mousegenomes. Genome Res. 2009;19:1316–1323. doi: 10.1101/gr.080531.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ashburner M, et al. Gene ontology: tool for the unification of biology. The GeneOntology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Prufer K, et al. FUNC: a package for detecting significant associations betweengene sets and ontological annotations. BMC Bioinformatics. 2007;8:41. doi: 10.1186/1471-2105-8-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Percival DB, Walden AT. Wavelet Methods for Time Series Analysis. Cambridge University Press; 2005. [Google Scholar]
- 38.Kunsch HR. The jackknife and the bootstrap for general stationaryobservations. The Annals of Statistics. 1989;17:1217–1241. [Google Scholar]
- 39.Hudson RR. Generating samples under a Wright-Fisher neutral model of geneticvariation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
- 40.Anders S, Huber W. Differential expression analysis for sequence countdata. Genome Biol. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




