![]() | This article needs to beupdated. Please help update this article to reflect recent events or newly available information.(October 2023) |
The1000 Genomes Project (1KGP), taken place from January 2008 to 2015, was an international research effort to establish the most detailed catalogue ofhuman genetic variation at the time. Scientists planned tosequence thegenomes of at least one thousand anonymous healthy participants from a number of different ethnic groups within the following three years, using advancements innewly developed technologies. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journalNature.[1] In 2012, the sequencing of 1092 genomes was announced in aNature publication.[2] In 2015, two papers inNature reported results and the completion of the project and opportunities for future research.[3][4]
Many rare variations, restricted to closely related groups, were identified, and eight structural-variation classes were analyzed.[5]
The project united multidisciplinary research teams from institutes around the world, includingChina,Italy,Japan,Kenya,Nigeria,Peru, theUnited Kingdom, and theUnited States contributing to the sequence dataset and to a refinedhuman genome map freely accessible through public databases to the scientific community and the general public alike.[2]
TheInternational Genome Sample Resource was created to host and expand on the data set after the project's end.[6]
Since the completion of theHuman Genome Project advances in humanpopulation genetics andcomparative genomics enabled further insight into genetic diversity.[7] The understanding aboutstructural variations (insertions/deletions (indels),copy number variations (CNV),retroelements),single-nucleotide polymorphisms (SNPs), andnatural selection were being solidified.[8][9][10][11]
The diversity ofHuman genetic variation such as thatIndels were being uncovered and investigating human genomic variations[citation needed]
It also aimed to provide evidence that can be used to explore the impact ofnatural selection on population differences. Patterns ofDNA polymorphisms can be used to reliably detect signatures of selection and may help to identify genes that might underlie variation in disease resistance or drug metabolism.[12][13] Such insights could improve understanding ofphenotypic variations,genetic disorders andMendelian inheritance and their effects on survival and/or reproduction of different human populations.
![]() | This section needs to beupdated. Please help update this article to reflect recent events or newly available information.(April 2021) |
The 1000 Genomes Project was designed to bridge the gap of knowledge between rare genetic variants that have a severe effect predominantly on simple traits (e.g.cystic fibrosis,Huntington disease) and common genetic variants have a mild effect and are implicated in complex traits (e.g.cognition,diabetes,heart disease).[14]
The primary goal of this project was to create a complete and detailed catalogue ofhuman genetic variations, which can be used forassociation studies relating genetic variation to disease. The consortium aimed to discover >95 % of the variants (e.g. SNPs, CNVs, indels) withminor allele frequencies as low as 1% across the genome and 0.1-0.5% in gene regions, as well as to estimate the population frequencies,haplotype backgrounds andlinkage disequilibrium patterns of variant alleles.[15]
Secondary goals included the support of better SNP and probe selection forgenotyping platforms in future studies and the improvement of thehuman reference sequence. The completed database was expected be a useful tool for studying regions under selection, variation in multiple populations and understanding the underlying processes of mutation andrecombination.[15]
Thehuman genome consists of approximately 3 billion DNA base pairs and is estimated to carry around 20,000protein codinggenes. In designing the study the consortium needed to address several critical issues regarding the project metrics such as technology challenges, data quality standards and sequence coverage.[15]
Over the course of the next three years,[clarification needed] scientists at theSanger Institute,BGI Shenzhen and theNational Human Genome Research Institute’s Large-Scale Sequencing Network planned to sequence a minimum of 1,000 human genomes. Due to the large amount of sequence data that was required, recruiting additional participants was maintained.[14]
Almost 10 billion bases were to be sequenced per day over a period of the two year production phase, equating to more than two human genomes every 24 hours. The intended sequence dataset was to comprise 6 trillion DNA bases, 60-fold more sequence data than what has been published inDNA databases at the time.[14]
To determine the final design of the full project three pilot studies were to be carried out within the first year of the project. The first pilot intends to genotype 180 people of 3major geographic groups at low coverage (2×). For the second pilot study, the genomes of two nuclear families (both parents and an adult child) are going to be sequenced with deep coverage (20× per genome). The third pilot study involves sequencing the coding regions (exons) of 1,000 genes in 1,000 people with deep coverage (20×).[14][15]
It was estimated that the project would likely cost more than $500 million if standard DNA sequencing technologies were used. Several newer technologies (e.g.Solexa,454,SOLiD) were to be applied, lowering the expected costs to between $30 million and $50 million. The major support was provided by theWellcome Trust Sanger Institute in Hinxton, England; theBeijing Genomics Institute, Shenzhen (BGI Shenzhen), China; and theNHGRI, part of the National Institutes of Health (NIH).[14]
In keeping withFort Lauderdale principles[16] all genome sequence data (including variant calls) is freely available as the project progresses and can be downloaded via ftp from the 1000 genomes project webpage.[17]
Based on the overall goals for the project, the samples will be chosen to provide power in populations whereassociation studies for common diseases are being carried out. Furthermore, the samples do not need to have medical or phenotype information since the proposed catalogue will be a basic resource on human variation.[15]
For the pilot studies human genome samples from theHapMap collection will be sequenced. It will be useful to focus on samples that have additional data available (such asENCODE sequence, genome-wide genotypes,fosmid-end sequence, structural variation assays, andgene expression) to be able to compare the results with those from other projects.[15]
Complying with extensive ethical procedures, the 1000 Genomes Project will then use samples from volunteer donors. The following populations will be included in the study:Yoruba inIbadan (YRI),Nigeria;Japanese inTokyo (JPT);Chinese inBeijing (CHB);Utah residents with ancestry from northern and westernEurope (CEU);Luhya inWebuye,Kenya (LWK);Maasai inKinyawa, Kenya (MKK); Toscani inItaly (TSI); Peruvians inLima,Peru (PEL); Gujarati Indians inHouston (GIH); Chinese in metropolitanDenver (CHD); people ofMexican ancestry inLos Angeles (MXL); and people ofAfrican ancestry in the southwesternUnited States (ASW).[14]
ID | Place | Population | Detail |
---|---|---|---|
ASW | ![]() | African Ancestry inSouthwestern US | [1] |
ACB | ![]() | African Caribbean inBarbados | [2] |
BEB | ![]() | Bengali inBangladesh | [3] |
GBR | ![]() | British fromEngland andScotland | [4] |
CDX | ![]() | Chinese Dai inXishuangbanna,China | [5] |
CLM | ![]() | Colombian inMedellín,Colombia | [6] |
ESN | ![]() | Esan inNigeria | [7] |
FIN | ![]() | Finnish inFinland | [8] |
GWD | ![]() | Gambian inWestern Division –Mandinka | [9] |
GIH | ![]() | GujaratiIndians inHouston,Texas,United States | [10] |
CHB | ![]() | Han Chinese inBeijing,China | [11] |
CHS | ![]() | Han ChineseSouth China | [12] |
IBS | ![]() | Iberian populations inSpain | [13] |
ITU | ![]() | IndianTelugu in theUnited Kingdom | [14] |
JPT | ![]() | Japanese inTokyo,Japan | [15] |
KHV | ![]() | Kinh inHo Chi Minh City,Vietnam | [16] |
LWK | ![]() | Luhya inWebuye,Kenya | [17] |
MSL | ![]() | Mende inSierra Leone | [18] |
MXL | ![]() | Mexican Ancestry inLos Angeles,California,United States | [19] |
PEL | ![]() | Peruvian inLima,Peru | [20] |
PUR | ![]() | Puerto Rican inPuerto Rico | [21] |
PJL | ![]() | Punjabi inLahore,Pakistan | [22] |
STU | ![]() | Sri Lankan Tamil in the United Kingdom | [23] |
TSI | ![]() | Toscani inItaly | [24] |
YRI | ![]() | Yoruba inIbadan,Nigeria | [25] |
CEU | ![]() | Utah residents withNorthern andWestern European ancestry from theCEPH collection | [26] |
* Population that was collected in diaspora
Data generated by the 1000 Genomes Project is widely used by the genetics community, making the first 1000 Genomes Project one of the most cited papers in biology.[19] To support this user community, the project held a community analysis meeting in July 2012 that included talks highlighting key project discoveries, their impact on population genetics and human disease studies, and summaries of other large-scale sequencing studies.[20]
The pilot phase consisted of three projects:
It was found that on average, each person carries around 250–300 loss-of-function variants in annotated genes and 50-100 variants previously implicated in inherited disorders. Based on the two trios, it is estimated that the rate of de novo germline mutation is approximately 10−8 per base per generation.[1]