Study of genetic variants in different individuals
Ingenomics, agenome-wide association study (GWA study, orGWAS), is anobservational study of a genome-wide set ofgenetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations betweensingle-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.
An illustration of aManhattan plot depicting several strongly associated risk loci. Each dot represents aSNP, with the X-axis showing genomic location and Y-axis showingassociation level. This example is taken from a GWA study investigatingkidney stone disease, so the peaks indicate genetic variants that are found more often in individuals with kidney stones.
When applied to human data, GWA studies compare the DNA of participants having varyingphenotypes for a particular trait or disease. These participants may be people with a disease (cases) and similar people without the disease (controls), or they may be people with different phenotypes for a particular trait, for example blood pressure.[1] This approach is known as phenotype-first, in which the participants are classified first by their clinical manifestation(s), as opposed togenotype-first. Each person gives a sample of DNA, from which millions ofgenetic variants are read usingSNP arrays. If there is significant statistical evidence that one type of the variant (oneallele) is more frequent in people with the disease, the variant is said to beassociated with the disease. The associated SNPs are then considered to mark a region of the human genome that may influence the risk of disease.
GWA studies investigate the entire genome, in contrast to methods that specifically test a small number of pre-specified genetic regions. Hence, GWAS is anon-candidate-driven approach, in contrast togene-specific candidate-driven studies. GWA studies identify SNPs and other variants in DNA associated with a disease, but they cannot on their own specify which genes are causal.[2][3][4]
The first successful GWAS published in 2002 studied myocardial infarction.[5] This study design was then implemented in the landmark GWA 2005 study investigating patients withage-related macular degeneration, and found two SNPs with significantly alteredallele frequency compared to healthy controls.[6] As of 2017[update], over 3,000 human GWA studies have examined over 1,800 diseases and traits, and thousands of SNP associations have been found.[7] Except in the case of raregenetic diseases, these associations are very weak, but while each individual association may not explain much of the risk, they provide insight into critical genes and pathways and can be important when consideredin aggregate.
GWA studies typically identify common variants with small effect sizes (lower right).[8]
Any twohuman genomes differ in millions of different ways. There are small variations in the individual nucleotides of the genomes (SNPs) as well as many larger variations, such asdeletions,insertions andcopy number variations. Any of these may cause alterations in an individual's traits, orphenotype, which can be anything from disease risk to physical properties such as height.[9] Around the year 2000, prior to the introduction of GWA studies, the primary method of investigation was through inheritance studies ofgenetic linkage in families. This approach had proven highly useful towardssingle gene disorders.[10][9][11] However, for common and complex diseases the results of genetic linkage studies proved hard to reproduce.[9][11] A suggested alternative to linkage studies was thegenetic association study. This study type asks if theallele of agenetic variant is found more often than expected in individuals with the phenotype of interest (e.g. with the disease being studied). Early calculations on statistical power indicated that this approach could be better than linkage studies at detecting weak genetic effects.[12]
In addition to the conceptual framework several additional factors enabled the GWA studies. One was the advent ofbiobanks, which are repositories of human genetic material that greatly reduced the cost and difficulty of collecting sufficient numbers of biological specimens for study.[13] Another was theInternational HapMap Project, which, from 2003 identified a majority of the common SNPs interrogated in a GWA study.[14] Thehaploblock structure identified by HapMap project also allowed the focus on the subset of SNPs that would describe most of the variation. Also the development of the methods to genotype all these SNPs usinggenotyping arrays was an important prerequisite.[15]
Example calculation illustrating the methodology of a case-control GWA study. Theallele count of each measured SNP is evaluated—in this case with achi-squared test—to identify variantsassociated with the trait in question. The numbers in this example are taken from a 2007 study ofcoronary artery disease (CAD) that showed that the individuals with the G-allele of SNP1 (rs1333049) were overrepresented amongst CAD-patients.[16]Illustration of a simulated genotype by phenotype regression for a single SNP. Each dot represents an individual. A GWAS of a continuous trait essentially consists of repeating this analysis at each SNP.
The most common approach of GWA studies is thecase-control setup, which compares two large groups of individuals, one healthy control group and one case group affected by a disease. All individuals in each group are typically genotyped at common known SNPs. The exact number of SNPs depends on the genotyping technology, but are typically one million or more.[8] For each of these SNPs it is then investigated if theallele frequency is significantly altered between the case and the control group.[17] In such setups, the fundamental unit for reporting effect sizes is theodds ratio. The odds ratio is the ratio of two odds, which in the context of GWA studies are the odds of case for individuals having a specific allele and the odds of case for individuals who do not have that same allele.[citation needed]
Example: suppose that there are two alleles, T and C. The number of individuals in the case group having allele T is represented by 'A' and the number of individuals in the control group having allele T is represented by 'B'. Similarly, the number of individuals in the case group having allele C is represented by 'X' and the number of individuals in the control group having allele C is represented by 'Y'. In this case the odds ratio for allele T is A:B (meaning 'A to B', in standard odds terminology) divided by X:Y, which in mathematical notation is simply (A/B)/(X/Y).[citation needed]
When the allele frequency in the case group is much higher than in the control group, the odds ratio is higher than 1, and vice versa for lower allele frequency. Additionally, aP-value for the significance of the odds ratio is typically calculated using a simplechi-squared test. Finding odds ratios that are significantly different from 1 is the objective of the GWA study because this shows that a SNP is associated with disease.[17] Because so many variants are tested, it is standard practice to require the p-value to be lower than5×10−8 to consider a variant significant.[citation needed]
Variations on the case-control approach. A common alternative to case-control GWA studies is the analysis of quantitative phenotypic data, e.g. height orbiomarker concentrations or evengene expression. Likewise, alternative statistics designed fordominance orrecessive penetrance patterns can be used.[17] Calculations are typically done usingbioinformatics software such asSNPTEST andPLINK, which also include support for many of these alternative statistics.[16][18] GWAS focuses on the effect of individual SNPs. However, it is also possible that complex interactions among two or more SNPs (epistasis) might contribute to complex diseases. Due to the potentially exponential number of interactions, detecting statistically significant interactions in GWAS data is both computationally and statistically challenging. This task has been tackled in existing publications that use algorithms inspired from data mining.[19] Moreover, the researchers try to integrate GWA data with other biological data such asprotein-protein interaction network to extract more informative results.[20][21] Despite the previously perceived challenge posed by the vast number of SNP combinations, a recent study has successfully unveiled complete epistatic maps at a gene-level resolution in plants/Arabidopsis thaliana[22]
Full 2D epistatic interaction maps point to epistatic signal[23]Zoom in a full epistatic map for an Arabidopsis phenotype[23]
A key step in the majority of GWA studies is theimputation of genotypes at SNPs not on the genotype chip used in the study.[24] This process greatly increases the number of SNPs that can be tested for association, increases the power of the study, and facilitates meta-analysis of GWAS across distinct cohorts. Genotype imputation is carried out by statistical methods that impute genotypic data to a set of reference panel of haplotypes, which typically have been densely genotyped using whole-genome sequencing. These methods take advantage of sharing of haplotypes between individuals over short stretches of sequence to impute alleles. Existing software packages for genotype imputation includeIMPUTE2,[25]Minimac,Beagle[26] andMaCH.[27]
In addition to the calculation of association, it is common to take into account any variables that could potentiallyconfound the results. Sex, age, and ancestry are common examples of confounding variables. Moreover, it is also known that many genetic variations are associated with the geographical and historical populations in which the mutations first arose.[28] Because of this association, studies must take account of the geographic and ethnic background of participants by controlling for what is calledpopulation stratification. If they did not do so, the studies could produce false positive results.[29]
After odds ratios andP-values have been calculated for all SNPs, a common approach is to create aManhattan plot. In the context of GWA studies, this plot shows the negative logarithm of theP-value as a function of genomic location. Thus the SNPs with the most significant association stand out on the plot, usually as stacks of points because of haploblock structure. Importantly, the P-value threshold for significance is corrected formultiple testing issues. The exact threshold varies by study,[30] but the conventionalgenome-wide significance threshold is5×10−8 to be significant in the face of hundreds of thousands to millions of tested SNPs.[8][17][31] GWA studies typically perform the first analysis in a discovery cohort, followed by validation of the most significant SNPs in an independent validation cohort.[32]
Regional association plot, showing individual SNPs in theLDL receptor region and their association toLDL-cholesterol levels. This type of plot is similar to the Manhattan plot in the lead section, but for a more limited section of the genome. Thehaploblock structure is visualized with colour scale and theassociation level is given by the left Y-axis. The dot representing the rs73015013 SNP (in the top-middle) has a high Y-axis location because this SNP explains some of the variation in LDL-cholesterol.[33]Relationship between the minor allele frequency and the effect size of genome wide significant variants in a GWAS of height.
Attempts have been made at creating comprehensive catalogues of SNPs that have been identified from GWA studies.[34] As of 2009, SNPs associated with diseases are numbered in the thousands.[35]
The first GWA study, conducted in 2005, compared 96 patients withage-related macular degeneration (ARMD) with 50 healthy controls.[36] It identified two SNPs with significantly altered allele frequency between the two groups. These SNPs were located in the gene encodingcomplement factor H, which was an unexpected finding in the research of ARMD. The findings from these first GWA studies have subsequently prompted further functional research towards therapeutical manipulation of the complement system in ARMD.[37]
Since these first landmark GWA studies, there have been two general trends.[39] One has been towards larger and larger sample sizes. In 2018, several genome-wide association studies are reaching a total sample size of over 1 million participants, including 1.1 million in a genome-wide study ofeducational attainment[40] follow by another in 2022 with 3 million individuals[41] and a study ofinsomnia containing 1.3 million individuals.[42] The reason is the drive towards reliably detecting risk-SNPs that have smallereffect sizes and lower allele frequency. Another trend has been towards the use of more narrowly defined phenotypes, such asblood lipids,proinsulin or similar biomarkers.[43][44] These are calledintermediate phenotypes, and their analyses may be of value to functional research into biomarkers.[45]
A variation of GWAS uses participants that are first-degreerelatives of people with a disease. This type of study has been namedgenome-wide association study by proxy (GWAX).[46]
A central point of debate on GWA studies has been that most of the SNP variations found by GWA studies are associated with only a small increased risk of the disease, and have only a small predictive value. The median odds ratio is 1.33 per risk-SNP, with only a few showing odds ratios above 3.0.[2][47] These magnitudes are considered small because they do not explain much of the heritable variation. Thisheritable variation is estimated from heritability studies based onmonozygotic twins.[48] For example, it is known that 40% of variance in depression can be explained by hereditary differences, but GWA studies only account for a minority of this variance.[48]
A challenge for future successful GWA study is to apply the findings in a way that acceleratesdrug and diagnostics development, including better integration of genetic studies into the drug-development process and a focus on the role of genetic variation in maintaining health as a blueprint for designing newdrugs anddiagnostics.[49] Several studies have looked into the use of risk-SNP markers as a means of directly improving the accuracy ofprognosis. Some have found that the accuracy of prognosis improves,[50] while others report only minor benefits from this use.[51] Generally, a problem with this direct approach is the small magnitudes of the effects observed. A small effect ultimately translates into a poor separation of cases and controls and thus only a small improvement of prognosis accuracy. An alternative application is therefore the potential for GWA studies to elucidatepathophysiology.[52]
One such success is related to identifying the genetic variant associated with response to anti-hepatitis C virus treatment. For genotype 1 hepatitis C treated withPegylated interferon-alpha-2a orPegylated interferon-alpha-2b combined withribavirin, a GWA study[53] has shown that SNPs near the humanIL28B gene, encoding interferon lambda 3, are associated with significant differences in response to the treatment. A later report demonstrated that the same genetic variants are also associated with the natural clearance of the genotype 1 hepatitis C virus.[54] These major findings facilitated the development of personalized medicine and allowed physicians to customize medical decisions based on the patient's genotype.[55]
The goal of elucidating pathophysiology has also led to increased interest in the association between risk-SNPs and thegene expression of nearby genes, the so-calledexpression quantitative trait loci (eQTL) studies.[56] The reason is that GWAS studies identify risk-SNPs, but not risk-genes, and specification of genes is one step closer towards actionabledrug targets. As a result, major GWA studies by 2011 typically included extensive eQTL analysis.[57][58][59] One of the strongest eQTL effects observed for a GWA-identified risk SNP is the SORT1 locus.[43] Functional follow up studies of this locus usingsmall interfering RNA andgene knock-out mice have shed light on the metabolism oflow-density lipoproteins, which have important clinical implications forcardiovascular disease.[43][60][61]
Research using a High-Precision Protein Interaction Prediction (HiPPIP) computational model that discovered 504 newprotein-protein interactions (PPIs) associated with genes linked toschizophrenia.[63][64][65] One study indicates that GWAS is substantially more effective for identifying genes associated with schizophrenia (and highly polygenic phenotypes in general) than candidate-driven studies.[66]
Another group of researchers conducted a joint analysis of GWAS summary statistics from seventeenpain susceptibility traits in theUK Biobank and revealed 99 genome-wide significant risk loci, among which 34 loci were new. Also, with leave-one-trait-out meta-analyses these loci were grouped in four categories: Loci associated with nearly all pain-related traits, associated with a single trait, associated with multiple forms of skeletomuscular pain and with headache-related pain.[citation needed]
Moreover, 664 genes were mapped to the 99 loci by genomic proximity,eQTLs and chromatin interaction, where 15% of these genes showed differential expression in individuals with acute orchronic pain compared to healthy individuals.[67][non-primary source needed]
Population level GWA studies may be used to identifyadaptive genes to help evaluate ability of species to adapt to changing environmental conditions as the globalclimate becomes warmer.[68] This could help determineextirpation risk for species and could therefore be an important tool forconservation planning. Utilizing GWA studies to determine adaptive genes could help elucidate the relationship between neutral and adaptivegenetic diversity.
GWA studies act as an important tool in plant breeding. With large genotyping and phenotyping data, GWAS are powerful in analyzing complex inheritance modes of traits that are important yield components such as number of grains per spike, weight of each grain and plant structure. In a study on GWAS in spring wheat, GWAS have revealed a strong correlation of grain production with booting data, biomass and number of grains per spike.[69] GWA study is also a success in study genetic architecture of complex traits in rice.[70]
The emergences ofplant pathogens have posed serious threats to plant health and biodiversity. Under this consideration, identification of wild types that have the natural resistance to certain pathogens could be of vital importance. Furthermore, we need to predict whichalleles are associated with the resistance. GWA studies is a powerful tool to detect the relationships of certain variants and the resistance to theplant pathogen, which is beneficial for developing new pathogen-resisted cultivars.[71]
The first GWA study in chickens was done by Abasht and Lamont[72] in 2007. This GWA was used to study the fatness trait in F2 population found previously. Significantly related SNPs were found are on 10 chromosomes (1, 2, 3, 4, 7, 8, 10, 12, 15 and 27).[non-primary source needed]
GWA studies have several issues and limitations that can be taken care of through proper quality control and study setup. Lack of well defined case and control groups, insufficient sample size, control forpopulation stratification are common problems.[3] On the statistical issue of multiple testing, it has been noted that "the GWA approach can be problematic because the massive number of statistical tests performed presents an unprecedented potential forfalse-positive results".[3] This is why all modern GWAS use avery low p-value threshold. In addition to easily correctible problems such as these, some more subtle but important issues have surfaced. A high-profile GWA study that investigated individuals with very long life spans to identify SNPs associated with longevity is an example of this.[73] The publication came under scrutiny because of a discrepancy between the type ofgenotyping array in the case and control group, which caused several SNPs to be falsely highlighted as associated with longevity.[74] The study was subsequentlyretracted,[75] but a modified manuscript was later published.[76] Now, many GWAS control for genotyping array. If there are substantial differences between groups on the type of genotyping array, as with any confounder, GWA studies could result in a false positive. Another consequence is that such studies are unable to detect the contribution of very rare mutations not included in the array or able to be imputed.[77]
Additionally, GWA studies identify candidate risk variants for the population from which their analysis is performed, and with most GWA studies historically stemming from European databases, there is a lack of translation of the identified risk variants to other non-European populations.[78] For instance, GWA studies for diseases likeAlzheimer's disease have been conducted primarily in Caucasian populations, which does not give adequate insight in other ethnic populations, including African Americans orEast Asians. Alternative strategies suggested involvelinkage analysis.[79][80] More recently, the rapidly decreasing price of complete genomesequencing have also provided a realistic alternative togenotyping array-based GWA studies. High-throughput sequencing does have potential to side-step some of the shortcomings of non-sequencing GWA.[81] Cross-traitassortative mating can inflate estimates of genetic phenotype similarity.[82]
Genotyping arrays designed for GWAS rely onlinkage disequilibrium to provide coverage of the entire genome by genotyping a subset of variants. Because of this, the reported associated variants are unlikely to be the actual causal variants. Associated regions can contain hundreds of variants spanning large regions and encompassing many different genes, making the biological interpretation of GWAS loci more difficult. Fine-mapping is a process to refine these lists of associated variants to a credible set most likely to include the causal variant.[83]
Fine-mapping requires all variants in the associated region to have been genotyped or imputed (dense coverage), very stringent quality control resulting in high-quality genotypes, and large sample sizes sufficient in separating out highly correlated signals. There are several different methods to perform fine-mapping, and all methods produce a posterior probability that a variant in that locus is causal. Because the requirements are often difficult to satisfy, there are still limited examples of these methods being more generally applied.[84]
^Ayati M, Koyutürk M (1 January 2015). "Assessing the Collective Disease Association of Multiple Genomic Loci".Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics. BCB '15. New York, NY, USA: ACM. pp. 376–385.doi:10.1145/2808719.2808758.ISBN978-1-4503-3853-0.S2CID5942777.
^Carré C, Carluer JB, Chaux C, Estoup-Streiff C, Roche N, Hosy E, Mas A, Krouk G (March, 2024). "Next-Gen GWAS: full 2D epistatic interaction maps retrieve part of missing heritability and improve phenotypic prediction". Genome biology. doi:10.1186/s13059-024-03202-0. PMID 38523316. S2CID 146570
^Marchini J, Howie B (July 2010). "Genotype imputation for genome-wide association studies".Nature Reviews Genetics.11 (7):499–511.doi:10.1038/nrg2796.PMID20517342.S2CID1465707.
^Liu JZ, Erlich Y, Pickrell JK (March 2017). "Case-control association mapping by proxy using family history of disease".Nature Genetics.49 (3):325–331.doi:10.1038/ng.3766.PMID28092683.S2CID5598845.
^Dubé JB, Johansen CT, Hegele RA (June 2011). "Sortilin: an unusual suspect in cholesterol metabolism: from GWAS identification to in vivo biochemical analyses, sortilin has been identified as a novel mediator of human lipoprotein metabolism".BioEssays.33 (6):430–7.doi:10.1002/bies.201100003.PMID21462369.
^Bauer RC, Stylianou IM, Rader DJ (April 2011). "Functional validation of new pathways in lipoprotein metabolism identified by human genetics".Current Opinion in Lipidology.22 (2):123–8.doi:10.1097/MOL.0b013e32834469b3.PMID21311327.S2CID24020035.
^Abasht B, Lamont SJ (October 2007). "Genome-wide association analysis reveals cryptic alleles as an important factor in heterosis for fatness in chicken F2 population".Animal Genetics.38 (5):491–498.doi:10.1111/j.1365-2052.2007.01642.x.PMID17894563.