| Type: | Package |
| Title: | Genome Wide Association Studies |
| Version: | 1.0.12 |
| Date: | 2025-06-30 |
| Description: | Fast single trait Genome Wide Association Studies (GWAS) following the method described in Kang et al. (2010), <doi:10.1038/ng.548>. One of a series of statistical genetic packages for streamlining the analysis of typical plant breeding experiments developed by Biometris. |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.2 |
| Depends: | R (≥ 3.6) |
| Imports: | data.table, ggplot2 (≥ 3.0.0), LMMsolver, methods, rlang,Rcpp, sommer (≥ 4.4.1) |
| Suggests: | knitr, rmarkdown, officer, tinytest, snpStats, vcfR |
| VignetteBuilder: | knitr |
| LinkingTo: | Rcpp, RcppArmadillo |
| URL: | https://biometris.github.io/statgenGWAS/index.html,https://github.com/Biometris/statgenGWAS/ |
| BugReports: | https://github.com/Biometris/statgenGWAS/issues |
| NeedsCompilation: | yes |
| Packaged: | 2025-07-01 06:12:04 UTC; rossu027 |
| Author: | Bart-Jan van Rossum |
| Maintainer: | Bart-Jan van Rossum <bart-jan.vanrossum@wur.nl> |
| Repository: | CRAN |
| Date/Publication: | 2025-07-01 06:50:03 UTC |
statgenGWAS: Genome Wide Association Studies
Description
Fast single trait Genome Wide Association Studies (GWAS) following the method described in Kang et al. (2010),doi:10.1038/ng.548. One of a series of statistical genetic packages for streamlining the analysis of typical plant breeding experiments developed by Biometris.
Author(s)
Maintainer: Bart-Jan van Rossumbart-jan.vanrossum@wur.nl (ORCID)
Authors:
Willem Kruijer (ORCID)
Other contributors:
Fred van Eeuwijk (ORCID) [contributor]
Martin Boer (ORCID) [contributor]
Marcos Malosetti (ORCID) [contributor]
Daniela Bustos-Korts (ORCID) [contributor]
Emilie Millet (ORCID) [contributor]
Joao Paulo (ORCID) [contributor]
Maikel Verouden (ORCID) [contributor]
Ron Wehrens (ORCID) [contributor]
Choazhi Zheng (ORCID) [contributor]
See Also
Useful links:
Report bugs athttps://github.com/Biometris/statgenGWAS/issues
Code and impute markers
Description
codeMarkers codes markers in agData object and optionallyperforms imputation of missing values as well.
The function performs the following steps:
replace strings in
naStringsbyNA.remove genotypes with a fraction of missing values higher than
nMissGeno.remove SNPs with a fraction of missing values higher than
nMiss.recode SNPs to numerical values.
remove SNPs with a minor allele frequency lower than
MAF.optionally remove duplicate SNPs.
optionally impute missing values.
repeat steps 5. and 6. if missing values are imputed.
Usage
codeMarkers( gData, refAll = "minor", nMissGeno = 1, nMiss = 1, MAF = NULL, MAC = NULL, removeDuplicates = TRUE, keep = NULL, impute = TRUE, imputeType = c("random", "fixed", "beagle"), fixedValue = NULL, naStrings = NA, verbose = FALSE)Arguments
gData | An object of class |
refAll | A character string indicating the reference allele used whenrecoding markers. |
nMissGeno | A numerical value between 0 and 1. Genotypes with afraction of missing values higher than |
nMiss | A numerical value between 0 and 1. SNPs with a fraction ofmissing values higher than |
MAF | A numerical value between 0 and 1. SNPs with a Minor AlleleFrequency (MAF) below this value will be removed. Only one of |
MAC | A numerical value. SNPs with Minor Allele Count (MAC) below thisvalue will be removed. Only one of |
removeDuplicates | Should duplicate SNPs be removed? |
keep | A vector of SNPs that should never be removed in the wholeprocess. |
impute | Should imputation of missing values be done? |
imputeType | A character string indicating what kind of imputation ofvalues should be done.
|
fixedValue | A numerical value used for replacing missing values incase |
naStrings | A character vector of strings to be treated as NA. |
verbose | Should a summary of the performed steps be printed? |
Value
A copy of the inputgData object with markers replaced bycoded and imputed markers.
References
S R Browning and B L Browning (2007) Rapid and accurate haplotypephasing and missing data inference for whole genome association studies byuse of localized haplotype clustering. Am J Hum Genet 81:1084-1097.doi:10.1086/521987
Examples
## Create markersmarkers <- matrix(c("AA", "AB", "AA", "BB", "BA", "AB", "AA", "AA", NA, "AA","AA", "AA", "BB", "BB", "AA", "AA", "BB", "AA", NA, "AA","AA", "BA", "AB", "BB", "AB", "AB", "AA", "BB", NA, "AA","AA", "AA", "BB", "BB", "AA", "AA", "AA", "AA", NA, "AA","AA", "AA", "BB", "BB", "AA", "BB", "BB", "BB", "AB", "AA","AA", "AA", "BB", "BB", "AA", NA, "BB", "AA", NA, "AA","AB", "AB", "BB", "BB", "BB", "AA", "BB", "BB", NA, "AB","AA", "AA", NA, "BB", NA, "AA", "AA", "AA", "AA", "AA","AA", NA, NA, "BB", "BB", "BB", "BB", "BB", "AA", "AA","AA", NA, "AA", "BB", "BB", "BB", "AA", "AA", NA, "AA"),ncol = 10, byrow = TRUE, dimnames = list(paste0("IND", 1:10),paste0("SNP", 1:10)))## create object of class 'gData'.gData <- createGData(geno = markers)## Code markers by minor allele, no imputation.gDataCoded1 <- codeMarkers(gData = gData, impute = FALSE)## Code markers by reference alleles, impute missings by fixed value.gDataCoded2 <- codeMarkers(gData = gData, refAll = rep(x = c("A", "B"), times = 5), impute = TRUE, imputeType = "fixed", fixedValue = 1)## Code markers by minor allele, impute by random value.gDataCoded3 <- codeMarkers(gData = gData, impute = TRUE, imputeType = "random")DROPS data sets
Description
This dataset comes from the European Union project DROPS (DROught-tolerantyielding PlantS). A panel of 256 maize hybrids was grown with two waterregimes (irrigated or rainfed), in seven fields in 2012 and 2013,respectively, spread along a climatic transect from western to easternEurope, plus one site in Chile in 2013. This resulted in 28 experimentsdefined as the combination of one year, one site and one water regime, withtwo and three repetitions for rainfed and irrigated treatments, respectively.A detailed environmental characterisation was carried out, with hourlyrecords of micrometeorological data and soil water status, and associatedwith precise measurement of phenology. Grain yield and its components weremeasured at the end of the experiment.
10 experiments have been selected from the full data set, two for each ofthe five main environmental scenarios that were identified in the data. Thescenarios have been added to the data as well as a classification of thegenotypes in four genetic groups.
The main purpose of this datasetconsists in using the environmental characterization to quantify the geneticvariability of maize grain yield in response to the environmental driversfor genotype-by-environment interaction. For instance, allelic effects atQTLs identified over the field network are consistent within a scenario butlargely differ between scenarios.
The data is split in three separate data.frames.
- dropsMarkers
This data.frame contains the 50K genotypingmatrix coded in allelic dose (012) filtered and imputed. Genotyping of41,722 loci on 246 parental lines were obtained using 50K Illumina InfiniumHD arrays (Ganal et al., 2011). Genotype were coded in allelic dose with 0for the minor allele, 1 for the heterozygote, and 2 for the major allele.Genotype were filtered (MAF > 1%) and missing data imputed using Beagle v3.
A data.frame with 246 rows and 41723 columns.- Ind
name of the genotype
- SYMN83 to PZE-110111485
coded QTLs
- dropsMap
This data.frame contains the description of the41,722 loci genotyped by 50K Illumina Infinium Array on the 246 lines.
A data.frame with 41722 rows and 5 columns.- SNP.names
name of the SNP
- Chromosome
number of the B73 reference genome V2
- Position
position on the B73 reference genome V2 in basepairs
- allele1
first original allele (A, T, G or C)
- allele2
second original allele (A, T, G or C)
- dropsPheno
This data.frame contains the genotypic means(Best Linear Unbiased Estimators, BLUEs), with one value per experiment(Location × year × water regime) per genotype.
A data.frame with 2460 rows and 19 columns.- Experiment
experiments ID described by the three first letters of thecity’s name followed by the year of experiment and the water regime with Wfor watered and R for rain-fed.
- parent1
identifier of donor dent line
- Code_ID, Variety_ID, Accession_ID
identifier of the genotype
- geno.panel
project in which the genetic material was generated
- grain.yield
genotypic mean for yield adjusted at 15% grain moisture,in ton per hectare (t ha^-1)
- grain.number
genotypic mean for number of grain per square meter
- grain.weight
genotypic mean for individual grain weight in milligram(mg)
- anthesis
genotypic mean for male flowering (pollen shed), in thermaltime cumulated since emergence (d_20°C)
- silking
genotypic mean for female flowering (silking emergence), inthermal time cumulated since emergence (d_20°C)
- plant.height
genotypic mean for plant height, from ground level tothe base of the flag leaf (highest) leaf in centimeter (cm)
- tassel.height
genotypic mean for plant height including tassel, fromground level to the highest point of the tassel in centimeter (cm)
- ear.height
genotypic mean for ear insertion height, from ground levelto ligule of the highest ear leaf in centimeter (cm)
- year
year in which the experiment was performed
- loc
location where the experiment was performed, a three letterabbreviation
- scenarioWater
water scenario for the experiment, well watered (WW) orwater deficit (WD)
- scenarioTemp
temperature scenario for the experiment, Cool, Hot orHot(Day)
- scenarioFull
the full scenario for the experiment, a combination ofscenarioWater and scenarioTemp
- geneticGroup
the genetic group to which the genotype belongs
Note
From the source data, the experiments from 2011 have been removed sincethese do not contain all genotypes. Also the experiment Gra13W has beenremoved.
Source
References
Millet, E. J., Pommier, C., et al. (2019). A multi-siteexperiment in a network of European fields for assessing the maize yieldresponse to environmental scenarios - Data set.doi:10.15454/IASSTN
Ganal MW, et al. (2011) A Large Maize (Zea mays L.) SNPGenotyping Array: Development and Germplasm Genotyping, and Genetic Mappingto Compare with the B73 Reference Genome. PLoS ONE 6(12): e28334.doi:10.1371/journal.pone.0028334
S3 Class gData
Description
createGData creates an object of S3 class gData with genotypic andphenotypic data for usage in further analysis. All input to the function isoptional, however at least one input should be provided. It is possible toprovide an existinggData object as additional input in which casedata is added to this object. Existing data will be overwritten with awarning.
Usage
createGData( gData = NULL, geno = NULL, map = NULL, kin = NULL, pheno = NULL, covar = NULL)Arguments
gData | An optional gData object to be modified. If |
geno | A matrix or data.frame with genotypes in the rows and markers inthe columns. A matrix from the |
map | A data.frame with columns |
kin | A kinship matrix or list of kinship matrices with genotype inrows and colums. These matrices can be from the |
pheno | A data.frame or a list of data.frames with phenotypic data,with genotypes in the first column |
covar | A data.frame with extra covariates per genotype. Genotypesshould be in the rows. |
Value
An object of classgData with the following components:
map | a data.frame containing map data. Map is sorted bychromosome and position. |
markers | a matrix containing marker information. |
pheno | a list of data.frames containing phenotypic data. |
kinship | a kinship matrix. |
covar | a data.frame with extra covariates. |
Author(s)
Bart-Jan van Rossum
See Also
Examples
set.seed(1234)## Create genotypic data.geno <- matrix(sample(x = c(0, 1, 2), size = 15, replace = TRUE), nrow = 3)dimnames(geno) <- list(paste0("G", 1:3), paste0("M", 1:5))## Construct map.map <- data.frame(chr = c(1, 1, 2, 2, 2), pos = 1:5, row.names = paste0("M", 1:5))## Compute kinship matrix.kin <- kinship(X = geno, method = "IBS")## Create phenotypic data.pheno <- data.frame(paste0("G", 1:3), matrix(rnorm(n = 12, mean = 50, sd = 5), nrow = 3), stringsAsFactors = FALSE)dimnames(pheno) = list(paste0("G", 1:3), c("genotype", paste0("T", 1:4)))## Combine all data in gData object.gData <- createGData(geno = geno, map = map, kin = kin, pheno = pheno)summary(gData)## Construct covariate.covar <- data.frame(C1 = c("a", "a", "b"), row.names = paste0("G", 1:3))## Compute alternative kinship matrix.kin2 <- kinship(X = geno, method = "astle")## Add covariates to previously created gData object and overwrite## current kinship matrix by newly computed one.gData2 <- createGData(gData = gData, kin = kin2, covar = covar)Functions for calculating kinship matrices
Description
A collection of functions for calculating kinship matrices using differentalgorithms. The following algorithms are included: astle (Astle and Balding,2009), Identity By State (IBS) and VanRaden (VanRaden, 2008) formarker matrices. For method identity an identity kinship matrix is returned.
Usage
kinship( X, method = c("astle", "IBS", "vanRaden", "identity"), MAF = NULL, denominator = NULL)Arguments
X | An n x m marker matrix with genotypes in the rows (n) and markers inthe columns (m). |
method | The method used for computing the kinship matrix. |
MAF | The minor allele frequency (MAF) threshold used in kinshipcomputation. A numerical value between 0 and 1. SNPs with MAF below thisvalue are not taken into account when computing the kinship. If NULL allmarkers are used regardless of their allele frequency. |
denominator | A numerical value. See details. |
Value
An n x n kinship matrix.
Marker matrices
In all algorithms the input matrixX is first cleaned, i.e. markerswith a variance of 0 are excluded from the calculation of the kinship matrix.Then some form of scaling is done which differs per algorithm. This gives ascaled matrixZ. The matrixZZ^t / denominator is returned.By default the denominator is equal to the number of columns inZ forastle andIBS and2 * p * (1-p) wherep = colSums(X) / (2 * nrow(X)) forvanRaden. This denominatorcan be overwritten by the user, e.g. when computing kinship matrices bysplittingX in smaller matrices and then adding the results togetherin the end.
References
Astle, William, and David J. Balding. 2009. “Population Structureand Cryptic Relatedness in Genetic Association Studies.” Statistical Science24 (4): 451–71.doi:10.1214/09-sts307.
VanRaden P.M. (2008) Efficient methods to compute genomicpredictions. Journal of Dairy Science 91 (11): 4414–23.doi:10.3168/jds.2007-0980.
Examples
## Create example matrix.M <- matrix(c(1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1), nrow = 4)## Compute kinship matrices using different methods.kinship(M, method = "astle")kinship(M, method = "IBS")kinship(M, method = "vanRaden")## Only use markers with a Minor Allele Frequency of 0.3 or more.kinship(M, method = "astle", MAF = 0.3)## Compute kinship matrix using astle and balding method with denominator 2.kinship(M, method = "astle", denominator = 2)Plot function for the classGWAS
Description
Creates a plot of an object of S3 classGWAS. The following types ofplot can be made:
a manhattan plot, i.e. a plot of LOD-scores per SNP
a QQ plot of observed LOD-scores versus expected LOD-scores
a qtl plot of effect sizes and directions for multiple traits
Manhattan plots and QQ plots are made for a single trait whichshould be indicated using the parametertrait. If the analysis wasdone for only one trait, it is detected automatically. The qtl plot will plotall traits analyzed.
See details for a detailed description of the plots and the plot optionsspecific to the different plots.
Usage
## S3 method for class 'GWAS'plot( x, ..., plotType = c("manhattan", "qq", "qtl"), trial = NULL, trait = NULL, title = NULL, output = TRUE)Arguments
x | An object of class |
... | Further arguments to be passed on to the actual plottingfunctions. |
plotType | A character string indicating the type of plot to be made.One of "manhattan", "qq" and "qtl". |
trial | A character string or numeric index indicating for whichtrial the plot should be made. If |
trait | A character string indicating for which trait the resultsshould be plotted. For |
title | A character string, the title of the plot. |
output | Should the plot be output to the current device? If |
Manhattan Plot
A LOD-profile of all marker positions and corresponding LOD-scores isplotted. Significant markers are highlighted with red dots. By default theseare taken from the result of the GWAS analysis however the LOD-threshold forsignificant parameters may be modified using the parameteryThr. Thethreshold is plotted as a horizontal line. If there are previously knownmarker effect, false positives and true negatives can also be marked.
Extra parameter options:
xLabA character string, the x-axis label. Default =
"Chromosomes"yLabA character string, the y-axis label. Default =
-log10(p)effectsA character vector, indicating which SNPs correspondto a real (known) effect. Used for determining true/false positives andfalse negatives. True positives are colored green, false positives orange andfalse negatives yellow.
colPaletteA color palette used for plotting. Defaultcoloring is done by chromosome, using black and grey.
yThrA numerical value for the LOD-threshold. The value fromthe GWAS analysis is used as default.
signLwdA numerical value giving the thickness of thepoints that are false/true positives/negatives. Default = 0.6
lodA positive numerical value. For the SNPs with a LOD-valuebelow this value, only 5% is plotted. The chance of a SNP being plotted isproportional to its LOD-score. This option can be useful when plotting alarge number of SNPs. The 5% of SNPs plotted is selected randomly. Forreproducible results use set.seed before calling the function.
chrA vector of chromosomes to be plotted. By default, allchromosomes are plotted. Using this option allows restricting the plot to asubset of chromosomes.
startPosA numerical value indicating the start position forthe plot. Using this option allows restricting the plot to a part of aselected chromosome. Only used if exactly one chromosome is specified in
chr.endPosA numerical value indicating the end position forthe plot. Using this option allows restricting the plot to a part of aselected chromosome. Only used if exactly one chromosome is specified in
chr.
QQ-Plot
From the LOD-scores calculated in the GWAS analysis, a QQ-plot is generatedwith observed LOD-scores versus expected LOD-scores. Code is adapted fromSegura et al. (2012).
QTL Plot
A plot of effect sizes for the significant SNPs found in the GWAS analysisis created. Each horizontal line contains QTLs of one trait,phenotypic trait or trial. Optionally, vertical white lines can indicatechromosome subdivision, genes of interest, known QTL, etc. Circle diametersare proportional to the absolute value of allelic effect. Colors indicate thedirection of the effect: green when the allele increases the trait value,and blue when it decreases the value.
Extra parameter options:
normalizeShould the snpEffect be normalized? Default =
FALSEsortDataShould the data be sorted before plotting? Either
FALSE, if no sorting should be done, or a character string indicatingthe data column to use for sorting. This should be a numerical column.Default =FALSEbinPositionsAn optional data.frame containing at leasts twocolumns, chr(omosome) and pos(ition). Vertical lines are plotted at thosepositions. Default =
NULLprintVertGridShould default vertical grid lines be plotted.Default =
TRUEyLabA character string, the y-axis label. Default =
"Traits"yThrA numerical value for the LOD-threshold. The value fromthe GWAS analysis is used as default.
chrA vector of chromosomes to be plotted. By default allchromosomes are plotted. Using this option this can be restricted to asubset of chromosomes.
exportPptxShould the plot be exported to a .pptx file?Default =
FALSEpptxNameA character string, the name of the .pptx file towhich the plot is exported. Ignored if exportPptx =
FALSE.
References
Millet et al. (2016) Genome-wide analysis of yield in Europe:Allelic effects vary with drought and heat scenarios. Plant Physiology,October 2016, Vol. 172, p. 749–764
Segura et al. (2012) An efficient multi-locus mixed-modelapproach for genome-wide association studies in structured populations.Nature Genetics, June 2012, Vol. 44, p. 825–830.
Examples
## Create a gData object Using the data from the DROPS project.## See the included vignette for a more extensive description on the steps.data(dropsMarkers)data(dropsMap)data(dropsPheno)## Add genotypes as row names of dropsMarkers and drop Ind column.rownames(dropsMarkers) <- dropsMarkers[["Ind"]]dropsMarkers <- dropsMarkers[colnames(dropsMarkers) != "Ind"]## Add genotypes as row names of dropsMap.rownames(dropsMap) <- dropsMap[["SNP.names"]]## Rename Chomosome and Position columns.colnames(dropsMap)[match(c("Chromosome", "Position"), colnames(dropsMap))] <- c("chr", "pos") ## Convert phenotypic data to a list.dropsPhenoList <- split(x = dropsPheno, f = dropsPheno[["Experiment"]])## Rename Variety_ID to genotype and select relevant columns.dropsPhenoList <- lapply(X = dropsPhenoList, FUN = function(trial) { colnames(trial)[colnames(trial) == "Variety_ID"] <- "genotype" trial <- trial[c("genotype", "grain.yield", "grain.number", "seed.size", "anthesis", "silking", "plant.height", "tassel.height", "ear.height")]return(trial)}) ## Create gData object.gDataDrops <- createGData(geno = dropsMarkers, map = dropsMap, pheno = dropsPhenoList) ## Run single trait GWAS for trait 'grain.yield' for trial Mur13W.GWASDrops <- runSingleTraitGwas(gData = gDataDrops, trials = "Mur13W", traits = "grain.yield") ## Create a manhattan plot.plot(GWASDrops)## Manually set a threshold for significant snps and add a title.plot(GWASDrops, yThr = 3.5, title = "Manhattan plot for Mur13W") ## Restrict plot to part of chr 6.plot(GWASDrops, yThr = 3.5, chr = 6, startPos = 0, endPos = 110000000) ## Create a qq plot.plot(GWASDrops, plotType = "qq", title = "QQ plot for Mur13W") ## Create a QTL plot.plot(GWASDrops, plotType = "qtl", title = "QTL plot for Mur13W") ## Manually set a threshold and don't show vertical lines.plot(GWASDrops, plotType = "qtl", yThr = 3.5, printVertGrid = FALSE, title = "QTL plot for Mur13W")Plot function for the classgData
Description
Creates a plot of the genetic map in an object of S3 classgData. Aplot of the genetic map showing the length of the chromosomes and thepositions of the markers in the genetic map is created.
Usage
## S3 method for class 'gData'plot(x, ..., highlight = NULL, title = NULL, output = TRUE)Arguments
x | An object of class |
... | Not used. |
highlight | A data.frame with at least columns chr and pos,containing marker positions that should be highlighted in the plot. If acolumn "name" is present that is used for annotation, otherwise thehighlighted markers are annotated as chr\@pos#' |
title | A character string, the title of the plot. |
output | Should the plot be output to the current device? If |
Value
An object of classggplot is invisibly returned.
Examples
set.seed(1234)## Create genotypic data.geno <- matrix(sample(x = c(0, 1, 2), size = 15, replace = TRUE), nrow = 3)dimnames(geno) <- list(paste0("G", 1:3), paste0("M", 1:5))## Construct map.map <- data.frame(chr = c(1, 1, 2, 2, 2), pos = 1:5, row.names = paste0("M", 1:5))## Compute kinship matrix.kin <- kinship(X = geno, method = "IBS")## Create phenotypic data.pheno <- data.frame(paste0("G", 1:3), matrix(rnorm(n = 12, mean = 50, sd = 5), nrow = 3), stringsAsFactors = FALSE)dimnames(pheno) = list(paste0("G", 1:3), c("genotype", paste0("T", 1:4)))## Combine all data in gData object.gData <- createGData(geno = geno, map = map, kin = kin, pheno = pheno)## Plot genetic map.plot(gData)## Plot genetic map. Highlight first marker in map.plot(gData, highlight = map[1, ])Read PLINK binary data
Description
Read PLINK binary data and save in gData format. This is a wrapper aroundsnpStats::read.plink in the Bioconductor packagesnpStats. This package needs to be installed for the function towork.
Usage
readPLINK(bed, bim, fam, ...)Arguments
bed | The name of the file containing the packed binary SNP genotypedata. It should have the extension .bed; If it doesn't, then this extensionwill be appended. |
bim | The file containing the SNP descriptions. If not specified |
fam | The file containing subject (and, possibly, family) identifiers.This is basically a tab-delimited "pedfile". If not specified |
... | Further arguments passed tosnpStats::read.plink. |
Value
An object of classgData.
Read variant call format data
Description
Read variant call format (VCF) data and save in gData format. This is awrapper aroundvcfR::read.vcfR in the packagevcfR. This packageneeds to be installed for the function to work.
Usage
readVcf(vcfFile, ...)Arguments
vcfFile | The name of the vcf file. This can either be a plain text filewith extension (.vcf), or a gzipped file with extencion (.vcf.gz). |
... | Further arguments passed tovcfR::read.vcfR. |
Value
An object of classgData.
References
Knaus BJ, Grünwald NJ (2017). “VCFR: a package to manipulate andvisualizevariant call format data in R.”Molecular Ecology Resources,17(1), 44-53. ISSN 757,doi:10.1111/1755-0998.12549.
Perform single-trait GWAS
Description
runSingleTraitGwas performs a single-trait Genome Wide AssociationStudy (GWAS) on phenotypic and genotypic data contained in agDataobject. A covariance matrix is computed using the EMMA algorithm (Kang etal., 2008), the Newton-Raphson algorithm (Tunnicliffe, 1989) in thesommer package, or the modified Henderson algorithm in theLMMsolver package. Then a Generalized Least Squares (GLS)method is used for estimating the marker effects and corresponding p-values.This is done using either one kinship matrix for all chromosomes or differentchromosome-specific kinship matrices for each chromosome. Significant SNPsare selected based on a user defined threshold.
Usage
runSingleTraitGwas( gData, traits = NULL, trials = NULL, covar = NULL, snpCov = NULL, kin = NULL, kinshipMethod = c("astle", "IBS", "vanRaden"), remlAlgo = c("EMMA", "NR", "Hend"), GLSMethod = c("single", "multi"), useMAF = TRUE, MAF = 0.01, MAC = 10, genomicControl = FALSE, thrType = c("bonf", "fixed", "small", "fdr"), alpha = 0.05, LODThr = 4, nSnpLOD = 10, pThr = 0.05, rho = 0.5, sizeInclRegion = 0, minR2 = 0.5, nCores = NULL)Arguments
gData | An object of class |
traits | A vector of traits on which to run GWAS. These can be eithernumeric indices or character names of columns in |
trials | A vector of trials on which to run GWAS. These canbe either numeric indices or character names of list items in |
covar | An optional vector of covariates taken into account whenrunning GWAS. These can be either numeric indices or character names ofcolumns in |
snpCov | An optional character vector of snps to be included ascovariates. |
kin | An optional kinship matrix or list of kinship matrices. Thesematrices can be from the |
kinshipMethod | An optional character indicating the method used forcalculating the kinship matrix(ces). Currently "astle" (Astle and Balding,2009), "IBS" and "vanRaden" (VanRaden, 2008) are supported. If akinship matrix is supplied either in |
remlAlgo | A character string indicating the algorithm used to estimatethe variance components. Either |
GLSMethod | A character string indicating the method used to estimatethe marker effects. Either |
useMAF | Should the minor allele frequency be used for selecting SNPsfor the analysis. If |
MAF | The minor allele frequency (MAF) threshold used in GWAS. Anumerical value between 0 and 1. SNPs with MAF below this value are not takeninto account in the analysis, i.e. p-values and effect sizes are put tomissing ( |
MAC | A numerical value. SNPs with minor allele count below this valueare not taken into account for the analysis, i.e. p-values and effect sizesare set to missing ( |
genomicControl | Should genomic control correction as in Devlin andRoeder (1999) be applied? |
thrType | A character string indicating the type of threshold used forthe selection of candidate loci. See Details. |
alpha | A numerical value used for calculating the LOD-threshold for |
LODThr | A numerical value used as a LOD-threshold when |
nSnpLOD | A numerical value indicating the number of SNPs with thesmallest p-values that are selected when |
pThr | A numerical value just as the cut off value for p-Values for |
rho | A numerical value used a the minimum value for SNPs to beconsidered correlated when using |
sizeInclRegion | An integer. Should the results for SNPs close tosignificant SNPs be included? If so, the size of the region in centimorganor base pairs. Otherwise 0. |
minR2 | A numerical value between 0 and 1. Restricts the SNPs includedin the region close to significant SNPs to only those SNPs that are insufficient Linkage Disequilibrium (LD) with the significant snp, where LDis measured in terms of |
nCores | A numerical value indicating the number of cores to be used bythe parallel part of the algorithm. If |
Value
An object of classGWAS.
Threshold types for the selection of candidate loci
For the selection of candidate loci from the GWAS output four differentmethods can be used. The method used can be specified in the functionparameterthrType. Further parameters can be used to fine tune themethod.
- bonf
The Bonferroni threshold, a LOD-threshold of
-log10(alpha / m), where m is the number of SNPs and alpha can bespecified by the function parameteralpha- fixed
A fixed LOD-threshold, specified by the function parameter
LODThr- small
The n SNPs with with the smallest p-Values are selected. n canbe specified in
nSnpLOD- fdr
Following the algorithm proposed by Brzyski D. et al. (2017),SNPs are selected in such a way that the False Discovery Rate (fdr) isminimized. To do this, first the SNPs are restricted to the SNPs with ap-Value below
pThr. Then clusters of SNPs are created using atwo step iterative process in which first the SNP with the lowest p-Value isselected as cluster representative. This SNP and all SNPs that have acorrelation with this SNP of\rhoor higher will form a cluster.\rhocan be specified as an argument in the function and has a defaultvalue of 0.5, which is a recommended starting value in practice.The selected SNPs are removed from the list and the procedure repeated untilno SNPs are left. Finally to determine the number of significant clustersthe first cluster is determined for which the p-Value of the clusterrepresentative is not smaller thancluster_{number} * \alpha / mwherem is the number of SNPs and alpha can be specified by the function parameteralpha. All previous clusters are selected as significant.
References
Astle, William, and David J. Balding. 2009. Population Structureand Cryptic Relatedness in Genetic Association Studies. Statistical Science24 (4): 451–71.doi:10.1214/09-sts307.
Brzyski D. et al. (2017) Controlling the Rate of GWAS FalseDiscoveries. Genetics 205 (1): 61-75.doi:10.1534/genetics.116.193987
Devlin, B., and Kathryn Roeder. 1999. Genomic Control forAssociation Studies. Biometrics 55 (4): 997–1004.doi:10.1111/j.0006-341x.1999.00997.x.
Kang et al. (2008) Efficient Control of Population Structure inModel Organism Association Mapping. Genetics 178 (3): 1709–23.doi:10.1534/genetics.107.080101.
Millet, E. J., Pommier, C., et al. (2019). A multi-siteexperiment in a network of European fields for assessing the maize yieldresponse to environmental scenarios - Data set.doi:10.15454/IASSTN
Rincent et al. (2014) Recovering power in association mappingpanels with variable levels of linkage disequilibrium. Genetics 197 (1):375–87.doi:10.1534/genetics.113.159731.
Segura et al. (2012) An efficient multi-locus mixed-modelapproach for genome-wide association studies in structured populations.Nature Genetics 44 (7): 825–30.doi:10.1038/ng.2314.
Sun et al. (2010) Variation explained in mixed-model associationmapping. Heredity 105 (4): 333–40.doi:10.1038/hdy.2010.11.
Tunnicliffe W. (1989) On the use of marginal likelihood in timeseries model estimation. JRSS 51 (1): 15–27.
VanRaden P.M. (2008) Efficient methods to compute genomicpredictions. Journal of Dairy Science 91 (11): 4414–23.doi:10.3168/jds.2007-0980.
See Also
Examples
## Create a gData object Using the data from the DROPS project.## See the included vignette for a more extensive description on the steps.data(dropsMarkers)data(dropsMap)data(dropsPheno)## Add genotypes as row names of dropsMarkers and drop Ind column.rownames(dropsMarkers) <- dropsMarkers[["Ind"]]dropsMarkers <- dropsMarkers[colnames(dropsMarkers) != "Ind"]## Add genotypes as row names of dropsMap.rownames(dropsMap) <- dropsMap[["SNP.names"]]## Rename Chomosome and Position columns.colnames(dropsMap)[match(c("Chromosome", "Position"), colnames(dropsMap))] <- c("chr", "pos") ## Convert phenotypic data to a list.dropsPhenoList <- split(x = dropsPheno, f = dropsPheno[["Experiment"]])## Rename Variety_ID to genotype and select relevant columns.dropsPhenoList <- lapply(X = dropsPhenoList, FUN = function(trial) { colnames(trial)[colnames(trial) == "Variety_ID"] <- "genotype" trial <- trial[c("genotype", "grain.yield", "grain.number", "seed.size", "anthesis", "silking", "plant.height", "tassel.height", "ear.height")]return(trial)}) ## Create gData object.gDataDrops <- createGData(geno = dropsMarkers, map = dropsMap, pheno = dropsPhenoList) ## Run single trait GWAS for trait 'grain.yield' for trial Mur13W.GWASDrops <- runSingleTraitGwas(gData = gDataDrops, trials = "Mur13W", traits = "grain.yield") ## Run single trait GWAS for trait 'grain.yield' for trial Mur13W.## Use chromosome specific kinship matrices calculated using vanRaden method.GWASDropsMult <- runSingleTraitGwas(gData = gDataDrops, trials = "Mur13W", traits = "grain.yield", kinshipMethod = "vanRaden", GLSMethod = "multi")Summary function for the classGWAS
Description
Gives a summary for an object of S3 classGWAS.
Usage
## S3 method for class 'GWAS'summary(object, ..., trials = NULL, traits = NULL)Arguments
object | An object of class |
... | Not used. |
trials | A vector of strings or numeric indices indicating forwhich trials the summary should be made. If |
traits | A vector of strings indicating which traits to include in thesummary. If |
Summary function for the classgData
Description
Gives a summary for an object of S3 classgData.
Usage
## S3 method for class 'gData'summary(object, ..., trials = NULL)Arguments
object | An object of class |
... | Not used. |
trials | A vector of trials to include in the summary. These canbe either numeric indices or character names of list items in |
Value
A list with a most four components:
- mapSum
A list with number of markers and number of chromosomes inthe map.
- markerSum
A list with number of markers, number of genotypes andthe distribution of the values within the markers.
- phenoSum
A list of data.frames, one per trial with a summary of alltraits within the trial.
- covarSum
A list of data.frames, one per trial with a summary of allcovariates within the trial.
All components are only present in the output if the corresponding content ispresent in the gData object.