Movatterモバイル変換


[0]ホーム

URL:


Genetic Analysis Package

Jing Hua Zhao

Department of Public Health and Primary Care

University of Cambridge

Cambridge, UK

https://jinghuazhao.github.io/

2024-08-26

1 Introduction

This package was initiated to integrate some C/Fortran/SAS programs I have writtenor used over the years. As such, it would rather be a long-term project, but animmediate benefit is something complementary to other packages currently availablefrom CRAN, e.g. genetics,hwde, etc. I hope eventually this will be partof a bigger effort to fulfil most of the requirements foreseen by many1,within the portable environment of R for data management, analysis, graphics andobject-oriented programming. My view has been outlined more formally2,3in relation to other package systems and also on packagekinship4,5.I feel the enormous advantage by shifting to R and would like to my work with othersas it is available, which I will not claim this work as exclusively done by myself,but would like to invite others to join me and enlarge the collections and improve them.

With recent work on genomewide association studies (GWASs) especially protein GWASs, Ihave added many functions such asMETAL_forestplot which handles data from softwareMETAL6 andqtlFinder which extracts sentinels from GWAS summary statisticsin a way that is very appealing to us compared to some other established software.Meanwhile, the size of the package surpasses the limit as imposed by CRAN, thus theold good feature ofS as withR that value both code and data alike has to sufferslightly in thatgap.datasets andgap.examples are spun off as two separatepackages; they do deserve a glimpse however for some general ideas. A separateinitiative has been made in thepQTLtools package.

Notable recent technical updates include:

  1. documentaion from R code viaroxygen2 anddevtools to allow for easiergeneration of the .Rd files and NAMESPACE.

  2. vignettes with noweb/Sweave as well asrmarkdown andbookdown that allowsfor numbered sectioning, multiple figures generated from a code chunk, andCitation Style Language (CSL) (https://citationstyles.org/).Rdpack is nowemployed for BibTeX citations in the .Rd files.

  3. an experimental shiny App (runshinygap()).

These experiences can be useful for others in their package development. I found ituseful to use specific functions without loading the whole package, i.e.,library(gap),e.g.,invnormal <- gap::invnormal,log10p <- gap::log10p.

2 Implementation

The currently available functions are shown below,

NameDescription
ANALYSIS 
AE3AE model using nuclear family trios
btBradley-Terry model for contingency table
ccsizePower and sample size for case-cohort design
csCredibel set
fbsizeSample size for family-based linkage and association design
gc.emGene counting for haplotype analysis
gcontrolgenomic control
gcontrol2genomic control based on p values
gcpPermutation tests using GENECOUNTING
gc.lambdaEstionmation of the genomic control inflation statistic (lambda)
genecountingGene counting for haplotype analysis
gifKinship coefficient and genetic index of familiality
MCMCgrmMixed modeling with genetic relationship matrices
hapHaplotype reconstruction
hap.emGene counting for haplotype analysis
hap.scoreScore statistics for association of traits with haplotypes
htrHaplotype trend regression
hweHardy-Weinberg equilibrium test for a multiallelic marker
hwe.ccA likelihood ratio test of population Hardy-Weinberg equilibrium
hwe.hardyHardy-Weinberg equilibrium test using MCMC
invnormalInverse normal transformation
kin.morgankinship matrix for simple pedigree
LD22LD statistics for two diallelic markers
LDklLD statistics for two multiallelic markers
lambda1000A standardized estimate of the genomic inflation scaling to
 a study of 1,000 cases and 1,000 controls
log10plog10(p) for a standard normal deviate
log10pvaluelog10(p) for a P value including its scientific format
logplog(p) for a normal deviate
masizeSample size calculation for mediation analysis
miamultiple imputation analysis for hap
mrMendelian randomization analysis
mtdtTransmission/disequilibrium test of a multiallelic marker
mtdt2Transmission/disequilibrium test of a multiallelic marker
 by Bradley-Terry model
mvmetaMultivariate meta-analysis based on generalized least squares
pbsizePower for population-based association design
pbsize2Power for case-control association design
pfcProbability of familial clustering of disease
pfc.simProbability of familial clustering of disease
pgcPreparing weight for GENECOUNTING
print.hap.scorePrint a hap.score object
s2kStatistics for 2 by K table
sentinelsSentinel identification from GWAS summary statistics
tsccPower calculation for two-stage case-control design
  
GRAPHICS 
asplotRegional association plot
ESplotEffect-size plot
circos.cnvplotcircos plot of CNVs
circos.cis.vs.trans.plotcircos plot of cis/trans classification
circos.mhtplotcircos Manhattan plot with gene annotation
circos.mhtplot2Another circos Manhattan plot
cnvplotgenomewide plot of CNVs
labelManhattanAnnotate Manhattan or Miami Plot
METAL_forestplotforest plot as R/meta’s forest for METAL outputs
makeRLEplotmake relative log expression plot
mhtplotManhattan plot
mhtplot2Manhattan plot with annotations
mhtplot.trunctruncated Manhattan plot
miamiplotMiami plot
miamiplot2Miami plot
mr_forestplotMendelian Randomization forest plot
pedtodotConverting pedigree(s) to dot file(s)
pedtodot_verbatimPedigree-drawing with graphviz
plot.hap.scorePlot haplotype frequencies versus haplotype score statistics
qqfunQuantile-comparison plots
qqunifQ-Q plot for uniformly distributed random variable
qtl2dplot2D QTL plot
qtl2dplotly2D QTL plotly
qtl3dplotly3D QTL plotly
  
UTILITIES 
SNPFunctions for single nucleotide polymorphisms (SNPs)
BFDPBayesian false-discovery probability
FPRPFalse-positive report probability
abTest/Power calculation for mediating effect
b2rObtain correlation coefficients and their variance-covariances
chow.testChow’s test for heterogeneity in two regressions
chr_pos_a1_a2Form SNPID from chromosome, posistion and alleles
cis.vs.trans.classificationa cis/trans classifier
ci2msEffect size and standard error from confidence interval
comp.scorescore statistics for testing genetic linkage of quantitative trait
GRM functionsReadGRM, ReadGRMBin, ReadGRMPLINK,
 ReadGRMPCA, WriteGRM,
 WriteGRMBin, WriteGRMSAS
 handle genomic relationship matrix involving other software
get_b_seGet b and se from AF, n, and z
get_pve_seGet pve and its standard error from n, z
get_sdyGet sd(y) from AF, n, b, se
h2GA utility function for heritability
h2GEA utility function for heritability involving gene-environment interaction
h2lA utility function for converting observed heritability to its counterpart
 under liability threshold model
h2_mzdzHeritability estimation according to twin correlations
klemHaplotype frequency estimation based on a genotype table
 of two multiallelic markers
makepedA function to prepare pedigrees in post-MAKEPED format
metapMeta-analysis of p values
metaregFixed and random effects model for meta-analysis
muvarMeans and variances under 1- and 2- locus (diallelic) QTL model
pvalueP value for a normal deviate
qtlClassifierA QTL cis/trans classifier
qtlFinderDistance-based signal identification
read.ms.outputA utility function to read ms output
revStrandAllele on the reverse strand
runshinygapStart shinygap
snptest_sampleA utility to generate SNPTEST sample file
whscoreWhittemore-Halpern scores for allele-sharing
weighted.medianWeighted median with interpolation

After installation, you will be able to obtain the list by typinglibrary(help=gap)in alphabetical order, or?gap::gap ordered by category, or view it within a webbrowser viahelp.start(). A full list of functions is provided in the Appendix.This file can be viewed with commandvignette("gap", package="gap"). You can cutand paste examples at end of each function’s documentation.

Bothgenecounting andhap are able to handle SNPs and multiallelicmarkers, with the former be flexible enough to include features such as X-linked dataand the later being able to handle large number of SNPs. But they are unable torecode allele labels automatically, so functionsgc.em andhap.em are inhaplo.em format and used by a modified functionhap.score in association testing.

It is notable that multilocus data are handled differently from that inhwde andelegant definitions of basic genetic data can be found in thegenetics package.Incidentally, I found my C mixed-radixed sorting routine7 is much faster thanR’s internal function.

With exceptions such as functionpfc which is very computer-intensive, most functionsin the package can easily be adapted for analysis of large datasets involving eitherSNPs or multiallelic markers. Some are utility functions, e.g. muvar andwhscore,which will be part of the other analysis routines in the future.

The benefit with R compared to standalone programs is that for users, all functions haveunified format. For developers, it is able to incorporate their C/C++ programs moreeasily and avoid repetitive work such as preparing own routines for matrix algebra andlinear models. Further advantage can be taken from packages inBioconductor, whichare designed and written to deal with large number of genes.

3 Independent programs

To facilitate comparisons and individual preferences, the source codes for EHPLUS8,2LD, GENECOUNTING, HAP9, now hosted at GitHub, have enjoyed great popularity aheadof GWASs therefore are likely to be more familiar than their R counterparts ingap butyou need to follow their instructions to compile for a particular computer system.

I have kept the originalpedtodot.sh by David Duffy which enables contrast withpedtodot_verbatim() andpedtodot() reported as application notes. I have also includedms code10 to couple withread.ms.output.

A final note is concerned abouttwinan90, which is now dropped from the package functionlist due to difficulty to keep up with the requirements by theR environment neverthelessyou will still be able to compile and use otherwise fromgap.examples.

4 Demos

This has been a template for adding self-contained examples:

library(gap)demo(gap)

See examples of haplotype analysis there.

5 Pedigrees and kinship

5.1 Pedigree drawing

I have included the original file for theR News as well as put examples in separatevignettes. They can be accessed viavignette("rnews",package="gap.examples") andvignette("pedtodot", package="gap.examples"), respectively.

We also demonstrate through pedigree 10081 example5 withpedtodot_verbatim.

library(gap)#> Loading required package: gap.datasets#> gap version 1.6knitr::kable(p3,caption="An example pedigree")
Table 5.1:An example pedigree
pididfidmidsexaffGABRB1D4S1645
10081123227/77/10
1008120011-/--/-
10081300227/93/10
10081423227/93/7
10081523217/77/10
10081623117/77/10
10081723217/77/10
1008180011-/--/-
10081984117/93/10
10081100021-/--/-
1008111210217/77/7
1008112210226/77/7
10081130011-/--/-
10081141311117/87/8
10081150011-/--/-
10081161512216/67/7
library(DOT)# one can see the diagram in RStudiopedtodot_verbatim(p3,run=TRUE,toDOT=TRUE,return="verbatim")#> Pedigree  10081 #> running DOT::dot on 10081library(DiagrammeR)pedtodot_verbatim(p3)#> Pedigree  10081grViz(readr::read_file("10081.dot"))

Figure 5.1: An example pedigree

5.2 Kinship calculation

Next, I will provide an example for kinship calculation usingkin.morgan.It is recommended that individuals in a pedigree are ordered so that parentsalways precede their children. In this regard, packagepedigree can beused, and packagekinship2 can be used to produce pedigree diagram as withkinship matrix.

The pedigree diagram is as follows,

# pedigree diagramdata(lukas, package="gap.datasets")library(kinship2)#> Loading required package: Matrix#> Loading required package: quadprogped <- with(lukas,pedigree(id,father,mother,sex))plot(ped,cex=0.4)
A pedigree diagram

Figure 5.2: A pedigree diagram

We then turn to the kinship calculation.

# unordered individualsgk1 <- kin.morgan(lukas)write.table(gk1$kin.matrix,"gap_1.txt",quote=FALSE)library(kinship2)kk1 <- kinship(lukas[,1],lukas[,2],lukas[,3])write.table(kk1,"kinship_1.txt",quote=FALSE)d <- gk1$kin.matrix-kk1sum(abs(d))#> [1] 2.443634# order individuals so that parents precede their childrenlibrary(pedigree)op <- orderPed(lukas)olukas <- lukas[order(op),]gk2 <- kin.morgan(olukas)write.table(olukas,"olukas.csv",quote=FALSE)write.table(gk2$kin.matrix,"gap_2.txt",quote=FALSE)kk2 <- kinship(olukas[,1],olukas[,2],olukas[,3])write.table(kk2,"kinship_2.txt",quote=FALSE)z <- gk2$kin.matrix-kk2sum(abs(z))#> [1] 0

We see that in the second case, the result agrees withkinship2.

6 Study designs

I would like to highlightfbsize,pbsize andccsize functions usedfor power/sample calculations in a genome-wide asssociatoin study asreported11,12,13.

It now has an experimental work via Shiny frominst/shinygap.

6.1 Family-based design

The example is as follows,

options(width=150)library(gap)models <- matrix(c(         4.0, 0.01,         4.0, 0.10,         4.0, 0.50,          4.0, 0.80,         2.0, 0.01,         2.0, 0.10,         2.0, 0.50,         2.0, 0.80,         1.5, 0.01,             1.5, 0.10,         1.5, 0.50,         1.5, 0.80), ncol=2, byrow=TRUE)outfile <- "fbsize.txt"cat("gamma","p","Y","N_asp","P_A","H1","N_tdt","H2","N_asp/tdt",    "L_o","L_s\n",file=outfile,sep="\t")for(i in 1:12) {    g <- models[i,1]    p <- models[i,2]    z <- fbsize(g,p)    cat(z$gamma,z$p,z$y,z$n1,z$pA,z$h1,z$n2,z$h2,z$n3,        z$lambdao,z$lambdas,file=outfile,append=TRUE,sep="\t")    cat("\n",file=outfile,append=TRUE)}table1 <- read.table(outfile,header=TRUE,sep="\t")nc <- c(4,7,9)table1[,nc] <- ceiling(table1[,nc])dc <- c(3,5,6,8,10,11)table1[,dc] <- round(table1[,dc],2)unlink(outfile)knitr::kable(table1,caption="Power/Sample size of family-based designs")
Table 6.1:Power/Sample size of family-based designs
gammapYN_aspP_AH1N_tdtH2N_asp.tdtL_oL_s
4.00.010.5264020.800.0512010.112571.081.09
4.00.100.602770.800.351650.54531.481.54
4.00.500.584460.800.501130.42671.361.39
4.00.800.5330240.800.242440.161771.121.13
2.00.010.504459640.670.0363710.0421551.011.01
2.00.100.5280870.670.257610.322901.071.08
2.00.500.5337530.670.503730.471971.111.11
2.00.800.51179090.670.277010.224311.051.05
1.50.010.5069447790.600.02211380.0385081.001.00
1.50.100.511019260.600.2124270.2510301.021.02
1.50.500.51270480.600.5010390.495301.041.04
1.50.800.511019260.600.2918200.2510301.021.02

As for APOE4 and Alzheimer’s14

g <- 4.5p <- 0.15alz <- data.frame(fbsize(g,p))knitr::kable(alz,caption="Power/Sample size of study on Alzheimer's disease")
Table 6.2:Power/Sample size of study on Alzheimer’s disease
gammapyn1pAh1n2h2n3lambdaolambdas
4.50.150.6256916162.62460.81818180.4598361108.9940.620762539.976881.6715941.784353

6.2 Population-based design

The example is as follows,

library(gap)kp <- c(0.01,0.05,0.10,0.2)models <- matrix(c(          4.0, 0.01,          4.0, 0.10,          4.0, 0.50,           4.0, 0.80,          2.0, 0.01,          2.0, 0.10,          2.0, 0.50,          2.0, 0.80,          1.5, 0.01,              1.5, 0.10,          1.5, 0.50,          1.5, 0.80), ncol=2, byrow=TRUE)outfile <- "pbsize.txt"cat("gamma","p","p1","p5","p10","p20\n",sep="\t",file=outfile)for(i in 1:dim(models)[1]){   g <- models[i,1]   p <- models[i,2]   n <- vector()   for(k in kp) n <- c(n,ceiling(pbsize(k,g,p)))   cat(models[i,1:2],n,sep="\t",file=outfile,append=TRUE)   cat("\n",file=outfile,append=TRUE)} table5 <- read.table(outfile,header=TRUE,sep="\t")knitr::kable(table5,caption="Sample size of population-based design")
Table 6.3:Sample size of population-based design
gammapp1p5p10p20
4.00.0146681895942441887
4.00.1081801570744331
4.00.50108912091991441
4.00.8031473604128621272
2.00.01403970775303672516323
2.00.10527091011647922130
2.00.5035285677232081426
2.00.80793911523772183208
1.50.01159992030705614544864644
1.50.1019210536869174657762
1.50.50980131881189113961
1.50.8019210536869174657762

6.3 Case-cohort design

We obtain results for ARIC and EPIC studies.

library(gap)# ARIC studyoutfile <- "aric.txt"n <- 15792pD <- 0.03p1 <- 0.25alpha <- 0.05theta <- c(1.35,1.40,1.45)beta <- 0.2s_nb <- c(1463,722,468)cat("n","pD","p1","hr","q","power","ssize\n",file=outfile,sep="\t")for(i in 1:3){  q <- s_nb[i]/n  power <- ccsize(n,q,pD,p1,log(theta[i]),alpha,beta,TRUE)  ssize <- ccsize(n,q,pD,p1,log(theta[i]),alpha,beta,FALSE)  cat(n,"\t",pD,"\t",p1,"\t",theta[i],"\t",q,"\t",      signif(power,3),"\t",ssize,"\n",      file=outfile,append=TRUE)}read.table(outfile,header=TRUE,sep="\t")#>       n   pD   p1   hr          q power ssize#> 1 15792 0.03 0.25 1.35 0.09264184   0.8  1463#> 2 15792 0.03 0.25 1.40 0.04571935   0.8   722#> 3 15792 0.03 0.25 1.45 0.02963526   0.8   468unlink(outfile)# EPIC studyoutfile <- "epic.txt"n <- 25000alpha <- 0.00000005beta <- 0.2s_pD <- c(0.3,0.2,0.1,0.05)s_p1 <- seq(0.1,0.5,by=0.1)s_hr <- seq(1.1,1.4,by=0.1)cat("n","pD","p1","hr","alpha","ssize\n",file=outfile,sep="\t")# direct calculationfor(pD in s_pD){   for(p1 in s_p1)   {      for(hr in s_hr)      {         ssize <- ccsize(n,q,pD,p1,log(hr),alpha,beta,FALSE)         if (ssize>0) cat(n,"\t",pD,"\t",p1,"\t",hr,"\t",alpha,"\t",                          ssize,"\n",                          file=outfile,append=TRUE)      }   }}knitr::kable(read.table(outfile,header=TRUE,sep="\t"),caption="Sample size of case-cohort designs")
Table 6.4:Sample size of case-cohort designs
npDp1hralphassize
250000.30.11.3014391
250000.30.11.405732
250000.30.21.2021529
250000.30.21.305099
250000.30.21.402613
250000.30.31.2011095
250000.30.31.303490
250000.30.31.401882
250000.30.41.208596
250000.30.41.302934
250000.30.41.401611
250000.30.51.207995
250000.30.51.302786
250000.30.51.401538
250000.20.11.409277
250000.20.21.307725
250000.20.21.403164
250000.20.31.304548
250000.20.31.402152
250000.20.41.2020131
250000.20.41.303648
250000.20.41.401805
250000.20.51.2017120
250000.20.51.303422
250000.20.51.401713
250000.10.21.408615
250000.10.31.403776
250000.10.41.3013479
250000.10.41.402824
250000.10.51.3010837
250000.10.51.402606
unlink(outfile)

7 Graphics

Some figures from the documentation may be of interest.

7.1 Genome-wide association

The following code is used to obtain a Q-Q plot viaqqunif function,

library(gap)u_obs <- runif(1000)r <- qqunif(u_obs,pch=21,bg="blue",bty="n")u_exp <- r$yhits <- u_exp >= 2.30103points(r$x[hits],u_exp[hits],pch=21,bg="green")legend("topleft",sprintf("GC.lambda=%.4f",gc.lambda(u_obs)))
A Q-Q plot

Figure 7.1: A Q-Q plot

Based on a chicken genome scan data, the code below generates a Manhattan plot, demonstratingthe use of gaps to separate chromosomes.

ord <- with(w4,order(chr,pos))w4 <- w4[ord,]oldpar <- par()par(cex=0.6)colors <- c(rep(c("blue","red"),15),"red")mhtplot(w4,control=mht.control(colors=colors,gap=1000),pch=19,srt=0)axis(2,cex.axis=2)suggestiveline <- -log10(3.60036E-05)genomewideline <- -log10(1.8E-06)abline(h=suggestiveline, col="blue")abline(h=genomewideline, col="green")abline(h=0)
A genome-wide association study on chickens

Figure 7.2: A genome-wide association study on chickens

The code below obtains a Manhattan plot with gene annotation15,

data <- with(mhtdata,cbind(chr,pos,p))glist <- c("IRS1","SPRY2","FTO","GRIK3","SNED1","HTR1A","MARCH3","WISP3",           "PPP1R3B","RP1L1","FDFT1","SLC39A14","GFRA1","MC4R")hdata <- subset(mhtdata,gene%in%glist)[c("chr","pos","p","gene")]color <- rep(c("lightgray","gray"),11)glen <- length(glist)hcolor <- rep("red",glen)  par(las=2, xpd=TRUE, cex.axis=1.8, cex=0.4)ops <- mht.control(colors=color,yline=1.5,xline=3)hops <- hmht.control(data=hdata,colors=hcolor)mhtplot(data,ops,hops,pch=19)axis(2,pos=2,at=1:16,cex.axis=0.5)
A Manhattan plot with gene annotation

Figure 7.3: A Manhattan plot with gene annotation

All these look familiar, so revised form of the function calledmhtplot2 wascreated for additional features such as centering the chromosome ticks, allowing formore sophisticated coloring schemes, using prespecified fonts, etc. Please refer tothe function’s documentation example.

We could also go further with a circos Manhattan plot,

circos.mhtplot(mhtdata, glist)#> Note: 11 points are out of plotting region in sector 'chr16', track '3'.
A circos Manhattan plot

Figure 7.4: A circos Manhattan plot

and a version with y-axis,

require(gap.datasets)library(dplyr)#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #>     filter, lag#> The following objects are masked from 'package:base':#> #>     intersect, setdiff, setequal, uniontestdat <- mhtdata[c("chr","pos","p","gene","start","end")] %>%           rename(log10p=p) %>%           mutate(chr=paste0("chr",chr),log10p=-log10(log10p))dat <- mutate(testdat,start=pos,end=pos) %>%       select(chr,start,end,log10p)labs <- subset(testdat,gene %in% glist) %>%        group_by(gene,chr,start,end) %>%        summarize() %>%        mutate(cols="blue") %>%        select(chr,start,end,gene,cols)#> `summarise()` has grouped output by 'gene', 'chr', 'start'. You can override using the `.groups` argument.labs[2,"cols"] <- "red"ticks <- 0:2*5circos.mhtplot2(dat,labs,ticks=ticks,ymax=max(ticks))#> Note: 43 points are out of plotting region in sector 'chr16', track '5'.
Another circos Manhattan plot

Figure 7.5: Another circos Manhattan plot

As a side note, the data is used bymanhattanly.

#{r mhttest, fig.cap="Manhattanly plot", fig.height=8, fig.width=8, plotly=TRUE}library(manhattanly)mhttest <- manhattanly(mhtdata, chr = "chr", bp = "pos",                       snp = "rsn", annotation1 = "gene", suggestiveline = TRUE,                       annotation2 = "rsn", p = "p")mhttesthtmlwidgets::saveWidget(mhttest,"mhttest.html")

We now experiment with Miami plot,

mhtdata <- within(mhtdata,{pr=p})miamiplot(mhtdata,chr="chr",bp="pos",p="p",pr="pr",snp="rsn")
Miami plots

Figure 7.6: Miami plots

# An alternative implementationgwas <- select(mhtdata,chr,pos,p) %>%        mutate(z=qnorm(p/2))chrmaxpos <- miamiplot2(gwas,gwas,name1="Batch 2",name2="Batch 1",z1="z",z2="z")#> Warning in max(c(dat1$pos[dat1$chr == i], dat2$pos[dat2$chr == i]), na.rm = TRUE): no non-missing arguments to max; returning -InflabelManhattan(chr=c(2,16),pos=c(226814165,52373776),name=c("AnonymousGene","FTO"),gwas,gwasZLab="z",chrmaxpos=chrmaxpos)
Miami plots

Figure 7.7: Miami plots

and a truncated Manhattan plot, noting that only data points with -log10(P)>=2 are shown,

par(oma=c(0,0,0,0), mar=c(5,6.5,1,1))mhtplot.trunc(filter(IL.12B,log10P>=2), chr="Chromosome", bp="Position", z="Z",   snp="MarkerName",   suggestiveline=FALSE, genomewideline=-log10(5e-10),   cex.mtext=1.2, cex.text=1.2,   annotatelog10P=-log10(5e-10), annotateTop = FALSE,   highlight=with(genes,gene),   mtext.line=3, y.brk1=115, y.brk2=300, trunc.yaxis=TRUE, delta=0.01,   cex.axis=1.5, cex=0.8, font=3, font.axis=1.5,   y.ax.space=20,   col = c("blue4", "skyblue"))
Association of IL-12B

Figure 7.8: Association of IL-12B

The code below obtains a regional association plot with theasplot function,

asplot(CDKNlocus, CDKNmap, CDKNgenes, best.pval=5.4e-8, sf=c(3,6))#> - CDKN2A #> - CDKN2Btitle("CDKN2A/CDKN2B Region")
A regional association plot

Figure 7.9: A regional association plot

The function predates the currently popularlocuszoom software but leavesthe option open for generating such plots on the fly and locally.

Note that all these can serve as templates to customize features of your own.

7.2 Copy number variation

A plot of copy number variation (CNV) is shown here,

cnvplot(gap.datasets::cnv)
A CNV plot

Figure 7.10: A CNV plot

7.3 Effect size plot

The code below obtains an effect size plot via theESplot function.

library(gap)rs12075 <- data.frame(id=c("CCL2","CCL7","CCL8","CCL11","CCL13","CXCL6","Monocytes"),                      b=c(0.1694,-0.0899,-0.0973,0.0749,0.189,0.0816,0.0338387),                      se=c(0.0113,0.013,0.0116,0.0114,0.0114,0.0115,0.00713386))ESplot(rs12075)
rs12075 and traits

Figure 7.11: rs12075 and traits

7.4 Forest plot

It draws many forest plots given a list of variants, e.g.,

data(OPG,package="gap.datasets")meta::settings.meta(method.tau="DL")METAL_forestplot(OPGtbl,OPGall,OPGrsid,width=6.75,height=5,digits.TE=2,digits.se=2,                 col.diamond="black",col.inside="black",col.square="black")#> Joining with `by = join_by(MarkerName)`#> Joining with `by = join_by(MarkerName)`
Forest plots

Figure 7.12: Forest plots

Forest plots

Figure 7.13: Forest plots

METAL_forestplot(OPGtbl,OPGall,OPGrsid,package="metafor",method="FE",xlab="Effect",                 showweights=TRUE)#> Joining with `by = join_by(MarkerName)`#> Joining with `by = join_by(MarkerName)`
Forest plots

Figure 7.14: Forest plots

Forest plots

Figure 7.15: Forest plots

8 Significance

Our focus is on z ~ Normal(0,1), whose schematic diagram is shown below.

Normal(0,1) distribution

Figure 8.1: Normal(0,1) distribution

The associate R function isz <- function(p) qnorm(p/2,lower.tail=FALSE).

When z is very large, its corresponding p value is very small. A genomewide significance is declared at 0.05/1000000=5e-8 with Bonferroni correction assuming 1 million SNPs are tested. This short note describes how to get -log10(p), which can be used in a Q-Q plot and software such as DEPICT16. The solution here is generic since z is also the square root of a chi-squared statistic, for instance.

8.1 log(p) and log10(p)

First thing first, here are the answers for log(p) and log10(p) given z,

# log(p) for a standard normal deviate z based on log()logp <- function(z) log(2)+pnorm(-abs(z), lower.tail=TRUE, log.p=TRUE)# log10(p) for a standard normal deviate z based on log()log10p <- function(z) log(2, base=10)+pnorm(-abs(z), lower.tail=TRUE, log.p=TRUE)/log(10)

Notelogp() will be used for functions such asqnorm() as in functioncs() whereaslog10p() is more appropriate for Manhattan plot and used insentinels().

8.2 Rationale

We start with z=1.96 whose corresponding p value is approximately 0.05.

2*pnorm(-1.96,lower.tail=TRUE)

giving an acceptable value 0.04999579, so we proceed to get log10(p)

log10(2)+log10(pnorm(-abs(z),lower.tail=TRUE))

leading to the expression above from the fact that log10(X)=log(X)/log(10) since log(),being the natural log function, ln() – so log(exp(1)) = 1, in R, works far better onthe numerator of the second term. The use of -abs() just makes sure we are working onthe lower tail of the standard Normal distribution from which our p value is calculated.

8.3 Benchmark

Now we have a stress test,

z <- 20000-log10p(z)

giving -log10(p) = 86858901.

8.4 Multiple precision arithmetic

We would be curious about the p value itself as well, which is furnished with the Rmpfr package

require(Rmpfr)2*pnorm(mpfr(-abs(z),100),lower.tail=TRUE,log.p=FALSE)mpfr(log(2),100) + pnorm(mpfr(-abs(z),100),lower.tail=TRUE,log.p=TRUE)

giving p = 1.660579603192917090365313727164e-86858901 and -log(p) = -200000010.1292789076808554854177,respectively. To carry on we have -log10(p) = -log(p)/log(10)=86858901.

To make -log10(p) usable in R we obtain it directly through

as.numeric(-log10(2*pnorm(mpfr(-abs(z),100),lower.tail=TRUE)))

which actually yields exactly the same 86858901.

If we go very far to have z=50,000. then -log10(p)=542868107 but we have less luck with Rmpfr.

One may wonder the P value in this case, which is 6.6666145952e-542868108 or simply 6.67e-542868108.

The magic function for doing this is defined as follows,

pvalue <- function (z, decimals = 2){    lp <- -log10p(z)    exponent <- ceiling(lp)    base <- 10^-(lp - exponent)    paste0(round(base, decimals), "e", -exponent)}

and it is more appropriate to express p values in scientific format so they can behandled as follows,

log10pvalue <- function(p=NULL,base=NULL,exponent=NULL){  if(!is.null(p))  {    p <- format(p,scientific=TRUE)    p2 <- strsplit(p,"e")    base <- as.numeric(lapply(p2,"[",1))    exponent <- as.numeric(lapply(p2,"[",2))  } else if(is.null(base) | is.null(exponent)) stop("base and exponent should both be specified")  log10(base)+exponent}

used aslog10pvalue(p) when p<=1e-323, or log10pvalue(base=1,exponent=-323) otherwise.

One can also derive logpvalue for natural base (e) similarly.

We end with a quick look-up table

require(gap)zlist <- c(5,10,30,40,50,100,500,1000,2000,3000,5000)zp <- sapply(zlist,function(z) {c(z,pvalue(z),logp(z),log10p(z))})rownames(zp) <- c("z","P","log(P)","log10(P)")knitr::kable(t(zp),caption="z, P, log(P) and log10(P)")
Table 8.1:z, P, log(P) and log10(P)
zPlog(P)log10(P)
55.73e-7-14.3718512134288-6.24161567672667
101.52e-23-52.5381379699525-22.8170234098221
309.81e-198-453.628096775783-197.008179265997
407.31e-350-803.915294833194-349.135976463682
502.16e-545-1254.13821395886-544.665305866333
1002.69e-2174-5004.83106151365-2173.57051287337
5002.47e-54290-125006.440403451-54289.6072695865
10004.58e-217151-500007.133547632-217150.339011999
20004.34e-868593-2000007.82669406-868592.362896546
30001.8e-1954329-4500008.23215903-1954328.74374587
50001.51e-5428685-12500008.7429846-5428684.82082061

8.5 Application

Themhtplot.trunc() function accepts three types of arguments:
  1. P values of association statistics, which could be very small.
log10p. log10(P).
  1. normal statistics that could be very large.

In all three cases, a log10(P) counterpart is obtained internally and to accommodate extreme value,the y-axis allows for truncation leaving out a given range to highlight the largest.

See the IL-12B example above.

9 Linear regression

Several functions related to linear regression are detailed here.

9.1 Some preparations

Let\(\mbox{x} = SNP\ dosage\). Note that\(\mbox{Var}(\mbox{x})=2f(1-f)\),\(f=MAF\) or\(1-MAF\) by symmetry.

Our linear regression model is\(\mbox{y}=a + b\mbox{x} + e\). We have\(\mbox{Var}(\mbox{y}) = b^2\mbox{Var}(\mbox{x}) + \mbox{Var}(e)\). Moreover,\(\mbox{Var}(b)=\mbox{Var}(e)(\mbox{x}'\mbox{x})^{-1}=\mbox{Var}(e)/S_\mbox{xx}\), we have\(\mbox{Var}(e) = \mbox{Var}(b)S_\mbox{xx} = N \mbox{Var}(b) \mbox{Var}(\mbox{x})\). Consequently, let\(z = {b}/{SE(b)}\), we have

\[\begin{eqnarray*}\mbox{Var}(\mbox{y}) &=& \mbox{Var}(\mbox{x})(b^2+N\mbox{Var}(b)) \hspace{100cm} \cr &=& \mbox{Var}(\mbox{x})\mbox{Var}(\mbox{b})(z^2+N) \cr &=& 2f(1-f)(z^2+N)\mbox{Var}(b)\end{eqnarray*}\]

Moreover, the mean and the variance of the multiple correlation coefficient or the coefficient of determination (\(R^2\)) are known17 to be\({1}/{(N-1)}\) and\({2(N-2)}/{\left[(N-1)^2(N+1)\right]}\), respectively.

We also need some established results of a ratio (R/S)1, i.e., the mean

\[\begin{align}E(R/S) \approx \frac{\mu_R}{\mu_S}-\frac{\mbox{Cov}(R,S)}{\mu_S^2}+\frac{\sigma_S^2\mu_R}{\mu_S^3} \hspace{100cm}\tag{9.1}\end{align}\]

and more importantly the variance

\[\begin{align}\mbox{Var}(R/S) \approx \frac{\mu_R^2}{\mu_S^2} \left[ \frac{\sigma_R^2}{\mu_R^2} -2\frac{\mbox{Cov}(R,S)}{\mu_R\;\mu_S} +\frac{\sigma_S^2}{\mu_S^2} \right] \hspace{100cm}\tag{9.2}\end{align}\]

where\(\mu_R\),\(\mu_S\),\(\sigma_R^2\),\(\sigma_S^2\) are the means and the variances for R and S, respectively.

Finally, we need some facts about\(\chi_1^2\),\(\chi^2\) distribution of one degree of freedom. For\(z \sim N(0,1)\),\(z^2\sim \chi_1^2\), whose mean and variance are 1 and 2, respectively.

We now have the following results.

9.2 Proportion of variance explained

We have

\[\begin{align}\mbox{PVE}_{\mbox{linear regression}} & = \frac{\mbox{Var}(b\mbox{x})}{\mbox{Var}(\mbox{y})} \hspace{100cm} \\ & = \frac{\mbox{Var}(\mbox{x})b^2}{\mbox{Var}(\mbox{x})(b^2+N\mbox{Var}(b))} \\ & = \frac{\mbox{z}^2}{\mbox{z}^2+N}\tag{9.3}\end{align}\]

On the other hand, for a simple linear regression\(R^2\equiv r^2\) where\(r\) is the Pearson correlation coefficient, which is readily from the\(t\)-statistic of the regression slope, i.e.,\(r={t}/{\sqrt{t^2+N-2}}\). so assuming\(t \equiv \ z \sim \chi_1^2\)

\[\begin{align}\mbox{PVE}_{t-\mbox{statistic}} & = \frac{\chi^2}{\chi^2+N-2} \hspace{100cm}\tag{9.4}\end{align}\]

To obtain coherent estimates of the asymptotic means and variances of both forms we resort to variance of a ratio (R/S). All the required elements are listed in a table below.

CharacteristicsLinear regression\(t\)-statistic
\(\mu_R\)11
\(\sigma_R^2\)22
\(\mu_S\)\(N+1\)\(N-1\)
\(\sigma_S^2\)22
\(\mbox{Cov}(R,S)\)22

then we have the means and the variances for PVE.

CharacteristicsLinear regression\(t\)-statistic
mean\(\frac{1}{N+1}\left[1-\frac{2}{N+1}+\frac{2}{(N+1)^2}\right]\)\(\frac{1}{N-1}\left[1-\frac{2}{N-1}+\frac{2}{(N-1)^2}\right]\)
variance\(\frac{2}{(N+1)^2}\left[1-\frac{1}{N+1}\right]^2\)\(\frac{2}{(N-1)^2}\left[1-\frac{1}{N-1}\right]^2\)

Finally, our approximation of PVE for a protein with\(T\) independent pQTLs from the meta-analysis

CharacteristicsLinear regression\(t\)-statistic
estimate\(\sum_{i=1}^T{\frac{\chi_i^2}{\chi_i^2+N_i}}\)\(\sum_{i=1}^T{\frac{\chi_i^2}{\chi_i^2+N_i-2}}\)
variance\(\sum_{i=1}^T\frac{2}{(N_i+1)^2}\)\(\sum_{i=1}^T\frac{2}{(N_i-1)^2}\)

Therefore they differ from the asymptotic results17 by ratios of\((N-2)(N+1)/(N-1)^2\) and\((N-2)/(N+1)\) forlinear regression and\(t\)-statistic, respectively.

Comparisons of Var($R^2$) estimates

Figure 9.1: Comparisons of Var(\(R^2\)) estimates

As the sample size increases, the estimates are quite close nevertheless quite small while the ratios approach 1 quickly from opposite sides after\(N\approx 300\).

9.3 Effect size and standard error

When\(\mbox{Var}(\mbox{y})=1\), as in cis eQTLGen18 data, we have\(b\) and its standard error (se) as follows,

\[\begin{align}b & = z/d \hspace{100cm} \\se & = 1/d\tag{9.5}\end{align}\]

where\(d = \sqrt{2f(1-f)(z^2+N)}\).

Now three functions are in place.

9.4 get_b_se

A record of the eQTLGen data is shown below

        SNP    Pvalue SNPChr  SNPPos AssessedAllele OtherAllele Zscore rs1003563 2.308e-06     12 6424577              A           G 4.7245            Gene GeneSymbol GeneChr GenePos NrCohorts NrSamples         FDR ENSG00000111321       LTBR      12 6492472        34     23991 0.006278872 BonferroniP hg19_chr hg19_pos AlleleA AlleleB allA_total allAB_total           1       12  6424577       A       G       2574        8483 allB_total AlleleB_all       7859   0.6396966

from which we obtain the effect size and its standard error as follows,

get_b_se(0.6396966,23991,4.7245)#>               b          se#> [1,] 0.04490488 0.009504684

9.5 get_pve_se

This function obtains proportion of explained variation (PVE) from n, z; itsstandard error is based on variance of the ratio (correction=TRUE) or\(r^2\).

9.6 get_sdy

We continue with the eQTLGen example above,

get_sdy(0.6396966,23991,0.04490488,0.009504684)#> [1] 1

and indeed the eQTLGen data were standardized.

10 Sentinels of association

10.1 Identification

Following an earlier implementation calledsentinels, a distance-based signalidentification based on GWAS summary statistics is avaialable fromqtlFinderfunction; to avoid the overhead of data-loading it works on a preselected listof variants from a GWAS and in this case a METAL output file.

10.2 Fine-mapping

We considered a region of interest (which could be approximately independent variants,e.g.,\(r^2 \le 0.5\)) using expressions that rely on effect sizes and their standarderrors19. More specifically, let Bayes factor (BF) for each variant in themeta-analysis be defined as\(ln(BF) \propto 0.5 \beta^2/SE^2\), where\(\beta\) and\(SE\)are the effect size and standard error from the meta-analysis, respectively. Theposterior probability (PP) for being causal for a particular variant is obtained as\(BF_i/\sum_{i=1}^TBF_i\), where\(i=1,\ldots,T\) indexes all variants considered in theregion. We generated credible sets within a given region by ranking all variants byPPs in descending order and calculating the number of variants required to reach acumulative probability of such as 99%.

The functioncs obtains credible set.

# zcat METAL/4E.BP1-1.tbl.gz | \# awk 'NR==1 || ($1==4 && $2 >= 187158034 - 1e6 && $2 < 187158034 + 1e6)' >  4E.BP1.ztbl <- within(read.delim("4E.BP1.z"),{logp <- logp(Effect/StdErr)})z <- cs(tbl)l <- cs(tbl,log_p="logp")

Note in particular that the implementation intends to avoid the naive summation inscenarios such as proteogenomic analysis containing exceptionally large BFs.

11 Polygenic modeling

In line with the recent surge of interest in the polygenic models, a separate vignetteis available throughvignette("h2",package="gap.examples") demonstrating aspect ofthe models on heritability. Utility Functionsh2G,h2GE andh2l are brieflydocumented. Functionsh2.jags andhwe.jags are also available. The functionh2_mzdz can be used for heritability estimation based on monozygotic (MZ) anddizygotic (DZ) twin correlations under the additive genetics, common and specificenvironment (ACE) model, e.g.,10.1038/s41562-023-01530-y.

12 Mendelian randomization

12.1 Themr function

The functionmr was originally developed to rework on data generated from GSMR20, although it could be any harmonised data. The following example is from analysis of a real data on LIF-R protein and CVD/FEV1.

Table 12.1:LIF-R and CAD/FEV1
SNPb.LIF.RSE.LIF.Rb.FEV1SE.FEV1b.CADSE.CAD
rs1887439060.68040.11040.001770.01660NANA
rs2289779-0.07880.01340.001040.00261-0.0075430.0092258
rs117804300-0.22810.0390-0.003920.008550.1093720.0362219
rs7033492-0.09680.0147-0.005850.002690.0227930.0119903
rs107939620.20980.02120.003780.00536-0.0145670.0138196
rs635634-0.28850.0153-0.020400.003340.0771570.0117123
rs176690-0.09730.01420.002930.00306-0.0000070.0107781
rs147278971-0.23360.0378-0.012400.007920.0798730.0397491
rs115626290.11550.01810.009600.00378-0.0100400.0151460

The MR analysis is as follows,

Mendelian randomization

Figure 12.1: Mendelian randomization

Mendelian randomization

Figure 12.2: Mendelian randomization

Mendelian randomization

Figure 12.3: Mendelian randomization

Mendelian randomization

Figure 12.4: Mendelian randomization

Mendelian randomization

Figure 12.5: Mendelian randomization

Mendelian randomization

Figure 12.6: Mendelian randomization

Table 12.2:LIFR variant rs635634 and CAD/FEV1
LIF.R.CADLIF.R.FEV1
bIVW-0.1870.045
sebIVW0.0500.012
CochQ2.1160.482
CochQp0.9531.000
bEGGER-0.1840.049
sebEGGER0.0600.015
intEGGER0.0010.001
seintEGGER0.0100.002
bWM-0.2470.062
sebWM0.0450.012
bPWM-0.2480.055
sebPWM0.0440.014
IVW1.79e-043.19e-04
EGGER2.15e-031.13e-03
WM4.15e-081.68e-07
PWM1.73e-085.64e-05

This is close to Ligthart et al.21 as used one time at workplace which turns to overlap with TwoSampleMR22.

12.2 Contrast of effect sizes

It would be of interest to contrast their effect sizes in the analysis above as well,

mr_names <- names(mrdat)LIF.R <- cbind(mrdat[grepl("SNP|LIF.R",mr_names)],trait="LIF.R"); names(LIF.R) <- c("SNP","b","se","trait")FEV1 <- cbind(mrdat[grepl("SNP|FEV1",mr_names)],trait="FEV1"); names(FEV1) <- c("SNP","b","se","trait")CAD <- cbind(mrdat[grepl("SNP|CAD",mr_names)],trait="CAD"); names(CAD) <- c("SNP","b","se","trait")mrdat2 <- within(rbind(LIF.R,FEV1,CAD),{y=b})library(ggplot2)p <- ggplot2::ggplot(mrdat2,aes(y = SNP, x = y))+     ggplot2::theme_bw()+     ggplot2::geom_point()+     ggplot2::facet_wrap(~ trait, ncol=3, scales="free_x")+     ggplot2::geom_segment(aes(x = b-1.96*se, xend = b+1.96*se, yend = SNP))+     ggplot2::geom_vline(lty=2, ggplot2::aes(xintercept=0), colour = 'red')+     ggplot2::xlab("Effect size")+     ggplot2::ylab("")p#> Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).#> Warning: Removed 1 row containing missing values or values outside the scale range (`geom_segment()`).
Combined forest plots for LIF.R, FEV1 and CAD

Figure 12.7: Combined forest plots for LIF.R, FEV1 and CAD

12.3 Forest plots

We illustrate withmr_forestplot(),

mr_forestplot(tnfb, colgap.forest.left="0.05cm", fontsize=14, leftlabs=c("Outcome","b","SE"),              common=FALSE, random=FALSE, print.I2=FALSE, print.pval.Q=FALSE, print.tau2=FALSE,              spacing=1.6,digits.TE=2,digits.se=2,xlab="Effect size",type.study="square",col.inside="black",col.square="black")
Forest plots for MR results on TNFB

Figure 12.8: Forest plots for MR results on TNFB

mr_forestplot(tnfb, colgap.forest.left="0.05cm", fontsize=14,              leftcols="studlab", leftlabs="Outcome", plotwidth="3inch", sm="OR", rightlabs="ci",              sortvar=tnfb[["Effect"]],              common=FALSE, random=FALSE, print.I2=FALSE, print.pval.Q=FALSE, print.tau2=FALSE,              backtransf=TRUE, spacing=1.6,type.study="square",col.inside="black",col.square="black")
Forest plots for MR results on TNFB (no summary statistics)

Figure 12.9: Forest plots for MR results on TNFB (no summary statistics)

mr_forestplot(tnfb,colgap.forest.left="0.05cm", fontsize=14,              leftcols=c("studlab"), leftlabs=c("Outcome"),              plotwidth="3inch", sm="OR", sortvar=tnfb[["Effect"]],              rightcols=c("effect","ci","pval"), rightlabs=c("OR","95%CI","P"),              digits=3, digits.pval=2, scientific.pval=TRUE,              common=FALSE, random=FALSE, print.I2=FALSE, print.pval.Q=FALSE, print.tau2=FALSE,              addrow=TRUE, backtransf=TRUE, spacing=1.6,type.study="square",col.inside="black",col.square="black")
Forest plots for MR results on TNFB (with P values)

Figure 12.10: Forest plots for MR results on TNFB (with P values)

13 Miscellaneous functions

13.1 chr_pos_a1_a2 and inv_chr_pos_a1_a2

They are functions to handle SNPid.

require(gap)s <- chr_pos_a1_a2(1,c(123,321),letters[1:2],letters[2:1])s#> [1] "chr1:123_A_B" "chr1:321_A_B"inv_chr_pos_a1_a2(s)#>               chr pos a1 a2#> chr1:123_A_B chr1 123  A  B#> chr1:321_A_B chr1 321  A  Binv_chr_pos_a1_a2("chr1:123-A_B",seps=c(":","-","_"))#>               chr pos a1 a2#> chr1:123-A_B chr1 123  A  B

13.2 ci2ms

Here is the documentation example on rs3784099 and breast cancer23.

example(ci2ms)#> #> ci2ms> # rs3784099 and breast cancer recurrence/mortality#> ci2ms> ms <- ci2ms("1.28-1.72")#> #> ci2ms> print(ms)#> $m#> [1] 0.3945922#> #> $s#> [1] 0.07537491#> #> $direction#> [1] "+"#> #> #> ci2ms> # Vector input#> ci2ms> ci2 <- c("1.28-1.72","1.25-1.64")#> #> ci2ms> ms2 <- ci2ms(ci2)#> #> ci2ms> print(ms2)#> $m#> [1] 0.3945922 0.3589199#> #> $s#> [1] 0.07537491 0.06927492#> #> $direction#> [1] "+" "+"

13.3 gc.lambda

The definition is as follows,

gc.lambda <- function(x, logscale=FALSE, z=FALSE) {  v <- x[!is.na(x)]  n <- length(v)  if (z) {     obs <- v^2     exp <- qchisq(log(1:n/n),1,lower.tail=FALSE,log.p=TRUE)  } else {    if (!logscale)    {      obs <- qchisq(v,1,lower.tail=FALSE)      exp <- qchisq(1:n/n,1,lower.tail=FALSE)    } else {      obs <- qchisq(-log(10)*v,1,lower.tail=FALSE,log.p=TRUE)      exp <- qchisq(log(1:n/n),1,lower.tail=FALSE,log.p=TRUE)    }  }  lambda <- median(obs)/median(exp)  return(lambda)}# A simplified version is as follows,# obs <- median(chisq)# exp <- qchisq(0.5, 1) # 0.4549364# lambda <- obs/exp# see also estlambda from GenABEL and qq.chisq from snpStats# A related functionlambda1000 <- function(lambda, ncases, ncontrols)  1 + (lambda - 1) * (1 / ncases + 1 / ncontrols)/( 1 / 1000 + 1 / 1000)

13.4 invnormal

The function is widely used in various consortium analyses and defined as follows,

invnormal <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))

An example use on data from Poisson distribution is as follows,

set.seed(12345)Ni <- rpois(50, lambda = 4); table(factor(Ni, 0:max(Ni)))#> #>  0  1  2  3  4  5  6  7  8  9 #>  2  4  6 11  8 10  4  2  2  1y <- invnormal(Ni)sd(y)#> [1] 0.9755074mean(y)#> [1] 0.002650512Ni <- 1:50y <- invnormal(Ni)mean(y)#> [1] -1.810184e-17sd(y)#> [1] 0.9973999

13.5 revStrand

This functions obtains allele(s) on the opposite strand.

alleles <- c("a","c","G","t")revStrand(alleles)#> [1] "t" "g" "C" "a"

13.6 snptest_sample

This is a function to output sample file for SNPTEST.

d <- data.frame(ID_1=1,ID_2=1,missing=0,PC1=1,PC2=2,D1=1,P1=10)snptest_sample(d,C=paste0("PC",1:2),D=paste0("D",1:1),P=paste0("P",1:1))

The commands above generates a file namedsnptest.sample.

14 Known issues

A lot of code optimization as wtih better memory management is desirable.There are apparent issues with the Shiny interfaces for power/sample sizecalculation which produce less outputs for certain configurations than expected.

15 Summary

By now the package should have given you a flavor of the project. It sets tobuild a infrastructure to keep up with the development of R system itself andcollect elements from oning work. Over years it also serves to inspire othersto join force and develop better alternatives.

Bibliography

1.
Guo, S. W. & Lange, K.Genetic mapping of complex traits: Promises, problems, and prospects.Theor Popul Biol57, 1–11 (2000).
2.
Zhao, J. H. & Tan, Q.Integrated analysis of genetic data withR.Hum Genomics2, 258–65 (2006).
3.
Zhao, J. H. & Tan, Q.Genetic dissection of complex traits in silico: Approaches, problems and solutions.Current Bioinformatics1, 359–369 (2006).
5.
Zhao, J. H.Pedigree-drawing withR and graphviz.Bioinformatics22, 1013–4 (2006).
6.
Willer, C. J., Li, Y. & Abecasis, G. R.METAL: Fast and efficient meta-analysis of genomewide association scans.Bioinformatics26, 2190–1 (2010).
7.
Zhao, J. H. & Sham, P. C.Generic number systems and haplotype analysis.Comput Methods Programs Biomed70, 1–9 (2003).
8.
Zhao, J. H., Curtis, D. & Sham, P. C.Model-free analysis and permutation tests for allelic associations.Hum Hered50, 133–9 (2000).
10.
11.
Risch, N. & Merikangas, K.The future of genetic studies of complex human diseases.Science273, 1516–1517 (1996).
12.
Risch, N. & Merikangas, K. Reply toScott el al.Science275, 1329–1330 (1997).
13.
Zhao, J. H.gap: Genetic analysis package.Journal of Statistical Software23, 1–18 (2007).
14.
Scott, W. K., Pericak-Vance, M. A. & Haines, J. L. Genetic analysis of complex diseases.Science275, 1327; author reply 1329–30 (1997).
17.
Kotz, S., Read, C. B., Balakrishnan, N., Vidakovic, B. & Johnson, N. L.Encyclopedia of Statistical Sciences. (John Wiley & Sons, Inc., Hoboken, New Jersey, 2006).
18.
Võsa, U.et al. Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression.Nat Genet (2021) doi:10.1038/s41588-021-00913-z.
19.
24.
Williams, C. J., Christian, J. C. & Norton, J. A.TWINAN90: A FORTRAN program for conducting ANOVA-based and likelihood-based analyses of twin data.Computer Methods and Programs in Biomedicine38, 167–176 (1992).

Appendix

After package loading vialibrary(gap), you can uselsf.str("package:gap") anddata(package="gap") to generate a list of functions and a listof datasets, respectvely. If this looks odd to you, you might trysearch() withinR to examine what is available in yourenvironment before issuing thelsf.str command.

#>  [1] ".GlobalEnv"           "package:ggplot2"      "package:lattice"      "package:dplyr"        "package:pedigree"     "package:kinship2"    #>  [7] "package:quadprog"     "package:Matrix"       "package:DiagrammeR"   "package:DOT"          "package:gap"          "package:gap.datasets"#> [13] "package:stats"        "package:graphics"     "package:grDevices"    "package:utils"        "package:datasets"     "package:methods"     #> [19] "Autoloads"            "package:base"#> AE3 : function (model, random, data, seed = 1234, n.sim = 50000, verbose = TRUE)  #> BFDP : function (a, b, pi1, W, logscale = FALSE)  #> ESplot : function (ESdat, alpha = 0.05, fontsize = 12)  #> FPRP : function (a, b, pi0, ORlist, logscale = FALSE)  #> KCC : function (model, GRR, p1, K)  #> LD22 : function (h, n)  #> LDkl : function (n1 = 2, n2 = 2, h, n, optrho = 2, verbose = FALSE)  #> MCMCgrm : function (model, prior, data, GRM, eps = 0, n.thin = 10, n.burnin = 3000, n.iter = 13000, ...)  #> METAL_forestplot : function (tbl, all, rsid, flag = "", package = "meta", method = "REML", split = FALSE, ...)  #> ReadGRM : function (prefix = 51)  #> ReadGRMBin : function (prefix, AllN = FALSE, size = 4)  #> WriteGRM : function (prefix = 51, id, N, GRM)  #> WriteGRMBin : function (prefix, grm, N, id, size = 4)  #> a2g : function (a1, a2)  #> ab : function (type = "power", n = 25000, a = 0.15, sa = 0.01, b = log(1.19), sb = 0.01, alpha = 0.05, fold = 1)  #> allele.recode : function (a1, a2, miss.val = NA)  #> asplot : function (locus, map, genes, flanking = 1000, best.pval = NULL, sf = c(4, 4), logpmax = 10, pch = 21)  #> b2r : function (b, s, rho, n)  #> bt : function (x)  #> ccsize : function (n, q, pD, p1, theta, alpha, beta = 0.2, power = TRUE, verbose = FALSE)  #> chow.test : function (y1, x1, y2, x2, x = NULL)  #> chr_pos_a1_a2 : function (chr, pos, a1, a2, prefix = "chr", seps = c(":", "_", "_"), uppercase = TRUE)  #> ci2ms : function (ci, logscale = TRUE, alpha = 0.05)  #> circos.cis.vs.trans.plot : function (hits, panel, id, radius = 1e+06)  #> circos.cnvplot : function (data)  #> circos.mhtplot : function (data, glist)  #> circos.mhtplot2 : function (dat, labs, species = "hg18", ticks = 0:3 * 10, ymax = 30)  #> cis.vs.trans.classification : function (hits, panel, id, radius = 1e+06)  #> cnvplot : function (data)  #> comp.score : function (ibddata = "ibd_dist.out", phenotype = "pheno.dat", mean = 0, var = 1, h2 = 0.3)  #> cs : function (tbl, b = "Effect", se = "StdErr", log_p = NULL, cutoff = 0.95)  #> fbsize : function (gamma, p, alpha = c(1e-04, 1e-08, 1e-08), beta = 0.2, debug = 0, error = 0)  #> g2a : function (g)  #> gc.em : function (data, locus.label = NA, converge.eps = 1e-06, maxiter = 500, handle.miss = 0, miss.val = 0, control = gc.control())  #> gc.lambda : function (x, logscale = FALSE, z = FALSE)  #> gcontrol : function (data, zeta = 1000, kappa = 4, tau2 = 1, epsilon = 0.01, ngib = 500, burn = 50, idum = 2348)  #> gcontrol2 : function (p, col = palette()[4], lcol = palette()[2], ...)  #> gcp : function (y, cc, g, handle.miss = 1, miss.val = 0, n.sim = 0, locus.label = NULL, quietly = FALSE)  #> genecounting : function (data, weight = NULL, loci = NULL, control = gc.control())  #> geno.recode : function (geno, miss.val = 0)  #> get_b_se : function (f, n, z)  #> get_pve_se : function (n, z, correction = TRUE)  #> get_sdy : function (f, n, b, se, method = "mean", ...)  #> gif : function (data, gifset)  #> grid2d : function (chrlen, plot = TRUE, cex.labels = 0.6, xlab = "QTL position", ylab = "Gene position")  #> h2.jags : function (y, x, G, eps = 1e-04, sigma.p = 0, sigma.r = 1, parms = c("b", "p", "r", "h2"), ...)  #> h2G : function (V, VCOV, verbose = TRUE)  #> h2GE : function (V, VCOV, verbose = TRUE)  #> h2_mzdz : function (mzDat = NULL, dzDat = NULL, rmz = NULL, rdz = NULL, nmz = NULL, ndz = NULL, selV = NULL)  #> h2l : function (K = 0.05, P = 0.5, h2, se, verbose = TRUE)  #> hap : function (id, data, nloci, loci = rep(2, nloci), names = paste("loci", 1:nloci, sep = ""), control = hap.control())  #> hap.control : function (mb = 0, pr = 0, po = 0.001, to = 0.001, th = 1, maxit = 100, n = 0, ss = 0, rs = 0, rp = 0, ro = 0, rv = 0, sd = 0, mm = 0, mi = 0, #>     mc = 50, ds = 0.1, de = 0, q = 0, hapfile = "hap.out", assignfile = "assign.out")  #> hap.em : function (id, data, locus.label = NA, converge.eps = 1e-06, maxiter = 500, miss.val = 0)  #> hap.score : function (y, geno, trait.type = "gaussian", offset = NA, x.adj = NA, skip.haplo = 0.005, locus.label = NA, miss.val = 0, n.sim = 0, method = "gc", #>     id = NA, handle.miss = 0, mloci = NA, sexid = NA)  #> hmht.control : function (data = NULL, colors = NULL, yoffset = 0.25, cex = 1.5, boxed = FALSE)  #> hwe : function (data, data.type = "allele", yates.correct = FALSE, miss.val = 0)  #> hwe.cc : function (model, case, ctrl, k0, initial1, initial2)  #> hwe.hardy : function (a, alleles = 3, seed = 3000, sample = c(1000, 1000, 5000))  #> hwe.jags : function (k, n, delta = rep(1/k, k), lambda = 0, lambdamu = -1, lambdasd = 1, parms = c("p", "f", "q", "theta", "lambda"), ...)  #> inv_chr_pos_a1_a2 : function (chr_pos_a1_a2, prefix = "chr", seps = c(":", "_", "_"))  #> invnormal : function (x)  #> ixy : function (x)  #> kin.morgan : function (ped, verbose = FALSE)  #> klem : function (obs, k = 2, l = 2)  #> labelManhattan : function (chr, pos, name, gwas, gwasChrLab = "chr", gwasPosLab = "pos", gwasPLab = "p", gwasZLab = "NULL", chrmaxpos, textPos = 4, angle = 0, #>     miamiBottom = FALSE)  #> log10p : function (z)  #> log10pvalue : function (p = NULL, base = NULL, exponent = NULL)  #> logp : function (z)  #> makeped : function (pifile = "pedfile.pre", pofile = "pedfile.ped", auto.select = 1, with.loop = 0, loop.file = NA, auto.proband = 1, proband.file = NA)  #> masize : function (model, opts, alpha = 0.025, gamma = 0.2)  #> metap : function (data, N, verbose = "Y", prefixp = "p", prefixn = "n")  #> metareg : function (data, N, verbose = "Y", prefixb = "b", prefixse = "se")  #> mht.control : function (type = "p", usepos = FALSE, logscale = TRUE, base = 10, cutoffs = NULL, colors = NULL, labels = NULL, srt = 45, gap = NULL, cex = 0.4, #>     yline = 3, xline = 3)  #> mhtplot : function (data, control = mht.control(), hcontrol = hmht.control(), ...)  #> mhtplot.trunc : function (x, chr = "CHR", bp = "BP", p = NULL, log10p = NULL, z = NULL, snp = "SNP", col = c("gray10", "gray60"), chrlabs = NULL, suggestiveline = -log10(1e-05), #>     genomewideline = -log10(5e-08), highlight = NULL, annotatelog10P = NULL, annotateTop = FALSE, cex.mtext = 1.5, cex.text = 0.7, mtext.line = 2, #>     y.ax.space = 5, y.brk1 = NULL, y.brk2 = NULL, trunc.yaxis = TRUE, cex.axis = 1.2, delta = 0.05, ...)  #> mhtplot2 : function (data, control = mht.control(), hcontrol = hmht.control(), ...)  #> mia : function (hapfile = "hap.out", assfile = "assign.out", miafile = "mia.out", so = 0, ns = 0, mi = 0, allsnps = 0, sas = 0)  #> miamiplot : function (x, chr = "CHR", bp = "BP", p = "P", pr = "PR", snp = "SNP", col = c("midnightblue", "chartreuse4"), col2 = c("royalblue1", "seagreen1"), #>     ymax = NULL, highlight = NULL, highlight.add = NULL, pch = 19, cex = 0.75, cex.lab = 1, xlab = "Chromosome", ylab = "-log10(P) [y>0]; log10(P) [y<0]", #>     lcols = c("red", "black"), lwds = c(5, 2), ltys = c(1, 2), main = "", ...)  #> miamiplot2 : function (gwas1, gwas2, name1 = "GWAS 1", name2 = "GWAS 2", chr1 = "chr", chr2 = "chr", pos1 = "pos", pos2 = "pos", p1 = "p", p2 = "p", z1 = NULL, #>     z2 = NULL, sug = 1e-05, sig = 5e-08, pcutoff = 0.1, topcols = c("green3", "darkgreen"), botcols = c("royalblue1", "navy"), yAxisInterval = 5)  #> mr : function (data, X, Y, alpha = 0.05, other_plots = FALSE)  #> mr_forestplot : function (dat, sm = "", title = "", ...)  #> mtdt : function (x, n.sim = 0)  #> mtdt2 : function (x, verbose = TRUE, n.sim = NULL, ...)  #> muvar : function (n.loci = 1, y1 = c(0, 1, 1), y12 = c(1, 1, 1, 1, 1, 0, 0, 0, 0), p1 = 0.99, p2 = 0.9)  #> mvmeta : function (b, V)  #> pbsize : function (kp, gamma = 4.5, p = 0.15, alpha = 5e-08, beta = 0.2)  #> pbsize2 : function (N, fc = 0.5, alpha = 0.05, gamma = 4.5, p = 0.15, kp = 0.1, model = "additive")  #> pedtodot : function (pedfile, makeped = FALSE, sink = TRUE, page = "B5", url = "https://jinghuazhao.github.io/", height = 0.5, width = 0.75, rotate = 0, #>     dir = "none")  #> pedtodot_verbatim : function (f, run = FALSE, toDOT = FALSE, ...)  #> pfc : function (famdata, enum = 0)  #> pfc.sim : function (famdata, n.sim = 1e+06, n.loop = 1)  #> pgc : function (data, handle.miss = 1, is.genotype = 0, with.id = 0)  #> pvalue : function (z, decimals = 2)  #> qqfun : function (x, distribution = "norm", ylab = deparse(substitute(x)), xlab = paste(distribution, "quantiles"), main = NULL, las = par("las"), envelope = 0.95, #>     labels = FALSE, col = palette()[4], lcol = palette()[2], xlim = NULL, ylim = NULL, lwd = 1, pch = 1, bg = palette()[4], cex = 0.4, line = c("quartiles", #>         "robust", "none"), ...)  #> qqunif : function (u, type = "unif", logscale = TRUE, base = 10, col = palette()[4], lcol = palette()[2], ci = FALSE, alpha = 0.05, ...)  #> qtl2dplot : function (d, chrlen = gap::hg19, snp_name = "SNP", snp_chr = "Chr", snp_pos = "bp", gene_chr = "p.chr", gene_start = "p.start", gene_end = "p.end", #>     trait = "p.target.short", gene = "p.gene", TSS = FALSE, cis = "cis", value = "log10p", plot = TRUE, cex.labels = 0.6, cex.points = 0.6, xlab = "QTL position", #>     ylab = "Gene position")  #> qtl2dplotly : function (d, chrlen = gap::hg19, qtl.id = "SNPid:", qtl.prefix = "QTL:", qtl.gene = "Gene:", target.type = "Protein", TSS = FALSE, xlab = "QTL position", #>     ylab = "Gene position", ...)  #> qtl3dplotly : function (d, chrlen = gap::hg19, zmax = 300, qtl.id = "SNPid:", qtl.prefix = "QTL:", qtl.gene = "Gene:", target.type = "Protein", TSS = FALSE, #>     xlab = "QTL position", ylab = "Gene position", ...)  #> qtlClassifier : function (geneSNP, SNPPos, genePos, radius)  #> qtlFinder : function (d, Chromosome = "Chromosome", Position = "Position", MarkerName = "MarkerName", Allele1 = "Allele1", Allele2 = "Allele2", EAF = "Freq1", #>     Effect = "Effect", StdErr = "StdErr", log10P = "log10P", N = "N", radius = 1e+06, collapse.hla = TRUE, build = "hg19")  #> read.ms.output : function (msout, is.file = TRUE, xpose = TRUE, verbose = TRUE, outfile = NULL, outfileonly = FALSE)  #> revStrand : function (allele)  #> runshinygap : function (...)  #> s2k : function (y1, y2)  #> sentinels : function (p, pid, st, debug = FALSE, flanking = 1e+06, chr = "Chrom", pos = "End", b = "Effect", se = "StdErr", log_p = NULL, snp = "MarkerName", #>     sep = ",")  #> snptest_sample : function (data, sample_file = "snptest.sample", ID_1 = "ID_1", ID_2 = "ID_2", missing = "missing", C = NULL, D = NULL, P = NULL)  #> tscc : function (model, GRR, p1, n1, n2, M, alpha.genome, pi.samples, pi.markers, K)  #> whscore : function (allele, type)  #> xy : function (x)#>      Package LibPath                                                        Item   Title                             #> [1,] "gap"   "/rds/user/jhz22/hpc-work/work/RtmpcOzqwx/Rinst2fb82136a4edc8" "hg18" "Chromosomal lengths for build 36"#> [2,] "gap"   "/rds/user/jhz22/hpc-work/work/RtmpcOzqwx/Rinst2fb82136a4edc8" "hg19" "Chromosomal lengths for build 37"#> [3,] "gap"   "/rds/user/jhz22/hpc-work/work/RtmpcOzqwx/Rinst2fb82136a4edc8" "hg38" "Chromosomal lengths for build 38"

  1. Notes by Howard Seltman from Carnegie Mellon University: Pittsburgh, PA, USA.↩︎


[8]ページ先頭

©2009-2025 Movatter.jp