bd.gwas.test: Fast BallDivergence Test for Multiple Hypothesis TestsThe\(K\)-sample Ball Divergence(KBD) is a nonparametric method to test the differences between\(K\) probability distributions. It isspecially designed for metric-valued and imbalanced data, which isconsistent with the characteristics of the GWAS data. It iscomputationally intensive for a large GWAS dataset because of theultra-high dimensionality of the data. Therefore, a fast KBD Test forGWAS data implemented in functionbd.gwas.test is developedand programmed to accelerate the computational speed.
We use a synthetic data to demonstrate the usage ofbd.gwas.test. In this example, phenotype data are generatedfrom three multivariate normal distributions with the same dimension butheterogeneous mean and covariance matrix. The three multivariate normaldistributions are: (i).\(N\sim(\mu,\Sigma^{(1)})\), (ii)\(N \sim (\mu +0.1 \times d, \Sigma^{(2)})\), and (iii)\(N \sim (\mu + 0.1 \times d,\Sigma^{(3)})\). Here, the mean\(\mu\) is set to\(\textbf{0}\) and the covariance matrixcovariance matrices follow the auto-regressive structure with someperturbations:\[\Sigma_{ij}^{(1)}=\rho^{|i-j|}, ~~\Sigma^{(2)}_{ij}=(\rho-0.1 \times d)^{|i-j|}, ~~\Sigma^{3}_{ij}=(\rho+0.1 \times d)^{|i-j|}.\] The dimension ofphenotype\(k\) is fixed as 100.
library(mvtnorm)num <- 100snp_num <- 200k <- 100rho <- 0.5freq0 <- 0.75d <- 3set.seed(2021)ar1 <- function (p, rho = 0.5) { Sigma <- matrix(0, p, p) for (i in 1:p) { for (j in 1:p) { Sigma[i, j] <- rho^(abs(i - j)) } } return(Sigma)}mean0 <- rep(0, k)mean1 <- rep(0.1 * d, k)mean2 <- rep(-0.1 * d, k)cov0 <- ar1(p = k, rho = rho)cov1 <- ar1(p = k, rho = rho - 0.1 * d)cov2 <- ar1(p = k, rho = rho + 0.1 * d)p1 <- freq0 ^ 2p2 <- 2 * freq0 * (1 - freq0)n1 <- round(num * p1)n2 <- round(num * p2)n3 <- num - n1 - n2x0 <- rmvnorm(n1, mean = mean0, sigma = cov0)x1 <- rmvnorm(n2, mean = mean1, sigma = cov1)x2 <- rmvnorm(n3, mean = mean2, sigma = cov2)x <- rbind(x0, x1, x2)head(x[, 1:6])## [,1] [,2] [,3] [,4] [,5] [,6]## [1,] 0.07039579 0.6100092 0.5546911 0.5346480 0.5785020 -1.3470882## [2,] -0.19620313 0.2196482 -0.2968494 -0.7916355 -1.3371275 -0.5766259## [3,] -0.18575434 -1.4601779 -1.2635876 -0.8933793 -0.9336481 -0.5277624## [4,] 1.56843201 1.6330457 1.8725853 0.8344465 0.5674250 -0.3928510## [5,] -0.54939765 0.1602176 0.4325440 -0.1750343 0.5231533 -1.0931093## [6,] -0.19665796 -1.3788672 -0.6887342 -1.2731113 -0.6638883 0.0764279The number of SNPs is fixed as\(200\) and the sample size is set to\(100\). The sample sizes of the three groupsfollow the transmission ratio:\[n_1:n_2:n_3\approx p^2:2pq:q^2,(p+q=1,n_1+n_2+n_3=100).\] Here,\(p\) is set to be\(0.75\), representing a scenario that closeto the real data.\(d\) is auser-specific positive integer, indicating the differences between thethree probability distributions. Here, we use\(d=3\), aiming to show that the SNP whichmatched with the distribution can be identified, even when thedifferences between distribution is small.
effect_snp <- c(rep(0, n1), rep(1, n2), rep(2, n3))noise_snp <- sapply(2:snp_num, function(j) { sample( 0:2, size = num, replace = TRUE, prob = c(p1, p2, 1 - p1 - p2) )})snp <- cbind(effect_snp, noise_snp)head(snp[, 1:6])## effect_snp ## [1,] 0 1 0 0 0 0## [2,] 0 1 1 1 0 0## [3,] 0 0 0 1 0 0## [4,] 0 0 1 0 2 0## [5,] 0 1 1 0 0 0## [6,] 0 1 1 1 1 0Given the synthetic datasetx andsnp,multiple KBD tests is conducted by:
library(Ball)res <- bd.gwas.test(x = x, snp = snp)## =========== Pre-screening SNPs ===========## Refining SNP... Progress: 1/1. ## Refined p-value: 0.0000499975, cost time: 3 (s).And we present the SNPs that is significant:
str(res)## List of 8## $ statistic : num [1:200] 4.218 0.876 1.881 1.104 1.143 ...## $ permuted.statistic :'data.frame': 20000 obs. of 1 variable:## ..$ g3: num [1:20000] 0.819 1.114 1.112 1.563 0.906 ...## $ eigenvalue : NULL## $ p.value : num [1:200] 0.00005 0.9112 0.0577 0.60847 0.55267 ...## $ refined.snp : int 1## $ refined.p.value : num 5e-05## $ refined.permuted.statistic: num [1:20000, 1] 1.216 0.857 0.988 0.937 1.37 ...## ..- attr(*, "dimnames")=List of 2## .. ..$ : NULL## .. ..$ : chr "SNP1"## $ screening.result :List of 5## ..$ : num [1:200] 4.218 0.876 1.881 1.104 1.143 ...## ..$ :'data.frame': 20000 obs. of 1 variable:## .. ..$ g3: num [1:20000] 0.819 1.114 1.112 1.563 0.906 ...## ..$ : num [1:200] 0.00005 0.9112 0.0577 0.60847 0.55267 ...## ..$ : int [1:10000] 0 97 44 35 95 99 39 19 46 65 ...## ..$ : int 0bd.gwas.test is faster?Our faster implementation for multiple testing significantly speedsup the KBD test in two aspects.
First, it uses a two-step algorithm for KBD. The algorithm firstcomputes an empirical\(p\)-value foreach SNP using a modest number of permutations which gives preciseenough estimates of the\(p\)-valuesabove a threshold. Then, the SNPs with first stage\(p\)-values being less than the thresholdare moved to the second stage for a far greater number ofpermutations.
Another key technique inbd.test.gwas is reusing theempirical KBD’s distribution under the null hypothesis. This techniqueis particularly helpful for decreasing computational burden when thenumber of factors\(p\) is very largeand\(K\) is a single digit. A typicalcase is the GWAS study, in which\(p \approx10^4\) or\(10^5\) but\(K = 3\).
According to the simulations:
the empirical type I errors of KBD are reasonably controlledaround\(10^{-5}\);
the power of KBD increases as either the sample size or thedifference between means or covariance matrices increases. The empiricalpower is close to\(1\) when thedifference between distributions is large enough.
Furthermore, correlated responses may slightly decrease the power ofthe test compared to the case of independent responses. Moreover, KBDperforms better when the data are not extremely imbalanced and itmaintains reasonable power for the imbalanced setting.
Compared to other methods, KBD performs better in most of thescenarios, especially when the simulation setting is close to the realdata. Moreover, KBD is more computationally efficient in identifyingsignificant variants.
From Figures 1 and 3, we can notice that the power curves are similarafter sample size of 500, when the minor allele frequency is not small.On the other hand, when the minor allele is rare, a larger sample sizecan lead to a higher power from Figures 2 and 4. The four figures showhow sample size could affect the power of the KBD method, indicatingthat there is an inverse relationship between minor allele frequency andthe sample sizes in order to get sufficient power.
We implementbd.test.gwas in Ball package for handlingmultiple KBD test. KBD is a powerful method that can detect thesignificant variants with a controllable type I error regardless if thedata are balanced or not.
Yue Hu, Haizhu Tan, Cai Li, Heping Zhang. (2021). Identifying geneticrisk variants associated with brain volumetric phenotypes via K-sampleBall Divergence method. Genetic Epidemiology, 1–11.https://doi.org/10.1002/gepi.22423