Quick Start

The fundamental problems for data mining and statistical analysisare:

Whether distributions of two samples are distinct?
Whether two random variables are dependent?

Two-sample test, which is designed to solve the first problem, isvery important in medicine, psychology, biology and so on. For instance,we want to know whether lifespan of male and female is different. Thus,we collect lifetime data, and try to figure out whether ages in twosamples are identically distributed. As the following images shown, ifdistribution of life span in two groups look like the left one, weconclude that lifetime are not identically distributed. But for theright one, it indicates that they are most likely to be identicallydistributed.

Figure 1

Test of independence, which is designed to solve the other problem,is also very essential. As the following images shown, there is a stronglinear relation with Y and X1, while X2 seems to have nothing to do withY. So X1 should be taken into account and added in to the regressionmodel for Y, or should be studied carefully in order to confirm thecorrelation mechanism with Y.

Figure 2

Ball package provides solution for independencetest, two-sample test or even K-sample test. Moreover, a genericnon-parametric sure independence screening procedure also implemented todeal with ultra high dimensional data.

The three core functions are:

bd.test: examine whether\(K(K \geq 2)\) univariate or multivariatedistributions are identical.
bcov.test: test whether univariate ormultivariate variables are related to each other.
bcorsis: carry out sure independence screeningprocedure to pick out the variables potentially related toresponse.

Installation

CRAN version

To install the Ball R package from CRAN, just run:

install.packages("Ball")

Github version

To install the development version from GitHub, run:

library(devtools)install_github("Mamba413/Ball",build_vignettes =TRUE)

Windows user will need to installRtoolsfirst.

Quick Start: Univariate Two-sample Test

In this example, we generate two normal random variables withdifferent location parameter:\[X \simN(0,1), Y \sim N(1, 1)\]

x<-rnorm(50)y<-rnorm(50,mean =1)# plot(density(x), xlim = c(-5, 5))# lines(density(y), col = 'red')

We usebd.test to perform the two-sample test todetermine whether two samples come from the same distribution.

bd.test(x = x,y = y)

# #   2-sample Ball Divergence Test (Permutation)# # data:  x and y # number of observations = 100, group sizes: 50 50# replicates = 99, weight: constant# bd.constant = 0.092215, p-value = 0.01# alternative hypothesis: distributions of samples are distinct

The result ofbd.test is thatp-value <0.05, which means to reject the null hypothesis, and conclude that twosamples are come from different distribution. Consequently, thehypothesis test result is concordant to data generation mechanism.

Quick Start: Multivariate Two-sample Test

In this example, we will demonstrate how to perform a test of whethertwo multivariate distributions are identical. We generate two randomsamples of size 50, which are sampled from two different multivariatenormal distributions:\[X \sim N(\mu_{X},I_{2\times 2}), Y \sim N(\mu_{Y}, I_{2 \times 2})\]\[\mu_{X} = (0,0), \mu_{Y} = (1,1)\]

x<-matrix(rnorm(100),nrow =50,ncol =2)y<-matrix(rnorm(100,mean =3),nrow =50,ncol =2)

We usebd.test to test whether two multivariaterandom samples are identically distributed.

bd.test(x = x,y = y)

# #   2-sample Ball Divergence Test (Permutation)# # data:  x and y # number of observations = 100, group sizes: 50 50# replicates = 99, weight: constant# bd.constant = 0.63695, p-value = 0.01# alternative hypothesis: distributions of samples are distinct

The result ofbd.test is thatp-value <0.05, so we conclude that two samples are not identicallydistributed.

Quick Start: Univariate Test of Independence

In this example, we will use the “W-shape” data fromWIKIto demonstrate how to perform univariate test of independence withbcov.test .

We generate a dataset containing 50 samples.

# generate random perturbation:noise<-runif(50,min =-0.3,max =0.3)x<-runif(50,0,4*pi)y<-cos(x)+ noise# plot(x, y)

Obviously,\(X\) is related to\(Y\), but the relationship is non-linear. Weusebcov.test to perform the test of independencebetween\(X\) and\(Y\).

bcov.test(x = x,y = y)

# #   Ball Covariance test of independence (Permutation)# # data:  x and y# number of observations = 50# replicates = 99, weight: constant# bcov.constant = 0.001948, p-value = 0.01# alternative hypothesis: random variables are dependent

The result ofbcov.test is thatp-value< 0.05, so we conclude that\(X\)and\(Y\) are not independent, whichmeans there is some kind of correlation between X and Y.

Quick Start: Multivariate Test of Independence

For multivariate independence test, we will demonstrate the usage ofbcov.test with the following example:\(X=(x_{1}, x_{2})\) come from the bivariatenormal distribution. The relation between\(Y\) and\(X\) is:
\[Y=2\sin(x_{1} + x_{2})+ \epsilon, \quad\epsilon \sim U(-0.1, 0.1)\]

x<-matrix(runif(50*2,-pi, pi),nrow =50,ncol =2)noise<-runif(50,min =-0.1,max =0.1)y<-2*sin(x[,1]+ x[,2])+ noise

We usebcov.test to perform multivariateindependence test:

bcov.test(x = x,y = y,weight ="prob")

# #   Ball Covariance test of independence (Permutation)# # data:  x and y# number of observations = 50# replicates = 99, weight: probability# bcov.probability = 0.038211, p-value = 0.01# alternative hypothesis: random variables are dependent

The result ofbcov.test is thatp-value< 0.05, so we conclude that multivariate random variable\(X\) and\(Y\) are associated.

Advance Features

The features below have been implemented to help you analyse diverseand complicated real data.

Non-Hilbert Space Data

During the scientific research, we always have to deal withNon-Hilbert space data. However, the traditional statistical inferencemethods usually depend on some assumptions, which are not able toperform statistical inference on this kind of data directly. Whereasball divergence doesn’t depend on the assumptions needed in traditionalstatistical inference method, and it’s able to perform two-sample testfor data from Non-Hilbert space. We will demonstrate how to useBall package to perform statistical inference for datafrom Non-Hilbert space with three examples:

Example 1: Simulated von Mises-Fisher distribution data

# load data:data("bdvmf")

The distribution of the data is shown in the following image:

In the image, the black dots (\(X\))and red dots (\(Y\)) respectivelyrepresent two group of simulated data with different distributions. Thedistributions are denoted by:\[X \simM(\mu_{X}, \kappa), Y \sim M(\mu_{Y}, \kappa)\] Where\(M\) denotesvonMises-Fisher distribution,\(\mu_{X} = (1,0, 0), \mu_{Y} = (1, 1, 1)\) are the orientation parameter of vonMises-Fisher distribution,\(\kappa =3\) denotes aggregation parameter.

We can tell from the image that, red dots and black dots are notidentically distributed. However, it is a tough task for the traditionalstatistical method to distinguish distribution because it is not aconventional data in Hilbert space. Fortunately, since the computationfor sample version of ball divergence (ball covariance) only involvescalculate distance matrix and counting the number of samples located ina ball, we can obtain empirical ball divergence so long as we can definethe distance metric between observations. Therefore, ball divergencestill work for this example.

We apply ball divergence to this data by carrying out the followingstep. First, we calculate the geodesic distance matrix of the data,which have been implemented in function . Later, we pass the distancematrix to arguments and let , , and . The detailed solution isdemonstrated below:

# calculate geodesic distance between samples:dx<-nhdist(bdvmf[["x"]],method ="geodesic")# sample sizes in each group: 150, 150# Two-Sample Test based on BD :bd.test(x = dx,size =c(150,150),num.permutations =99,distance =TRUE)

# #   2-sample Ball Divergence Test (Permutation)# # data:  dx # number of observations = 300, group sizes: 150 150# replicates = 99, weight: constant# bd.constant = 0.14483, p-value = 0.01# alternative hypothesis: distributions of samples are distinct

In this example, we firstly calculate the geodesic distance matrixusingnhdist function inBall package. Then,passdx to argumentsx and setdistance =TRUE to indicate that thex parameter is a distancematrix. Meanwhile, we set the size of each samplesize = c(150,150) and set the replication timesnum.permutations = 99.The result is thatp-value < 0.05, which means that red dotsand black dots are not identically distributed.

Example 2: Macaques Data

Based on Macaques data provided by dryden, scientists want to figureout whether there are differences in the shape of skull between Macaquesof different genders. In a similar way, we can calculate the distancematrix of the data and transform this problem into two-sample test thatcan be solved by BD. Riemann shape distance is always used to describethe distance between shape data. By settingmethod = “riemann”in thenhdist function, we are able to calculate theriemann shape distance between shape data. The detailed procedure isdemonstrated below:

# load data:data("macaques")# number of femala and male Macaca fascicularis:# table(macaques[["group"]])  # f: 9; m: 9# calculate Riemannian shape distance matrix:dx<-nhdist(macaques[["x"]],method ="riemann")# hypothesis test with BD:bd.test(x = dx,num.permutations =99,size =c(9,9),distance =TRUE)

# #   2-sample Ball Divergence Test (Permutation)# # data:  dx # number of observations = 18, group sizes: 9 9# replicates = 99, weight: constant# bd.constant = 0.1922, p-value = 0.03# alternative hypothesis: distributions of samples are distinct

p-value is under 0.05, which means the skull shape differsbetween male macaques and female macaques.

Example 3: ArcticLake Data

bcov.test is related to calculating the distancebetween samples of two multivariate random variables. Therefore, we canexamine independence assumption by employingbcov.testto non-Hilbert space real data so long as we obtain the distance matrixof the samples.

We take a data in the Book,The Statistical Analysis ofCompositional Data, as an example to demonstrate how to usebcov.test to determine the dependence of non-Hilbertspace data. Scientists collect Sand, silt and clay compositions of 39sediment samples of different water depth in an Arctic lake. They wantto figure out whether the compositions of sediment samples of differentwater depth are identical or not. To achieve the goal, we usebcov.test to perform the test of independence. Thedetailed procedure is demonstrated below:

data("ArcticLake")# Distance matrix between y:dy<-nhdist(ArcticLake[["x"]],method ="compositional")# Distance matrix between x:dx<-dist(ArcticLake[["depth"]])# hypothesis test with BCov:bcov.test(x = dx,y = dy,num.permutations =99,distance =TRUE)

# #   Ball Covariance test of independence (Permutation)# # data:  dx and dy# number of observations = 39# replicates = 99, weight: constant# bcov.constant = 0.0083848, p-value = 0.01# alternative hypothesis: random variables are dependent

We first calculate the distance matrixdy anddx.Then, we passdx to argumentsx,dy toargumentsy, and set the replication timesnum.permutations= 99,distance = TRUE to indicate that thex andy parameters are distance matrices.
The result shows thatp-value is less than 0.05, an usualsignificance level, so we conclude that the compositions of sediment isassociated with the water depth.

In the example above, we use the square root transformed data tocalculate the geodesic distance as a measurement of the differencebetween different compositions of sediment samples (Dy).Meanwhile, we use euclidean distance to measure the difference ofdifferent water depth (Dx). For different data, we can usedifferent measurements to cope with the different features in data.

K-Sample Test

bd.test is also applicable for testing of multiplesamples. We generate three random normal samples of size 50, which aresampled from the same normal distribution. As an example, we usebd.test to test whether these samples are identicallydistributed.

n<-150bd.test(rnorm(n),size =rep(50,3))

# #   3-sample Ball Divergence Test (Permutation)# # data:  rnorm(n) # number of observations = 150, group sizes: 50 50 50# replicates = 99, weight: constant, kbd.type: sum# kbd.sum.constant = 0.032714, p-value = 0.59# alternative hypothesis: distributions of samples are distinct

As the result shown,p-value>0.05, which means we can’treject the null hypothesis.
We can also utilizebd.test to deal with\(K\)-Sample problem in non-Hilbert spacefollowing the aforementioned procedure. At the same time, remember toassign size vector to parametersize arguments and setdistance = TRUE.

Weighted Ball Covariance Test

Pan et. al(2017) show that the weighted ball covariance basedindependence test is statistical consistent against all dependencealternatives without any moment conditions and some times superior tostandard version of ball covariance.

We have been implemented weighted ball covariance test inBall package and we can employ it to data analysis byjust settingweight = TRUE inbcov.test. TakeArcticLake data as example:

data("ArcticLake")Dy<-nhdist(ArcticLake[["x"]],method ="compositional")Dx<-dist(ArcticLake[["depth"]])# hypothesis test with weighted BCov:bcov.test(x = Dx,y = Dy,num.permutations =99,distance =TRUE,weight ="constant")

# #   Ball Covariance test of independence (Permutation)# # data:  Dx and Dy# number of observations = 39# replicates = 99, weight: constant# bcov.constant = 0.0083848, p-value = 0.01# alternative hypothesis: random variables are dependent

Ball Covariance Mutual Independence Test

Apart from the relationships between two random variables, anotherimportant dependence concept for a set of variables is mutual (or joint)independence, which says that any two disjoint subsets of variables areindependent from each other. For instance, we know to investigatewhether air temperature, soil temperature, humidity, wind andevaporation are correlated.

It is natural to extend ball covariance to measure mutualindependence between\(K\) randomvariables. More importantly, Mutual independence test based on ball covariance havebeen implemented inBall package. We give two simplyexample in the following to demonstrate its usage.

The first example,\(X, \epsilon_{1},\epsilon_{2}\) are independent from the standard normaldistribution\(N(0,1)\), and\[Y = \max(X, 0) + \epsilon_{1}, \; Z = \min(X, 0)+ \epsilon_{2}\]

x<-rnorm(50)y<- (x>0)* x+rnorm(50)z<- (x<=0)* x+rnorm(50)example1<-list(x, y, z)

The Second example,\(W, X, Y, Z\)are connected by a latent random variable\(H\sim N(0,1)\), and\[W = H^{2}; X =|H|, Y = min(H, 0)\]\[Z = (Z_{1},Z_{2}), Z_{1}=I(H<0.5)H, Z_{2}=I(H>-0.5)H\]

h<-rnorm(50)w<- (h)^2x<-abs(h)y<- h* (h<0)z1<- h* (h<0.5)z2<- h* (h>-0.5)z<-cbind(z1, z2)example2<-list(w, x, y, z)

We bind these data to listexample1 andexample2and pass them to argumentsx inbcov.test tocarry out ball covariance mutual independence test.

bcov.test(x = example1,num.permutations =199)

# #   Ball Covariance test of mutual independence (Permutation)# # data:  example1# number of observations = 50# replicates = 199, weight: constant# bcov.constant = 0.0016909, p-value = 0.01# alternative hypothesis: random variables are dependent

bcov.test(x = example2,num.permutations =199)

# #   Ball Covariance test of mutual independence (Permutation)# # data:  example2# number of observations = 50# replicates = 199, weight: constant# bcov.constant = 0.038425, p-value = 0.005# alternative hypothesis: random variables are dependent

The hypothesis test result for two examples show thatp-value < 0.05, coinciding with the simulation setting.

Ball Correlation Based Sure Independence Screening

Recent technological advances have made it possible to collect ultrahigh-dimensional data. A common feature of these data is that the numberof variables\(p\) is generally muchlarger than sample sizes\(n\). Forinstance, the number of gene expression profiles is in the order of tensof thousands while the number of patient samples is in the order of tensor hundreds. However, traditional variable selection algorithms such asLASSO, SCAD may not perform well due to the statistical inaccuracy, andalgorithmic instability.

A new framework, sure independence screening (SIS), was proposed totackle the challenges above. SIS tries to filtering out the featuresthat have marginal correlation with the response, hence effectivelyreducing the dimensionality\(p\) to amoderate scale so that performing statistical algorithm is feasible.

BCor-SIS, a generic non-parametric sure independence screeningprocedure based on ball correlation, is able to pick out explanatoryvariables related to response. The linear, non-linear or linearinteraction effect relationship can be captured by BCor-SIS even thoughdata is heavy tail or existing outliers. More importantly, BCor-SIS isable to retain all of the important features in the model withprobability tending to 1 under mild conditions.

BCor-SIS: Quick Start Example

In this example, we will utilizebcorsis function tocarry out BCor-SIS procedure. We generate 150 high dimensional instanceswith 3000 independent standard gaussian explanatory variables\(X\) and univariate response variable\(Y\). The relation between\(Y\) and\(X\) is:
\[Y=3 X_{1} + 5 X_{3}^{2} + \epsilon, \quad\epsilon \sim N(0, 1)\]

set.seed(1)n<-150p<-3000x<-matrix(rnorm(n* p),nrow = n)noise<-rnorm(n)y<-3*x[,1]+5*(x[,3])^2+ noise

We perform BCor-SIS procedure and display the top 5 variables indexselected by BCor-SIS.

res<-bcorsis(y = y,x = x)head(res[[1]],n =5)

# [1]    3    1 1601   20  429

Thebcorsis result shows that the first and thethird variable are the two most important variables in 3000 explanatoryvariables which is consistent to simulation settings.

Extension of BCor-SIS: A Censored Survival Data

Survival analysis is a commonly used method for the analysis ofcensored data such as biological death and mechanical failure, which isusually subject to censoring. The main goal of survival analysis is tostudy the dependence of the survival time\(T\) on covariate variables\(X, X \in R^{p}\).

With the remarkable development of modern technology, a huge amountof covariate information such as microarray and SNP data are collected.Consequently, SIS procedure designed for censored survival data is inneed. Pan et al(2017) proposed a extend BCor-SIS procedure which is ableto selected the significant variables for censored data.

We implement BCor-SIS procedure for survival data inBall package and use a publicly lung cancer genomicdata from the Chemores Cohort Study to demonstrate its usage. The dataoutcome was the “Disease-Free Survival Time”. Patients were followeduntil the first relapse occurred or administrative censoring. In thisgenomic dataset, the expression levels of mRNA, miRNA as well asclinical variables from the 123 samples were included. Moreover, thisdataset include 944 biological covariates and 1056 artificial standardgaussian variables which are independence with response. We employextension of Bcor-SIS on this data to hunt for efficient covariates anddemonstrate detailed procedure in the following.

result<-bcorsis(x = genlung[["covariate"]],y = genlung[["survival"]],d ="small",method ="survival")top_gene<-colnames(genlung[["covariate"]])[result[["ix"]]]head(top_gene,n =1)

# [1] "hsa.miR.564"

We first pass covariates and censored information to arugmentsx andy, and set themethod = “survival” toindicate that they should be considered as a survival statuscontaining event time and censored status. BCor-SIS asserts thathsa.miR.564, corresponding to geneMIR564, is stronglyrelevant to disease-free survival status. The conclusion is highlycoincident with the statement in other public literature.

Movatterモバイル変換

Ball: Statistical Inference and SureIndependence Screening via Ball Statistics

Jin Zhu,zhuj37@mail2.sysu.edu.cn

December 18, 2017