Movatterモバイル変換


[0]ホーム

URL:


Ball Statistics in R

AppVeyor Build StatusCRAN Status Badge

Introduction

The fundamental problems for data mining, statistical analysis, andmachine learning are: - whether several distributions are different? -whether random variables are dependent? - how to pick out usefulvariables/features from a high-dimensional data?

These issues can be tackled by usingbd.test,bcov.test, andbcorsis functions intheBall package, respectively. They enjoy followingadmirable advantages: - available for most of datasets (e.g.,traditional tabular data, brain shape, functional connectome, winddirection and so on) - insensitive to outliers, distribution-free andmodel-free; - theoretically guaranteed and computationallyefficient.

Installation

CRAN version

To install the Ball R package from CRAN, just run:

install.packages("Ball")

Github version

To install the development version from GitHub, run:

library(devtools)install_github("Mamba413/Ball/R-package",build_vignettes =TRUE)

Windows user will need to installRtoolsfirst.

Overview:Ballpackage

Three most importance functions inBall:

bd.testbcov.testbcorsis
FeatureHypothesis testHypothesis testFeature screening
TypeTest of equal distributionsTest of (joint) independenceSIS and ISIS
Optional weight:heavy_check_mark::heavy_check_mark::heavy_check_mark:
Parallel programming:heavy_check_mark::heavy_check_mark::heavy_check_mark:
p-value:heavy_check_mark::heavy_check_mark::x:
Limit distributionTwo-sample test onlyIndependence test only:x:
Censored data:x::x::heavy_check_mark:
Interaction screening:x::x::heavy_check_mark:
GWAS optimization:x::x::heavy_check_mark:

Quick examples

Takeiris dataset as an example to illustrate how to usebd.test andbcov.test to deal with thefundamental problems mentioned above.

bd.test

virginica<- iris[iris$Species=="virginica","Sepal.Length"]versicolor<- iris[iris$Species=="versicolor","Sepal.Length"]bd.test(virginica, versicolor)

In this example,bd.test examines the assumptionthat Sepal.Length distributions of versicolor and virginica areequal.

If the assumption invalid, thep-value of thebd.test will be under 0.05.

In this example, the result is:

    2-sample Ball Divergence Test (Permutation)data:  virginica and versicolor number of observations = 100, group sizes: 50 50replicates = 99, weight: constantbd.constant = 0.11171, p-value = 0.01alternative hypothesis: distributions of samples are distinct

The R output shows thatp-value is under 0.05. Consequently,we can conclude that the Sepal.Length distribution of versicolor andvirginica are distinct.

bcov.test

sepal<- iris[,c("Sepal.Width","Sepal.Length")]petal<- iris[,c("Petal.Width","Petal.Length")]bcov.test(sepal, petal)

In this example,bcov.test investigates whetherwidth or length of petal is associated with width and length of sepal.If the dependency really exists, thep-value of thebcov.test will be under 0.05. In this example, theresult is show to be:

    Ball Covariance test of independence (Permutation)data:  sepal and petalnumber of observations = 150replicates = 99, weight: constantbcov.constant = 0.0081472, p-value = 0.01alternative hypothesis: random variables are dependent

Therefore, the relationship between width and length of sepal andpetal exists.

bcorsis

We generate a dataset and demonstrate the usage ofbcorsis function as follow.

## simulate a ultra high dimensional dataset:set.seed(1)n <- 150p <- 3000x <- matrix(rnorm(n * p), nrow = n)error <- rnorm(n)y <- 3 * x[, 1] + 5 * (x[, 3])^2 + error## BCor-SIS procedure:res <- bcorsis(y = y, x = x)head(res[["ix"]], n = 5)

In this example, the result is:

# [1]    3    1 1601   20  429

Thebcorsis result shows that the first and thethird variable are the two most important variables in 3000 explanatoryvariables which is consistent to the simulation settings.

Citation

If you use Ball or reference our vignettes in a presentation orpublication, we would appreciate citations of our package. > Zhu J,Pan W, Zheng W, Wang X (2021). “Ball: An R Package for DetectingDistribution Difference and Association in Metric Spaces.” Journal ofStatistical Software, 97(6), 1–31. doi: 10.18637/jss.v097.i06.

Here is the corresponding Bibtex entry

@Article{,  title = {{Ball}: An {R} Package for Detecting Distribution Difference and Association in Metric Spaces},  author = {Jin Zhu and Wenliang Pan and Wei Zheng and Xueqin Wang},  journal = {Journal of Statistical Software},  year = {2021},  volume = {97},  number = {6},  pages = {1--31},  doi = {10.18637/jss.v097.i06},}

Reference

Bug report

If you find any bugs, or if you experience any crashes, please reportto us. If you have any questions just ask, we won’t bite. Open anissue or send anemail to Jin Zhu at zhuj37@mail2.sysu.edu.cn


[8]ページ先頭

©2009-2025 Movatter.jp