- Notifications
You must be signed in to change notification settings - Fork7
Estimate a suite of normalizing transformations, including a new adaptation of a technique based on ranks which can guarantee normally distributed transformed data if there are no ties: ordered quantile normalization (ORQ).
petersonR/bestNormalize
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
ThebestNormalize R package was designed to help find a normalizingtransformation for a vector. There are many techniques that have beendeveloped in this aim, however each has been subject to their ownstrengths/weaknesses, and it is unclear on how to decide which will workbest until the data is observed. This package will look at a range ofpossible transformations and return the best one, i.e. the one thatmakes it look themost normal.
Note that some authors use the term “normalize” differently than in thispackage. We define “normalize”: to transform a vector of data in such away that the transformed values follow a Gaussian distribution (orequivalently, a bell curve). This is in contrast to other suchtechniques designed to transform values to the 0-1 range, or to the -1to 1 range.
This package also introduces a new adaptation of a normalizationtechnique, which we call Ordered Quantile normalization (orderNorm(),or ORQ). ORQ transforms the data based off of a rank mapping to thenormal distribution. This allows us toguarantee normally distributedtransformed data (if ties are not present). The adaptation uses ashifted logit approximation on the ranks transformation to perform thetransformation on newly observed data outside of the original domain. Onnew data within the original domain, the transformation uses linearinterpolation of the fitted transformation.
To evaluate the efficacy of the normalization technique, thebestNormalize() function implements repeated cross-validation toestimate the Pearson’s P statistic divided by its degrees of freedom.This is called the “Normality statistic”, and if it is close to 1 (orless), then the transformation can be thought of as working well. Thefunction is designed to select the transformation that produces thelowest P / df value, when estimated on out-of-sample data (estimatingthis on in-sample data will always choose the orderNorm technique, andis generally not the main goal of these procedures).
You can install the most recent (devel) version of bestNormalize fromGitHub with:
# install.packages("devtools")devtools::install_github("petersonR/bestNormalize")
Or, you can download it from CRAN with:
install.packages("bestNormalize")In this example, we generate 1000 draws from a gamma distribution, andnormalize them:
library(bestNormalize)set.seed(100)x<- rgamma(1000,1,1)# Estimate best transformation with repeated cross-validationBN_obj<- bestNormalize(x,allow_lambert_s=TRUE)#> Warning: package 'lamW' was built under R version 4.0.5BN_obj#> Best Normalizing transformation with 1000 Observations#> Estimated Normality Statistics (Pearson P / df, lower => more normal):#> - arcsinh(x): 3.6204#> - Box-Cox: 0.96#> - Center+scale: 6.7851#> - Exp(x): 50.8513#> - Lambert's W (type s): 1.0572#> - Log_b(x+a): 1.908#> - orderNorm (ORQ): 1.0516#> - sqrt(x + a): 1.4556#> - Yeo-Johnson: 1.7385#> Estimation method: Out-of-sample via CV with 10 folds and 5 repeats#>#> Based off these, bestNormalize chose:#> Standardized Box Cox Transformation with 1000 nonmissing obs.:#> Estimated statistics:#> - lambda = 0.2739638#> - mean (before standardization) = -0.3870903#> - sd (before standardization) = 1.045498# Perform transformationgx<- predict(BN_obj)# Perform reverse transformationx2<- predict(BN_obj,newdata=gx,inverse=TRUE)# Prove the transformation is 1:1all.equal(x2,x)#> [1] TRUE
As of version 1.3, the package supports leave-one-out cross-validationas well. ORQ normalization works very well when the size of the testdataset is low relative to the training data set, so it will often beselected via leave-one-out cross-validation (which is why we setallow_orderNorm = FALSE here).
(BN_loo<- bestNormalize(x,allow_orderNorm=FALSE,allow_lambert_s=TRUE,loo=TRUE))#> Best Normalizing transformation with 1000 Observations#> Estimated Normality Statistics (Pearson P / df, lower => more normal):#> - arcsinh(x): 14.0712#> - Box-Cox: 0.8077#> - Center+scale: 26.5181#> - Exp(x): 451.435#> - Lambert's W (type s): 1.269#> - Log_b(x+a): 4.5374#> - sqrt(x + a): 3.3655#> - Yeo-Johnson: 5.7997#> Estimation method: Out-of-sample via leave-one-out CV#>#> Based off these, bestNormalize chose:#> Standardized Box Cox Transformation with 1000 nonmissing obs.:#> Estimated statistics:#> - lambda = 0.2739638#> - mean (before standardization) = -0.3870903#> - sd (before standardization) = 1.045498
It is also possible to visualize these transformations:
plot(BN_obj,leg_loc="bottomright")
For a more in depth tutorial, please consultthe packagevignette,or thepackage website.
About
Estimate a suite of normalizing transformations, including a new adaptation of a technique based on ranks which can guarantee normally distributed transformed data if there are no ties: ordered quantile normalization (ORQ).
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors5
Uh oh!
There was an error while loading.Please reload this page.