Movatterモバイル変換


[0]ホーム

URL:


bigPCAcpp

Principalcomponent analysis for bigmemory matrices

Frédéric Bertrand

R-CMD-checkR-hub

ThebigPCAcpp package provides high performanceprincipal component analysis (PCA) routines specialised forbigmemory::big.matrix objects. It keeps data in bigmemoryallocations from ingestion through eigendecomposition so that very largematrices can be analysed without copying them into base R matrices. Inaddition to the PCA core, the package offers streaming helpers thatwrite scores, loadings, correlations, and contributions back intofile-backedbig.matrix targets for integration withdownstream pipelines.

Beyond classical PCA, the package ships with scalable SVD tools thatcan process file-backed matrices block by block, and it includes robustPCA and robust SVD routines that temper the influence of outliers whileremaining compatible with bigmemory workflows. For exploratory work onlarge batches, a scalable PCA interface lets users extract leadingcomponents without reading the full matrix into memory.

These workflows make it possible to analyse data sets that exceed theavailable RAM while keeping numerical stability through double-precisionaccumulation and LAPACK eigen decompositions. Current featuresinclude

Installation

You can install the development version ofbigPCAcppfrom GitHub with:

# install.packages("devtools")devtools::install_github("fbertran/bigPCAcpp")

If you prefer a local source install, clone the repository andrun:

R CMD build bigPCAcppR CMD INSTALL bigPCAcpp_0.9.0.tar.gz

Options

The package defines several options to control numerical tolerancesand workspace allocation. They are prefixed withbigPCAcpp.and include:

OptionDefault valueDescription
bigPCAcpp.block_size1000LNumber of rows processed in each block when streaming scores throughBLAS.
bigPCAcpp.center_scale_epsilon1e-8Lower bound applied when rescaling columns to avoid divisioninstabilities.
bigPCAcpp.progressFALSEEmit progress updates when computing PCA on long-running jobs.

All options can be changed withoptions() at runtime.For example,options(bigPCAcpp.block_size = 5000L)increases the streaming block size.

Examples

The examples below demonstrate the bigmemory workflow and compare theresults with base R’sprcomp() implementation.

library(bigmemory)library(bigPCAcpp)# Allocate a 1,000 x 25 big.matrix with simulated valuesn<-1000p<-25bm<- bigmemory::big.matrix(n, p,type ="double")bm[,]<-matrix(rnorm(n* p),nrow = n)# Run PCA and extract eigenvalues and rotationres<-pca_bigmatrix(bm,center =TRUE,scale =TRUE)res$eigenvalues#>  [1] 1.2772679 1.2549573 1.2261127 1.2200832 1.2029447 1.1372111 1.1116603 1.0863140 1.0612750#> [10] 1.0430975 1.0251884 1.0036304 0.9922516 0.9661366 0.9511738 0.9342366 0.9118102 0.8894958#> [19] 0.8861798 0.8662711 0.8326502 0.8234052 0.7850452 0.7762024 0.7353990res$importance#> NULLres$rotation[1:5,1:3]#>             [,1]        [,2]      [,3]#> [1,] -0.13665626 -0.19398781 0.3217218#> [2,] -0.07597561  0.09425838 0.1678119#> [3,]  0.08992670  0.00729943 0.2609075#> [4,]  0.10200029 -0.28583284 0.2290518#> [5,]  0.19534252  0.32324433 0.1690638# Generate PCA scores in bigmemory storagescores<- bigmemory::big.matrix(nrow = n,ncol =3,type ="double")(pca_scores_bigmatrix(  bm,  res$rotation,center = res$center,scale = res$scale))[1:6,1:6]#>             [,1]        [,2]        [,3]       [,4]       [,5]       [,6]#> [1,]  2.81870347 -0.06337779  1.99072631 -0.5920623 -1.6703024  1.6483439#> [2,] -0.35747573 -0.80297261 -1.07285346  0.5123663 -1.4595653  0.5980145#> [3,] -0.78310002  0.24236085  0.46701646 -0.2727803  0.4929943  2.2777379#> [4,]  1.45650763 -0.74008842 -2.57891649  0.1402697  1.5748613  1.1219994#> [5,] -1.56142789 -0.68169732 -0.01681349  0.1119421 -0.9571047 -1.0961306#> [6,]  0.05141656 -0.91365588  0.30322391 -1.4171899 -0.2089137 -2.3574471# Compare sum of absolute values with prcomp()pr<-prcomp(bm[],center =TRUE,scale =TRUE)sum(abs(abs(pr$rotation[,1:3])-abs(res$rotation[,1:3])))<10^(-6)#> [1] TRUE

pca_bigmatrix() can also focus on a subset of leadingcomponents while streaming the results into file-backed matrices. Thefollowing snippet stores the first four principal components and keeps arunning summary of their scores.

library(bigmemory)library(bigPCAcpp)set.seed(2025)bm<- bigmemory::big.matrix(nrow =1500,ncol =40,type ="double")bm[,]<-matrix(rnorm(1500*40),nrow =1500)# Request only the first four componentstop_pca<-pca_bigmatrix(bm,center =TRUE,scale =TRUE,ncomp =4)top_pca$sdev#> [1] 1.141546 1.124998 1.119607 1.109924# Stream the corresponding scores into a file-backed allocationpath<-tempfile(fileext =".bin")desc<-paste0(path,".desc")scores_fb<- bigmemory::filebacked.big.matrix(nrow =nrow(bm),ncol =4,type ="double",backingfile =basename(path),backingpath =dirname(path),descriptorfile =basename(desc))pca_scores_stream_bigmatrix(  bm,  scores_fb,  top_pca$rotation[,1:4],center = top_pca$center,scale = top_pca$scale)#> <pointer: 0x10f559be0># Inspect a lightweight summary without loading the entire matrixcolMeans(scores_fb[,1:2])#> [1] 3.944992e-17 3.064216e-17apply(scores_fb[,1:2],2, sd)#> [1] 1.141546 1.124998

To stream the diagnostics intobigmemory-backedmatrices, use the corresponding helper functions:

library(bigmemory)library(bigPCAcpp)n<-1000p<-25bm<- bigmemory::big.matrix(n, p,type ="double")bm[,]<-matrix(rnorm(n* p),nrow = n)rotation<- bigmemory::big.matrix(nrow = p,ncol = p)loadings<- bigmemory::big.matrix(nrow = p,ncol = p)correlations<- bigmemory::big.matrix(nrow = p,ncol = p)contrib<- bigmemory::big.matrix(nrow = p,ncol = p)pca_stream<-pca_stream_bigmatrix(bm,xpRotation = rotation,center =TRUE,scale =FALSE)pca_variable_loadings_stream_bigmatrix(rotation, pca_stream$sdev,                                       loadings)#> <pointer: 0x138167dd0>pca_variable_correlations_stream_bigmatrix(rotation, pca_stream$sdev,                           pca_stream$column_sd, correlations)#> Error in pca_variable_correlations_stream_bigmatrix(rotation, pca_stream$sdev, : argument "xpDest" is missing, with no defaultpca_variable_contributions_stream_bigmatrix(loadings, contrib)#> <pointer: 0x1381705f0>

Robust PCA andsingular value decompositions

Robust workflows dampen the influence of outliers while retaining thefamiliar PCA interface. Thepca_robust() helper centresvariables by the median, optionally scales by the MAD, and relies on aniteratively reweighted SVD to derive principal components. The samerobust solver is exposed directly viasvd_robust() for usein custom pipelines, and the streaming-friendlysvd_bigmatrix() wrapper computes classical SVDs onbig.matrix objects without materialising dense copies inmemory.

library(bigmemory)library(bigPCAcpp)set.seed(42)mat<-matrix(rnorm(200),nrow =40,ncol =5)mat[1,1]<-15# introduce an outliermat_scaled<-scale(mat,center =TRUE,scale=TRUE)# Classical PCA on the same data highlights the impact of the outlierbm_small<- bigmemory::big.matrix(nrow =nrow(mat_scaled),ncol =ncol(mat_scaled),type ="double")bm_small[,]<- mat_scaledclassical<-pca_bigmatrix(bm_small,center =FALSE,scale =FALSE,ncomp =3)classical$explained_variance#> [1] 0.2940708 0.2332728 0.2031007scores_classical<-pca_scores_bigmatrix(xpMat = bm_small,rotation = classical$rotation,center = classical$center, classical$scale)scores_classical[1,]#> [1] -4.752614 -1.534966  1.578737pca_plot_contributions(pca_individual_contributions(scores_classical, classical$sdev))
plot of chunk robustsvdexample

plot of chunk robustsvdexample

# Robust PCA keeps the outlier from dominating the rotationrobust<-pca_robust(mat_scaled,center =FALSE,scale =FALSE,ncomp =3)robust$explained_variance#> [1] 0.3633363 0.3509611 0.2857026robust$scores[1,]#> [1] 1.025663 1.948710 2.095546pca_plot_contributions(pca_individual_contributions(robust$scores, robust$sdev))
plot of chunk robustsvdexample

plot of chunk robustsvdexample

cbind(classical = classical$rotation[1:5,1],robust = robust$rotation[1:5,1])#>       classical     robust#> [1,] -0.5793644 0.01128235#> [2,] -0.3121420 0.59547597#> [3,] -0.5716000 0.77399456#> [4,]  0.4071138 0.18028142#> [5,]  0.2728298 0.11709868
# Classical SVD on a file-backed big.matrixpath<-tempfile(fileext =".bin")desc<-paste0(path,".desc")bm<- bigmemory::filebacked.big.matrix(200,10,type ="double",backingfile =basename(path),backingpath =dirname(path),descriptorfile =basename(desc))bm[,]<-matrix(rnorm(2000),nrow =200)svd_stream<-svd_bigmatrix(bm,nu =3,nv =3)svd_stream$d#>  [1] 16.66256 15.90085 15.80823 14.84659 13.99062 13.52699 13.06717 12.61343 12.15871 11.63997# Direct access to the robust SVD routinesvd_out<-svd_robust(mat,ncomp =3)svd_out$d#> [1] 16.789433  6.178555  5.620833svd_out$weights[1:6]#> [1] 1 1 1 1 1 1

Robust decompositions down-weight the contaminated observations whilethe classical stream demonstrates how to fetch singular vectors withoutmaterialising the dense matrix. The robust solver also exposes per-rowweights that can be reused to flag problematic observations for furtherinspection.

Plotting diagnostics

bigPCAcpp bundles plot helpers that operate on bothdense matrices andbig.matrix backends. The snippets belowillustrate how to call each function using results frompca_bigmatrix(). For instance, thepca_plot_scores() helper samples observations and draws ascatter plot of their scores on a chosen pair of components, which isparticularly useful when you need to visually assess potential clusterswithout loading the full data set into memory.

library(bigmemory)library(bigPCAcpp)set.seed(123)bm<- bigmemory::big.matrix(500,6,type ="double")bm[,]<-matrix(rnorm(500*6),nrow =500)res<-pca_bigmatrix(bm,center =TRUE,scale =TRUE)# Scree plot of explained variancepca_plot_scree(res)
plot of chunk plotexamples

plot of chunk plotexamples

# Scatter plot of sampled scores on PCs 1 and 2pca_plot_scores(  bm,  res$rotation,center = res$center,scale = res$scale,components =c(1L, 2L),max_points = 2000L,seed =2024)
plot of chunk plotexamples

plot of chunk plotexamples

# Contribution bar plot for the leading componentloadings<-pca_variable_loadings(res$rotation, res$sdev)contrib<-pca_variable_contributions(loadings)pca_plot_contributions(contrib,component = 1L,top_n = 10L)
plot of chunk plotexamples

plot of chunk plotexamples

# Correlation circle for the first two componentscorrelations<-pca_variable_correlations(res$rotation, res$sdev,res$column_sd, res$scale)pca_plot_correlation_circle(correlations,components =c(1L, 2L))
plot of chunk plotexamples

plot of chunk plotexamples

# Biplot combining scores and loadingsscores<- res$scoresif (is.null(scores)) {  scores<-pca_scores_bigmatrix(bm, res$rotation,center = res$center,scale = res$scale)}pca_plot_biplot(scores, loadings,components =c(1L, 2L))
plot of chunk plotexamples

plot of chunk plotexamples

Citation

If you usebigPCAcpp in academic work, pleasecite:

Bertrand F. (2025).bigPCAcpp: Principal Component Analysis forbigmemory Matrices.

Maintainer

Maintainer: Frédéric Bertrandfrederic.bertrand@lecnam.net

For questions, bug reports, or contributions, please open an issue onGitHub.


[8]ページ先頭

©2009-2025 Movatter.jp