Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Title:Rcpp Machine Learning Library
Version:0.3.7
Date:2021-09-21
Description:Fast machine learning algorithms including matrix factorization and divisive clustering for large sparse and dense matrices.
License:GPL-2 |GPL-3 [expanded from: GPL (≥ 2)]
Imports:Rcpp, Matrix, methods, stats
LinkingTo:Rcpp, RcppEigen
VignetteBuilder:knitr
RoxygenNote:7.1.1
Suggests:knitr, rmarkdown, testthat (≥ 3.0.0)
Config/testthat/edition:3
URL:https://github.com/zdebruine/RcppML
BugReports:https://github.com/zdebruine/RcppML/issues
NeedsCompilation:yes
Packaged:2021-09-21 18:39:44 UTC; Owner
Author:Zachary DeBruineORCID iD [aut, cre]
Maintainer:Zachary DeBruine <zacharydebruine@gmail.com>
Repository:CRAN
Date/Publication:2021-09-21 19:00:02 UTC

RcppML: Rcpp Machine Learning Library

Description

High-performance non-negative matrix factorization and linear model projection for sparse matrices, and fast non-negative least squares implementations

Author(s)

Zach DeBruine


Bipartition a sample set

Description

Spectral biparitioning by rank-2 matrix factorization

Usage

bipartition(  A,  tol = 1e-05,  maxit = 100,  nonneg = TRUE,  samples = 1:ncol(A),  seed = NULL,  verbose = FALSE,  calc_dist = FALSE,  diag = TRUE)

Arguments

A

matrix of features-by-samples in dense or sparse format (preferred classes are "matrix" or "Matrix::dgCMatrix", respectively). Prefer sparse storage when more than half of all values are zero.

tol

stopping criteria, the correlation distance betweenw across consecutive iterations,1 - cor(w_i, w_{i-1})

maxit

stopping criteria, maximum number of alternating updates ofw andh

nonneg

enforce non-negativity

samples

samples to include in bipartition, numbered from 1 toncol(A). Default isNULL for all samples.

seed

random seed for model initialization

verbose

print model tolerances between iterations

calc_dist

calculate the relative cosine distance of samples within a cluster to either cluster centroid. IfTRUE, centers for clusters will also be calculated.

diag

scale factors inw andh to sum to 1 by introducing a diagonal,d. This should generally never be set toFALSE. Diagonalization enables symmetry of models in factorization of symmetric matrices, convex L1 regularization, and consistent factor scalings.

Details

Spectral bipartitioning is a popular subroutine in divisive clustering. The sign of the difference between sample loadings in factors of a rank-2 matrix factorizationgives a bipartition that is nearly identical to an SVD.

Rank-2 matrix factorization by alternating least squares is faster than rank-2-truncated SVD (i.e.irlba).

This function is a specialization of rank-2nmf with support for factorization of only a subset of samples, and with additional calculations on the factorization model relevant to bipartitioning. Seenmf for details regarding rank-2 factorization.

Value

A list giving the bipartition and useful statistics:

Author(s)

Zach DeBruine

References

Kuang, D, Park, H. (2013). "Fast rank-2 nonnegative matrix factorization for hierarchical document clustering." Proc. 19th ACM SIGKDD intl. conf. on Knowledge discovery and data mining.

See Also

nmf,dclust

Examples

## Not run: library(Matrix)data(iris)A <- as(as.matrix(iris[,1:4]), "dgCMatrix")bipartition(A, calc_dist = TRUE)## End(Not run)

Divisive clustering

Description

Recursive bipartitioning by rank-2 matrix factorization with an efficient modularity-approximate stopping criteria

Usage

dclust(  A,  min_samples,  min_dist = 0,  verbose = TRUE,  tol = 1e-05,  maxit = 100,  nonneg = TRUE,  seed = NULL)

Arguments

A

matrix of features-by-samples in sparse format (preferred class is "Matrix::dgCMatrix")

min_samples

stopping criteria giving the minimum number of samples permitted in a cluster

min_dist

stopping criteria giving the minimum cosine distance of samples within a cluster to the center of their assigned vs. unassigned cluster. If0, neither this distance nor cluster centroids will be calculated.

verbose

print number of divisions in each generation

tol

in rank-2 NMF, the correlation distance (1 - R^2) betweenw across consecutive iterations at which to stop factorization

maxit

stopping criteria, maximum number of alternating updates ofw andh

nonneg

in rank-2 NMF, enforce non-negativity

seed

random seed for rank-2 NMF model initialization

Details

Divisive clustering is a sensitive and fast method for sample classification. Samples are recursively partitioned into two groups until a stopping criteria is satisfied and prevents successful partitioning.

Seenmf andbipartition for technical considerations and optimizations relevant to bipartitioning.

Stopping criteria. Two stopping criteria are used to prevent indefinite division of clusters and tune the clustering resolution to a desirable range:

Newman-Girvan modularity (Q) is an interpretable and widely used measure of modularity for a bipartition. However, it requires the calculation of distance between all within-cluster and between-cluster sample pairs. This is computationally intensive, especially for large sample sets.

dclust uses a measure which linearly approximates Newman-Girvan modularity, and simply requires the calculation of distance between all samples in a cluster and both cluster centers (the assigned and unassigned center), which is orders of magnitude faster to compute. Cosine distance is used instead of Euclidean distance since it handles outliers and sparsity well.

A bipartition is rejected if either of the two clusters contains fewer thanmin_samples or if the mean relative cosine distance of the bipartition is less thanmin_dist.

A bipartition will only be attempted if there are more than2 * min_samples samples in the cluster, meaning thatdist may not be calculated for some clusters.

Reproducibility. Because rank-2 NMF is approximate and requires random initialization, results may vary slightly across restarts. Therefore, specify aseed to guarantee absolute reproducibility.

Other than setting the seed, reproducibility may be improved by settingtol to a smaller number to increase the exactness of each bipartition.

Value

A list of lists corresponding to individual clusters:

Author(s)

Zach DeBruine

References

Schwartz, G. et al. "TooManyCells identifies and visualizes relationships of single-cell clades". Nature Methods (2020).

Newman, MEJ. "Modularity and community structure in networks". PNAS (2006)

Kuang, D, Park, H. (2013). "Fast rank-2 nonnegative matrix factorization for hierarchical document clustering." Proc. 19th ACM SIGKDD intl. conf. on Knowledge discovery and data mining.

See Also

bipartition,nmf

Examples

## Not run: library(Matrix)data(USArrests)A <- as(as.matrix(t(USArrests)), "dgCMatrix")clusters <- dclust(A, min_samples = 2, min_dist = 0.001)str(clusters)## End(Not run)

Get the number of threads RcppML should use

Description

Get the number of threads that will be used by RcppML functions supporting parallelization with OpenMP. UsesetRcppMLthreads to set the number of threads to be used.

Usage

getRcppMLthreads()

Value

integer giving number of threads to be used by RcppML functions.0 corresponds to all available threads, as determined by OpenMP.

Author(s)

Zach DeBruine

See Also

setRcppMLthreads

Examples

## Not run: # set serial configurationsetRcppMLthreads(1)getRcppMLthreads()# restore default parallel configuration, # letting OpenMP decide how many threads to usesetRcppMLthreads(0)getRcppMLthreads()## End(Not run)

Mean Squared Error loss of a factor model

Description

MSE of factor modelsw andh given sparse matrixA

Usage

mse(A, w, d = NULL, h, mask_zeros = FALSE)

Arguments

A

matrix of features-by-samples in dense or sparse format (preferred classes are "matrix" or "Matrix::dgCMatrix", respectively). Prefer sparse storage when more than half of all values are zero.

w

dense matrix of classmatrix with factors (columns) by features (rows)

d

diagonal scaling vector of rank length

h

dense matrix of classmatrix with samples (columns) by factors (rows)

mask_zeros

handle zeros as missing values, available only whenA is sparse

Details

Mean squared error of a matrix factorization of the formA = wdh is given by

\frac{\sum_{i,j}{(A - wdh)^2}}{ij}

wherei andj are the number of rows and columns inA.

Thus, this function simply calculates the cross-product ofwh orwdh (ifd is specified),subtracts that fromA, squares the result, and calculates the mean of all values.

If no diagonal scaling vector is present in the model, inputd = rep(1, k) wherek is the rank of the model.

Parallelization. Calculation of mean squared error is performed in parallel across columns inA using the number of threads set bysetRcppMLthreads.By default, all available threads are used, seegetRcppMLthreads.

Value

mean squared error of the factorization model

Author(s)

Zach DeBruine

Examples

## Not run: library(Matrix)A <- Matrix::rsparsematrix(1000, 1000, 0.1)model <- nmf(A, k = 10, tol = 0.01)c_mse <- mse(A, model$w, model$d, model$h)R_mse <- mean((A - model$w %*% Diagonal(x = model$d) %*% model$h)^2)all.equal(c_mse, R_mse)## End(Not run)

Non-negative matrix factorization

Description

Sparse matrix factorization of the formA = wdh by alternating least squares with optional non-negativity constraints.

Usage

nmf(  A,  k,  tol = 1e-04,  maxit = 100,  verbose = TRUE,  L1 = c(0, 0),  seed = NULL,  mask_zeros = FALSE,  diag = TRUE,  nonneg = TRUE)

Arguments

A

matrix of features-by-samples in dense or sparse format (preferred classes are "matrix" or "Matrix::dgCMatrix", respectively). Prefer sparse storage when more than half of all values are zero.

k

rank

tol

stopping criteria, the correlation distance betweenw across consecutive iterations,1 - cor(w_i, w_{i-1})

maxit

stopping criteria, maximum number of alternating updates ofw andh

verbose

print model tolerances between iterations

L1

L1/LASSO penalties between 0 and 1, array of length two forc(w, h)

seed

random seed for model initialization

mask_zeros

handle zeros as missing values, available only whenA is sparse

diag

scale factors inw andh to sum to 1 by introducing a diagonal,d. This should generally never be set toFALSE. Diagonalization enables symmetry of models in factorization of symmetric matrices, convex L1 regularization, and consistent factor scalings.

nonneg

enforce non-negativity

Details

This fast non-negative matrix factorization (NMF) implementation decomposes a matrixA into lower-ranknon-negative matricesw andh, with factors scaled to sum to 1 via multiplication by a diagonal,d:

A = wdh

The scaling diagonal enables symmetric factorization, convex L1 regularization, and consistent factor scalings regardless of random initialization.

The factorization model is randomly initialized, andw andh are updated alternately using least squares.GivenA andw,h is updated according to the equation:

w^Twh = wA_j

This equation is in the formax = b wherea = w^Tw,x = h, andb = wA_j for all columnsj inA.

The corresponding update forw is

hh^Tw = hA^T_j

Stopping criteria. Alternating least squares projections (seeproject subroutine) are repeated until a stopping criteria is satisfied, which is either a maximum number ofiterations or a tolerance based on the correlation distance between models (1 - cor(w_i, w_{i-1})) across consecutive iterations. Use thetol parameter to control the stopping criteria for alternating updates:

Parallelization. Least squares projections in factorizations of rank-3 and greater are parallelized using the number of threads set bysetRcppMLthreads.By default, all available threads are used, seegetRcppMLthreads.The overhead of parallization is too great to benefit rank-1 and rank-2 factorization.

Specializations. There are specializations for symmetric matrices, and for rank-1 and rank-2 factorization.

L1 regularization. L1 penalization increases the sparsity of factors, but does not change the information content of the modelor the relative contributions of the leading coefficients in each factor to the model. L1 regularization only slightly increases the error of a model.L1 penalization should be used to aid interpretability. Penalty values should range from 0 to 1, where 1 gives complete sparsity. In this implementation of NMF,a scaling diagonal ensures that the L1 penalty is equally applied across all factors regardless of random initialization and the distribution of the model.Many other implementations of matrix factorization claim to apply L1, but the magnitude of the penalty is at the mercy of the random distribution andmore significantly affects factors with lower overall contribution to the model. L1 regularization of rank-1 and rank-2 factorizations has no effect.

Rank-2 factorization. Whenk = 2, a very fast optimized algorithm is used. Two-variable least squares solutions to the problemax = b are found by direct substitution:

x_1 = \frac{a_{22}b_1 - a_{12}b_2}{a_{11}a_{22} - a_{12}^2}

x_2 = \frac{a_{11}b_2 - a_{12}b_1}{a_{11}a_{22} - a_{12}^2}

In the above equations, the denominator is constant and thus needs to be calculated only once. Additionally, if non-negativity constraints are to be imposed,ifx_1 < 0 thenx_1 = 0 andx_2 = \frac{b_1}{a_{11}}.Similarly, ifx_2 < 0 thenx_2 = 0 andx_1 = \frac{b_2}{a_{22}}.

Rank-2 NMF is useful for bipartitioning, and is a subroutine inbipartition, where the sign of the difference between sample loadingsin both factors gives the partitioning.

Rank-1 factorization. Rank-1 factorization by alternating least squares gives vectors equivalent to the first singular vectors in an SVD. It is a normalization of the data to a middle point,and may be useful for ordering samples based on the most significant axis of variation (i.e. pseudotime trajectories). Diagonal scaling guarantees consistent linearscaling of the factor across random restarts.

Random seed and reproducibility. Results of a rank-1 and rank-2 factorizations should be reproducible regardless of random seed. For higher-rank models,results across random restarts should, in theory, be comparable at very high tolerances (i.e. machine precision fordouble, corresponding to abouttol = 1e-10). However, to guaranteereproducibility without such low tolerances, set theseed argument. Note thatset.seed() will not work. Only random initialization is supported, as other methodsincur unnecessary overhead and sometimes trap updates into local minima.

Rank determination. This function does not attempt to provide a method for rank determination. Like any clustering algorithm or dimensional reduction,finding the optimal rank can be subjective. An easy way toestimate rank uses the "elbow method", where the inflection point on a plot of Mean Squared Error loss (MSE) vs. rankgives a good idea of the rank at which most of the signal has been captured in the model. Unfortunately, this inflection pointis not often as obvious for NMF as it is for SVD or PCA.

k-fold cross-validation is a better method. Missing value of imputation has previously been proposed, but is arguably no less subjectivethan test-training splits and requires computationally slower factorization updates using missing values, which are not supported here.

Symmetric factorization. Special optimization for symmetric matrices is automatically applied. Specifically, alternating updates ofw andhrequire transposition ofA, butA == t(A) whenA is symmetric, thus no up-front transposition is performed.

Zero-masking. When zeros in a data structure can be regarded as "missing",mask_zeros = TRUE may be set. However, this requires a sloweralgorithm, and tolerances will fluctuate more dramatically.

Publication reference. For theoretical and practical considerations, please see our manuscript: "DeBruine ZJ, Melcher K, Triche TJ (2021)High-performance non-negative matrix factorization for large single cell data." on BioRXiv.

Value

A list giving the factorization model:

Author(s)

Zach DeBruine

References

DeBruine, ZJ, Melcher, K, and Triche, TJ. (2021). "High-performance non-negative matrix factorization for large single-cell data." BioRXiv.

Lin, X, and Boutros, PC (2020). "Optimization and expansion of non-negative matrix factorization." BMC Bioinformatics.

Lee, D, and Seung, HS (1999). "Learning the parts of objects by non-negative matrix factorization." Nature.

Franc, VC, Hlavac, VC, Navara, M. (2005). "Sequential Coordinate-Wise Algorithm for the Non-negative Least Squares Problem". Proc. Int'l Conf. Computer Analysis of Images and Patterns. Lecture Notes in Computer Science.

See Also

nnls,project,mse

Examples

## Not run: library(Matrix)# basic NMFmodel <- nmf(rsparsematrix(1000, 100, 0.1), k = 10)# compare rank-2 NMF to second left vector in an SVDdata(iris)A <- as(as.matrix(iris[,1:4]), "dgCMatrix")nmf_model <- nmf(A, 2, tol = 1e-5)bipartitioning_vector <- apply(nmf_model$w, 1, diff)second_left_svd_vector <- base::svd(A, 2)$u[,2]abs(cor(bipartitioning_vector, second_left_svd_vector))# compare rank-1 NMF with first singular vector in an SVDabs(cor(nmf(A, 1)$w[,1], base::svd(A, 2)$u[,1]))# symmetric NMFA <- crossprod(rsparsematrix(100, 100, 0.02))model <- nmf(A, 10, tol = 1e-5, maxit = 1000)plot(model$w, t(model$h))# see package vignette for more examples## End(Not run)

Non-negative least squares

Description

Solves the equationa %*% x = b forx subject tox > 0.

Usage

nnls(a, b, cd_maxit = 100L, cd_tol = 1e-08, fast_nnls = FALSE, L1 = 0)

Arguments

a

symmetric positive definite matrix giving coefficients of the linear system

b

matrix giving the right-hand side(s) of the linear system

cd_maxit

maximum number of coordinate descent iterations

cd_tol

stopping criteria, difference inx across consecutive solutions over the sum ofx

fast_nnls

initialize coordinate descent with a FAST NNLS approximation

L1

L1/LASSO penalty to be subtracted fromb

Details

This is a very fast implementation of non-negative least squares (NNLS), suitable for very small or very large systems.

Algorithm. Sequential coordinate descent (CD) is at the core of this implementation, and requires an initialization ofx. There are two supported methods for initialization ofx:

  1. Zero-filled initialization whenfast_nnls = FALSE andcd_maxit > 0. This is generally very efficient for well-conditioned and small systems.

  2. Approximation with FAST whenfast_nnls = TRUE. Forward active set tuning (FAST), described below, finds an approximate active set using unconstrained least squares solutions found by Cholesky decomposition and substitution. To use only FAST approximation, setcd_maxit = 0.

a must be symmetric positive definite if FAST NNLS is used, but this is not checked.

See our BioRXiv manuscript (references) for benchmarking against Lawson-Hanson NNLS and for a more technical introduction to these methods.

Coordinate Descent NNLS. Least squares bysequential coordinate descent is used to ensure the solution returned is exact. This algorithm wasintroduced by Franc et al. (2005), and our implementation is a vectorized and optimized rendition of that found in the NNLM R package by Xihui Lin (2020).

FAST NNLS. Forward active set tuning (FAST) is an exact or near-exact NNLS approximation initialized by an unconstrainedleast squares solution. Negative values in this unconstrained solution are set to zero (the "active set"), and allother values are added to a "feasible set". An unconstrained least squares solution is then solved for the"feasible set", any negative values in the resulting solution are set to zero, and the process is repeated untilthe feasible set solution is strictly positive.

The FAST algorithm has a definite convergence guarantee because thefeasible set will either converge or become smaller with each iteration. The result is generally exact or nearlyexact for small well-conditioned systems (< 50 variables) within 2 iterations and thus sets up coordinatedescent for very rapid convergence. The FAST method is similar to the first phase of the so-called "TNT-NN" algorithm (Myre et al., 2017),but the latter half of that method relies heavily on heuristics to refine the approximate active set, which we avoid by usingcoordinate descent instead.

Value

vector or matrix giving solution forx

Author(s)

Zach DeBruine

References

DeBruine, ZJ, Melcher, K, and Triche, TJ. (2021). "High-performance non-negative matrix factorization for large single-cell data." BioRXiv.

Franc, VC, Hlavac, VC, and Navara, M. (2005). "Sequential Coordinate-Wise Algorithm for the Non-negative Least Squares Problem. Proc. Int'l Conf. Computer Analysis of Images and Patterns."

Lin, X, and Boutros, PC (2020). "Optimization and expansion of non-negative matrix factorization." BMC Bioinformatics.

Myre, JM, Frahm, E, Lilja DJ, and Saar, MO. (2017) "TNT-NN: A Fast Active Set Method for Solving Large Non-Negative Least Squares Problems". Proc. Computer Science.

See Also

nmf,project

Examples

## Not run: # compare solution to base::solve for a random systemX <- matrix(runif(100), 10, 10)a <- crossprod(X)b <- crossprod(X, runif(10))unconstrained_soln <- solve(a, b)nonneg_soln <- nnls(a, b)unconstrained_err <- mean((a %*% unconstrained_soln - b)^2)nonnegative_err <- mean((a %*% nonneg_soln - b)^2)unconstrained_errnonnegative_errall.equal(solve(a, b), nnls(a, b))# example adapted from multiway::fnnls example 1X <- matrix(1:100,50,2)y <- matrix(101:150,50,1)beta <- solve(crossprod(X)) %*% crossprod(X, y)betabeta <- nnls(crossprod(X), crossprod(X, y))beta## End(Not run)

Project a linear factor model

Description

Solves the equationA = wh for eitherh orw given eitherw orh andA

Usage

project(A, w = NULL, h = NULL, nonneg = TRUE, L1 = 0, mask_zeros = FALSE)

Arguments

A

matrix of features-by-samples in dense or sparse format (preferred classes are "matrix" or "Matrix::dgCMatrix", respectively). Prefer sparse storage when more than half of all values are zero.

w

dense matrix of factors x features giving the linear model to be projected (ifh = NULL)

h

dense matrix of factors x samples giving the linear model to be projected (ifw = NULL)

nonneg

enforce non-negativity

L1

L1/LASSO penalty to be applied. No scaling is performed. See details.

mask_zeros

handle zeros as missing values, available only whenA is sparse

Details

For the classical alternating least squares matrix factorization update problemA = wh, the updates(or projection) ofh is given by the equation:

w^Twh = wA_j

which is in the formax = b wherea = w^Twx = h andb = wA_j for all columnsj inA.

GivenA,project can solve for eitherw orh given the other:

Parallelization. Least squares projections in factorizations of rank-3 and greater are parallelized using the number of threads set bysetRcppMLthreads.By default, all available threads are used, seegetRcppMLthreads. The overhead of parallization is too great for rank-1 and -2 factorization.

L1 Regularization. Any L1 penalty is subtracted fromb and should generally be scaled tomax(b), whereb = WA_j for all columnsj inA. An easy way to properly scale an L1 penalty is to normalize all columns inw to sum to 1. No scaling is applied in this function. Such scaling guarantees thatL1 = 1 gives a completely sparse solution.

Specializations. There are specializations for symmetric input matrices, and for rank-1 and rank-2 projections. See documentation fornmf for theoretical details and guidance.

Publication reference. For theoretical and practical considerations, please see our manuscript: "DeBruine ZJ, Melcher K, Triche TJ (2021)High-performance non-negative matrix factorization for large single cell data." on BioRXiv.

Value

matrixh orw

Author(s)

Zach DeBruine

References

DeBruine, ZJ, Melcher, K, and Triche, TJ. (2021). "High-performance non-negative matrix factorization for large single-cell data." BioRXiv.

See Also

nnls,nmf

Examples

## Not run: library(Matrix)w <- matrix(runif(1000 * 10), 1000, 10)h_true <- matrix(runif(10 * 100), 10, 100)# A is the crossproduct of "w" and "h" with 10% signal dropoutA <- (w %*% h_true) * (rsparsematrix(1000, 100, 0.9) > 0)h <- project(A, w)cor(as.vector(h_true), as.vector(h))# alternating projections refine solution (like NMF)mse_bad <- mse(A, w, rep(1, ncol(w)), h) # mse before alternating updatesh <- project(A, w = w)w <- project(A, h = h)h <- project(A, w)w <- project(A, h = h)h <- project(A, w)w <- t(project(A, h = h))mse_better <- mse(A, w, rep(1, ncol(w)), h) # mse after alternating updatesmse_better < mse_bad# two ways to solve for "w" that give the same solutionw <- project(A, h = h)w2 <- project(t(A), w = t(h))all.equal(w, w2)## End(Not run)

Set the number of threads RcppML should use

Description

The number of threads is 0 by default (corresponding to all available threads), but can be set manually using this function. If you clear environment variables or affect the "RcppMLthreads" environment variable specifically, you will need to set your number of preferred threads again.

Usage

setRcppMLthreads(threads)

Arguments

threads

number of threads to be used in RcppML functions that are parallelized with OpenMP.

Details

The number of threads set affects OpenMP parallelization only for functions in the RcppML package. It does not affect other R packages that use OpenMP. Parallelization is used for projection of linear factor models with rank > 2, calculation of mean squared error for linear factor models, and for divisive clustering.

Author(s)

Zach DeBruine

See Also

getRcppMLthreads

Examples

## Not run: # set serial configurationsetRcppMLthreads(1)getRcppMLthreads()# restore default parallel configuration, # letting OpenMP decide how many threads to usesetRcppMLthreads(0)getRcppMLthreads()## End(Not run)

[8]ページ先頭

©2009-2025 Movatter.jp