Movatterモバイル変換


[0]ホーム

URL:


Title:The Uniform Manifold Approximation and Projection (UMAP) Methodfor Dimensionality Reduction
Version:0.2.4
Description:An implementation of the Uniform Manifold Approximation and Projection dimensionality reduction by McInnes et al. (2018) <doi:10.48550/arXiv.1802.03426>. It also provides means to transform new data and to carry out supervised dimensionality reduction. An implementation of the related LargeVis method of Tang et al. (2016) <doi:10.48550/arXiv.1602.00370> is also provided. This is a complete re-implementation in R (and C++, via the 'Rcpp' package): no Python installation is required. See the uwot website (https://github.com/jlmelville/uwot) for more documentation and examples.
License:GPL (≥ 3)
URL:https://github.com/jlmelville/uwot,https://jlmelville.github.io/uwot/
BugReports:https://github.com/jlmelville/uwot/issues
Depends:Matrix
Imports:FNN, irlba, methods, Rcpp, RcppAnnoy (≥ 0.0.17), RSpectra
Suggests:bigstatsr, covr, knitr, RcppHNSW, rmarkdown, rnndescent,testthat
LinkingTo:dqrng, Rcpp, RcppAnnoy, RcppProgress
VignetteBuilder:knitr
Config/Needs/website:rmarkdown
Encoding:UTF-8
RoxygenNote:7.3.3
NeedsCompilation:yes
Packaged:2025-11-09 17:56:59 UTC; jlmel
Author:James Melville [aut, cre, cph], Aaron Lun [ctb], Mohamed Nadhir Djekidel [ctb], Yuhan Hao [ctb], Dirk Eddelbuettel [ctb], Wouter van der Bijl [ctb], Hugo Gruson [ctb]
Maintainer:James Melville <jlmelville@gmail.com>
Repository:CRAN
Date/Publication:2025-11-10 06:10:02 UTC

Save or Load a Model

Description

Functions to write a UMAP model to a file, and to restore.

Usage

load_uwot(file, verbose = FALSE)

Arguments

file

name of the file where the model is to be saved or read from.

verbose

ifTRUE, log information to the console.

Value

The model saved atfile, for use withumap_transform. Additionally, it contains an extra item:mod_dir, which contains the path to the temporary working directoryused during loading of the model. This directory cannot be removed untilthis model has been unloaded by usingunload_uwot.

See Also

save_uwot,unload_uwot

Examples

library(RSpectra)iris_train <- iris[c(1:10, 51:60), ]iris_test <- iris[100:110, ]# create modelmodel <- umap(iris_train, ret_model = TRUE, n_epochs = 20)# save without unloading: this leaves behind a temporary working directorymodel_file <- tempfile("iris_umap")model <- save_uwot(model, file = model_file)# The model can continue to be usedtest_embedding <- umap_transform(iris_test, model)# To manually unload the model from memory when finished and to clean up# the working directory (this doesn't touch your model file)unload_uwot(model)# At this point, model cannot be used with umap_transform, this would fail:# test_embedding2 <- umap_transform(iris_test, model)# restore the model: this also creates a temporary working directorymodel2 <- load_uwot(file = model_file)test_embedding2 <- umap_transform(iris_test, model2)# Unload and clean up the loaded model temp directoryunload_uwot(model2)# clean up the model fileunlink(model_file)# save with unloading: this deletes the temporary working directory but# doesn't allow the model to be re-usedmodel3 <- umap(iris_train, ret_model = TRUE, n_epochs = 20)model_file3 <- tempfile("iris_umap")model3 <- save_uwot(model3, file = model_file3, unload = TRUE)

Dimensionality Reduction with a LargeVis-like method

Description

Carry out dimensionality reduction of a dataset using a method similar toLargeVis (Tang et al., 2016).

Usage

lvish(  X,  perplexity = 50,  n_neighbors = perplexity * 3,  n_components = 2,  metric = "euclidean",  n_epochs = -1,  learning_rate = 1,  scale = "maxabs",  init = "lvrandom",  init_sdev = NULL,  repulsion_strength = 7,  negative_sample_rate = 5,  nn_method = NULL,  n_trees = 50,  search_k = 2 * n_neighbors * n_trees,  n_threads = NULL,  n_sgd_threads = 0,  grain_size = 1,  kernel = "gauss",  pca = NULL,  pca_center = TRUE,  pcg_rand = TRUE,  fast_sgd = FALSE,  ret_nn = FALSE,  ret_extra = c(),  tmpdir = tempdir(),  verbose = getOption("verbose", TRUE),  batch = FALSE,  opt_args = NULL,  epoch_callback = NULL,  pca_method = NULL,  binary_edge_weights = FALSE,  nn_args = list(),  rng_type = NULL)

Arguments

X

Input data. Can be adata.frame,matrix,dist object orsparseMatrix.Matrix and data frames should contain one observation per row. Data frameswill have any non-numeric columns removed, although factor columns will beused if explicitly included viametric (see the help formetric for details). A sparse matrix is interpreted as a distancematrix, and is assumed to be symmetric, so you can also pass in anexplicitly upper or lower triangular sparse matrix to save storage. Theremust be at leastn_neighbors non-zero distances for each row. Bothimplicit and explicit zero entries are ignored. Set zero distances you wantto keep to an arbitrarily small non-zero value (e.g.1e-10).X can also beNULL if pre-computed nearest neighbor data ispassed tonn_method, andinit is not"spca" or"pca".

perplexity

Controls the size of the local neighborhood used formanifold approximation. This is the analogous ton_neighbors inumap. Change this, rather thann_neighbors.

n_neighbors

The number of neighbors to use when calculating theperplexity. Usually set to three times the value of theperplexity. Must be at least as large asperplexity.

n_components

The dimension of the space to embed into. This defaultsto2 to provide easy visualization, but can reasonably be set to anyinteger value in the range2 to100.

metric

Type of distance metric to use to find nearest neighbors. Fornn_method = "annoy" this can be one of:

  • "euclidean" (the default)

  • "cosine"

  • "manhattan"

  • "hamming"

  • "correlation" (a distance based on the Pearson correlation)

  • "categorical" (see below)

Fornn_method = "hnsw" this can be one of:

  • "euclidean"

  • "cosine"

  • "correlation"

Ifrnndescent isinstalled andnn_method = "nndescent" is specified then many moremetrics are avaiable, including:

  • "braycurtis"

  • "canberra"

  • "chebyshev"

  • "dice"

  • "hamming"

  • "hellinger"

  • "jaccard"

  • "jensenshannon"

  • "kulsinski"

  • "rogerstanimoto"

  • "russellrao"

  • "sokalmichener"

  • "sokalsneath"

  • "spearmanr"

  • "symmetrickl"

  • "tsss"

  • "yule"

For more details see the package documentation ofrnndescent.Fornn_method = "fnn", the distance metric is always "euclidean".

IfX is a data frame or matrix, then multiple metrics can bespecified, by passing a list to this argument, where the name of each item inthe list is one of the metric names above. The value of each list item shouldbe a vector giving the names or integer ids of the columns to be included ina calculation, e.g.metric = list(euclidean = 1:4, manhattan = 5:10).

Each metric calculation results in a separate fuzzy simplicial set, which areintersected together to produce the final set. Metric names can be repeated.Because non-numeric columns are removed from the data frame, it is safer touse column names than integer ids.

Factor columns can also be used by specifying the metric name"categorical". Factor columns are treated different from numericcolumns and although multiple factor columns can be specified in a vector,each factor column specified is processed individually. If you specifya non-factor column, it will be coerced to a factor.

For a given data block, you may override thepca andpca_centerarguments for that block, by providing a list with one unnamed itemcontaining the column names or ids, and then any of thepca orpca_center overrides as named items, e.g.metric =list(euclidean = 1:4, manhattan = list(5:10, pca_center = FALSE)). Thisexists to allow mixed binary and real-valued data to be included and to havePCA applied to both, but with centering applied only to the real-valued data(it is typical not to apply centering to binary data before PCA is applied).

n_epochs

Number of epochs to use during the optimization of theembedded coordinates. The default is calculate the number of epochsdynamically based on dataset size, to give the same number of edge samplesas the LargeVis defaults. This is usually substantially larger than theUMAP defaults. Ifn_epochs = 0, then coordinates determined by"init" will be returned.

learning_rate

Initial learning rate used in optimization of thecoordinates.

scale

Scaling to apply toX if it is a data frame or matrix:

  • "none" orFALSE orNULL No scaling.

  • "Z" or"scale" orTRUE Scale each column tozero mean and variance 1.

  • "maxabs" Center each column to mean 0, then divide eachelement by the maximum absolute value over the entire matrix.

  • "range" Range scale the entire matrix, so the smallestelement is 0 and the largest is 1.

  • "colrange" Scale each column in the range (0,1).

For lvish, the default is"maxabs", for consistency with LargeVis.

init

Type of initialization for the coordinates. Options are:

  • "spectral" Spectral embedding using the normalized Laplacianof the fuzzy 1-skeleton, with Gaussian noise added.

  • "normlaplacian". Spectral embedding using the normalizedLaplacian of the fuzzy 1-skeleton, without noise.

  • "random". Coordinates assigned using a uniform randomdistribution between -10 and 10.

  • "lvrandom". Coordinates assigned using a Gaussiandistribution with standard deviation 1e-4, as used in LargeVis(Tang et al., 2016) and t-SNE.

  • "laplacian". Spectral embedding using the Laplacian Eigenmap(Belkin and Niyogi, 2002).

  • "pca". The first two principal components from PCA ofX ifX is a data frame, and from a 2-dimensional classicalMDS ifX is of class"dist".

  • "spca". Like"pca", but each dimension is then scaledso the standard deviation is 1e-4, to give a distribution similar to thatused in t-SNE and LargeVis. This is an alias forinit = "pca", init_sdev = 1e-4.

  • "agspectral" An "approximate global" modification of"spectral" which all edges in the graph to a value of 1, and thensets a random number of edges (negative_sample_rate edges pervertex) to 0.1, to approximate the effect of non-local affinities.

  • A matrix of initial coordinates.

For spectral initializations, ("spectral","normlaplacian","laplacian","agspectral"), if more than one connectedcomponent is identified, no spectral initialization is attempted. Insteada PCA-based initialization is attempted. Ifverbose = TRUE thenumber of connected components are logged to the console. The existence ofmultiple connected components implies that a global view of the data cannotbe attained with this initialization. Increasing the value ofn_neighbors may help.

init_sdev

If non-NULL, scales each dimension of the initializedcoordinates (including any user-supplied matrix) to this standarddeviation. By default no scaling is carried out, except wheninit ="spca", in which case the value is0.0001. Scaling the input mayhelp if the unscaled versions result in initial coordinates with largeinter-point distances or outliers. This usually results in small gradientsduring optimization and very little progress being made to the layout.Shrinking the initial embedding by rescaling can help under thesecircumstances. Scaling the result ofinit = "pca" is usuallyrecommended andinit = "spca" as an alias forinit = "pca",init_sdev = 1e-4 but for the spectral initializations the scaled versionsusually aren't necessary unless you are using a large value ofn_neighbors (e.g.n_neighbors = 150 or higher). Forcompatibility with recent versions of the Python UMAP package, if you areusinginit = "spectral", then you should also setinit_sdev = "range", which will range scale each of the columnscontaining the initial data between 0-10. This is not set by default tomaintain backwards compatibility with previous versions of uwot.

repulsion_strength

Weighting applied to negative samples in lowdimensional embedding optimization. Values higher than one will result ingreater weight being given to negative samples.

negative_sample_rate

The number of negative edge/1-simplex samples touse per positive edge/1-simplex sample in optimizing the low dimensionalembedding.

nn_method

Method for finding nearest neighbors. Options are:

  • "fnn". Use exact nearest neighbors via theFNN package.

  • "annoy" Use approximate nearest neighbors via theRcppAnnoy package.

  • "hnsw" Use approximate nearest neighbors with theHierarchical Navigable Small World (HNSW) method (Malkov and Yashunin,2018) via theRcppHNSW package.RcppHNSW is not a dependency of this package: this option isonly available if you have installedRcppHNSW yourself. Also,HNSW only supports the following arguments formetric:"euclidean","cosine" and"correlation".

  • "nndescent" Use approximate nearest neighbors with theNearest Neighbor Descent method (Dong et al., 2011) via thernndescentpackage.rnndescent is not a dependency of this package: thisoption is only available if you have installedrnndescentyourself.

By default, ifX has less than 4,096 vertices, the exact nearestneighbors are found. Otherwise, approximate nearest neighbors are used.You may also pass precalculated nearest neighbor data to this argument. Itmust be a list consisting of two elements:

  • "idx". An_vertices x n_neighbors matrixcontaining the integer indexes of the nearest neighbors inX. Eachvertex is considered to be its own nearest neighbor, i.e.idx[, 1] == 1:n_vertices.

  • "dist". An_vertices x n_neighbors matrixcontaining the distances of the nearest neighbors.

Multiple nearest neighbor data (e.g. from two different precomputedmetrics) can be passed by passing a list containing the nearest neighbordata lists as items.Then_neighbors parameter is ignored when using precomputednearest neighbor data.

n_trees

Number of trees to build when constructing the nearestneighbor index. The more trees specified, the larger the index, but thebetter the results. Withsearch_k, determines the accuracy of theAnnoy nearest neighbor search. Only used if thenn_method is"annoy". Sensible values are between10 to100.

search_k

Number of nodes to search during the neighbor retrieval. Thelarger k, the more the accurate results, but the longer the search takes.Withn_trees, determines the accuracy of the Annoy nearest neighborsearch. Only used if thenn_method is"annoy".

n_threads

Number of threads to use (except during stochastic gradientdescent). Default is half the number of concurrent threads supported by thesystem. For nearest neighbor search, only applies ifnn_method = "annoy". Ifn_threads > 1, then the Annoy indexwill be temporarily written to disk in the location determined bytempfile.

n_sgd_threads

Number of threads to use during stochastic gradientdescent. If set to > 1, then be aware that ifbatch = FALSE, resultswillnot be reproducible, even ifset.seed is called with afixed seed before running. Set to"auto" to use the same value asn_threads.

grain_size

The minimum amount of work to do on each thread. If thisvalue is set high enough, then less thann_threads orn_sgd_threads will be used for processing, which might give aperformance improvement if the overhead of thread management and contextswitching was outweighing the improvement due to concurrent processing.This should be left at default (1) and work will be spread evenlyover all the threads specified.

kernel

Type of kernel function to create input probabilities. Can beone of"gauss" (the default) or"knn"."gauss" usesthe usual Gaussian weighted similarities."knn" assigns equalprobabilities to every edge in the nearest neighbor graph, and zerootherwise, usingperplexity nearest neighbors. Then_neighborsparameter is ignored in this case.

pca

If set to a positive integer value, reduce data to this number ofcolumns using PCA. Doesn't applied if the distancemetric is"hamming", or the dimensions of the data is larger than thenumber specified (i.e. number of rows and columns must be larger than thevalue of this parameter). If you have > 100 columns in a data frame ormatrix, reducing the number of columns in this way may substantiallyincrease the performance of the nearest neighbor search at the cost of apotential decrease in accuracy. In many t-SNE applications, a value of 50is recommended, although there's no guarantee that this is appropriate forall settings.

pca_center

IfTRUE, center the columns ofX beforecarrying out PCA. For binary data, it's recommended to set this toFALSE.

pcg_rand

IfTRUE, use the PCG random number generator (O'Neill,2014) during optimization. Otherwise, use the faster (but probably lessstatistically good) Tausworthe "taus88" generator. The default isTRUE. This parameter has been superseded byrng_type – ifboth are set,rng_type takes precedence.

fast_sgd

IfTRUE, then the following combination of parametersis set:pcg_rand = TRUE andn_sgd_threads = "auto". Thedefault isFALSE. Setting this toTRUE will speed up thestochastic optimization phase, but give a potentially less accurateembedding, and which will not be exactly reproducible even with a fixedseed. For visualization,fast_sgd = TRUE will give perfectly goodresults. For more generic dimensionality reduction, it's safer to leavefast_sgd = FALSE. Iffast_sgd = TRUE, then user-suppliedvalues ofpcg_rand andn_sgd_threads, are ignored.

ret_nn

IfTRUE, then in addition to the embedding, also returnnearest neighbor data that can be used as input tonn_method toavoid the overhead of repeatedly calculating the nearest neighbors whenmanipulating unrelated parameters (e.g.min_dist,n_epochs,init). See the "Value" section for the names of the list items. IfFALSE, just return the coordinates. Note that the nearest neighborscould be sensitive to data scaling, so be wary of reusing nearest neighbordata if modifying thescale parameter.

ret_extra

A vector indicating what extra data to return. May containany combination of the following strings:

  • "nn" same as settingret_nn = TRUE.

  • "P" the high dimensional probability matrix. The graphis returned as a sparse symmetric N x N matrix of classdgCMatrix-class, where a non-zero entry (i, j) gives theinput probability (or similarity or affinity) of the edge connectingvertex i and vertex j. Note that the graph is further sparsified byremoving edges with sufficiently low membership strength that theywould not be sampled by the probabilistic edge sampling employed foroptimization and therefore the number of non-zero elements in thematrix is dependent onn_epochs. If you are only interested inthe fuzzy input graph (e.g. for clustering), settingn_epochs = 0 will avoid any further sparsifying. Be aware thatsettingbinary_edge_weights = TRUE will affect this graph (allnon-zero edge weights will be 1).

  • sigma a vector of the bandwidths used to calibrate the inputGaussians to reproduce the target"perplexity".

tmpdir

Temporary directory to store nearest neighbor indexes duringnearest neighbor search. Default istempdir. The index isonly written to disk ifn_threads > 1 andnn_method = "annoy"; otherwise, this parameter is ignored.

verbose

IfTRUE, log details to the console.

batch

IfTRUE, then embedding coordinates are updated at theend of each epoch rather than during the epoch. In batch mode, results arereproducible with a fixed random seed even withn_sgd_threads > 1,at the cost of a slightly higher memory use. You may also have to modifylearning_rate and increasen_epochs, so whether this providesa speed increase over the single-threaded optimization is likely to bedataset and hardware-dependent.

opt_args

A list of optimizer parameters, used whenbatch = TRUE. The default optimization method used is Adam (Kingmaand Ba, 2014).

  • method The optimization method to use. Either"adam"or"sgd" (stochastic gradient descent). Default:"adam".

  • beta1 (Adam only). The weighting parameter for theexponential moving average of the first moment estimator. Effectively themomentum parameter. Should be a floating point value between 0 and 1.Higher values can smooth oscillatory updates in poorly-conditionedsituations and may allow for a largerlearning_rate to bespecified, but too high can cause divergence. Default:0.5.

  • beta2 (Adam only). The weighting parameter for theexponential moving average of the uncentered second moment estimator.Should be a floating point value between 0 and 1. Controls the degree ofadaptivity in the step-size. Higher values put more weight on previoustime steps. Default:0.9.

  • eps (Adam only). Intended to be a small value to preventdivision by zero, but in practice can also affect convergence due to itsinteraction withbeta2. Higher values reduce the effect of thestep-size adaptivity and bring the behavior closer to stochastic gradientdescent with momentum. Typical values are between 1e-8 and 1e-3. Default:1e-7.

  • alpha The initial learning rate. Default: the value of thelearning_rate parameter.

epoch_callback

A function which will be invoked at the end of everyepoch. Its signature should be:(epoch, n_epochs, coords), where:

  • epoch The current epoch number (between1 andn_epochs).

  • n_epochs Number of epochs to use during the optimization ofthe embedded coordinates.

  • coords The embedded coordinates as of the end of the currentepoch, as a matrix with dimensions (N,n_components).

pca_method

Method to carry out any PCA dimensionality reduction whenthepca parameter is specified. Allowed values are:

  • "irlba". Usesprcomp_irlba from theirlba package.

  • "rsvd". Uses 5 iterations ofsvdr fromtheirlba package.This is likely to give much faster but potentially less accurate resultsthan using"irlba". For the purposes of nearest neighborcalculation and coordinates initialization, any loss of accuracy doesn'tseem to matter much.

  • "bigstatsr". Usesbig_randomSVDfrom thebigstatsrpackage. The SVD methods used inbigstatsr may be faster onsystems without access to efficient linear algebra libraries (e.g.Windows).Note:bigstatsr isnot a dependency ofuwot: if you choose to use this package for PCA, youmust installit yourself.

  • "svd". Usessvd for the SVD. This islikely to be slow for all but the smallest datasets.

  • "auto" (the default). Uses"irlba", unless more than50case"svd" is used.

binary_edge_weights

IfTRUE then edge weights in the inputgraph are treated as binary (0/1) rather than real valued. This affects thesampling frequency of neighbors and is the strategy used by the PaCMAPmethod (Wang and co-workers, 2020). Practical (Böhm and co-workers, 2020)and theoretical (Damrich and Hamprecht, 2021) work suggests this has littleeffect on UMAP's performance.

nn_args

A list containing additional arguments to pass to the nearestneighbor method. Fornn_method = "annoy", you can specify"n_trees" and"search_k", and these will override then_trees andsearch_k parameters.Fornn_method = "hnsw", you may specify the following arguments:

  • M The maximum number of neighbors to keep for each vertex.Reasonable values are2 to100. Higher values give betterrecall at the cost of more memory. Default value is16.

  • ef_construction A positive integer specifying the size ofthe dynamic list used during index construction. A higher value willprovide better results at the cost of a longer time to build the index.Default is200.

  • ef A positive integer specifying the size of the dynamiclist used during search. This cannot be smaller thann_neighborsand cannot be higher than the number of items in the index. Default is10.

Fornn_method = "nndescent", you may specify the followingarguments:

  • n_trees The number of trees to use in a random projectionforest to initialize the search. A larger number will give more accurateresults at the cost of a longer computation time. The default ofNULL means that the number is chosen based on the number ofobservations inX.

  • max_candidates The number of potential neighbors to exploreper iteration. By default, this is set ton_neighbors or60,whichever is smaller. A larger number will give more accurate results atthe cost of a longer computation time.

  • n_iters The number of iterations to run the search. A largernumber will give more accurate results at the cost of a longer computationtime. By default, this will be chosen based on the number of observationsinX. You may also need to modify the convergence criteriondelta.

  • delta The minimum relative change in the neighbor graphallowed before early stopping. Should be a value between 0 and 1. Thesmaller the value, the smaller the amount of progress between iterations isallowed. Default value of0.001 means that at least 0.1neighbor graph must be updated at each iteration.

  • init How to initialize the nearest neighbor descent. Bydefault this is set to"tree" and uses a random project forest.If you set this to"rand", then a random selection is used. Usuallythis is less accurate than using RP trees, but for high-dimensional cases,there may be little difference in the quality of the initialization andrandom initialization will be a lot faster. If you set this to"rand", then then_trees parameter is ignored.

rng_type

The type of random number generator to use duringoptimization. One of:

  • "pcg". Use the PCG random number generator (O'Neill, 2014).

  • "tausworthe". Use the Tausworthe "taus88" generator.

  • "deterministic". Use a deterministic number generator. Thisisn't actually random, but may provide enough variation in the negativesampling to give a good embedding and can provide a noticeable speed-up.

For backwards compatibility, by default this is unset and the choice ofpcg_rand is used (making "pcg" the effective default).

Details

lvish differs from the official LargeVis implementation in thefollowing:

Value

A matrix of optimized coordinates, or:

The returned list contains the combined data from any combination ofspecifyingret_nn andret_extra.

References

Belkin, M., & Niyogi, P. (2002).Laplacian eigenmaps and spectral techniques for embedding and clustering.InAdvances in neural information processing systems(pp. 585-591).http://papers.nips.cc/paper/1961-laplacian-eigenmaps-and-spectral-techniques-for-embedding-and-clustering.pdf

Böhm, J. N., Berens, P., & Kobak, D. (2020).A unifying perspective on neighbor embeddings along the attraction-repulsion spectrum.arXiv preprintarXiv:2007.08902.https://arxiv.org/abs/2007.08902

Damrich, S., & Hamprecht, F. A. (2021).On UMAP's true loss function.Advances in Neural Information Processing Systems,34.https://proceedings.neurips.cc/paper/2021/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html

Dong, W., Moses, C., & Li, K. (2011, March).Efficient k-nearest neighbor graph construction for generic similarity measures.InProceedings of the 20th international conference on World Wide Web(pp. 577-586).ACM.doi:10.1145/1963405.1963487.

Kingma, D. P., & Ba, J. (2014).Adam: A method for stochastic optimization.arXiv preprintarXiv:1412.6980.https://arxiv.org/abs/1412.6980

Lee, J. A., Peluffo-Ordóñez, D. H., & Verleysen, M. (2015).Multi-scale similarities in stochastic neighbour embedding: Reducingdimensionality while preserving both local and global structure.Neurocomputing,169, 246-261.

Malkov, Y. A., & Yashunin, D. A. (2018).Efficient and robust approximate nearest neighbor search using hierarchicalnavigable small world graphs.IEEE transactions on pattern analysis and machine intelligence,42(4), 824-836.

McInnes, L., Healy, J., & Melville, J. (2018).UMAP: Uniform Manifold Approximation and Projection for Dimension ReductionarXiv preprintarXiv:1802.03426.https://arxiv.org/abs/1802.03426

O'Neill, M. E. (2014).PCG: A family of simple fast space-efficient statistically goodalgorithms for random number generation(Report No. HMC-CS-2014-0905). Harvey Mudd College.

Tang, J., Liu, J., Zhang, M., & Mei, Q. (2016, April).Visualizing large-scale and high-dimensional data.InProceedings of the 25th International Conference on World Wide Web(pp. 287-297).International World Wide Web Conferences Steering Committee.https://arxiv.org/abs/1602.00370

Van der Maaten, L., & Hinton, G. (2008).Visualizing data using t-SNE.Journal of Machine Learning Research,9 (2579-2605).https://www.jmlr.org/papers/v9/vandermaaten08a.html

Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021).Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization.Journal of Machine Learning Research,22(201), 1-73.https://www.jmlr.org/papers/v22/20-1061.html

Examples

# Default number of epochs is much larger than for UMAP, assumes random# initialization. Use perplexity rather than n_neighbors to control the size# of the local neighborhood 20 epochs may be too small for a random# initializationiris_lvish <- lvish(iris,  perplexity = 50, learning_rate = 0.5,  init = "random", n_epochs = 20)

Optimize Graph Layout

Description

Carry out dimensionality reduction on an input graph, where the distances inthe low dimensional space attempt to reproduce the neighbor relations in theinput data. By default, the cost function used to optimize the outputcoordinates use the Uniform Manifold Approximation and Projection (UMAP)method (McInnes et al., 2018), but the approach from LargeVis (Tang et al.,2016) can also be used. This function can be used to produce a lowdimensional representation of the graph produced bysimilarity_graph.

Usage

optimize_graph_layout(  graph,  X = NULL,  n_components = 2,  n_epochs = NULL,  learning_rate = 1,  init = "spectral",  init_sdev = NULL,  spread = 1,  min_dist = 0.01,  repulsion_strength = 1,  negative_sample_rate = 5,  a = NULL,  b = NULL,  method = "umap",  approx_pow = FALSE,  pcg_rand = TRUE,  fast_sgd = FALSE,  n_sgd_threads = 0,  grain_size = 1,  verbose = getOption("verbose", TRUE),  batch = FALSE,  opt_args = NULL,  epoch_callback = NULL,  pca_method = NULL,  binary_edge_weights = FALSE,  rng_type = NULL)

Arguments

graph

A sparse, symmetric N x N weighted adjacency matrixrepresenting a graph. Non-zero entries indicate an edge between two nodeswith a given edge weight. There can be a varying number of non-zero entriesin each row/column.

X

Optional input data. Used only for PCA-based initialization.

n_components

The dimension of the space to embed into. This defaultsto2 to provide easy visualization, but can reasonably be set to anyinteger value in the range2 to100.

n_epochs

Number of epochs to use during the optimization of theembedded coordinates. By default, this value is set to500 fordatasets containing 10,000 vertices or less, and200 otherwise.Ifn_epochs = 0, then coordinates determined by"init" willbe returned.For UMAP, the default is"none".

learning_rate

Initial learning rate used in optimization of thecoordinates.

init

Type of initialization for the coordinates. Options are:

  • "spectral" Spectral embedding using the normalized Laplacianof the fuzzy 1-skeleton, with Gaussian noise added.

  • "normlaplacian". Spectral embedding using the normalizedLaplacian of the fuzzy 1-skeleton, without noise.

  • "random". Coordinates assigned using a uniform randomdistribution between -10 and 10.

  • "lvrandom". Coordinates assigned using a Gaussiandistribution with standard deviation 1e-4, as used in LargeVis(Tang et al., 2016) and t-SNE.

  • "laplacian". Spectral embedding using the Laplacian Eigenmap.

  • "pca". The first two principal components from PCA ofX ifX is a data frame, and from a 2-dimensional classicalMDS ifX is of class"dist".

  • "spca". Like"pca", but each dimension is then scaledso the standard deviation is 1e-4, to give a distribution similar to thatused in t-SNE. This is an alias forinit = "pca", init_sdev = 1e-4.

  • "agspectral" An "approximate global" modification of"spectral" which all edges in the graph to a value of 1, and thensets a random number of edges (negative_sample_rate edges pervertex) to 0.1, to approximate the effect of non-local affinities.

  • A matrix of initial coordinates.

For spectral initializations, ("spectral","normlaplacian","laplacian","agspectral"), if more than one connectedcomponent is identified, no spectral initialization is attempted. Insteada PCA-based initialization is attempted. Ifverbose = TRUE thenumber of connected components are logged to the console. The existence ofmultiple connected components implies that a global view of the data cannotbe attained with this initialization. Increasing the value ofn_neighbors may help.

init_sdev

If non-NULL, scales each dimension of the initializedcoordinates (including any user-supplied matrix) to this standarddeviation. By default no scaling is carried out, except wheninit ="spca", in which case the value is0.0001. Scaling the input mayhelp if the unscaled versions result in initial coordinates with largeinter-point distances or outliers. This usually results in small gradientsduring optimization and very little progress being made to the layout.Shrinking the initial embedding by rescaling can help under thesecircumstances. Scaling the result ofinit = "pca" is usuallyrecommended andinit = "spca" as an alias forinit = "pca",init_sdev = 1e-4 but for the spectral initializations the scaled versionsusually aren't necessary unless you are using a large value ofn_neighbors (e.g.n_neighbors = 150 or higher). Forcompatibility with recent versions of the Python UMAP package, if you areusinginit = "spectral", then you should also setinit_sdev = "range", which will range scale each of the columnscontaining the initial data between 0-10. This is not set by default tomaintain backwards compatibility with previous versions of uwot.

spread

The effective scale of embedded points. In combination withmin_dist, this determines how clustered/clumped the embedded pointsare.

min_dist

The effective minimum distance between embedded points.Smaller values will result in a more clustered/clumped embedding wherenearby points on the manifold are drawn closer together, while largervalues will result on a more even dispersal of points. The value should beset relative to thespread value, which determines the scale atwhich embedded points will be spread out.

repulsion_strength

Weighting applied to negative samples in lowdimensional embedding optimization. Values higher than one will result ingreater weight being given to negative samples.

negative_sample_rate

The number of negative edge/1-simplex samples touse per positive edge/1-simplex sample in optimizing the low dimensionalembedding.

a

More specific parameters controlling the embedding. IfNULLthese values are set automatically as determined bymin_dist andspread.

b

More specific parameters controlling the embedding. IfNULLthese values are set automatically as determined bymin_dist andspread.

method

Cost function to optimize. One of:

  • "umap". The UMAP method of McInnes and co-workers (2018).

  • "tumap". UMAP with thea andb parameters fixedto 1.

  • "largevis". The LargeVis method Tang and co-workers (2016).

approx_pow

IfTRUE, use an approximation to the power functionin the UMAP gradient, fromhttps://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/.

pcg_rand

IfTRUE, use the PCG random number generator (O'Neill,2014) during optimization. Otherwise, use the faster (but probably lessstatistically good) Tausworthe "taus88" generator. The default isTRUE. This parameter has been superseded byrng_type – ifboth are set,rng_type takes precedence.

fast_sgd

IfTRUE, then the following combination of parametersis set:pcg_rand = TRUE,n_sgd_threads = "auto" andapprox_pow = TRUE. The default isFALSE. Setting this toTRUE will speed up the stochastic optimization phase, but give apotentially less accurate embedding, and which will not be exactlyreproducible even with a fixed seed. For visualization,fast_sgd =TRUE will give perfectly good results. For more generic dimensionalityreduction, it's safer to leavefast_sgd = FALSE. Iffast_sgd =TRUE, then user-supplied values ofpcg_rand,n_sgd_threads,andapprox_pow are ignored.

n_sgd_threads

Number of threads to use during stochastic gradientdescent. If set to > 1, then be aware that ifbatch = FALSE, resultswillnot be reproducible, even ifset.seed is called with afixed seed before running. If set to"auto" then half the number ofconcurrent threads supported by the system will be used.

grain_size

The minimum amount of work to do on each thread. If thisvalue is set high enough, then less thann_threads orn_sgd_threads will be used for processing, which might give aperformance improvement if the overhead of thread management and contextswitching was outweighing the improvement due to concurrent processing.This should be left at default (1) and work will be spread evenlyover all the threads specified.

verbose

IfTRUE, log details to the console.

batch

IfTRUE, then embedding coordinates are updated at theend of each epoch rather than during the epoch. In batch mode, results arereproducible with a fixed random seed even withn_sgd_threads > 1,at the cost of a slightly higher memory use. You may also have to modifylearning_rate and increasen_epochs, so whether this providesa speed increase over the single-threaded optimization is likely to bedataset and hardware-dependent.

opt_args

A list of optimizer parameters, used whenbatch = TRUE. The default optimization method used is Adam (Kingmaand Ba, 2014).

  • method The optimization method to use. Either"adam"or"sgd" (stochastic gradient descent). Default:"adam".

  • beta1 (Adam only). The weighting parameter for theexponential moving average of the first moment estimator. Effectively themomentum parameter. Should be a floating point value between 0 and 1.Higher values can smooth oscillatory updates in poorly-conditionedsituations and may allow for a largerlearning_rate to bespecified, but too high can cause divergence. Default:0.5.

  • beta2 (Adam only). The weighting parameter for theexponential moving average of the uncentered second moment estimator.Should be a floating point value between 0 and 1. Controls the degree ofadaptivity in the step-size. Higher values put more weight on previoustime steps. Default:0.9.

  • eps (Adam only). Intended to be a small value to preventdivision by zero, but in practice can also affect convergence due to itsinteraction withbeta2. Higher values reduce the effect of thestep-size adaptivity and bring the behavior closer to stochastic gradientdescent with momentum. Typical values are between 1e-8 and 1e-3. Default:1e-7.

  • alpha The initial learning rate. Default: the value of thelearning_rate parameter.

epoch_callback

A function which will be invoked at the end of everyepoch. Its signature should be:(epoch, n_epochs, coords), where:

  • epoch The current epoch number (between1 andn_epochs).

  • n_epochs Number of epochs to use during the optimization ofthe embedded coordinates.

  • coords The embedded coordinates as of the end of the currentepoch, as a matrix with dimensions (N,n_components).

pca_method

Method to carry out any PCA dimensionality reduction whenthepca parameter is specified. Allowed values are:

  • "irlba". Usesprcomp_irlba from theirlba package.

  • "rsvd". Uses 5 iterations ofsvdr fromtheirlba package.This is likely to give much faster but potentially less accurate resultsthan using"irlba". For the purposes of nearest neighborcalculation and coordinates initialization, any loss of accuracy doesn'tseem to matter much.

  • "bigstatsr". Usesbig_randomSVDfrom thebigstatsrpackage. The SVD methods used inbigstatsr may be faster onsystems without access to efficient linear algebra libraries (e.g.Windows).Note:bigstatsr isnot a dependency ofuwot: if you choose to use this package for PCA, youmust installit yourself.

  • "svd". Usessvd for the SVD. This islikely to be slow for all but the smallest datasets.

  • "auto" (the default). Uses"irlba", unless more than50case"svd" is used.

binary_edge_weights

IfTRUE then edge weights in the inputgraph are treated as binary (0/1) rather than real valued.

rng_type

The type of random number generator to use duringoptimization. One of:

  • "pcg". Use the PCG random number generator (O'Neill, 2014).

  • "tausworthe". Use the Tausworthe "taus88" generator.

  • "deterministic". Use a deterministic number generator. Thisisn't actually random, but may provide enough variation in the negativesampling to give a good embedding and can provide a noticeable speed-up.

For backwards compatibility, by default this is unset and the choice ofpcg_rand is used (making "pcg" the effective default).

Value

A matrix of optimized coordinates.

References

Kingma, D. P., & Ba, J. (2014).Adam: A method for stochastic optimization.arXiv preprintarXiv:1412.6980.https://arxiv.org/abs/1412.6980

McInnes, L., Healy, J., & Melville, J. (2018).UMAP: Uniform Manifold Approximation and Projection for Dimension ReductionarXiv preprintarXiv:1802.03426.https://arxiv.org/abs/1802.03426

O'Neill, M. E. (2014).PCG: A family of simple fast space-efficient statistically goodalgorithms for random number generation(Report No. HMC-CS-2014-0905). Harvey Mudd College.

Tang, J., Liu, J., Zhang, M., & Mei, Q. (2016, April).Visualizing large-scale and high-dimensional data.InProceedings of the 25th International Conference on World Wide Web(pp. 287-297).International World Wide Web Conferences Steering Committee.https://arxiv.org/abs/1602.00370

Examples

iris30 <- iris[c(1:10, 51:60, 101:110), ]# return a 30 x 30 sparse matrix with similarity data based on 10 nearest# neighbors per itemiris30_sim_graph <- similarity_graph(iris30, n_neighbors = 10)# produce 2D coordinates replicating the neighbor relations in the similarity# graphset.seed(42)iris30_opt <- optimize_graph_layout(iris30_sim_graph, X = iris30)# the above two steps are the same as:# set.seed(42); iris_umap <- umap(iris30, n_neighbors = 10)

Save or Load a Model

Description

Functions to write a UMAP model to a file, and to restore.

Usage

save_uwot(model, file, unload = FALSE, verbose = FALSE)

Arguments

model

a UMAP model create byumap.

file

name of the file where the model is to be saved or read from.

unload

ifTRUE, unload all nearest neighbor indexes for themodel. Themodel will no longer be valid for use inumap_transform and the temporary working directory usedduring model saving will be deleted. You will need to reload the model withload_uwot to use the model. IfFALSE, then the model can bere-used without reloading, but you must manually unload the NN index whenyou are finished using it if you want to delete the temporary workingdirectory. To unload manually, useunload_uwot. The absolutepath of the working directory is found in themod_dir item of thereturn value.

verbose

ifTRUE, log information to the console.

Value

model with one extra item:mod_dir, which contains thepath to the working directory. Ifunload = FALSE then this directorystill exists after this function returns, and can be cleaned up withunload_uwot. If you don't care about cleaning up thisdirectory, orunload = TRUE, then you can ignore the return value.

See Also

load_uwot,unload_uwot

Examples

iris_train <- iris[c(1:10, 51:60), ]iris_test <- iris[100:110, ]# create modelmodel <- umap(iris_train, ret_model = TRUE, n_epochs = 20)# save without unloading: this leaves behind a temporary working directorymodel_file <- tempfile("iris_umap")model <- save_uwot(model, file = model_file)# The model can continue to be usedtest_embedding <- umap_transform(iris_test, model)# To manually unload the model from memory when finished and to clean up# the working directory (this doesn't touch your model file)unload_uwot(model)# At this point, model cannot be used with umap_transform, this would fail:# test_embedding2 <- umap_transform(iris_test, model)# restore the model: this also creates a temporary working directorymodel2 <- load_uwot(file = model_file)test_embedding2 <- umap_transform(iris_test, model2)# Unload and clean up the loaded model temp directoryunload_uwot(model2)# clean up the model fileunlink(model_file)# save with unloading: this deletes the temporary working directory but# doesn't allow the model to be re-usedmodel3 <- umap(iris_train, ret_model = TRUE, n_epochs = 20)model_file3 <- tempfile("iris_umap")model3 <- save_uwot(model3, file = model_file3, unload = TRUE)

Similarity Graph

Description

Create a graph (as a sparse symmetric weighted adjacency matrix) representingthe similarities between items in a data set. No dimensionality reduction iscarried out. By default, the similarities are calculated using the mergedfuzzy simplicial set approach in the Uniform Manifold Approximation andProjection (UMAP) method (McInnes et al., 2018), but the approach fromLargeVis (Tang et al., 2016) can also be used.

Usage

similarity_graph(  X = NULL,  n_neighbors = NULL,  metric = "euclidean",  scale = NULL,  set_op_mix_ratio = 1,  local_connectivity = 1,  nn_method = NULL,  n_trees = 50,  search_k = 2 * n_neighbors * n_trees,  perplexity = 50,  method = "umap",  y = NULL,  target_n_neighbors = n_neighbors,  target_metric = "euclidean",  target_weight = 0.5,  pca = NULL,  pca_center = TRUE,  ret_extra = c(),  n_threads = NULL,  grain_size = 1,  kernel = "gauss",  tmpdir = tempdir(),  verbose = getOption("verbose", TRUE),  pca_method = NULL,  binary_edge_weights = FALSE,  nn_args = list())

Arguments

X

Input data. Can be adata.frame,matrix,dist object orsparseMatrix.Matrix and data frames should contain one observation per row. Data frameswill have any non-numeric columns removed, although factor columns will beused if explicitly included viametric (see the help formetric for details). A sparse matrix is interpreted as a distancematrix, and is assumed to be symmetric, so you can also pass in anexplicitly upper or lower triangular sparse matrix to save storage. Theremust be at leastn_neighbors non-zero distances for each row. Bothimplicit and explicit zero entries are ignored. Set zero distances you wantto keep to an arbitrarily small non-zero value (e.g.1e-10).X can also beNULL if pre-computed nearest neighbor data ispassed tonn_method.

n_neighbors

The size of local neighborhood (in terms of number ofneighboring sample points) used for manifold approximation. Larger valuesresult in more global views of the manifold, while smaller values result inmore local data being preserved. In general values should be in the range2 to100.

metric

Type of distance metric to use to find nearest neighbors. Fornn_method = "annoy" this can be one of:

  • "euclidean" (the default)

  • "cosine"

  • "manhattan"

  • "hamming"

  • "correlation" (a distance based on the Pearson correlation)

  • "categorical" (see below)

Fornn_method = "hnsw" this can be one of:

  • "euclidean"

  • "cosine"

  • "correlation"

Ifrnndescent isinstalled andnn_method = "nndescent" is specified then many moremetrics are avaiable, including:

  • "braycurtis"

  • "canberra"

  • "chebyshev"

  • "dice"

  • "hamming"

  • "hellinger"

  • "jaccard"

  • "jensenshannon"

  • "kulsinski"

  • "rogerstanimoto"

  • "russellrao"

  • "sokalmichener"

  • "sokalsneath"

  • "spearmanr"

  • "symmetrickl"

  • "tsss"

  • "yule"

For more details see the package documentation ofrnndescent.Fornn_method = "fnn", the distance metric is always "euclidean".

IfX is a data frame or matrix, then multiple metrics can bespecified, by passing a list to this argument, where the name of each item inthe list is one of the metric names above. The value of each list item shouldbe a vector giving the names or integer ids of the columns to be included ina calculation, e.g.metric = list(euclidean = 1:4, manhattan = 5:10).

Each metric calculation results in a separate fuzzy simplicial set, which areintersected together to produce the final set. Metric names can be repeated.Because non-numeric columns are removed from the data frame, it is safer touse column names than integer ids.

Factor columns can also be used by specifying the metric name"categorical". Factor columns are treated different from numericcolumns and although multiple factor columns can be specified in a vector,each factor column specified is processed individually. If you specifya non-factor column, it will be coerced to a factor.

For a given data block, you may override thepca andpca_centerarguments for that block, by providing a list with one unnamed itemcontaining the column names or ids, and then any of thepca orpca_center overrides as named items, e.g.metric =list(euclidean = 1:4, manhattan = list(5:10, pca_center = FALSE)). Thisexists to allow mixed binary and real-valued data to be included and to havePCA applied to both, but with centering applied only to the real-valued data(it is typical not to apply centering to binary data before PCA is applied).

scale

Scaling to apply toX if it is a data frame or matrix:

  • "none" orFALSE orNULL No scaling.

  • "Z" or"scale" orTRUE Scale each column tozero mean and variance 1.

  • "maxabs" Center each column to mean 0, then divide eachelement by the maximum absolute value over the entire matrix.

  • "range" Range scale the entire matrix, so the smallestelement is 0 and the largest is 1.

  • "colrange" Scale each column in the range (0,1).

Formethod"umap", the default is"none". For"largevis", the default is"maxabs".

set_op_mix_ratio

Interpolate between (fuzzy) union and intersection asthe set operation used to combine local fuzzy simplicial sets to obtain aglobal fuzzy simplicial sets. Both fuzzy set operations use the productt-norm. The value of this parameter should be between0.0 and1.0; a value of1.0 will use a pure fuzzy union, while0.0 will use a pure fuzzy intersection. Ignored ifmethod = "largevis"

local_connectivity

The local connectivity required – i.e. the numberof nearest neighbors that should be assumed to be connected at a locallevel. The higher this value the more connected the manifold becomeslocally. In practice this should be not more than the local intrinsicdimension of the manifold. Ignored ifmethod = "largevis".

nn_method

Method for finding nearest neighbors. Options are:

  • "fnn". Use exact nearest neighbors via theFNN package.

  • "annoy" Use approximate nearest neighbors via theRcppAnnoy package.

  • "hnsw" Use approximate nearest neighbors with theHierarchical Navigable Small World (HNSW) method (Malkov and Yashunin,2018) via theRcppHNSW package.RcppHNSW is not a dependency of this package: this option isonly available if you have installedRcppHNSW yourself. Also,HNSW only supports the following arguments formetric andtarget_metric:"euclidean","cosine" and"correlation".

  • "nndescent" Use approximate nearest neighbors with theNearest Neighbor Descent method (Dong et al., 2011) via thernndescentpackage.rnndescent is not a dependency of this package: thisoption is only available if you have installedrnndescentyourself.

By default, ifX has less than 4,096 vertices, the exact nearestneighbors are found. Otherwise, approximate nearest neighbors are used.You may also pass pre-calculated nearest neighbor data to this argument. Itmust be one of two formats, either a list consisting of two elements:

  • "idx". An_vertices x n_neighbors matrixcontaining the integer indexes of the nearest neighbors inX. Eachvertex is considered to be its own nearest neighbor, i.e.idx[, 1] == 1:n_vertices.

  • "dist". An_vertices x n_neighbors matrixcontaining the distances of the nearest neighbors.

or a sparse distance matrix of typedgCMatrix, with dimensionsn_vertices x n_vertices. Distances should be arranged by column,i.e. a non-zero entry in rowj of theith column indicatesthat thejth observation inX is a nearest neighbor of theith observation with the distance given by the value of thatelement.Then_neighbors parameter is ignored when using precomputednearest neighbor data. If using the sparse distance matrix input, eachcolumn can contain a different number of neighbors.

n_trees

Number of trees to build when constructing the nearestneighbor index. The more trees specified, the larger the index, but thebetter the results. Withsearch_k, determines the accuracy of theAnnoy nearest neighbor search. Only used if thenn_method is"annoy". Sensible values are between10 to100.

search_k

Number of nodes to search during the neighbor retrieval. Thelarger k, the more the accurate results, but the longer the search takes.Withn_trees, determines the accuracy of the Annoy nearest neighborsearch. Only used if thenn_method is"annoy".

perplexity

Used only ifmethod = "largevis". Controls the sizeof the local neighborhood used for manifold approximation. Should be avalue between 1 and one less than the number of items inX. Ifspecified, you shouldnot specify a value forn_neighborsunless you know what you are doing.

method

How to generate the similarities between items. One of:

  • "umap" The UMAP method of McInnes et al. (2018).

  • "largevis" The LargeVis method of Tang et al. (2016).

y

Optional target data to add supervised or semi-supervised weightingto the similarity graph . Can be a vector, matrix or data frame. Use thetarget_metric parameter to specify the metrics to use, using thesame syntax asmetric. Usually either a single numeric or factorcolumn is used, but more complex formats are possible. The following typesare allowed:

  • Factor columns with the same length asX.NA isallowed for any observation with an unknown level, in which caseUMAP operates as a form of semi-supervised learning. Each column istreated separately.

  • Numeric data.NA isnot allowed in this case. Use theparametertarget_n_neighbors to set the number of neighbors usedwithy. If unset,n_neighbors is used. Unlike factors,numeric columns are grouped into one block unlesstarget_metricspecifies otherwise. For example, if you wish columnsa andb to be treated separately, specifytarget_metric = list(euclidean = "a", euclidean = "b"). Otherwise,the data will be effectively treated as a matrix with two columns.

  • Nearest neighbor data, consisting of a list of two matrices,idx anddist. These represent the precalculated nearestneighbor indices and distances, respectively. Thisis the same format as that expected for precalculated data innn_method. This format assumes that the underlying data was anumeric vector. Any user-supplied value of thetarget_n_neighborsparameter is ignored in this case, because the the number of columns inthe matrices is used for the value. Multiple nearest neighbor data usingdifferent metrics can be supplied by passing a list of these lists.

UnlikeX, all factor columns included iny are automaticallyused. This parameter is ignored ifmethod = "largevis".

target_n_neighbors

Number of nearest neighbors to use to construct thetarget simplicial set. Default value isn_neighbors. Applies only ify is non-NULL andnumeric. This parameter is ignoredifmethod = "largevis".

target_metric

The metric used to measure distance fory ifusing supervised dimension reduction. Used only ify is numeric.This parameter is ignored ifmethod = "largevis".

target_weight

Weighting factor between data topology and targettopology. A value of 0.0 weights entirely on data, a value of 1.0 weightsentirely on target. The default of 0.5 balances the weighting equallybetween data and target. Only applies ify is non-NULL. Thisparameter is ignored ifmethod = "largevis".

pca

If set to a positive integer value, reduce data to this number ofcolumns using PCA. Doesn't applied if the distancemetric is"hamming", or the dimensions of the data is larger than thenumber specified (i.e. number of rows and columns must be larger than thevalue of this parameter). If you have > 100 columns in a data frame ormatrix, reducing the number of columns in this way may substantiallyincrease the performance of the nearest neighbor search at the cost of apotential decrease in accuracy. In many t-SNE applications, a value of 50is recommended, although there's no guarantee that this is appropriate forall settings.

pca_center

IfTRUE, center the columns ofX beforecarrying out PCA. For binary data, it's recommended to set this toFALSE.

ret_extra

A vector indicating what extra data to return. May containany combination of the following strings:

  • "nn" nearest neighbor data that can be used as input tonn_method to avoid the overhead of repeatedly calculating thenearest neighbors when manipulating unrelated parameters. See the"Value" section for the names of the list items. Note that the nearestneighbors could be sensitive to data scaling, so be wary of reusingnearest neighbor data if modifying thescale parameter.

  • "sigma" the normalization value for each observation in thedataset when constructing the smoothed distances to each of itsneighbors. This gives some sense of the local density of eachobservation in the high dimensional space: higher values ofsigma indicate a higher dispersion or lower density.

n_threads

Number of threads to use. Default is half the number ofconcurrent threads supported by the system. For nearest neighbor search,only applies ifnn_method = "annoy". Ifn_threads > 1, thenthe Annoy index will be temporarily written to disk in the locationdetermined bytempfile.

grain_size

The minimum amount of work to do on each thread. If thisvalue is set high enough, then less thann_threads will be used forprocessing, which might give a performance improvement if the overhead ofthread management and context switching was outweighing the improvement dueto concurrent processing. This should be left at default (1) andwork will be spread evenly over all the threads specified.

kernel

Used only ifmethod = "largevis". Type of kernelfunction to create input similiarties. Can be one of"gauss" (thedefault) or"knn"."gauss" uses the usual Gaussian weightedsimilarities."knn" assigns equal similiarties. to every edge in thenearest neighbor graph, and zero otherwise, usingperplexity nearestneighbors. Then_neighbors parameter is ignored in this case.

tmpdir

Temporary directory to store nearest neighbor indexes duringnearest neighbor search. Default istempdir. The index isonly written to disk ifn_threads > 1 andnn_method = "annoy"; otherwise, this parameter is ignored.

verbose

IfTRUE, log details to the console.

pca_method

Method to carry out any PCA dimensionality reduction whenthepca parameter is specified. Allowed values are:

  • "irlba". Usesprcomp_irlba from theirlba package.

  • "rsvd". Uses 5 iterations ofsvdr fromtheirlba package.This is likely to give much faster but potentially less accurate resultsthan using"irlba". For the purposes of nearest neighborcalculation and coordinates initialization, any loss of accuracy doesn'tseem to matter much.

  • "bigstatsr". Usesbig_randomSVDfrom thebigstatsrpackage. The SVD methods used inbigstatsr may be faster onsystems without access to efficient linear algebra libraries (e.g.Windows).Note:bigstatsr isnot a dependency ofuwot: if you choose to use this package for PCA, youmust installit yourself.

  • "svd". Usessvd for the SVD. This islikely to be slow for all but the smallest datasets.

  • "auto" (the default). Uses"irlba", unless more than50case"svd" is used.

binary_edge_weights

IfTRUE then edge weights of the returnedgraph are binary (0/1) rather than reflecting the degree of similarity.

nn_args

A list containing additional arguments to pass to the nearestneighbor method. Fornn_method = "annoy", you can specify"n_trees" and"search_k", and these will override then_trees andsearch_k parameters.Fornn_method = "hnsw", you may specify the following arguments:

  • M The maximum number of neighbors to keep for each vertex.Reasonable values are2 to100. Higher values give betterrecall at the cost of more memory. Default value is16.

  • ef_construction A positive integer specifying the size ofthe dynamic list used during index construction. A higher value willprovide better results at the cost of a longer time to build the index.Default is200.

  • ef A positive integer specifying the size of the dynamiclist used during search. This cannot be smaller thann_neighborsand cannot be higher than the number of items in the index. Default is10.

Fornn_method = "nndescent", you may specify the followingarguments:

  • n_trees The number of trees to use in a random projectionforest to initialize the search. A larger number will give more accurateresults at the cost of a longer computation time. The default ofNULL means that the number is chosen based on the number ofobservations inX.

  • max_candidates The number of potential neighbors to exploreper iteration. By default, this is set ton_neighbors or60,whichever is smaller. A larger number will give more accurate results atthe cost of a longer computation time.

  • n_iters The number of iterations to run the search. A largernumber will give more accurate results at the cost of a longer computationtime. By default, this will be chosen based on the number of observationsinX. You may also need to modify the convergence criteriondelta.

  • delta The minimum relative change in the neighbor graphallowed before early stopping. Should be a value between 0 and 1. Thesmaller the value, the smaller the amount of progress between iterations isallowed. Default value of0.001 means that at least 0.1neighbor graph must be updated at each iteration.

  • init How to initialize the nearest neighbor descent. Bydefault this is set to"tree" and uses a random project forest.If you set this to"rand", then a random selection is used. Usuallythis is less accurate than using RP trees, but for high-dimensional cases,there may be little difference in the quality of the initialization andrandom initialization will be a lot faster. If you set this to"rand", then then_trees parameter is ignored.

  • pruning_degree_multiplier The maximum number of edges per nodeto retain in the search graph, relative ton_neighbors. A largervalue will give more accurate results at the cost of a longer computationtime. Default is1.5. This parameter only affects neighbor searchwhen transforming new data withumap_transform.

  • epsilon Controls the degree of the back-tracking whentraversing the search graph. Setting this to0.0 will do a greedysearch with no back-tracking. A larger value will give more accurateresults at the cost of a longer computation time. Default is0.1.This parameter only affects neighbor search when transforming new data withumap_transform.

  • max_search_fraction Specifies the maximum fraction of thesearch graph to traverse. By default, this is set to1.0, so theentire graph (i.e. all items inX) may be visited. You may want toset this to a smaller value if you have a very large dataset (inconjunction withepsilon) to avoid an inefficient exhaustive searchof the data inX. This parameter only affects neighbor search whentransforming new data withumap_transform.

Details

This is equivalent to runningumap with theret_extra = c("fgraph") parameter, but without the overhead ofcalculating (or returning) the optimized low-dimensional coordinates.

Value

A sparse symmetrized matrix of the similarities between the items inX or ifnn_method contains pre-computed nearest neighbordata, the items innn_method. Because of the symmetrization, theremay be more non-zero items in each column than the specified value ofn_neighbors (or pre-computed neighbors innn_method).Ifret_extra is specified then the return value will be a listcontaining:

References

Dong, W., Moses, C., & Li, K. (2011, March).Efficient k-nearest neighbor graph construction for generic similarity measures.InProceedings of the 20th international conference on World Wide Web(pp. 577-586).ACM.doi:10.1145/1963405.1963487.

Malkov, Y. A., & Yashunin, D. A. (2018).Efficient and robust approximate nearest neighbor search using hierarchicalnavigable small world graphs.IEEE transactions on pattern analysis and machine intelligence,42(4), 824-836.

McInnes, L., Healy, J., & Melville, J. (2018).UMAP: Uniform Manifold Approximation and Projection for Dimension ReductionarXiv preprintarXiv:1802.03426.https://arxiv.org/abs/1802.03426

Tang, J., Liu, J., Zhang, M., & Mei, Q. (2016, April).Visualizing large-scale and high-dimensional data.InProceedings of the 25th International Conference on World Wide Web(pp. 287-297).International World Wide Web Conferences Steering Committee.https://arxiv.org/abs/1602.00370

Examples

iris30 <- iris[c(1:10, 51:60, 101:110), ]# return a 30 x 30 sparse matrix with similarity data based on 10 nearest# neighbors per itemiris30_sim_graph <- similarity_graph(iris30, n_neighbors = 10)# Default is to use the UMAP method of calculating similarities, but LargeVis# is also available: for that method, use perplexity instead of n_neighbors# to control neighborhood size. Use ret_extra = "nn" to return nearest# neighbor data as well as the similarity graph. Return value is a list# containing similarity_graph' and 'nn' items.iris30_lv_graph <- similarity_graph(iris30,  perplexity = 10,  method = "largevis", ret_extra = "nn")# If you have the neighbor information you don't need the original datairis30_lv_graph_nn <- similarity_graph(  nn_method = iris30_lv_graph$nn,  perplexity = 10, method = "largevis")all(iris30_lv_graph_nn == iris30_lv_graph$similarity_graph)

Merge Similarity Graph by Simplicial Set Intersection

Description

Combine two similarity graphs by treating them as fuzzy topological sets andforming the intersection.

Usage

simplicial_set_intersect(x, y, weight = 0.5, n_threads = NULL, verbose = FALSE)

Arguments

x

A sparse matrix representing the first similarity graph in theintersection operation.

y

A sparse matrix representing the second similarity graph in theintersection operation.

weight

A value between0 - 1, controlling the relativeinfluence ofx andy in the intersection. Default(0.5) gives equal influence. Values smaller than0.5 put moreweight onx. Values greater than0.5 put more weight ony.

n_threads

Number of threads to use when resetting the local metric.Default is half the number of concurrent threads supported by the system.

verbose

IfTRUE, log progress to the console.

Value

A sparse matrix containing the intersection ofx andy.

Examples

# Form two different "views" of the same datairis30 <- iris[c(1:10, 51:60, 101:110), ]iris_sg12 <- similarity_graph(iris30[, 1:2], n_neighbors = 5)iris_sg34 <- similarity_graph(iris30[, 3:4], n_neighbors = 5)# Combine the two representations into oneiris_combined <- simplicial_set_intersect(iris_sg12, iris_sg34)# Optimize the layout based on the combined viewiris_combined_umap <- optimize_graph_layout(iris_combined, n_epochs = 100)

Merge Similarity Graph by Simplicial Set Union

Description

Combine two similarity graphs by treating them as fuzzy topological sets andforming the union.

Usage

simplicial_set_union(x, y, n_threads = NULL, verbose = FALSE)

Arguments

x

A sparse matrix representing the first similarity graph in the unionoperation.

y

A sparse matrix representing the second similarity graph in theunion operation.

n_threads

Number of threads to use when resetting the local metric.Default is half the number of concurrent threads supported by the system.

verbose

IfTRUE, log progress to the console.

Value

A sparse matrix containing the union ofx andy.

Examples

# Form two different "views" of the same datairis30 <- iris[c(1:10, 51:60, 101:110), ]iris_sg12 <- similarity_graph(iris30[, 1:2], n_neighbors = 5)iris_sg34 <- similarity_graph(iris30[, 3:4], n_neighbors = 5)# Combine the two representations into oneiris_combined <- simplicial_set_union(iris_sg12, iris_sg34)# Optimize the layout based on the combined viewiris_combined_umap <- optimize_graph_layout(iris_combined, n_epochs = 100)

Dimensionality Reduction Using t-Distributed UMAP (t-UMAP)

Description

A faster (but less flexible) version of the UMAP (McInnes et al, 2018)gradient. For more detail on UMAP, see theumap function.

Usage

tumap(  X,  n_neighbors = 15,  n_components = 2,  metric = "euclidean",  n_epochs = NULL,  learning_rate = 1,  scale = FALSE,  init = "spectral",  init_sdev = NULL,  set_op_mix_ratio = 1,  local_connectivity = 1,  bandwidth = 1,  repulsion_strength = 1,  negative_sample_rate = 5,  nn_method = NULL,  n_trees = 50,  search_k = 2 * n_neighbors * n_trees,  n_threads = NULL,  n_sgd_threads = 0,  grain_size = 1,  y = NULL,  target_n_neighbors = n_neighbors,  target_metric = "euclidean",  target_weight = 0.5,  pca = NULL,  pca_center = TRUE,  pcg_rand = TRUE,  fast_sgd = FALSE,  ret_model = FALSE,  ret_nn = FALSE,  ret_extra = c(),  tmpdir = tempdir(),  verbose = getOption("verbose", TRUE),  batch = FALSE,  opt_args = NULL,  epoch_callback = NULL,  pca_method = NULL,  binary_edge_weights = FALSE,  seed = NULL,  nn_args = list(),  rng_type = NULL)

Arguments

X

Input data. Can be adata.frame,matrix,dist object orsparseMatrix.Matrix and data frames should contain one observation per row. Data frameswill have any non-numeric columns removed, although factor columns will beused if explicitly included viametric (see the help formetric for details). A sparse matrix is interpreted as a distancematrix, and is assumed to be symmetric, so you can also pass in anexplicitly upper or lower triangular sparse matrix to save storage. Theremust be at leastn_neighbors non-zero distances for each row. Bothimplicit and explicit zero entries are ignored. Set zero distances you wantto keep to an arbitrarily small non-zero value (e.g.1e-10).X can also beNULL if pre-computed nearest neighbor data ispassed tonn_method, andinit is not"spca" or"pca".

n_neighbors

The size of local neighborhood (in terms of number ofneighboring sample points) used for manifold approximation. Larger valuesresult in more global views of the manifold, while smaller values result inmore local data being preserved. In general values should be in the range2 to100.

n_components

The dimension of the space to embed into. This defaultsto2 to provide easy visualization, but can reasonably be set to anyinteger value in the range2 to100.

metric

Type of distance metric to use to find nearest neighbors. Fornn_method = "annoy" this can be one of:

  • "euclidean" (the default)

  • "cosine"

  • "manhattan"

  • "hamming"

  • "correlation" (a distance based on the Pearson correlation)

  • "categorical" (see below)

Fornn_method = "hnsw" this can be one of:

  • "euclidean"

  • "cosine"

  • "correlation"

Ifrnndescent isinstalled andnn_method = "nndescent" is specified then many moremetrics are avaiable, including:

  • "braycurtis"

  • "canberra"

  • "chebyshev"

  • "dice"

  • "hamming"

  • "hellinger"

  • "jaccard"

  • "jensenshannon"

  • "kulsinski"

  • "rogerstanimoto"

  • "russellrao"

  • "sokalmichener"

  • "sokalsneath"

  • "spearmanr"

  • "symmetrickl"

  • "tsss"

  • "yule"

For more details see the package documentation ofrnndescent.Fornn_method = "fnn", the distance metric is always "euclidean".

IfX is a data frame or matrix, then multiple metrics can bespecified, by passing a list to this argument, where the name of each item inthe list is one of the metric names above. The value of each list item shouldbe a vector giving the names or integer ids of the columns to be included ina calculation, e.g.metric = list(euclidean = 1:4, manhattan = 5:10).

Each metric calculation results in a separate fuzzy simplicial set, which areintersected together to produce the final set. Metric names can be repeated.Because non-numeric columns are removed from the data frame, it is safer touse column names than integer ids.

Factor columns can also be used by specifying the metric name"categorical". Factor columns are treated different from numericcolumns and although multiple factor columns can be specified in a vector,each factor column specified is processed individually. If you specifya non-factor column, it will be coerced to a factor.

For a given data block, you may override thepca andpca_centerarguments for that block, by providing a list with one unnamed itemcontaining the column names or ids, and then any of thepca orpca_center overrides as named items, e.g.metric =list(euclidean = 1:4, manhattan = list(5:10, pca_center = FALSE)). Thisexists to allow mixed binary and real-valued data to be included and to havePCA applied to both, but with centering applied only to the real-valued data(it is typical not to apply centering to binary data before PCA is applied).

n_epochs

Number of epochs to use during the optimization of theembedded coordinates. By default, this value is set to500 fordatasets containing 10,000 vertices or less, and200 otherwise.Ifn_epochs = 0, then coordinates determined by"init" willbe returned.

learning_rate

Initial learning rate used in optimization of thecoordinates.

scale

Scaling to apply toX if it is a data frame or matrix:

  • "none" orFALSE orNULL No scaling.

  • "Z" or"scale" orTRUE Scale each column tozero mean and variance 1.

  • "maxabs" Center each column to mean 0, then divide eachelement by the maximum absolute value over the entire matrix.

  • "range" Range scale the entire matrix, so the smallestelement is 0 and the largest is 1.

  • "colrange" Scale each column in the range (0,1).

For t-UMAP, the default is"none".

init

Type of initialization for the coordinates. Options are:

  • "spectral" Spectral embedding using the normalized Laplacianof the fuzzy 1-skeleton, with Gaussian noise added.

  • "normlaplacian". Spectral embedding using the normalizedLaplacian of the fuzzy 1-skeleton, without noise.

  • "random". Coordinates assigned using a uniform randomdistribution between -10 and 10.

  • "lvrandom". Coordinates assigned using a Gaussiandistribution with standard deviation 1e-4, as used in LargeVis(Tang et al., 2016) and t-SNE.

  • "laplacian". Spectral embedding using the Laplacian Eigenmap(Belkin and Niyogi, 2002).

  • "pca". The first two principal components from PCA ofX ifX is a data frame, and from a 2-dimensional classicalMDS ifX is of class"dist".

  • "spca". Like"pca", but each dimension is then scaledso the standard deviation is 1e-4, to give a distribution similar to thatused in t-SNE. This is an alias forinit = "pca", init_sdev = 1e-4.

  • "agspectral" An "approximate global" modification of"spectral" which all edges in the graph to a value of 1, and thensets a random number of edges (negative_sample_rate edges pervertex) to 0.1, to approximate the effect of non-local affinities.

  • A matrix of initial coordinates.

For spectral initializations, ("spectral","normlaplacian","laplacian","agspectral"), if more than one connectedcomponent is identified, no spectral initialization is attempted. Insteada PCA-based initialization is attempted. Ifverbose = TRUE thenumber of connected components are logged to the console. The existence ofmultiple connected components implies that a global view of the data cannotbe attained with this initialization. Increasing the value ofn_neighbors may help.

init_sdev

If non-NULL, scales each dimension of the initializedcoordinates (including any user-supplied matrix) to this standarddeviation. By default no scaling is carried out, except wheninit ="spca", in which case the value is0.0001. Scaling the input mayhelp if the unscaled versions result in initial coordinates with largeinter-point distances or outliers. This usually results in small gradientsduring optimization and very little progress being made to the layout.Shrinking the initial embedding by rescaling can help under thesecircumstances. Scaling the result ofinit = "pca" is usuallyrecommended andinit = "spca" as an alias forinit = "pca",init_sdev = 1e-4 but for the spectral initializations the scaled versionsusually aren't necessary unless you are using a large value ofn_neighbors (e.g.n_neighbors = 150 or higher). Forcompatibility with recent versions of the Python UMAP package, if you areusinginit = "spectral", then you should also setinit_sdev = "range", which will range scale each of the columnscontaining the initial data between 0-10. This is not set by default tomaintain backwards compatibility with previous versions of uwot.

set_op_mix_ratio

Interpolate between (fuzzy) union and intersection asthe set operation used to combine local fuzzy simplicial sets to obtain aglobal fuzzy simplicial sets. Both fuzzy set operations use the productt-norm. The value of this parameter should be between0.0 and1.0; a value of1.0 will use a pure fuzzy union, while0.0 will use a pure fuzzy intersection.

local_connectivity

The local connectivity required – i.e. the numberof nearest neighbors that should be assumed to be connected at a locallevel. The higher this value the more connected the manifold becomeslocally. In practice this should be not more than the local intrinsicdimension of the manifold.

bandwidth

The effective bandwidth of the kernel if we view thealgorithm as similar to Laplacian Eigenmaps. Larger values induce moreconnectivity and a more global view of the data, smaller values concentratemore locally.

repulsion_strength

Weighting applied to negative samples in lowdimensional embedding optimization. Values higher than one will result ingreater weight being given to negative samples.

negative_sample_rate

The number of negative edge/1-simplex samples touse per positive edge/1-simplex sample in optimizing the low dimensionalembedding.

nn_method

Method for finding nearest neighbors. Options are:

  • "fnn". Use exact nearest neighbors via theFNN package.

  • "annoy" Use approximate nearest neighbors via theRcppAnnoy package.

  • "hnsw" Use approximate nearest neighbors with theHierarchical Navigable Small World (HNSW) method (Malkov and Yashunin,2018) via theRcppHNSW package.RcppHNSW is not a dependency of this package: this option isonly available if you have installedRcppHNSW yourself. Also,HNSW only supports the following arguments formetric andtarget_metric:"euclidean","cosine" and"correlation".

  • "nndescent" Use approximate nearest neighbors with theNearest Neighbor Descent method (Dong et al., 2011) via thernndescentpackage.rnndescent is not a dependency of this package: thisoption is only available if you have installedrnndescentyourself.

By default, ifX has less than 4,096 vertices, the exact nearestneighbors are found. Otherwise, approximate nearest neighbors are used.You may also pass pre-calculated nearest neighbor data to this argument. Itmust be one of two formats, either a list consisting of two elements:

  • "idx". An_vertices x n_neighbors matrixcontaining the integer indexes of the nearest neighbors inX. Eachvertex is considered to be its own nearest neighbor, i.e.idx[, 1] == 1:n_vertices.

  • "dist". An_vertices x n_neighbors matrixcontaining the distances of the nearest neighbors.

or a sparse distance matrix of typedgCMatrix, with dimensionsn_vertices x n_vertices. Distances should be arranged by column,i.e. a non-zero entry in rowj of theith column indicatesthat thejth observation inX is a nearest neighbor of theith observation with the distance given by the value of thatelement.Then_neighbors parameter is ignored when using precomputednearest neighbor data. If using the sparse distance matrix input, eachcolumn can contain a different number of neighbors.

n_trees

Number of trees to build when constructing the nearestneighbor index. The more trees specified, the larger the index, but thebetter the results. Withsearch_k, determines the accuracy of theAnnoy nearest neighbor search. Only used if thenn_method is"annoy". Sensible values are between10 to100.

search_k

Number of nodes to search during the neighbor retrieval. Thelarger k, the more the accurate results, but the longer the search takes.Withn_trees, determines the accuracy of the Annoy nearest neighborsearch. Only used if thenn_method is"annoy".

n_threads

Number of threads to use (except during stochastic gradientdescent). Default is half the number of concurrent threads supported by thesystem. For nearest neighbor search, only applies ifnn_method = "annoy". Ifn_threads > 1, then the Annoy indexwill be temporarily written to disk in the location determined bytempfile.

n_sgd_threads

Number of threads to use during stochastic gradientdescent. If set to > 1, then be aware that ifbatch = FALSE, resultswillnot be reproducible, even ifset.seed is called with afixed seed before running. Set to"auto" to use the same value asn_threads.

grain_size

The minimum amount of work to do on each thread. If thisvalue is set high enough, then less thann_threads orn_sgd_threads will be used for processing, which might give aperformance improvement if the overhead of thread management and contextswitching was outweighing the improvement due to concurrent processing.This should be left at default (1) and work will be spread evenlyover all the threads specified.

y

Optional target data for supervised dimension reduction. Can be avector, matrix or data frame. Use thetarget_metric parameter tospecify the metrics to use, using the same syntax asmetric. Usuallyeither a single numeric or factor column is used, but more complex formatsare possible. The following types are allowed:

  • Factor columns with the same length asX.NA isallowed for any observation with an unknown level, in which caseUMAP operates as a form of semi-supervised learning. Each column istreated separately.

  • Numeric data.NA isnot allowed in this case. Use theparametertarget_n_neighbors to set the number of neighbors usedwithy. If unset,n_neighbors is used. Unlike factors,numeric columns are grouped into one block unlesstarget_metricspecifies otherwise. For example, if you wish columnsa andb to be treated separately, specifytarget_metric = list(euclidean = "a", euclidean = "b"). Otherwise,the data will be effectively treated as a matrix with two columns.

  • Nearest neighbor data, consisting of a list of two matrices,idx anddist. These represent the precalculated nearestneighbor indices and distances, respectively. Thisis the same format as that expected for precalculated data innn_method. This format assumes that the underlying data was anumeric vector. Any user-supplied value of thetarget_n_neighborsparameter is ignored in this case, because the the number of columns inthe matrices is used for the value. Multiple nearest neighbor data usingdifferent metrics can be supplied by passing a list of these lists.

UnlikeX, all factor columns included iny are automaticallyused.

target_n_neighbors

Number of nearest neighbors to use to construct thetarget simplicial set. Default value isn_neighbors. Applies only ify is non-NULL andnumeric.

target_metric

The metric used to measure distance fory ifusing supervised dimension reduction. Used only ify is numeric.

target_weight

Weighting factor between data topology and targettopology. A value of 0.0 weights entirely on data, a value of 1.0 weightsentirely on target. The default of 0.5 balances the weighting equallybetween data and target. Only applies ify is non-NULL.

pca

If set to a positive integer value, reduce data to this number ofcolumns using PCA. Doesn't applied if the distancemetric is"hamming", or the dimensions of the data is larger than thenumber specified (i.e. number of rows and columns must be larger than thevalue of this parameter). If you have > 100 columns in a data frame ormatrix, reducing the number of columns in this way may substantiallyincrease the performance of the nearest neighbor search at the cost of apotential decrease in accuracy. In many t-SNE applications, a value of 50is recommended, although there's no guarantee that this is appropriate forall settings.

pca_center

IfTRUE, center the columns ofX beforecarrying out PCA. For binary data, it's recommended to set this toFALSE.

pcg_rand

IfTRUE, use the PCG random number generator (O'Neill,2014) during optimization. Otherwise, use the faster (but probably lessstatistically good) Tausworthe "taus88" generator. The default isTRUE. This parameter has been superseded byrng_type – ifboth are set,rng_type takes precedence.

fast_sgd

IfTRUE, then the following combination of parametersis set:pcg_rand = TRUE andn_sgd_threads = "auto". Thedefault isFALSE. Setting this toTRUE will speed up thestochastic optimization phase, but give a potentially less accurateembedding, and which will not be exactly reproducible even with a fixedseed. For visualization,fast_sgd = TRUE will give perfectly goodresults. For more generic dimensionality reduction, it's safer to leavefast_sgd = FALSE. Iffast_sgd = TRUE, then user-suppliedvalues ofpcg_rand andn_sgd_threads, are ignored.

ret_model

IfTRUE, then return extra data that can be used toadd new data to an existing embedding viaumap_transform. Theembedded coordinates are returned as the list itemembedding. IfFALSE, just return the coordinates. This parameter can be used inconjunction withret_nn andret_extra. Note that somesettings are incompatible with the production of a UMAP model: externalneighbor data (passed via a list tonn_method), and factor columnsthat were included via themetric parameter. In the latter case, themodel produced is based only on the numeric data. A transformation usingnew data is possible, but the factor columns in the new data are ignored.Note that settingret_model = TRUE forces the use of the approximatenearest neighbors method. Because small datasets would otherwise use exactnearest neighbor calculations, settingret_model = TRUE means thatdifferent results may be returned for small datasets in terms of both thereturned nearest neighbors (if requested) and the final embeddedcoordinates, compared toret_model = FALSE, even if the randomnumber seed is fixed. To avoid this, explicitly setnn_method = "annoy" in theret_model = FALSE case.

ret_nn

IfTRUE, then in addition to the embedding, also returnnearest neighbor data that can be used as input tonn_method toavoid the overhead of repeatedly calculating the nearest neighbors whenmanipulating unrelated parameters (e.g.min_dist,n_epochs,init). See the "Value" section for the names of the list items. IfFALSE, just return the coordinates. Note that the nearest neighborscould be sensitive to data scaling, so be wary of reusing nearest neighbordata if modifying thescale parameter. This parameter can be used inconjunction withret_model andret_extra.

ret_extra

A vector indicating what extra data to return. May containany combination of the following strings:

  • "model" Same as settingret_model = TRUE.

  • "nn" Same as settingret_nn = TRUE.

  • "fgraph" the high dimensional fuzzy graph (i.e. the fuzzysimplicial set of the merged local views of the input data). The graphis returned as a sparse symmetric N x N matrix of classdgCMatrix-class, where a non-zero entry (i, j) gives themembership strength of the edge connecting vertex i and vertex j. Thiscan be considered analogous to the input probability (or similarity oraffinity) used in t-SNE and LargeVis. Note that the graph is furthersparsified by removing edges with sufficiently low membership strengththat they would not be sampled by the probabilistic edge samplingemployed for optimization and therefore the number of non-zero elementsin the matrix is dependent onn_epochs. If you are onlyinterested in the fuzzy input graph (e.g. for clustering), settingn_epochs = 0 will avoid any further sparsifying. Be aware thatsettingbinary_edge_weights = TRUE will affect this graph (allnon-zero edge weights will be 1).

  • "sigma" the normalization value for each observation in thedataset when constructing the smoothed distances to each of itsneighbors. This gives some sense of the local density of eachobservation in the high dimensional space: higher values ofsigma indicate a higher dispersion or lower density.

tmpdir

Temporary directory to store nearest neighbor indexes duringnearest neighbor search. Default istempdir. The index isonly written to disk ifn_threads > 1 andnn_method = "annoy"; otherwise, this parameter is ignored.

verbose

IfTRUE, log details to the console.

batch

IfTRUE, then embedding coordinates are updated at theend of each epoch rather than during the epoch. In batch mode, results arereproducible with a fixed random seed even withn_sgd_threads > 1,at the cost of a slightly higher memory use. You may also have to modifylearning_rate and increasen_epochs, so whether this providesa speed increase over the single-threaded optimization is likely to bedataset and hardware-dependent.

opt_args

A list of optimizer parameters, used whenbatch = TRUE. The default optimization method used is Adam (Kingmaand Ba, 2014).

  • method The optimization method to use. Either"adam"or"sgd" (stochastic gradient descent). Default:"adam".

  • beta1 (Adam only). The weighting parameter for theexponential moving average of the first moment estimator. Effectively themomentum parameter. Should be a floating point value between 0 and 1.Higher values can smooth oscillatory updates in poorly-conditionedsituations and may allow for a largerlearning_rate to bespecified, but too high can cause divergence. Default:0.5.

  • beta2 (Adam only). The weighting parameter for theexponential moving average of the uncentered second moment estimator.Should be a floating point value between 0 and 1. Controls the degree ofadaptivity in the step-size. Higher values put more weight on previoustime steps. Default:0.9.

  • eps (Adam only). Intended to be a small value to preventdivision by zero, but in practice can also affect convergence due to itsinteraction withbeta2. Higher values reduce the effect of thestep-size adaptivity and bring the behavior closer to stochastic gradientdescent with momentum. Typical values are between 1e-8 and 1e-3. Default:1e-7.

  • alpha The initial learning rate. Default: the value of thelearning_rate parameter.

epoch_callback

A function which will be invoked at the end of everyepoch. Its signature should be:(epoch, n_epochs, coords), where:

  • epoch The current epoch number (between1 andn_epochs).

  • n_epochs Number of epochs to use during the optimization ofthe embedded coordinates.

  • coords The embedded coordinates as of the end of the currentepoch, as a matrix with dimensions (N,n_components).

pca_method

Method to carry out any PCA dimensionality reduction whenthepca parameter is specified. Allowed values are:

  • "irlba". Usesprcomp_irlba from theirlba package.

  • "rsvd". Uses 5 iterations ofsvdr fromtheirlba package.This is likely to give much faster but potentially less accurate resultsthan using"irlba". For the purposes of nearest neighborcalculation and coordinates initialization, any loss of accuracy doesn'tseem to matter much.

  • "bigstatsr". Usesbig_randomSVDfrom thebigstatsrpackage. The SVD methods used inbigstatsr may be faster onsystems without access to efficient linear algebra libraries (e.g.Windows).Note:bigstatsr isnot a dependency ofuwot: if you choose to use this package for PCA, youmust installit yourself.

  • "svd". Usessvd for the SVD. This islikely to be slow for all but the smallest datasets.

  • "auto" (the default). Uses"irlba", unless more than50case"svd" is used.

binary_edge_weights

IfTRUE then edge weights in the inputgraph are treated as binary (0/1) rather than real valued. This affects thesampling frequency of neighbors and is the strategy used by the PaCMAPmethod (Wang and co-workers, 2020). Practical (Böhm and co-workers, 2020)and theoretical (Damrich and Hamprecht, 2021) work suggests this has littleeffect on UMAP's performance.

seed

Integer seed to use to initialize the random number generatorstate. Combined withn_sgd_threads = 1 orbatch = TRUE, thisshould give consistent output across multiple runs on a given installation.Setting this value is equivalent to callingset.seed,but it may be more convenient in some situations than having to call aseparate function. The default is to not set a seed. Ifret_model = TRUE, the seed will be stored in the output model andthen used to set the seed insideumap_transform.

nn_args

A list containing additional arguments to pass to the nearestneighbor method. Fornn_method = "annoy", you can specify"n_trees" and"search_k", and these will override then_trees andsearch_k parameters.Fornn_method = "hnsw", you may specify the following arguments:

  • M The maximum number of neighbors to keep for each vertex.Reasonable values are2 to100. Higher values give betterrecall at the cost of more memory. Default value is16.

  • ef_construction A positive integer specifying the size ofthe dynamic list used during index construction. A higher value willprovide better results at the cost of a longer time to build the index.Default is200.

  • ef A positive integer specifying the size of the dynamiclist used during search. This cannot be smaller thann_neighborsand cannot be higher than the number of items in the index. Default is10.

Fornn_method = "nndescent", you may specify the followingarguments:

  • n_trees The number of trees to use in a random projectionforest to initialize the search. A larger number will give more accurateresults at the cost of a longer computation time. The default ofNULL means that the number is chosen based on the number ofobservations inX.

  • max_candidates The number of potential neighbors to exploreper iteration. By default, this is set ton_neighbors or60,whichever is smaller. A larger number will give more accurate results atthe cost of a longer computation time.

  • n_iters The number of iterations to run the search. A largernumber will give more accurate results at the cost of a longer computationtime. By default, this will be chosen based on the number of observationsinX. You may also need to modify the convergence criteriondelta.

  • delta The minimum relative change in the neighbor graphallowed before early stopping. Should be a value between 0 and 1. Thesmaller the value, the smaller the amount of progress between iterations isallowed. Default value of0.001 means that at least 0.1neighbor graph must be updated at each iteration.

  • init How to initialize the nearest neighbor descent. Bydefault this is set to"tree" and uses a random project forest.If you set this to"rand", then a random selection is used. Usuallythis is less accurate than using RP trees, but for high-dimensional cases,there may be little difference in the quality of the initialization andrandom initialization will be a lot faster. If you set this to"rand", then then_trees parameter is ignored.

  • pruning_degree_multiplier The maximum number of edges per nodeto retain in the search graph, relative ton_neighbors. A largervalue will give more accurate results at the cost of a longer computationtime. Default is1.5. This parameter only affects neighbor searchwhen transforming new data withumap_transform.

  • epsilon Controls the degree of the back-tracking whentraversing the search graph. Setting this to0.0 will do a greedysearch with no back-tracking. A larger value will give more accurateresults at the cost of a longer computation time. Default is0.1.This parameter only affects neighbor search when transforming new data withumap_transform.

  • max_search_fraction Specifies the maximum fraction of thesearch graph to traverse. By default, this is set to1.0, so theentire graph (i.e. all items inX) may be visited. You may want toset this to a smaller value if you have a very large dataset (inconjunction withepsilon) to avoid an inefficient exhaustive searchof the data inX. This parameter only affects neighbor search whentransforming new data withumap_transform.

Fornn_method = "nndescent", you may specify the followingarguments:

  • n_trees The number of trees to use in a random projectionforest to initialize the search. A larger number will give more accurateresults at the cost of a longer computation time. The default ofNULL means that the number is chosen based on the number ofobservations inX.

  • max_candidates The number of potential neighbors to exploreper iteration. By default, this is set ton_neighbors or60,whichever is smaller. A larger number will give more accurate results atthe cost of a longer computation time.

  • n_iters The number of iterations to run the search. A largernumber will give more accurate results at the cost of a longer computationtime. By default, this will be chosen based on the number of observationsinX. You may also need to modify the convergence criteriondelta.

  • delta The minimum relative change in the neighbor graphallowed before early stopping. Should be a value between 0 and 1. Thesmaller the value, the smaller the amount of progress between iterations isallowed. Default value of0.001 means that at least 0.1neighbor graph must be updated at each iteration.

  • init How to initialize the nearest neighbor descent. Bydefault this is set to"tree" and uses a random project forest. Ifyou set this to"rand", then a random selection is used. Usuallythis is less accurate than using RP trees, but for high-dimensional cases,there may be little difference in the quality of the initialization andrandom initialization will be a lot faster. If you set this to"rand", then then_trees parameter is ignored.

  • pruning_degree_multiplier The maximum number of edges per nodeto retain in the search graph, relative ton_neighbors. A largervalue will give more accurate results at the cost of a longer computationtime. Default is1.5. This parameter only affects neighbor searchwhen transforming new data withumap_transform.

  • epsilon Controls the degree of the back-tracking whentraversing the search graph. Setting this to0.0 will do a greedysearch with no back-tracking. A larger value will give more accurateresults at the cost of a longer computation time. Default is0.1.This parameter only affects neighbor search when transforming new data withumap_transform.

  • max_search_fraction Specifies the maximum fraction of thesearch graph to traverse. By default, this is set to1.0, so theentire graph (i.e. all items inX) may be visited. You may want toset this to a smaller value if you have a very large dataset (inconjunction withepsilon) to avoid an inefficient exhaustive searchof the data inX. This parameter only affects neighbor search whentransforming new data withumap_transform.

rng_type

The type of random number generator to use duringoptimization. One of:

  • "pcg". Use the PCG random number generator (O'Neill, 2014).

  • "tausworthe". Use the Tausworthe "taus88" generator.

  • "deterministic". Use a deterministic number generator. Thisisn't actually random, but may provide enough variation in the negativesampling to give a good embedding and can provide a noticeable speed-up.

For backwards compatibility, by default this is unset and the choice ofpcg_rand is used (making "pcg" the effective default).

Details

By setting the UMAP curve parametersa andb to1, youget back the Cauchy distribution as used in t-SNE (van der Maaten and Hinton,2008) and LargeVis (Tang et al., 2016). It also results in a substantiallysimplified gradient expression. This can give a speed improvement of around50%.

Value

A matrix of optimized coordinates, or:

The returned list contains the combined data from any combination ofspecifyingret_model,ret_nn andret_extra.

References

Belkin, M., & Niyogi, P. (2002).Laplacian eigenmaps and spectral techniques for embedding and clustering.InAdvances in neural information processing systems(pp. 585-591).http://papers.nips.cc/paper/1961-laplacian-eigenmaps-and-spectral-techniques-for-embedding-and-clustering.pdf

Böhm, J. N., Berens, P., & Kobak, D. (2020).A unifying perspective on neighbor embeddings along the attraction-repulsion spectrum.arXiv preprintarXiv:2007.08902.https://arxiv.org/abs/2007.08902

Damrich, S., & Hamprecht, F. A. (2021).On UMAP's true loss function.Advances in Neural Information Processing Systems,34.https://proceedings.neurips.cc/paper/2021/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html

Dong, W., Moses, C., & Li, K. (2011, March).Efficient k-nearest neighbor graph construction for generic similarity measures.InProceedings of the 20th international conference on World Wide Web(pp. 577-586).ACM.doi:10.1145/1963405.1963487.

Kingma, D. P., & Ba, J. (2014).Adam: A method for stochastic optimization.arXiv preprintarXiv:1412.6980.https://arxiv.org/abs/1412.6980

Malkov, Y. A., & Yashunin, D. A. (2018).Efficient and robust approximate nearest neighbor search using hierarchicalnavigable small world graphs.IEEE transactions on pattern analysis and machine intelligence,42(4), 824-836.

McInnes, L., Healy, J., & Melville, J. (2018).UMAP: Uniform Manifold Approximation and Projection for Dimension ReductionarXiv preprintarXiv:1802.03426.https://arxiv.org/abs/1802.03426

O'Neill, M. E. (2014).PCG: A family of simple fast space-efficient statistically goodalgorithms for random number generation(Report No. HMC-CS-2014-0905). Harvey Mudd College.

Tang, J., Liu, J., Zhang, M., & Mei, Q. (2016, April).Visualizing large-scale and high-dimensional data.InProceedings of the 25th International Conference on World Wide Web(pp. 287-297).International World Wide Web Conferences Steering Committee.https://arxiv.org/abs/1602.00370

Van der Maaten, L., & Hinton, G. (2008).Visualizing data using t-SNE.Journal of Machine Learning Research,9 (2579-2605).https://www.jmlr.org/papers/v9/vandermaaten08a.html

Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021).Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization.Journal of Machine Learning Research,22(201), 1-73.https://www.jmlr.org/papers/v22/20-1061.html

Examples

iris_tumap <- tumap(iris, n_neighbors = 50, learning_rate = 0.5)

Dimensionality Reduction with UMAP

Description

Carry out dimensionality reduction of a dataset using the Uniform ManifoldApproximation and Projection (UMAP) method (McInnes et al., 2018). Some ofthe following help text is lifted verbatim from the Python referenceimplementation athttps://github.com/lmcinnes/umap.

Usage

umap(  X,  n_neighbors = 15,  n_components = 2,  metric = "euclidean",  n_epochs = NULL,  learning_rate = 1,  scale = FALSE,  init = "spectral",  init_sdev = NULL,  spread = 1,  min_dist = 0.01,  set_op_mix_ratio = 1,  local_connectivity = 1,  bandwidth = 1,  repulsion_strength = 1,  negative_sample_rate = 5,  a = NULL,  b = NULL,  nn_method = NULL,  n_trees = 50,  search_k = 2 * n_neighbors * n_trees,  approx_pow = FALSE,  y = NULL,  target_n_neighbors = n_neighbors,  target_metric = "euclidean",  target_weight = 0.5,  pca = NULL,  pca_center = TRUE,  pcg_rand = TRUE,  fast_sgd = FALSE,  ret_model = FALSE,  ret_nn = FALSE,  ret_extra = c(),  n_threads = NULL,  n_sgd_threads = 0,  grain_size = 1,  tmpdir = tempdir(),  verbose = getOption("verbose", TRUE),  batch = FALSE,  opt_args = NULL,  epoch_callback = NULL,  pca_method = NULL,  binary_edge_weights = FALSE,  dens_scale = NULL,  seed = NULL,  nn_args = list(),  rng_type = NULL)

Arguments

X

Input data. Can be adata.frame,matrix,dist object orsparseMatrix.Matrix and data frames should contain one observation per row. Data frameswill have any non-numeric columns removed, although factor columns will beused if explicitly included viametric (see the help formetric for details). A sparse matrix is interpreted as a distancematrix, and is assumed to be symmetric, so you can also pass in anexplicitly upper or lower triangular sparse matrix to save storage. Theremust be at leastn_neighbors non-zero distances for each row. Bothimplicit and explicit zero entries are ignored. Set zero distances you wantto keep to an arbitrarily small non-zero value (e.g.1e-10).X can also beNULL if pre-computed nearest neighbor data ispassed tonn_method, andinit is not"spca" or"pca".

n_neighbors

The size of local neighborhood (in terms of number ofneighboring sample points) used for manifold approximation. Larger valuesresult in more global views of the manifold, while smaller values result inmore local data being preserved. In general values should be in the range2 to100.

n_components

The dimension of the space to embed into. This defaultsto2 to provide easy visualization, but can reasonably be set to anyinteger value in the range2 to100.

metric

Type of distance metric to use to find nearest neighbors. Fornn_method = "annoy" this can be one of:

  • "euclidean" (the default)

  • "cosine"

  • "manhattan"

  • "hamming"

  • "correlation" (a distance based on the Pearson correlation)

  • "categorical" (see below)

Fornn_method = "hnsw" this can be one of:

  • "euclidean"

  • "cosine"

  • "correlation"

Ifrnndescent isinstalled andnn_method = "nndescent" is specified then many moremetrics are avaiable, including:

  • "braycurtis"

  • "canberra"

  • "chebyshev"

  • "dice"

  • "hamming"

  • "hellinger"

  • "jaccard"

  • "jensenshannon"

  • "kulsinski"

  • "rogerstanimoto"

  • "russellrao"

  • "sokalmichener"

  • "sokalsneath"

  • "spearmanr"

  • "symmetrickl"

  • "tsss"

  • "yule"

For more details see the package documentation ofrnndescent.Fornn_method = "fnn", the distance metric is always "euclidean".

IfX is a data frame or matrix, then multiple metrics can bespecified, by passing a list to this argument, where the name of each item inthe list is one of the metric names above. The value of each list item shouldbe a vector giving the names or integer ids of the columns to be included ina calculation, e.g.metric = list(euclidean = 1:4, manhattan = 5:10).

Each metric calculation results in a separate fuzzy simplicial set, which areintersected together to produce the final set. Metric names can be repeated.Because non-numeric columns are removed from the data frame, it is safer touse column names than integer ids.

Factor columns can also be used by specifying the metric name"categorical". Factor columns are treated different from numericcolumns and although multiple factor columns can be specified in a vector,each factor column specified is processed individually. If you specifya non-factor column, it will be coerced to a factor.

For a given data block, you may override thepca andpca_centerarguments for that block, by providing a list with one unnamed itemcontaining the column names or ids, and then any of thepca orpca_center overrides as named items, e.g.metric =list(euclidean = 1:4, manhattan = list(5:10, pca_center = FALSE)). Thisexists to allow mixed binary and real-valued data to be included and to havePCA applied to both, but with centering applied only to the real-valued data(it is typical not to apply centering to binary data before PCA is applied).

n_epochs

Number of epochs to use during the optimization of theembedded coordinates. By default, this value is set to500 fordatasets containing 10,000 vertices or less, and200 otherwise.Ifn_epochs = 0, then coordinates determined by"init" willbe returned.

learning_rate

Initial learning rate used in optimization of thecoordinates.

scale

Scaling to apply toX if it is a data frame or matrix:

  • "none" orFALSE orNULL No scaling.

  • "Z" or"scale" orTRUE Scale each column tozero mean and variance 1.

  • "maxabs" Center each column to mean 0, then divide eachelement by the maximum absolute value over the entire matrix.

  • "range" Range scale the entire matrix, so the smallestelement is 0 and the largest is 1.

  • "colrange" Scale each column in the range (0,1).

For UMAP, the default is"none".

init

Type of initialization for the coordinates. Options are:

  • "spectral" Spectral embedding using the normalized Laplacianof the fuzzy 1-skeleton, with Gaussian noise added.

  • "normlaplacian". Spectral embedding using the normalizedLaplacian of the fuzzy 1-skeleton, without noise.

  • "random". Coordinates assigned using a uniform randomdistribution between -10 and 10.

  • "lvrandom". Coordinates assigned using a Gaussiandistribution with standard deviation 1e-4, as used in LargeVis(Tang et al., 2016) and t-SNE.

  • "laplacian". Spectral embedding using the Laplacian Eigenmap(Belkin and Niyogi, 2002).

  • "pca". The first two principal components from PCA ofX ifX is a data frame, and from a 2-dimensional classicalMDS ifX is of class"dist".

  • "spca". Like"pca", but each dimension is then scaledso the standard deviation is 1e-4, to give a distribution similar to thatused in t-SNE. This is an alias forinit = "pca", init_sdev = 1e-4.

  • "agspectral" An "approximate global" modification of"spectral" which all edges in the graph to a value of 1, and thensets a random number of edges (negative_sample_rate edges pervertex) to 0.1, to approximate the effect of non-local affinities.

  • A matrix of initial coordinates.

For spectral initializations, ("spectral","normlaplacian","laplacian","agspectral"), if more than one connectedcomponent is identified, no spectral initialization is attempted. Insteada PCA-based initialization is attempted. Ifverbose = TRUE thenumber of connected components are logged to the console. The existence ofmultiple connected components implies that a global view of the data cannotbe attained with this initialization. Increasing the value ofn_neighbors may help.

init_sdev

If non-NULL, scales each dimension of the initializedcoordinates (including any user-supplied matrix) to this standarddeviation. By default no scaling is carried out, except wheninit ="spca", in which case the value is0.0001. Scaling the input mayhelp if the unscaled versions result in initial coordinates with largeinter-point distances or outliers. This usually results in small gradientsduring optimization and very little progress being made to the layout.Shrinking the initial embedding by rescaling can help under thesecircumstances. Scaling the result ofinit = "pca" is usuallyrecommended andinit = "spca" as an alias forinit = "pca",init_sdev = 1e-4 but for the spectral initializations the scaled versionsusually aren't necessary unless you are using a large value ofn_neighbors (e.g.n_neighbors = 150 or higher). Forcompatibility with recent versions of the Python UMAP package, if you areusinginit = "spectral", then you should also setinit_sdev = "range", which will range scale each of the columnscontaining the initial data between 0-10. This is not set by default tomaintain backwards compatibility with previous versions of uwot.

spread

The effective scale of embedded points. In combination withmin_dist, this determines how clustered/clumped the embedded pointsare.

min_dist

The effective minimum distance between embedded points.Smaller values will result in a more clustered/clumped embedding wherenearby points on the manifold are drawn closer together, while largervalues will result on a more even dispersal of points. The value should beset relative to thespread value, which determines the scale atwhich embedded points will be spread out.

set_op_mix_ratio

Interpolate between (fuzzy) union and intersection asthe set operation used to combine local fuzzy simplicial sets to obtain aglobal fuzzy simplicial sets. Both fuzzy set operations use the productt-norm. The value of this parameter should be between0.0 and1.0; a value of1.0 will use a pure fuzzy union, while0.0 will use a pure fuzzy intersection.

local_connectivity

The local connectivity required – i.e. the numberof nearest neighbors that should be assumed to be connected at a locallevel. The higher this value the more connected the manifold becomeslocally. In practice this should be not more than the local intrinsicdimension of the manifold.

bandwidth

The effective bandwidth of the kernel if we view thealgorithm as similar to Laplacian Eigenmaps. Larger values induce moreconnectivity and a more global view of the data, smaller values concentratemore locally.

repulsion_strength

Weighting applied to negative samples in lowdimensional embedding optimization. Values higher than one will result ingreater weight being given to negative samples.

negative_sample_rate

The number of negative edge/1-simplex samples touse per positive edge/1-simplex sample in optimizing the low dimensionalembedding.

a

More specific parameters controlling the embedding. IfNULLthese values are set automatically as determined bymin_dist andspread.

b

More specific parameters controlling the embedding. IfNULLthese values are set automatically as determined bymin_dist andspread.

nn_method

Method for finding nearest neighbors. Options are:

  • "fnn". Use exact nearest neighbors via theFNN package.

  • "annoy" Use approximate nearest neighbors via theRcppAnnoy package.

  • "hnsw" Use approximate nearest neighbors with theHierarchical Navigable Small World (HNSW) method (Malkov and Yashunin,2018) via theRcppHNSW package.RcppHNSW is not a dependency of this package: this option isonly available if you have installedRcppHNSW yourself. Also,HNSW only supports the following arguments formetric andtarget_metric:"euclidean","cosine" and"correlation".

  • "nndescent" Use approximate nearest neighbors with theNearest Neighbor Descent method (Dong et al., 2011) via thernndescentpackage.rnndescent is not a dependency of this package: thisoption is only available if you have installedrnndescentyourself.

By default, ifX has less than 4,096 vertices, the exact nearestneighbors are found. Otherwise, approximate nearest neighbors are used.You may also pass pre-calculated nearest neighbor data to this argument. Itmust be one of two formats, either a list consisting of two elements:

  • "idx". An_vertices x n_neighbors matrixcontaining the integer indexes of the nearest neighbors inX. Eachvertex is considered to be its own nearest neighbor, i.e.idx[, 1] == 1:n_vertices.

  • "dist". An_vertices x n_neighbors matrixcontaining the distances of the nearest neighbors.

or a sparse distance matrix of typedgCMatrix, with dimensionsn_vertices x n_vertices. Distances should be arranged by column,i.e. a non-zero entry in rowj of theith column indicatesthat thejth observation inX is a nearest neighbor of theith observation with the distance given by the value of thatelement.Then_neighbors parameter is ignored when using precomputednearest neighbor data. If using the sparse distance matrix input, eachcolumn can contain a different number of neighbors.

n_trees

Number of trees to build when constructing the nearestneighbor index. The more trees specified, the larger the index, but thebetter the results. Withsearch_k, determines the accuracy of theAnnoy nearest neighbor search. Only used if thenn_method is"annoy". Sensible values are between10 to100.

search_k

Number of nodes to search during the neighbor retrieval. Thelarger k, the more the accurate results, but the longer the search takes.Withn_trees, determines the accuracy of the Annoy nearest neighborsearch. Only used if thenn_method is"annoy".

approx_pow

IfTRUE, use an approximation to the power functionin the UMAP gradient, fromhttps://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/.Ignored ifdens_scale is non-NULL.

y

Optional target data for supervised dimension reduction. Can be avector, matrix or data frame. Use thetarget_metric parameter tospecify the metrics to use, using the same syntax asmetric. Usuallyeither a single numeric or factor column is used, but more complex formatsare possible. The following types are allowed:

  • Factor columns with the same length asX.NA isallowed for any observation with an unknown level, in which caseUMAP operates as a form of semi-supervised learning. Each column istreated separately.

  • Numeric data.NA isnot allowed in this case. Use theparametertarget_n_neighbors to set the number of neighbors usedwithy. If unset,n_neighbors is used. Unlike factors,numeric columns are grouped into one block unlesstarget_metricspecifies otherwise. For example, if you wish columnsa andb to be treated separately, specifytarget_metric = list(euclidean = "a", euclidean = "b"). Otherwise,the data will be effectively treated as a matrix with two columns.

  • Nearest neighbor data, consisting of a list of two matrices,idx anddist. These represent the precalculated nearestneighbor indices and distances, respectively. Thisis the same format as that expected for precalculated data innn_method. This format assumes that the underlying data was anumeric vector. Any user-supplied value of thetarget_n_neighborsparameter is ignored in this case, because the the number of columns inthe matrices is used for the value. Multiple nearest neighbor data usingdifferent metrics can be supplied by passing a list of these lists.

UnlikeX, all factor columns included iny are automaticallyused.

target_n_neighbors

Number of nearest neighbors to use to construct thetarget simplicial set. Default value isn_neighbors. Applies only ify is non-NULL andnumeric.

target_metric

The metric used to measure distance fory ifusing supervised dimension reduction. Used only ify is numeric.

target_weight

Weighting factor between data topology and targettopology. A value of 0.0 weights entirely on data, a value of 1.0 weightsentirely on target. The default of 0.5 balances the weighting equallybetween data and target. Only applies ify is non-NULL.

pca

If set to a positive integer value, reduce data to this number ofcolumns using PCA. Doesn't applied if the distancemetric is"hamming", or the dimensions of the data is larger than thenumber specified (i.e. number of rows and columns must be larger than thevalue of this parameter). If you have > 100 columns in a data frame ormatrix, reducing the number of columns in this way may substantiallyincrease the performance of the nearest neighbor search at the cost of apotential decrease in accuracy. In many t-SNE applications, a value of 50is recommended, although there's no guarantee that this is appropriate forall settings.

pca_center

IfTRUE, center the columns ofX beforecarrying out PCA. For binary data, it's recommended to set this toFALSE.

pcg_rand

IfTRUE, use the PCG random number generator (O'Neill,2014) during optimization. Otherwise, use the faster (but probably lessstatistically good) Tausworthe "taus88" generator. The default isTRUE. This parameter has been superseded byrng_type – ifboth are set,rng_type takes precedence.

fast_sgd

IfTRUE, then the following combination of parametersis set:pcg_rand = TRUE,n_sgd_threads = "auto" andapprox_pow = TRUE. The default isFALSE. Setting this toTRUE will speed up the stochastic optimization phase, but give apotentially less accurate embedding, and which will not be exactlyreproducible even with a fixed seed. For visualization,fast_sgd =TRUE will give perfectly good results. For more generic dimensionalityreduction, it's safer to leavefast_sgd = FALSE. Iffast_sgd =TRUE, then user-supplied values ofpcg_rand,n_sgd_threads,andapprox_pow are ignored.

ret_model

IfTRUE, then return extra data that can be used toadd new data to an existing embedding viaumap_transform. Theembedded coordinates are returned as the list itemembedding. IfFALSE, just return the coordinates. This parameter can be used inconjunction withret_nn andret_extra. Note that somesettings are incompatible with the production of a UMAP model: externalneighbor data (passed via a list tonn_method), and factor columnsthat were included via themetric parameter. In the latter case, themodel produced is based only on the numeric data. A transformation usingnew data is possible, but the factor columns in the new data are ignored.Note that settingret_model = TRUE forces the use of the approximatenearest neighbors method. Because small datasets would otherwise use exactnearest neighbor calculations, settingret_model = TRUE means thatdifferent results may be returned for small datasets in terms of both thereturned nearest neighbors (if requested) and the final embeddedcoordinates, compared toret_model = FALSE, even if the randomnumber seed is fixed. To avoid this, explicitly setnn_method = "annoy" in theret_model = FALSE case.

ret_nn

IfTRUE, then in addition to the embedding, also returnnearest neighbor data that can be used as input tonn_method toavoid the overhead of repeatedly calculating the nearest neighbors whenmanipulating unrelated parameters (e.g.min_dist,n_epochs,init). See the "Value" section for the names of the list items. IfFALSE, just return the coordinates. Note that the nearest neighborscould be sensitive to data scaling, so be wary of reusing nearest neighbordata if modifying thescale parameter. This parameter can be used inconjunction withret_model andret_extra.

ret_extra

A vector indicating what extra data to return. May containany combination of the following strings:

  • "model" Same as settingret_model = TRUE.

  • "nn" Same as settingret_nn = TRUE.

  • "fgraph" the high dimensional fuzzy graph (i.e. the fuzzysimplicial set of the merged local views of the input data). The graphis returned as a sparse symmetric N x N matrix of classdgCMatrix-class, where a non-zero entry (i, j) gives themembership strength of the edge connecting vertex i and vertex j. Thiscan be considered analogous to the input probability (or similarity oraffinity) used in t-SNE and LargeVis. Note that the graph is furthersparsified by removing edges with sufficiently low membership strengththat they would not be sampled by the probabilistic edge samplingemployed for optimization and therefore the number of non-zero elementsin the matrix is dependent onn_epochs. If you are onlyinterested in the fuzzy input graph (e.g. for clustering), settingn_epochs = 0 will avoid any further sparsifying.Be aware that setting 'binary_edge_weights = TRUE' will affect thisgraph (all non-zero edge weights will be 1).

  • "sigma" the normalization value for each observation in thedataset when constructing the smoothed distances to each of itsneighbors. This gives some sense of the local density of eachobservation in the high dimensional space: higher values ofsigma indicate a higher dispersion or lower density.

n_threads

Number of threads to use (except during stochastic gradientdescent). Default is half the number of concurrent threads supported by thesystem. For nearest neighbor search, only applies ifnn_method = "annoy". Ifn_threads > 1, then the Annoy indexwill be temporarily written to disk in the location determined bytempfile.

n_sgd_threads

Number of threads to use during stochastic gradientdescent. If set to > 1, then be aware that ifbatch = FALSE, resultswillnot be reproducible, even ifset.seed is called with afixed seed before running. Set to"auto" to use the same value asn_threads.

grain_size

The minimum amount of work to do on each thread. If thisvalue is set high enough, then less thann_threads orn_sgd_threads will be used for processing, which might give aperformance improvement if the overhead of thread management and contextswitching was outweighing the improvement due to concurrent processing.This should be left at default (1) and work will be spread evenlyover all the threads specified.

tmpdir

Temporary directory to store nearest neighbor indexes duringnearest neighbor search. Default istempdir. The index isonly written to disk ifn_threads > 1 andnn_method = "annoy"; otherwise, this parameter is ignored.

verbose

IfTRUE, log details to the console.

batch

IfTRUE, then embedding coordinates are updated at theend of each epoch rather than during the epoch. In batch mode, results arereproducible with a fixed random seed even withn_sgd_threads > 1,at the cost of a slightly higher memory use. You may also have to modifylearning_rate and increasen_epochs, so whether this providesa speed increase over the single-threaded optimization is likely to bedataset and hardware-dependent.

opt_args

A list of optimizer parameters, used whenbatch = TRUE. The default optimization method used is Adam (Kingmaand Ba, 2014).

  • method The optimization method to use. Either"adam"or"sgd" (stochastic gradient descent). Default:"adam".

  • beta1 (Adam only). The weighting parameter for theexponential moving average of the first moment estimator. Effectively themomentum parameter. Should be a floating point value between 0 and 1.Higher values can smooth oscillatory updates in poorly-conditionedsituations and may allow for a largerlearning_rate to bespecified, but too high can cause divergence. Default:0.5.

  • beta2 (Adam only). The weighting parameter for theexponential moving average of the uncentered second moment estimator.Should be a floating point value between 0 and 1. Controls the degree ofadaptivity in the step-size. Higher values put more weight on previoustime steps. Default:0.9.

  • eps (Adam only). Intended to be a small value to preventdivision by zero, but in practice can also affect convergence due to itsinteraction withbeta2. Higher values reduce the effect of thestep-size adaptivity and bring the behavior closer to stochastic gradientdescent with momentum. Typical values are between 1e-8 and 1e-3. Default:1e-7.

  • alpha The initial learning rate. Default: the value of thelearning_rate parameter.

epoch_callback

A function which will be invoked at the end of everyepoch. Its signature should be:(epoch, n_epochs, coords), where:

  • epoch The current epoch number (between1 andn_epochs).

  • n_epochs Number of epochs to use during the optimization ofthe embedded coordinates.

  • coords The embedded coordinates as of the end of the currentepoch, as a matrix with dimensions (N,n_components).

pca_method

Method to carry out any PCA dimensionality reduction whenthepca parameter is specified. Allowed values are:

  • "irlba". Usesprcomp_irlba from theirlba package.

  • "rsvd". Uses 5 iterations ofsvdr fromtheirlba package.This is likely to give much faster but potentially less accurate resultsthan using"irlba". For the purposes of nearest neighborcalculation and coordinates initialization, any loss of accuracy doesn'tseem to matter much.

  • "bigstatsr". Usesbig_randomSVDfrom thebigstatsrpackage. The SVD methods used inbigstatsr may be faster onsystems without access to efficient linear algebra libraries (e.g.Windows).Note:bigstatsr isnot a dependency ofuwot: if you choose to use this package for PCA, youmust installit yourself.

  • "svd". Usessvd for the SVD. This islikely to be slow for all but the smallest datasets.

  • "auto" (the default). Uses"irlba", unless more than50case"svd" is used.

binary_edge_weights

IfTRUE then edge weights in the inputgraph are treated as binary (0/1) rather than real valued. This affects thesampling frequency of neighbors and is the strategy used by the PaCMAPmethod (Wang and co-workers, 2020). Practical (Böhm and co-workers, 2020)and theoretical (Damrich and Hamprecht, 2021) work suggests this has littleeffect on UMAP's performance.

dens_scale

A value between 0 and 1. If > 0 then the output attemptsto preserve relative local density around each observation. This uses anapproximation to the densMAP method (Narayan and co-workers, 2021). Thelarger the value ofdens_scale, the greater the range of outputdensities that will be used to map the input densities. This option isignored if using multiplemetric blocks.

seed

Integer seed to use to initialize the random number generatorstate. Combined withn_sgd_threads = 1 orbatch = TRUE, thisshould give consistent output across multiple runs on a given installation.Setting this value is equivalent to callingset.seed,but it may be more convenient in some situations than having to call aseparate function. The default is to not set a seed. Ifret_model = TRUE, the seed will be stored in the output model andthen used to set the seed insideumap_transform.

nn_args

A list containing additional arguments to pass to the nearestneighbor method. Fornn_method = "annoy", you can specify"n_trees" and"search_k", and these will override then_trees andsearch_k parameters.Fornn_method = "hnsw", you may specify the following arguments:

  • M The maximum number of neighbors to keep for each vertex.Reasonable values are2 to100. Higher values give betterrecall at the cost of more memory. Default value is16.

  • ef_construction A positive integer specifying the size ofthe dynamic list used during index construction. A higher value willprovide better results at the cost of a longer time to build the index.Default is200.

  • ef A positive integer specifying the size of the dynamiclist used during search. This cannot be smaller thann_neighborsand cannot be higher than the number of items in the index. Default is10.

Fornn_method = "nndescent", you may specify the followingarguments:

  • n_trees The number of trees to use in a random projectionforest to initialize the search. A larger number will give more accurateresults at the cost of a longer computation time. The default ofNULL means that the number is chosen based on the number ofobservations inX.

  • max_candidates The number of potential neighbors to exploreper iteration. By default, this is set ton_neighbors or60,whichever is smaller. A larger number will give more accurate results atthe cost of a longer computation time.

  • n_iters The number of iterations to run the search. A largernumber will give more accurate results at the cost of a longer computationtime. By default, this will be chosen based on the number of observationsinX. You may also need to modify the convergence criteriondelta.

  • delta The minimum relative change in the neighbor graphallowed before early stopping. Should be a value between 0 and 1. Thesmaller the value, the smaller the amount of progress between iterations isallowed. Default value of0.001 means that at least 0.1neighbor graph must be updated at each iteration.

  • init How to initialize the nearest neighbor descent. Bydefault this is set to"tree" and uses a random project forest.If you set this to"rand", then a random selection is used. Usuallythis is less accurate than using RP trees, but for high-dimensional cases,there may be little difference in the quality of the initialization andrandom initialization will be a lot faster. If you set this to"rand", then then_trees parameter is ignored.

  • pruning_degree_multiplier The maximum number of edges per nodeto retain in the search graph, relative ton_neighbors. A largervalue will give more accurate results at the cost of a longer computationtime. Default is1.5. This parameter only affects neighbor searchwhen transforming new data withumap_transform.

  • epsilon Controls the degree of the back-tracking whentraversing the search graph. Setting this to0.0 will do a greedysearch with no back-tracking. A larger value will give more accurateresults at the cost of a longer computation time. Default is0.1.This parameter only affects neighbor search when transforming new data withumap_transform.

  • max_search_fraction Specifies the maximum fraction of thesearch graph to traverse. By default, this is set to1.0, so theentire graph (i.e. all items inX) may be visited. You may want toset this to a smaller value if you have a very large dataset (inconjunction withepsilon) to avoid an inefficient exhaustive searchof the data inX. This parameter only affects neighbor search whentransforming new data withumap_transform.

rng_type

The type of random number generator to use duringoptimization. One of:

  • "pcg". Use the PCG random number generator (O'Neill, 2014).

  • "tausworthe". Use the Tausworthe "taus88" generator.

  • "deterministic". Use a deterministic number generator. Thisisn't actually random, but may provide enough variation in the negativesampling to give a good embedding and can provide a noticeable speed-up.

For backwards compatibility, by default this is unset and the choice ofpcg_rand is used (making "pcg" the effective default).

Value

A matrix of optimized coordinates, or:

The returned list contains the combined data from any combination ofspecifyingret_model,ret_nn andret_extra.

References

Belkin, M., & Niyogi, P. (2002).Laplacian eigenmaps and spectral techniques for embedding and clustering.InAdvances in neural information processing systems(pp. 585-591).http://papers.nips.cc/paper/1961-laplacian-eigenmaps-and-spectral-techniques-for-embedding-and-clustering.pdf

Böhm, J. N., Berens, P., & Kobak, D. (2020).A unifying perspective on neighbor embeddings along the attraction-repulsion spectrum.arXiv preprintarXiv:2007.08902.https://arxiv.org/abs/2007.08902

Damrich, S., & Hamprecht, F. A. (2021).On UMAP's true loss function.Advances in Neural Information Processing Systems,34.https://proceedings.neurips.cc/paper/2021/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html

Dong, W., Moses, C., & Li, K. (2011, March).Efficient k-nearest neighbor graph construction for generic similarity measures.InProceedings of the 20th international conference on World Wide Web(pp. 577-586).ACM.doi:10.1145/1963405.1963487.

Kingma, D. P., & Ba, J. (2014).Adam: A method for stochastic optimization.arXiv preprintarXiv:1412.6980.https://arxiv.org/abs/1412.6980

Malkov, Y. A., & Yashunin, D. A. (2018).Efficient and robust approximate nearest neighbor search using hierarchicalnavigable small world graphs.IEEE transactions on pattern analysis and machine intelligence,42(4), 824-836.

McInnes, L., Healy, J., & Melville, J. (2018).UMAP: Uniform Manifold Approximation and Projection for Dimension ReductionarXiv preprintarXiv:1802.03426.https://arxiv.org/abs/1802.03426

Narayan, A., Berger, B., & Cho, H. (2021).Assessing single-cell transcriptomic variability through density-preserving data visualization.Nature biotechnology,39(6), 765-774.doi:10.1038/s41587-020-00801-7

O'Neill, M. E. (2014).PCG: A family of simple fast space-efficient statistically goodalgorithms for random number generation(Report No. HMC-CS-2014-0905). Harvey Mudd College.

Tang, J., Liu, J., Zhang, M., & Mei, Q. (2016, April).Visualizing large-scale and high-dimensional data.InProceedings of the 25th International Conference on World Wide Web(pp. 287-297).International World Wide Web Conferences Steering Committee.https://arxiv.org/abs/1602.00370

Van der Maaten, L., & Hinton, G. (2008).Visualizing data using t-SNE.Journal of Machine Learning Research,9 (2579-2605).https://www.jmlr.org/papers/v9/vandermaaten08a.html

Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021).Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization.Journal of Machine Learning Research,22(201), 1-73.https://www.jmlr.org/papers/v22/20-1061.html

Examples

iris30 <- iris[c(1:10, 51:60, 101:110), ]# Non-numeric columns are automatically removed so you can pass data frames# directly in a lot of cases without pre-processingiris_umap <- umap(iris30, n_neighbors = 5, learning_rate = 0.5, init = "random", n_epochs = 20)# Faster approximation to the gradient and return nearest neighborsiris_umap <- umap(iris30, n_neighbors = 5, approx_pow = TRUE, ret_nn = TRUE, n_epochs = 20)# Can specify min_dist and spread parameters to control separation and size# of clusters and reuse nearest neighbors for efficiencynn <- iris_umap$nniris_umap <- umap(iris30, n_neighbors = 5, min_dist = 1, spread = 5, nn_method = nn, n_epochs = 20)# Supervised dimension reduction using the 'Species' factor columniris_sumap <- umap(iris30,  n_neighbors = 5, min_dist = 0.001, y = iris30$Species,  target_weight = 0.5, n_epochs = 20)# Calculate Petal and Sepal neighbors separately (uses intersection of the resulting sets):iris_umap <- umap(iris30, metric = list(  "euclidean" = c("Sepal.Length", "Sepal.Width"),  "euclidean" = c("Petal.Length", "Petal.Width")))

Dimensionality Reduction with UMAP

Description

Carry out dimensionality reduction of a dataset using the Uniform ManifoldApproximation and Projection (UMAP) method (McInnes et al., 2018).

Usage

umap2(  X,  n_neighbors = 15,  n_components = 2,  metric = "euclidean",  n_epochs = NULL,  learning_rate = 1,  scale = FALSE,  init = "spectral",  init_sdev = "range",  spread = 1,  min_dist = 0.1,  set_op_mix_ratio = 1,  local_connectivity = 1,  bandwidth = 1,  repulsion_strength = 1,  negative_sample_rate = 5,  a = NULL,  b = NULL,  nn_method = NULL,  n_trees = 50,  search_k = 2 * n_neighbors * n_trees,  approx_pow = FALSE,  y = NULL,  target_n_neighbors = n_neighbors,  target_metric = "euclidean",  target_weight = 0.5,  pca = NULL,  pca_center = TRUE,  pcg_rand = TRUE,  fast_sgd = FALSE,  ret_model = FALSE,  ret_nn = FALSE,  ret_extra = c(),  n_threads = NULL,  n_sgd_threads = 0,  grain_size = 1,  tmpdir = tempdir(),  verbose = getOption("verbose", TRUE),  batch = TRUE,  opt_args = NULL,  epoch_callback = NULL,  pca_method = NULL,  binary_edge_weights = FALSE,  dens_scale = NULL,  seed = NULL,  nn_args = list(),  rng_type = NULL)

Arguments

X

Input data. Can be adata.frame,matrix,dist object orsparseMatrix.Matrix and data frames should contain one observation per row. Data frameswill have any non-numeric columns removed, although factor columns will beused if explicitly included viametric (see the help formetric for details). Sparse matrices must be in thedgCMatrixformat, and you must also installrnndescentand setnn_method = "nndescent"X can also beNULL if pre-computed nearest neighbor data ispassed tonn_method, andinit is not"spca" or"pca".

n_neighbors

The size of local neighborhood (in terms of number ofneighboring sample points) used for manifold approximation. Larger valuesresult in more global views of the manifold, while smaller values result inmore local data being preserved. In general values should be in the range2 to100.

n_components

The dimension of the space to embed into. This defaultsto2 to provide easy visualization, but can reasonably be set to anyinteger value in the range2 to100.

metric

Type of distance metric to use to find nearest neighbors. Fornn_method = "annoy" this can be one of:

  • "euclidean" (the default)

  • "cosine"

  • "manhattan"

  • "hamming"

  • "correlation" (a distance based on the Pearson correlation)

  • "categorical" (see below)

Fornn_method = "hnsw" this can be one of:

  • "euclidean"

  • "cosine"

  • "correlation"

Ifrnndescent isinstalled andnn_method = "nndescent" is specified then many moremetrics are avaiable, including:

  • "braycurtis"

  • "canberra"

  • "chebyshev"

  • "dice"

  • "hamming"

  • "hellinger"

  • "jaccard"

  • "jensenshannon"

  • "kulsinski"

  • "rogerstanimoto"

  • "russellrao"

  • "sokalmichener"

  • "sokalsneath"

  • "spearmanr"

  • "symmetrickl"

  • "tsss"

  • "yule"

For more details see the package documentation ofrnndescent.Fornn_method = "fnn", the distance metric is always "euclidean".

IfX is a data frame or matrix, then multiple metrics can bespecified, by passing a list to this argument, where the name of each item inthe list is one of the metric names above. The value of each list item shouldbe a vector giving the names or integer ids of the columns to be included ina calculation, e.g.metric = list(euclidean = 1:4, manhattan = 5:10).

Each metric calculation results in a separate fuzzy simplicial set, which areintersected together to produce the final set. Metric names can be repeated.Because non-numeric columns are removed from the data frame, it is safer touse column names than integer ids.

Factor columns can also be used by specifying the metric name"categorical". Factor columns are treated different from numericcolumns and although multiple factor columns can be specified in a vector,each factor column specified is processed individually. If you specifya non-factor column, it will be coerced to a factor.

For a given data block, you may override thepca andpca_centerarguments for that block, by providing a list with one unnamed itemcontaining the column names or ids, and then any of thepca orpca_center overrides as named items, e.g.metric =list(euclidean = 1:4, manhattan = list(5:10, pca_center = FALSE)). Thisexists to allow mixed binary and real-valued data to be included and to havePCA applied to both, but with centering applied only to the real-valued data(it is typical not to apply centering to binary data before PCA is applied).

n_epochs

Number of epochs to use during the optimization of theembedded coordinates. By default, this value is set to500 fordatasets containing 10,000 vertices or less, and200 otherwise.Ifn_epochs = 0, then coordinates determined by"init" willbe returned.

learning_rate

Initial learning rate used in optimization of thecoordinates.

scale

Scaling to apply toX if it is a data frame or matrix:

  • "none" orFALSE orNULL No scaling.

  • "Z" or"scale" orTRUE Scale each column tozero mean and variance 1.

  • "maxabs" Center each column to mean 0, then divide eachelement by the maximum absolute value over the entire matrix.

  • "range" Range scale the entire matrix, so the smallestelement is 0 and the largest is 1.

  • "colrange" Scale each column in the range (0,1).

For UMAP, the default is"none".

init

Type of initialization for the coordinates. Options are:

  • "spectral" Spectral embedding using the normalized Laplacianof the fuzzy 1-skeleton, with Gaussian noise added.

  • "normlaplacian". Spectral embedding using the normalizedLaplacian of the fuzzy 1-skeleton, without noise.

  • "random". Coordinates assigned using a uniform randomdistribution between -10 and 10.

  • "lvrandom". Coordinates assigned using a Gaussiandistribution with standard deviation 1e-4, as used in LargeVis(Tang et al., 2016) and t-SNE.

  • "laplacian". Spectral embedding using the Laplacian Eigenmap(Belkin and Niyogi, 2002).

  • "pca". The first two principal components from PCA ofX ifX is a data frame, and from a 2-dimensional classicalMDS ifX is of class"dist".

  • "spca". Like"pca", but each dimension is then scaledso the standard deviation is 1e-4, to give a distribution similar to thatused in t-SNE. This is an alias forinit = "pca", init_sdev = 1e-4.

  • "agspectral" An "approximate global" modification of"spectral" which all edges in the graph to a value of 1, and thensets a random number of edges (negative_sample_rate edges pervertex) to 0.1, to approximate the effect of non-local affinities.

  • A matrix of initial coordinates.

For spectral initializations, ("spectral","normlaplacian","laplacian","agspectral"), if more than one connectedcomponent is identified, no spectral initialization is attempted. Insteada PCA-based initialization is attempted. Ifverbose = TRUE thenumber of connected components are logged to the console. The existence ofmultiple connected components implies that a global view of the data cannotbe attained with this initialization. Increasing the value ofn_neighbors may help.

init_sdev

If non-NULL, scales each dimension of the initializedcoordinates (including any user-supplied matrix) to this standarddeviation. By default, (init_sdev = "range"), each column of theinitial coordinates are range scaled between 0-10. Scaling the input mayhelp if the unscaled versions result in initial coordinates with largeinter-point distances or outliers. This usually results in small gradientsduring optimization and very little progress being made to the layout.Shrinking the initial embedding by rescaling can help under thesecircumstances. Scaling the result ofinit = "pca" is usuallyrecommended andinit = "spca" as an alias forinit = "pca",init_sdev = 1e-4 but for the spectral initializations the scaled versionsusually aren't necessary unless you are using a large value ofn_neighbors (e.g.n_neighbors = 150 or higher).

spread

The effective scale of embedded points. In combination withmin_dist, this determines how clustered/clumped the embedded pointsare.

min_dist

The effective minimum distance between embedded points.Smaller values will result in a more clustered/clumped embedding wherenearby points on the manifold are drawn closer together, while largervalues will result on a more even dispersal of points. The value should beset relative to thespread value, which determines the scale atwhich embedded points will be spread out.

set_op_mix_ratio

Interpolate between (fuzzy) union and intersection asthe set operation used to combine local fuzzy simplicial sets to obtain aglobal fuzzy simplicial sets. Both fuzzy set operations use the productt-norm. The value of this parameter should be between0.0 and1.0; a value of1.0 will use a pure fuzzy union, while0.0 will use a pure fuzzy intersection.

local_connectivity

The local connectivity required – i.e. the numberof nearest neighbors that should be assumed to be connected at a locallevel. The higher this value the more connected the manifold becomeslocally. In practice this should be not more than the local intrinsicdimension of the manifold.

bandwidth

The effective bandwidth of the kernel if we view thealgorithm as similar to Laplacian Eigenmaps. Larger values induce moreconnectivity and a more global view of the data, smaller values concentratemore locally.

repulsion_strength

Weighting applied to negative samples in lowdimensional embedding optimization. Values higher than one will result ingreater weight being given to negative samples.

negative_sample_rate

The number of negative edge/1-simplex samples touse per positive edge/1-simplex sample in optimizing the low dimensionalembedding.

a

More specific parameters controlling the embedding. IfNULLthese values are set automatically as determined bymin_dist andspread.

b

More specific parameters controlling the embedding. IfNULLthese values are set automatically as determined bymin_dist andspread.

nn_method

Method for finding nearest neighbors. Options are:

  • "fnn". Use exact nearest neighbors via theFNN package.

  • "annoy" Use approximate nearest neighbors via theRcppAnnoy package.

  • "hnsw" Use approximate nearest neighbors with theHierarchical Navigable Small World (HNSW) method (Malkov and Yashunin,2018) via theRcppHNSW package.RcppHNSW is not a dependency of this package: this option isonly available if you have installedRcppHNSW yourself. Also,HNSW only supports the following arguments formetric andtarget_metric:"euclidean","cosine" and"correlation".

  • "nndescent" Use approximate nearest neighbors with theNearest Neighbor Descent method (Dong et al., 2011) via thernndescentpackage.rnndescent is not a dependency of this package: thisoption is only available if you have installedrnndescentyourself.

By default, ifX has less than 4,096 vertices, the exact nearestneighbors are found. Otherwise, approximate nearest neighbors are used.You may also pass pre-calculated nearest neighbor data to this argument. Itmust be one of two formats, either a list consisting of two elements:

  • "idx". An_vertices x n_neighbors matrixcontaining the integer indexes of the nearest neighbors inX. Eachvertex is considered to be its own nearest neighbor, i.e.idx[, 1] == 1:n_vertices.

  • "dist". An_vertices x n_neighbors matrixcontaining the distances of the nearest neighbors.

or a sparse distance matrix of typedgCMatrix, with dimensionsn_vertices x n_vertices. Distances should be arranged by column,i.e. a non-zero entry in rowj of theith column indicatesthat thejth observation inX is a nearest neighbor of theith observation with the distance given by the value of thatelement.Then_neighbors parameter is ignored when using precomputednearest neighbor data. If using the sparse distance matrix input, eachcolumn can contain a different number of neighbors.

n_trees

Number of trees to build when constructing the nearestneighbor index. The more trees specified, the larger the index, but thebetter the results. Withsearch_k, determines the accuracy of theAnnoy nearest neighbor search. Only used if thenn_method is"annoy". Sensible values are between10 to100.

search_k

Number of nodes to search during the neighbor retrieval. Thelarger k, the more the accurate results, but the longer the search takes.Withn_trees, determines the accuracy of the Annoy nearest neighborsearch. Only used if thenn_method is"annoy".

approx_pow

IfTRUE, use an approximation to the power functionin the UMAP gradient, fromhttps://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/.Ignored ifdens_scale is non-NULL.

y

Optional target data for supervised dimension reduction. Can be avector, matrix or data frame. Use thetarget_metric parameter tospecify the metrics to use, using the same syntax asmetric. Usuallyeither a single numeric or factor column is used, but more complex formatsare possible. The following types are allowed:

  • Factor columns with the same length asX.NA isallowed for any observation with an unknown level, in which caseUMAP operates as a form of semi-supervised learning. Each column istreated separately.

  • Numeric data.NA isnot allowed in this case. Use theparametertarget_n_neighbors to set the number of neighbors usedwithy. If unset,n_neighbors is used. Unlike factors,numeric columns are grouped into one block unlesstarget_metricspecifies otherwise. For example, if you wish columnsa andb to be treated separately, specifytarget_metric = list(euclidean = "a", euclidean = "b"). Otherwise,the data will be effectively treated as a matrix with two columns.

  • Nearest neighbor data, consisting of a list of two matrices,idx anddist. These represent the precalculated nearestneighbor indices and distances, respectively. Thisis the same format as that expected for precalculated data innn_method. This format assumes that the underlying data was anumeric vector. Any user-supplied value of thetarget_n_neighborsparameter is ignored in this case, because the the number of columns inthe matrices is used for the value. Multiple nearest neighbor data usingdifferent metrics can be supplied by passing a list of these lists.

UnlikeX, all factor columns included iny are automaticallyused.

target_n_neighbors

Number of nearest neighbors to use to construct thetarget simplicial set. Default value isn_neighbors. Applies only ify is non-NULL andnumeric.

target_metric

The metric used to measure distance fory ifusing supervised dimension reduction. Used only ify is numeric.

target_weight

Weighting factor between data topology and targettopology. A value of 0.0 weights entirely on data, a value of 1.0 weightsentirely on target. The default of 0.5 balances the weighting equallybetween data and target. Only applies ify is non-NULL.

pca

If set to a positive integer value, reduce data to this number ofcolumns using PCA. Doesn't applied if the distancemetric is"hamming", or the dimensions of the data is larger than thenumber specified (i.e. number of rows and columns must be larger than thevalue of this parameter). If you have > 100 columns in a data frame ormatrix, reducing the number of columns in this way may substantiallyincrease the performance of the nearest neighbor search at the cost of apotential decrease in accuracy. In many t-SNE applications, a value of 50is recommended, although there's no guarantee that this is appropriate forall settings.

pca_center

IfTRUE, center the columns ofX beforecarrying out PCA. For binary data, it's recommended to set this toFALSE.

pcg_rand

IfTRUE, use the PCG random number generator (O'Neill,2014) during optimization. Otherwise, use the faster (but probably lessstatistically good) Tausworthe "taus88" generator. The default isTRUE. This parameter has been superseded byrng_type – ifboth are set,rng_type takes precedence.

fast_sgd

IfTRUE, then the following combination of parametersis set:pcg_rand = TRUE,n_sgd_threads = "auto" andapprox_pow = TRUE. The default isFALSE. Setting this toTRUE will speed up the stochastic optimization phase, but give apotentially less accurate embedding, and which will not be exactlyreproducible even with a fixed seed. For visualization,fast_sgd =TRUE will give perfectly good results. For more generic dimensionalityreduction, it's safer to leavefast_sgd = FALSE. Iffast_sgd =TRUE, then user-supplied values ofpcg_rand,n_sgd_threads,andapprox_pow are ignored.

ret_model

IfTRUE, then return extra data that can be used toadd new data to an existing embedding viaumap_transform. Theembedded coordinates are returned as the list itemembedding. IfFALSE, just return the coordinates. This parameter can be used inconjunction withret_nn andret_extra. Note that somesettings are incompatible with the production of a UMAP model: externalneighbor data (passed via a list tonn_method), and factor columnsthat were included via themetric parameter. In the latter case, themodel produced is based only on the numeric data. A transformation usingnew data is possible, but the factor columns in the new data are ignored.Note that settingret_model = TRUE forces the use of the approximatenearest neighbors method. Because small datasets would otherwise use exactnearest neighbor calculations, settingret_model = TRUE means thatdifferent results may be returned for small datasets in terms of both thereturned nearest neighbors (if requested) and the final embeddedcoordinates, compared toret_model = FALSE, even if the randomnumber seed is fixed. To avoid this, explicitly setnn_method = "annoy" in theret_model = FALSE case.

ret_nn

IfTRUE, then in addition to the embedding, also returnnearest neighbor data that can be used as input tonn_method toavoid the overhead of repeatedly calculating the nearest neighbors whenmanipulating unrelated parameters (e.g.min_dist,n_epochs,init). See the "Value" section for the names of the list items. IfFALSE, just return the coordinates. Note that the nearest neighborscould be sensitive to data scaling, so be wary of reusing nearest neighbordata if modifying thescale parameter. This parameter can be used inconjunction withret_model andret_extra.

ret_extra

A vector indicating what extra data to return. May containany combination of the following strings:

  • "model" Same as settingret_model = TRUE.

  • "nn" Same as settingret_nn = TRUE.

  • "fgraph" the high dimensional fuzzy graph (i.e. the fuzzysimplicial set of the merged local views of the input data). The graphis returned as a sparse symmetric N x N matrix of classdgCMatrix-class, where a non-zero entry (i, j) gives themembership strength of the edge connecting vertex i and vertex j. Thiscan be considered analogous to the input probability (or similarity oraffinity) used in t-SNE and LargeVis. Note that the graph is furthersparsified by removing edges with sufficiently low membership strengththat they would not be sampled by the probabilistic edge samplingemployed for optimization and therefore the number of non-zero elementsin the matrix is dependent onn_epochs. If you are onlyinterested in the fuzzy input graph (e.g. for clustering), settingn_epochs = 0 will avoid any further sparsifying.Be aware that setting 'binary_edge_weights = TRUE' will affect thisgraph (all non-zero edge weights will be 1).

  • "sigma" the normalization value for each observation in thedataset when constructing the smoothed distances to each of itsneighbors. This gives some sense of the local density of eachobservation in the high dimensional space: higher values ofsigma indicate a higher dispersion or lower density.

n_threads

Number of threads to use (except during stochastic gradientdescent). Default is half the number of concurrent threads supported by thesystem. For nearest neighbor search, only applies ifnn_method = "annoy". Ifn_threads > 1, then the Annoy indexwill be temporarily written to disk in the location determined bytempfile.

n_sgd_threads

Number of threads to use during stochastic gradientdescent. If set to > 1, then be aware that ifbatch = FALSE, resultswillnot be reproducible, even ifset.seed is called with afixed seed before running. Set to"auto" to use the same value asn_threads. Default is to use only one thread, unlessbatch = TRUE in which case"auto" used.

grain_size

The minimum amount of work to do on each thread. If thisvalue is set high enough, then less thann_threads orn_sgd_threads will be used for processing, which might give aperformance improvement if the overhead of thread management and contextswitching was outweighing the improvement due to concurrent processing.This should be left at default (1) and work will be spread evenlyover all the threads specified.

tmpdir

Temporary directory to store nearest neighbor indexes duringnearest neighbor search. Default istempdir. The index isonly written to disk ifn_threads > 1 andnn_method = "annoy"; otherwise, this parameter is ignored.

verbose

IfTRUE, log details to the console.

batch

IfTRUE, then embedding coordinates are updated at theend of each epoch rather than during the epoch. In batch mode, results arereproducible with a fixed random seed even withn_sgd_threads > 1,at the cost of a slightly higher memory use. You may also have to modifylearning_rate and increasen_epochs, so whether this providesa speed increase over the single-threaded optimization is likely to bedataset and hardware-dependent.

opt_args

A list of optimizer parameters, used whenbatch = TRUE. The default optimization method used is Adam (Kingmaand Ba, 2014).

  • method The optimization method to use. Either"adam"or"sgd" (stochastic gradient descent). Default:"adam".

  • beta1 (Adam only). The weighting parameter for theexponential moving average of the first moment estimator. Effectively themomentum parameter. Should be a floating point value between 0 and 1.Higher values can smooth oscillatory updates in poorly-conditionedsituations and may allow for a largerlearning_rate to bespecified, but too high can cause divergence. Default:0.5.

  • beta2 (Adam only). The weighting parameter for theexponential moving average of the uncentered second moment estimator.Should be a floating point value between 0 and 1. Controls the degree ofadaptivity in the step-size. Higher values put more weight on previoustime steps. Default:0.9.

  • eps (Adam only). Intended to be a small value to preventdivision by zero, but in practice can also affect convergence due to itsinteraction withbeta2. Higher values reduce the effect of thestep-size adaptivity and bring the behavior closer to stochastic gradientdescent with momentum. Typical values are between 1e-8 and 1e-3. Default:1e-7.

  • alpha The initial learning rate. Default: the value of thelearning_rate parameter.

epoch_callback

A function which will be invoked at the end of everyepoch. Its signature should be:(epoch, n_epochs, coords), where:

  • epoch The current epoch number (between1 andn_epochs).

  • n_epochs Number of epochs to use during the optimization ofthe embedded coordinates.

  • coords The embedded coordinates as of the end of the currentepoch, as a matrix with dimensions (N,n_components).

pca_method

Method to carry out any PCA dimensionality reduction whenthepca parameter is specified. Allowed values are:

  • "irlba". Usesprcomp_irlba from theirlba package.

  • "rsvd". Uses 5 iterations ofsvdr fromtheirlba package.This is likely to give much faster but potentially less accurate resultsthan using"irlba". For the purposes of nearest neighborcalculation and coordinates initialization, any loss of accuracy doesn'tseem to matter much.

  • "bigstatsr". Usesbig_randomSVDfrom thebigstatsrpackage. The SVD methods used inbigstatsr may be faster onsystems without access to efficient linear algebra libraries (e.g.Windows).Note:bigstatsr isnot a dependency ofuwot: if you choose to use this package for PCA, youmust installit yourself.

  • "svd". Usessvd for the SVD. This islikely to be slow for all but the smallest datasets.

  • "auto" (the default). Uses"irlba", unless more than50case"svd" is used.

binary_edge_weights

IfTRUE then edge weights in the inputgraph are treated as binary (0/1) rather than real valued. This affects thesampling frequency of neighbors and is the strategy used by the PaCMAPmethod (Wang and co-workers, 2020). Practical (Böhm and co-workers, 2020)and theoretical (Damrich and Hamprecht, 2021) work suggests this has littleeffect on UMAP's performance.

dens_scale

A value between 0 and 1. If > 0 then the output attemptsto preserve relative local density around each observation. This uses anapproximation to the densMAP method (Narayan and co-workers, 2021). Thelarger the value ofdens_scale, the greater the range of outputdensities that will be used to map the input densities. This option isignored if using multiplemetric blocks.

seed

Integer seed to use to initialize the random number generatorstate. Combined withn_sgd_threads = 1 orbatch = TRUE, thisshould give consistent output across multiple runs on a given installation.Setting this value is equivalent to callingset.seed,but it may be more convenient in some situations than having to call aseparate function. The default is to not set a seed. Ifret_model = TRUE, the seed will be stored in the output model andthen used to set the seed insideumap_transform.

nn_args

A list containing additional arguments to pass to the nearestneighbor method. Fornn_method = "annoy", you can specify"n_trees" and"search_k", and these will override then_trees andsearch_k parameters.Fornn_method = "hnsw", you may specify the following arguments:

  • M The maximum number of neighbors to keep for each vertex.Reasonable values are2 to100. Higher values give betterrecall at the cost of more memory. Default value is16.

  • ef_construction A positive integer specifying the size ofthe dynamic list used during index construction. A higher value willprovide better results at the cost of a longer time to build the index.Default is200.

  • ef A positive integer specifying the size of the dynamiclist used during search. This cannot be smaller thann_neighborsand cannot be higher than the number of items in the index. Default is10.

Fornn_method = "nndescent", you may specify the followingarguments:

  • n_trees The number of trees to use in a random projectionforest to initialize the search. A larger number will give more accurateresults at the cost of a longer computation time. The default ofNULL means that the number is chosen based on the number ofobservations inX.

  • max_candidates The number of potential neighbors to exploreper iteration. By default, this is set ton_neighbors or60,whichever is smaller. A larger number will give more accurate results atthe cost of a longer computation time.

  • n_iters The number of iterations to run the search. A largernumber will give more accurate results at the cost of a longer computationtime. By default, this will be chosen based on the number of observationsinX. You may also need to modify the convergence criteriondelta.

  • delta The minimum relative change in the neighbor graphallowed before early stopping. Should be a value between 0 and 1. Thesmaller the value, the smaller the amount of progress between iterations isallowed. Default value of0.001 means that at least 0.1neighbor graph must be updated at each iteration.

  • init How to initialize the nearest neighbor descent. Bydefault this is set to"tree" and uses a random project forest.If you set this to"rand", then a random selection is used. Usuallythis is less accurate than using RP trees, but for high-dimensional cases,there may be little difference in the quality of the initialization andrandom initialization will be a lot faster. If you set this to"rand", then then_trees parameter is ignored.

  • pruning_degree_multiplier The maximum number of edges per nodeto retain in the search graph, relative ton_neighbors. A largervalue will give more accurate results at the cost of a longer computationtime. Default is1.5. This parameter only affects neighbor searchwhen transforming new data withumap_transform.

  • epsilon Controls the degree of the back-tracking whentraversing the search graph. Setting this to0.0 will do a greedysearch with no back-tracking. A larger value will give more accurateresults at the cost of a longer computation time. Default is0.1.This parameter only affects neighbor search when transforming new data withumap_transform.

  • max_search_fraction Specifies the maximum fraction of thesearch graph to traverse. By default, this is set to1.0, so theentire graph (i.e. all items inX) may be visited. You may want toset this to a smaller value if you have a very large dataset (inconjunction withepsilon) to avoid an inefficient exhaustive searchof the data inX. This parameter only affects neighbor search whentransforming new data withumap_transform.

rng_type

The type of random number generator to use duringoptimization. One of:

  • "pcg". Use the PCG random number generator (O'Neill, 2014).

  • "tausworthe". Use the Tausworthe "taus88" generator.

  • "deterministic". Use a deterministic number generator. Thisisn't actually random, but may provide enough variation in the negativesampling to give a good embedding and can provide a noticeable speed-up.

For backwards compatibility, by default this is unset and the choice ofpcg_rand is used (making "pcg" the effective default).

Details

This function behaves likeumap except with some updateddefaults that make it behave more like the Python implementation and whichcannot be added toumap without breaking backwardscompatibility. In addition:

Value

A matrix of optimized coordinates, or:

The returned list contains the combined data from any combination ofspecifyingret_model,ret_nn andret_extra.

References

Belkin, M., & Niyogi, P. (2002).Laplacian eigenmaps and spectral techniques for embedding and clustering.InAdvances in neural information processing systems(pp. 585-591).http://papers.nips.cc/paper/1961-laplacian-eigenmaps-and-spectral-techniques-for-embedding-and-clustering.pdf

Böhm, J. N., Berens, P., & Kobak, D. (2020).A unifying perspective on neighbor embeddings along the attraction-repulsion spectrum.arXiv preprintarXiv:2007.08902.https://arxiv.org/abs/2007.08902

Damrich, S., & Hamprecht, F. A. (2021).On UMAP's true loss function.Advances in Neural Information Processing Systems,34.https://proceedings.neurips.cc/paper/2021/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html

Dong, W., Moses, C., & Li, K. (2011, March).Efficient k-nearest neighbor graph construction for generic similarity measures.InProceedings of the 20th international conference on World Wide Web(pp. 577-586).ACM.doi:10.1145/1963405.1963487.

Kingma, D. P., & Ba, J. (2014).Adam: A method for stochastic optimization.arXiv preprintarXiv:1412.6980.https://arxiv.org/abs/1412.6980

Malkov, Y. A., & Yashunin, D. A. (2018).Efficient and robust approximate nearest neighbor search using hierarchicalnavigable small world graphs.IEEE transactions on pattern analysis and machine intelligence,42(4), 824-836.

McInnes, L., Healy, J., & Melville, J. (2018).UMAP: Uniform Manifold Approximation and Projection for Dimension ReductionarXiv preprintarXiv:1802.03426.https://arxiv.org/abs/1802.03426

Narayan, A., Berger, B., & Cho, H. (2021).Assessing single-cell transcriptomic variability through density-preserving data visualization.Nature biotechnology,39(6), 765-774.doi:10.1038/s41587-020-00801-7

O'Neill, M. E. (2014).PCG: A family of simple fast space-efficient statistically goodalgorithms for random number generation(Report No. HMC-CS-2014-0905). Harvey Mudd College.

Tang, J., Liu, J., Zhang, M., & Mei, Q. (2016, April).Visualizing large-scale and high-dimensional data.InProceedings of the 25th International Conference on World Wide Web(pp. 287-297).International World Wide Web Conferences Steering Committee.https://arxiv.org/abs/1602.00370

Van der Maaten, L., & Hinton, G. (2008).Visualizing data using t-SNE.Journal of Machine Learning Research,9 (2579-2605).https://www.jmlr.org/papers/v9/vandermaaten08a.html

Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021).Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization.Journal of Machine Learning Research,22(201), 1-73.https://www.jmlr.org/papers/v22/20-1061.html

Examples

iris30 <- iris[c(1:10, 51:60, 101:110), ]iris_umap <- umap2(iris30, n_neighbors = 5)

Add New Points to an Existing Embedding

Description

Carry out an embedding of new data using an existing embedding. Requiresusing the result of callingumap ortumap withret_model = TRUE.

Usage

umap_transform(  X = NULL,  model = NULL,  nn_method = NULL,  init_weighted = TRUE,  search_k = NULL,  tmpdir = tempdir(),  n_epochs = NULL,  n_threads = NULL,  n_sgd_threads = 0,  grain_size = 1,  verbose = FALSE,  init = "weighted",  batch = NULL,  learning_rate = NULL,  opt_args = NULL,  epoch_callback = NULL,  ret_extra = NULL,  seed = NULL)

Arguments

X

The new data to be transformed, either a matrix of data frame. Musthave the same columns in the same order as the input data used to generatethemodel.

model

Data associated with an existing embedding.

nn_method

Optional pre-calculated nearest neighbor data. There aretwo supported formats. The first is a list consisting of two elements:

  • "idx". An_vertices x n_neighbors matrix wheren_vertices is the number of observations inX. The contentsof the matrix should be the integer indexes of the data used to generatethemodel, which are then_neighbors-nearest neighbors ofthe data to be transformed.

  • "dist". An_vertices x n_neighbors matrixcontaining the distances of the nearest neighbors.

The second supported format is a sparse distance matrix of typedgCMatrix, with dimensionsn_model_vertices x n_vertices.wheren_model_vertices is the number of observations in the originaldata that generated the model. Distances should be arranged by column, i.e.a non-zero entry in rowj of theith column indicates thatthejth observation in the original data used to generate themodel is a nearest neighbor of theith observation in the newdata, with the distance given by the value of that element. In this format,a different number of neighbors is allowed for each observation, i.e.each column can contain a different number of non-zero values.Multiple nearest neighbor data (e.g. from two different pre-calculatedmetrics) can be passed by passing a list containing the nearest neighbordata lists as items.

init_weighted

IfTRUE, then initialize the embedded coordinatesofX using a weighted average of the coordinates of the nearestneighbors from the original embedding inmodel, where the weightsused are the edge weights from the UMAP smoothed knn distances. Otherwise,use an un-weighted average.This parameter will be deprecated and removed at version 1.0 of thispackage. Use theinit parameter as a replacement, replacinginit_weighted = TRUE withinit = "weighted" andinit_weighted = FALSE withinit = "average".

search_k

Number of nodes to search during the neighbor retrieval. Thelarger k, the more the accurate results, but the longer the search takes.Default is the value used in building themodel is used.

tmpdir

Temporary directory to store nearest neighbor indexes duringnearest neighbor search. Default istempdir. The index isonly written to disk ifn_threads > 1; otherwise, this parameter isignored.

n_epochs

Number of epochs to use during the optimization of theembedded coordinates. A value between30 - 100 is a reasonable tradeoff between speed and thoroughness. By default, this value is set to onethird the number of epochs used to build themodel.

n_threads

Number of threads to use, (except during stochastic gradientdescent). Default is half the number of concurrent threads supported by thesystem.

n_sgd_threads

Number of threads to use during stochastic gradientdescent. If set to > 1, then be aware that ifbatch = FALSE, resultswillnot be reproducible, even ifset.seed is called with afixed seed before running. Set to"auto" to use the same value asn_threads.

grain_size

Minimum batch size for multithreading. If the number ofitems to process in a thread falls below this number, then no threads willbe used. Used in conjunction withn_threads andn_sgd_threads.

verbose

IfTRUE, log details to the console.

init

how to initialize the transformed coordinates. One of:

  • "weighted" (The default). Use a weighted average of thecoordinates of the nearest neighbors from the original embedding inmodel, where the weights used are the edge weights from the UMAPsmoothed knn distances. Equivalent toinit_weighted = TRUE.

  • "average". Use the mean average of the coordinates ofthe nearest neighbors from the original embedding inmodel.Equivalent toinit_weighted = FALSE.

  • A matrix of user-specified input coordinates, which must havedimensions the same as(nrow(X), ncol(model$embedding)).

This parameter should be used in preference toinit_weighted.

batch

IfTRUE, then embedding coordinates are updated at theend of each epoch rather than during the epoch. In batch mode, results arereproducible with a fixed random seed even withn_sgd_threads > 1,at the cost of a slightly higher memory use. You may also have to modifylearning_rate and increasen_epochs, so whether this providesa speed increase over the single-threaded optimization is likely to bedataset and hardware-dependent. IfNULL, the transform will use thevalue provided in themodel, if available. Default:FALSE.

learning_rate

Initial learning rate used in optimization of thecoordinates. This overrides the value associated with themodel.This should be left unspecified under most circumstances.

opt_args

A list of optimizer parameters, used whenbatch = TRUE. The default optimization method used is Adam (Kingmaand Ba, 2014).

  • method The optimization method to use. Either"adam"or"sgd" (stochastic gradient descent). Default:"adam".

  • beta1 (Adam only). The weighting parameter for theexponential moving average of the first moment estimator. Effectively themomentum parameter. Should be a floating point value between 0 and 1.Higher values can smooth oscillatory updates in poorly-conditionedsituations and may allow for a largerlearning_rate to bespecified, but too high can cause divergence. Default:0.5.

  • beta2 (Adam only). The weighting parameter for theexponential moving average of the uncentered second moment estimator.Should be a floating point value between 0 and 1. Controls the degree ofadaptivity in the step-size. Higher values put more weight on previoustime steps. Default:0.9.

  • eps (Adam only). Intended to be a small value to preventdivision by zero, but in practice can also affect convergence due to itsinteraction withbeta2. Higher values reduce the effect of thestep-size adaptivity and bring the behavior closer to stochastic gradientdescent with momentum. Typical values are between 1e-8 and 1e-3. Default:1e-7.

  • alpha The initial learning rate. Default: the value of thelearning_rate parameter.

IfNULL, the transform will use the value provided in themodel, if available.

epoch_callback

A function which will be invoked at the end of everyepoch. Its signature should be:(epoch, n_epochs, coords, fixed_coords), where:

  • epoch The current epoch number (between1 andn_epochs).

  • n_epochs Number of epochs to use during the optimization ofthe embedded coordinates.

  • coords The embedded coordinates as of the end of the currentepoch, as a matrix with dimensions (N,n_components).

  • fixed_coords The originally embedded coordinates from themodel. These are fixed and do not change. A matrix with dimensions(Nmodel,n_components) whereNmodel is the number ofobservations in the original data.

ret_extra

A vector indicating what extra data to return. May containany combination of the following strings:

  • "fgraph" the high dimensional fuzzy graph (i.e. the fuzzysimplicial set of the merged local views of the input data). The graphis returned as a sparse matrix of classdgCMatrix-classwith dimensionsNX xNmodel, whereNX is the numberof items in the data to transform inX, andNModel isthe number of items in the data used to build the UMAPmodel.A non-zero entry (i, j) gives the membership strength of the edgeconnecting the vertex representing the ith item inX to thejth item in the data used to build themodel. Note that thegraph is further sparsified by removing edges with sufficiently lowmembership strength that they would not be sampled by the probabilisticedge sampling employed for optimization and therefore the number ofnon-zero elements in the matrix is dependent onn_epochs. If youare only interested in the fuzzy input graph (e.g. for clustering),settingn_epochs = 0 will avoid any further sparsifying.

  • "nn" the nearest neighbor graph forX with respect tothe observations in themodel. The graph will be returned as alist of two items:idx a matrix of indices, with as many rowsas there are items inX and as many columns as there are nearestneighbors to be computed (this value is determined by themodel).The indices are those of the rows of the data used to build themodel, so they're not necessarily of much use unless you haveaccess to that data. The second item,dist is a matrix of theequivalent distances, with the same dimensions asidx.

seed

Integer seed to use to initialize the random number generatorstate. Combined withn_sgd_threads = 1 orbatch = TRUE, thisshould give consistent output across multiple runs on a given installation.Setting this value is equivalent to callingset.seed,but it may be more convenient in some situations than having to call aseparate function. The default is to not set a seed, in which case thisfunction uses the behavior specified by the suppliedmodel: If themodel specifies a seed, then the model seed will be used to seed thenrandom number generator, and results will still be consistent (ifn_sgd_threads = 1). If you want to force the seed to not be set,even if it is set inmodel, setseed = FALSE.

Details

Note that some settings are incompatible with the production of a UMAP modelviaumap: external neighbor data (passed via a list to theargument of thenn_method parameter), and factor columns that wereincluded in the UMAP calculation via themetric parameter. In thelatter case, the model produced is based only on the numeric data.A transformation is possible, but factor columns in the new data are ignored.

Value

A matrix of coordinates forX transformed into the spaceof themodel, or ifret_extra is specified, a listcontaining:

Examples

iris_train <- iris[1:100, ]iris_test <- iris[101:150, ]# You must set ret_model = TRUE to return extra data needediris_train_umap <- umap(iris_train, ret_model = TRUE)iris_test_umap <- umap_transform(iris_test, iris_train_umap)

Unload a Model

Description

Unloads the UMAP model. This prevents the model being used withumap_transform, but allows the temporary working directoryassociated with saving or loading the model to be removed.

Usage

unload_uwot(model, cleanup = TRUE, verbose = FALSE)

Arguments

model

a UMAP model create byumap.

cleanup

ifTRUE, attempt to delete the temporary workingdirectory that was used in either the save or load of the model.

verbose

ifTRUE, log information to the console.

See Also

save_uwot,load_uwot

Examples

iris_train <- iris[c(1:10, 51:60), ]iris_test <- iris[100:110, ]# create modelmodel <- umap(iris_train, ret_model = TRUE, n_epochs = 20)# save without unloading: this leaves behind a temporary working directorymodel_file <- tempfile("iris_umap")model <- save_uwot(model, file = model_file)# The model can continue to be usedtest_embedding <- umap_transform(iris_test, model)# To manually unload the model from memory when finished and to clean up# the working directory (this doesn't touch your model file)unload_uwot(model)# At this point, model cannot be used with umap_transform, this would fail:# test_embedding2 <- umap_transform(iris_test, model)# restore the model: this also creates a temporary working directorymodel2 <- load_uwot(file = model_file)test_embedding2 <- umap_transform(iris_test, model2)# Unload and clean up the loaded model temp directoryunload_uwot(model2)# clean up the model fileunlink(model_file)# save with unloading: this deletes the temporary working directory but# doesn't allow the model to be re-usedmodel3 <- umap(iris_train, ret_model = TRUE, n_epochs = 20)model_file3 <- tempfile("iris_umap")model3 <- save_uwot(model3, file = model_file3, unload = TRUE)

[8]ページ先頭

©2009-2025 Movatter.jp