- Notifications
You must be signed in to change notification settings - Fork6
R package for large-scale similarity/distance computation
License
koheiw/proxyC
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
proxyC computes proximity between rows or columns of large matricesefficiently in C++. It is optimized for large sparse matrices using theArmadillo and Intel TBB libraries. Among several built-insimilarity/distance measures, computation of correlation, cosinesimilarity and Euclidean distance is particularly fast.
This code was originally written forquanteda to computesimilarity/distance between documents or features in large corpora, butseparated as a stand-alone package to make it available for broader datascientific purposes.
SinceproxyC v0.4.0, it requires the Intel oneAPI Threading BuildingBlocks for parallel computing. Windows and Mac users can download abinary package from CRAN, but Linux users must install the library byexecuting the commands below:
# Fedora, CentOS, RHELsudo yum install tbb-devel# Debian and Ubuntusudo apt install libtbb-dev
install.packages("proxyC")
require(Matrix)## Loading required package: Matrixrequire(microbenchmark)## Loading required package: microbenchmarkrequire(ggplot2)## Loading required package: ggplot2require(magrittr)## Loading required package: magrittr# Set number of threadsoptions("proxyC.threads"=8)# Make a matrix with 99% zerossm1k<- rsparsematrix(1000,1000,0.01)# 1,000 columnssm10k<- rsparsematrix(1000,10000,0.01)# 10,000 columns# Convert to dense formatdm1k<- as.matrix(sm1k)dm10k<- as.matrix(sm10k)
With sparse matrices,proxyC is roughly 10 to 100 times faster thanproxy.
bm1<- microbenchmark("proxy 1k"=proxy::simil(dm1k,method="cosine"),"proxyC 1k"=proxyC::simil(sm1k,margin=2,method="cosine"),"proxy 10k"=proxy::simil(dm10k,method="cosine"),"proxyC 10k"=proxyC::simil(sm10k,margin=2,method="cosine"),times=10)autoplot(bm1)
Ifmin_simil
is used,proxyC becomes even faster because smallsimilarity scores are floored to zero.
bm2<- microbenchmark("proxyC all"=proxyC::simil(sm1k,margin=2,method="cosine"),"proxyC min_simil"=proxyC::simil(sm1k,margin=2,method="cosine",min_simil=0.9),times=10)autoplot(bm2)
Flooring bymin_simil
makes the resulting object much smaller.
proxyC::simil(sm10k,margin=2,method="cosine") %>% object.size() %>% print(units="MB")## 763 MbproxyC::simil(sm10k,margin=2,method="cosine",min_simil=0.9) %>% object.size() %>% print(units="MB")## 0.2 Mb
Ifrank
is used,proxyC only returns top-n values.
bm3<- microbenchmark("proxyC rank"=proxyC::simil(sm1k,margin=2,method="correlation",rank=10),"proxyC all"=proxyC::simil(sm1k,margin=2,method="correlation"),times=10)autoplot(bm3)
About
R package for large-scale similarity/distance computation