- Notifications
You must be signed in to change notification settings - Fork1
PolMine/biglda
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Topic modelling is an established and useful unsupervised learningtechnique for detecting topics in large collections of text. Among themany variants of topic modelling and innovations that are been tried,the classic ‘Latent Dirichlet Allocation’ (LDA) algorithm remains to bea powerful tool that yields good results. Yet computing LDA topic modelsrequires significant system resources, particularly when corpora growlarge. The biglda package addresses three specific issues that may beoutright obstacles to using LDA topic models productively and in astate-of-the-art fashion.
The package addresses issues of computing time and RAM limitations atthe different relevant stages when working with topic models:
Data preparation and import: Issues of performance and memoryefficiency start to arise when preparing input data for a trainingalgorithm.
Fitting topic models: Computing time for a single topic model may beextensive. It is good practice to evaluate a set of topic models forhyperparameter optimization. So you need a fast implementation of LDAtopic modelling to be able to fit a set of topic models withinreasonable time.
To cost of interfacing: R is very productive for developing youranalysis, but the best implementations of LDA topic modelling arewritten in Python, Java and C. Transferring the data betweenprogramming languages is a potential bottleneck and can be quite slow.
Evaluating topic models: Issues performance and memory efficiencyare relevant when computing indicators to assess topic models may beconsiderable. Computations require mathematical operations with reallylarge matrices. Memory may be exhausted easily.
More plastically spoken: As you work with increasingly large data,fitting and evaluating topic models can bring hours and days you waitfor a result to emerge, just to see that the process crashes becausememory has been exhausted.
The biglda package addresses performance and memory efficiency issuesfor R users at all stages. TheParallelTopicModel class from theMachine Learning for Language Toolkit ‘mallet’ offers a fastimplementation of an LDA topic model that yields good results. Thepurpose of the ‘biglda’ package is to offer a seamless and fastinterface to the Java classes of ‘mallet’ so that the multicoreimplementation of the LDA algorithm can be used. Theas_LDA() functioncan be used to map the mallet model on theLDA_Gibbs class from thewidely usedtopicanalysis package.
The biglda package is a GitHub-only package at this time. Install thestable version as follows.
remotes::install_github("PolMine/biglda")
Install the development version of the package as follows.
remotes::install_github("PolMine/biglda",ref="dev")
The focus of the biglda package is to offer a seamless interface toMallet. Thebiglda::mallet_install() function installs Mallet(v202108) in a directory within the package. The big disadvantage ofthis installation mechanism is that the Mallet installation isoverwritten whenever you install a new version of biglda, and Malletneeds to be installed anew. This is why installing Mallet in the storagelocation used by your system for add-on software (such as “/opt” onLinux/macOS) is recommended.
We strongly recommend to install the latest version of Mallet (v202108or higher). Among others, it includes a new$printDenseDocumentTopics() method of theParallelTopicModel used bybiglda::save_document_topics(), improving the efficiency of movingdata from Java/Mallet to R significantly.
Note that v202108 is a “serialization-breaking release”. Instance filesand binary models prepared using previous versions cannot be processedwith this version and may have to be rebuilt.
On Linux and macOS machines, you may use the following lines of code forinstalling Mallet. Note that admin privileges (“sudo”) may be required.
mkdir /opt/malletcd /opt/malletwget https://github.com/mimno/Mallet/releases/download/v202108/Mallet-202108-bin.tar.gztar xzfv Mallet-202108-bin.tar.gzrm Mallet-202108-bin.tar.gzUsing Mallet with the interface offered by biglda requires a workinginstallation of the rJava package. Unfortunately, it happens indeed thatinstalling rJava causes headaches. A solution that works very often isto reconfigure the R-Java intervace running the following command on thecommand line.
R CMD javareconf
It goes beyond this introduction to list all potential solutions forrJava problems The prospect that Mallet is a very good and efficienttool may serve as a motivation.
The equivalent to rJava as an interface to Java is the reticulatepackage as an interface for running Python commands from R. Thereticulate package needs to be installed and loaded.
install.packages("reticulate")library(reticulate)
The reticulate package typically runs Python within a Conda environment.The easiest way is to useinstall_miniconda() and to install Gensim asfollows.
reticulate::install_miniconda()reticulate::conda_install(packages="gensim")
Note that it is not possible to use the R packages “biglda” and “mallet”in parallel. If “mallet” is loaded, it will put its Java Archive on theclasspath of the Java Virtual Machine (JVM), making the latest versionof theParallelTopicModel class inaccessible and causing errors thatmay be difficult to understand. Therefore, a warning will be issued when“biglda” may detect that the JAR included in the “mallet” package is inthe classpath.
options(java.parameters="-Xmx8g")Sys.setenv(MALLET_DIR="/opt/mallet/Mallet-202108")library(biglda)
## Mallet version: v202108## JVM memory allocated: 7.1 GbWhen using Mallet for topic modelling, the first step is to prepare aso-called “instance list” with information on the input documents. Inthis example, we use theAssociatedPress dataset included in thetopicmodels package.
data("AssociatedPress",package="topicmodels")instance_list<- as.instance_list(AssociatedPress,verbose=FALSE)
We then instantiate aBigTopicModel class object with hyperparameters.This class is a super class of theParallelTopicModel class, theefficient worker of Mallet topic modelling. The superclass adds methodsto this class that greatly speed up transferring data at the R-Javainterface.
BTM<- BigTopicModel(n_topics=100L,alpha_sum=5.1,beta=0.1)
This result (objectBTM) is a Java object of class “jobjRef”. Methodsof this (Java) class are accessible from R, and are used to configurethe topic modelling engine. The first step is to add the instance list.
BTM$addInstances(instance_list)
We then set the number of iterations to 1000 and just use one threadcore. Note that using multiple cores speeds up topic modelling - seeperformance assessment below.
BTM$setNumIterations(100L)BTM$setNumThreads(1L)
Finally, we control the verbosity of the engine. By default, Malletissues a status message every 10 iterations and reports the top wordsfor topics every 100 iterations. For the purpose of rendering thisdocument, we turn these progress reports off.
BTM$setTopicDisplay(0L,0L)# no intermediate report on topicsBTM$logger$setLevel(rJava::J("java.util.logging.Level")$OFF)# remain silent
We now fit the topic model and report the time that has elapsed forfitting the model.
started<- Sys.time()BTM$estimate()Sys.time()-started
## Time difference of 4.485914 secsThe package includes optimized functionality for evaluating the topicmodel. Metrics are computed as follows.
lda<- as_LDA(BTM,verbose=FALSE)N<-BTM$getDocLengthCounts()data.frame(arun2010= BigArun2010(beta= B(lda),gamma= G(lda),doclengths=N),cao2009= BigCao2009(X= B(lda)),deveaud2014= BigDeveaud2014(beta= B(lda)))
## arun2010 cao2009 deveaud2014## 1 3630.858 0.06176298 1.480718To use Gensim for topic modelling, first load the reticulate package andimport gensim into the Python session.
library(reticulate)gensim<-reticulate::import("gensim")
The bottleneck for using Gensim for topic modelling from R is theinterface between the R and the Python session. We assume that aDocumentTermMatrix has been prepared. The biglda package offers thefunctionsdtm_as_bow() anddtm_as_dictionary() to prepare the inputdata structures required by Gensim topic modelling.
bow<- dtm_as_bow(AssociatedPress)dict<- dtm_as_dictionary(AssociatedPress)
We use the multicore implementation of LDA topic modelling and use allcores but two.
threads<-parallel::detectCores()-2L
This is the genuine Python part - running Gensim.
gensim_model<-gensim$models$ldamulticore$LdaModel(# gensim_model <- gensim$models$ldamulticore$LdaMulticore(corpus=py$corpus,id2word=py$dictionary,num_topics=100L,iterations=100L,per_word_topics=FALSE# workers = as.integer(threads) # required to be integer)
Use theas_LDA() method to get back data from Pyhton/Gensim to R.
lda<- as_LDA(gensim_model,dtm=AssociatedPress)
## ℹ Insantiate basic LDA_Gensim S4 class✔ Insantiate basic LDA_Gensim S4 class [125ms]## ℹ assign dictionary (slot 'terms')✔ assign dictionary (slot 'terms') [17ms]## ℹ assign word-topic distribution matrix (slot 'beta')✔ assign word-topic distribution matrix (slot 'beta') [16ms]## ℹ assign topic distribution for each document (slot 'gamma')✔ assign topic distribution for each document (slot 'gamma') [12s]To convey that biglda is fast, we fit topic models with Mallet withdifferent numbers of cores and compare computing time with topicmodelling withtopicmodels::LDA(). The corpus used isAssociatedPress as represented by aDocumentTermMatrix included inthetopicmodel package. Here, we just run 100 iterations for a limitedset ofk topics. In “real life”, you would have more iterations(typically 1000-2000), but this setup is sufficiently informative onperformance.
So we start with the general settings.
k<-100Literations<-100Ln_cores_max<-parallel::detectCores()-2L# use all cores but twodata("AssociatedPress",package="topicmodels")report<-list()
We then fit topic models with Mallet using a increasing number of cores.
library(biglda)instance_list<- as.instance_list(AssociatedPress,verbose=FALSE)for (coresin1L:n_cores_max){mallet_started<- Sys.time()BTM<- BigTopicModel(n_topics=k,alpha_sum=5.1,beta=0.1)BTM$addInstances(instance_list)BTM$setNumThreads(cores)BTM$setNumIterations(iterations)BTM$setTopicDisplay(0L,0L)# no intermediate report on topicsBTM$logger$setLevel(rJava::J("java.util.logging.Level")$OFF)# remain silentBTM$estimate()run<- sprintf("mallet_%d",cores)report[[as.character(cores)]]<-data.frame(tool=run,time= as.numeric(difftime(Sys.time(),mallet_started,units="mins")) )}
And we run the classic implementation of LDA topic topic modelling inthe topicmodels package.
library(topicmodels)topicmodels_started<- Sys.time()lda_model<- LDA(AssociatedPress,k=k,method="Gibbs",control=list(iter=iterations))report[["topicmodels"]]<-data.frame(tool="topicmodels",time= difftime(Sys.time(),topicmodels_started,units="mins"))
The following chart reports the elapsed time for LDA topic modelling.
These are the takeaways from the benchmarks:
Topic modelling with Mallet is fast and even more so with severalcores. So for large data, biglda can make the difference, whether ittakes a week or a day for fit the topic model.
Note that we did not an evaluation of
stm::stm()(structural topicmodel) here, which has become a widely used state-of-the-artalgorithm. The stm package offers very rich analytical possibilitiesand the ability to include document metadata goes significantly beyondclassic LDA topic modelling. Butstm::stm()is significantly slowerthantopicmodels::LDA(). The stm package does not address big datascenarios very well, this is the specialization of the biglda package.The great reputation of Gensim notwithstanding, benchmarks for Gensimare significantly weaker than what we see for mallet and topicmodels.The reticulate interface may be the bottleneck: We do not yet know.
The package implements the metrics for topic models of the ldatuningpackage (Griffiths2004 still missing) using RcppArmadillo. Based on thefollowing code used for benchmarking, the following chart conveys thatthe biglda funtionsBigArun2010(),BigCao2009() andBigDeveaud2014() are significantly faster than their counterparts inthe ldatuning package.
library(ldatuning)metrics_timing<-data.frame(pkg= c(rep("ldatuning",times=3), rep("biglda",times=3)),metric= c(rep(c("Arun2010","CaoJuan2009","Deveaud2009"),times=2)),time= c( system.time(Arun2010(models=list(lda_model),dtm=AssociatedPress))[3], system.time(CaoJuan2009(models=list(lda_model)))[3], system.time(Deveaud2014(models=list(lda_model)))[3], system.time(BigArun2010(beta= B(lda_model),gamma= G(lda_model),doclengths=BTM$getDocLengthCounts()))[3], system.time(BigCao2009(B(lda_model)))[3], system.time(BigDeveaud2014(B(lda_model)))[3] ))ggplot(metrics_timing, aes(fill=pkg,y=time,x=metric))+ geom_bar(position="dodge",stat="identity")
Note that apart from speed, the RcppArmadillo/C++ implementation is muchmore memory efficient. Computations of metrics that may fail withldatuning functionality because memory is exhausted will often work fastand successfully with the biglda package.
The examples here and in the package documentation including vignettesneed to be minimal working examples to keep computing time low whenbuilding the package. In real-life scenarios, we recommend to usescripts for topic modelling on large datasets and to execute the scriptfrom the command line. R Markdown are an ideal alternative to plain Rscripts that can be run with a shell command such as ``.
- Mallet
- Gensim
- Gensim2
The R package “mallet” has been around for a while and is thetraditional interface to the “mallet” Java package for R users. However,the aRTopicModelclassis used as an interface to theParallelTopicModel class, hiding themethod to define the number of cores to be used from the R used, thuslimiting the potential to speed up the computation of a topic modelusing multiple threads. What is more, functionality for full access tothe computed model is hidden, inhibiting the extraction of theinformation the mallet LDA topic model that is required to map the topicmodel on the LDA_Gibbs topic model.
The biglda package is designed for heavy-duty work with large data sets.If you work with smaller data, other approaches and implementations thatare established and that may be easier to use and to install (the rJavaBlues!)
do a very good job indeed. But we are not just happy to also havebiglda. Jobs with big data may just take too long or simply fail becausememory is exhausted. We think that biglda pushes the frontier of beingable of productively using topic modelling in the R realm.
(Alternative: Command line use of Mallet. Quarto.)
David Minmo, “The Details: Training and Validating Big Models on BigData”:http://journalofdigitalhumanities.org/2-1/the-details-by-david-mimno/
About
Tools for fast LDA topic modelling for big corpora
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.

