Topic modelling is an established and useful unsupervised learning technique for detecting topics in large collections of text. Among the many variants of topic modelling and innovations that are been tried, the classic ‘Latent Dirichlet Allocation’ (LDA) algorithm remains to be a powerful tool that yields good results. Yet computing LDA topic models requires significant system resources, particularly when corpora grow large. The biglda package addresses three specific issues that may be outright obstacles to using LDA topic models productively and in a state-of-the-art fashion.
The package addresses issues of computing time and RAM limitations at the different relevant stages when working with topic models:
-
Data preparation and import: Issues of performance and memory efficiency start to arise when preparing input data for a training algorithm.
-
Fitting topic models: Computing time for a single topic model may be extensive. It is good practice to evaluate a set of topic models for hyperparameter optimization. So you need a fast implementation of LDA topic modelling to be able to fit a set of topic models within reasonable time.
-
To cost of interfacing: R is very productive for developing your analysis, but the best implementations of LDA topic modelling are written in Python, Java and C. Transferring the data between programming languages is a potential bottleneck and can be quite slow.
-
Evaluating topic models: Issues performance and memory efficiency are relevant when computing indicators to assess topic models may be considerable. Computations require mathematical operations with really large matrices. Memory may be exhausted easily.
More plastically spoken: As you work with increasingly large data, fitting and evaluating topic models can bring hours and days you wait for a result to emerge, just to see that the process crashes because memory has been exhausted.
The biglda package addresses performance and memory efficiency issues
for R users at all stages. The ParallelTopicModel class from the
Machine Learning for Language Toolkit ‘mallet’ offers a fast
implementation of an LDA topic model that yields good results. The
purpose of the ‘biglda’ package is to offer a seamless and fast
interface to the Java classes of ‘mallet’ so that the multicore
implementation of the LDA algorithm can be used. The as_LDA()
function
can be used to map the mallet model on the LDA_Gibbs
class from the
widely used topicanalysis
package.
The biglda package is a GitHub-only package at this time. Install the stable version as follows.
remotes::install_github("PolMine/biglda")
Install the development version of the package as follows.
remotes::install_github("PolMine/biglda", ref = "dev")
The focus of the biglda package is to offer a seamless interface to
Mallet. The biglda::mallet_install()
function installs Mallet
(v202108) in a directory within the package. The big disadvantage of
this installation mechanism is that the Mallet installation is
overwritten whenever you install a new version of biglda, and Mallet
needs to be installed anew. This is why installing Mallet in the storage
location used by your system for add-on software (such as “/opt” on
Linux/macOS) is recommended.
We strongly recommend to install the latest version of Mallet (v202108
or higher). Among others, it includes a new
$printDenseDocumentTopics()
method of the ParallelTopicModel
used by
biglda::save_document_topics()
, improving the efficiency of moving
data from Java/Mallet to R significantly.
Note that v202108 is a “serialization-breaking release”. Instance files and binary models prepared using previous versions cannot be processed with this version and may have to be rebuilt.
On Linux and macOS machines, you may use the following lines of code for installing Mallet. Note that admin privileges (“sudo”) may be required.
mkdir /opt/mallet
cd /opt/mallet
wget https://github.com/mimno/Mallet/releases/download/v202108/Mallet-202108-bin.tar.gz
tar xzfv Mallet-202108-bin.tar.gz
rm Mallet-202108-bin.tar.gz
Using Mallet with the interface offered by biglda requires a working installation of the rJava package. Unfortunately, it happens indeed that installing rJava causes headaches. A solution that works very often is to reconfigure the R-Java intervace running the following command on the command line.
R CMD javareconf
It goes beyond this introduction to list all potential solutions for rJava problems The prospect that Mallet is a very good and efficient tool may serve as a motivation.
The equivalent to rJava as an interface to Java is the reticulate package as an interface for running Python commands from R. The reticulate package needs to be installed and loaded.
install.packages("reticulate")
library(reticulate)
The reticulate package typically runs Python within a Conda environment.
The easiest way is to use install_miniconda()
and to install Gensim as
follows.
reticulate::install_miniconda()
reticulate::conda_install(packages = "gensim")
Note that it is not possible to use the R packages “biglda” and “mallet”
in parallel. If “mallet” is loaded, it will put its Java Archive on the
classpath of the Java Virtual Machine (JVM), making the latest version
of the ParallelTopicModel
class inaccessible and causing errors that
may be difficult to understand. Therefore, a warning will be issued when
“biglda” may detect that the JAR included in the “mallet” package is in
the classpath.
options(java.parameters = "-Xmx8g")
Sys.setenv(MALLET_DIR = "/opt/mallet/Mallet-202108")
library(biglda)
## Mallet version: v202108
## JVM memory allocated: 7.1 Gb
When using Mallet for topic modelling, the first step is to prepare a
so-called “instance list” with information on the input documents. In
this example, we use the AssociatedPress
dataset included in the
topicmodels package.
data("AssociatedPress", package = "topicmodels")
instance_list <- as.instance_list(AssociatedPress, verbose = FALSE)
We then instantiate a BigTopicModel
class object with hyperparameters.
This class is a super class of the ParallelTopicModel
class, the
efficient worker of Mallet topic modelling. The superclass adds methods
to this class that greatly speed up transferring data at the R-Java
interface.
BTM <- BigTopicModel(n_topics = 100L, alpha_sum = 5.1, beta = 0.1)
This result (object BTM
) is a Java object of class “jobjRef”. Methods
of this (Java) class are accessible from R, and are used to configure
the topic modelling engine. The first step is to add the instance list.
BTM$addInstances(instance_list)
We then set the number of iterations to 1000 and just use one thread core. Note that using multiple cores speeds up topic modelling - see performance assessment below.
BTM$setNumIterations(100L)
BTM$setNumThreads(1L)
Finally, we control the verbosity of the engine. By default, Mallet issues a status message every 10 iterations and reports the top words for topics every 100 iterations. For the purpose of rendering this document, we turn these progress reports off.
BTM$setTopicDisplay(0L, 0L) # no intermediate report on topics
BTM$logger$setLevel(rJava::J("java.util.logging.Level")$OFF) # remain silent
We now fit the topic model and report the time that has elapsed for fitting the model.
started <- Sys.time()
BTM$estimate()
Sys.time() - started
## Time difference of 4.485914 secs
The package includes optimized functionality for evaluating the topic model. Metrics are computed as follows.
lda <- as_LDA(BTM, verbose = FALSE)
N <- BTM$getDocLengthCounts()
data.frame(
arun2010 = BigArun2010(beta = B(lda), gamma = G(lda), doclengths = N),
cao2009 = BigCao2009(X = B(lda)),
deveaud2014 = BigDeveaud2014(beta = B(lda))
)
## arun2010 cao2009 deveaud2014
## 1 3630.858 0.06176298 1.480718
To use Gensim for topic modelling, first load the reticulate package and import gensim into the Python session.
library(reticulate)
gensim <- reticulate::import("gensim")
The bottleneck for using Gensim for topic modelling from R is the
interface between the R and the Python session. We assume that a
DocumentTermMatrix
has been prepared. The biglda package offers the
functions dtm_as_bow()
and dtm_as_dictionary()
to prepare the input
data structures required by Gensim topic modelling.
bow <- dtm_as_bow(AssociatedPress)
dict <- dtm_as_dictionary(AssociatedPress)
We use the multicore implementation of LDA topic modelling and use all cores but two.
threads <- parallel::detectCores() - 2L
This is the genuine Python part - running Gensim.
gensim_model <- gensim$models$ldamulticore$LdaModel(
# gensim_model <- gensim$models$ldamulticore$LdaMulticore(
corpus = py$corpus,
id2word = py$dictionary,
num_topics = 100L,
iterations = 100L,
per_word_topics = FALSE
# workers = as.integer(threads) # required to be integer
)
Use the as_LDA()
method to get back data from Pyhton/Gensim to R.
lda <- as_LDA(gensim_model, dtm = AssociatedPress)
## ℹ Insantiate basic LDA_Gensim S4 class✔ Insantiate basic LDA_Gensim S4 class [125ms]
## ℹ assign dictionary (slot 'terms')✔ assign dictionary (slot 'terms') [17ms]
## ℹ assign word-topic distribution matrix (slot 'beta')✔ assign word-topic distribution matrix (slot 'beta') [16ms]
## ℹ assign topic distribution for each document (slot 'gamma')✔ assign topic distribution for each document (slot 'gamma') [12s]
To convey that biglda is fast, we fit topic models with Mallet with
different numbers of cores and compare computing time with topic
modelling with topicmodels::LDA()
. The corpus used is
AssociatedPress
as represented by a DocumentTermMatrix
included in
the topicmodel
package. Here, we just run 100 iterations for a limited
set of k
topics. In “real life”, you would have more iterations
(typically 1000-2000), but this setup is sufficiently informative on
performance.
So we start with the general settings.
k <- 100L
iterations <- 100L
n_cores_max <- parallel::detectCores() - 2L # use all cores but two
data("AssociatedPress", package = "topicmodels")
report <- list()
We then fit topic models with Mallet using a increasing number of cores.
library(biglda)
instance_list <- as.instance_list(AssociatedPress, verbose = FALSE)
for (cores in 1L:n_cores_max){
mallet_started <- Sys.time()
BTM <- BigTopicModel(n_topics = k, alpha_sum = 5.1, beta = 0.1)
BTM$addInstances(instance_list)
BTM$setNumThreads(cores)
BTM$setNumIterations(iterations)
BTM$setTopicDisplay(0L, 0L) # no intermediate report on topics
BTM$logger$setLevel(rJava::J("java.util.logging.Level")$OFF) # remain silent
BTM$estimate()
run <- sprintf("mallet_%d", cores)
report[[as.character(cores)]] <- data.frame(
tool = run,
time = as.numeric(difftime(Sys.time(), mallet_started, units = "mins"))
)
}
And we run the classic implementation of LDA topic topic modelling in the topicmodels package.
library(topicmodels)
topicmodels_started <- Sys.time()
lda_model <- LDA(
AssociatedPress,
k = k,
method = "Gibbs",
control = list(iter = iterations)
)
report[["topicmodels"]] <- data.frame(
tool = "topicmodels",
time = difftime(Sys.time(), topicmodels_started, units = "mins")
)
The following chart reports the elapsed time for LDA topic modelling.
These are the takeaways from the benchmarks:
-
Topic modelling with Mallet is fast and even more so with several cores. So for large data, biglda can make the difference, whether it takes a week or a day for fit the topic model.
-
Note that we did not an evaluation of
stm::stm()
(structural topic model) here, which has become a widely used state-of-the-art algorithm. The stm package offers very rich analytical possibilities and the ability to include document metadata goes significantly beyond classic LDA topic modelling. Butstm::stm()
is significantly slower thantopicmodels::LDA()
. The stm package does not address big data scenarios very well, this is the specialization of the biglda package. -
The great reputation of Gensim notwithstanding, benchmarks for Gensim are significantly weaker than what we see for mallet and topicmodels. The reticulate interface may be the bottleneck: We do not yet know.
The package implements the metrics for topic models of the ldatuning
package (Griffiths2004 still missing) using RcppArmadillo. Based on the
following code used for benchmarking, the following chart conveys that
the biglda funtions BigArun2010()
, BigCao2009()
and
BigDeveaud2014()
are significantly faster than their counterparts in
the ldatuning package.
library(ldatuning)
metrics_timing <- data.frame(
pkg = c(rep("ldatuning", times = 3), rep("biglda", times = 3)),
metric = c(rep(c("Arun2010", "CaoJuan2009", "Deveaud2009"), times = 2)),
time = c(
system.time(Arun2010(models = list(lda_model), dtm = AssociatedPress))[3],
system.time(CaoJuan2009(models = list(lda_model)))[3],
system.time(Deveaud2014(models = list(lda_model)))[3],
system.time(BigArun2010(
beta = B(lda_model),
gamma = G(lda_model),
doclengths = BTM$getDocLengthCounts()))[3],
system.time(BigCao2009(B(lda_model)))[3],
system.time(BigDeveaud2014(B(lda_model)))[3]
)
)
ggplot(metrics_timing, aes(fill = pkg, y = time, x = metric)) +
geom_bar(position = "dodge", stat = "identity")
Note that apart from speed, the RcppArmadillo/C++ implementation is much more memory efficient. Computations of metrics that may fail with ldatuning functionality because memory is exhausted will often work fast and successfully with the biglda package.
The examples here and in the package documentation including vignettes need to be minimal working examples to keep computing time low when building the package. In real-life scenarios, we recommend to use scripts for topic modelling on large datasets and to execute the script from the command line. R Markdown are an ideal alternative to plain R scripts that can be run with a shell command such as ``.
- Mallet
- Gensim
- Gensim2
The R package “mallet” has been around for a while and is the
traditional interface to the “mallet” Java package for R users. However,
the a RTopicModel
class
is used as an interface to the ParallelTopicModel
class, hiding the
method to define the number of cores to be used from the R used, thus
limiting the potential to speed up the computation of a topic model
using multiple threads. What is more, functionality for full access to
the computed model is hidden, inhibiting the extraction of the
information the mallet LDA topic model that is required to map the topic
model on the LDA_Gibbs topic model.
The biglda package is designed for heavy-duty work with large data sets.
If you work with smaller data, other approaches and implementations that
are established and that may be easier to use and to install (the rJava
Blues!)
do a very good job indeed. But we are not just happy to also have
biglda. Jobs with big data may just take too long or simply fail because
memory is exhausted. We think that biglda pushes the frontier of being
able of productively using topic modelling in the R realm.
(Alternative: Command line use of Mallet. Quarto.)
David Minmo, “The Details: Training and Validating Big Models on Big Data”: http://journalofdigitalhumanities.org/2-1/the-details-by-david-mimno/