-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add native LDA function #31
Conversation
Merge branch 'dev-lda2' of https://github.com/quanteda/quanteda.textmodels into dev-lda2 # Conflicts: # src/dataset.cpp # src/dataset.h # src/model.cpp
…a.textmodels into dev-lda2-seeded
- Add documentation - Separate functions - Add print methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally appreciate the addition of these functions, but they raise a host of issues that relate to why we never wrapped or included a topic models package in the first place. These are:
-
Topic models need a number of post-estimation and diagnostic functions that are unique to topic models, in other words, not fitting in well to the fit - predict - assess model of more standard supervised predictors or scaling models. Although admittedly, scaling models are different already. But topic models need diagnostic methods, methods for summarizing terms, different plotting methods, etc. There are also more lots of parameters to tune, as well as different fitting methods, for instance VEM rather than Gibbs sampling. That's why I thought this would be better off in a separate package, and once the stm package matured, I supported that (and use it) rather than trying to include one.
What is nice about including it is that it can be nicely integrated. We could easily include a
predict()
method for estimating topic prevalence on a new dfm (likestm::fitNewDocuments()
). But we'd also need functions to evaluate diagnostics for assessing model fit, such as perplexity, and for that we need log-likelihoods. There's no convergence info in the fitted object right now either. If it were me fitting a topic model, I'd still use stm rather than this. -
Speed could be a reason, and this is faster than
lda::lda.collapsed.gibbs.sampler()
, which is 105s on my system, versusquanteda.textmodels::textmodel_lda()
's 62s. Buttopicmodels::LDA(x, method = "Gibbs")
is 32s. (See code below.) -
textmodels_seededlda()
is already available throughtopicmodels::LDA(x, method = "Gibbs", seedwords = ...)
. See this SO answer.
So: Do we really want to get into this crowded field? Would a wrapper not be better?
I have other comments but let's decide on this larger issue first.
My comparison code:
library("quanteda")
# topicmodels
library("topicmodels")
data("AssociatedPress", package = "topicmodels")
system.time({
lda_LDA <- LDA(AssociatedPress[1:1000, ], method = "Gibbs",
control = list(alpha = 0.1), k = 20)
})
lda_inf <- posterior(lda_LDA, AssociatedPress[21:30, ])
terms(lda_LDA, 10)
# LDA
library("lda")
lda_input <- as.dfm(AssociatedPress)[1:1000, ] %>%
convert(to = "lda")
system.time({
lda_topicmodels <- lda.collapsed.gibbs.sampler(documents = lda_input$documents,
K = 20,
vocab = lda_input$vocab,
num.iterations = 2000,
alpha = 0.1,
eta = 0.1,
compute.log.likelihood = TRUE)
})
top.topic.words(lda_topicmodels$topics, 10, by.score = TRUE)
# STM
library("stm")
system.time({
lda_stm <- stm(as.dfm(AssociatedPress)[1:1000, ], K = 20)
})
labelTopics(lda_stm, n = 10)
# quanteda.textmodels
library("quanteda.textmodels")
system.time({
lda_qtm <- textmodel_lda(as.dfm(AssociatedPress)[1:1000, ], k = 20)
})
quanteda.textmodels::terms(lda_qtm)
If you see the code of topicmodels, you will understand why. I also though that its implementation of seeded LDA is wrong. However, I am happy to make this a separate package. |
I have seen it and it's a mess, so is the package. We could offer a simpler approach, so I didn't want to give the impression that I am totally against it, just aware of some challenges. Can we meet those challenges by simplifying the post-estimation functions in a simpler, quanteda-like way, and/or allow them to work with those already found in another package such as stm, or the excellent LDAvis? |
No worries. I was aware that topicmodels has the seeded LDA. This is exactly why I chose LDAGibbs++ here. I am only interested semisupervided LDA so unsupervised LDA is just an by product. The library has many more functions for online training and prediction on new data, but you remained me that I have to work a lot to add functions that I have no plan to use if I add the code to this package. This is why I am happy to put that in my seededlda package (which is currently depends on topicmodels). |
Ah, users, we love them but they always ask for more. https://github.com/bstewart/stm/issues |
We don't have enough resources to maintain so many functions, so we need let users to develop the packages via PR. This is why I think modularization is urgently needed. |
Maybe it would be an option to participate in the Hacktoberfest event. I tried to participate last year but there were almost no |
@JBGruber can you tell how we can make you (or people like you) feel like to contribute quanteda.textmodels or quanteda actively? |
I think the Hacktoberfest guidelines for maintainers are a pretty good starting point to get people to contribute actively:
So I think to get people to help you would have to communicate that you want help and for what specifically. That's why I brought up the Hacktoberfest. But you could also, for example, organise a quanteda-hackaton to solve a long list of issues at once to get things going. I would show up 😉 |
Address #30