Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native LDA function #31

Closed
wants to merge 36 commits into from
Closed

Add native LDA function #31

wants to merge 36 commits into from

Conversation

koheiw
Copy link
Collaborator

@koheiw koheiw commented Aug 11, 2020

Address #30

@koheiw koheiw closed this Aug 11, 2020
@koheiw koheiw reopened this Aug 11, 2020
@koheiw koheiw requested a review from kbenoit August 11, 2020 18:24
Copy link
Contributor

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally appreciate the addition of these functions, but they raise a host of issues that relate to why we never wrapped or included a topic models package in the first place. These are:

  1. Topic models need a number of post-estimation and diagnostic functions that are unique to topic models, in other words, not fitting in well to the fit - predict - assess model of more standard supervised predictors or scaling models. Although admittedly, scaling models are different already. But topic models need diagnostic methods, methods for summarizing terms, different plotting methods, etc. There are also more lots of parameters to tune, as well as different fitting methods, for instance VEM rather than Gibbs sampling. That's why I thought this would be better off in a separate package, and once the stm package matured, I supported that (and use it) rather than trying to include one.

    What is nice about including it is that it can be nicely integrated. We could easily include a predict() method for estimating topic prevalence on a new dfm (like stm::fitNewDocuments()). But we'd also need functions to evaluate diagnostics for assessing model fit, such as perplexity, and for that we need log-likelihoods. There's no convergence info in the fitted object right now either. If it were me fitting a topic model, I'd still use stm rather than this.

  2. Speed could be a reason, and this is faster than lda::lda.collapsed.gibbs.sampler(), which is 105s on my system, versus quanteda.textmodels::textmodel_lda()'s 62s. But topicmodels::LDA(x, method = "Gibbs") is 32s. (See code below.)

  3. textmodels_seededlda() is already available through topicmodels::LDA(x, method = "Gibbs", seedwords = ...). See this SO answer.

So: Do we really want to get into this crowded field? Would a wrapper not be better?

I have other comments but let's decide on this larger issue first.

My comparison code:

library("quanteda")

# topicmodels
library("topicmodels")
data("AssociatedPress", package = "topicmodels")
system.time({
    lda_LDA <- LDA(AssociatedPress[1:1000, ], method = "Gibbs",
                       control = list(alpha = 0.1), k = 20)
})
lda_inf <- posterior(lda_LDA, AssociatedPress[21:30, ])
terms(lda_LDA, 10)

# LDA
library("lda")
lda_input <- as.dfm(AssociatedPress)[1:1000, ] %>%
    convert(to = "lda")
system.time({
    lda_topicmodels <- lda.collapsed.gibbs.sampler(documents = lda_input$documents,
                                       K = 20,
                                       vocab = lda_input$vocab,
                                       num.iterations = 2000,
                                       alpha = 0.1,
                                       eta = 0.1,
                                       compute.log.likelihood = TRUE)
})
top.topic.words(lda_topicmodels$topics, 10, by.score = TRUE)

# STM
library("stm")
system.time({
    lda_stm <- stm(as.dfm(AssociatedPress)[1:1000, ], K = 20)
})
labelTopics(lda_stm, n = 10)

# quanteda.textmodels
library("quanteda.textmodels")
system.time({
    lda_qtm <- textmodel_lda(as.dfm(AssociatedPress)[1:1000, ], k = 20)
})
quanteda.textmodels::terms(lda_qtm)

@koheiw
Copy link
Collaborator Author

koheiw commented Aug 18, 2020

If you see the code of topicmodels, you will understand why. I also though that its implementation of seeded LDA is wrong. However, I am happy to make this a separate package.

@koheiw koheiw closed this Aug 18, 2020
@kbenoit
Copy link
Contributor

kbenoit commented Aug 18, 2020

I have seen it and it's a mess, so is the package. We could offer a simpler approach, so I didn't want to give the impression that I am totally against it, just aware of some challenges. Can we meet those challenges by simplifying the post-estimation functions in a simpler, quanteda-like way, and/or allow them to work with those already found in another package such as stm, or the excellent LDAvis?

@koheiw
Copy link
Collaborator Author

koheiw commented Aug 18, 2020

No worries. I was aware that topicmodels has the seeded LDA. This is exactly why I chose LDAGibbs++ here. I am only interested semisupervided LDA so unsupervised LDA is just an by product. The library has many more functions for online training and prediction on new data, but you remained me that I have to work a lot to add functions that I have no plan to use if I add the code to this package. This is why I am happy to put that in my seededlda package (which is currently depends on topicmodels).

@kbenoit
Copy link
Contributor

kbenoit commented Aug 18, 2020

Ah, users, we love them but they always ask for more. https://github.com/bstewart/stm/issues

@koheiw
Copy link
Collaborator Author

koheiw commented Aug 19, 2020

We don't have enough resources to maintain so many functions, so we need let users to develop the packages via PR. This is why I think modularization is urgently needed.

@JBGruber
Copy link
Collaborator

Maybe it would be an option to participate in the Hacktoberfest event. I tried to participate last year but there were almost no R issues around to solve.

@koheiw
Copy link
Collaborator Author

koheiw commented Aug 19, 2020

@JBGruber can you tell how we can make you (or people like you) feel like to contribute quanteda.textmodels or quanteda actively?

@JBGruber
Copy link
Collaborator

I think the Hacktoberfest guidelines for maintainers are a pretty good starting point to get people to contribute actively:

  • Apply the "Hacktoberfest" label to issues in your GitHub project that are ready for contributors to work on. (Since they advertise the event and collect all issues with the label centrally, people who otherwise wouldn't go to the issues page will see it. If you don't want to participate in that or in the 11 month of the year when the event is not on, you could simply share a link to all quanteda issues with the help wanted label (which is currently none))
  • Add a CONTRIBUTING.md file with contribution guidelines to your repository. (The quanteda wiki is already pretty but might need some adaption for people who just started developing functions (?))
  • Choose issues that have a well-defined scope and are self-contained. (This would probably be the most important thing for me personally. Looking through a few issues it is sometimes not quite clear where to even start especially with functions being moved etc)
  • Adopt a code of conduct to foster a greater sense of inclusion and community. (You already nailed that one - although it's only in the quanteda package itself so far)

So I think to get people to help you would have to communicate that you want help and for what specifically. That's why I brought up the Hacktoberfest. But you could also, for example, organise a quanteda-hackaton to solve a long list of issues at once to get things going. I would show up 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants