Add native LDA function #31

koheiw · 2020-08-11T08:40:00Z

Address #30

Merge branch 'dev-lda2' of https://github.com/quanteda/quanteda.textmodels into dev-lda2 # Conflicts: # src/dataset.cpp # src/dataset.h # src/model.cpp

…a.textmodels into dev-lda2-seeded

- Add documentation - Separate functions - Add print methods

kbenoit

I totally appreciate the addition of these functions, but they raise a host of issues that relate to why we never wrapped or included a topic models package in the first place. These are:

Topic models need a number of post-estimation and diagnostic functions that are unique to topic models, in other words, not fitting in well to the fit - predict - assess model of more standard supervised predictors or scaling models. Although admittedly, scaling models are different already. But topic models need diagnostic methods, methods for summarizing terms, different plotting methods, etc. There are also more lots of parameters to tune, as well as different fitting methods, for instance VEM rather than Gibbs sampling. That's why I thought this would be better off in a separate package, and once the stm package matured, I supported that (and use it) rather than trying to include one.

What is nice about including it is that it can be nicely integrated. We could easily include a predict() method for estimating topic prevalence on a new dfm (like stm::fitNewDocuments()). But we'd also need functions to evaluate diagnostics for assessing model fit, such as perplexity, and for that we need log-likelihoods. There's no convergence info in the fitted object right now either. If it were me fitting a topic model, I'd still use stm rather than this.
Speed could be a reason, and this is faster than lda::lda.collapsed.gibbs.sampler(), which is 105s on my system, versus quanteda.textmodels::textmodel_lda()'s 62s. But topicmodels::LDA(x, method = "Gibbs") is 32s. (See code below.)
textmodels_seededlda() is already available through topicmodels::LDA(x, method = "Gibbs", seedwords = ...). See this SO answer.

So: Do we really want to get into this crowded field? Would a wrapper not be better?

I have other comments but let's decide on this larger issue first.

My comparison code:

library("quanteda")

# topicmodels
library("topicmodels")
data("AssociatedPress", package = "topicmodels")
system.time({
    lda_LDA <- LDA(AssociatedPress[1:1000, ], method = "Gibbs",
                       control = list(alpha = 0.1), k = 20)
})
lda_inf <- posterior(lda_LDA, AssociatedPress[21:30, ])
terms(lda_LDA, 10)

# LDA
library("lda")
lda_input <- as.dfm(AssociatedPress)[1:1000, ] %>%
    convert(to = "lda")
system.time({
    lda_topicmodels <- lda.collapsed.gibbs.sampler(documents = lda_input$documents,
                                       K = 20,
                                       vocab = lda_input$vocab,
                                       num.iterations = 2000,
                                       alpha = 0.1,
                                       eta = 0.1,
                                       compute.log.likelihood = TRUE)
})
top.topic.words(lda_topicmodels$topics, 10, by.score = TRUE)

# STM
library("stm")
system.time({
    lda_stm <- stm(as.dfm(AssociatedPress)[1:1000, ], K = 20)
})
labelTopics(lda_stm, n = 10)

# quanteda.textmodels
library("quanteda.textmodels")
system.time({
    lda_qtm <- textmodel_lda(as.dfm(AssociatedPress)[1:1000, ], k = 20)
})
quanteda.textmodels::terms(lda_qtm)

koheiw · 2020-08-18T17:11:48Z

If you see the code of topicmodels, you will understand why. I also though that its implementation of seeded LDA is wrong. However, I am happy to make this a separate package.

kbenoit · 2020-08-18T17:16:00Z

I have seen it and it's a mess, so is the package. We could offer a simpler approach, so I didn't want to give the impression that I am totally against it, just aware of some challenges. Can we meet those challenges by simplifying the post-estimation functions in a simpler, quanteda-like way, and/or allow them to work with those already found in another package such as stm, or the excellent LDAvis?

koheiw · 2020-08-18T18:02:54Z

No worries. I was aware that topicmodels has the seeded LDA. This is exactly why I chose LDAGibbs++ here. I am only interested semisupervided LDA so unsupervised LDA is just an by product. The library has many more functions for online training and prediction on new data, but you remained me that I have to work a lot to add functions that I have no plan to use if I add the code to this package. This is why I am happy to put that in my seededlda package (which is currently depends on topicmodels).

kbenoit · 2020-08-18T19:47:54Z

Ah, users, we love them but they always ask for more. https://github.com/bstewart/stm/issues

koheiw · 2020-08-19T05:05:41Z

We don't have enough resources to maintain so many functions, so we need let users to develop the packages via PR. This is why I think modularization is urgently needed.

JBGruber · 2020-08-19T10:12:12Z

Maybe it would be an option to participate in the Hacktoberfest event. I tried to participate last year but there were almost no R issues around to solve.

koheiw · 2020-08-19T14:31:07Z

@JBGruber can you tell how we can make you (or people like you) feel like to contribute quanteda.textmodels or quanteda actively?

JBGruber · 2020-08-19T15:39:44Z

I think the Hacktoberfest guidelines for maintainers are a pretty good starting point to get people to contribute actively:

Apply the "Hacktoberfest" label to issues in your GitHub project that are ready for contributors to work on. (Since they advertise the event and collect all issues with the label centrally, people who otherwise wouldn't go to the issues page will see it. If you don't want to participate in that or in the 11 month of the year when the event is not on, you could simply share a link to all quanteda issues with the help wanted label (which is currently none))
Add a CONTRIBUTING.md file with contribution guidelines to your repository. (The quanteda wiki is already pretty but might need some adaption for people who just started developing functions (?))
Choose issues that have a well-defined scope and are self-contained. (This would probably be the most important thing for me personally. Looking through a few issues it is sometimes not quite clear where to even start especially with functions being moved etc)
Adopt a code of conduct to foster a greater sense of inclusion and community. (You already nailed that one - although it's only in the quanteda package itself so far)

So I think to get people to help you would have to communicate that you want help and for what specifically. That's why I brought up the Hacktoberfest. But you could also, for example, organise a quanteda-hackaton to solve a long list of issues at once to get things going. I would show up 😉

koheiw added 30 commits August 7, 2020 18:14

Copy files from original GibbsLDA++ library

bd7825e

Improve readability and add comments

aa33a11

Improve readability and add comments

4060acf

Integrate the code

b0cdfb1

Add interface for parameters

76a7584

Call estimate

6789058

Use arma sp_mat

c3e6d39

First working version

31d44d7

Reduce printing

9092ede

Tidy up

3a4dd13

Change name

ff97b20

Fix sampler

5633232

Remove test

183fbbb

Solve conflict

98fa825

Merge branch 'dev-lda2' of https://github.com/quanteda/quanteda.textmodels into dev-lda2 # Conflicts: # src/dataset.cpp # src/dataset.h # src/model.cpp

Draft native LDA function #30

3e6c67d

Use at() to speed up

3395b83

Consolidate files

e317097

Write R methods for LDA

fce2447

Add credit

9a50fbc

Allow seed words

6bc7b01

Allow seed words for #30

c22bd43

Merge branch 'dev-lda2-seeded' of https://github.com/quanteda/quanted…

2f6a47d

…a.textmodels into dev-lda2-seeded

Test adjusting marginal count too

815cf39

Update man

9eab69b

Remove old file

c228fed

Add seed for random number generation

6789854

Finish packaging

6ffef0a

- Add documentation - Separate functions - Add print methods

Return more parameters

48b2dfc

Add tests

13481d1

Build

80cfdb6

koheiw added 6 commits August 11, 2020 10:25

Make name more specific

fb54472

Remove original test files

c916eaa

Add words

3ddbc1d

Fix

1d0642d

Set fixed seed

53bc048

Improve the test

8c2e06c

koheiw closed this Aug 11, 2020

koheiw reopened this Aug 11, 2020

koheiw requested a review from kbenoit August 11, 2020 18:24

kbenoit reviewed Aug 18, 2020

View reviewed changes

kbenoit mentioned this pull request Aug 18, 2020

Add native textmodel_lda #30

Open

koheiw closed this Aug 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native LDA function #31

Add native LDA function #31

koheiw commented Aug 11, 2020

kbenoit left a comment •

edited

Loading

koheiw commented Aug 18, 2020

kbenoit commented Aug 18, 2020

koheiw commented Aug 18, 2020

kbenoit commented Aug 18, 2020

koheiw commented Aug 19, 2020

JBGruber commented Aug 19, 2020

koheiw commented Aug 19, 2020

JBGruber commented Aug 19, 2020

Add native LDA function #31

Add native LDA function #31

Conversation

koheiw commented Aug 11, 2020

kbenoit left a comment • edited Loading

Choose a reason for hiding this comment

My comparison code:

koheiw commented Aug 18, 2020

kbenoit commented Aug 18, 2020

koheiw commented Aug 18, 2020

kbenoit commented Aug 18, 2020

koheiw commented Aug 19, 2020

JBGruber commented Aug 19, 2020

koheiw commented Aug 19, 2020

JBGruber commented Aug 19, 2020

kbenoit left a comment •

edited

Loading