Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native textmodel_lda #30

Open
koheiw opened this issue Aug 4, 2020 · 5 comments
Open

Add native textmodel_lda #30

koheiw opened this issue Aug 4, 2020 · 5 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@koheiw
Copy link
Collaborator

koheiw commented Aug 4, 2020

topicmodels::LDA is implemented using this library, which I can call directly via Rcpp:

https://sourceforge.net/projects/gibbslda/files/

We can call the library in this way

https://github.com/cran/topicmodels/blob/ade6dc5698f385ad222fd28aa8e90c1a4bd33cf5/R/lda.R#L134-L155

There are a lot of things going on but it shouldn't be too complex for minimal functions that users usually need:

If we implement our quanteda-native LDA, I move quanteda.seededlda to this package.

https://github.com/koheiw/quanteda.seededlda

@koheiw koheiw added the enhancement New feature or request label Aug 4, 2020
@koheiw
Copy link
Collaborator Author

koheiw commented Aug 4, 2020

GibbsLDA++-0.2.tar.gz

@koheiw koheiw added the help wanted Extra attention is needed label Aug 4, 2020
koheiw added a commit that referenced this issue Aug 9, 2020
koheiw added a commit that referenced this issue Aug 10, 2020
@koheiw
Copy link
Collaborator Author

koheiw commented Aug 10, 2020

I manage to make GibbsLDA++ work and we have both seeded and regular LDA.

# seeded LDA (repliates https://github.com/koheiw/quanteda.seededlda)

> result10 <- textmodel_lda(dfmt_spnik, verbose = FALSE, seeds = tfmt_spnik)
> terms(result10)
      economy    politics        society         diplomacy    military   nature      other     
 [1,] "company"  "parliament"    "police"        "diplomatic" "army"     "human"     "going"   
 [2,] "money"    "congress"      "school"        "embassy"    "navy"     "sand"      "really"  
 [3,] "market"   "politicians"   "hospital"      "ambassador" "soldiers" "water"     "come"    
 [4,] "bank"     "parliamentary" "prison"        "treaty"     "marine"   "syria"     "see"     
 [5,] "industry" "lawmakers"     "women"         "diplomat"   "korea"    "syrian"    "american"
 [6,] "banks"    "voters"        "man"           "diplomats"  "korean"   "terrorist" "know"    
 [7,] "markets"  "lawmaker"      "investigation" "sanctions"  "missile"  "daesh"     "facebook"
 [8,] "banking"  "politician"    "found"         "iran"       "air"      "turkish"   "much"    
 [9,] "china"    "uk"            "court"         "deal"       "nuclear"  "turkey"    "good"    
[10,] "chinese"  "eu"            "children"      "meeting"    "force"    "weapons"   "team"  

# regular (unseeded) LDA
> result11 <- textmodel_lda(dfmt_spnik, k = 7, verbose = FALSE)
> terms(result11)
      topic1     topic2      topic3      topic4       topic5      topic6         topic7    
 [1,] "korea"    "china"     "syria"     "eu"         "going"     "uk"           "police"  
 [2,] "korean"   "chinese"   "syrian"    "sanctions"  "really"    "house"        "video"   
 [3,] "nuclear"  "economic"  "israel"    "iran"       "much"      "british"      "women"   
 [4,] "missile"  "india"     "terrorist" "deal"       "know"      "department"   "court"   
 [5,] "air"      "oil"       "daesh"     "union"      "see"       "white"        "man"     
 [6,] "nato"     "billion"   "turkish"   "agreement"  "come"      "campaign"     "found"   
 [7,] "force"    "trade"     "turkey"    "germany"    "good"      "ukrainian"    "children"
 [8,] "japan"    "project"   "weapons"   "elections"  "something" "secretary"    "service" 
 [9,] "kim"      "indian"    "saudi"     "parliament" "facebook"  "ukraine"      "swedish" 
[10,] "aircraft" "companies" "iraq"      "german"     "problem"   "intelligence" "rights" 

My question is should I separate the function to textmodel_lda(x, k) and textmodel_seededlda(x, dictionary) just like my older package?

@JBGruber
Copy link
Collaborator

Just my very subjective two cents: I think a dedicated textmodel_seededlda() function would be good advertisement for the concept as it is not widely known yet.

Which doesn't mean though that textmodel_lda() shouldn't be able to do it as well. Like stringi::stri_detect() which runs stringi::stri_detect_fixed() if one wants to.

@koheiw
Copy link
Collaborator Author

koheiw commented Aug 11, 2020

@JBGruber thanks for the input. I added textmodel_seededlda() to make it more visible to users.

@kbenoit
Copy link
Contributor

kbenoit commented Aug 18, 2020

Sorry to be a downer here - and I was offline for 2 weeks - but seeded LDA is already available through topicmodels::LDA(). See #31 (review).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants