Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3. Topic Modeling #7

Open
tmozgach opened this issue Feb 6, 2018 · 14 comments
Open

3. Topic Modeling #7

tmozgach opened this issue Feb 6, 2018 · 14 comments

Comments

@tmozgach
Copy link
Owner

tmozgach commented Feb 6, 2018

You could start with topic modeling first. Dr. Yang was using LDA with five or six methods like SVM etc. I think you can easily google some guides to do it with R or python. This is quite mature now.

@tmozgach
Copy link
Owner Author

tmozgach commented Feb 6, 2018

Not important:
An explanation and example of topic modeling with R

http://brazenly.blogspot.ca/2016/05/r-text-classification-and-topic_1.html

@tmozgach
Copy link
Owner Author

tmozgach commented Feb 13, 2018

Main guide that I followed:
https://github.com/abhijeet3922/Topic-Modelling-on-Wiki-corpus

@tmozgach
Copy link
Owner Author

tmozgach commented Feb 22, 2018

DONE:
It has visualization stuff:
https://github.com/shichaoji/easyLDA

@tmozgach
Copy link
Owner Author

tmozgach commented Feb 22, 2018

@tmozgach
Copy link
Owner Author

tmozgach commented Mar 1, 2018

Based on the following:
convergence_liklihood.pdf
We need 900-1000 iterations.

The graph above I produced using:
How to monitor convergence of Gensim LDA model?
https://stackoverflow.com/questions/37570696/how-to-monitor-convergence-of-gensim-lda-model

Select iterations and passes parameters of LDA model:

I suggest the following way to choose iterations and passes. First, enable logging (as described in many Gensim tutorials), and set eval_every = 1 in LdaModel. When training the model look for a line in the log that looks something like this:

2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations

If you set passes = 20 you will see this line 20 times. Make sure that by the final passes, most of the documents have converged. So you want to choose both passes and iterations to be high enough for this to happen.

@tmozgach
Copy link
Owner Author

tmozgach commented Mar 1, 2018

@neowangkkk
First result from Cedar, just for Title and Post (no COMMENTS), 10 topics, took 8 hours, download all files and run HTML file:
Uploading cedar_data_10topics.zip…
How to interpret visualization (last section):
https://nlpforhackers.io/topic-modeling/

@tmozgach
Copy link
Owner Author

tmozgach commented Mar 1, 2018

@neowangkkk
20 Topics:
20Topics.zip

Working on 40 topics...

@tmozgach
Copy link
Owner Author

tmozgach commented Mar 2, 2018

@neowangkkk
40 Topics:
40Topics.zip

Working on 30

@tmozgach
Copy link
Owner Author

tmozgach commented Mar 5, 2018

@neowangkkk
30Topics.zip

@tmozgach
Copy link
Owner Author

tmozgach commented Mar 5, 2018

@neowangkkk
10, 20, 30, 40 Topics for Titles, Posts and their COMMENTS (10to40TopicsForALL.zip):
https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing

@neowangkkk
Copy link
Collaborator

R Code for importing data

set your working driectory

setwd("/Users/Tao/Dropbox/Data/Reddit_data/")

install tydyverse package

install.packages("tidyverse")
library(tidyverse)

read file using read_csv

data<-read_csv(file = "data_full.csv")

check summary stats

summary(data)

@tmozgach
Copy link
Owner Author

@neowangkkk
New result on your friend's data:
https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing
NewAllData.zip

@neowangkkk
Copy link
Collaborator

##can you please replace these:
ideas = idea
product = products
cars = car
entrepreneurs = entrepreneur

##can you please remove these:
doesnt
didnt
theyre
isnt
business
work
start

@tmozgach
Copy link
Owner Author

@neowangkkk
https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ
10-30.zip

I am not sure that it was good idea to exclude a lot of words. It seems that it influences and change topics. I got rid of the following words:
(It is ok that some of them are repeated, it doesn't influence on anything.)

awesome cant though theyre yeah around try enough keep way start work busines isnt theyre didnt doesnt i\'ve you\'re that\'s what\'s let\'s i\'d you\'ll aren\'t \"the i\'ll we\'re wont 009 don\'t it\'s nbsp i\'m get make like would want dont\' use one need know good take thank say also see really could much something ive well give first even great things come thats sure help youre lot someone ask best many question etc better still put might actually let love may tell every maybe always never probably anything cant\' doesnt\' ill already able anyone since another theres everything without didn\'t isn\'t youll\' per else ive get would like want hey might may without also make want put etc actually else far definitely youll\' didnt\' isnt\' theres since able maybe without may suggestedsort never isredditmediadomain userreports far appreciate next think know need look please one null take dont dont\' want\' could able ask well best someone sure lot thank also anyone really something give years use make all ago people know many call include part find become

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants