3. Topic Modeling #7

tmozgach · 2018-02-06T00:31:28Z

You could start with topic modeling first. Dr. Yang was using LDA with five or six methods like SVM etc. I think you can easily google some guides to do it with R or python. This is quite mature now.

tmozgach · 2018-02-06T00:32:15Z

Not important:
An explanation and example of topic modeling with R

http://brazenly.blogspot.ca/2016/05/r-text-classification-and-topic_1.html

tmozgach · 2018-02-13T02:21:59Z

Main guide that I followed:
https://github.com/abhijeet3922/Topic-Modelling-on-Wiki-corpus

tmozgach · 2018-02-22T21:51:24Z

DONE:
It has visualization stuff:
https://github.com/shichaoji/easyLDA

tmozgach · 2018-02-22T23:24:58Z

Tips and tuning parameters:
https://markroxor.github.io/gensim/static/notebooks/lda_training_tips.html

tmozgach · 2018-03-01T01:35:37Z

Based on the following:
convergence_liklihood.pdf
We need 900-1000 iterations.

The graph above I produced using:
How to monitor convergence of Gensim LDA model?
https://stackoverflow.com/questions/37570696/how-to-monitor-convergence-of-gensim-lda-model

Select iterations and passes parameters of LDA model:

I suggest the following way to choose iterations and passes. First, enable logging (as described in many Gensim tutorials), and set eval_every = 1 in LdaModel. When training the model look for a line in the log that looks something like this:

2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations

If you set passes = 20 you will see this line 20 times. Make sure that by the final passes, most of the documents have converged. So you want to choose both passes and iterations to be high enough for this to happen.

tmozgach · 2018-03-01T01:54:49Z

@neowangkkk
First result from Cedar, just for Title and Post (no COMMENTS), 10 topics, took 8 hours, download all files and run HTML file:
Uploading cedar_data_10topics.zip…
How to interpret visualization (last section):
https://nlpforhackers.io/topic-modeling/

tmozgach · 2018-03-01T21:44:32Z

@neowangkkk
20 Topics:
20Topics.zip

Working on 40 topics...

tmozgach · 2018-03-02T02:09:31Z

@neowangkkk
40 Topics:
40Topics.zip

Working on 30

tmozgach · 2018-03-05T01:44:24Z

@neowangkkk
30Topics.zip

tmozgach · 2018-03-05T01:47:04Z

@neowangkkk
10, 20, 30, 40 Topics for Titles, Posts and their COMMENTS (10to40TopicsForALL.zip):
https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing

neowangkkk · 2018-03-17T05:43:44Z

R Code for importing data

set your working driectory

setwd("/Users/Tao/Dropbox/Data/Reddit_data/")

install tydyverse package

install.packages("tidyverse")
library(tidyverse)

read file using read_csv

data<-read_csv(file = "data_full.csv")

check summary stats

summary(data)

tmozgach · 2018-03-19T21:56:47Z

@neowangkkk
New result on your friend's data:
https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing
NewAllData.zip

neowangkkk · 2018-03-20T07:12:04Z

##can you please replace these:
ideas = idea
product = products
cars = car
entrepreneurs = entrepreneur

##can you please remove these:
doesnt
didnt
theyre
isnt
business
work
start

tmozgach · 2018-03-22T01:18:07Z

@neowangkkk
https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ
10-30.zip

I am not sure that it was good idea to exclude a lot of words. It seems that it influences and change topics. I got rid of the following words:
(It is ok that some of them are repeated, it doesn't influence on anything.)

awesome cant though theyre yeah around try enough keep way start work busines isnt theyre didnt doesnt i\'ve you\'re that\'s what\'s let\'s i\'d you\'ll aren\'t \"the i\'ll we\'re wont 009 don\'t it\'s nbsp i\'m get make like would want dont\' use one need know good take thank say also see really could much something ive well give first even great things come thats sure help youre lot someone ask best many question etc better still put might actually let love may tell every maybe always never probably anything cant\' doesnt\' ill already able anyone since another theres everything without didn\'t isn\'t youll\' per else ive get would like want hey might may without also make want put etc actually else far definitely youll\' didnt\' isnt\' theres since able maybe without may suggestedsort never isredditmediadomain userreports far appreciate next think know need look please one null take dont dont\' want\' could able ask well best someone sure lot thank also anyone really something give years use make all ago people know many call include part find become

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3. Topic Modeling #7

3. Topic Modeling #7

tmozgach commented Feb 6, 2018

tmozgach commented Feb 6, 2018 •

edited

Loading

tmozgach commented Feb 13, 2018 •

edited

Loading

tmozgach commented Feb 22, 2018 •

edited

Loading

tmozgach commented Feb 22, 2018 •

edited

Loading

tmozgach commented Mar 1, 2018 •

edited

Loading

tmozgach commented Mar 1, 2018 •

edited

Loading

tmozgach commented Mar 1, 2018 •

edited

Loading

tmozgach commented Mar 2, 2018

tmozgach commented Mar 5, 2018

tmozgach commented Mar 5, 2018

neowangkkk commented Mar 17, 2018

tmozgach commented Mar 19, 2018

neowangkkk commented Mar 20, 2018

tmozgach commented Mar 22, 2018

3. Topic Modeling #7

3. Topic Modeling #7

Comments

tmozgach commented Feb 6, 2018

tmozgach commented Feb 6, 2018 • edited Loading

tmozgach commented Feb 13, 2018 • edited Loading

tmozgach commented Feb 22, 2018 • edited Loading

tmozgach commented Feb 22, 2018 • edited Loading

tmozgach commented Mar 1, 2018 • edited Loading

tmozgach commented Mar 1, 2018 • edited Loading

tmozgach commented Mar 1, 2018 • edited Loading

tmozgach commented Mar 2, 2018

tmozgach commented Mar 5, 2018

tmozgach commented Mar 5, 2018

neowangkkk commented Mar 17, 2018

R Code for importing data

set your working driectory

install tydyverse package

read file using read_csv

check summary stats

tmozgach commented Mar 19, 2018

neowangkkk commented Mar 20, 2018

tmozgach commented Mar 22, 2018

tmozgach commented Feb 6, 2018 •

edited

Loading

tmozgach commented Feb 13, 2018 •

edited

Loading

tmozgach commented Feb 22, 2018 •

edited

Loading

tmozgach commented Feb 22, 2018 •

edited

Loading

tmozgach commented Mar 1, 2018 •

edited

Loading

tmozgach commented Mar 1, 2018 •

edited

Loading

tmozgach commented Mar 1, 2018 •

edited

Loading