1. Data gathering: crawl data from Reddit. #1

tmozgach · 2018-01-08T21:35:51Z

neowangkkk · 2018-01-09T21:25:32Z

Glad to form a work team here. The first job is that we want to crawl more data from the target community.
https://www.reddit.com/r/Entrepreneur/

The time period we want to crawl is from 2012.1.1 to 2017.12.31.

The required variables are shown in the attachment. If you have any questions, please let me know.

Sample.xlsx

tmozgach · 2018-01-11T00:11:23Z

@neowangkkk I haven't done it before, so in order to save time and be consistence could you please give me the script or method/web link how you did it before. Is that like that? https://www.labnol.org/internet/web-scraping-reddit/28369/

neowangkkk · 2018-01-11T02:39:44Z

@tmozgach Last time I paid $150 to hire a part-time programmer to crawl the data. He told me he used C+ language to write the program. I don't believe he will give me the code:-(

I am not sure about the difficulty level of doing web crawling at reddit.com. Can you please search and check if any python package or sth else can do it? If after 20 work hours it is still a problem, we may go to the part-time programmer again. I understand some websites put a lot of tricks to prevent people crawling their content. It may be a huge task that only people with years of experience in crawling can handle. But it is worthy for you to learn and try when our time still allows.

The google sheet method in your link may have some flaws. It said sub-reddit can only show 1000 posts. But our last crawling got over 25,000 threads for 18 months.

If you have any questions, please feel free to let me know.

tmozgach · 2018-01-16T22:23:20Z

PRAW API

Install pip and draw without root for LINUX:
https://gist.github.com/saurabhshri/46e4069164b87a708b39d947e4527298

curl -L https://bootstrap.pypa.io/get-pip.py -o getpip.py
python get-pip.py --user
python -m pip install --user praw

For MAC:

https://gist.github.com/haircut/14705555d58432a5f01f9188006a04ed

Reddit video tutorials:
https://www.youtube.com/watch?v=NRgfgtzIhBQ

Documentation:
http://praw.readthedocs.io/en/latest/getting_started/quick_start.html

user_agent parameter?

Archival Bot

According to Reddit the user agent is required to determine basic information about the script accessing Reddit, such as the name, the version and the author (and that's supposed to go in there, such as "prawtutorial v1.0 by /u/sentdex" or similar).


r = praw.Reddit(client_id = '***',
                     client_secret = '***',
                     user_agent = 'Archival Bot')

Things to try:

https://www.reddit.com/r/botwatch/comments/3i0s4c/how_to_get_all_posts_from_a_subreddit/
The following script works, but slow
https://github.com/peoplma/subredditarchive/blob/master/subredditarchive.py
https://www.reddit.com/r/help/comments/799cfp/how_can_i_find_all_posts_in_a_subreddit_between
Research more on '~ 1.7 billion comments dataset'

tmozgach · 2018-01-18T05:40:01Z

Apparently, the all reddit comments are located in the BigQuery. Using SQL we can take.
Example:
Using BigQuery with Reddit Data
https://pushshift.io/using-bigquery-with-reddit-data/
https://bigquery.cloud.google.com/table/fh-bigquery:reddit_posts.2017_12

Retrieving Data From Google BigQuery (Reddit Relevant XKCD)
http://blog.reddolution.com/Life/index.php/2017/05/retrieving-data-from-google-bigquery-reddit-relevant-xkcd/
https://stackoverflow.com/questions/18493533/how-to-download-all-data-in-a-google-bigquery-dataset

Also, it seem that we can do NLP analyziz using Google NLP API.
Example:
Machine learning, NLP Google APIs
https://hackernoon.com/how-people-talk-about-marijuana-on-reddit-a-natural-language-analysis-a8d595882a7a

neowangkkk · 2018-01-18T05:59:23Z

Wow. That's great! Very promising. Can we output the query result into R or python data format?

tmozgach · 2018-01-20T10:21:45Z

@neowangkkk probably there is a way, but I faced with another the issue, in BigQuery they store comments and POSTS for 2016,2017 year, but there is NO posts for 2015,2014,2013,2012, only comments! I mean, there is no main posts, title (from sender), only their comments (replies).
I crawled data for 2012,2013,2014,2015 without Karma and another information about the user by now using the following script:
https://github.com/peoplma/subredditarchive/blob/master/subredditarchive.py

tmozgach · 2018-01-20T10:36:06Z

@neowangkkk Could you provide the attributes that you need JUST for Topic Modeling?

tmozgach · 2018-01-26T23:23:08Z

Parse JSON to CSV
Output the JSON file nicely:

python -m json.tool my_json.json

Useful links (weren't used):
http://blog.appliedinformaticsinc.com/how-to-parse-and-convert-json-to-csv-using-python/
https://www.dataquest.io/blog/python-json-tutorial/
https://medium.com/@gis10kwo/converting-nested-json-data-to-csv-using-python-pandas-dc6eddc69175

Parser is ready (using regular expression):
https://github.com/tmozgach/ent_ob/blob/master/jsonTOcsvTopicModel.py

tmozgach · 2018-02-07T06:54:01Z

Raw information for 2012 - 2017 years:
https://drive.google.com/open?id=1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ

tmozgach · 2018-02-13T00:36:34Z

@neowangkkk
Files:
TPostComRaw.csv: Not clean titles, main posts and comments for 2012-2017 years.
TPostRaw.csv: Not clean titles and main posts WITHOUT comments for 2012-2017 years.
https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing

tmozgach · 2018-02-22T21:25:06Z

Probably I need to join comments, post and title as ONE paragraph?

neowangkkk · 2018-02-22T21:28:06Z

Yes. As discussed last time, combining all texts in one thread may generate better outcome in clustering/topic modelling. Please go ahead and try it. In addition, I checked the old data of karma. It is a fixed value for each individual at the point of crawling. We can’t get the changing karma through the time. So later if you can get the karma data for the participants in your investigated period, that would be fine.

…

On Feb 22, 2018, at 4:25 PM, tmozgach ***@***.***> wrote: Probably I need to join comments, post and title as ONE paragraph? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tmozgach · 2018-02-23T22:25:16Z

Every row of that csv is one thread (title, post, comments).

tmozgach · 2018-03-09T19:50:44Z

Raw, not formated:
2009_2011data.csv.zip

tmozgach · 2018-03-17T07:43:17Z

2009 - 2017 all data:

https://www.dropbox.com/s/tlq6gfnnlnqvumx/data0312.csv?dl=0

tmozgach · 2018-03-17T07:45:30Z

Data for Topic Modeling 2009-2017:
https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing
newRawAllData.csv

tmozgach · 2018-03-17T08:19:29Z

Change NA in Title with previous Title.
R

library(tidyverse)
library(zoo)
library(dplyr)




myDataf = read_delim("/home/tatyana/Downloads/data_full.csv", delim = ',' )
myDataff = myDataf[!is.na(strptime(myDataf$Date,format="%Y-%m-%d %H:%M:%S")),]

# There are title that duplicates another one. Titles are not unique
myDataff$Title <- make.unique(as.character(myDataff$Title), sep = "___-___")

# make.uniqui makes also NA - unique by adding number, need to transform them back to NA
myDataff$Title <- gsub("NA__+", NA, myDataff$Title)

# change NA by previous Title
myDataff['Title2'] = data.frame(col1 = myDataff$Title, col2 = myDataff$Conversation) %>% 
  do(na.locf(.))
write_csv(myDataff, "data_full_title2.csv")

newDff = data.frame(col1 = myDataff$Title, col2 = myDataff$Conversation) %>% 
  do(na.locf(.))
write_csv(newDff, "dataForPyth.csv")

Merge all comment and title in one document/row

Python:

import csv
import pandas as pd
import numpy as np

newDF = pd.DataFrame()
tit = ""
com = ""
rows_list = []
title_list = []
with open("/home/tatyana/dataForPyth.csv", "rt") as f:
    reader = csv.reader(f)
    for i, line in enumerate(reader):
        # print ('line[{}] = {}'.format(i, line))
        if i == 0:
            continue
        if i == 1:
            title_list.append(line[0])
            tit = line[0]
            com = line[0] + " " + line[1]
            continue
            
        if line[0] == tit:
            com = com + " " + line[1]
        else:
            rows_list.append(com)
            tit = line[0]
            title_list.append(line[0])
            com = line[0] + " " + line[1]

rows_list.append(com)

df = pd.DataFrame(rows_list)
se = pd.Series(title_list)
df['Topic'] = se.values

# print(title_list[84627])
# print(rows_list[84627])


df.to_csv("newRawAllData.csv",index=False, header=False)

Topic Modeling and labeling;

Merge labeling and data_full_title2.csv;


myLabDataf = read_delim("/home/tatyana/nlp/LabeledTopic.csv", delim = ',' )

# 8 thredshad some issue and weren't mereged
newm = merge(myDataff,myLabDataf, by.x = 'Title2', by.y = 'title')

fin = select(newm, Date, Sender, Title2, Replier, Conversation,`Points from this question`, `Post Karma`, `Comment Karma`, `Date joining the forum;Category Label`, `Topic/Probability`, `Main Topic`, `Main Probability`)


write_csv(fin, "final.csv")

tmozgach · 2018-06-27T01:19:35Z

Latest data:
https://www.dropbox.com/s/50vkf5makcojd5w/data_full.csv?dl=0

tmozgach changed the title ~~Background~~ Data gathering: crawl data from Reddit Jan 11, 2018

tmozgach changed the title ~~Data gathering: crawl data from Reddit~~ 1.Data gathering: crawl data from Reddit. Jan 11, 2018

tmozgach closed this as completed Jan 16, 2018

tmozgach reopened this Jan 16, 2018

tmozgach changed the title ~~1.Data gathering: crawl data from Reddit.~~ 1. Data gathering: crawl data from Reddit. Feb 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1. Data gathering: crawl data from Reddit. #1

1. Data gathering: crawl data from Reddit. #1

tmozgach commented Jan 8, 2018 •

edited

Loading

neowangkkk commented Jan 9, 2018 •

edited by tmozgach

Loading

tmozgach commented Jan 11, 2018

neowangkkk commented Jan 11, 2018 •

edited

Loading

tmozgach commented Jan 16, 2018 •

edited

Loading

tmozgach commented Jan 18, 2018 •

edited

Loading

neowangkkk commented Jan 18, 2018

tmozgach commented Jan 20, 2018 •

edited

Loading

tmozgach commented Jan 20, 2018 •

edited

Loading

tmozgach commented Jan 26, 2018 •

edited

Loading

tmozgach commented Feb 7, 2018 •

edited

Loading

tmozgach commented Feb 13, 2018 •

edited

Loading

tmozgach commented Feb 22, 2018

neowangkkk commented Feb 22, 2018 via email

tmozgach commented Feb 23, 2018

tmozgach commented Mar 9, 2018 •

edited

Loading

tmozgach commented Mar 17, 2018

tmozgach commented Mar 17, 2018

tmozgach commented Mar 17, 2018 •

edited

Loading

tmozgach commented Jun 27, 2018

1. Data gathering: crawl data from Reddit. #1

1. Data gathering: crawl data from Reddit. #1

Comments

tmozgach commented Jan 8, 2018 • edited Loading

neowangkkk commented Jan 9, 2018 • edited by tmozgach Loading

tmozgach commented Jan 11, 2018

neowangkkk commented Jan 11, 2018 • edited Loading

tmozgach commented Jan 16, 2018 • edited Loading

tmozgach commented Jan 18, 2018 • edited Loading

neowangkkk commented Jan 18, 2018

tmozgach commented Jan 20, 2018 • edited Loading

tmozgach commented Jan 20, 2018 • edited Loading

tmozgach commented Jan 26, 2018 • edited Loading

tmozgach commented Feb 7, 2018 • edited Loading

tmozgach commented Feb 13, 2018 • edited Loading

tmozgach commented Feb 22, 2018

neowangkkk commented Feb 22, 2018 via email

tmozgach commented Feb 23, 2018

tmozgach commented Mar 9, 2018 • edited Loading

tmozgach commented Mar 17, 2018

tmozgach commented Mar 17, 2018

tmozgach commented Mar 17, 2018 • edited Loading

tmozgach commented Jun 27, 2018

tmozgach commented Jan 8, 2018 •

edited

Loading

neowangkkk commented Jan 9, 2018 •

edited by tmozgach

Loading

neowangkkk commented Jan 11, 2018 •

edited

Loading

tmozgach commented Jan 16, 2018 •

edited

Loading

tmozgach commented Jan 18, 2018 •

edited

Loading

tmozgach commented Jan 20, 2018 •

edited

Loading

tmozgach commented Jan 20, 2018 •

edited

Loading

tmozgach commented Jan 26, 2018 •

edited

Loading

tmozgach commented Feb 7, 2018 •

edited

Loading

tmozgach commented Feb 13, 2018 •

edited

Loading

tmozgach commented Mar 9, 2018 •

edited

Loading

tmozgach commented Mar 17, 2018 •

edited

Loading