Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1. Data gathering: crawl data from Reddit. #1

Open
tmozgach opened this issue Jan 8, 2018 · 19 comments
Open

1. Data gathering: crawl data from Reddit. #1

tmozgach opened this issue Jan 8, 2018 · 19 comments

Comments

@tmozgach
Copy link
Owner

tmozgach commented Jan 8, 2018

@neowangkkk
Copy link
Collaborator

neowangkkk commented Jan 9, 2018

Glad to form a work team here. The first job is that we want to crawl more data from the target community.
https://www.reddit.com/r/Entrepreneur/

The time period we want to crawl is from 2012.1.1 to 2017.12.31.

The required variables are shown in the attachment. If you have any questions, please let me know.
screen shot 2018-01-09 at 1 22 38 pm

Sample.xlsx

@tmozgach tmozgach changed the title Background Data gathering: crawl data from Reddit Jan 11, 2018
@tmozgach
Copy link
Owner Author

@neowangkkk I haven't done it before, so in order to save time and be consistence could you please give me the script or method/web link how you did it before. Is that like that? https://www.labnol.org/internet/web-scraping-reddit/28369/

@tmozgach tmozgach changed the title Data gathering: crawl data from Reddit 1.Data gathering: crawl data from Reddit. Jan 11, 2018
@neowangkkk
Copy link
Collaborator

neowangkkk commented Jan 11, 2018

@tmozgach Last time I paid $150 to hire a part-time programmer to crawl the data. He told me he used C+ language to write the program. I don't believe he will give me the code:-(

I am not sure about the difficulty level of doing web crawling at reddit.com. Can you please search and check if any python package or sth else can do it? If after 20 work hours it is still a problem, we may go to the part-time programmer again. I understand some websites put a lot of tricks to prevent people crawling their content. It may be a huge task that only people with years of experience in crawling can handle. But it is worthy for you to learn and try when our time still allows.

The google sheet method in your link may have some flaws. It said sub-reddit can only show 1000 posts. But our last crawling got over 25,000 threads for 18 months.

If you have any questions, please feel free to let me know.

@tmozgach
Copy link
Owner Author

tmozgach commented Jan 16, 2018

PRAW API

Install pip and draw without root for LINUX:
https://gist.github.com/saurabhshri/46e4069164b87a708b39d947e4527298

curl -L https://bootstrap.pypa.io/get-pip.py -o getpip.py
python get-pip.py --user
python -m pip install --user praw

For MAC:

https://gist.github.com/haircut/14705555d58432a5f01f9188006a04ed

Reddit video tutorials:
https://www.youtube.com/watch?v=NRgfgtzIhBQ

Documentation:
http://praw.readthedocs.io/en/latest/getting_started/quick_start.html

user_agent parameter?

Archival Bot
According to Reddit the user agent is required to determine basic information about the script accessing Reddit, such as the name, the version and the author (and that's supposed to go in there, such as "prawtutorial v1.0 by /u/sentdex" or similar).

r = praw.Reddit(client_id = '***',
                     client_secret = '***',
                     user_agent = 'Archival Bot')

Things to try:

@tmozgach tmozgach reopened this Jan 16, 2018
@tmozgach
Copy link
Owner Author

tmozgach commented Jan 18, 2018

Apparently, the all reddit comments are located in the BigQuery. Using SQL we can take.
Example:
Using BigQuery with Reddit Data
https://pushshift.io/using-bigquery-with-reddit-data/
https://bigquery.cloud.google.com/table/fh-bigquery:reddit_posts.2017_12

Retrieving Data From Google BigQuery (Reddit Relevant XKCD)
http://blog.reddolution.com/Life/index.php/2017/05/retrieving-data-from-google-bigquery-reddit-relevant-xkcd/
https://stackoverflow.com/questions/18493533/how-to-download-all-data-in-a-google-bigquery-dataset

Also, it seem that we can do NLP analyziz using Google NLP API.
Example:
Machine learning, NLP Google APIs
https://hackernoon.com/how-people-talk-about-marijuana-on-reddit-a-natural-language-analysis-a8d595882a7a

@neowangkkk
Copy link
Collaborator

Wow. That's great! Very promising. Can we output the query result into R or python data format?

@tmozgach
Copy link
Owner Author

tmozgach commented Jan 20, 2018

@neowangkkk probably there is a way, but I faced with another the issue, in BigQuery they store comments and POSTS for 2016,2017 year, but there is NO posts for 2015,2014,2013,2012, only comments! I mean, there is no main posts, title (from sender), only their comments (replies).
I crawled data for 2012,2013,2014,2015 without Karma and another information about the user by now using the following script:
https://github.com/peoplma/subredditarchive/blob/master/subredditarchive.py

@tmozgach
Copy link
Owner Author

tmozgach commented Jan 20, 2018

@neowangkkk Could you provide the attributes that you need JUST for Topic Modeling?

@tmozgach
Copy link
Owner Author

tmozgach commented Jan 26, 2018

@tmozgach tmozgach changed the title 1.Data gathering: crawl data from Reddit. 1. Data gathering: crawl data from Reddit. Feb 6, 2018
@tmozgach
Copy link
Owner Author

tmozgach commented Feb 7, 2018

Raw information for 2012 - 2017 years:
https://drive.google.com/open?id=1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ

@tmozgach
Copy link
Owner Author

tmozgach commented Feb 13, 2018

@neowangkkk
Files:
TPostComRaw.csv: Not clean titles, main posts and comments for 2012-2017 years.
TPostRaw.csv: Not clean titles and main posts WITHOUT comments for 2012-2017 years.
https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing

@tmozgach
Copy link
Owner Author

Probably I need to join comments, post and title as ONE paragraph?

@neowangkkk
Copy link
Collaborator

neowangkkk commented Feb 22, 2018 via email

@tmozgach
Copy link
Owner Author

Every row of that csv is one thread (title, post, comments).

@tmozgach
Copy link
Owner Author

tmozgach commented Mar 9, 2018

Raw, not formated:
2009_2011data.csv.zip

@tmozgach
Copy link
Owner Author

@tmozgach
Copy link
Owner Author

Data for Topic Modeling 2009-2017:
https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing
newRawAllData.csv

@tmozgach
Copy link
Owner Author

tmozgach commented Mar 17, 2018

Change NA in Title with previous Title.
R

library(tidyverse)
library(zoo)
library(dplyr)




myDataf = read_delim("/home/tatyana/Downloads/data_full.csv", delim = ',' )
myDataff = myDataf[!is.na(strptime(myDataf$Date,format="%Y-%m-%d %H:%M:%S")),]

# There are title that duplicates another one. Titles are not unique
myDataff$Title <- make.unique(as.character(myDataff$Title), sep = "___-___")

# make.uniqui makes also NA - unique by adding number, need to transform them back to NA
myDataff$Title <- gsub("NA__+", NA, myDataff$Title)

# change NA by previous Title
myDataff['Title2'] = data.frame(col1 = myDataff$Title, col2 = myDataff$Conversation) %>% 
  do(na.locf(.))
write_csv(myDataff, "data_full_title2.csv")

newDff = data.frame(col1 = myDataff$Title, col2 = myDataff$Conversation) %>% 
  do(na.locf(.))
write_csv(newDff, "dataForPyth.csv")

Merge all comment and title in one document/row

Python:

import csv
import pandas as pd
import numpy as np

newDF = pd.DataFrame()
tit = ""
com = ""
rows_list = []
title_list = []
with open("/home/tatyana/dataForPyth.csv", "rt") as f:
    reader = csv.reader(f)
    for i, line in enumerate(reader):
        # print ('line[{}] = {}'.format(i, line))
        if i == 0:
            continue
        if i == 1:
            title_list.append(line[0])
            tit = line[0]
            com = line[0] + " " + line[1]
            continue
            
        if line[0] == tit:
            com = com + " " + line[1]
        else:
            rows_list.append(com)
            tit = line[0]
            title_list.append(line[0])
            com = line[0] + " " + line[1]

rows_list.append(com)

df = pd.DataFrame(rows_list)
se = pd.Series(title_list)
df['Topic'] = se.values

# print(title_list[84627])
# print(rows_list[84627])


df.to_csv("newRawAllData.csv",index=False, header=False) 

Topic Modeling and labeling;

Merge labeling and data_full_title2.csv;


myLabDataf = read_delim("/home/tatyana/nlp/LabeledTopic.csv", delim = ',' )

# 8 thredshad some issue and weren't mereged
newm = merge(myDataff,myLabDataf, by.x = 'Title2', by.y = 'title')

fin = select(newm, Date, Sender, Title2, Replier, Conversation,`Points from this question`, `Post Karma`, `Comment Karma`, `Date joining the forum;Category Label`, `Topic/Probability`, `Main Topic`, `Main Probability`)


write_csv(fin, "final.csv")

@tmozgach
Copy link
Owner Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants