-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1. Data gathering: crawl data from Reddit. #1
Comments
Glad to form a work team here. The first job is that we want to crawl more data from the target community. The time period we want to crawl is from 2012.1.1 to 2017.12.31. The required variables are shown in the attachment. If you have any questions, please let me know. |
@neowangkkk I haven't done it before, so in order to save time and be consistence could you please give me the script or method/web link how you did it before. Is that like that? https://www.labnol.org/internet/web-scraping-reddit/28369/ |
@tmozgach Last time I paid $150 to hire a part-time programmer to crawl the data. He told me he used C+ language to write the program. I don't believe he will give me the code:-( I am not sure about the difficulty level of doing web crawling at reddit.com. Can you please search and check if any python package or sth else can do it? If after 20 work hours it is still a problem, we may go to the part-time programmer again. I understand some websites put a lot of tricks to prevent people crawling their content. It may be a huge task that only people with years of experience in crawling can handle. But it is worthy for you to learn and try when our time still allows. The google sheet method in your link may have some flaws. It said sub-reddit can only show 1000 posts. But our last crawling got over 25,000 threads for 18 months. If you have any questions, please feel free to let me know. |
PRAW API Install
For MAC:
Reddit video tutorials: Documentation:
Things to try:
|
Apparently, the all reddit comments are located in the BigQuery. Using Retrieving Data From Google BigQuery (Reddit Relevant XKCD) Also, it seem that we can do NLP analyziz using Google NLP API. |
Wow. That's great! Very promising. Can we output the query result into R or python data format? |
@neowangkkk probably there is a way, but I faced with another the issue, in BigQuery they store comments and POSTS for 2016,2017 year, but there is NO posts for 2015,2014,2013,2012, only comments! I mean, there is no main posts, title (from sender), only their comments (replies). |
@neowangkkk Could you provide the attributes that you need JUST for Topic Modeling? |
Parse JSON to CSV
Useful links (weren't used): Parser is ready (using regular expression): |
Raw information for 2012 - 2017 years: |
@neowangkkk |
Probably I need to join comments, post and title as ONE paragraph? |
Yes. As discussed last time, combining all texts in one thread may generate better outcome in clustering/topic modelling. Please go ahead and try it.
In addition, I checked the old data of karma. It is a fixed value for each individual at the point of crawling. We can’t get the changing karma through the time. So later if you can get the karma data for the participants in your investigated period, that would be fine.
… On Feb 22, 2018, at 4:25 PM, tmozgach ***@***.***> wrote:
Probably I need to join comments, post and title as ONE paragraph?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Every row of that csv is one thread (title, post, comments). |
Raw, not formated: |
2009 - 2017 all data: |
Data for Topic Modeling 2009-2017: |
Change NA in Title with previous Title.
Merge all comment and title in one document/row Python:
Topic Modeling and labeling; Merge labeling and data_full_title2.csv;
|
The text was updated successfully, but these errors were encountered: