Skip to content

Training a classifier to reddit's TIL to find new things on Wikipedia

Notifications You must be signed in to change notification settings

thoppe/today-AI-learned

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

today-AI-learned

Hello reddit! I'm the semi-autonomous bot u/possible_urban_king

TLDR; I was created to machine learn reddit's r/today-I-learned (TIL) subreddit for new and interesting things. If karma/upvotes measure success, I passed the Turing test.


Press & Presentations

BuzzFeed News

H&&T DSDC

BuzzFeed News : Meet The Man Who Gamed Reddit With A Bot

H&&T : Round 20: Severe Municipal Jazz, May 11, 2015, presentation link

Data Science DC : Lightning Talks! (IV), September 29th, 2015, presentation link


Description

from the author Travis Hoppe

It is an exciting time right now if you're interested in Machine Learning. With modest effort, anyone with an idea can transform it into a working algorithm. I've been a fan of the subreddit r/today-I-learned and I always found it interesting that top posts would build upon my current knowledge and append a new factoid. In contrast to traditional machine learning tasks such as image recognition or time-series prediction, the concept of an interesting post is vague and undefined. This makes it an exciting topic to study!

The metric for a successful post on reddit is the upvote. These votes are an aggregated poll over the reddit vox populi, and in a limited sense constitute tests for intelligence. In the TIL subreddit especially, this requires higher order cognitive skills from the Bloom Taxonomy like Knowledge, Synthesis and Evaluation. If a machine were to act like a (human) redditor, it would have to emulate these submissions with new and novel posts.

In this context u/possible_urban_king passes the Turing test. Over the last three months I've been running an experiment and posted about 50 submissions to TIL. The bot's posts have made it to the front page multiple times and the majority of posts are well-received (see results).

The bot was trained over a selection of previously successful TIL posts (see methods) that used Wikipedia as a source. Classification worked well, sometimes too well. I found that media characters (books, movies, etc...) were disproportionately tagged as interesting. These characters would be interesting too, if only they were real people! Additionally, sections in Wikipedia that were salacious or required a [Citation Needed] were often removed by the time they were to be posted.

  • Semi-autonomous?

It turns out that writing the title of a post is really hard, and ultimately I decided that this was outside the scope of the experiment. In all of the posts, I wrote the title and submitted by hand. I was however, limited to use the information taken from the paragraph marked by the bot.

  • Which algorithm/classifier?

Extremely Random Trees.

  • Why the name possible_urban_king?

It's a colorless green idea.


Results

Upvotes Post
4726 TIL The Founder Of Japans Mcdonalds Stated
4123 TIL Mike Kurtz An American Burglar Found Out That
2899 TIL A Woman That Reported 100 Incidents Of
1551 TIL During The Sentencing Of His War Crimes Trial
1144 TIL That Art Spiegelman The Creator Of Maus A
640 TIL That Once Officially Labeled As Retarded
498 TIL Before World War Ii It Was Very Rare For
142 TIL That A Study Showed Those With A Distressed
135 TIL Frankie Fraser A Notorious English Gangster
68 TIL Rafael Quintero A Mexican Drug Trafficker
55 TIL The Summer Of Shark Refers To The Medias
49 TIL The Indian Head Eagle Coin Minted In America
42 TIL There Is A 1 Million Dollar Prize For
42 TIL A Murder Victim Was Dismembered So Precisely
40 TIL Daigo Fukuryu Maru A Japanese Fishing Boat
38 TIL It Was 1883 When Kerckhoff Laid Out The
38 TIL An Overcrowded Trailer Carrying 70 People To
37 TIL Machon Ayalon Was A Secret Underground Bullet
36 TIL That Joe Pullen An Africanamerican Tenant
29 TIL Peter Fat Pete Chiodo A Capo In The Lucchese
29 TIL Pinochets Government In Chile 19731990 Had A
24 TIL During Wwi The British Forbade Incendiary
24 TIL That Even Professional Herbalists Avoid The
21 TIL Hm Prison Liverpool Charges Prisoners To
17 TIL Women In Norway That Fraternized With German
16 TIL The Saab 96 Engine Was Tested Under Extreme
15 TIL That The Male Clouded Leopard Is Extremely
15 TIL The Tactic Of Marching Fire Where Rounds Are
12 TIL Captain James Cook Was Killed While
10 TIL Chrysomya Rufifacies Are Usually The First
10 TIL Oskar Daubmann Was A Con Man Who Convinced
9 TIL Captain Strong Is A Dc Clone Of Popeye Except
8 TIL Peter Sawyer Is Credited As The First
7 TIL The Stock Expression Thats A Joke Son Came
6 TIL In The Summer Of 2011 Three Enforcers Ice
6 TIL Morality Follows In The Wake Of Malt Liquors
5 TIL Frances Parker A British Suffragette Was
4 TIL Of The Rogue Elephant Of Aberdare Forest An
4 TIL Nasenbluten A Band Credited For Pioneering
3 TIL Sulfa The First Effective Antibiotic
2 TIL Fiddlin John Carson An American Oldtime
2 TIL While Investigating The Phenomena Of Entombed
2 TIL There Is A Hazemaking Compound That Designers
2 TIL During Kobe Bryants Sexual Assault Case It
1 TIL Of The Worst Deal Made In The Dotcom Era The
1 TIL Primate Experiments At Cambridge Incorrectly
1 TIL The Mushroom Poisonous Mushroom Hapalopilus
1 TIL Sahar Gul Was An Afghan Teenager Who Was
1 TIL That Up Until 1996 Japan Had A Law To Stop
0 TIL Prior To The Commencement Of An Elimination
0 TIL Former Congressman Cleo Fields Achieved
0 TIL There Is A Canadian Bill Called The Blood
0 TIL There Was A NC Sheriff That Dressed In

Methods

In the interests of scientific reproducibility, all of the code used in the experiment is hosted in this project. If you'd like to repeat the experiment yourself however, it will require a bit of tinkering to get it to work with your system. A zipped sqlite3 database of the raw paragraphs marked as interesting can be found in db/report.db.bz2. Feel free to fork and do whatever you like with this repo as long you follow the CC Attribution 3.0 license.

Data collection

Supervised machine learning requires a massive tagged collection of high-quality data to be effective. Fortunately the past submissions of to r/TIL have done just that. Redditors have carefully curated a selection of posts that they collectively find interesting through their voting system. We can filter these posts to just those that point to Wikipedia as a source. This way, the source of each post uses a somewhat standardized language and grammar.

Initially I started with the top 1000 posts of all-time (due to an API restriction in reddit's search) using praw. Ultimately however, I extended that to all posts that had a score of > 1000 in the years 2013 and 2014 (resulting in about 5000 quality TIL posts) using an alternate database.

From here it is relatively easy to download a parsed down versed of the wiki page linked to by the reddit post.

Data wrangling

Somehow, we have to link the pithy one-line TIL title to the correct paragraph in the Wikipedia article. This is a non trivial task, as simple word frequencies are not enough. Ultimately I settled on a sort of "word-entropy". That is, each paragraph was stripped to it's unique words and these sets all formed a frequency vector for each paragraph. These vectors were normalized so that the unique words in each paragraph carried more weight. Then we took the title of the TIL post and compared it to the vectors of each paragraph settling on the paragraph with the closest match. This turns out to work surprisingly well.

Additionally, I saved the non-matching paragraphs as some useful false positives.

The next step was to prep the Wikipedia corpus. Using a full XML corpus of Wikipedia (not provided and parsed with bs4), I tokenized and stemmed each paragraph of text for each article. This uses both nltk for the word tokenization & stop words and the porter2 stemmer from the aptly named package stemming.

This creates a rather massive SQLite database with each paragraph and the associated meta-data (like title, paragraph number, word-entropy, ...). Since there are many millions of assorted paragraphs (and I assume very few of them are interesting), I am going to use a random sampling of some of these as True Negatives in my machine learning.

Machine Learning

Initially, I experimented with a simple word frequency as my feature vector. While this works for toy problems, the corpus of Wikipedia needed a smarter way to condense down the data. Fortunately, a neat textual feature generator, Word2Vec (developed by Google) is available in gensim.

Using Word2Vec requires two complete passes over the data, though it allows you to use an iterator making the memory requirements rather small.

Here, perhaps lies the most contentious part of the project, the construction of the classifier. In the end, I settled for the Extremely Random Trees implementation in scikit-learn. This classifier, while fairly poor at detecting new true positives at about 10%, was extremely proficient at marking the true negatives. Since the assumption is that most of Wikipedia is, in fact, quite boring, this will help narrow down the results immensely.

Training classifier
Test Accuracy: 0.878
Test Accuracy on TP: 0.116
Test Accuracy on TN: 0.998

With the classifier solved, the next step is score each and every paragraph in Wikipedia. The classifier marks about 6 per 10000 as potential candidates.

With the positives marked, we need to prepare the potentially interesting things to a human-readable format! Report starts building a new database that contains only the positive entries and the associated Wikipedia text from the original source.

Nobody likes a repost (unless it's better, or more aptly timed...), so we need to find out what has already been posted to reddit. To do so, we need a proper search name of the Wikipedia article. The module mediawiki-utils can do this, but stupidly requires python3. Thus the cross-reference program makes a system call to properly encode name as a search query for reddit. We then take the top search result (if exists) and store it; this info will serve as the criteria for a post/repost.

With the potential TIL candidates identified, let's find the best time to post! Note that we are going to posit that post time has a casual relationship with the ultimate score. Since reddit is dynamic and viewership is dependent on a steady-stream of upvotes, this should be a reasonable assumption. Going back over our training set, we can map the distribution of times for a r/TIL post:

it seems like the sweet spot for a submission is between 9AM-11AM!

What about the bottom r/TIL posts, those that had a score of < 1000? Considering only the ones we found with our algorithm, the posting time is dramatically different:

Since we are going to have a few false positives, I setup a simple script to help determine quality TIL's. A random unlabeled TIL is pull from the database that hasn't been posted already and is opened on both the screen and the browser to quickly determine if it is "something worth learning". This script show both the tagged interesting paragraph and the corresponding Wikipedia page. There is a simple prompt that allows you to mark an item to post later.


License

CC Attribution 3.0.

About

Training a classifier to reddit's TIL to find new things on Wikipedia

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published