Skip to content

Commit

Permalink
Functional system status
Browse files Browse the repository at this point in the history
  • Loading branch information
rocket-ron committed Aug 22, 2016
1 parent f1ff14c commit 772d8bd
Show file tree
Hide file tree
Showing 7 changed files with 307 additions and 141 deletions.
24 changes: 23 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,30 @@ Build and deploy a system in Python for sentiment analysis of public, geolocated

Set up a simple consumer of the Twitter Streaming API to get a stream of all public, geolocated tweets (filter out any post without geo). Tweedy (http://tweepy.readthedocs.org/en/v3.5.0/streaming_how_to.html) is a good library for consuming the streaming API. These tweets should be dispatched to your sentiment analysis workers through a message broker like RabbitMQ, Redis, or Kafka.

Kafka was chosen because of the ease of setup on AWS EC2. A Vagrant startup file and Ansible Zookeeper and Kafka configuration playbook are executed to create and start the Kafka server.

The Twitter stream processor listens to the Twitter Streaming API and places tweets that have geolocation data and are english language onto the Kafka topic.

### Sentiment Analysis Pipeline

Set up a worker that will consume tweets from your message broker and perform sentiment analysis using TextBlob (https://textblob.readthedocs.org/en/dev/quickstart.html#sentiment-analysis) or a similar library. We recommend using a task queueing system such as Celery (http://www.celeryproject.org/) or RQ (http://python-rq.org/) though you can also use the message broker directly. This component should be designed to scale to multiple machines easily and tolerate instance failures. The results of sentiment analysis should be stored in a MongoDB collection for retrieval from the API. See Sentiment API doc below to figure out which fields will need to be persisted.

An AWS ElasticCache backed with Redis was created for ease of creation and management, and the Python RQ library was chosen to work with it also because of its simplicity.

A Python command line program was created to consume messages from the Kafka topic (tweets) and queue them onto an RQ queue for sentiment processing with TextBlob.

Another Python program implements the sentiment processor worker that dequeues the processing job, executes the sentiment analysis and places the results into MongoDB. The results consist of the sentiment scores, tweet text and location data.

Both of these Python programs are simple and multiples of them may be executed on machines with available cores, which allows quick horizontal scaling. However more management tooling is necessary to create larger sets of parallel processes to queue and de-queue these tasks. Currently these programs run on a t2.micro instance, one per program.

### Database

Set up a MongoDB collection with indexing to support the structure of the API response as presented below.

The MongoDB is a single instance AWS EC2 m3.large SSD instance. It isn't a large server in order to keep costs down, but is set up with SSD. A single sentiment collection contains each analyzed tweet text, sentiment scores and location document.
A geolocation 2dsphere index is created on the `location` field of the documents to allow use of MongoDB's geo-query functionality. In this way we can use an aggregation query to compute the average, min, and max sentiment scores that fall within a 2d sphere of a given radius.


### Sentiment API

Design a RESTful API with Flask (http://flask.pocoo.org/), Bottle (http://bottlepy.org/docs/dev/index.html), or a similar library that allows a user to query the average sentiment at a location by providing a latitude, longitude, and radius. The API should provide a JSON response in the following format:
Expand All @@ -43,8 +59,14 @@ Design a RESTful API with Flask (http://flask.pocoo.org/), Bottle (http://bottle
"coordinates": [-75.14311344, 40.05701716]
}
}

### Deployment

Project should be hosted in a public repository on Github or Bitbucket. The system should be deployed on AWS EC2, Elasticbeanstalk, Heroku, Google AppEngine, or a similar service.

The REST API is located at `http://develop.8hum4jfqxp.us-west-1.elasticbeanstalk.com`

A sample query is `http://develop.8hum4jfqxp.us-west-1.elasticbeanstalk.com/sentiment?lat=40.9&lon=-75.0&dist=100` which looks for tweets in a 100km radius around the geographical point given by the lat/lon coordinates.



7 changes: 7 additions & 0 deletions rest/.ebignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# pycharm project files
.idea/*

# Elastic Beanstalk Files
.elasticbeanstalk/*
!.elasticbeanstalk/*.cfg.yml
!.elasticbeanstalk/*.global.yml
5 changes: 5 additions & 0 deletions rest/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@

# Elastic Beanstalk Files
.elasticbeanstalk/*
!.elasticbeanstalk/*.cfg.yml
!.elasticbeanstalk/*.global.yml
Loading

0 comments on commit 772d8bd

Please sign in to comment.