In order to deploy the project, the following steps should be taken in order:
-
Download/Clone the repository at any location in your machine.
-
Open terminal, go the project folder "GithubAnalysis" and execute the python script "prerequisites.py". This script creates the required directory structure.
$ python prerequisites.py
- Execute "prerequisites_data.py" from the project folder "GithubAnalysis". This script downloads all the old data from a file server.
$ python prerequisites_data
- Execute "prerequisites_hadoop.py" from the project folder "GithubAnalysis". This script creates the all the required directories at HDFS.
$ python prerequisites_data hdfs://some_directory
- Install python dependencies using pip utility:
dateutil, numpy, scipy, pandas, httplib2, MySQL-python, sqlalchemy, simplejson, config, requests, google-api-python-client, scikit-learn, feedparser, beautifulsoup4, lxml, Flask, Flask-Login, PyOpenSSL, pycrypto, oauth2client==1.5.2, ntlk (After installing this package, run command "sudo python -m ntlk.downloader -d /usr/share/ntlk_data all" to install ntlk-data)
- Edit system settings as below:
-
Add following entry in /etc/security/limits.conf:
* hard nofile 128000 * soft nofile 128000 root hard nofile 128000 root soft nofile 128000
-
logout from the machine and login back. Check for "ulimit -n". It should show 128000
- Crawler prerequisites: Our data extractors gather data. Both of these sources allow restricted/authenticated access in order to balance load on their servers. For authentication purpose, we are required to create OAuth tokens in order to run our crawlers.
-
Creating OAuth tokens for GitHub
- Login to your GitHub account
- Go to: Settings -> OAuth applications
- Click on Developer applications -> "Register a new application"
- After registering an application, OAuth tokens "Client ID" and "Client Secret" would be generated.
- Copy these token values to the Post_Processing configuration file: "post_processing_config.py" at "GitHubAnalysis/Techtrends/Codebase/Post_Processing"
-
Creating OAuth tokens for Google BigQuery
- SIGN IN to google bigquery
- Click on "Try it free" and fill your details.
- Click on "Create a project..." by going to "Go to project" on the upper-right bar
- Create the project by entering the required fields. Note down the "project ID" and "project Name" for future reference.
- Go to "Manage all projects" -> "Service accounts" -> "Create service account". Note "Service account ID" and download P12(private key) and create a "Service account"
- Convert p12 file to pem file and rename the pem file as "google-api-client-key.pem"
- Move the pem file to "GitHubAnalysis/Techtrends/Codebase/Crawlers"
- Also modify the following parameters at "BigQuery_config.py" file: PROJECT_NUMBER = "project id" and SERVICE_ACCOUNT_EMAIL="Service account ID"
- Run crontab files by modifying the location of the directories