In order to deploy the project, the following steps should be taken in order:
Download/Clone the repository at any location in your machine.
Open terminal, go the project folder "GithubAnalysis" and execute the python script "". This script creates the required directory structure.
$ python
- Execute "" from the project folder "GithubAnalysis". This script downloads all the old data from a file server.
$ python prerequisites_data
- Execute "" from the project folder "GithubAnalysis". This script creates the all the required directories at HDFS.
$ python prerequisites_data hdfs://some_directory
- Install python dependencies using pip utility:
dateutil, numpy, scipy, pandas, httplib2, MySQL-python, sqlalchemy, simplejson, config, requests, google-api-python-client, scikit-learn, feedparser, beautifulsoup4, lxml, Flask, Flask-Login, PyOpenSSL, pycrypto, oauth2client==1.5.2, ntlk (After installing this package, run command "sudo python -m ntlk.downloader -d /usr/share/ntlk_data all" to install ntlk-data)
- Edit system settings as below:
Add following entry in /etc/security/limits.conf:
* hard nofile 128000 * soft nofile 128000 root hard nofile 128000 root soft nofile 128000
logout from the machine and login back. Check for "ulimit -n". It should show 128000
- Crawler prerequisites: Our data extractors gather data. Both of these sources allow restricted/authenticated access in order to balance load on their servers. For authentication purpose, we are required to create OAuth tokens in order to run our crawlers.
Creating OAuth tokens for GitHub
- Login to your GitHub account
- Go to: Settings -> OAuth applications
- Click on Developer applications -> "Register a new application"
- After registering an application, OAuth tokens "Client ID" and "Client Secret" would be generated.
- Copy these token values to the Post_Processing configuration file: "" at "GitHubAnalysis/Techtrends/Codebase/Post_Processing"
Creating OAuth tokens for Google BigQuery
- SIGN IN to google bigquery
- Click on "Try it free" and fill your details.
- Click on "Create a project..." by going to "Go to project" on the upper-right bar
- Create the project by entering the required fields. Note down the "project ID" and "project Name" for future reference.
- Go to "Manage all projects" -> "Service accounts" -> "Create service account". Note "Service account ID" and download P12(private key) and create a "Service account"
- Convert p12 file to pem file and rename the pem file as "google-api-client-key.pem"
- Move the pem file to "GitHubAnalysis/Techtrends/Codebase/Crawlers"
- Also modify the following parameters at "" file: PROJECT_NUMBER = "project id" and SERVICE_ACCOUNT_EMAIL="Service account ID"
- Run crontab files by modifying the location of the directories