GHArchive DE project

This is a Data Enginerering Project using Github Archive data

GHArchive DE project

Problem Description

This project is about events that happens on Github. How many users are currently in the Github space? Which repo is the most contributed to? Who has the highest commits? What time of the day or month does users push commits the most?

The Main Objective is to :

Develop a pipeline to collect the github archive data and process it in batch
Build a dashboard to visualize the trends

Technologies

Cloud: GCP
Infrastructure as code (IaC): Terraform
Workflow orchestration: Prefect
Data Warehouse: BigQuery
Data Lake: Google Cloud Storage
Batch processing/Transformations: dbt cloud and Spark
Dashboard: Google Data Looker Studio

Project Architecture

The data pipeline involves the following:

fetching data in batches and storing it in GCS
preprocessing the data with pyspark and moving it to DWH
transforming and preparing the data in the DWH for visualization
creating dashboards

Dashboard

click here to see the dashboard

Setup

To setup this project, GCP account will be required. Activate your free trial with free $300 credit. select other when choosing what best describes your needs.

Terraform

Instructions to setup Terraform and GCP infrastruture click here

login into the google compute instance using ssh. to setup gcp with vscode click here

Note: The following instructions does not use docker to run the ochestration. To use docker click here

Java runtime and Spark

connect to your vm instance via vscode and continue. create a directory for the installation and enter the directory

mkdir spark && cd spark

wget https://download.java.net/java/GA/jdk11/13/GPL/openjdk-11.0.1_linux-x64_bin.tar.gz

extract the file

tar xzvf openjdk-11.0.1_linux-x64_bin.tar.gz

download spark

wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz

extract the file

tar xzfv spark-3.3.2-bin-hadoop3.tgz

to add the java and spark to path

nano ~/.bashrc

scroll to the bottom and add the following

export JAVA_HOME="${HOME}/spark/jdk-11.0.1"
export PATH="${JAVA_HOME}/bin:${PATH}"

export SPARK_HOME="${HOME}/spark/spark-3.3.2-bin-hadoop3"
export PATH="${SPARK_HOME}/bin:${PATH}"

export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH"

after exiting, logout and login back into the session to effect the changes or run source ~/.bashrc

Github repo

Go to this repo, fork it and clone the forked repo

Prefect

We will running prefect locally here.

go back to terminal on the vm, run the next command to install the requirement to run prefect. run sudo apt update && sudo apt install python3-pip to install pip package manager. Then change directory into the cloned repo, then run the following.
```
pip install -r requirements.txt
```
then run sudo reboot now to reboot vm instance to effect installation.
run prefect orion start to start prefect server.
open another terminal session and run prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api.

The following block needs to be added.

prefect block register -m prefect_gcp
prefect block register -m prefect_dbt

configuring prefect blocks. blocks can be configured with scripts or through the Prefect UI. The blocks will be configured via the UI. if you want to configure blocks via scripts, use the scripts in the blocks folder.

GCP bucket block
- click the + to configure a block
- go to GCS Bucket
- name the block gharchive
- get your gcp bucket name that was created in terraform setup. use it for the name of the bucket gharchive_dataset_gcs
- scroll down to Gcp Credentials to add credentials. Click Add + to add gcp credentials.
- let the name of the block name be gcp-creds
- the api key that was downloaded when setting up GCP. copy the contents to Service Account Info (Optional) and save it.
- the credential will be added to the gcp block automatically. click save

DBT Cloud

Setup DbtCloud here.
To setup dbtcloud credentials block on prefect.
- create a DbtCloudCredentials.
- Name the block as dbt-gharchive.
- paste your account ID.
- The api access key can be gotten from your dbt settings. copy it and paste it in DbtCloudCredentials block. Then save it.
go to code/dbt_run.py to input your job_id. replace your job_id in the job_id variable.

Credentials

go to the credentials folder and create credentials.json file.
copy the google credentials details into it and save it.

Deployment

Go back to terminal to configure deployment
change directory into the clone repo folder and running the following.
```
prefect deployment build code/main.py:pipeline \
  -n "pipeline flow" \
  -o "pipeline flow" \
  --apply
```
- the -n parameter set the name of the deployment in prefect.
- the -o parameter set the output of the file.
- the --apply parameter apply the deployment file to prefect.
```
prefect agent start -q 'default'
```
- Visit Prefect to run deployment.
- Go to the deployment tab. the newly created deployment should appear under the deployment tab.
- Click on run to create a custom run. For test purposes,
  - set the year parameter to a year e.g 2020;
  - set the month to take just a single month in a list e.g [1] which means January;
  - set the day to any day of the month e.g 1 which means the first day. Note if the day parameter is not set, this will run for every day in the chosen month.
- the prefect flow run can be monitored from the terminal session running prefect agent.
- Optional: you can forward spark port which is 4040 from vscode to view your spark jobs

Visualization

visit Google Looker Studio
create a datasource.
select bigquery as source.
select your project_ID.
select production dataset.
select the gh table.
select connect on the right top corner.
You can have fun creating any dashboard of your choice.

when you are done, dont forget to tear down the infrastructure with terraform destroy

if you have any questions, feel free to reach me via Mail or via Twitter

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
blocks		blocks
code		code
credentials		credentials
dbt		dbt
gcp-terraform		gcp-terraform
images		images
lib		lib
notebooks		notebooks
terraform		terraform
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
deployment-pipeline.yaml		deployment-pipeline.yaml
docker-compose.yaml		docker-compose.yaml
docker.md		docker.md
prefect_cloud.md		prefect_cloud.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GHArchive DE project

Problem Description

Technologies

Project Architecture

Dashboard

Setup

Terraform

Java runtime and Spark

Github repo

Prefect

GCP bucket block

DBT Cloud

Credentials

Deployment

Visualization

About

Languages

GbotemiB/gharchive_de_project

Folders and files

Latest commit

History

Repository files navigation

GHArchive DE project

Problem Description

Technologies

Project Architecture

Dashboard

Setup

Terraform

Java runtime and Spark

Github repo

Prefect

GCP bucket block

DBT Cloud

Credentials

Deployment

Visualization

About

Topics

Resources

Stars

Watchers

Forks

Languages