GitHub - SANKHA1/ETL-Pipeline-creation-for-populating-database

Description

This project looks at data modelling for a fictitious music startup Sparkify, applying STAR schema to ingest data to simplify queries that answers business questions the product owner may have. This repo provides the ETL pipeline, to populate the sparkifydb database.

The purpose of this database is to enable Sparkify to answer business questions it may have of its users, the types of songs they listen to and the artists of those songs using the data that it has in logs and files. The database provides a consistent and reliable source to store this data.
This source of data will be useful in helping Sparkify reach some of its analytical goals, for example, finding out songs that have highest popularity or times of the day which is high in traffic.

Database Design and ETL Pipeline

For the schema design, the STAR schema is used as it simplifies queries and provides fast aggregations of data.

For the ETL pipeline, Python is used as it contains libraries such as pandas, that simplifies data manipulation. It also allows connection to Postgres Database.
There are 2 types of data involved, song and log data. For song data, it contains information about songs and artists, which we extract from and load into users and artists dimension tables
Log data gives the information of each user session. From log data, we extract and load into time, users dimension tables and songplays fact table.

Running the ETL Pipeline

First, run create_tables.py to create the data tables using the schema design specified. If tables were created previously, they will be dropped and recreated.
Next, run etl.py to populate the data tables created.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
log_data/2018/11		log_data/2018/11
song_data/A		song_data/A
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
schema.PNG		schema.PNG
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Database Design and ETL Pipeline

Running the ETL Pipeline

About

Releases

Packages

Languages

SANKHA1/ETL-Pipeline-creation-for-populating-database

Folders and files

Latest commit

History

Repository files navigation

Description

Database Design and ETL Pipeline

Running the ETL Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages