Data Engineering Project / Youtube Data Analysis

List of Contents:

Description of the Problem
Objective
Technologies
Data Architecture
Data Description
1. CSV Files
2. JSON Files
AWS Set up
1. Sign up the account
2. Connect the AWS CLI
AWS S3 (Raw File Bucket)
AWS Glue
AWS Athena
AWS Lambda

1. Description of the Problem:

Today, analysts use historical data from platforms like YouTube to build predictive models for better user engagement and smarter decision-making. But these models face challenges due to the changing nature of online video consumption, which can be influenced by trends, world events, and audience preferences.

Moreover, when dealing with global data, complexities arise due to cultural and regional variations in content consumption. Hence, it's important to combine data analysis with current context and trends to drive better business decisions.

2. Objective:

Historical YouTube data is a rich resource for content creators, marketers, and analysts, helping them understand audience habits, predict content demand, and formulate content strategies.

These datasets can also optimize content recommendations and planning, and guide strategic decisions for platforms like YouTube. Plus, they provide academic and research opportunities, allowing exploration of viewer behavior and the influence of video platforms on global media consumption.

A typical YouTube dataset, available as .csv files or via APIs, includes various engagement metrics but has limitations. It captures historical trends but may not predict sudden changes or account for local contexts. Therefore, it's essential to consider these limitations when using this data. Moreover, we must handle this data responsibly, ensuring user anonymity and data privacy. This project aims to analyze YouTube data using AWS and visualize the results, helping uncover patterns and insights into user behavior and content performance on YouTube.

3. Technologies:

In this project, we gonna leverage some AWS services:

AWS S3
AWS IAM
AWS Glue
AWS lambda
AWS athena

4. Data Architecture:

5. Data Description:

5.1 CSV Files:

These files contain historical video data from YouTube for different countries. They include video IDs, viewing durations, user interests, popular content, and more. The data helps understand viewer behavior and video performance trends in respective regions.

CAvideos.csv: Data for Canada
DEvideos.csv: Data for Germany
FRvideos.csv: Data for France
GBvideos.csv: Data for the United Kingdom
INvideos.csv: Data for India
JPvideos.csv: Data for Japan
KRvideos.csv: Data for South Korea
MXvideos.csv: Data for Mexico
RUvideos.csv: Data for Russia
USvideos.csv: Data for the United States

5.2 JSON Files:

These files map the category IDs to their corresponding categories for videos from different countries, assisting in understanding popular content categories in respective regions.

DE_category_id.json: Category data for Germany
FR_category_id.json: Category data for France
GB_category_id.json: Category data for the United Kingdom
IN_category_id.json: Category data for India
JP_category_id.json: Category data for Japan
KR_category_id.json: Category data for South Korea
MX_category_id.json: Category data for Mexico
RU_category_id.json: Category data for Russia
US_category_id.json: Category data for the United States

These data was collected from kaggle : https://www.kaggle.com/datasets/datasnaek/youtube-new

6. AWS Set up:

In this first section, we gonna create the account and connect the AWS CLI to our local computer

6.1 Sign up the account:

Sign up to the AWS using root account (use your email)
Go to the AWS IAM and create the new user there, this user will be used as our main system since the root account it's a bit risk to be used (since it contains all the important information)
After you create the user, download the csv file and logout from the account.
Log in again using your user account (the register information is included in your csv file)

6.2 Connect the AWS CLI:

It would be different for different OS, please check and open AWS CLI for more information.
Open your terminal/ Git Bash
Running the ( aws configure )
Add all your personal information from your account

AWS Access Key ID
AWS Secret Access Key
Default region name: The region code of the default region (e.g., us-east-1).
Default output format: The output format for the AWS CLI (e.g., json).

7. AWS S3 (Raw File Bucket)

This tools will provide the bucket which will be used to store our file

Create the main table for our raw file
You can use the name of the combination of your project and the region that you use

Along the process we will back to create the new bucket, so keep in mind to always specify the name of your bucket. For example (your main bucket name)-(your specific task using this bucket)

Download the data from the data source. In case from kaggle, then create the new directory and copy API Command from kaggle and paste that in git bash.
Zip the zip file
Send the folder into S3 Bucket using git bash

8. AWS Glue:

This tools can help to read the schema and create the catalog (ETL Process), which eventually can be created a table and we can use the AWS Athena afterwards.

Create the crawler glue
Then we create the role for the crawler (specify glue name in your glue role)
Set the role:
- AmazonFullAccess
- AWSGlueServiceRole
Create the new database for AWS Glue
Open the table and we would find some information about json (column name, etc)
Click action and click the view data
And it will open AWS Athena

9. AWS Athena:

We can transform the data without moving it data warehouse like reshifts or in other words, we transform on top of the data and can be stored again it in the S3 bucket.

Click setting and manage
Create the new bucket, in particular for AWS Athena
Try to run, we will find there is error
The issue is the json file has unreaded json format. So we have to edit this to be fitted with the AWS or in other words, we clean the data from semi structured data to structured data.

So in this case, we will use AWS lamda to clean and store in the bucket.

10. AWS Lambda:

Firstly, create the function and give AmazonFullAccess
Open the AWS Lamda and put the python file (which will help us to clean the data)
Click configuration and click the environment variable set several variable that same on our python file
Try to setting the run and choose S3 put. Then input yoru bucket and the link of one of your file
If there is an error, then scroll down and you'll find 'add layer'. Choose DataWrangler or SDKPandas and choose the latest version
If there is an error about timeout, then click the configuration and then click the general button (on the top) and set the timeout to be maybe 3 minutes (You can adjust it as like as you want)
Try to run it again and it should work (make sure your connection is good)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
README.md		README.md
lambda_function.py		lambda_function.py
pyspark_code.py		pyspark_code.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Project / Youtube Data Analysis

List of Contents:

1. Description of the Problem:

2. Objective:

3. Technologies:

4. Data Architecture:

5. Data Description:

5.1 CSV Files:

5.2 JSON Files:

6. AWS Set up:

6.1 Sign up the account:

6.2 Connect the AWS CLI:

7. AWS S3 (Raw File Bucket)

8. AWS Glue:

9. AWS Athena:

10. AWS Lambda:

About

Releases

Packages

Languages

Irf4n-Muhammad/Youtube-Data_DataEng

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Project / Youtube Data Analysis

List of Contents:

1. Description of the Problem:

2. Objective:

3. Technologies:

4. Data Architecture:

5. Data Description:

5.1 CSV Files:

5.2 JSON Files:

6. AWS Set up:

6.1 Sign up the account:

6.2 Connect the AWS CLI:

7. AWS S3 (Raw File Bucket)

8. AWS Glue:

9. AWS Athena:

10. AWS Lambda:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages