- Description of the Problem
- Objective
- Technologies
- Data Architecture
- Data Description
- AWS Set up
- AWS S3 (Raw File Bucket)
- AWS Glue
- AWS Athena
- AWS Lambda
Today, analysts use historical data from platforms like YouTube to build predictive models for better user engagement and smarter decision-making. But these models face challenges due to the changing nature of online video consumption, which can be influenced by trends, world events, and audience preferences.
Moreover, when dealing with global data, complexities arise due to cultural and regional variations in content consumption. Hence, it's important to combine data analysis with current context and trends to drive better business decisions.
Historical YouTube data is a rich resource for content creators, marketers, and analysts, helping them understand audience habits, predict content demand, and formulate content strategies.
These datasets can also optimize content recommendations and planning, and guide strategic decisions for platforms like YouTube. Plus, they provide academic and research opportunities, allowing exploration of viewer behavior and the influence of video platforms on global media consumption.
A typical YouTube dataset, available as .csv files or via APIs, includes various engagement metrics but has limitations. It captures historical trends but may not predict sudden changes or account for local contexts. Therefore, it's essential to consider these limitations when using this data. Moreover, we must handle this data responsibly, ensuring user anonymity and data privacy. This project aims to analyze YouTube data using AWS and visualize the results, helping uncover patterns and insights into user behavior and content performance on YouTube.
In this project, we gonna leverage some AWS services:
- AWS S3
- AWS IAM
- AWS Glue
- AWS lambda
- AWS athena
These files contain historical video data from YouTube for different countries. They include video IDs, viewing durations, user interests, popular content, and more. The data helps understand viewer behavior and video performance trends in respective regions.
- CAvideos.csv: Data for Canada
- DEvideos.csv: Data for Germany
- FRvideos.csv: Data for France
- GBvideos.csv: Data for the United Kingdom
- INvideos.csv: Data for India
- JPvideos.csv: Data for Japan
- KRvideos.csv: Data for South Korea
- MXvideos.csv: Data for Mexico
- RUvideos.csv: Data for Russia
- USvideos.csv: Data for the United States
These files map the category IDs to their corresponding categories for videos from different countries, assisting in understanding popular content categories in respective regions.
- DE_category_id.json: Category data for Germany
- FR_category_id.json: Category data for France
- GB_category_id.json: Category data for the United Kingdom
- IN_category_id.json: Category data for India
- JP_category_id.json: Category data for Japan
- KR_category_id.json: Category data for South Korea
- MX_category_id.json: Category data for Mexico
- RU_category_id.json: Category data for Russia
- US_category_id.json: Category data for the United States
These data was collected from kaggle : https://www.kaggle.com/datasets/datasnaek/youtube-new
In this first section, we gonna create the account and connect the AWS CLI to our local computer
- Sign up to the AWS using root account (use your email)
- Go to the AWS IAM and create the new user there, this user will be used as our main system since the root account it's a bit risk to be used (since it contains all the important information)
- After you create the user, download the csv file and logout from the account.
- Log in again using your user account (the register information is included in your csv file)
- It would be different for different OS, please check and open AWS CLI for more information.
- Open your terminal/ Git Bash
- Running the ( aws configure )
- Add all your personal information from your account
- AWS Access Key ID
- AWS Secret Access Key
- Default region name: The region code of the default region (e.g., us-east-1).
- Default output format: The output format for the AWS CLI (e.g., json).
This tools will provide the bucket which will be used to store our file
- Create the main table for our raw file
- You can use the name of the combination of your project and the region that you use
Along the process we will back to create the new bucket, so keep in mind to always specify the name of your bucket. For example (your main bucket name)-(your specific task using this bucket)
- Download the data from the data source. In case from kaggle, then create the new directory and copy API Command from kaggle and paste that in git bash.
- Zip the zip file
- Send the folder into S3 Bucket using git bash
This tools can help to read the schema and create the catalog (ETL Process), which eventually can be created a table and we can use the AWS Athena afterwards.
-
Create the crawler glue
-
Then we create the role for the crawler (specify glue name in your glue role)
-
Set the role:
- AmazonFullAccess
- AWSGlueServiceRole
-
Create the new database for AWS Glue
-
Open the table and we would find some information about json (column name, etc)
-
Click action and click the view data
-
And it will open AWS Athena
We can transform the data without moving it data warehouse like reshifts or in other words, we transform on top of the data and can be stored again it in the S3 bucket.
- Click setting and manage
- Create the new bucket, in particular for AWS Athena
- Try to run, we will find there is error
- The issue is the json file has unreaded json format. So we have to edit this to be fitted with the AWS or in other words, we clean the data from semi structured data to structured data.
- So in this case, we will use AWS lamda to clean and store in the bucket.
- Firstly, create the function and give AmazonFullAccess
- Open the AWS Lamda and put the python file (which will help us to clean the data)
- Click configuration and click the environment variable set several variable that same on our python file
- Try to setting the run and choose S3 put. Then input yoru bucket and the link of one of your file
- If there is an error, then scroll down and you'll find 'add layer'. Choose DataWrangler or SDKPandas and choose the latest version
- If there is an error about timeout, then click the configuration and then click the general button (on the top) and set the timeout to be maybe 3 minutes (You can adjust it as like as you want)
- Try to run it again and it should work (make sure your connection is good)