DataTalksClub - data-engineering-zoomcamp
-
Week 1: Basics and setup (Done)
- Note for Week 1 (Highly Recommended)
- Learn basic docker and docker compose
- Learn basic Terraform
- Setup a VM on the Google Cloud Platform
- gcloud cli, IAM&Admin, service account
- GitBash
Note: Credentials or sensitive information already is hidden.
-
Week 2: Workflow Orchestration (Done)
- Note for Week 2
- Prefect (2023)
- Airflow (2022)
- setting up a data pipeline, ETLs using Google Cloud Storage and Local PostgreSQL (containerized)
-
Week 3: Data Warehouse (Done)
- Note for Week 3
- Query Optimization concept on Bigquery
-
Week 4: Analytics Engineering (Done)
- Note for Week 4
- Basic Data Pipeline Concepts
- Learn dbt
-
Week 5: Batch Processing (Done)
- Note for week 5
- Example of Using Pyspark api for data transformation such as initilize session, defining schema, udf, read/write, sql, and join
- Anatomy of Spark and how it work underneath such as, reshuffling, repartitioning, broadcasting
- Submitting spark job to a cluster with parameterized script by argparse to be able to parse parameter to use within the script
-
Week 6: Stream Processing (In progress)
- Note for week 5
- Anatomy of Kafka
- Python Kafka configuration and Demo
- Additional Refernce: