This project implements the Airflow DAG presented in chapter 4 of the book Data Pipelines with Apache Airflow by B. Harenslak and J. de Ruiter
This pipeline fetches page views from https://dumps.wikimedia.org/
.
Pages of interest are:
- Meta
- Microsoft
- Apple
- Amazon
- Netflix
- Nvidia
Overall pipeline runs in less than 20 seconds. This includes fetching results as zip, unziping, processing, inserting to postgress, and analytics.
-
Have Docker installed
To install check: Docker Dekstop Install
-
Have Astro CLI installed
If you use brew, you can run:
brew install astro
For other systems, please refer to: Install Astro CLI
-
Run
astro dev init
to create the necessary files for your environment. -
Run
astro dev start
to start the airflow service with docker. -
Configure Postrges connection by following this steps:
-
Run
astro dev bash
to access airflow terminal. -
Run the following command to add the connection:
airflow connections add \ --conn-type postgres \ --conn-host host.docker.internal \ --conn-login postgres \ --conn-password postgres \ postgres_default
Here using localhost will create an error. For an in depth explanation check: Connect to local Postgres from docker airflow
-
To execute DAG, please visit: Airflow UI
In the DAGs section, you should see a DAG called wikipedia_pageviews
.
NOTE: Your run section will be empty instead of the colored options you see in the image.
Click the dag to open it, and to run it click the trigger play
button in the top right side.
To take at the process flow of the pipeline. Select the Graph
view.
.
├── Dockerfile
├── LICENSE
├── README.md
├── dags
│ ├── sql
│ │ └── most_popular_hour_per_page.sql
│ └── wikipedia_pageviews.py
├── packages.txt
├── pyproject.toml
├── requirements.txt
└── tests
└── dags
└── test_dag_example.py
Generated with: tree --gitignore --prune
- Data Pipelines with Apache Airflow by B. Harenslak and J. de Ruiter
- Develop your Astro project
- Airflow Docs
- TemplateNotFound error when running simple Airflow BashOperator
- How to Change the Timezone of a Postgres Database
- Airflow PostgresHook Example
- Start a process when the container starts
- Read JSON file using Python
- Reading and Writing JSON to a File in Python
- Passing a command line argument to airflow BashOperator
- Templates reference
- Time Zones