Skip to content

This projects implements the Airflow DAG presented in chapter 4 from the book `Data Pipelines with Apache Airflow` by B. Harenslak and J. de Ruiter

License

Notifications You must be signed in to change notification settings

kevinknights29/Airflow_Wikipedia_Pageviews

Repository files navigation

Airflow_Wikipedia_Pageviews

This project implements the Airflow DAG presented in chapter 4 of the book Data Pipelines with Apache Airflow by B. Harenslak and J. de Ruiter

Results

This pipeline fetches page views from https://dumps.wikimedia.org/.

Pages of interest are:

  • Meta
  • Microsoft
  • Apple
  • Amazon
  • Netflix
  • Nvidia
  • Google

Overall pipeline runs in less than 20 seconds. This includes fetching results as zip, unziping, processing, inserting to postgress, and analytics.

image

Prerequisites

Getting Started

  1. Run astro dev init to create the necessary files for your environment.

  2. Run astro dev start to start the airflow service with docker.

  3. Configure Postrges connection by following this steps:

    1. Run astro dev bash to access airflow terminal.

    2. Run the following command to add the connection:

      airflow connections add \
      --conn-type postgres \
      --conn-host host.docker.internal \
      --conn-login postgres \
      --conn-password postgres \
      postgres_default

    Here using localhost will create an error. For an in depth explanation check: Connect to local Postgres from docker airflow

Execution

To execute DAG, please visit: Airflow UI

In the DAGs section, you should see a DAG called wikipedia_pageviews.

image

NOTE: Your run section will be empty instead of the colored options you see in the image.

Click the dag to open it, and to run it click the trigger play button in the top right side.

image

To take at the process flow of the pipeline. Select the Graph view.

image

Project Structure

.
├── Dockerfile
├── LICENSE
├── README.md
├── dags
│   ├── sql
│   │   └── most_popular_hour_per_page.sql
│   └── wikipedia_pageviews.py
├── packages.txt
├── pyproject.toml
├── requirements.txt
└── tests
    └── dags
        └── test_dag_example.py

Generated with: tree --gitignore --prune

Have fun! 😄

Reference

About

This projects implements the Airflow DAG presented in chapter 4 from the book `Data Pipelines with Apache Airflow` by B. Harenslak and J. de Ruiter

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published