This project implements a complete data warehouse solution for e-commerce analytics using Apache Airflow, PostgreSQL, and Apache Superset. It includes automated data generation, ETL processes, and analytics dashboards.
flowchart TB
subgraph Sources["Data Sources"]
direction TB
S1[("CSV File\nCustomers")]
S2[("JSON File\nProducts")]
S3[("XML File\nTransactions")]
end
subgraph DataGen["Data Generation"]
direction TB
DG[/"Apache Airflow DAG\nGenerate Data"/]
PG["Python Script\nGenerator"]
end
subgraph Storage["Data Storage"]
direction TB
VS[("Volume\n/opt/airflow/data")]
LOG[("Volume\n/opt/airflow/logs")]
end
subgraph ETLLayer["ETL Layer"]
direction TB
subgraph Airflow["Apache Airflow"]
direction TB
DAG1["DAG\nSchema Creation"]
DAG2["DAG\nETL Process"]
DAG3["DAG\nReporting"]
end
subgraph Process["ETL Process"]
direction TB
E["Extraction\nPython Operators"]
T["Transformation\nPython/SQL"]
L["Loading\nPostgres Operators"]
end
end
subgraph Warehouse["Data Warehouse"]
direction TB
subgraph Dimensions["Dimension Tables"]
D1["DIM_TIME"]
D2["DIM_CUSTOMER"]
D3["DIM_PRODUCT"]
end
subgraph Facts["Fact Tables"]
F1["FACT_SALES"]
end
subgraph Views["Views Layer"]
V1["Superset\nViews"]
V2["Materialized\nViews"]
end
end
subgraph Analytics["Analytics Layer"]
direction TB
SUP["Apache Superset\nVisualization"]
BI["BI Tools\nDashboards"]
end
%% Connections
DataGen --> Storage
S1 & S2 & S3 --> ETLLayer
ETLLayer --> Warehouse
Warehouse --> Analytics
%% Detailed Flow
DG --> PG --> VS
VS --> E --> T --> L
DAG1 & DAG2 & DAG3 --> Process
D1 & D2 & D3 --> F1
F1 --> V1 & V2
V1 & V2 --> SUP --> BI
classDef sourceSystem fill:#e6f3ff,stroke:#333,stroke-width:2px
classDef storageSystem fill:#f9f3ff,stroke:#333,stroke-width:2px
classDef processSystem fill:#fff3e6,stroke:#333,stroke-width:2px
classDef warehouseSystem fill:#e6ffe6,stroke:#333,stroke-width:2px
classDef analyticsSystem fill:#ffe6e6,stroke:#333,stroke-width:2px
class S1,S2,S3 sourceSystem
class VS,LOG storageSystem
class ETLLayer,Process processSystem
class Warehouse,Dimensions,Facts,Views warehouseSystem
class Analytics,SUP,BI analyticsSystem
project/
├── dags/ # Airflow DAG definitions
│ ├── generate_sample_data.py
│ ├── dw_create_schema.py
│ ├── dw_etl_dag.py
│ └── dw_reporting_dag.py
├── scripts/ # Python scripts
│ └── generate_data.py
├── data/ # Generated data files
├── logs/ # Airflow logs
├── docker-compose.yml # Docker services configuration
├── requirements.txt # Python dependencies
└── .env # Environment variables
- Docker and Docker Compose
- 4GB RAM minimum
- 10GB free disk space
- Clone the repository:
git clone https://github.com/Stefen-Taime/datawarehouse/
cd datawarehouse
- Create required directories:
mkdir -p ./dags ./logs ./scripts ./data
- Create .env:
AIRFLOW_UID=50000
AIRFLOW_GID=50000
AIRFLOW_FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
- Start the services:
docker-compose up -d
- Access the services:
- Airflow UI: http://localhost:8080 (admin/admin)
- Adminer: http://localhost:8081
- Data Warehouse: localhost:5433
The project consists of four main DAGs:
-
generate_sample_data
: Generates sample data- Customers (CSV)
- Products (JSON)
- Transactions (XML)
-
dw_create_schema
: Creates warehouse schema- Dimension tables
- Fact tables
- OLAP views
-
dw_etl
: Performs ETL operations- Extracts from source files
- Transforms data
- Loads into warehouse
-
dw_reporting
: Creates analytics views- Materialized views for reporting
- OLAP cubes
- Performance optimizations
-
dim_time
: Time dimension- Date hierarchies
- Weekend indicators
- Fiscal periods
-
dim_client
: Customer dimension (SCD Type 2)- Demographics
- Segments
- Location
-
dim_produit
: Product dimension (SCD Type 2)- Categories
- Pricing
- Attributes
fact_ventes
: Sales transactions- Foreign keys to dimensions
- Measures (quantity, amount)
- Transaction metadata
mv_ventes_par_segment
: Sales by customer segmentmv_ventes_par_categorie
: Sales by product categorymv_ventes_temporel
: Time-based sales analysis
docker-compose -f docker-compose-superset.yml up -d
Access Superset UI
Navigate to http://localhost:8088
Login with credentials:
- Username: admin
- Password: admin
These dashboards provide comprehensive analytics including:
Sales Overview Dashboard:
- Customer segmentation analysis
- Temporal sales trends
- Regional performance metrics
- Sales channel distribution
Product Analysis Dashboard:
- Product category performance
- Inventory turnover metrics
- Price point analysis
- Cross-selling opportunities
-
Connection errors:
# Check network docker network ls # Verify services docker-compose ps
-
Data loading failures:
# Check logs docker-compose logs airflow-scheduler # Verify files ls -l ./data/
-
Permission issues:
# Fix permissions sudo chown -R 50000:50000 ./logs ./data
- Fork the repository
- Create a feature branch
- Submit a pull request
MIT License
For support or questions, please open an issue in the repository.