Skip to content

RoshniRanaDS27/Data_Engineering_Cleaning_Normalization_With_ERD

Repository files navigation

image image

Women’s maternal and reproductive health

Married or in-union women of reproductive age who have their need for family planning satisfied with modern methods (%)

image

BackGround

This Repo throws light on fascinating journey of Project in crafting data engineering pipelines for meticulous data analysis.

Specific focus areas:

How does access to modern family planning methods vary across different regions and socioeconomic groups?

Step 1: Data collection

  • This Project journey involves data collection from The World Health Organization Relational Data Hub.
  • Had collected data set Related to Women’s maternal and reproductive health which is related to family planning satisfied with modern methods (%),
  • Here Ensuring the reliability and relevance of data was paramount for me as it formed the foundation for the depth and accuracy of our analysis.

Step 02: Data cleaning image

  • In this stage data cleaning was on focus.
  • Here, I prioritized data cleaning and quality by addressing issues like missing values, nulls, duplicates, outliers, changing data's physical type
  • Ensuring standardization with Python. Hence, This meticulous preparation ensures that the data aligns seamlessly with analysis goals.
  • Modern Family Planning Data Cleaning and Transformation Notebook
  • Cleaned Data CSV

Step 03: Data Transformation image

  • Step three involves data transformation, where I have shaped the data to fit the needs of analysis.
  • This includes normalization to ensure consistency and clarity in data representation, setting the stage for effective modeling.
  • Data Normalization Notebook
  • Normalized Tables

1NF

image

2NF

image

image

3NF

image

  • Normalizing Period Ranges

image

Fact Table

image

Step 04: Data modeling image

crafting entity-relationship diagrams (ERDs) and establishing connections between datasets by Postgre-SQL and assigning primary and foreign keys within each tables.

image

Step 05: Exploratory data analysis image

Delved into exploratory data analysis using Python libraries, and explored patterns with cleaned data sets.
This phase unveils insights and prepares the data for meaningful visualizations.

Continent Level Analysis

image

Here, We we can see, Analysis on continents level for the percentage of Married or in-union women who can access and use modern
family planning methods to control if and when they can have children. Also, we had included a geo heat map in our project as well with HTML File, screen shot of map that is on the screen.
modern family planning methods varies significantly across regions and socioeconomic groups globally. In the Americas and Southeast Asia,
access rates are notably higher, around 72% and 71% respectively. This high rate can be depends, due to better healthcare infrastructure, services, education, awareness and availability of Tech fields within that regions.
On the other hand, the Eastern Mediterranean has the lowest percentage at 50.3%. lower rate can be often due to cultural barriers, lack of this facilities. These disparities show the varying levels of support and challenges different regions face in providing family planning services.

Country Level Analysis

image

Top Three Countries within each Continent - Family Planning Data set

image

Here are the highlights for the top three countries within each continent. In Africa, Zimbabwe leads with 84.8% of women
having their needs met. Egypt leads in the Eastern Mediterranean with 80%. Europe sees France at the forefront with 95.5%.
The Democratic People's Republic of Korea leads South-East Asia at 89.6%, and in the Western Pacific, China stands out with an impressive 96.6%.

Step 06: Data visualization image

Here are the highlights for the top three countries within each continent. In Africa, Zimbabwe leads with 84.8% of women
having their needs met. Egypt leads in the Eastern Mediterranean with 80%. Europe sees France at the forefront with 95.5%.
The Democratic People's Republic of Korea leads South-East Asia at 89.6%, and in the Western Pacific, China stands out with an impressive 96.6%.

involves data visualization for further analysis with Interactive Geoographical Heat Map
where I transformed complex findings into clear, insightful visual representations.
This step ensures that the results are not only understood but also actionable for stakeholders.

Interactive Geographical Heat Map with tooltips Screenshot (HTML file is Saved)

Geographical Analysis - Family Planning Data set (Continent Level)

image

Top Three Countries within each Continent - Family Planning Data set

image

Time-Period Analysis

image

Important

Key information users need to know to achieve their goal.

Ultimately, Data journey concludes with interpreting the results, weaving them into meaningful conclusions Through this approach, I ensure that my analysis not only addresses initial problems but also adds unexpected value to business requirements through my technical expertise.

Dependency

  • CSV
  • OS
  • matlotlib
  • Pandas
  • pyplot
  • numpy
  • seaborn
  • geopandas
  • folium
  • time
  • Selenium, webdriver
  • Ipython.display, image

  • Ultimately, Data journey concludes with interpreting the results, weaving them into meaningful conclusions
  • Through this approach, I ensure that my analysis not only addresses initial problems but also adds unexpected value to business requirements through my technical expertise.

Note

Useful information that users should know, even when skimming content.

Data Flow:

  • Data sourced from WHO -> Processed in Jupyter Notebook -> Stored and retrieved from a SQL database.
  • Schema Diagram: Detailed in the Engineering_ERD folder.

Tools Used:

  • Storage: SQL database for organized data storage and retrieval.
  • Processing: Jupyter Notebook (odern Family Planning Data Cleaning and Transformation.ipynb) for data manipulation and analysis.

Analytical Use Cases

  • Access Disparities: Analyzing regional and socioeconomic variations in access to family planning.

Demonstration

  • Jupyter Notebook: Demonstrates data retrieval and visualization.
  • Visuals: Include Geo Heat Maps and line graph

Assumptions:

  • When the period of study was done between 2 years (i.e. 2022-2023), it is assumed that the results of that particular study corresponds to 12 months and it is a reflection of the latest year (2023).__
  • The datasets were broken down in intervals of 3 years each starting in 2003 to 2023 to allow consistent analysis of data over time.
  • The study was done in married and in-union women of reproductive age, which is assumed to be between 15-49 years.
  • Assumed the same collecting data method accross countries.

Limitations:

  • There are more indicators that could have been analyzed to contribute to the overall hypothesis. We focused on 4 key indicators due to time constrainsts.
  • Period data was not standardized accross datasets. Some assumptions needed to be made to standardize it and make them fully comparable.

Ethical Considerations:

  • Ensuring the confidentiality and ethical use of data.
  • Addressing biases inherent in data collection methods.

Future Work Scope:

  • Extended Analysis: Incorporate more indicators for a comprehensive view.
  • Data Integration: Enhance the database with additional sources and real-time data.
  • Interactive Dashboards: Develop more interactive visualization tools for dynamic data exploration.
  • Please, refer to the word file to get the summary of the findings

Folder Structure:

  • Extracted Folders: Contains all exported datasets and analysis results.
  • Engineering_ERD: ERD for schema and SQL database export.
  • Project_Analysis: Findings and summary documents.

How to Run:

  • Environment Setup: Ensure you have Python and Jupyter Notebook installed.
  • Dependencies: Install required libraries numpy, pandas, matplotlib, seaborn.
  • Run Notebook: Open .ipynb in Jupyter Notebook and run the cells sequentially.

image

Releases

No releases published

Packages

No packages published