Skip to content

Latest commit

 

History

History
46 lines (29 loc) · 3.52 KB

README.md

File metadata and controls

46 lines (29 loc) · 3.52 KB

Read, store and analyze NFHS-5 data from district-level summaries

  1. Download State and District-level PDFs [Notebook]
    Download PDF reports of key indicators for each state/UT and each of their districts from http://rchiips.org/nfhs/.

  2. Pickle the Indicators [Notebook, Notebook] Save indicators, names of states/UTs and their respective districts in dictionary format for easy "pickling" (serializing).

  3. Save district-level statistics to DataFrame [Notebook, PY]
    Read the PDF reports sequentially and store 104 indicator values for each of 700+ districts in a CSV file.

  4. Perform PCA, K-Means Clustering on the reported NFHS-5 data [Notebook, PY]
    Perform PCA to (1) plots 2D/3D representations of all 700+ data points, (2) find k-nearest neighbors to (3) impute missing (unavailable) values in the dataset.

    For example, the plot below on the left is a 2D representation of the original 95-dimensional data. Each dot represents a district in the dataset, and the two highlighted in red are from the state of Goa. This reduction in the data's orignal dimensionality (to 2 dimensions) explains only about 34% of the variance in the data. A 3D representation (on the right below) explains roughly 40% of the variance in the data.

2D representation by PCA 3D representation by PCA
2D-PCA 3D-PCA
  1. Display NFHS-5 data on interactive maps using GeoPandas [Notebook, PY]
    Generate maps to view reported statistics for each district. Missing or unavailable entries are estimated using Principal Component Analysis (PCA). The images below are screenshots of maps showing three such indicators (or statistics) for different districts in the country. The number of principal components for imputing missing entries is chosen in such a way so as to explain 99% percent of the variance in the dataset.

    (a) Percentage of literate women (aged 15-49)

    Q14

    (b) Percentage of married women (aged 15-49) who follow some family planning method

    Q20

    (c) Percentage of pregnant women (aged 15-49) who are anaemic

    Q83

Code Credit

@kalyaninagaraj

Resources

  1. National Family Health Survey of India (official website)
  2. fitz, or PyMuPDF (documentation)
  3. pickle (documentaion)
  4. GeoPandas (documenatation)
  5. District boundary data of India in the form of shapefiles sourced from Kaggle