Dengue Fever Project

Intro & Problem Statement

Dengue fever is a public health concern in tropical and subtropical regions. Early detection of outbreaks can help health officials to take proactive measures to contain the impact.

The goal of this project, as part of DengAI competition, is to develop a predictive model that accurately forecasts the total number of dengue fever cases in San Juan and Iquitos based on historical, climatic, and environmental data.

Dataset files:

To predict the total_cases label for each (city, year, weekofyear), we will be working with follwing datasets which can be downloaded.

Training features: dengue_features_train.csv
Training labels : dengue_labels_train.csv
Test features for prediction: dengue_features_test.csv
The format for submission: submission_format.csv

The training and test data contains data for two cities (San Juan, Puerto Rico and Iquitos, Peru). In the Training labels, we have total cases labels of two cities for each combination of (city, year, weekofyear) for corresponding climate features given in dengue_features_train.csv. The test data contains the climate features for each city which spans 5 and 3 years respectively.

Feature files

Each feature datafile contains the following columns:

city	year	weekofyear	week_start_date	ndvi_ne	ndvi_nw	ndvi_se	ndvi_sw	precipitation_amt_mm	reanalysis_air_temp_k	reanalysis_avg_temp_k	reanalysis_dew_point_temp_k	reanalysis_max_air_temp_k	reanalysis_min_air_temp_k	reanalysis_precip_amt_kg_per_m2	reanalysis_relative_humidity_percent	reanalysis_sat_precip_amt_mm	reanalysis_specific_humidity_g_per_kg	reanalysis_tdtr_k	station_avg_temp_c	station_diur_temp_rng_c	station_max_temp_c	station_min_temp_c	station_precip_mm

Description:

city – City abbreviations: sj for San Juan and iq for Iquitos
week_start_date – Date given in yyyy-mm-dd format
station_max_temp_c – Maximum temperature
station_min_temp_c – Minimum temperature
station_avg_temp_c – Average temperature
station_precip_mm – Total precipitation
station_diur_temp_rng_c – Diurnal temperature range
precipitation_amt_mm – Total precipitation
reanalysis_sat_precip_amt_mm – Total precipitation
reanalysis_dew_point_temp_k – Mean dew point temperature
reanalysis_air_temp_k – Mean air temperature
reanalysis_relative_humidity_percent – Mean relative humidity
reanalysis_specific_humidity_g_per_kg – Mean specific humidity
reanalysis_precip_amt_kg_per_m2 – Total precipitation
reanalysis_max_air_temp_k – Maximum air temperature
reanalysis_min_air_temp_k – Minimum air temperature
reanalysis_avg_temp_k – Average air temperature
reanalysis_tdtr_k – Diurnal temperature range

Labels file

Training labels file (dengue_labels_train.csv) contains following columns.

city	year	weekofyear	total_cases

Prerequisites

Before you begin, ensure you have met the following requirements:

You have installed Conda.
You have installed Kedro.

If you haven't installed Kedro or Conda, please follow the links above for installation instructions.

Setting Up the Project

To set up the project, follow these steps:

Clone the Repository

First, clone the repository to your local machine:

git clone https://github.com/johnheusinger/dengue_fever.git
cd dengue_fever

Create and Activate a Conda Environment

Create a new Conda environment:

conda create --name your_env_name python=3.12
conda activate your_env_name

Install Dependencies

Install the required packages using the requirements.txt file:

pip install -r requirements.txt

Running the Project

With your Conda environment activated and dependencies installed, you can run the project using Kedro:

kedro run

Approach

1. Checking the Data

Getting familiar with the data:

Checking for seasonality

Checking for correlation

Checking for missing values

year                                       0
weekofyear                                 0
week_start_date                            0
ndvi_ne                                  194
ndvi_nw                                   52
ndvi_se                                   22
ndvi_sw                                   22
precipitation_amt_mm                      13
reanalysis_air_temp_k                     10
reanalysis_avg_temp_k                     10
reanalysis_dew_point_temp_k               10
reanalysis_max_air_temp_k                 10
reanalysis_min_air_temp_k                 10
reanalysis_precip_amt_kg_per_m2           10
reanalysis_relative_humidity_percent      10
reanalysis_sat_precip_amt_mm              13
reanalysis_specific_humidity_g_per_kg     10
reanalysis_tdtr_k                         10
station_avg_temp_c                        43
station_diur_temp_rng_c                   43
station_max_temp_c                        20
station_min_temp_c                        14
station_precip_mm                         22
total_cases                                0

2. Handling Missing Values with KNN Imputer

We used the KNNImputer to fill in the missing values in the dataset.

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_filled = imputer.fit_transform(df)

The idea is to find the 5 closest neighbors to the missing value and take the average of those 5 neighbors to fill in the missing value.

Going forward it would have been interesting to try different values for n_neighbors to see if it would have improved the model.

Also we could try different imputing methods to see if it would have improved the model.

3. Detecting & Handling Outliers - Keep or Remove?

We decided to first impute the missing values and then detect and handle outliers because we assumed that outliers in the dataset still contain valuable information that could be useful for the model.

We used the z-score method to detect outliers in the dataset.

Overall we got mixed results. One team member founding that removing outliers improved the model while the other team member found that keeping the outliers improved the model.

More testing needed to establish whether or not to keep outliers

4. Scaling the Data with MaxAbsScaler

We used the MaxAbsScaler to scale the data.

5. Identifying the Best Model to Use

Although we were not given a validation set, we decided to still use the training dataset and split into a training and validation set to be able to test models and parameters.

We tried with three models to see which one performs the best:

models = [RandomForestRegressor(),LinearRegression(),SVR()]
for model in models:
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    print(f'{model} MSE: {mean_squared_error(y_test,y_pred)}')

The MSE for each model was as follows: RandomForestRegressor() MSE: 406.71694216027873 LinearRegression() MSE: 466.33090156794424 SVR() MSE: 515.257572877534

6. Hyperparameter Tuning

We used GridSearchCV to find the best hyperparameters for the RandomForestRegressor model.

from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators':[100,200,300],
    'max_depth':[3,5,7,10],
    'min_samples_split':[2,5,10],
    'min_samples_leaf':[1,2,4]
}
grid_search = GridSearchCV(RandomForestRegressor(),param_grid,cv=5,scoring='neg_mean_squared_error')
grid_search.fit(X_train,y_train)
grid_search.best_params_

Results of the GridSearchCV:

{'max_depth': 7, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 200}

7. Experimentation & Model Improvement

Our Backlog of assumptions & experiment ideas:

Cyclical enconding for the week of the year - IMPROVED
Rolling averages (rainfall, temperature, etc.) as new features - DID NOT IMPROVE
Keeping rather than removing outliers - MIXED RESULTS
Split by cities & build a seperate model for each city - DID NOT IMPROVE
Try with pyCaret - PENDING

Other factors that may have (accidentally) improved model (needs to be tested):

Merging the train and test datasets potentially improved imputation >> model performance
Rounding the target variable rather than directly converting outome to integer

8. Score & Ranking

Best Score: 25.3413
Rank: 1,335

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.vscode		.vscode
conf		conf
data		data
images		images
notebooks		notebooks
src/dengue_fever		src/dengue_fever
tests		tests
.gitignore		.gitignore
.telemetry		.telemetry
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
stats.json		stats.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dengue Fever Project

Intro & Problem Statement

Dataset files:

Feature files

Description:

Labels file

Prerequisites

Setting Up the Project

Clone the Repository

Create and Activate a Conda Environment

Install Dependencies

Running the Project

Approach

1. Checking the Data

2. Handling Missing Values with KNN Imputer

3. Detecting & Handling Outliers - Keep or Remove?

4. Scaling the Data with MaxAbsScaler

5. Identifying the Best Model to Use

6. Hyperparameter Tuning

7. Experimentation & Model Improvement

8. Score & Ranking

About

Releases

Packages

Contributors 3

Languages

johnheusinger/dengue_fever

Folders and files

Latest commit

History

Repository files navigation

Dengue Fever Project

Intro & Problem Statement

Dataset files:

Feature files

Description:

Labels file

Prerequisites

Setting Up the Project

Clone the Repository

Create and Activate a Conda Environment

Install Dependencies

Running the Project

Approach

1. Checking the Data

2. Handling Missing Values with KNN Imputer

3. Detecting & Handling Outliers - Keep or Remove?

4. Scaling the Data with MaxAbsScaler

5. Identifying the Best Model to Use

6. Hyperparameter Tuning

7. Experimentation & Model Improvement

8. Score & Ranking

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages