Skip to content

ipersids/WiDS-Datathon-2024-Challenge-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

WiDS Datathon 2024 Challenge #2

Datathon 2024 Challenge Theme: Equity in Healthcare.

The WiDS Datathon is hosted on Kaggle, using a predictive analytics challenge focused on social impact. This Datathon used a real-world evidence dataset from Health Verity, one of the largest healthcare data ecosystems in the US.

Data description:.
The datasets contain health-related information on patients diagnosed with metastatic triple-negative breast cancer in the USA. In addition, social, economic, demographic and climatic information was included using zip codes.

Task:
Predict the time of metastatic diagnosis for patients in the test dataset using the patient characteristics and information provided.

Libraries:

# Data manipulation
import pandas as pd
import numpy as np
import re
import datetime as dt

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Mathematical and statistical packages
import math import ceil
from scipy.stats import pearsonr, spearmanr, chi2_contingency, pointbiserialr

#Machine Learning Libraries
# Preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
# Clustering
from sklearn.cluster import KMeans
# Linear models
from sklearn.linear_model import Ridge
# Tree-based models
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier
# Neural Networks
from sklearn.neural_network import MLPRegressor
# Model performance metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import silhouette_score, mean_squared_error, r2_score, f1_score


# modules and libraries for NLP processing the text data
import string
import nltk

Work Process

  • Exploratory Data Analysis (EDA)
    • Key takeaways from the discovery phase
    • Key takeaways from the structuring phase
    • Cleaning data strategy overview
    • Resulting dataset validation
  • Attribute Selection and Final Data Transformation
  • Machine Learning Phase
    • Construction phase
    • Execution phase
    • Submission


Submission:
The final prediction was made using the Gradient Boosting Regressor model, which performed best against other models in terms of Root Mean Squared Error (RMSE).



Team of DataTribe Collective:

About

Women in Data Science (WiDS) Worldwide Datathon:

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published