Diabetes-Prediction

Ankita Priya

Pre-final year CS undergrad @B.I.T Mesra

Purpose of Repository

Analysing the Pima Indians Diabetes Dataset for research purpose.

Coding Language, IDE and Libraries used

All coding woeks are done in Python 3.8 and the IDE used to create them was using Anaconda's Jupyter Notebook, you can download Anaconda here.

The Machine Learning models are coded using the following libraries:

Scikit-learn
NumPy
Matplotlib
Pandas
Seaborn

About the Dataset

The Pima Indian Diabetes Dataset has been used in this study, provided by the UCI Machine Learning Repository. The dataset has been originally collected from the National Institute of Diabetes and Digestive and Kidney Diseases. The dataset consists of some medical distinct variables, such as pregnancy record, glucose concentration, diastolic blood pressure, BMI, insulin level, age, triceps skin fold thickness, diabetes pedigree function i.e. nine features in total. This dataset has 768 patient’s data where all the patients are female and at least 21 years old of Pima Indian heritage.

Dataset used: https://www.kaggle.com/uciml/pima-indians-diabetes-database

Data Cleaning

I had to perform data cleaning here because the data has many missing values. The values are not in the form of NaN rather, they are presented as values 'zero'. I replaced all those values by two methods:

Taking mean of all the values given and replacing all the zeroes.
Taking median of all the values and replacing all the zero values.

Data was replaced in the following features:

Glucose level: The level cannot be above 150mg/dl or below 70mg/dl. A glucose level of 200 mg/dl or higher is used to diagnose diabetes.
Blood Pressure: Diastolic Blood Pressure below 55 and higher than 90 is highly dangerous.
Skin Thickness: This value depicts the body fat. The avg. value is 23 mm in women.
BMI index: Body Mass Index is usually 18.5 to 25. BMI between 25 and 30 falls in the overweight range. A BMI of 30 or over falls in the obese range.

Algorithms Used

The Machine Learning algorithms used are:

Logistic Regression
K-Nearest Neighbor(KNN)
Support Vector Classifier -> A class of SVM
Random Forest Classification

Use of this model

This model is used for a research on feature selection and its behaviours with different algorithms.

Best value was obtained from Logistic Regression Algorithm on 5 features namely, Pregnancies, Glucose, Blood Pressure, Skin Thickness and Age without any manipulation with the data. The accuracy obtained was 81.0%. The precision and recall obtained were 80.0% and 65.0% respectively.

Link to the research paper

Coming shortly.

References

Dataset: https://www.kaggle.com/uciml/pima-indians-diabetes-database

Reference site: https://towardsdatascience.com/end-to-end-data-science-example-predicting-diabetes-with-logistic-regression-db9bc88b4d16

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
Diabetes Dataset		Diabetes Dataset
DiabetesDataset.xlsx		DiabetesDataset.xlsx
LICENSE		LICENSE
README.md		README.md
Results.xlsx		Results.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diabetes-Prediction

Purpose of Repository

Coding Language, IDE and Libraries used

About the Dataset

Data Cleaning

Algorithms Used

Use of this model

Link to the research paper

References

About

Releases 1

Packages

Languages

License

AnkitaxPriya/Diabetes-Prediction

Folders and files

Latest commit

History

Repository files navigation

Diabetes-Prediction

Purpose of Repository

Coding Language, IDE and Libraries used

About the Dataset

Data Cleaning

Algorithms Used

Use of this model

Link to the research paper

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages