I learned to build a logistic regression model using Pyspark MLLIB to classify patients as either diabetic or non-diabetic. I used the popular Pima Indian Diabetes data set. The goal was to use a simple logistic regression classifier from the pyspark Machine learning library for diabetes classification. The entire project was carried on the Google Colab environment with the installation of Pyspark. This was a project course on coursera and I learnt from this course the following things:
- To build and train Logistic Regression Classifier using Pyspark MLLIB.
- To setup Pyspark on the Colab Environment.
- To work with spark dataframe.
- To prepare data for analysis.
By the end of this project, I was able to build the logistic regression classifier using Pyspark MLlib to classify between the diabetic and nondiabetic patients. I was also able to set up and work with Pyspark on the Google colab environment. Additionally, I was also able to clean and prepare data for analysis.