- This is a project I made using pyspark with jupyter notebook for my subject "Data Analytics" in my 6th semester of college
- This project consists of the following steps
- Initializing connection with pyspark
- Data loading and preprocessing
- Data Cleaning
- Data Normalization
- Splitting the data
- The notebook has two datasets
- One was used for the clustering application (k-means)
- The other dataset was used for random forest classification
This entire project was done in Apache Spark using its python wrapper class 'pyspark'
Plese use git lfs clone to clone the repo as the datasets might not be able to be accessed from a simple zip download Apache Spark 3.1.1 , python 3.8.3