pyspark-project

This is a project I made using pyspark with jupyter notebook for my subject "Data Analytics" in my 6th semester of college
This project consists of the following steps
- Initializing connection with pyspark
- Data loading and preprocessing
- Data Cleaning
- Data Normalization
- Splitting the data
The notebook has two datasets
- One was used for the clustering application (k-means)
- The other dataset was used for random forest classification

This entire project was done in Apache Spark using its python wrapper class 'pyspark'

Plese use git lfs clone to clone the repo as the datasets might not be able to be accessed from a simple zip download Apache Spark 3.1.1 , python 3.8.3

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.ipynb_checkpoints		.ipynb_checkpoints
dataset		dataset
.gitattributes		.gitattributes
Final_Document.ipynb		Final_Document.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyspark-project

About

Releases

Packages

Languages

Vasanthagokul/pyspark-project

Folders and files

Latest commit

History

Repository files navigation

pyspark-project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages