This repository includes the results of my first data science project during my data science bootcamp at neuefische in Hamburg.
This project includes work with the complete data science lifecycle on a dataset. The given dataset is the King County House dataset, which includes sales in the King County Area from begining of May 2014 to the end of May 2015. My task in this project is to figure out at least 3 recommendations for buyers based on the given dataset and developed features. Furthermore a multivariate linear regression model to predict the price of houses is developed. To get an impression about the area of King County, the map of King County is shown here.
This github repository includes the following data:
- Jupiter notebook with all code and descriptive text for all steps of the data science lifecycle Jupyter Notebook
- Slides of the Presentation as a PDF-file Slides
- under figures the ouput figures are stored
- under rawdata the given original dataset and the given column description are stored
The online presentation on GSlides is here
The focus of this project is mainly on EDA (Exploratory Data Analysis), but during the project all steps of the data science lifecycle are conducted. These steps are summed up in the following:
- What is the objective of this project?
- What prolems need to be tackled?
- Get data or scrape data.
- Fix missing data.
- Fix inconsistencies based on assumptions.
Visually analyze your data by using:
- Correlation analysis
- Heatmap
- Histograms
- Scatter Plots
- Box plots
- Surface Plots
Select important features and develop new and more meaningful data. In this project new features regarding the age, distance to Seattle and renovation were developed.
Use machine algorithms to make predictions. In this case a multivariate linear regression model was developed including new features.
Communicate the key findings using plots and visualizations. In this project this is a presentation to non-technical stakeholders.