This repository contains two Google Colab notebooks that explore and analyze data related to California housing prices. The first one aims at revealing the basic under the hood implementation of linear regression models. It goes through the math involved behind the simplest linear regression models. The second one explores and analyzes data related to California housing prices. The goal is to build a model to predict median house value based on various features.
- This notebook delves into the mathematical implementation of the simplest linear regression model for a single parameter.
- It provides a clear explanation of how linear regression works to learn a linear relationship between a feature and a target variable.
- Through visualizations and code examples, the notebook clarifies the concepts of cost function, gradient descent, and model fitting.
- This notebook applies linear regression to predict median house value in California using a real-world dataset.
- It performs exploratory data analysis (EDA) to understand the data distribution and identify potential issues.
- The notebook addresses data pre-processing steps like handling skewed features and feature engineering to create potentially useful new features.
- It highlights the importance of consistent pre-processing between training and testing data.
- The notebook evaluates the model's performance using metrics like R-squared, mean absolute error (MAE), and mean squared error (MSE).
- The notebook concludes with visualizations to compare predicted and actual median house values.
Note: These notebooks require several Python libraries, including pandas, seaborn, numpy, and scikit-learn. These libraries are typically pre-installed in Colab environments.
- The California housing dataset is a public dataset available on Kaggle.
- This is a foundational example of linear regression for housing price prediction. There are many techniques to improve the model's performance, such as using more advanced algorithms or performing hyperparameter tuning.