Exploratory Data Analysis (EDA), Statistical Analysis in PySpark and building Predictive Models using Python

Overview

This project focuses on analyzing weather data, performing exploratory data analysis (EDA), and building predictive models. The primary goals include:

Exploring the weather dataset and gaining insights into the relationships between variables.
Developing a linear regression model to predict average temperature.
Creating a logistic regression model to predict precipitation.
Implementing k-fold cross-validation for model evaluation.

Files

The project consists of the following files:

weather_EDA.ipynb:
- This Jupyter Notebook contains code and visualizations for Exploratory Data Analysis (EDA) of the weather dataset. It explores the relationships between different weather variables.
weather_regression.ipynb:
- This notebook focuses on building a linear regression model to predict the average temperature. It also includes Box-Cox transformation and stepwise feature selection techniques.
weather_pca.ipynb:
- In this notebook, Principal Component Analysis (PCA) is used to understand the relationships between variables in the weather dataset. This helps in dimensionality reduction and feature selection.
weather_cross_validation.ipynb:
- This notebook covers the implementation of k-fold cross-validation for model evaluation. It includes train-test splitting and k-fold validation for both regression and logistic models.

Libraries Used

The project utilizes several Python libraries for data analysis, visualization, and modeling:

pyspark.sql and pyspark.sql.functions: Used for Spark DataFrame operations.
pandas: Utilized for data manipulation and analysis.
matplotlib.pyplot: Used for creating data visualizations.
factor_analyzer: Employed for Factor Analysis to understand variable relationships.
pingouin: Used for statistical analysis and hypothesis testing.
seaborn and numpy: Additional libraries for data visualization and numerical operations.
statsmodels: Used for regression analysis and model summaries.
scipy.stats: Utilized for Pearson correlation calculation and statistical tests.
sklearn: Includes various functions for model building, preprocessing, and evaluation.

Target Variables

Average Temperature: This is the target variable for linear regression modeling. The goal is to predict the average temperature based on other weather-related features.
Precipitation Dummy: This is the target variable for logistic regression modeling and k-fold cross-validation. It is a binary variable indicating whether precipitation occurred on a given day.

Usage

You can follow the order of the notebooks to understand the project flow:

Start with weather_EDA.ipynb to explore the dataset and gain insights into variable relationships.
Move on to weather_regression.ipynb to build a linear regression model for predicting average temperature.
Refer to weather_pca.ipynb to understand how Principal Component Analysis can help in feature selection and dimensionality reduction.
Lastly, review weather_cross_validation.ipynb to implement k-fold cross-validation for model evaluation.

Each notebook includes code, explanations, and visualizations to facilitate understanding.

Data Source

The dataset used in this project was obtained from the Instituto Nacional de Meteorologia (INMET), Brazil's National Institute of Meteorology. You can access historical weather data from the INMET website at the following link:

INMET Historical Weather Data

Please note that you may need to register or follow specific procedures on the INMET website to access the data.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
INMET_SE_SP_A771_2022.xlsx		INMET_SE_SP_A771_2022.xlsx
README.md		README.md
weather_EDA.ipynb		weather_EDA.ipynb
weather_august.csv		weather_august.csv
weather_cross_validation.ipynb		weather_cross_validation.ipynb
weather_logistic_regression.ipynb		weather_logistic_regression.ipynb
weather_pca.ipynb		weather_pca.ipynb
weather_regression.ipynb		weather_regression.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploratory Data Analysis (EDA), Statistical Analysis in PySpark and building Predictive Models using Python

Overview

Files

Libraries Used

Target Variables

Usage

Data Source

About

Releases

Packages

Languages

diegomattos1408/Weather-EDA-ML-Python

Folders and files

Latest commit

History

Repository files navigation

Exploratory Data Analysis (EDA), Statistical Analysis in PySpark and building Predictive Models using Python

Overview

Files

Libraries Used

Target Variables

Usage

Data Source

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages