This project performs analysis on earthquake data using PySpark SQL and DataFrames.
The data is stored in a MySQL database table named neic_earthquakes with the following schema:
CREATE TABLE IF NOT EXISTS neic_earthquakes (
`Date` DATE,
`Time` TIME,
`Latitude` DECIMAL(10, 6),
`Longitude` DECIMAL(10, 6),
`Type` VARCHAR(255),
`Depth` DECIMAL(10, 2),
`Depth Error` DECIMAL(10, 2),
`Depth Seismic Stations` INT,
`Magnitude` DECIMAL(3, 1),
`Magnitude Type` VARCHAR(255),
`Magnitude Error` DECIMAL(3, 1),
`Magnitude Seismic Stations` INT,
`Azimuthal Gap` DECIMAL(5, 2),
`Horizontal Distance` DECIMAL(10, 2),
`Horizontal Error` DECIMAL(10, 2),
`Root Mean Square` DECIMAL(5, 2),
`ID` VARCHAR(255),
`Source` VARCHAR(255),
`Location Source` VARCHAR(255),
`Magnitude Source` VARCHAR(255),
`Status` VARCHAR(255)
);
To run this code, you need:
- Python 3
- PySpark 3.0+
- pandas
- MySQL Connector Python module
- MySQL JAR Connector - https://dev.mysql.com/downloads/connector/j/
To install python modules, run the following command: pip install -r requirements.txt
Important: Update the following constants in the code to match your database configuration:
- host = "localhost"
- user = ""
- password = ""
- database = "aidetic" #dbname
- csv_file = "../database.csv" #filelocation
- table_name = "neic_earthquakes"
To execute the PySpark script for this analysis:
- Ensure you meet all the requirements and configuration steps above
- Navigate to the project directory: /src/
- To upload data from CSV to MySQL table, run the following command:
python3 csv_to_mysql_upload.py
- Follow these instructions to read data and execute queries:
- To execute all queries, use the following command:
spark-submit --jars ../mysql-connector-j-8.2.0/mysql-connector-j-8.2.0.jar spark_df_queries.py --all all
- To execute a specific query, use the following command:
spark-submit --jars ../mysql-connector-j-8.2.0/mysql-connector-j-8.2.0.jar spark_df_queries.py --questionNum 1
- For Query-2, the default year is set to 2015. To execute for a different year, add the following argument to the above commands:
--yearOfInterest 1995
- To execute all queries, use the following command:
- The output results will be displayed
- How does the Day of the Week affect the number of earthquakes?
- What is the relation between the Day of the month and the number of earthquakes that happened in a year?
- What does the average frequency of earthquakes in a month from the year 1965 to 2016 tell us?
- What is the relation between the Year and Number of earthquakes that happened in that year?
- How has the earthquake magnitude on average been varied over the years?
- How does the year impact the standard deviation of the earthquakes?
- Does geographic location have anything to do with earthquakes?
- Where do earthquakes occur very frequently?
- What is the relation between Magnitude, Magnitude Type, Status, and Root Mean Square of the earthquakes?