Skip to content

Latest commit

 

History

History
87 lines (80 loc) · 2.92 KB

RUNNING.md

File metadata and controls

87 lines (80 loc) · 2.92 KB

Run and test the project

How to get the data is explained here

Transform GHCNd's DLY-files (exemplified with one station)

spark-submit ETL/final/dly-transformation/ETL-dly-files.py ghcnd-stations.txt CA1AB000001.dly

Run the ETL on an AWS EMR cluster

Our ETL scripts can be found here.

Cluster configuration

  • US-WEST-2 (Oregon)
  • Software configuration
    • Release: emr-6.9.0
    • Application: Spark
  • Hardware configuration
    • Instance type: c7g.xlarge
    • Number of instances:
      • 1 main node
      • 2 worker nodes

Add Step

  • Spark config
--conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --num-executors=12 --executor-cores=1 --executor-memory=600M
  • Location of script
s3://kpd3-datastorm-cmpt732/ETL_script-emr-s3.py
  • Arguments
s3://kpd3-datastorm-cmpt732/ 
s3://kpd3-datastorm-cmpt732/data_after_ETL-nopartitioning/
  • Example
spark-submit --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --num-executors=12 --executor-cores=1 --executor-memory=600M s3://kpd3-datastorm-cmpt732/ETL_script-emr-s3.py s3://kpd3-datastorm-cmpt732/ s3://kpd3-datastorm-cmpt732/data_after_ETL-nopartitioning/

Local version

# input contains ghcnd-stations.txt and weather data (for example for a year only the file 2020.csv.gz)
spark-submit ETL/final/partitioning/ETL-local-nopartitioning.py input/ output/

Athena Queries

  • Go to Amazon Athena
  • Select the Query Editor
  • Data source: AwsDataCatalog
  • Database: datastorm

Example

select distinct(observation) from observations;

Our queries can be found here. We used Athena with Quicksight's Query Editor.

QuickSight

  • Go to QuickSight
  • Go to Shared Folders (menu on the left side)
  • Go to datastorm
  • There are three icons:
    • Analyses (blue) and Dashboards (green)
      • our final results, stated here
    • Data sources (Athena's icon, orange)
      • Queries can be accessed by clicking on...
      • datasource > Edit Dataset > Clicking on data (and an dropdown arrow) > Edit SQL Query
      • ... Quicksights Query Editor opens

shared resources icons

Additional Work

Testing weather prediction

See Algorithmic Work

  • Train model
# output=path to an older dataset. for example from the previous transformation with an additional filter for the time range
# prediction-model = path for the model's output
spark-submit observation_train.py output/ prediction-model
  • Test model
# output = newer dataset. for example from the previous transformation with an additional filter on the past year
# PRCP = example for an specific observation
# prediction-model = name of the trained model output
# results=path for storing the results
spark-submit observation_prediction.py output/ PRCP prediction-model results/