-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
146 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
--- | ||
title: "Lab 3 Assignment" | ||
author: Dana Gonzalez | ||
format: html | ||
editor: visual | ||
--- | ||
## 1. Read in the data | ||
```{r} | ||
download.file( | ||
"https://raw.githubusercontent.com/USCbiostats/data-science-data/master/02_met/met_all.gz", | ||
destfile = file.path("~", "Downloads", "met_all.gz"), | ||
method = "libcurl", | ||
timeout = 60 | ||
) | ||
met <- data.table::fread(file.path("~", "Downloads", "met_all.gz")) | ||
met <- as.data.frame(met) | ||
``` | ||
|
||
## 2. Check the dimensions, headers, footers | ||
|
||
There are 30 columns and 6 rows in this dataset (the dataset was cleaned up during lecture). | ||
|
||
```{r} | ||
dim(met) | ||
``` | ||
```{r} | ||
head (met) | ||
``` | ||
```{r} | ||
tail(met) | ||
``` | ||
|
||
## 3. Take a look at the variables | ||
|
||
Based on our objective stated at the beginning of lab (find the weather station with the highest elevation and look at patterns in the time series of its wind speed and temperature), our variables of interest are most likely: 'USAFID', 'elev', 'wind.sp', and 'temp'. | ||
|
||
```{r} | ||
str(met) | ||
``` | ||
|
||
## 4. Take a closer look at the key variables | ||
|
||
There were 91853 NA values for the 'wind.sp' variable. | ||
|
||
The highest elevation of any weather station in this dataset is 4,113 meters above sea level. | ||
|
||
```{r} | ||
table(met$year) | ||
``` | ||
```{r} | ||
table(met$day) | ||
``` | ||
```{r} | ||
table(met$hour) | ||
``` | ||
```{r} | ||
summary(met$temp) | ||
``` | ||
```{r} | ||
summary(met$elev) | ||
``` | ||
```{r} | ||
met$wind.sp[met$winds.sp == 9999.0] <- NA | ||
summary(met$wind.sp) | ||
``` | ||
|
||
```{r} | ||
met[met$elev==9999.0, ] <- NA | ||
summary(met$elev) | ||
``` | ||
```{r} | ||
met <- met[met$temp > -40, ] | ||
head(met[order(met$temp), ]) | ||
``` | ||
|
||
```{r} | ||
met$elev[met$elev == 9999.0] <- NA | ||
summary(met$elev) | ||
``` | ||
|
||
|
||
## 5. Check the data against an external source | ||
|
||
The range of elevations are valid (according to my Google search). The lowest elevation in the continental US in Death Valley, CA (-86m), whereas the highest elevation in the continental US is found in Mount Whitney, CA (4,418m). This clearly encompasses this dataset's range of -13m to 4,113m. | ||
|
||
The coordinates for the -17.2 degrees celsius data leads to a location outside of Colorado Springs, Colorado. Knowing that this data was pulled in August, this temperature does not make sense in this context. | ||
|
||
## 6. Calculate summary statistics | ||
```{r} | ||
elev <- met[which(met$elev == max(met$elev, na.rm = TRUE)), ] | ||
summary(elev) | ||
``` | ||
```{r} | ||
cor(elev$temp, elev$wind.sp, use="complete") | ||
``` | ||
```{r} | ||
cor(elev$temp, elev$hour, use="complete") | ||
``` | ||
```{r} | ||
cor(elev$wind.sp, elev$day, use="complete") | ||
``` | ||
```{r} | ||
cor(elev$wind.sp, elev$hour, use="complete") | ||
``` | ||
```{r} | ||
cor(elev$temp, elev$day, use="complete") | ||
``` | ||
|
||
## 7. Exploratory graphs | ||
```{r} | ||
hist(met$elev,) | ||
``` | ||
```{r} | ||
hist(met$temp,) | ||
``` | ||
```{r} | ||
hist(met$wind.sp,) | ||
``` | ||
|
||
```{r} | ||
library(leaflet) | ||
leaflet(elev) |> | ||
addProviderTiles('OpenStreetMap') %>% | ||
addCircles(lat=~lat,lng=~lon, opacity=1, fillOpacity=1, radius=100) | ||
``` | ||
|
||
```{r} | ||
library(lubridate) | ||
elev$date <- with(elev, ymd_h(paste(year, month, day, hour, sep= ' '))) | ||
summary(elev$date) | ||
``` | ||
```{r} | ||
elev <- elev[order(elev$date), ] | ||
head(elev) | ||
``` | ||
```{r} | ||
plot(elev$date, elev$temp, type = "l", cex = 0.5) | ||
``` | ||
```{r} | ||
plot(elev$date, elev$wind.sp, type = "l", cex = 0.5) | ||
``` | ||
Our temperature and date line graph seems to have two peaks, one around August 6th and another around August 20th. Two peaks were also seen in our wind speed line plot, however both peaks were seen later in the month, the first being around August 18th and the second around August 26th. Too, wind speeds seemed to be lower in the first third of the August. | ||
|
||
## Ask questions | ||
Given that we spent some time in class cleaning up this data, I'm wondering what led to the missing data in this dataset. Was it an issue with collection at the sites, gathering and distributing the data, or another issue? |