-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpart1.Rmd
166 lines (125 loc) · 9.94 KB
/
part1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: 'Portfolio: Part 1'
author: "John Higdon"
date: "10/10/2019"
output: html_document
---
### I) Introduction
The topic we will be exploring is Gross Domestic Product and Personal income in the United States. The main motive for choosing this topic was to determine if overall GDP has a significant affect on personal income. Intuitively we would expect that this is the case. Let's explore if this notion is true.
### II) Identify initial data set
```{r}
library("tidyverse")
# read in the initial datasets
gdpByState <- read_csv("https://raw.githubusercontent.com/VioletInferno/DataSciencePortfolio/master/GDPbystate.csv")
personalIncome <- read_csv("https://raw.githubusercontent.com/VioletInferno/DataSciencePortfolio/master/PersonalIncome.csv")
```
The original dataset can be obtained by navigating to the following url: https://apps.bea.gov/itable/iTable.cfm?ReqID=70&step=1#reqid=70&step=1&isuri=1 then selecting "QUARTERLY GROSS DOMESTIC PRODUCT (GDP) BY STATE" -> "GDP in current dollars (SQGDP2)" -> selecting "All Areas", "All statistics in table" and "Levels" for unit of measure -> Download -> CSV. I uploaded a copy of this dataset to the github for ease of access. I also retrieved the dataset "QUARTERLY STATE PERSONAL INCOME" -> "Personal Income, Population, Per Capita Personal Income (SQINC1)". The plan is to join these two data-sets to explore relationships between GDP and personal income. The urls to the datasets are https://raw.githubusercontent.com/VioletInferno/DataSciencePortfolio/master/GDPbystate.csv and https://raw.githubusercontent.com/VioletInferno/DataSciencePortfolio/master/PersonalIncome.csv, respectively.
### III) Describe your source, using critical analysis to assess the quality (or limitations) of the data.
The datasets were retrieved from the Bureau of Economic Analysis (bea.gov). The amount of categorical variables is technically 4, however I only believe two will be of use for our analysis. These two categorical variables are "GeoName" and "Description". The other two are "GeoFips" and "LineCode" - which appear to be a way to encode the aforementioned variables as integers which could be more efficient in parsing. The remaining variables are continous and contain the quartery GDP for a particular year.
In an effort to tidy up our data, we are going to be analyzing the quarterly GDP and personal income from the years 2017 to quarter 1 in 2019. This could cause a limitation by only evaluating a limited range of years. As such, we will maintain the original datasets for analysis at a later time.
### IV) Document your variables and describe in plain language what they mean and how they are represented
The variables we will be analyzing in this report are the following:
GeoName: either a state or region within the United States
"Description" in the gdp table: description of the industry generating revenue
"Description" in the personal income table: Population, per capita or personal income.
qx_YYYY: the specific quarter, indicated by x (between 1 and 4), and year. We will be analyzing data from 2017 until the first quarter of 2019.
### V) Document any manipulations you make to clean your data and organize it as Tidy Data as best you can prepare it for data analysis and modeling
```{r}
# cleaning up GDP By State table
gdpByState <- gdpByState[-c(1,2,3),] # deletes the first three rows
gdpByState <- gdpByState[-1,] # delete next row (used row to properly name variables)
gdpByState2017 <- gdpByState[, -c(5:52)] # removes columns from col #5 to 52, leaving us with data from 2017-2019
gdpByState2017 <- gdpByState2017[-c(1444:1448),] # remove superfluous information
# Rename variables to be more descriptive
colnames(gdpByState2017)[1] <- "GeoFips"
colnames(gdpByState2017)[2] <- "GeoName"
colnames(gdpByState2017)[3] <- "LineCode"
colnames(gdpByState2017)[4] <- "Description"
colnames(gdpByState2017)[5] <- "q1_2017"
colnames(gdpByState2017)[6] <- "q2_2017"
colnames(gdpByState2017)[7] <- "q3_2017"
colnames(gdpByState2017)[8] <- "q4_2017"
colnames(gdpByState2017)[9] <- "q1_2018"
colnames(gdpByState2017)[10] <- "q2_2018"
colnames(gdpByState2017)[11] <- "q3_2018"
colnames(gdpByState2017)[12] <- "q4_2018"
colnames(gdpByState2017)[13] <- "q1_2019"
gdpByState2017 <- gdpByState2017[-c(1:27),] # we want to look specifically at states, remove US data.
# cleaning up Personal Income table
personalIncome <- personalIncome[-c(1,2,3),] # remove first three rows
personalIncome2017 <- personalIncome[,-c(5:280)] # grab data from 2017-2019
personalIncome2017 <- personalIncome2017[,-c(14)] # remove variable that won't be matched in gdpByState table.
personalIncome2017 <- personalIncome2017[-1,] # remove top row which indicated names of columns.
personalIncome2017 <- personalIncome2017[-c(1:3),] # remove US data.
colnames(personalIncome2017)[1] <- "GeoFips"
colnames(personalIncome2017)[2] <- "GeoName"
colnames(personalIncome2017)[3] <- "LineCode"
colnames(personalIncome2017)[4] <- "Description"
colnames(personalIncome2017)[5] <- "q1_2017"
colnames(personalIncome2017)[6] <- "q2_2017"
colnames(personalIncome2017)[7] <- "q3_2017"
colnames(personalIncome2017)[8] <- "q4_2017"
colnames(personalIncome2017)[9] <- "q1_2018"
colnames(personalIncome2017)[10] <- "q2_2018"
colnames(personalIncome2017)[11] <- "q3_2018"
colnames(personalIncome2017)[12] <- "q4_2018"
colnames(personalIncome2017)[13] <- "q1_2019"
# ensure our quarterly tables are numeric
gdpByState2017$q1_2017 <- as.numeric(gdpByState2017$q1_2017)
gdpByState2017$q2_2017 <- as.numeric(gdpByState2017$q2_2017)
gdpByState2017$q3_2017 <- as.numeric(gdpByState2017$q3_2017)
gdpByState2017$q4_2017 <- as.numeric(gdpByState2017$q4_2017)
gdpByState2017$q1_2018 <- as.numeric(gdpByState2017$q1_2018)
gdpByState2017$q2_2018 <- as.numeric(gdpByState2017$q2_2018)
gdpByState2017$q3_2018 <- as.numeric(gdpByState2017$q3_2018)
gdpByState2017$q4_2018 <- as.numeric(gdpByState2017$q4_2018)
gdpByState2017$q1_2019 <- as.numeric(gdpByState2017$q1_2019)
personalIncome2017$q1_2017 <- as.numeric(personalIncome2017$q1_2017)
personalIncome2017$q2_2017 <- as.numeric(personalIncome2017$q2_2017)
personalIncome2017$q3_2017 <- as.numeric(personalIncome2017$q3_2017)
personalIncome2017$q4_2017 <- as.numeric(personalIncome2017$q4_2017)
personalIncome2017$q1_2018 <- as.numeric(personalIncome2017$q1_2018)
personalIncome2017$q2_2018 <- as.numeric(personalIncome2017$q2_2018)
personalIncome2017$q3_2018 <- as.numeric(personalIncome2017$q3_2018)
personalIncome2017$q4_2018 <- as.numeric(personalIncome2017$q4_2018)
personalIncome2017$q1_2019 <- as.numeric(personalIncome2017$q1_2019)
# ensure our categorial variables are defined as factors
gdpByState2017$GeoName <- as.factor(gdpByState2017$GeoName)
gdpByState2017$Description <- as.factor(gdpByState2017$Description)
gdpByState2017$GeoFips <- as.factor(gdpByState2017$GeoFips)
gdpByState2017$LineCode <- as.factor(gdpByState2017$LineCode)
personalIncome2017$GeoName <- as.factor(personalIncome2017$GeoName)
personalIncome2017$Description <- as.factor(personalIncome2017$Description)
personalIncome2017$GeoFips <- as.factor(personalIncome2017$GeoFips)
personalIncome2017$LineCode <- as.factor(personalIncome2017$LineCode)
```
- Removed the first 3 rows of the gdpByState dataset as they were empty anyway and will not be used for any anlalysis.
- Removed the first row (after viewing it to change column names to their correct names)
- Grabbed data from Quarter 1 2017 until Quarter 1 2019 in order to assist in tidying up our data.
- Renamed variables.
- Removed data for the entire US. Our study will be focussing on states and regions (as we can sum up values from the aforementioned to see the total US GDP)
- Defined categorical variables as factors, and continuous variables as numeric.
### VI) Summarize your data using descriptive statistics, along with visualizations
```{r}
mean(gdpByState2017$q1_2017, na.rm = TRUE) # average GDP of all states/regions Quarter 1, 2017 in millions of dollars
sd(gdpByState2017$q1_2017, na.rm = TRUE) # standard deviation of GDP of all states/regions Quarter 1, 2017 in millions of dollars
median(gdpByState2017$q1_2017, na.rm = TRUE) # median GDP of all states/regions Quarter 1, 2017 in millions of dollars
```
Based on the descriptive statistics, our standard deviation indicates that there are outliers in our dataset. I believe this is because regional data is mixed within specific state data. Because of this, median is a more reasonable statistic to focus on because our average is skewed. To see the skew, we can look at a histogram.
```{r}
ggplot(gdpByState2017, aes(x = q1_2017)) + geom_histogram()
```
This shows that our data is skewed to the right, meaning most of our data is on the left side of the graph but a few larger outliers remain. As such, we have discovered a limitation in our dataset. In the next phase of the project, the plan is to tidy up the data even further.
```{r}
# boxplot of GDP by state and region
ggplot(gdpByState2017, aes(x = GeoName, y = q1_2017, fill = GeoName)) + geom_boxplot() + theme_light() + coord_flip()
# Quarter 1 2018 GDP vs Quarter 1 2017 GDP. May show a correlation
ggplot(gdpByState2017, aes(x = q1_2017, y = q1_2018)) + geom_point()
```
We can see that there is a positive correlation between the first quarters of 2017 and 2018 in regards to GDP.
### VII) Finally, describe what research questions you hope to explore in later phases, along with potential social & ethical implications of your work.
## Research questions:
1) Does GDP have an affect on personal income / QOL of a population?
2) Do certain industries have more of an affect on personal income than others?
## Issues to be addressed:
What we've discovered in this initial analysis of our datasets is that there are significant outliers. This skews our current dataset which can pose problems for data analysis. A goal between now and the next project is to clean up the data further to provide better analysis for the future.