generated from intro-to-data-science-22-workshop/00-template-munzert-kermit
-
Notifications
You must be signed in to change notification settings - Fork 0
/
janitor-exercises.Rmd
178 lines (100 loc) · 5.52 KB
/
janitor-exercises.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
title: "Janitor exercises"
author: "Abigail Peña Alejos & Nikolina Klatt"
date: "2022-11-13"
output:
html_document:
toc: TRUE
df_print: paged
number_sections: FALSE
highlight: tango
theme: lumen
toc_depth: 3
toc_float: true
css: custom.css
self_contained: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Today we will check some of the basic functions offered by the `janitor package` that will help us save time when cleaning our data. Along with `janitor`, we will make use of `tidyverse`packages such as `dplyr` and `readr`. `kableExtra` is suggested to render nice tables. But it is up to you!
```{r, message = F}
library(readr)
library(tidyverse)
library(janitor)
library(kableExtra)
```
For this exercise we will use *crime_subset.csv* dataset taken from a real world data hosted at [Data.gov catalogue](https://catalog.data.gov/dataset), the U.S. Government’s open data portal. The dataset contains reported crimes in the Montgomery County in Maryland. (You can download the original [here](https://catalog.data.gov/dataset/crime))
We import the .csv file
```{r, warning=FALSE, message=FALSE}
crime_df <- read_delim("data/crime_subset.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
```
Now, let's go!
***
## Getting our variables names in clean format with `clean_names()`
We have talked in our presentation about naming convention in computing, for example for variable and subroutine names, and for file names. We have been recommended using `snake_case`, the style of writing in which each whitespace is replaced by an underscore `_` character, and each word is written in lowercase. Something like this: `my_awesome_df`.
However, this isn't the case for our dataset.
```{r}
crime_df %>%
glimpse()
```
**EXERCISE 1:** Change variable names of the dataframe to follow `snake_case` convention.
```{r}
# insert your code here
```
Voilá! Now we have `snake_case` formatted variables! Btw, did you notice the function also took care of the `/` in `Dispatch/Date` and replaced it with an underscore. Isn't it great?
Let's pretend you are not really fond of `snake_case` format because you prefer the `camelCase`convention instead. You might also want to leave NIBRS and ID untouched. No problem! `clean_names()` got your back. Just change your preference in the argument `case =` to `upper_camel` and keep the abbreviation with the argument `abbreviations`. Check out `?clean_names()` for further information.
```{r, eval=FALSE}
# insert your code here
```
## Detect duplicates with `get_dupes()`
**EXERCISE 2:** Find out whether there are any duplicate values in the *crime_df* dataframe
```{r}
# insert your code here
```
Uhlala, looks like our data needs some cleaning!
## Remove content with `remove_empty()` and `remove_constant()` functions
Our `crime_df` data contain several missing values, and a couple of rows/columns filled with NAs.
**EXERCISE 3.1:** Create a subset called *tidy_crime* using the `remove_empty()`
```{r}
# insert your code here
```
**EXERCISE 3.2:** From *crime_df*, subset the observation for crimes committed in *Silver Spring district*, clean the data and get rid of the *police_distric_name* variable.
```{r}
# insert your code here
```
## Check content of columns with `compare_df_colums`
`compare_df_colums` is a pretty cool function to check how a group of dataset compare to each other and whether could be possible to merge them based on the number of variables and vector class.
**Exercise 4** From *crime_df*, subset the observation for crimes committed in *Wheaton district* , clean the data and get rid of the *police_distric_name* variable. Then compare *Wheaton* and *Silver Spring* subsets. Check out what columns are missing or present in the different inputs.
```{r, echo = FALSE}
# insert your code here
```
## Using `tabyl()` and `adorn_*()` options
We know that `tabyl()`is useful to create contingency tables with 1, 2 or 3 variables, and we have learned that `adorn_*()` functions will help us add information such as total numbers, not to forget a nice formatting for percentages.
**EXERCISE 5.1:** Find out proportion of crime per district, round percentages to 1 digit.
```{r}
# insert your code here
```
**EXERCISE 5.2:** For crimes committed in Silver Springs involving only one victim, find what percentage corresponds to property-related offences. DON'T USE `drop_na()`
```{r}
# insert your code here
```
## Finally, let's play with numbers & dates
**Rounding up numbers with `round_half_up`**
Following the example of *v1* vector, create 2 vectors in which the numbers from vector 1 are rounded up to 1 and 0 digits respectively. Bind the vectors into a tibble under the name `funny_numbers` to see how the numbers compare across columns
```{r, message=FALSE}
v1 <- runif(10, min=15, max=48)
# insert your code here
```
**Playing with dates**
Use the `excel_numeric_to_date()` function to convert Excel serial date numbers included in *edates* to class `Date`
```{r, message=FALSE}
edates <- c(0, 59, 61, 1000, 45100)
# insert your code here
```
## Sources
[janitor package](https://cran.r-project.org/web/packages/janitor/janitor.pdf)
[overview of janitor functions](https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html#tabyl---a-better-version-of-table)
[cleaning penguins with the janitor packages](http://jenrichmond.rbind.io/post/digging-into-the-janitor-package/)
[How to Clean Data: {janitor} Package](https://www.exploringdata.org/post/how-to-clean-data-janitor-package/)