forked from CSDE-UW/csde502-winter-2022
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-week07.Rmd
838 lines (633 loc) · 29.2 KB
/
07-week07.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
# Week 7 {#week7}
```{r, echo=FALSE, warning=FALSE, message=FALSE}
pacman::p_load(
tidyverse,
magrittr,
knitr,
kableExtra,
readstata13,
haven,
pdftools,
curl,
ggplot2,
captioner
)
table_nums <- captioner(prefix = "Table")
figure_nums <- captioner(prefix = "Figure")
# path to this file name
if (!interactive()) {
fnamepath <- current_input(dir = TRUE)
fnamestr <- paste0(Sys.getenv("COMPUTERNAME"), ": ", fnamepath)
} else {
fnamepath <- ""
}
```
<h2>Topic: Add Health data: variable creation and tabulation</h2>
This week's lesson will provide more background on variable creation and tabulation of variables.
Download [template.Rmd.txt](files/template.Rmd.txt) as a template for this lesson. Save in in your working folder for the course, renamed to `week_07.Rmd` (with no `.txt` extension). Make any necessary changes to the YAML header.
Also make sure these packages are loaded:
```
pacman::p_load(
tidyverse,
magrittr,
knitr,
kableExtra,
readstata13,
haven,
pdftools,
curl,
ggplot2,
captioner
)
```
## Creating value labels
"Labeled" data are important for documentation of data sets. The labels can apply to different objects, e.g., columns (as we saw in the column labels used to "decode" the data set in `AHwave1_v1.dta`, Section \@ref(tidyverse)), or to individual values of variables, e.g., the attribute `labels` of the variable `h1gi1m`:
```{r}
if (!exists("AHwave1_v1_haven")) {
AHwave1_v1_haven <- haven::read_dta("http://staff.washington.edu/phurvitz/csde502_winter_2021/data/AHwave1_v1.dta")
}
attributes(AHwave1_v1_haven$h1gi1m)$labels %>%
t() %>%
t()
```
Consider the difference between different file formats. Here we will save the data set, once as a CSV file and once an RDS file:
```{r}
# temp dir
mytempdir <- tempdir()
write.csv(x = AHwave1_v1_haven, file = file.path(mytempdir, "AHwave1_v1_haven.csv"), row.names = FALSE)
saveRDS(object = AHwave1_v1_haven, file = file.path(mytempdir, "AHwave1_v1_haven.RDS"))
```
Then we will read them back in and investigate their structure. Here we are using a temporary folder, specified using the `tmpdir()` function. Each time an R session is started, a unique temporary folder is used. For the session used to create this document, the temporary dir is `r mytempdir`.
First, read the CSV format:
```{r}
AHwave1_v1_haven_csv <- read.csv(file = file.path(mytempdir, "AHwave1_v1_haven.csv"))
```
What kind of object is this?
```{r}
is(AHwave1_v1_haven_csv)
```
It is a simple data frame. What attributes does this data set have? Here we list the attributes and enumerate the first 6 elements of each attribute:
```{r}
AHwave1_v1_haven_csv %>%
attributes() %>%
map(~ head(.))
```
There are three attributes, `names`, which are the individual column names; `class`, indicating this is a data frame, and `row.names`, in this case a serial number 1 .. n.
Do the columns have any attributes?
```{r}
AHwave1_v1_haven_csv$h1gi1m %>%
attributes()
```
No, the columns have no attributes.
Now we will read in the RDS file:
```{r}
AHwave1_v1_haven_rds <- readRDS(file = file.path(mytempdir, "AHwave1_v1_haven.RDS"))
```
What kind of object is this?
```{r}
is(AHwave1_v1_haven_rds)
```
It is a data frame--but an example of the `tidyr` subclass `tibble`; see the [documentation](https://www.rdocumentation.org/packages/tibble).
What attributes does this data set have?
```{r}
AHwave1_v1_haven_rds %>%
attributes() %>%
map(~ head(.))
```
The overall attributes are similar to the basic data frame, but there is an overall data set label, ``r AHwave1_v1_haven_rds %>% attributes() %>% extract("label")`` (*sic*).
We can also look at the attributes of specific columns, here for `h1gi1m`:
```{r}
AHwave1_v1_haven_rds$h1gi1m %>%
attributes()
```
Here we see that all of the original column attributes were preserved when the data set was saved in RDS format.
**Importantly**, saving a data set in CSV format loses any built-in metadata, whereas saving in RDS format maintains the built-in metadata. For other plain text formats, metadata will not be maintained; for other formats, it is worth determining whether such metadata are maintained. If metadata are not maintained in the file structure of the data set, it will be important to maintain the metadata in an external format (e.g., text, PDF). Note that the CSV format is the *lingua franca* of data transfer, and should be readable by any software, whereas the RDS format may not be readable by software other than R.
### Creating factor variables
Factor variables are used in R to store categorical variables. These categorical variables can be nominal, in which each value is distinct and *not* ordered, such as race, classified as
* White
* Black/African American
* American Indian
* Asian/Pacific Islander
* other
Factor variables can be ordered, where there is a difference in amount or intensity, such as self-reported health status:
1. Excellent
1. Very good
1. Good
1. Fair
1. Poor
Ordinal variables may or may not have equal intervals; for example, the "distance" between excellent and good health may not represent the same "distance" as the difference between good and fair health.
Structurally, factor variables are stored as integers that can be linked to text labels. Operationally, __the use of factor variables is important in statistical modeling, so that the variables are handled correctly as being categorical__.
Factor variables are created using the `factor()` or `as_factor()` functions. Here we will convert the self-reported general health variable (`h1gh1`) to a factor. First, let's look at the variable:
```{r}
AHwave1_v1_haven$h1gh1 %>%
attributes()
```
This shows that there are values 1 through 6 and 8, with coding indicated in the `labels` attribute. Using `head()` also reveals the structure of the variable, including the label for the variable itself as well as the coding of the variable values:
```{r}
head(AHwave1_v1_haven$h1gh1)
```
Here we convert the variable to a factor:
```{r}
AHwave1_v1_haven$health <- factor(AHwave1_v1_haven$h1gh1)
```
What is the result? We can see from the first few values; `head()` presents the values as well as the levels.
```{r}
# look at the first few values of the factor variable
head(AHwave1_v1_haven$health)
```
Although the levels are in numerical order, there are no meaningful labels. We use the `labels = ` argument to assign labels to each level. Simultaneously, because the factor is ordered in the same order as the alphanumeric ordering of the attributes, we can set the ordering based on those attributes. Finally, we reverse the order of the levels so that better health has a higher position in the order.
```{r}
# extract the labels from the column attribute
health_levels <- AHwave1_v1_haven$h1gh1 %>%
attributes() %>%
extract2("labels") %>%
names()
# create the factor variable and re-level it in reverse order
AHwave1_v1_haven$health <- factor(AHwave1_v1_haven$h1gh1,
labels = health_levels,
ordered = TRUE
) %>%
fct_relevel(rev)
```
_[Note that if the ordering is not alphanumeric, one should enter the list of values as `... labels = c("label1", "label2, ..., "labeln"), ordered = TRUE ...` to enforce correct ordering.]_
Let's compare the two variables through simple tabulation. Here is the "raw" variable:
```{r}
# "raw" variable
(tab_h1gh1 <- AHwave1_v1_haven %>%
group_by(h1gh1) %>%
summarise(n = n()))
```
The "raw" variable shows that it is a labeled, double precision variable (`<dbl+lbl>`).
Now the factor variable:
```{r}
# factor variable
(tab_health <- AHwave1_v1_haven %>%
group_by(health) %>%
summarise(n = n()))
```
The counts are the same, but for the factor variable, the `<ord>` additionally indicates the ordering of health levels. Note that the order is different because `(1) Excellent` should have the highest numerical value--that is, in a statistical model, "Excellent" should have a greater value than "Poor".
Bar plots also show the difference between the raw and factor variables. `r figure_nums(name = "barplotraw", display = "cite")` presents a bar plot from the raw variable.
```{r, warning=FALSE, message=FALSE}
ggplot(data = tab_h1gh1, aes(x = h1gh1, y = n)) +
geom_bar(stat = "identity") +
coord_flip()
```
*`r figure_nums(name = "barplotraw", caption = "Bar plot from raw h1gh1 values")`*
Because the numerical values have no special meaning, the bar plot uses its default method of displaying the axes. `r figure_nums(name = "barplotfac", display = "cite")` shows the bar plot created with the factor variable, with informative labels at the correct position on the axes.
```{r}
ggplot(data = tab_health, mapping = aes(x = health, y = n)) +
geom_bar(stat = "identity") +
coord_flip()
```
*`r figure_nums(name = "barplotfac", caption = "Bar plot from h1gh1 values as factors")`*
One of the potential down sides to using factor variables is that using the text values requires additional code. For example, to select (`filter()`) a subset of records, there are two methods. The first method uses the label. In this case because the factor is ordered, it is possible to use an ordinal comparison. Any records with self-reported health less than or equal to "Very good" are selected. The tabulation shows that no "Excellent" records were selected.
```{r}
# filter for Excellent or Very good
y <- AHwave1_v1_haven %>%
filter(health <= "(2) Very good")
# tabulate
y %>%
group_by(health) %>%
summarise(n = n())
```
The other method is to use `as_numeric()` to specify the underlying numerical values. In this case, because the value of 6 indicated `Very good`, we can filter for `health %>% as.numeric() <= 6`. However, it is critically important that the factor levels and values are known; in the original variable `h1gh1`, "Very good" had a value of 2, but in the re-ordered factor variable, "Very good" has a value of 6.
```{r}
x <- AHwave1_v1_haven %>%
filter(health %>% as.numeric() <= 6)
# tabulate
x %>%
group_by(health) %>%
summarise(n = n())
```
Using unordered factor variables shows that more coding is generally involved in specifying values in `filter()` statements. For example, let's create a factor variable for the interviewer's observed single race variable:
```{r}
# number of labels
nlabs <- length(unique(AHwave1_v1_haven$h1gi9))
# get the values, "as.numeric()"
obsrace_values <- AHwave1_v1_haven$h1gi9 %>%
attributes() %>%
extract2("labels") %>%
as.numeric()
# get the labels, names()
obsrace_labels <- AHwave1_v1_haven$h1gi9 %>%
attributes() %>%
extract2("labels") %>%
names()
# create the factor
AHwave1_v1_haven$obsrace <- factor(AHwave1_v1_haven$h1gi9,
levels = obsrace_values,
labels = obsrace_labels
)
```
Suppose we wanted to make a subset of only White and Black records, there are a few syntax methods:
Here, each value is explicitly named, using the `|` "or" operator
```{r}
dat_wb1 <- AHwave1_v1_haven %>%
filter(obsrace == "(1) White" |
obsrace == "(2) Black/African American")
dat_wb1 %>%
group_by(obsrace) %>%
summarise(n = n())
```
A different approach uses `str_detect()` with a regular expression to match any values of `obsrace` that contain the strings `white` or `black` in any upper/lower case combination.
```{r}
dat_wb2 <- AHwave1_v1_haven %>%
filter(str_detect(obsrace, regex("white|black", ignore_case = TRUE)))
dat_wb2 %>%
group_by(obsrace) %>%
summarise(n = n())
```
Finally, if we know the numeric values representing the factors, we can use those directly, using requisite care in identifying the numeric values proxying for the factor levels:
```{r}
dat_wb3 <- AHwave1_v1_haven %>%
filter(obsrace %>% as.numeric() %in% c(1, 2))
dat_wb3 %>%
group_by(obsrace) %>%
summarise(n = n())
```
The first method is the most verbose, requires the most code, and is probably the easiest to read. The other two methods may be easier to code (i.e., fewer keystrokes), but seem to be more difficult to read. There is often a trade-off between writing code that is quickly written and less easy to read versus code that is more methodically written but easier to read.
As above, care needs to be taken in saving the data set to a file. If the factor has text labels, those will be written as text by default to an output CSV file. When they are read back in, they will be treated as character values rather than ordered factors. For example, the `health` variable we created is an ordered factor:
```{r}
str(AHwave1_v1_haven$health)
```
But when cycled through a write/read CSV cycle, the variable is not maintained as a factor.
```{r}
write.csv(x = AHwave1_v1_haven, file = file.path(mytempdir, "foo.csv"), row.names = FALSE)
y <- read.csv(file.path(mytempdir, "foo.csv"))
str(y$health)
```
Additional work would be needed to re-establish it as a factor. Fortunately, the alphanumeric sorting order of the character strings is also the logical order (i.e., `(1)` comes before `(2)` in alphanumeric order).
```{r}
y$health <- factor(y$health,
labels = y$health %>% unique() %>% sort(),
ordered = TRUE
) %>% fct_rev()
head(y$health)
```
Without the numeric values preceding the text (e.g., `(1)`, `(2)`, ...), the ordering would need to be explicitly set. For example, the vector in desired sorting order
`"Excellent", "Very good", "Good", "Fair", "Poor", "Refused", "Don't know"`
sorts as
`"Excellent", "Very good", "Good", "Fair", "Poor", "Refused", "Don't know"`
which is not the proper order of self-reported health.
To use our own specified labels, we need to include those in the desired order. Here we create a vector with specified labels in descending order matching the numeric values (i.e., 1 = "Excellent", 2 = "Very good", etc.), and then specify that vector in the `labels` argument:
```{r}
health2labels <- c("Excellent", "Very good", "Good", "Fair", "Poor", "Refused", "Don't know")
y$health2 <- factor(y$h1gh1,
labels = health2labels,
ordered = TRUE
) %>% fct_rev()
head(y$health2)
```
If you save the data set as RDS, the factor and other structures are maintained; here we see the factor levels are exactly as we set them earlier.
```{r}
write_rds(x = AHwave1_v1_haven, file = file.path(mytempdir, "foo.Rds"))
z <- readRDS(file = file.path(mytempdir, "foo.Rds"))
head(z$health)
```
### Creating attributes
In addition to creating factors, which can serve in some capacity as self-documenting (i.e., the value labels should be at least somewhat self-explanatory), we can create attributes as we have seen with the Stata `.dta` files.
Let's start with the CSV file, which we know was stripped of its descriptive attributes:
```{r}
y <- read.csv(file.path(mytempdir, "foo.csv"))
y %>%
attributes() %>%
map(~ head(.))
```
First, an attribute of the data frame itself (the overall data frame `label` attribute):
```{r}
attributes(y)$label <- "National Longitudinal Study of Adolescent to Adult Health (Add Health), 1994-2000 with some variable additions"
```
... which we can verify:
```{r}
y %>%
attributes() %>%
extract("label")
```
Next, attributes for the `health` and `obsrace` variables, documenting the variables and values. Here, for `health` we set this up as a factor using the sorted unique values of `health` from the CSV file. The factor is ordered and then reversed so that "Excellent" health has the highest value.
```{r}
# label for health
attributes(y$health)$label <- "General health from public Add Health Wave 1 Section 3 question 1"
# labels for health
healthlevels <- unique(y$health) %>%
sort()
# values for health
attributes(y$health)$levels <- healthlevels
# create the factor "health" with appropriate levels, order it and reverse the levels
# so that Excellent is highest
y %<>% mutate(
health =
factor(health, levels = healthlevels, ordered = TRUE) %>%
fct_rev()
)
```
We create the `obsrace` factor but because there is no inherent value hierarchy, we do not make this an ordered factor. We convert it to a factor for use in statistical models.
```{r}
# label for obsrace
attributes(y$obsrace)$label <- "Interviewer observation of race from public Add Health Wave 1 Section 1 question 9"
# obsrace levels
obsracelevels <- unique(y$obsrace) %>%
sort()
# values for obsrace
attributes(y$obsrace)$levels <- obsracelevels
# create a factor
y %<>% mutate(
obsrace = factor(obsrace, levels = obsracelevels)
)
```
Verify these were created:
```{r}
y$health %>% attributes()
head(y$health)
y$obsrace %>% attributes()
head(y$obsrace)
```
The same caveats apply to saving the data frame to files, with respect to the various attributes and which attributes are saved and restored after a write/read cycle.
## Tabulation
We have seen a few examples of tabulation in previous exercises (Sections \@ref(summarizingaggregating-data), \@ref(rmdtables), \@ref(default-values-for-arguments)). This section will give a more complete treatment using the Add Health data as an example.
### Raw counts
The most basic tabulations simply give the count of observations in different strata. Those strata can be based on numeric ratio or interval data, using functions such as `cut()` or `BAMMtools::getJenksBreaks()`, nominal data (such as the Add Health interviewer's observation of respondents' single race), or ordinal data (such as the self-reported health status). We will use examples of each type of data.
First, this code will load the full Add Health data set, which includes additional variables not present in the data set we have used previously.
```{r}
# download and unzip the larger data set
myUrl <- "http://staff.washington.edu/phurvitz/csde502_winter_2021/data/21600-0001-Data.dta.zip"
# zip file in $temp -- basename gets just the file name from the URL and not the URL path;
# file.path stitches the tempdir() path to the file name
zipfile <- file.path(mytempdir, basename(myUrl))
# dta file in $temp
dtafile <- tools::file_path_sans_ext(zipfile)
# check if the dta file exists
if (!file.exists(dtafile)) {
# if the dta file doesn't exist, check for the zip file
# check if the zip file exists, download if necessary
if (!file.exists(zipfile)) {
curl::curl_download(url = myUrl, destfile = zipfile)
}
# unzip the downloaded zip file
if (file.exists(zipfile)) {
unzip(zipfile = zipfile, exdir = mytempdir)
}
}
# if the data set has not been read, read it in
if (!exists("ahcomplete")) {
ahcomplete <- haven::read_dta(dtafile)
}
# lowercase column names
colnames(ahcomplete) %<>% str_to_lower()
```
The metadata (variable names and labels) are presented in `r table_nums(name = "ahcompletemeta", display = "cite")`.
*`r table_nums(name = "ahcompletemeta", caption = "Add Health large table variable names and labels")`*
```{r}
# create a data frame of the variable names and labels
ahcomplete_metadata <- bind_cols(
varname = colnames(ahcomplete),
varlabel = ahcomplete %>% map(~ attributes(.)$label) %>% unlist()
)
# print the table with DT::datatable for interactive display
DT::datatable(ahcomplete_metadata)
```
\
Let's look at body mass index (BMI) data, which uses weight and height values. We find the variables representing weight and height from the metadata above. We need to determine if there are any invalid values:
```{r}
attributes(ahcomplete$h1gh59a)$labels
attributes(ahcomplete$h1gh59b)$labels
attributes(ahcomplete$h1gh60)$labels
```
First, we need to select the variables representing feet and inches, then filter out invalid heights (> 90) and weights (> 900) identified above, and finally calculate height in m, weight in kg, and BMI ($\frac{weight}{height^2}$). For future use we will also select self-reported health and interviewer observed race as factors.
```{r}
# make the data frame
htwt <- ahcomplete %>%
# select columns
select(
feet = h1gh59a,
inches = h1gh59b,
weight_lb = h1gh60,
health = h1gh1,
obsrace = h1gi9
) %>%
# filter for valid values
filter(feet < 90 & inches < 90 & weight_lb < 900) %>%
# calculate metric units and BMI
mutate(
height_m = (feet * 12 + inches) / 39.37008,
weight_kg = weight_lb / 2.205,
BMI = weight_kg / height_m^2
)
# factor: get values and labels
healthvals <- htwt$health %>%
attributes() %>%
extract2("labels") %>%
as.numeric()
healthlabs <- htwt$health %>%
attributes() %>%
extract2("labels") %>%
names()
racevals <- htwt$obsrace %>%
attributes() %>%
extract2("labels") %>%
as.numeric()
racelabs <- htwt$obsrace %>%
attributes() %>%
extract2("labels") %>%
names()
# package the data frame up
htwt %<>%
mutate(
health = factor(health,
levels = healthvals,
labels = healthlabs
),
obsrace = factor(obsrace,
levels = racevals,
labels = racelabs
)
)
```
A histogram (`r figure_nums(name = "bmihist", display = "cite")` shows the distribution of BMI for the respondents with vertical lines at the 5th and 85th percentile. This range is generally considered "normal" or "healthy" according to the CDC, although for a sample with varying age, sex, height, and weight ranges, it is difficult to interpret. Nevertheless the cut points can serve the purpose of demonstration.
```{r}
# get the 5th & 85th percentile
bmibreaks <- quantile(x = htwt$BMI, probs = c(0.05, 0.85))
ggplot(htwt, aes(x = BMI)) +
geom_histogram(bins = 30) +
geom_vline(xintercept = bmibreaks)
```
\
*`r figure_nums(name = "bmihist", caption = "Histogram of BMI for Add Health cohort")`*
We assign the BMI class using these cut points, with `cut()` breaks at the minimum, 5%, 85%, and maximum BMI, and also assign labels and set ordering:
```{r}
htwt %<>%
mutate(bmiclass = cut(
x = BMI,
breaks = c(min(BMI), bmibreaks, max(BMI)),
labels = c("underweight", "normal", "overweight"),
include.lowest = TRUE
) %>%
factor(ordered = TRUE))
```
The tabulation of count of respondents by weight class can be generated with the base R function `table()` or `dplyr` functions `group_by()` and `summarise()` (`r table_nums(name = "bmiclass", display = "cite")`.
```{r}
# base R
table(htwt$bmiclass, useNA = "ifany")
```
*`r table_nums(name = "bmiclass", caption = "Count by BMI class, Add Health cohort")`*
```{r}
# tidyR
htwt %>%
group_by(bmiclass) %>%
summarise(n = n()) %>%
kable() %>%
kable_styling(full_width = FALSE, position = "left",
bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```
For variables that are already nominal or ordinal factors, tabulations can be made directly. Because these were converted to factors, the correct labels will show, rather than simple numeric values. The tabulations will also be presented in descending order by label or level. Tabulation of self-reported health is in `r table_nums(name = "healthclass", display = "cite")` and by race observed by interviewer in `r table_nums(name = "obsraceclass", display = "cite")`.
*`r table_nums(name = "healthclass", caption = "Count by self-reported health class, Add Health cohort")`*
```{r}
htwt %>%
group_by(health) %>%
summarise(n = n()) %>%
kable() %>%
kable_styling(
full_width = FALSE,
position = "left",
bootstrap_options = c("striped", "hover", "condensed", "responsive")
)
```
*`r table_nums(name = "obsraceclass", caption = "Count by interviewer-observed race, Add Health cohort")`*
```{r}
htwt %>%
group_by(obsrace) %>%
summarise(n = n()) %>%
kable() %>%
kable_styling(
full_width = FALSE, position = "left",
bootstrap_options = c("striped", "hover", "condensed", "responsive")
)
```
### Proportions/percentages
Proportions and percentages can be added to tabulations for greater interpretability. Using base R, the `prop_table()` function can be used as a wrapper around `table()` to generate proportions, optionally multiplying by 100 to generate a percentage. Remember to round as needed. For example:
```{r}
round(prop.table(table(htwt$bmiclass)), 2)
round(prop.table(table(htwt$bmiclass)) * 100, 0)
```
Not surprisingly, the BMI classes show 5% underweight and 15% overweight, because that is how the stratification was defined.
The `tidyR` version requires a bit more coding but provides a more readable output. Because the percent sign is a special character, we enclose it in back ticks, ``%``. Here we generate a tabulation of observed race (`r table_nums(name = "obsraceclass", display = "cite")`).
*`r table_nums(name = "obsraceclasspct", caption = "Count by interviewer-observed race, Add Health cohort")`*
```{r}
htwt %>%
group_by(obsrace) %>%
summarise(n = n()) %>%
mutate(`%` = n / sum(n) * 100) %>%
mutate(`%` = `%` %>% round(1)) %>%
kable() %>%
kable_styling(
full_width = FALSE, position = "left",
bootstrap_options = c("striped", "hover", "condensed", "responsive")
)
```
### Stratified tabulation
Tabulations can be generated using multiple variables. Here we will examine BMI and race as well as BMI and health. The percentages sum to 100 based on the order of the grouping.
Here, we see the relative percent of underweight, normal, and overweight within each race class (`r table_nums(name = "obsracebmiclass", display = "cite")`).
*`r table_nums(name = "obsracebmiclass", caption = "Count by interviewer-observed race and BMI class, Add Health cohort")`*
```{r}
htwt %>%
group_by(
obsrace,
bmiclass
) %>%
summarise(n = n(), .groups = "drop_last") %>%
mutate(`%` = n / sum(n) * 100) %>%
mutate(`%` = `%` %>% round(1)) %>%
kable() %>%
kable_styling(
full_width = FALSE, position = "left",
bootstrap_options = c("striped", "hover", "condensed", "responsive")
)
```
The table can also be presented with grouped rows. This code generates a named
vector of the count of observed race by BMI class
```{r}
# create the row groupings
(obsrace <- htwt %>%
group_by(
obsrace,
bmiclass
) %>%
summarise(n = n(), .groups = "drop_last") %>%
group_by(obsrace) %>%
summarise(n = n()) %>%
deframe())
```
The groups are then used in the `pack_rows()` function for grouped rows (`r table_nums(name = "obsracebmiclasspct", display = "cite")`).
*`r table_nums(name = "obsracebmiclasspct", caption = "Count by interviewer-observed race and BMI class, Add Health cohort, percent by race")`*
```{r}
htwt %>%
group_by(
obsrace,
bmiclass
) %>%
summarise(n = n(), .groups = "drop_last") %>%
mutate(`%` = n / sum(n) * 100) %>%
mutate(`%` = `%` %>% round(1)) %>%
kable %>%
kable_styling(
full_width = FALSE, position = "left",
bootstrap_options = c("striped", "hover", "condensed", "responsive")
) %>%
pack_rows(index = obsrace)
```
And here we see the relative percent of different race groups within each BMI class (`r table_nums(name = "obsbmiraceclasspct", display = "cite")`.
*`r table_nums(name = "obsbmiraceclasspct", caption = "Count by interviewer-observed race and BMI class, Add Health cohort, percent by BMI class")`*
```{r}
htwt %>%
group_by(
bmiclass,
obsrace
) %>%
summarise(n = n(), .groups = "drop_last") %>%
mutate(`%` = n / sum(n) * 100) %>%
mutate(`%` = `%` %>% round(1)) %>%
kable() %>%
kable_styling(full_width = FALSE, position = "left",
bootstrap_options = c("striped", "hover", "condensed", "responsive")
)
```
Even though the `n` values are the same in the two tables (e.g., underweight $\times$ White n = `r htwt %>% filter(str_detect(obsrace, "White") & str_detect(bmiclass, "under")) %>% nrow()`), the percentages are different due to grouping. That is, the percent of underweight persons among the White race stratum is different from the percent of White persons within the underweight stratum. Proper grouping is critical to answer specific questions about the data.
Another way to understand the data, preferred by some, would be to make a graph. For example, BMI by race as a facet grid bar graph (`r `figure_nums(name = "bmibarfacet", display = "cite")`).
```{r}
bmi_race <- htwt %>%
group_by(
obsrace,
bmiclass
) %>%
summarise(n = n(), .groups = "drop_last") %>%
mutate(`%` = n / sum(n) * 100) %>%
filter(!str_detect(obsrace, regex("refused|know", ignore_case = TRUE)))
ggplot(data = bmi_race, mapping = aes(x = bmiclass, y = `%`)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
facet_grid(~obsrace) +
xlab("BMI class")
```
\
*`r figure_nums(name = "bmibarfacet", caption = "BMI class by race, Add Health cohort")`*
Or as a stacked bar graph (`r figure_nums(name = "bmibarstack", display = "cite")`)
```{r}
ggplot(data = bmi_race, mapping = aes(x = obsrace, y = `%`, fill = bmiclass)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_discrete(name = "BMI class") +
xlab("observed race")
#theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
```
\
*`r figure_nums(name = "bmibarstack", caption = "BMI class by race, Add Health cohort")`*
<hr>
Rendered at <tt>`r Sys.time()`</tt>
## Source code
File is at `r fnamestr`.
### R code used in this document
```{r ref.label=knitr::all_labels(), echo=TRUE, eval=FALSE}
```
### Complete Rmd code
```{r comment=''}
cat(readLines(fnamepath), sep = '\n')
```