-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathEpiconceptIPD.Rmd
1101 lines (733 loc) · 44.5 KB
/
EpiconceptIPD.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Analysing the effect of vaccine intervention on IPD incidence in Scotland, 2000-2017"
output: worded::rdocx_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
echo = TRUE,
oncordance = TRUE,
message = FALSE,
warning = FALSE,
fig.width = 9,
fig.height = 6
)
```
```{r packages4knit, echo = FALSE}
#load packages
#Those required for creating this markdown:
#"Worded": for styling and pagebreaks (devtools::install_github("davidgohel/worded"))
#"knitr": for styling tables
required_packages <- c("worded", "knitr")
for (i in seq(along = required_packages)) {
library(required_packages[i], character.only = TRUE)
}
```
<!---CHUNK_PAGEBREAK--->
# Copyright and License
**Source:**
This case study was first designed by Lukas Richter and Alexander Spina in 2018 using *R*. It is based on surveillance data from an Scotland.
**Revisions:**
*If you modify this case study, please indicate below your name and changes you made*
**You are free:**
- **to Share** — to copy, distribute and transmit the work
- **to Remix** — to adapt the work
**Under the following conditions:**
- **Attribution** — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). The best way to do this is to keep as it is the list of contributors: sources, authors and reviewers.
- **Share Alike** — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one. Your changes must be documented. Under that condition, you are allowed to add your name to the list of contributors.
- You cannot sell this work alone but you can use it as part of a teaching.
**With the understanding that:**
- **Waiver** — Any of the above conditions can be waived if you get permission from the copyright holder.
- **Public Domain** — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- **Other Rights** — In no way are any of the following rights affected by the license:
- Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
- The author's moral rights;
- Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.
- **Notice** — For any reuse or distribution, you must make clear to others the license terms of this work by keeping together this work and the current license.
This licence is based on http://creativecommons.org/licenses/by-sa/3.0/
<!---CHUNK_PAGEBREAK--->
## Objectives of the tutorial
At the end of the tutorial participants should be able to
- create a plan of analysis for simple segmented regression
- use R for exploratory analysis including installing the software, importing a dataset, setting time units for aggregation, data checking, investigating trend and seasonality
- understand how to run a simple negative binomial segmented regression
## Background to the tutorial
The objective of the tutorial is to carry out an analysis on the effect of a vaccine intervention on the incidence of invasive pneumococcal disease (IPD).
There are two available vaccines for IPD, one which covers seven serotypes and one which covers 13 serotypes, called PCV7 and PCV13 respectively.
The dataset is Scottish surveillance data from 2000 to 2017. This includes several years of pre-vaccination data (2000-2006), 3 years of PCV7 years (2007-2009) and 8 years of PCV13 years (2010-2017).
The dataset is case-based (every row is a case) and provides the year and month of notification, the age group, and the serotype.
In addition, there are population counts for Scotland between 2000 and 2017 overall and broken down by age group.
<!---CHUNK_PAGEBREAK--->
# Question 1
Open the datasets (*ipd_cases.csv* and *ipd_population.csv*) in R and briefly explore them (make sure your working directory is set in R). Import the necessary packages and install the corresponding packages.
With the assumption that the reporting in the dataset here is stable over time, create an outline of a mini-protocol for the analysis, including hypotheses to be tested, a plan of analysis, with specifications for the dataset, inclusion/exclusion criteria, etc.
<!---CHUNK_PAGEBREAK--->
# Help Question 1
## R-Studio environment
*R-Studio* is an integrated development environment (IDE) for *R*. It includes a console, syntax highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. You can download it freely from the [*R-Studio* website](https://www.rstudio.com/).
The *RStudio* window comprises the following sections:
![](RstudioIDE.png)
1. A script which is equivalent to a do file in STATA or pgn in EpiInfo.
2. A console where you can write commands and execute them, and where results other than graphs are shown.
3. A “Environment” and “History” windows where you can find the list of R objects being stored and the log of your session, respectively.
4. The “Plots”, “Packages” and “Help” windows, for graphic visualisation, selection of installed packages and help documentation, respectively.
## R-scripts
Throughout this case study we strongly encourage you to use *R-scripts* to keep track of the commands you are using to follow your plan of analysis. It will contain data management steps as well as descriptive and statistical analysis.
Open R studio and open a new *R-script*.
To open a new *R-script*, click on the plus button on the top left (below File), then select *R-script* from the drop-down menu.
![](RstudioScript.png)
## Working directory
Make sure you have created a working directory on your computer. This is where you store your data and your R scripts and outputs. An example of working directory could be “C:/data/crypto” if you are using a PC.
To set your working directory you can use the menu function by clicking "Session" > "Set Working Directory" > "Choose Directory..." and then you can select the appropriate folder.
![](RstudioWorkingDir.png)
You could alternatively do this by adding code to your *R-script* (or the command line).
You can check the path for your current working directory using the *getwd* function.
```{r, eval = FALSE}
#Check your current working directory
getwd()
```
To set your working directory you can use the *setwd* function.
```{r, eval = FALSE}
#set working directory
#note the forward rather than backwards slashes
setwd("C:/data/IPD")
```
## Packages
R comes with a set of default functions. However one of the strengths of R is that many user-written functions exist to help you do your work. To use them, you first need to install the package they belong to.
To do this you can click on "Install" on the bottom right in the "Packages" window.
![](RstudioPackages.png)
<!---CHUNK_PAGEBREAK--->
We will mainly use packages which come ready installed with R (base code), but where it makes things easier we will use add-on packages. All the R packages you need for the exercises can be installed over the Internet.
You will need to install these before starting. We will be using the following packages.
- *zoo*: for producing rolling statistics and year-months
- *MASS*: for performing negative binomial regression
- *broom*: for tidying and formatting model outputs
You can also install packages using the *install.packages* function, as below. Do not try to understand this code.
```{r}
#choose packages required
required_packages <- c("zoo", "MASS", "broom")
```
```{r, eval = FALSE}
# Installing required packages for this case study
for (pkg in required_packages) {
if (!pkg %in% rownames(installed.packages())) {
install.packages(pkg, dependencies = TRUE)
}
}
```
You only need to install packages once. However you will need to load them each time you start a new session using the *library* function.
```{r, eval = FALSE}
library(MASS)
```
Here too, you can load multiple packages at the same time within a loop; again do not try to understand this just yet, just appreciate it is possible.
```{r, results = 'hide', message = FALSE, warning = FALSE}
# Loading required packages for this case study
for (i in seq(along = required_packages)) {
library(required_packages[i], character.only = TRUE)
}
```
<!---CHUNK_PAGEBREAK--->
## Reading in data
Import the dataset from a comma seperated file (.csv) using the `read.csv` function, storing it as a dataframe within *R* called *ipd_cases*. Also import population data using the same approach (*population*). For a CSV file the separator is normally a comma (","), however depending on the language of your operating system this can also be other values, for example some non-english speaking European countries use a semi-colon (";") as standard. Here we also specify that we do not want to format strings as factors. When reading population data we additionally define the option `check.names = FALSE` which makes sure that variable names are kept as defined in the CSV. By default (`check.names = TRUE`) *R* would automatically change column names to follow certain rules (e.g. no numeric as first character).
```{r read_data}
# read in your cases dataset
ipd_cases <- read.csv("data/ipd_cases.csv", sep = ",", stringsAsFactors = FALSE)
# read in your population dataset
population <- read.csv("data/ipd_population.csv", sep = ",",
stringsAsFactors = FALSE, check.names = FALSE)
```
## Browse data
Take a look at the first 10 rows of the data or browse the dataframe in the RStudio preview.
```{r, eval = FALSE}
# view the first ten rows of data
head(ipd_cases, n = 10)
# view your whole dataset
View(ipd_cases)
```
Remember that `View` needs to be written with a capital *V*.
```{r, echo = FALSE}
# participants dont see this code
# creates nice table output for word doc
kable(head(ipd_cases, n = 10))
```
Alternatively you can view the data by clicking on the data by clicking on the data icon to the right of the view.
![](RstudioView.png)
Another option is to use *R*'s `str` function to display the structure of an *R* object compactly.
```{r}
str(ipd_cases)
```
You get a short overview of the dimensions of the dataset (number of observations, number of variables) and additionally the class of each variable including the first couple of observations.
It is often useful to inspect character and factor variables separately to get to know the used values and labels and find out about potential spelling mistakes and different styles. Use the `unique` function on each of the character variables to get a list of all unique values in each variable.
<!---CHUNK_PAGEBREAK--->
```{r, eval = FALSE}
# see unique values in your variables
unique(ipd_cases$Month)
unique(ipd_cases$AgeGroup)
unique(ipd_cases$Serotype)
```
**Examples of hypotheses to be tested could be:**
- Is there a trend or seasonality in the IPD reporting?
- Are there sufficient case counts for monthly regression?
- Has the introduction of thirteen valent vaccine impacted incidence?
**Inclusion/exclusion criteria:** Only records with year of notification will be included (drop records with missing notification year)
**Time unit for analysis:** Months or Years
**Draft plan of analysis:**
**Data cleaning**
- Plot original data
- Data description (frequency distributions, summary tables) of all variables
- Checks for missing data
- Checks for numeric attraction in reporting due to reporting bias, including statistical tests
**Data description**
- To better understand the nature of the data
- Plot the data
- Calculate statistical descriptors
- Moving average (exploring trend and seasonality)
- Decomposition of the dataset
- Checks for trend (check R^2^)
- Checks for seasonality (taking trend into account)
- Spectral analysis, Periodogram
- Check R^2^ of models
- Test for white noise
**Segmented regression**
- Decide whether there are sufficient case counts for segmented regression based on monthly data
- Define pre- and post-intervention periods
- Carry out negative binomial regression
<!---CHUNK_PAGEBREAK--->
# Question 2
Make any changes necessary to the dataset. Carry out data checking, checks for missing data. Create new variables that may be necessary.
- Make sure that `ipd_cases$Month` only contains values between 1-12 (Jan-Dec)
- Check if there are cases missing month
- If there are, make sure these are labelled correctly
- Make sure that `ipd_cases$Serotype` categories are uniform
- Define two variables in our data to indicate if cases correspond to PCV7 and/or PCV13 serotypes
- Define a variable which combines year and month
<!---CHUNK_PAGEBREAK--->
# Help Question 2
Remember in the last section that we looked at unique values in each of our variables. This is a quick and easy way to check the contents of variables. Particularly in our case as we only have a few variables.
```{r, eval = FALSE}
# see unique values in your variables
unique(ipd_cases$Year)
unique(ipd_cases$Month)
unique(ipd_cases$AgeGroup)
unique(ipd_cases$Serotype)
```
From this we can see that the years all appear to be in the time frame that we expected and age groups appear to be uniformly reported. However the month variable appears to have missings which have been entered as "#NULL!" and the serotype variable does not appear to be uniformly reported with some cases having "serotype" written out in front of the number.
## Data cleaning
### correcting month variable
Here we use the square brackets to subset rows where month is equal to "#NULL!" and replace the content of the month variable with NA.
```{r}
# correct the labelling of missing values
ipd_cases$Month[ipd_cases$Month == "#NULL!"] <- NA
```
The next thing that we might realise is that this is not a numeric variable.
```{r}
# check the class of month
class(ipd_cases$Month)
```
We may want to change this from a character to a numeric variable.
```{r}
# overwrite month as a numeric variable
ipd_cases$Month <- as.numeric(ipd_cases$Month)
```
### Correcting serotype variable
In order to get rid of the "serotype" text before the value we can use the *gsub* function.
This will search for a specific string of characters in a given variable and then replace that with another string. So here we can search for "serotype " (note the space at the end!) in `ipd_cases$Serotype` and replace that with nothing (""), thereby deleting it.
```{r}
# replace "serotype " with nothing in the serotype variable
ipd_cases$Serotype <- gsub(pattern = "serotype ", replacement = "",
x = ipd_cases$Serotype, fixed = TRUE)
```
## Variable creation
### Assigning cases based on serotype
The PCV7 covers IPD serotypes 4, 6B, 9V, 14, 18C, 19F and 23F. The PCV13 additionally covers serotypes 1, 5, 7F, 3, 6A and 19A.
First define which serotypes are in each of the two vaccines. We will use these character strings later to identify cases with either PCV7 or PCV13 serotypes.
```{r vaccine_definition}
# define strings of each vaccine type
pcv7 <- c("4", "6B", "9V", "14", "18C", "19F", "23F")
pcv13 <- c("4", "6B", "9V", "14", "18C", "19F", "23F", "1", "5", "7F", "3", "6A", "19A")
```
To do this we will create two variables, one for each vaccine type. So if the identified serotype is one of the PCV7 serotypes, then that case will be tagged as a PCV7 case.
Define two columns in our data to indicate if the case corresponds to PCV7 and/or PCV13 serotypes, respectively. Values in these columns are either `TRUE` or `FALSE`; for most purposes *R* will interpret these variables as 1/0.
In plain english, the code below reads create a new "PCV7" variable and where the serotype is in the "pcv7" string (defined above), return TRUE for that row otherwise return FALSE.
```{r}
# create a new variable defining cases by serotypes
ipd_cases$PCV7 <- ipd_cases$Serotype %in% pcv7
ipd_cases$PCV13 <- ipd_cases$Serotype %in% pcv13
```
Also create a third variable which defines serotypes covered by neither vaccine.
To do this we use the exclamation mark to designate "not" and the ampersand to combined arguments with an "and".
So this reads, make the new variable NonVacc true where serotype is not in pcv7 and serotype is not in pcv13.
```{r}
# create new var with other serotypes not covered
ipd_cases$NonVacc <- !ipd_cases$Serotype %in% pcv7 & !ipd_cases$Serotype %in% pcv13
```
### Creating a year/month combined variable
It may be useful to combine the year and month variable for later investigation. To do this use the *paste0* function, which simply combines multiple inputs in to one character string.
In this case we will be pasting the year, followed by an dash and then the month (e.g. 2011-12).
We could then use the *as.yearmon* function from the *zoo* package to order this character variable correctly.
```{r}
# create a new variable with combined year and month
ipd_cases$Year_Month <- paste0(ipd_cases$Year, "-", ipd_cases$Month)
# change this variable in to a year month variable
ipd_cases$Year_Month <- as.yearmon(ipd_cases$Year_Month)
```
# Question 3
Describe the data. Check how many missing values there are. Plot cases by different time units of interest (e.g. by month, quarter and year). Use the "moving average" to explore possible trends and seasonality within your data.
<!---CHUNK_PAGEBREAK--->
# Help Question 2
## Looking at missings
You will want to start by seeing what missing values there are.
You can see this by combining the *is.na* function with the *table* function.
```{r}
#check missing values for serotype
table(is.na(ipd_cases$Serotype))
# check missing values for month
table(is.na(ipd_cases$Month))
```
We can see that there are several missing values for both serogroup and month. This is important to keep in mind for later when we are deciding on the time unit of interest.
## Basic descriptives
You may want to start by just looking at simple counts and proportions of your variables using the *table* and *prop.table* functions.
```{r}
# case counts by age group
agegrp <- table(ipd_cases$AgeGroup)
# case counts by year
counts <- table(ipd_cases$Year)
```
Primarily the older age groups appear to be affected. Annual cases seem to fluctuate somewhat by year though this all appears relatively constant. You could then also break down counts by age group and year.
```{r}
# counts by age group and year
table(ipd_cases$Year, ipd_cases$AgeGroup)
```
When looking at the number of cases per month, a table might be too large to get an oversight. So you may want to plot your cases by month, combining *table* and *barplot*.
```{r}
# get counts by year and month
monthcounts <- table(ipd_cases$Year_Month)
# make a barplot
barplot(monthcounts)
```
This plot highlights two things. First is that there appears to be some seasonality, which is worth investigating. Second is that there are consistently less than 100 cases per month, which will become important later when considering time units for regression.
# Seasonality and moving averages
Start by creating a timeseries object from your monthly variable. To do this we first need to use our monthly counts from above and the *ts* function. We can then define starts and ends to our series as well as the frequency we are interested in (in our case 12 is monthly).
```{r}
ipd_ts <- ts(monthcounts, start = c(2000,1), end = c(2017, 12), frequency = 12)
```
We now want to use moving averages to investigate possible trends and seasonality.
We can calculate these using the *rollapply* function.
We will calculate the centred moving average for several different time windows. Try for example to do one each for 12 months, 6 months and 3 months.
```{r}
# calculate annual moving average
ma12 <- rollapply(ipd_ts, width = 12, FUN = mean)
```
We can then plot this using the *ts.plot* function.
```{r}
# plot your timeseries with annual moving average
ts.plot(ipd_ts, ma12,
main = "Monthly cases and moving average of 12 months, IPD cases, 2000-2017",
ylab = "number of cases",
xlab = "month of notification", col = c("black", "red"))
```
Now try with a 6 month moving window.
```{r}
# calculate annual moving average
ma6 <- rollapply(ipd_ts, width = 6, FUN = mean)
```
We can again plot this using the *ts.plot* function.
```{r}
# plot your timeseries with annual moving average
ts.plot(ipd_ts, ma6,
main = "Monthly cases and moving average of 6 months, IPD cases, 2000-2017",
ylab = "number of cases",
xlab = "month of notification", col = c("black", "red"))
```
Now try with a 3 month moving window.
```{r}
# calculate annual moving average
ma3 <- rollapply(ipd_ts, width = 3, FUN = mean)
```
We can again plot this using the *ts.plot* function.
```{r}
# plot your timeseries with annual moving average
ts.plot(ipd_ts, ma3,
main = "Monthly cases and moving average of 3 months, IPD cases, 2000-2017",
ylab = "number of cases",
xlab = "month of notification", col = c("black", "red"))
```
It is also possible to decompose a time series automatically into seasonal, trend and irregular components using moving averages with the *decompose* function.
```{r}
plot(decompose(ipd_ts))
```
A clear trend cannot be observed. It appears to fluctuate a bit, however there does appear to be a decrease in cases after the vaccine introduction in 2007 but this seems to pick up again after 2013.
A strong seasonal trend can be observed in the winter months, which is consistent with published findings of IPD seasonality.
It may make things clearer to first look at regressions with annual counts. This is for a number of reasons for this; no clear trend, seasonality stable and low monthly case numbers (as well as missing notification months) means that looking at annual counts without correcting for seasonality may give more robust results. If absolutely necessary it may be worth considering moving on to quarterly counts.
<!---CHUNK_PAGEBREAK--->
# Question 3
Before moving on to segmented regression, consider looking at counts and plots based on the vaccine serogrouping variables we created previously.
<!---CHUNK_PAGEBREAK--->
# Help question 3
## Producing counts by age group and year
First get counts for the overall group by year and age group. Remember that in a previous question we produced this using the *table* function.
The *table* function, however, does not provide column or row sums. In order to do this, we can use the *addmargins* function in combination with *table*. We add our table of counts, then specify that we are interested in row sums (margin = 2),
```{r}
# tabulate cases by year and age-group.
annual_cases <- addmargins(table(ipd_cases$Year, ipd_cases$AgeGroup),
margin = 2, FUN = sum)
# rename the sum column
# colnames where colname is "sum" change to "Total"
colnames(annual_cases)[colnames(annual_cases) == "sum"] <- "Total"
```
An alternative to having to rename the "sum" column above, would be to instead name the column immediately when calling the function by typing: FUN = list(Total = sum)
```{r, eval = FALSE}
# view your counts table
annual_cases
```
```{r, echo = FALSE}
# participants dont see this code
# creates nice table output for word doc
kable(annual_cases)
```
## Producing counts by serotype groupings, age group and year
We can do that same calculation as above, however subsetting our counts for the different serotype groupings of interest. Remember that we created variables for PCV7, PCV13 and NonVacc serotypes, which were all TRUE/FALSE variables.
We subset using the square brackets, and only need to specify the variable of interest - as only those rows with TRUE will be counted.
```{r}
# create annual counts of PCV7 serotype cases
annual_pcv7 <- addmargins(table(ipd_cases$Year[ipd_cases$PCV7],
ipd_cases$AgeGroup[ipd_cases$PCV7]),
margin = 2, FUN = list(Total = sum))
# create annual counts of PCV13 serotype cases
annual_pcv13 <- addmargins(table(ipd_cases$Year[ipd_cases$PCV13],
ipd_cases$AgeGroup[ipd_cases$PCV13]),
margin = 2, FUN = list(Total = sum))
# create annual counts of non vaccination serotype cases
annual_nonvacc <- addmargins(table(ipd_cases$Year[ipd_cases$NonVacc],
ipd_cases$AgeGroup[ipd_cases$NonVacc]),
margin = 2, FUN = list(Total = sum))
```
## Plot counts by vaccine serotype group
We can then use these annual counts to plot the various serotype groupings. We can also highlight years with PCV7/13 introduction (as red dashed vertical lines). Remember, PCV7 was introduced in 2007, and PCV13 in 2010.
In our annual counts tables, the years are in the row names (i.e. not a column within the table). So within plot we specify that our x-variable is years (i.e. the rownames of our counts table). We then specify that our y-variable should be the "Total" column from our counts table. We specify that we want a line graph and put our various axis, main and sub labels as well as set the y-axis minimum and maximum (ylim).
We could then add *points* to our graph specifying the x and y variables in the same way as above. Finally, we can add vertical lines by specifying where we would like them on the x axis (2007 and 2010), the colour and the line type we would like.
<!---CHUNK_PAGEBREAK--->
**PCV7:**
```{r}
# plot your annual counts as a line graph
plot(rownames(annual_pcv7), annual_pcv7[, "Total"],
type = "l",
xlab = "Year", ylab = "Number of cases",
main = "IPD cases of PCV7 serotypes, total population",
sub = "red dashed: PCV7 and PCV13 introduction, respectively",
ylim = c(0, 450))
# add points to your line graph
points(rownames(annual_pcv7), annual_pcv7[, "Total"])
# add vertical lines to designate vaccine introduction
abline(v = c(2007, 2010), col = "red", lty = "dashed")
```
**PCV13:**
We can then do the same thing for PCV13 serotypes, adjusting the y axis accordingly.
```{r}
# plot your annual counts as a line graph
plot(rownames(annual_pcv13), annual_pcv13[, "Total"],
type = "l",
xlab = "Year", ylab = "Number of cases",
main = "IPD cases of PCV13 serotypes, total population",
sub = "red dashed: PCV7 and PCV13 introduction, respectively",
ylim = c(0, 700))
# add points to your line graph
points(rownames(annual_pcv13), annual_pcv13[, "Total"])
# add vertical lines to designate vaccine introduction
abline(v = c(2007, 2010), col = "red", lty = "dashed")
```
The two graphs above show that there was a strong reduction after in the number of annual cases with the serotypes covered by the vaccines.
**NonVacc:**
We can then repeat for non vaccination serotypes.
```{r}
# plot your annual counts as a line graph
plot(rownames(annual_nonvacc), annual_nonvacc[, "Total"],
type = "l",
xlab = "Year", ylab = "Number of cases",
main = "IPD cases of PCV13 serotypes, total population",
sub = "red dashed: PCV7 and PCV13 introduction, respectively",
ylim = c(0, 700))
# add points to your line graph
points(rownames(annual_nonvacc), annual_nonvacc[, "Total"])
# add vertical lines to designate vaccine introduction
abline(v = c(2007, 2010), col = "red", lty = "dashed")
```
In contrast to the previous plots, this shows us that the annual cases with serotypes not covered by the vaccine acctually increased after their introduction.
## Combining plots for comparison
We can combine the plots for PCV13 and non vaccine types in order to get a clearer understanding of the relation. To do this plot your PCV13 plot as above, then add in lines and points for non-vaccination serotypes using the *lines* and *points* functions. You can then also add a legend using *legend*.
```{r}
# plot PCV13 annual counts as a line graph
plot(rownames(annual_pcv13), annual_pcv13[, "Total"],
type = "l",
xlab = "Year", ylab = "Number of cases",
main = "IPD cases of PCV13 and Non-vaccine serotypes, total population",
sub = "red dashed: PCV7 and PCV13 introduction, respectively",
ylim = c(0, 700), col = "dark green")
# add points for pcv13
points(rownames(annual_pcv13), annual_pcv13[, "Total"], col = "dark green")
# add vertical lines for intervention
abline(v = c(2007, 2010), col = "red", lty = "dashed")
# add line to graph for non vaccine serotypes
lines(rownames(annual_nonvacc), annual_nonvacc[, "Total"], col = "blue")
# add points to graph for non vaccine serotypes
points(rownames(annual_nonvacc), annual_nonvacc[, "Total"], col = "blue")
# add a horizontal legend
# pch specifies that you want coloured dots
legend("top", horiz = TRUE,
legend = c("PCV13", "Non-Vacc"),
col = c("dark green", "blue"), pch = 1)
```
The above figure quite nicely shows that non vaccination serotypes increased while the PCV13 serotypes decreased concurrently after the introduction of vaccine in 2007. This is quite a neat demonstration of serotype replacement, which has been widely documented in literature. This may be something that should be investigated in regression analysis.
<!---CHUNK_PAGEBREAK--->
# Question 5
Run a segmented negative binomial regression model investigating the effect of vaccination on the incidence rates of cases with PCV13 serotypes.
<!---CHUNK_PAGEBREAK--->
# Help question 5
## Segmented regression for PCV13 serotypes
We can use segmented regression to assess the impact of a certain interventions, such as the introduction of a vaccine programme. You can think of it as the data being devided into two segments (or periods), before and after the intervention, where regression models are estimated for both time periods. To do this, you could split the time series in two segments and run regression models for each. However, the advantage of segmented regression is that estimates for both segments come from the same model and therefore it is easy to calculate the effect of the intervention (which can loosely be defined as the difference of the two segments) and confidence intervals directly from the model.
We will need the `glm.nb` function from the *MASS* package to perform negative binomial regression. Additionally we will use the `tidy` function from the *broom* package to create nice model outputs.
In general, the negative binomial regression model can be written as follows.
$log(Y_t) = \beta_0 + \beta_1 * t + \beta_2 * \delta(t-t_0) + \beta_3 * (t - t_0)^+ + log(pop_t) + e_t$
Here, $Y_t$ is the number of IPD cases observed in year $t$; $pop_t$ is the population size in 100,000s in year $t$.
$t_0$ is the last year (7^th^ year) of the pre-period (including the transition time if there is any), $\delta(x)$ is the indicator function (it is 0 if $x\le 0$ and 1 if $x > 0$), $(x)^+$ is the cut off operator (it is $x$ if $x>0$ and 0 otherwise), and $e_t$ denotes the residual.
This means, that $\beta_2 * \delta(t-t_0) + \beta_3 * (t - t_0)^+$ is the additional generalised linear part of the post-period, which is zero in the pre-period.
The estimates of $\beta_2$ and $\beta_3$ will denote the effects of the intervention. Of course more periods and terms could be added, if needed, in order to control for seasonality.
In the following we will analyse annual incidence of PCV13 serotypes in the overall population (*total*). We will define 2007 as the time of intervention, this is when PCV7 vaccination began. We choose this because PCV7 serotypes are also covered by PCV13 and we already see an impact on PCV13 serotypes from 2007 onwards.
Analoguously, sub-populations and other intervention definitions can be analysed. However, our observed number of cases is not high enough for this type of stratification. To obtain more robust results a certain number of time points per period and a certain number of observations per time point is required. It is desirable to have 12 time points per period (e.g. 12 years pre- and post-intervation; not fulfilled in our example) and _at least_ 100 observations (cases) per time point. It must be said however that these numbers are rough suggestions and not derived power estimates.
First, transform the table (*annual_pcv13*) into a data.frame. Use the *as.data.frame.matrix* function to do so.
Check the class of *annual_pcv13*:
```{r}
# check class of object
class(annual_pcv13)
```
In our case, the year is stored as row name and not as a column itself. Therefore transform the data into a data.frame as explained above and add the year as seperate column. Make sure that the year column is of class numeric.
```{r}
# make counts table in to a data.frame
pcv13_aggr <- as.data.frame.matrix(annual_pcv13)
# make years in to a variable
# take the rownames and make them numeric
pcv13_aggr$year <- as.numeric(rownames(pcv13_aggr))
```
We then need to merge our population data to be able to calculate incidences. To do this we use the *merge* function. We use "year" as our unique identifier in both datasets, and specify that pcv13_aggr is our main dataset so only the rows of that dataset should be considered (no new additions from population rows; equivalent to a left-join).
Once merged we can subset and only keep the columns of interest to us. Using square brackets.
```{r}
# merge counts with population
pcv13_df <- merge(pcv13_aggr, population, by = "year", all.x = TRUE)
# only keep relevant columns
pcv13_df <- pcv13_df[, c("year", "Total", "Total_pop")]
```
We then want to define the parameters and columns needed for segmented regression.
```{r}
# year when vaccination with PCV7 started
intervention_year <- 2007
# corresponds to t in the formula
# count of years
pcv13_df$linear <- 1:nrow(pcv13_df)
# corresponds to delta(t-t0) in the formula
# pre or post vaccination period
pcv13_df$const_vacc <- as.numeric(pcv13_df$year >= intervention_year)
# corresponds to (t-t0)^+ in the formula
# count of years post intervention
pcv13_df$lin_vacc <- pmax(0, pcv13_df$year - intervention_year + 1)
# population in 100,000s
pcv13_df$pop100 <- pcv13_df$Total_pop / 100000
```
We can then pass these parameters through the *glm.nb* function of the *MASS* package. We specify our dataset, as well as the "Total" variable (with annual counts) and our parameters from above. This gives us a segmented negative binomial regression with population per 100,000 as offset.
```{r}
# negative binomial segmented regression
nb <- glm.nb(Total ~ linear + const_vacc + lin_vacc + offset(log(pop100)), data = pcv13_df)
```
## Interpretation of intervention effects
Create table with results. Use the `tidy` function from *broom* package to simplify the regression output.
```{r}
# create a table with results
nb_tidy <- tidy(nb, exponentiate = FALSE, conf.int = TRUE)
```
Take a look at your results.
```{r, eval = FALSE}
# view output
nb_tidy
```
```{r, echo = FALSE}
# participants dont see this code
# creates nice table output for word doc
kable(nb_tidy)
```
We then need to exponentiate the estimates to make them interpretable as Incidence Rate Ratios. This makes the intercept equal to the estimate of baseline incidence per 100,000.
```{r}
# exponentiate results for IRRs
nb_tidy_exp <- tidy(nb, exponentiate = TRUE, conf.int = TRUE)
```
```{r, eval = FALSE}
# view results
nb_tidy_exp
```
```{r, echo = FALSE}
# participants dont see this code
# creates nice table output for word doc
kable(nb_tidy_exp)
```
The percentage change of the anual incidence by year can be computed using the formula
$IC\% = 100 * (\exp(\hat{\beta}) - 1)$.
Therefore, after the intervention, the anual incidence changed by `r paste0("$100 * (\\exp(", round(nb_tidy$estimate[4],3) , ") - 1) = ", round(100*(exp(nb_tidy$estimate[4])-1), 1), "\\%$")`, after controlling for the baseline trend.
## Plot results
We may then want to visualse our results. To do this we first need to calculate model predictions of incidence and confidence intervals using the *predict* function.
```{r}
# store new dataset for fit estimation
nb_predict <- pcv13_df
nb_predict$pop100 <- 1
# pass fitted model and baseline through predict to calculate standard error
pred.ci <- predict(nb, nb_predict, type = "link", se.fit = TRUE)
# use standard errors of model fit to calculate CIs
pred.ci <- within(pred.ci, {
inci <- exp(fit)
LL <- exp(fit - 1.96 * se.fit)
UL <- exp(fit + 1.96 * se.fit)
})
# create a variable for observed incidence rate
nb_predict$inci_observed <- nb_predict$Total / nb_predict$Total_pop * 100000
# create variables for modelled incidence with CIs
nb_predict$inci_modelled <- pred.ci$inci
nb_predict$lowerCI <- pred.ci$LL
nb_predict$upperCI <- pred.ci$UL
```
Once we have calculated these values and incorporated them in to our data frame, we want to split this in two. One data.frame for the pre vaccine period and one for the post vaccine period.
```{r}
# split into 2 data.frames for plotting (pre/post)
pre <- nb_predict[nb_predict$year < intervention_year, ]
post <- nb_predict[nb_predict$year >= intervention_year, ]
```
We can then pass all these values through the *plot* function. We start by plotting our observed incidence by year as points. Then we add lines for the modelled incidence in the pre and post vaccine periods with their respective confidence intervals.
```{r}
# plot observed incidence by year as points
plot(nb_predict$year, nb_predict$inci_observed, type = "p",
xlab = "Year", ylab = "Number of cases per 100.000 pop",
main = "IPD incidence of PCV13 serotypes, total population",
sub = "red dashed: PCV7 introduction",
xlim = range(nb_predict$year),
ylim = c(0, 15))
# plot lines for modelled incidence (with CIs) in pre period
lines(pre$year, pre$inci_modelled, type = "l")
lines(pre$year, pre$lowerCI, type = "l", lty = "dashed")
lines(pre$year, pre$upperCI, type = "l", lty = "dashed")
# plot lines for modelled incidence (with CIs) in post period
lines(post$year, post$inci_modelled, type = "l")
lines(post$year, post$lowerCI, type = "l", lty = "dashed")
lines(post$year, post$upperCI, type = "l", lty = "dashed")
# add a vertical line to denote vaccine intervention year
abline(v = c(2007), col = "red", lty = "dashed")
```
<!---CHUNK_PAGEBREAK--->
# Question 6
Based on the findings from descriptive analysis, it may be interesting to look at serotype replacement. Run a segmented negative binomial regression model investigating the effect of vaccination on the incidence rates of all IPD cases.
<!---CHUNK_PAGEBREAK--->
# Help question 6
## Segmented regression for all serotypes
In the following we will analyse annual incidence of all serotypes in the overall population (*total*). We will define 2007 as the time of intervention.
First, transform the table (*annual_cases*) into a data.frame. Use the *as.data.frame.matrix* function to do so.
In our case, the year is stored as row name and not as a column itself. Therefore transform the data into a data.frame as explained above and add the year as seperate column. Make sure that the year column is of class numeric.
```{r}
# make counts table in to a data.frame
cases_aggr <- as.data.frame.matrix(annual_cases)
# make years in to a variable
# take the rownames and make them numeric
cases_aggr$year <- as.numeric(rownames(cases_aggr))
```
We then need to merge our population data to be able to calculate incidences. To do this we use the *merge* function. We use "year" as our unique identifier in both datasets, and specify that cases_aggr is our main dataset so only the rows of that dataset should be considered (no new additions from population rows; equivalent to a left-join).
Once merged we can subset and only keep the columns of interest to us. Using square brackets.
```{r}
# merge counts with population
cases_df <- merge(cases_aggr, population, by = "year", all.x = TRUE)
# only keep relevant columns
cases_df <- cases_df[, c("year", "Total", "Total_pop")]
```
We then want to define the parameters and columns needed for segmented regression.
```{r}
# year when vaccination with PCV7 started
intervention_year <- 2007
# corresponds to t in the formula
# count of years
cases_df$linear <- 1:nrow(cases_df)
# corresponds to delta(t-t0) in the formula
# pre or post vaccination period
cases_df$const_vacc <- as.numeric(cases_df$year >= intervention_year)
# corresponds to (t-t0)^+ in the formula
# count of years post intervention
cases_df$lin_vacc <- pmax(0, cases_df$year - intervention_year + 1)
# population in 100,000s
cases_df$pop100 <- cases_df$Total_pop / 100000
```
We can then pass these parameters through the *glm.nb* function of the *MASS* package. We specify our dataset, as well as the "Total" variable (with annual counts) and our parameters from above. This gives us a segmented negative binomial regression with population per 100,000 as offset.
```{r}
# negative binomial segmented regression
nb_total <- glm.nb(Total ~ linear + const_vacc + lin_vacc + offset(log(pop100)), data = cases_df)
```