-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path08-timeseriesEDA.Rmd
349 lines (229 loc) · 14.8 KB
/
08-timeseriesEDA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
# Time series data exploration
[Chapter 3](#histoline) contains the basic exploratory data analysis you should do before using SPC tools. But there are many other time series-oriented analytic tools available that can help you understand the data more completely.
## Time series EDA
There is usually far more information in a time series than is typically explored with basic SPC methods. You can create a variety of exploratory and diagnostic plots that help you understand the data more thoroughly.
Because the fake data (`df_ts`) used in previous chapters has no time series patterns, we'll use it alongside a data set with clear patterns (`beer`) so we can explore what these EDA tools show with and without time-related patterns. Note the `beer` data is measured quarterly whereas the `df_ts` data is measured monthly.
```{r beer_data}
# Use Australian beer data, trimmed to a 15 year subset
data(ausbeer, package = "fpp2")
beer = window(ausbeer, start = 1990.00, end = 2005.75)
```
For comparison's sake, here are summaries of each time series that show the time series itself, the [autocorrelation function](#acf), and a (simplified) [periodogram](#tsa):
```{r tsdisplays, fig.height=3}
# No temporal patterns in data
ggtsdisplay(df_ts, plot.type = "spectrum")
# Temporal patterns in data
ggtsdisplay(beer, plot.type = "spectrum")
```
### Overall trend (if any)
The first thing to look for is whether there is a trend. The simplest way to let the data speak for this is by using a [loess smoother](https://en.wikipedia.org/wiki/Local_regression).
The `autoplot` function in the `forecast` package provides several out-of-the-box plots for time series data, and since it's built over `ggplot2`, it can use those functions as well.
The `df_ts` time series data set has absolutely no trend at all.
```{r loess_trend1, fig.height=3}
autoplot(df_ts) +
geom_smooth()
```
There does seem to be an initial overall declining trend in the `beer` data that seems to flatten out.
```{r loess_trend2, fig.height=3}
autoplot(beer) +
geom_smooth()
```
### Seasonplot
The seasonplot places each year as its own line over an x-axis of the sequential frequency, which defaults to the frequency of the time series. When there's no seasonal pattern across or within that frequency, the plot looks like spaghetti as the result of being driven by natural variation.
```{r seasonplot1, fig.height=3}
ggseasonplot(df_ts)
```
When there is a pattern in the time series, patterns emerge. In this case, the fourth quarter increase above the other quarters is quite evident.
```{r seasonplot2, fig.height=3}
ggseasonplot(beer)
```
### Monthplot
A monthplot puts all years into seasonal groups, where each line is a group (e.g., month) and each point in that line is an individual year. When there is a lengthy trend in the series, you can see it in a consistent up or down pattern in each seasonal group. You can also compare central tendencies across those groups with a mean or median line.
Data with no inherent pattern shows up as noise:
```{r monthplot1, fig.height=3}
ggmonthplot(df_ts)
```
Whereas in a time series with temporal patterns, you can see both the higher levels in Q4 as compared with the other quarters, but you can also see that this quarter's values is declining over the years, a pattern echoed to lesser extent in the early years' values for the other quarters.
```{r monthplot2, fig.height=3}
ggmonthplot(beer)
```
### Autocorrelation {#acf}
We've touched on autocorrelation in other portions of this book, and will discuss it further [later in this chapter](#moreautocor).
The `acf` function provides a graphical summary of the autocorrelation function, with each data point correlated with a value at increasing lagged distances from itself. Each correlation is plotted as a spike; spikes that go above or below the dashed line suggest that significant positive or negative autocorrelation, respectively, occurs at that lag (at the 95% confidence level). If all spikes occur inside those limits, it's safe to assume that there is no autocorrelation. If only one or perhaps two spikes exceed the limits slightly, it could be due simply to chance. Clear patterns seen in the acf plot can indicate autocorrelation even when the values do not exceed the limits.
With the `df_ts`, there is no autocorrelation and no obvious pattern, and the correlation values themselves (y-axis) are tiny:
```{r acf1, fig.height=3}
# acf plot using the autoplot function instead of base for the ggplot look
autoplot(acf(df_ts, plot = FALSE))
```
But with the `beer` data, the patterning is obvious, especially at lags 2 (6 months apart) and 4 (1 year apart), and the correlation values are quite large.
```{r acf2, fig.height=3}
# acf plot using the autoplot function instead of base for the ggplot look
autoplot(acf(beer, plot = FALSE))
```
The autocorrelation function is most concisely plotted with the approach above, but you can also plot the increasing lags against an initial value in individual scatterplots. If the points look like a shotgun target, there's no autocorrelation. Patterns in the points indicate autocorrelation in the data. Patterns strung along or perpendicular to the 1:1 dashed line suggest strong positive and negative correlation, respectively, though any sort of pattern is cause for concern.
The lagplot for the `df_ts` data shows the shotgun target "pattern" that suggests that only random variation is present.
```{r lagplot1}
# Scatterplot of df_ts autocorrelation through first 12 lags
lag.plot(df_ts, lags = 12, do.lines = FALSE)
```
But clear patterns emerge---especially at lag 4 (1 year apart)---in the lagplot for the `beer` data.
```{r lagplot2}
# Scatterplot of beer data autocorrelation through first 8 lags
lag.plot(beer, lags = 8, do.lines = FALSE)
```
The `pacf` function gives you a partial autocorrelation plot, which is the correlation between the first value and each individual lag. It's the same information provided by the lag plot, only more compact as it only displays the correlation value itself. This can be quite useful in identifying cycles in data.
A `pacf` plot for `df_ts` data shows the random noise we'd expect, as well as tiny correlation values.
```{r}
autoplot(pacf(df_ts))
```
Using the `beer` data shows the partial autocorrelation pattern. The spike at the second line indicates that there is a moderate negative relationship in values 6 months (2 quarters) apart, and the spike at the fourth line shows there's a strong positive relationship in values 1 year (4 quarters) apart.
```{r fig.height=3}
autoplot(pacf(beer))
```
### Cycles {#tsa}
Periodograms allow you to explore a time series for cycles that may or may not be regular in timing (which makes it slightly distinct from seasonality). Sunspot cycles are a classic example at ~11 years, a time span that obviously doesn't correspond to calendar seasons and frequencies.
Spikes in the periodogram designate possible cycle timing lengths, where the x-axis is based on frequency. The reciprocal of the frequency is the time period, so a spike in a periodogram for an annual series at a frequency of 0.09 suggests a cycle time of about 11 years.
A bunch of spikes scattered across the plot, or a more or less flat line with no real spikes, both suggest that there is no cyclic pattern in the data.
```{r periodicity1, fig.height=3}
TSA::periodogram(df_ts)
```
A clear spike occurs in the `beer` data at a frequency of 0.26, a time period of about 4. Since this is quarterly data, it confirms the annual pattern seen in several plots above.
```{r periodicity2, fig.height=3}
TSA::periodogram(beer)
```
### Decomposition
The `decompose` function extracts the major pieces of a time series, while the `autoplot` function presents the results using `ggplot2` for a cleaner look.
```{r decomp1}
autoplot(decompose(df_ts))
```
```{r decomp2}
autoplot(decompose(beer))
```
### Seasonal adjustment
The `seasonal` package uses the U.S. Census Bureau's X-13ARIMA-SEATS method to calculate seasonal adjustment. The `seas` function can be used to view or save the results into another object. WHAT EXACTLY IS A SEASONAL ADJUSTMENT? WHAT IS IT USED FOR?
```{r seas}
# Convert ts to data frame
beer_df = tsdf(beer)
# Get seasonally-adjusted values and put into data frame
beer_season = seasonal::seas(beer)
beer_df$y_seasonal = beer_season$data[,3]
# Show top 6 lines of data frame
knitr::kable(head(beer_df))
```
If you just want to plot it on the fly, `ggseas` provides the `stat_seas` function for use with `ggplot2`. As with all ggplots, you need a data frame first, which the `tsdf` function provides.
```{r seasonal1, fig.height=3, eval=FALSE}
# Convert ts to data frame
df_ts_df = tsdf(df_ts)
# Plot original and seasonally adjusted data
ggplot(df_ts_df, aes(x, y)) +
geom_line(color="gray70") +
stat_seas(color="blue")
```
```{r seasonal2, fig.height=3, eval=FALSE}
# Plot original and seasonally adjusted data
ggplot(beer_df, aes(x, y)) +
geom_line(color="gray70") +
stat_seas(color="blue")
```
### Residuals
Residuals---the random component of the time series---can also be explored for potential patterns. Ideally, you don't want to see patterns in the residuals, but they're worth exploring in the name of thoroughness.
```{r residuals_df_ts}
# Convert ts residuals to data frame
df_ts_df_rand = tsdf(decompose(df_ts)$random)
# Add month as a factor
df_ts_df_rand$mnth = factor(rep(month.abb, 10), levels = month.abb)
# Plot residuals
# No apparent patterns
ggplot(df_ts_df_rand, aes(x, y)) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_smooth(color = "gray70", alpha = 0.2) +
geom_point(aes(color = mnth))
```
```{r residuals_beer}
# Convert ts residuals to data frame
beer_df_rand = tsdf(decompose(beer)$random)
# Add quarter as a factor
beer_df_rand$qtr = factor(quarter(date_decimal(beer_df_rand$x)))
# Plot residuals, with custom colors
#LOOKS LIKE THERE MIGHT BE A PATTERN. MAYBE?
ggplot(beer_df_rand, aes(x, y)) +
geom_hline(yintercept=0, linetype="dotted") +
geom_smooth(color = "gray70", alpha = 0.2) +
geom_point(aes(color = qtr)) +
scale_color_manual(values=c("#E69F00", "#56B4E9", "#009E73", "#000000"))
```
```{r residuals_faceted_df_ts}
# Residuals faceted by month
# Is December weird? Rest seem ok
# DO WE WANT TO SAY SOMETHING ABOUT HOW THESE RESIDUAL PLOTS TRACK THE WITHIN-MONTH PATTERNS WE SAW IN THE MONTHPLOTS? OR IS THAT JUST A COINCIDENCE?
ggplot(df_ts_df_rand, aes(x, y)) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_smooth(color = "gray70", alpha = 0.2) +
facet_wrap(~ mnth) +
geom_point(aes(color = mnth))
```
```{r residuals_faceted_beer}
# Residuals faceted by quarter
ggplot(beer_df_rand, aes(x, y)) +
geom_hline(yintercept=0, linetype="dotted") +
geom_smooth(color = "gray70", alpha = 0.2) +
facet_wrap(~ qtr) +
geom_point(aes(color = qtr)) +
scale_color_manual(values=c("#E69F00", "#56B4E9", "#009E73", "#000000"))
```
### Accumuluation plots
You can use the EDA tools above on rates, numerators, and denominators alike to explore patterns. When you do have a numerator and a denominator that create your metric, you can also plot them against each other, looking at the accumulation of each over the course of a relevant time frame (e.g., a year).
To illustrate, we'll create a new time series for monthly central line associated infections, set up so that the last two years of a 10 year series are based on a different process.
```{r accumplot_data}
# Generate sample data
set.seed(54)
bsi_8yr = data.frame(Linedays = sample(1000:2000, 96), Infections = rpois(96, 4))
bsi_2yr = data.frame(Linedays = sample(1200:2200, 24), Infections = rpois(24, 3))
bsi_10yr = rbind(bsi_8yr, bsi_2yr)
bsi_10yr$Month = seq(as.Date("2007/1/1"), by = "month", length.out = 120)
bsi_10yr$Year = year(bsi_10yr$Month)
bsi_10yr$Rate = round((bsi_10yr$Infections / bsi_10yr$Linedays * 1000), 2)
```
First, calculate the cumulative sums for the numerator and denominator for the time period of interest. Here, we use years.
```{r accumplot_calcs}
# Calculate cumulative sums by year
accum_bsi_df = bsi_10yr %>%
group_by(Year) %>%
arrange(Month) %>%
mutate(cuml_linedays = cumsum(Linedays), cuml_infections = cumsum(Infections))
```
Then, plot them against each other. Much like a seasonplot, a spaghetti "pattern" indicates that only random, common cause variation is acting on the variables. Strands (individual years) that separate from that mess of lines suggest that a different process is in place for those strands.
```{r accumplot}
# Accumulation plot
ggplot(accum_bsi_df, aes(x = cuml_linedays, y = cuml_infections, group = as.factor(Year))) +
geom_path(aes(color = as.factor(Year)), size = 1) +
geom_point(aes(color = as.factor(Year)))+
scale_y_continuous(name = "Cummulative Infections", breaks = seq(0,120,10)) +
scale_x_continuous(name = "Cumulative Central Line Days", breaks = seq(0,40000,5000)) +
scale_colour_brewer(type = "div", palette = "Spectral") +
guides(color = guide_legend(title = "Year")) +
ggtitle("Infections vesus Central Line Days by Year")
```
## More on autocorrelation {#moreautocor}
```{r acfplotsfortable, include=FALSE}
#png("images/ac.png", width = 6, height = 4, units = "in", res = 600)
#autoplot(acf(mb_ts))
#dev.off()
#png("images/no_ac.png", width = 6, height = 4, units = "in", res = 600)
#autoplot(acf(df_ts, plot = FALSE))
#dev.off()
```
For convenience of comparison, here are autocorrelated and non-autocorrelated data already shown above, shown here side-by-side.
| Example autocorrelated data | Example non-autocorrelated data |
| ------ | ------ |
| ![](images/ac.png) | ![](images/no_ac.png) |
When data are autocorrelated, control limits will be *too small*---and thus an increase in *false* signals of special causes should be expected. In addition, none of the tests for special cause variation remain valid.
Sometimes, autocorrelation can be removed by changing the sampling or metric's time step: for example, you generally wouldn't expect hospital acquired infection rates in one quarter to influence those in the subsequent quarter.
It can also be sometimes removed or abated with differencing, although doing so hurts interpretability of the resulting run or control chart.
```{r diffing, fig.height=3}
# Take the fourth lag to difference the beer data
beer_diff = diff(beer, lag = 4)
# Plot the resulting autocorrelation function
autoplot(acf(beer_diff, plot = FALSE))
```
If have autocorrelated data, and you aren't willing to difference the data or can't change the sampling rate or time step, you shouldn't use either run or control charts, and instead use a standard line chart. If you must have limits to help guide decision-making, you'll need a more advanced technique, such as a Generalized Additive Mixed Model (GAMM) or time series models such as ARIMA. It's probably best to work with a statistician if you need to do this.