-
Notifications
You must be signed in to change notification settings - Fork 6
/
qcew.Rmd
360 lines (284 loc) · 30.8 KB
/
qcew.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
---
title: "Download data from the Quarterly Census of Employment and Wages (QCEW)"
author: "Erik Loualiche"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
#output:
# md_document:
# variant: markdown_github
# toc: true
# toc_depth: 2
vignette: >
%\VignetteIndexEntry{Download data from the Quarterly Census of Employment and Wages (QCEW)}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r global_options, echo = FALSE, include = FALSE}
options(width = 999)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE,
cache = FALSE, tidy = FALSE, size = "small")
```
In this vignette we provide examples of how to use entrydatar to download precise files from the QCEW.
In the second part we give rudimentary documentations of the feature that are of principal interest to work with the data.
For an overview of the QCEW, head over to the [BLS website](https://www.bls.gov/cew/cewover.htm)
The vignette is organized as:
1. Downloading data from the QCEW
2. Documentation
+ [Table of data cuts](#tablecut)
+ [What is available](#availability)
+ [General documentation](#generaldoc)
# Downloading data from the QCEW
To get read to download data, we load some libraries that the package might have forgotten to call (the package works with all tables in a `data.table` format)
```r
library(dplyr); library(data.table);
library(entrydatar)
```
## General download
For example if we are interested in downloading aggregate level data we use the cut `10`. Please see documentation below for a definition of what is availabe and the mapping of the different cuts of QCEW.
```r
dt_agg <- get_qcew_cut(
data_cut = 10,
year_start = 1990, year_end =1993,
path_data = "~/Downloads/", write = F)
```
```r
as_tibble(dt_agg)
# A tibble: 16 x 42
area_fips own_code indus… agglv… size_… year qtr discl… qtrly… month… month… month… total_… taxa… qtrl… avg_… lq_d… lq_q… lq_m… lq_m… lq_m… lq_t… lq_t… lq_q… lq_a… oty_… oty_q… oty_… oty_mo… oty_m… oty_mo… oty_m… oty_mo… oty_m… oty_to… oty_t… oty_tax… oty_… oty_qt… oty_… oty_a… oty_…
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 US000 0 10.0 10.0 0 1990 1.00 "" 6.02e⁶ 1.06e⁸ 1.07e⁸ 1.08e⁸ 6.23e¹¹ 0 0 448 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 127332 2.20 2.09e⁶ 2.00 2.12e⁶ 2.00 2.03e⁶ 1.90 3.91e¹⁰ 6.70 -3.57e¹¹ -100 -7.56e⁹ -100 20.0 4.70
2 US000 0 10.0 10.0 0 1990 2.00 "" 6.04e⁶ 1.08e⁸ 1.09e⁸ 1.10e⁸ 6.29e¹¹ 0 0 443 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 154120 2.60 1.87e⁶ 1.80 2.12e⁶ 2.00 1.82e⁶ 1.70 3.67e¹⁰ 6.20 -1.82e¹¹ -100 -4.01e⁹ -100 19.0 4.50
3 US000 0 10.0 10.0 0 1990 3.00 "" 6.09e⁶ 1.08e⁸ 1.09e⁸ 1.10e⁸ 6.24e¹¹ 0 0 441 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 164809 2.80 1.68e⁶ 1.60 1.79e⁶ 1.70 1.34e⁶ 1.20 3.37e¹⁰ 5.70 -1.10e¹¹ -100 -2.47e⁹ -100 18.0 4.30
4 US000 0 10.0 10.0 0 1990 4.00 "" 6.12e⁶ 1.09e⁸ 1.09e⁸ 1.09e⁸ 6.88e¹¹ 0 0 483 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 149102 2.50 1.01e⁶ 0.900 6.69e⁵ 0.600 2.75e⁵ 0.300 3.90e¹⁰ 6.00 -8.46e¹⁰ -100 -1.85e⁹ -100 25.0 5.50
5 US000 0 10.0 10.0 0 1991 1.00 "" 6.30e⁶ 1.05e⁸ 1.05e⁸ 1.06e⁸ 6.37e¹¹ 0 0 465 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 284012 4.70 -1.03e⁶ -1.00 -1.51e⁶ -1.40 -1.77e⁶ -1.60 1.40e¹⁰ 2.20 0 0 0 0 17.0 3.80
6 US000 0 10.0 10.0 0 1991 2.00 "" 6.36e⁶ 1.06e⁸ 1.07e⁸ 1.08e⁸ 6.45e¹¹ 0 0 463 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 317123 5.30 -1.84e⁶ -1.70 -2.05e⁶ -1.90 -2.24e⁶ -2.00 1.64e¹⁰ 2.60 0 0 0 0 20.0 4.50
7 US000 0 10.0 10.0 0 1991 3.00 "" 6.39e⁶ 1.06e⁸ 1.07e⁸ 1.08e⁸ 6.41e¹¹ 0 0 462 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 303568 5.00 -2.13e⁶ -2.00 -1.93e⁶ -1.80 -2.03e⁶ -1.90 1.76e¹⁰ 2.80 0 0 0 0 21.0 4.80
8 US000 0 10.0 10.0 0 1991 4.00 "" 6.43e⁶ 1.08e⁸ 1.08e⁸ 1.08e⁸ 7.05e¹¹ 0 0 503 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 317664 5.20 -1.70e⁶ -1.60 -1.61e⁶ -1.50 -1.52e⁶ -1.40 1.72e¹⁰ 2.50 0 0 0 0 20.0 4.10
9 US000 0 10.0 10.0 0 1992 1.00 "" 6.48e⁶ 1.05e⁸ 1.05e⁸ 1.06e⁸ 6.64e¹¹ 0 0 486 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 176781 2.80 -5.04e⁵ -0.500 -1.48e⁵ -0.100 -1.06e⁵ -0.100 2.72e¹⁰ 4.30 0 0 0 0 21.0 4.50
10 US000 0 10.0 10.0 0 1992 2.00 "" 6.50e⁶ 1.07e⁸ 1.08e⁸ 1.09e⁸ 6.72e¹¹ 0 0 479 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 140624 2.20 5.39e⁵ 0.500 6.40e⁵ 0.600 6.08e⁵ 0.600 2.69e¹⁰ 4.20 0 0 0 0 16.0 3.50
11 US000 0 10.0 10.0 0 1992 3.00 "" 6.52e⁶ 1.07e⁸ 1.07e⁸ 1.08e⁸ 6.72e¹¹ 0 0 482 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 131248 2.10 7.06e⁵ 0.700 2.48e⁵ 0.200 6.40e⁵ 0.600 3.06e¹⁰ 4.80 0 0 0 0 20.0 4.30
12 US000 0 10.0 10.0 0 1992 4.00 "" 6.57e⁶ 1.09e⁸ 1.09e⁸ 1.09e⁸ 7.79e¹¹ 0 0 550 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 134740 2.10 1.09e⁶ 1.00 1.05e⁶ 1.00 1.22e⁶ 1.10 7.36e¹⁰ 10.4 0 0 0 0 47.0 9.30
13 US000 0 10.0 10.0 0 1993 1.00 "" 6.62e⁶ 1.06e⁸ 1.07e⁸ 1.07e⁸ 6.67e¹¹ 0 0 480 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 138525 2.10 1.43e⁶ 1.40 1.70e⁶ 1.60 1.69e⁶ 1.60 2.54e ⁹ 0.400 0 0 0 0 - 6.00 -1.20
14 US000 0 10.0 10.0 0 1993 2.00 "" 6.65e⁶ 1.09e⁸ 1.10e⁸ 1.11e⁸ 7.04e¹¹ 0 0 494 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 155234 2.40 1.85e⁶ 1.70 1.75e⁶ 1.60 1.98e⁶ 1.80 3.19e¹⁰ 4.70 0 0 0 0 15.0 3.10
15 US000 0 10.0 10.0 0 1993 3.00 "" 6.70e⁶ 1.09e⁸ 1.09e⁸ 1.11e⁸ 7.11e¹¹ 0 0 499 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 176159 2.70 2.15e⁶ 2.00 2.31e⁶ 2.20 2.35e⁶ 2.20 3.86e¹⁰ 5.70 0 0 0 0 17.0 3.50
16 US000 0 10.0 10.0 0 1993 4.00 "" 6.75e⁶ 1.11e⁸ 1.11e⁸ 1.12e⁸ 8.06e¹¹ 0 0 556 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 176528 2.70 2.42e⁶ 2.20 2.54e⁶ 2.30 2.80e⁶ 2.60 2.76e¹⁰ 3.50 0 0 0 0 6.00 1.10
```
Note that the dataset can be large. Downloading most of the industry 3 and 4 digits cuts across counties and MSAs ends up with 57mn rows. Saving it in `.rds` formats takes around 2gb of memory.
On the other hand for something more precise, say nationally by size cuts at the 6 digits industry level we would call the cut `28`:
```r
dt_naics <- get_qcew_cut(
data_cut = 28,
industry = "naics",
year_start = 1990, year_end = 2015,
path_data = "~/Downloads/", write = F)
```
```r
as_tibble(dt_naics)
# A tibble: 208,667 x 47
area… own_… indus… aggl… size… year qtr disc… area… own_… indu… aggl… size… qtrly… month… month… month… total… taxa… qtrl… avg_… lq_d… lq_q… lq_m… lq_m… lq_m… lq_t… lq_t… lq_q… lq_a… oty_… oty_… oty_… oty_… oty_… oty_… oty_… oty_… oty_… oty_… oty_… oty_… oty_… oty_… oty_… oty_… oty_…
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 US000 5.00 111110 28.0 1.00 1990 1.00 NA NA NA NA NA NA 2.16e² 354 353 371 1.57e⁶ 0 0 337 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 US000 5.00 111110 28.0 2.00 1990 1.00 NA NA NA NA NA NA 9.80e¹ 547 577 640 1.95e⁶ 0 0 255 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 US000 5.00 111110 28.0 3.00 1990 1.00 NA NA NA NA NA NA 4.00e¹ 439 450 519 1.32e⁶ 0 0 216 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 US000 5.00 111110 28.0 4.00 1990 1.00 NA NA NA NA NA NA 1.40e¹ 336 336 382 1.01e⁶ 0 0 220 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 US000 5.00 111120 28.0 1.00 1990 1.00 NA NA NA NA NA NA 2.60e¹ 39.0 33.0 39.0 1.30e⁵ 0 0 270 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 US000 5.00 111120 28.0 2.00 1990 1.00 NA NA NA NA NA NA 5.00e⁰ 21.0 23.0 28.0 6.87e⁴ 0 0 220 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 US000 5.00 111130 28.0 1.00 1990 1.00 NA NA NA NA NA NA 8.20e¹ 76.0 73.0 85.0 3.28e⁵ 0 0 324 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 US000 5.00 111140 28.0 1.00 1990 1.00 NA NA NA NA NA NA 1.92e³ 1369 1314 1575 5.09e⁶ 0 0 276 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 US000 5.00 111140 28.0 2.00 1990 1.00 NA NA NA NA NA NA 8.50e¹ 374 404 526 1.34e⁶ 0 0 237 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 US000 5.00 111140 28.0 4.00 1990 1.00 NA NA NA NA NA NA 8.00e⁰ 193 211 224 7.45e⁵ 0 0 274 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# ... with 208,657 more rows
```
## Faster download with annual frequencies
The BLS provides quarterly as well as annual averages, which are lighter. There is now an option to download directly the annual average using a frequency option. The default is quarterly, so everything should be backward compatible.
Let's see first with the quarterly data (*warning* it is slow):
```r
dt_naics <- get_qcew_cut(
data_cut = 16,
industry = "naics", frequency = "quarter",
year_start = 2000, year_end = 2005,
path_data = "~/Downloads/", write = F)
```
```r
as_tibble(dt_naics)
# A tibble: 18,028 x 42
area_fips own_code indus… agglv… size_… year qtr discl… qtrly_… month… month… month… total… taxa… qtrl… avg_… lq_d… lq_q… lq_m… lq_m… lq_m… lq_t… lq_t… lq_q… lq_a… oty_… oty_… oty_q… oty_mo… oty_mo… oty_mo… oty_m… oty_mo… oty_m… oty_to… oty_to… oty_… oty_… oty_… oty_… oty_av… oty_av…
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 US000 1.00 1125 16.0 0 2000 1.00 "" 2.00 7.00e⁰ 7.00e⁰ 7.00e⁰ 5.24e⁴ 0 0 576 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 - 1.00 -12.5 - 1.00 -12.5 - 1.00 -12.5 -1.90e³ - 3.50 0 0 0 0 54.0 10.3
2 US000 1.00 1125 16.0 0 2000 2.00 "" 2.00 7.00e⁰ 7.00e⁰ 8.00e⁰ 5.59e⁴ 0 0 587 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 - 1.00 -12.5 0 0 1.00 14.3 1.26e⁴ 29.0 0 0 0 0 132 29.0
3 US000 1.00 1125 16.0 0 2000 3.00 "" 2.00 8.00e⁰ 8.00e⁰ 8.00e⁰ 5.99e⁴ 0 0 576 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 1.00 14.3 1.00 14.3 1.00 14.3 9.74e³ 19.4 0 0 0 0 25.0 4.50
4 US000 1.00 1125 16.0 0 2000 4.00 "" 3.00 8.00e⁰ 8.00e⁰ 8.00e⁰ 6.42e⁴ 0 0 617 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 1.00 50.0 1.00 14.3 1.00 14.3 1.00 14.3 1.34e⁴ 26.4 0 0 0 0 59.0 10.6
5 US000 1.00 1131 16.0 0 2000 1.00 "" 78.0 4.09e³ 3.92e³ 3.93e³ 3.66e⁷ 0 0 706 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA -1.00 - 1.30 -237 - 5.50 -217 - 5.20 -282 - 6.70 -1.22e⁵ - 0.300 0 0 0 0 38.0 5.70
6 US000 1.00 1131 16.0 0 2000 2.00 "" 78.0 4.19e³ 4.57e³ 5.26e³ 4.77e⁷ 0 0 786 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 1.00 1.30 -315 - 7.00 -255 - 5.30 250 5.00 7.70e⁶ 19.2 0 0 0 0 142 22.0
7 US000 1.00 1131 16.0 0 2000 3.00 "" 78.0 5.85e³ 5.67e³ 5.47e³ 5.86e⁷ 0 0 796 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 - 97.0 - 1.60 -167 - 2.90 -110 - 2.00 -1.43e⁶ - 2.40 0 0 0 0 - 1.00 - 0.100
8 US000 1.00 1131 16.0 0 2000 4.00 "" 78.0 4.84e³ 4.49e³ 4.43e³ 5.09e⁷ 0 0 853 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA 0 0 -502 - 9.40 -146 - 3.10 205 4.80 -1.07e⁵ - 0.200 0 0 0 0 25.0 3.00
9 US000 1.00 2211 16.0 0 2000 1.00 "" 140 1.23e⁴ 1.23e⁴ 1.23e⁴ 1.91e⁸ 0 0 1193 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA -5.00 - 3.40 -201 - 1.60 -157 - 1.30 -125 - 1.00 1.54e⁷ 8.80 0 0 0 0 110 10.2
10 US000 1.00 2211 16.0 0 2000 2.00 "" 141 1.24e⁴ 1.24e⁴ 1.24e⁴ 1.83e⁸ 0 0 1138 NA 1.00 1.00 1.00 1.00 1.00 0 0 1.00 NA -2.00 - 1.40 49.0 0.400 5.00 0 6.00 0 2.54e⁷ 16.1 0 0 0 0 157 16.0
# ... with 18,018 more rows
```
Now let's download the data only at annual frequency (similar to the size cuts from earlier):
```r
dt_naics_year <- get_qcew_cut(
data_cut = 16,
industry = "naics", frequency = "year",
year_start = 2000, year_end = 2005,
path_data = "~/Downloads/", write = F)
```
```r
as_tibble(dt_naics_year)
# A tibble: 4,507 x 38
area_fips own_code industry_code agglvl_code size_code year qtr disclosure_code annual… annual… total_… taxab… annua… annua… avg_a… lq_di… lq_a… lq_a… lq_t… lq_t… lq_a… lq_a… lq_a… oty_… oty_… oty_a… oty_an… oty_ann… oty_to… oty_tot… oty_… oty_… oty_… oty_… oty_a… oty_a… oty_… oty_a…
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 US000 1.00 1125 16.0 0 2000 NA "" 2.00 8.00e⁰ 2.32e⁵ 0 0 589 30648 NA 1.00 1.00 1.00 0 0 1.00 1.00 NA 0 0 1.00e⁰ 14.3 3.38e⁴ 17.0 0 0 0 0 68.0 13.1 3567 13.2
2 US000 1.00 1131 16.0 0 2000 NA "" 78.0 4.73e³ 1.94e⁸ 0 0 788 40994 NA 1.00 1.00 1.00 0 0 1.00 1.00 NA 0 0 -1.56e² - 3.20 6.04e⁶ 3.20 0 0 0 0 49.0 6.60 2547 6.60
3 US000 1.00 2211 16.0 0 2000 NA "" 142 1.24e⁴ 7.90e⁸ 0 0 1222 63557 NA 1.00 1.00 1.00 0 0 1.00 1.00 NA -2.00 - 1.40 -5.10e¹ - 0.400 8.53e⁷ 12.1 0 0 0 0 136 12.5 7096 12.6
4 US000 1.00 2213 16.0 0 2000 NA "" 10.0 9.37e² 2.67e⁷ 0 0 547 28463 NA 1.00 1.00 1.00 0 0 1.00 1.00 NA 1.00 11.1 1.14e² 13.9 -4.72e⁶ - 15.0 0 0 0 0 -186 -25.4 -9643 -25.3
5 US000 1.00 2362 16.0 0 2000 NA "" 1.00 1.00e⁰ 6.73e⁴ 0 0 1294 67270 NA 1.00 1.00 1.00 0 0 1.00 1.00 NA 0 0 1.00e⁰ 100 4.32e⁴ 180 0 0 0 0 184 16.6 9555 16.6
6 US000 1.00 3231 16.0 0 2000 NA "" 8.00 4.86e³ 2.69e⁸ 0 0 1063 55287 NA 1.00 1.00 1.00 0 0 1.00 1.00 NA 0 0 -6.10e¹ - 1.20 8.19e⁵ 0.300 0 0 0 0 16.0 1.50 852 1.60
7 US000 1.00 3259 16.0 0 2000 NA "" 3.00 3.10e² 1.67e⁷ 0 0 1037 53945 NA 1.00 1.00 1.00 0 0 1.00 1.00 NA 0 0 -2.00e⁰ - 0.600 1.30e⁶ 8.40 0 0 0 0 84.0 8.80 4396 8.90
8 US000 1.00 3321 16.0 0 2000 NA "" 3.00 4.75e² 3.06e⁷ 0 0 1241 64540 NA 1.00 1.00 1.00 0 0 1.00 1.00 NA 1.00 50.0 1.26e² 36.1 9.43e⁶ 44.5 0 0 0 0 72.0 6.20 3748 6.20
9 US000 1.00 3329 16.0 0 2000 NA "" 10.0 1.23e⁴ 6.24e⁸ 0 0 976 50746 NA 1.00 1.00 1.00 0 0 1.00 1.00 NA 0 0 -1.32e³ - 9.70 5.99e⁶ 1.00 0 0 0 0 104 11.9 5379 11.9
10 US000 1.00 3364 16.0 0 2000 NA "" 1.00 2.60e³ 6.86e⁷ 0 0 508 26390 NA 1.00 1.00 1.00 0 0 1.00 1.00 NA 0 0 -4.50e¹ - 1.70 -1.19e⁷ - 14.8 0 0 0 0 - 77.0 -13.2 -4048 -13.3
# ... with 4,497 more rows
```
## Tidy the data
The data is still raw from the BLS download. We can clean the data in two ways: a monthly dataset with employment by categories and a quarterly dataset with only number of firms. Finally we allow an option for aggregating both types. Note this function works only for the quarterly version of the data.
```r
dt_naics_tidy <- tidy_qcew(dt_naics, frequency = "month")
```
```r
as_tibble(dt_naics_tidy)
# A tibble: 54,084 x 9
year quarter industry_code area_fips own_code size_code agglvl_code month emplvl
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 1.00 1111 US000 5.00 0 16.0 1.00 21243
2 2000 1.00 1111 US000 5.00 0 16.0 2.00 21759
3 2000 1.00 1111 US000 5.00 0 16.0 3.00 23600
4 2000 2.00 1111 US000 5.00 0 16.0 4.00 26142
5 2000 2.00 1111 US000 5.00 0 16.0 5.00 27287
6 2000 2.00 1111 US000 5.00 0 16.0 6.00 29355
7 2000 3.00 1111 US000 5.00 0 16.0 7.00 39098
8 2000 3.00 1111 US000 5.00 0 16.0 8.00 33418
9 2000 3.00 1111 US000 5.00 0 16.0 9.00 30534
10 2000 4.00 1111 US000 5.00 0 16.0 10.0 30475
# ... with 54,074 more rows
```
## Backup: what if the BLS has changed my data!
I included an option to specify a snapshot of the data take through the url wayback machine: thanks To [Gabriel Chodorow-Reich](http://scholar.harvard.edu/chodorow-reich) for suggesting the idea.<sup>[1](#fn1)</sup>
Specify a suffix found from the internet wayback machine on the BLS website and add it as an argument in the download function:
```r
# Enter the address directly from the html address
wayback_suffix <- "https://web.archive.org/web/20141101135821" # suffix for November 1st, 2014
# Or enter the exact date at which date it has been crawled:
wayback_suffix <- 20141101135821
# execute adding the wayback option
dt_naics <- get_qcew_cut(
data_cut = 28,
industry = "naics",
year_start = 1990, year_end = 2015, # try not to ask for data from the future
url_wayback = wayback_suffix,
path_data = "~/Downloads/", write = F)
dt_naics
```
---------------------------------
# Documentation
## <a name="tablecut">Cuts in QCEW</a>
#### NAICS Aggregation levels
The table of cuts looks as the following for **NAICS** based industries is as follows. See the reference [table](../data_raw/naics_agglevel.csv) and official BLS [documentation](https://data.bls.gov/cew/doc/titles/agglevel/agglevel_titles.htm) for more details.
```r
#------------------------------------------------------------------------------------------------------------------------------
#| | Geographic/Size | National | National | CSA | MSA | Statewide | Statewide | County | MicroSA |
#| | LevelCode Format | | by size | | | | by size | | |
#|---------------------------|-------------------|----------|--- ------|-----|-----|-----------|-----------|--------|---------|
#| Industry Ownership Level | | 1x | 2x | 3x | 4x | 5x | 6x | 7x | 8x |
#| | | | | | | | | | |
#| Total, All Ownerships | x0 | 10 | - | 30 | 40 | 50 | - | 70 | 80 |
#| Total, by Ownership | x1 | 11 | 21 | - | 41 | 51 | 61 | 71 | - |
#| Domain, by Ownership | x2 | 12 | 22 | - | 42 | 52 | 62 | 72 | - |
#| SuperSector, by Ownership | x3 | 13 | 23 | - | 43 | 53 | 63 | 73 | - |
#| Sector, by Ownership | x4 | 14 | 24 | - | 44 | 54 | 64 | 74 | - |
#| 3-digit, by Ownership | x5 | 15 | 25 | - | 45 | 55 | - | 75 | - |
#| 4-digit, by Ownership | x6 | 16 | 26 | - | 46 | 56 | - | 76 | - |
#| 5-digit, by Ownership | x7 | 17 | 27 | - | 47 | 57 | - | 77 | - |
#| 6-digit, by Ownership | x8 | 18 | 28 | - | 48 | 58 | - | 78 | - |
#------------------------------------------------------------------------------------------------------------------------------
```
#### SIC Aggregation levels
Be careful as the SIC aggregation levels are different than for NAICS. See the reference [table](../data_raw/sic_agglevel.csv) and official BLS [documentation](https://data.bls.gov/cew/doc/titles/agglevel/sic_agglevel_titles.htm) for more details.
The division follows
+ `01` to `11`: National level; all sectors down to 4 digits; aggregated and by size classes
+ `12` to `17`: MSA level; all sectors down to 4 digits; aggregated (no split by size classes available)
+ `18` to `25`: State level; all sectors down to 4 digits; aggregated and by size classes (only some size split are available)
+ `12` to `17`: County level; all sectors down to 4 digits; aggregated (no split by size classes available)
---------------------------------
## <a name = "availability">What is available</a>
The data file layout for a general view of what is available:
- For [downloading](https://www.bls.gov/cew/datatoc.htm) directly the individual files
- For [table of contents](https://www.bls.gov/cew/datatoc.htm)
- For [quarterly naics](http://www.bls.gov/cew/doc/layouts/csv_quarterly_layout.htm) and [annual naics](https://data.bls.gov/cew/doc/layouts/csv_annual_layout.htm)
- For [quarterly sic](http://www.bls.gov/cew/doc/layouts/sic_csv_quarterly_layout.htm) and [annual sic](https://data.bls.gov/cew/doc/layouts/sic_csv_annual_layout.htm)
### NAICS files
Some cuts are limited to an annual frequency (first quarter only) while some have a quarterly frequency (with monthly reports for employment)
- Annual frequency cuts are:
+ National by size: `21, 22, 23, 24, 25, 26, 27, 28`
+ State by size: `61, 62, 63, 64`
- Quarterly frequency: *everything else*
### SIC files
Same as for NAICS. Collections stops in 2000. Size starts only in 1997.
- Annual frequency cuts are availabe only for size from 1997 to 2000:
+ National by size: `7, 8, 9, 10, 11`
+ State by size: `24, 25`
- Quarterly frequency: *everything else*
### Availability Warning
The BLS Table can be misleading at times. And all of the files are not created equals. I am working on harmonizing all of the data pulls but there seems to be a lot of exceptions. Here are warnings I found important:
1. There is no disaggregated data for size X industry before 1990 in the case of **naics**. The layout of files on the [download page](http://www.bls.gov/cew/datatoc.htm) is somewhat misleading.
2. For the **sic by industry** files the data from standard sources only goes back to 1984. I have yet to implement the function that also downloads and get the data from 1975 to 1983. They are arranged in individual industry files.
+ For the year 1983 the available cuts are: `01`, `02`, `03`, `04`, `05`, `06`, `13`, `14`, `15`, `18`, `19`, `20`, `21`, `22`, `23`, `26`, `27`, `28`, `29`
+ For the year 1984 the available cuts are: `01`, `02`, `03`, `04`, `05`, `06`, `13`, `14`, `15`, `16`, `17`, `18`, `19`, `20`, `21`, `22`, `23`, `26`, `27`, `28`, `29`, `30`, `31`
+ The missing cuts pre-1984 are: `16`, `17` (MSA 3 and 4 digits SIC) and `30`, `31` (County 3 and 4 digits SIC)
---------------------------------
## <a name = "generaldoc">General documentation</a>
The table of contents to download directly datasets
The data file layout for a general view of what is available:
- For [downloading](https://www.bls.gov/cew/datatoc.htm) directly the individual files
- For [table of contents](https://www.bls.gov/cew/datatoc.htm)
- For [quarterly naics](http://www.bls.gov/cew/doc/layouts/csv_quarterly_layout.htm) and [annual naics](https://data.bls.gov/cew/doc/layouts/csv_annual_layout.htm)
- For [quarterly sic](http://www.bls.gov/cew/doc/layouts/sic_csv_quarterly_layout.htm) and [annual sic](https://data.bls.gov/cew/doc/layouts/sic_csv_annual_layout.htm)
### Data codes
#### Aggregation levels
- Aggregation levels and files that contain them are defined by the BLS; we reproduce the table in the package for merging or easier access:
+ **naics** [table](../data_raw/naics_agglevel.csv); original [BLS webpage](http://www.bls.gov/cew/doc/titles/agglevel/agglevel_titles.htm)
+ **sic** [table](../data_raw/sic_agglevel.csv); original [BLS webpage](http://www.bls.gov/cew/doc/titles/agglevel/sic_agglevel_titles.htm)
+ **Supersectors** and **Sector**: head over to the [BLS webpage](https://www.bls.gov/cew/supersector.htm)
#### Size Classes
- For [naics](http://www.bls.gov/cew/doc/titles/size/size_titles.htm)
- For [sic](http://www.bls.gov/cew/doc/titles/size/sic_size_titles.htm)
#### Industries
Industry titles are standard in that case
- **naics** [table](../data_raw/naics_industry_titles.csv); original [BLS webage](http://www.bls.gov/cew/doc/titles/industry/industry_titles.htm)
- **sic** [table](../data_raw/sic_industry_titles.csv); original [BLS webpage](http://www.bls.gov/cew/doc/titles/industry/sic_industry_titles.htm)
#### Ownership
Ownership codes go from 0 to 6:
- 0 represents the aggregate or `Total Covered`
- 5 represents the `Private` sector
- 1 to 4 represent different level of government: 4 for `International Government`; 3 for `Local Government`; 2 for `State Government` and 1 for `Federal Government`
Online docs for [naics](http://www.bls.gov/cew/doc/titles/ownership/ownership_titles.htm) and [sic](http://www.bls.gov/cew/doc/titles/ownership/sic_ownership_titles.htm)
#### Area codes and titles (FIPS)
Structures somewhat like industry codes. 5 characters. `US000` is aggregated over the total US.
Then for example `XXYYY` can be split in two parts:
- `XX` represents the state as in Census codes (alphabetical orders): 01 is Alabama and 02 is Alaska.
- `YYY` represents the county within the given state
- There are exceptions to `YYY` matching to counties:
+ If `YYY` is `000` then this represents data aggregated at the state level: 01000,
+ `YYY` is `996` then it represents "Overseas Locations"
+ `YYY` is `997` then it represents "Multicounty, Not Statewide"
+ `YYY` is `998` then it represents "Out-of-State"
+ `YYY` is `999` then it represents "Unknown Or Undefined"
- If the first character is `C` then it represents subdivisions at the MSA level
BLS documentation is available at the following links:
- For [naics](http://www.bls.gov/cew/doc/titles/area/area_titles.htm)
- For [sic](http://www.bls.gov/cew/doc/titles/area/sic_area_titles.htm)
---------------------------
<a name="fn1">1</a>: The `stata` version of this code is on Gabe's website [here](http://scholar.harvard.edu/chodorow-reich/data-programs)
---------------------------
(c) Erik Loualiche