-
Notifications
You must be signed in to change notification settings - Fork 8
/
03-intro-to-vectors.Rmd
516 lines (357 loc) · 14 KB
/
03-intro-to-vectors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
---
title: "Vectors and Factors"
subtitle: "Stat 133"
author: "Gaston Sanchez"
output: github_document
fontsize: 11pt
urlcolor: blue
---
> ### Learning Objectives
>
> - Work with vectors of different data types
> - Understand the concept of _atomic_ structures
> - Learn how to subset and slice R vectors
> - Understand the concept of _vectorization_
> - Understand _recycling_ rules in R
---
## NBA Data
In this tutorial, we are going to consider the 2016-2017 starting lineup for the Golden
State Warriors:
| Player | Position | Salary | Points | PPG | Rookie |
|----------|----------|------------|--------|------|--------|
| Thompson | SG | 16,663,575 | 1742 | 22.3 | FALSE |
| Curry | PG | 12,112,359 | 1999 | 25.3 | FALSE |
| Green | PF | 15,330,435 | 776 | 10.2 | FALSE |
| Durant | SF | 26,540,100 | 1555 | 25.1 | FALSE |
| Pachulia | C | 2,898,000 | 426 | 6.1 | FALSE |
From the statistical point of view, we can say that there are six variables
measured on five individuals. How would you characterize each variable:
quantitative -vs- qualitative?
From the programming point of view, what type of data would you use to encode
each variable: character, boolean, integer, real?
## Vectors in R
Vectors are the most basic type of data structures in R. Learning how to
manipulate data structures in R requires you to start learning
how to manipulate vectors.
### Creating vectors with `c()`
Among the main functions to work with vectors we have the __combine__ function
`c()`. This is the workhorse function to create vectors in R. Here's how to
create a vector for the players with `c()`:
```{r}
player <- c('Thompson', 'Curry', 'Green', 'Durant', 'Pachulia')
```
You can use the same function to create vectors `position`, `salary`, and `ppg`
```{r}
position <- c('SG', 'PG', 'PF', 'SF', 'C')
salary <- c(16663575, 12112359, 15330435, 26540100, 2898000)
ppg <- c(22.3, 25.3, 10.2, 25.1, 6.1)
```
As for `rookie` you can still use `c()` or also the repetition function `rep()`:
```{r}
rookie <- c(FALSE, FALSE, FALSE, FALSE, FALSE)
# alternatively
rookie <- rep(FALSE, 5)
```
### Vectors are Atomic structures
The first thing you should learn about R vectors is that they are
__atomic structures__, which is just the fancy name to indicate that all the
elements of a vector must be of the same type, either all numbers,
all characters, or all logical values. To test if an object
is _atomic_, i.e. has all its elements of the same type, use `is.atomic()`
How do you know that a given vector is of a certain data type?
There are several functions that allow you to answer this question:
- `typeof()`
- `class()`
- `mode()`
One function to answer the previous question is `typeof()`:
```{r eval = FALSE}
typeof(player)
typeof(salary)
typeof(ppg)
typeof(rookie)
```
You should know that among the R community, most useRs don't really talk about
_types_. Instead, because of historical reasons related to the S language,
you will often hear useRs talk about _modes_:
```{r eval = FALSE}
mode(player)
mode(salary)
mode(ppg)
mode(rookie)
```
## Manipulating Vectors: Subsetting
Subsetting refers to extracting elements of a vector (or another R object).
To do so, you use what is known as __bracket notation__. This implies using
(square) brackets `[ ]` to get access to the elements of a vector:
```{r}
# first element
player[1]
# first three elements
player[1:3]
```
What type of things can you specify inside the brackets? Basically:
- numeric vectors
- logical vectors (the length of the logical vector must match the length
of the vector to be subset)
- character vectors (if the elements have names)
### Subsetting with Numeric Indices
Here are some subsetting examples using a numeric vector inside the
brackets:
```{r eval = FALSE}
# fifth element of 'player'
player[4]
# numeric range
player[2:4]
# numeric vector
player[c(1, 3)]
# different order
player[c(3, 1, 2)]
# third element (four times)
player[rep(3, 4)]
```
### Subsetting with Logical Indices
Logical subsetting involves using a logical vector inside the brackets.
Learning about _logical subsetting_ is a fundamental survival skill.
This kind of subsetting is __very powerful__ because it allows you to
extract elements based on some logical condition.
Here's a toy example of logical subsetting:
```{r}
# dummy vector
a <- c(5, 6, 7, 8)
# logical subsetting
a[c(TRUE, FALSE, TRUE, FALSE)]
```
Logical subsetting occurs when the vector of indices that you pass inside the
brackets is a logical vector.
To do logical subsetting, the vector that you put inside the brackets,
should match the length of the manipulated vector. If you pass a shorter
vector inside brackets, R will apply its recycling rules.
Notice that the elements of the vector that are subset are those which match
the logical value `TRUE`.
```{r eval = FALSE}
# your turn
player[c(TRUE, TRUE, TRUE, TRUE, TRUE)]
player[c(TRUE, TRUE, TRUE, FALSE, FALSE)]
player[c(FALSE, FALSE, FALSE, TRUE, TRUE)]
player[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
player[c(FALSE, FALSE, FALSE, FALSE, FALSE)]
# recycling
player[TRUE]
player[c(TRUE, FALSE)]
```
When subsetting a vector logically, most of the times you won't really be
providing an explicit vector of `TRUE`'s and `FALSE`s. Just imagine having a
vector of 100 or 1000 or 1000000 elements, and trying to do logical subsetting
by manually creating a logical vector of the same length.
That would be very boring. Instead, you will be providing a logical condition
or a comparison operation that returns a logical vector.
A __comparison operation__ occurs when you use comparison operators such as:
- `>` greater than
- `>=` greater than or equal
- `<` less than
- `<=` less than or equal
- `==` equal
- `!=` different
Notice that a comparison operation always returns a logical vector:
```{r eval = FALSE}
# example with '=='
player == 'Durant'
# example with '>'
ppg > 24
```
Here are some examples of logical subsetting:
```{r eval = FALSE}
# salary of Durant
salary[player == 'Durant']
# name of players with more than 24 points per game
player[ppg > 24]
```
In addition to using comparison operators, you can also use __logical operators__ to
produce a logical vector. The most common type of logical operators are:
- `&` AND
- `|` OR
- `!` negation
Run the following commands to see what R does:
```{r eval = FALSE}
# AND
TRUE & TRUE
TRUE & FALSE
FALSE & FALSE
# OR
TRUE | TRUE
TRUE | FALSE
FALSE | FALSE
# NOT
!TRUE
!FALSE
```
More examples with comparisons and logical operators:
```{r eval = FALSE}
# name of players with salary between 10 and 20 millions (exclusive)
player[salary > 10000000 & salary < 20000000]
# name of players with salary between 10 and 20 millions (inclusive)
player[salary >= 10000000 & salary <= 20000000]
```
### Subsetting with Character Vectors
A third type of subsetting involves passing a character vector inside brackets.
When you do this, the characters are supposed to be names of the manipulated
vector.
None of the vectors `player`, `salary`, and `ppg`, have names.
You can confirm that with the `names()` function applied on any of the vectors:
```{r}
names(salary)
```
Create a new vector `millions` by converting `salary` into millions, and then assign
`player` as the names of `millions`
```{r}
# create 'millions', rounded to 2 decimals
millions <- round(salary / 1000000, 2)
# assign 'player' as names of 'millions'
names(millions) <- player
```
You should have a vector `millions` with named elements. Now you can use
character subsetting:
```{r}
millions["Durant"]
millions[c("Green", "Curry", "Pachulia")]
```
### Adding more elements
Related with subsetting, you can consider adding more elements to a given vector. For example, say you want to include data for three more players: Iguodala, McCaw, and Jones:
| Player | Position | Salary | Points | PPG | Rookie |
|----------|----------|------------|--------|------|--------|
| Iguodala | SF | 11,131,368 | 574 | 7.6 | FALSE |
| McCaw | SG | 543,471 | 282 | 4.0 | TRUE |
| Jones | C | 1,171,560 | 19 | 1.9 | TRUE |
You can use bracket notation to add more elements:
```{r}
player[6] <- 'Iguodala'
player[7] <- 'McCaw'
player[8] <- 'Jones'
```
Another option is to use `c()` to combine a vector with more values like this:
```{r}
position <- c(position, 'SF', 'SG', 'C')
rookie <- c(rookie, FALSE, TRUE, TRUE)
```
Of course, you can combine both options:
```{r}
salary[6] <- 11131368
salary <- c(salary, 543471, 1171560)
```
## Vectorization
Say you want to create a vector `log_salary` by taking the logarithm of salaries:
```{r}
log_salary <- log(salary)
```
When you create the vector `log_salary`, what you're doing is applying a function to a
vector, which in turn acts on all elements of the vector.
This is called __Vectorization__ in R parlance. Most functions that operate
with vectors in R are __vectorized__ functions. This means that an action
is applied to all elements of the vector without the need to explicitly type
commands to traverse all the elements.
In many other programming languages, you would have to use a set of commands
to loop over each element of a vector (or list of numbers) to transform them.
But not in R.
Another example of vectorization would be the calculation of the square root
of all the points per game `ppg`:
```{r}
sqrt(ppg)
```
Or the conversion of `salary` into millions:
```r
salary / 1000000
```
### Why should you care about vectorization?
If you are new to programming, learning about R's vectorization will be very
natural (you won't stop to think about it too much). If you have some previous
programming experience in other languages (e.g. C, python, perl), you know
that vectorization does not tend to be a native thing.
Vectorization is essential in R. It saves you from typing many lines of code,
and you will exploit vectorization with other useful functions known as the
_apply_ family functions (we'll talk about them later in the course).
## Recycling
Closely related with the concept of _vectorization_ we have the notion of
__Recycling__. To explain _recycling_ let's see an example.
`salary` is given in dollars, but what if you need to
obtain the salaries in euros?. Let's create a new vector
`euros` with the converted salaries in euros. To convert from dollars to euros use the
following conversion:
1 dollar = 0.9 euro
```{r}
# your code here
```
What you just did (assuming that you did things correctly) is called
__Recycling__. To understand this concept, you need to remember that R does
not have a data structure for scalars (single numbers). Scalars are in reality
vectors of length 1.
Converting dollars to euros requires this operation: `salary * 0.9`.
Although it may not be obvious, we are multiplying two vectors: `salary` and
`0.9`. Moreover (and more important) __we are multiplying two vectors of
different lengths!__. So how does R know what to do in this cases?
Well, R uses the __recycling rule__, which takes the shorter vector (in this
case `0.9`) and recycles its elements to form a temporary vector that matches
the length of the longer vector (i.e. `salary`).
### Another recycling example
Here's another example of recycling. Salaries of elements in an odd number
positions will be divided by two; salaries of elements in an even
number position will be divided by 10:
```r
units <- c(1/2, 1/10)
new_salary <- salary * units
```
The elements of `units` are recycled and repeated as many times as elements
in `salary`. The previous command is equivalent to this:
```r
new_units <- rep(c(1/2, 1/10), length.out = length(salary))
salary * new_units
```
-----
## Factors
As mentioned before, vectors are the most essential type of data structure
in R. They are _atomic_ structures (can contain only one type of data):
integers, real numbers, logical values, characters, complex numbers.
Related to vectors, there is another important data structure in R called
__factor__. Factors are data structures exclusively designed to handle
categorical data.
The term _factor_ as used in R for handling categorical variables, comes from
the terminology used in _Analysis of Variance_, commonly referred to as ANOVA.
In this statistical method, a categorical variable is commonly referred to as
_factor_ and its categories are known as _levels_.
### Creating Factors
To create a factor you use the homonym function `factor()`, which takes a
vector as input. The vector can be either numeric, character or logical.
Looking at the available variables, we can treat _Position_ and _Rooky_ as categorical variables. This means that we can convert the corresponding vectors `position`, and `rooky` into factors.
```{r}
# convert to factor
position <- factor(position)
position
```
```{r}
rookie <- factor(rookie)
```
Notice how `position` and `rooky` are displayed. Even though the
elements are the same in both the vector and the factor, they are printed in
different formats. The letters in the factor are printed without quotes.
### How does R store factors?
Under the hood, a factor is internally stored using two arrays (R vectors): one is an
integer array containing the values of the categories, the other array is the
"levels" which has the names of categories which are mapped to the integers.
One way to confirm that the values of the categories are mapped as integers
is by using the function `storage.mode()`
```{r}
# storage of factor
storage.mode(position)
```
### Manipulating Factors
Because factors are internally stored as integers, you can manipulate factors
as any other vector:
```{r}
position[1:5]
position[c(1, 3, 5)]
position[rep(1, 5)]
rookie[player == 'Iguodala']
rookie[player == 'McCaw']
```
### Why using R factors?
When or/and why to use factors? The simplest answer is: use R factors when you want to handle categorical data as such. Often, statisticians think about variables as categorical data, expressed in several scales: binary, nominal, and ordinal. And R lets you handle this type of data through factors. Many functions in R are specifically dedicated for factors, and you can (should) take advantage of such behavior.
-----