-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathLesson04.Rmd
1348 lines (1013 loc) · 55.4 KB
/
Lesson04.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Lesson 4: Describing Quantitative Data (Spread)"
output:
html_document:
theme: cerulean
toc: true
toc_float: false
---
<script type="text/javascript">
function showhide(id) {
var e = document.getElementById(id);
e.style.display = (e.style.display == 'block') ? 'none' : 'block';
}
</script>
<div style="width:50%;float:right;">
#### Optional Videos for this Lesson {.tabset .tabset-pills}
##### Part 1
<iframe id="kaltura_player_1639178944" src="https://cdnapisec.kaltura.com/p/1157612/sp/115761200/embedIframeJs/uiconf_id/47306393/partner_id/1157612?iframeembed=true&playerId=kaltura_player_1639178944&entry_id=1_o1ym0xdl" width="480" height="270" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" frameborder="0"></iframe>
##### Part 2
<iframe id="kaltura_player_1639179016" src="https://cdnapisec.kaltura.com/p/1157612/sp/115761200/embedIframeJs/uiconf_id/47306393/partner_id/1157612?iframeembed=true&playerId=kaltura_player_1639179016&entry_id=1_8iuzfzoa" width="480" height="270" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" frameborder="0"></iframe>
##### Part 3
<iframe id="kaltura_player_1639179054" src="https://cdnapisec.kaltura.com/p/1157612/sp/115761200/embedIframeJs/uiconf_id/47306393/partner_id/1157612?iframeembed=true&playerId=kaltura_player_1639179054&entry_id=1_v40qfq8q" width="480" height="270" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" frameborder="0"></iframe>
</div><div style="clear:both;"></div>
<br>
## Lesson Outcomes
By the end of this lesson, you should be able to:
1. Calculate a percentile from data
2. Interpret a percentile
3. Calculate the standard deviation from data
4. Interpret the standard deviation
5. Calculate the five-number summary using software
6. Interpret the five-number summary
7. Create a box plot using software
8. Determine the five-number summary visually from a box plot
<br>
## Spread of a Distribution
In the previous lesson, we introduced two important characteristics of a distribution: the **shape** and the **center**. In this section, you will discover ways to summarize the **spread** of a distribution of data. The spread of a distribution of data describes how far the observations tend to be from each other. There are many ways to describe the spread of a distribution, but one of the most popular measurements of spread is called the "standard deviation."
<br>
### Standard Deviation and Variance
This activity introduces two measures of spread: the standard deviation and the variance.
<img src="./Images/StepsAll.png">
**Bird Flu Fever **
<img src="./Images/Peuquito.jpg">
Avian Influenza A H5N1, commonly called the bird flu, is a deadly illness that is currently only passed to humans from infected birds. This illness is particularly dangerous because at some point it is likely to mutate to allow human-to-human transmission. Health officials worldwide are preparing for the possibility of a bird flu pandemic.
<img src="./Images/Step1.png">
Dr. K. Y. Yuen led a team of researchers who reported the body temperatures of people admitted to Chinese hospitals with confirmed cases of Avian Influenza.
Their research team collected data on the body temperature at the time that people with the bird flu were admitted to the hospital. In the article, they reported on two groups of people, those with relatively uncomplicated cases of the bird flu and those with severe cases.
<img src="./Images/Step2.png">
The table below presents the data representative of the body temperatures for the two groups of bird flu patients:
| Relatively Uncomplicated Cases | Severe Cases |
|--------------------------------|--------------|
| 38.1 | 39.1 |
| 38.3 | 39.5 |
| 38.4 | 38.9 |
| 39.5 | 39.2 |
| 39.7 | 39.9 |
| | 39.7 |
| | 39.0 |
<img src="./Images/Step3.png">
Let us focus on the relatively uncomplicated cases. Creating a histogram of such a small dataset does not provide much benefit. With only a handful of values, there is not much shape to the distribution.
We can, however, use numerical summaries to give an indication of the center of the distribution.
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
7. What is the median of the body temperatures for the relatively uncomplicated cases? <!--Draw a vertical line on your number line to represent this point. -->
<a href="javascript:showhide('Q7')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q7" style="display:none;">
* The median body temperature for the relatively uncomplicated cases is 38.4 degrees Centigrade.
</div>
<br>
8. What is the mean of the body temperatures for the relatively uncomplicated cases? <!--Draw a vertical line on your number line to represent this point. -->
<a href="javascript:showhide('Q8')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q8" style="display:none;">
* The mean body temperature for the relatively uncomplicated cases is 38.8 degrees Centigrade.
</div>
</div>
<br>
We will use these data to investigate some measures of the spread in a data set.
There is relatively little difference in the temperatures of the uncomplicated patients. The lowest is $38.1 ^\circ \text{C}$, while the highest temperature is $39.7 ^\circ \text{C}$.
The **standard deviation** is a measure of the *spread* in the distribution. If the data tend to be close together, then the standard deviation is relatively small. If the data tend to be more spread out, then the standard deviation is relatively large.
The standard deviation of the body temperatures is $0.742 ^\circ \text{C}$. This number contains information from all the patients. If the patients' temperatures had been more diverse, the standard deviation would be larger. If the patients' temperatures were more uniform (i.e. closer together), then the standard deviation would have been smaller. If all the patients somehow had the same temperature, then the standard deviation would be zero.
We are working with a sample. To be explicit, we call $0.742 ^\circ \text{C}$ the *sample* standard deviation. The symbol for the sample standard deviation is $s$. This is a statistic. The parameter representing the *population* standard deviation is $\sigma$ (pronounced /SIG-ma/). In practice, we rarely know the value of the population standard deviation, so we use the sample standard deviation $s$ as an approximation for the unknown population standard deviation $\sigma$.
At this point, you probably do not have much intuition regarding the standard deviation. We will use this statistic frequently. By the end of the semester you can expect to become very comfortable with this idea. For now, all you need to know is that if two variables are measured on the same scale, the variable with values that are further apart will have the larger standard deviation.
<div class="SoftwareHeading">Excel Instructions</div>
<div class="Software">
**To calculate the sample standard deviation in Excel, follow these steps:**
* In a blank cell, type "**=stdev.s(**"
* Highlight the data (the cell range reference will be added to your formula)
* Close the parenthesis with "**)**" and hit enter
<div style="padding-left:35px;"><img src="./Images/Lesson_03_stdev.PNG"></div>
<br>
</div>
<br>
<div class="message Tip">**Rounding:** As a general rule, when reporting your answers in this class, round to three decimal places unless otherwise specified.</div>
<br>
<br>
#### Calculating the Standard Deviation by Hand
How is the standard deviation computed? Where does this "magic" number come from? How does one number include the information about the spread of all the points?
It is a little tedious to compute the standard deviation by hand. You will usually compute standard deviation with a computer. However, the process is very instructive and will help you understand conceptually what the statistic represents. As you work through the following steps, please remember the goal is to find a measure of the spread in a data set. We want one number that describes how spread out the data are.
First, observe the number line below, where each x represents the temperature of a patient with a relatively uncomplicated case of bird flu. As mentioned earlier, there is not a huge spread in the temperatures.
<img src="./Images/BirdFlu-DotPlotAnswer.png">
On your sketch of the number line, we draw a vertical line at 38.8 degrees, the sample mean. Now, draw horizontal lines from the mean to each of your $\times$'s. These horizontal line segments represent the spread of the data about the mean. Your plot should look something like this:
<img src="./Images/BirdFlu-DotPlotAnswer-Lines.png">
The length of each of the line segments represents how far each observation is from the mean. If the data are close together, these lines will be fairly short. If the distribution has a large spread, the line segments will be longer. The standard deviation is a measure of how long these lines are, as a whole.
The distance between the mean and an observation is referred to as a deviation. In other words, deviations are the lengths of the line segments drawn in the image above.
$$
\begin{array}{1cl}
\text{Deviation} & = & \text{Value} - \text{Mean} \\
\text{Deviation} & = & x - \bar x
\end{array}
$$
If the observed value is greater than the mean, the deviation is positive. If the value is less than the mean, the deviation is negative.
The standard deviation is a complicated sort of average of the deviations. Making a table like the one below will help you keep track of your calculations. Please participate fully in this exercise. Writing your answers at each step and developing a table as instructed will greatly enhance the learning experience. By following these steps, you will be able to compute the standard deviation by hand, and more importantly, understand what it is telling you.
**Step 01**: The first step in computing the standard deviation by hand is to create a table, like the following. Enter the observed data in the first column.
<center>
<table>
<thead>
<tr class="header">
<th><p>Observation ($x$)</p>
</td></p></th>
<th><p>Deviation from the Mean ($x-\bar x$)
</p></th>
</tr>
</thead>
<tbody>
<tr class="odd">
</tr>
<tr class="even">
<td><p>$38.1$</p></td>
<td><p>$38.1-38.8=-0.7$</p></td>
</tr>
<tr class="odd">
<td><p>$38.3$</p></td>
<td></td>
</tr>
<tr class="even">
<td><p>$38.4$</p></td>
<td></td>
</tr>
<tr class="odd">
<td><p>$39.5$</p></td>
<td></td>
</tr>
<tr class="even">
<td><p>$39.7$</p></td>
<td></td>
</tr>
<tr class="odd">
<td><p>$\bar x = 38.8$</p></td>
<td></td>
</tr>
</tbody>
</table>
</center>
**Step 02**: The second column of the table contains the deviations from the mean. Complete column 2 of the table above.
<a href="javascript:showhide('Step2')"><span style="font-size:8pt;">Check Results for Step 2</span></a>
<div id="Step2" style="display:none;">
<table>
<thead>
<tr class="header">
<th><p>Observation ($x$)</p></th>
<th><p>Deviation from the Mean ($x-\bar x$)
</p></th>
</tr>
</thead>
<tbody>
<tr class="even">
<td><p>$38.1$</p></td>
<td><p>$38.1-38.8=-0.7$</p></td>
</tr>
<tr class="odd">
<td><p>$38.3$</p></td>
<td><p>$38.3-38.8=-0.5$</p></td>
</tr>
<tr class="even">
<td><p>$38.4$</p></td>
<td><p>$38.4-38.8=-0.4$</p></td>
</tr>
<tr class="odd">
<td><p>$39.5$</p></td>
<td><p>$39.5-38.8=0.7$</p></td>
</tr>
<tr class="even">
<td><p>$39.7$</p></td>
<td><p>$39.7-38.8=0.9$</p></td>
</tr>
<tr class="odd">
<td><p>$\bar x = 38.8$</p></td>
<td></td>
</tr>
</tbody>
</table>
</div>
<br>
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
9. How could we use this table to find the "typical" distance from each point to the mean? Think carefully about this, and then write down your answer before continuing.
<a href="javascript:showhide('Q9')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q9" style="display:none;">
* You may have suggested that we compute the mean of these values. This seems like a good idea. If we compute the mean, it will tell us the average deviation from the mean.
<br>
<br>
9b. Compute the mean of Column 2. What do you get?
<a href="javascript:showhide('Q9b')"><span style="font-size:8pt;">Click Here to Continue</span></a>
<div id="Q9b" style="display:none;">
* You should have found that the mean of the deviations is zero. This is true for *every* data set. If you add up the deviations from the mean, the positive values will cancel with the negative values. The sum of the deviations from the mean will be zero, so the mean also must equal zero.
* The good news is that you can use this fact to check if you are on the right track. If the deviations from the mean do not add up to zero, then you have made a mistake in the calculations. The bad news is that the deviations always add up to 0, making it look like the distance from the data to the mean is 0. Nonsense!
* The mean of the deviations from the mean cannot be used to find a measure of the spread in a data set, but it does provide a guidepost that shows we are on the right track. We must find another way to estimate the spread of a data set.
</div>
</div>
<br>
10. We need a way to work with the negative deviations from the mean, so they do not cancel with the positive ones. What could we do? (Choose one of the four options below.)
<a href="javascript:showhide('Q10a')"><span style="font-size:8pt;">Option #1: Take the absolute value of the deviations</span></a>
<div id="Q10a" style="display:none;">
* Option #1:
* This is an excellent suggestion. This is probably one of the first things statisticians used to estimate the spread in the data.
* If we take the absolute value of the deviations, then all the values are positive. By taking the mean of these numbers, we do get a measure of spread. This quantity is called the mean absolute deviation (MAD).
* There is good news and bad news. The good news is, you discovered a way to estimate the spread in a data set. (In fact, the MAD is used as one estimate of the volatility of stocks.) The bad news is that the MAD does not have good theoretical properties. A proof of this claim requires calculus, and so will not be discussed here. For most applications, there is a better choice. Please select another option.
</div>
<br>
<a href="javascript:showhide('Q10b')"><span style="font-size:8pt;">Option #2: Square the deviations from the mean</span></a>
<div id="Q10b" style="display:none;">
* Option #2:
* If we square the deviations from the mean, the values that were negative will become positive. This leads to an estimator of the spread that has excellent theoretical properties. This is the best of the four options. You will apply this idea in Step 03.
</div>
<br>
<a href="javascript:showhide('Q10c')"><span style="font-size:8pt;">Option #3: Delete all the negative deviations</span></a>
<div id="Q10c" style="display:none;">
* Option #3:
* Sorry, you can't make your troubles go away by deleting things you don't like. Please try again.
</div>
<br>
<a href="javascript:showhide('Q10d')"><span style="font-size:8pt;">Option #4: Do something entirely different</span></a>
<div id="Q10d" style="display:none;">
* Option #4:
* You probably have an ingenious idea. Surprisingly enough, there *is* a right answer to the question. Please choose a different option.
</div>
<br>
<br>
* Please do not go on to Step 03 until you have finished this exploration.
</div>
<br>
<center>
"Piled Higher and Deeper" by Jorge Cham
<br>
<img src="./Images/Phd062307s.png">
</center>
**Step 03**: Add a third column to your table. To get the values in this column, square the deviations from the mean that you found in Column 2.
<a href="javascript:showhide('Step3blank')"><span style="font-size:8pt;">Click Here for a Blank Table</span></a>
<div id="Step3blank" style="display:none;">
<table>
<thead>
<tr class="header">
<th><p>Observation $x$</p></th>
<th><p>Deviation<br />
from the Mean<br />
$x-\bar x$</p></th>
<th><p>Squared Deviation<br />
from the Mean<br />
$\left(x-\bar x\right)^2$</p></th>
</tr>
</thead>
<tbody>
<tr class="even">
<td><p>$38.1$</p></td>
<td><p>$38.1-38.8=-0.7$</p></td>
<td></td>
</tr>
<tr class="odd">
<td><p>$38.3$</p></td>
<td><p>$38.3-38.8=-0.5$</p></td>
<td></td>
</tr>
<tr class="even">
<td><p>$38.4$</p></td>
<td><p>$38.4-38.8=-0.4$</p></td>
<td></td>
</tr>
<tr class="odd">
<td><p>$39.5$</p></td>
<td><p>$39.5-38.8=0.7$</p></td>
<td></td>
</tr>
<tr class="even">
<td><p>$39.7$</p></td>
<td><p>$39.7-38.8=0.9$</p></td>
<td></td>
</tr>
<tr class="odd">
<td><p>$\bar x = 38.8$</p></td>
<td><p>Sum $=0$</p></td>
<td></td>
</tr>
</tbody>
</table>
</div>
</br>
<a href="javascript:showhide('Step3filled')"><span style="font-size:8pt;">Check Results for Step 03</span></a>
<div id="Step3filled" style="display:none;">
<table>
<thead>
<tr class="header">
<th><p>Observation $x$</p></th>
<th><p>Deviation<br />
from the Mean<br />
$x-\bar x$</p></th>
<th><p>Squared Deviation<br />
from the Mean<br />
$\left(x-\bar x\right)^2$</p></th>
</tr>
</thead>
<tbody>
<tr class="even">
<td><p>$38.1$</p></td>
<td><p>$38.1-38.8=-0.7$</p></td>
<td><p>$(-0.7)^2=0.49$</p></td>
</tr>
<tr class="odd">
<td><p>$38.3$</p></td>
<td><p>$38.3-38.8=-0.5$</p></td>
<td><p>$(-0.5)^2=0.25$</p></td>
</tr>
<tr class="even">
<td><p>$38.4$</p></td>
<td><p>$38.4-38.8=-0.4$</p></td>
<td><p>$(-0.4)^2=0.16$</p></td>
</tr>
<tr class="odd">
<td><p>$9.5$</p></td>
<td><p>$39.5-38.8=0.7$</p></td>
<td><p>$(0.7)^2=0.49$</p></td>
</tr>
<tr class="even">
<td><p>$39.7$</p></td>
<td><p>$39.7-38.8=0.9$</p></td>
<td><p>$(0.9)^2=0.81$</p></td>
</tr>
<tr class="odd">
<td><p>$\bar x = 38.8$</p></td>
<td><p>Sum $=0$</p></td>
<td></td>
</tr>
</tbody>
</table>
</div>
<br>
**Step 04**: Now, add up the squared deviations from the mean.
<a href="javascript:showhide('Step4')"><span style="font-size:8pt;">Check Results for Step 04</span></a>
<div id="Step4" style="display:none;">
<table>
<thead>
<tr class="header">
<th><p>Observation $x$</p></th>
<th><p>Deviation<br />
from the Mean<br />
$x-\bar x$</p></th>
<th><p>Squared Deviation<br />
from the Mean<br />
$\left(x-\bar x\right)^2$</p></th>
</tr>
</thead>
<tbody>
<tr class="even">
<td><p>$38.1$</p></td>
<td><p>$38.1-38.8=-0.7$</p></td>
<td><p>$(-0.7)^2=0.49$</p></td>
</tr>
<tr class="odd">
<td><p>$38.3$</p></td>
<td><p>$38.3-38.8=-0.5$</p></td>
<td><p>$(-0.5)^2=0.25$</p></td>
</tr>
<tr class="even">
<td><p>$38.4$</p></td>
<td><p>$38.4-38.8=-0.4$</p></td>
<td><p>$(-0.4)^2=0.16$</p></td>
</tr>
<tr class="odd">
<td><p>$39.5$</p></td>
<td><p>$39.5-38.8=0.7$</p></td>
<td><p>$(0.7)^2=0.49$</p></td>
</tr>
<tr class="even">
<td><p>$39.7$</p></td>
<td><p>$39.7-38.8=0.9$</p></td>
<td><p>$(0.9)^2=0.81$</p></td>
</tr>
<tr class="odd">
<td><p>$\bar x = 38.8$</p></td>
<td><p>Sum $=0$</p></td>
<td><p>Sum $=2.20$</p></td>
</tr>
</tbody>
</table>
The sum of the squared deviations is 2.20.
</div>
<br>
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
11. Suppose that the researchers had collected body temperature data on 500 bird flu patients instead of 5. What would happen to the sum of the squared deviations, if the distribution of the data is the same for the 500 patients as the 5 patients?
<a href="javascript:showhide('Q11')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q11" style="display:none;">
* We would expect the sum of the squared deviations to be a lot larger than it is now. We would be adding squared deviations for 500 observations instead of 5. So, the sum of the squared deviations would be about 100 times larger.
* Remember, we are trying to find a measure of the spread of a data set. Our final measure should not be dependent on the sample size. We need to do something else.
</div>
<br>
12. What could we do to make sure the sample size does not inflate our estimate of the spread of the data?
<a href="javascript:showhide('Q12a')"><span style="font-size:8pt;">Option #1: Divide by n</span></a>
<div id="Q12a" style="display:none;">
* Option #1:
* This is an excellent suggestion. There are good reasons to choose this option. Unfortunately, dividing by $n$ to estimate the spread of data gives estimates that are too low, on average. There is a surprising, yet simple fix: divide by $n-1$ instead of $n$. Please examine Option #2.
</div>
<br>
<a href="javascript:showhide('Q12b')"><span style="font-size:8pt;">Option #2: Divide by n-1</span></a>
<div id="Q12b" style="display:none;">
* Option #2:
* This seems very odd, but it avoids the problem of underestimating the spread in the data, a problem that dividing by $n$ has. Ultimately, dividing by $n-1$ will lead us to the standard deviation you computed using
<!--{{Course_Filter|course=B C|content=SPSS}}-->
earlier. In Step 05, you will divide the sum of the squared deviations by $n-1$. Here is how your table should look so far.
<table>
<thead>
<tr class="header">
<th><p>Observation $x$</p></th>
<th><p>Deviation<br />
from the Mean<br />
$x-\bar x$</p></th>
<th><p>Squared Deviation<br />
from the Mean<br />
$\left(x-\bar x\right)^2$</p></th>
</tr>
</thead>
<tbody>
<tr class="even">
<td><p>$38.1$</p></td>
<td><p>$38.1-38.8=-0.7$</p></td>
<td><p>$(-0.7)^2=0.49$</p></td>
</tr>
<tr class="odd">
<td><p>$38.3$</p></td>
<td><p>$38.3-38.8=-0.5$</p></td>
<td><p>$(-0.5)^2=0.25$</p></td>
</tr>
<tr class="even">
<td><p>$38.4$</p></td>
<td><p>$38.4-38.8=-0.4$</p></td>
<td><p>$(-0.4)^2=0.16$</p></td>
</tr>
<tr class="odd">
<td><p>$39.5$</p></td>
<td><p>$39.5-38.8=0.7$</p></td>
<td><p>$(0.7)^2=0.49$</p></td>
</tr>
<tr class="even">
<td><p>$39.7$</p></td>
<td><p>$39.7-38.8=0.9$</p></td>
<td><p>$(0.9)^2=0.81$</p></td>
</tr>
<tr class="odd">
<td><p>$\bar x = 38.8$</p></td>
<td><p>Sum $=0$</p></td>
<td><p>Sum $=2.20$</p></td>
</tr>
<tr class="even">
<td></td>
<td><p>$\displaystyle{s^2=\frac{sum}{n-1}=\frac{2.20}{5-1}=0.55}$</p></td>
<td></td>
</tr>
</tbody>
</table>
</div>
<br>
<a href="javascript:showhide('Q12c')"><span style="font-size:8pt;">Option #3: Neither of these</span></a>
<div id="Q12c" style="display:none;">
* Option #3:
* You probably have an ingenious idea. Nevertheless, please choose a different option.
</div>
<br>
* Please do not go on until you have finished this exercise.
</div>
<br>
**Step 05**: Divide the sum of the squared deviations by $n - 1$. Write this value at the bottom of Column 3 of your table.
The number you computed in Step 05 is called the <span id="Sample Variance">**sample variance**</span>. It is a measure of the spread in a data set. It has very nice theoretical properties. The variance plays an important role in Statistics. We denote the sample variance by the symbol $s^2$.
It can be shown that the sample variance is an unbiased estimator of the true population variance (which is denoted $\sigma^2$.) This means that the sample variance can be considered a reasonable estimator of the population variance. If the sample size is large, this estimator tends to be very good.
<a href="javascript:showhide('Step5')"><span style="font-size:8pt;">Check the Results for Step 05</span></a>
<div id="Step5" style="display:none;">
The sum of the squared deviations is the sum of the values in Column 3. This sum equals 2.20. We divide the sum of Column 3 ($2.20$) by $n-1=5-1=4$ to get the sample variance, $s^2$:
$$s^2=\frac{sum}{n-1}=\frac{2.20}{5-1}=0.55$$
This is the sample variance.
<table>
<thead>
<tr class="header">
<th><p>Observation $x$</p></th>
<th><p>Deviation<br />
from the Mean<br />
$x-\bar x$</p></th>
<th><p>Squared Deviation<br />
from the Mean<br />
$\left(x-\bar x\right)^2$</p></th>
</tr>
</thead>
<tbody>
<tr class="even">
<td><p>$38.1$</p></td>
<td><p>$38.1-38.8=-0.7$</p></td>
<td><p>$(-0.7)^2=0.49$</p></td>
</tr>
<tr class="odd">
<td><p>$38.3$</p></td>
<td><p>$38.3-38.8=-0.5$</p></td>
<td><p>$(-0.5)^2=0.25$</p></td>
</tr>
<tr class="even">
<td><p>$38.4$</p></td>
<td><p>$38.4-38.8=-0.4$</p></td>
<td><p>$(-0.4)^2=0.16$</p></td>
</tr>
<tr class="odd">
<td><p>$39.5$</p></td>
<td><p>$39.5-38.8=0.7$</p></td>
<td><p>$(0.7)^2=0.49$</p></td>
</tr>
<tr class="even">
<td><p>$39.7$</p></td>
<td><p>$39.7-38.8=0.9$</p></td>
<td><p>$(0.9)^2=0.81$</p></td>
</tr>
<tr class="odd">
<td><p>$\bar x = 38.8$</p></td>
<td><p>Sum $=0$</p></td>
<td><p>Sum $=2.20$</p></td>
</tr>
<tr class="even">
<td><p>Variance:</p></td>
<td><p>$\displaystyle{s^2=\frac{sum}{n-1}=\frac{2.20}{5-1}=0.55}$</p></td>
<td></td>
</tr>
</tbody>
</table>
</div>
<br>
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
13. The temperature data for the bird flu patients are in degrees Centigrade. What are the units of the variance?
<a href="javascript:showhide('Q13')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q13" style="display:none;">
* The data in Column 1 of the table is in degrees Centigrade. The mean also is in degrees Centigrade.
* To get the numbers in Column 2, we subtracted the mean from each of the values in Column 1.
* We squared the values in Column 2 to get Column 3. The units for this column are degrees Centigrade squared.
* The sum of the numbers in Column 3 will also be in units of degrees Centigrade squared.
* When we divided that sum by $n-1$, we obtained the sample variance. The sample variance has units of degrees Centigrade squared. This is not easily interpretable. It would be much easier to think about it if our measure of spread was in the same units as the data.
</div>
<br>
14. What operation can we do to the variance to get a quantity with units degrees Centigrade?
<a href="javascript:showhide('Q14')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q14" style="display:none;">
* If we take the square root of the variance, we get a quantity that has units of degrees Centigrade. This quantity is the standard deviation.
</div>
</div>
<br>
**Step 06**: Take the square root of the sample variance to get the sample standard deviation.
The sample standard deviation is defined as the square root of the sample variance.
$$\text{Sample Standard Deviation} = s = \sqrt{ s^2 } = \sqrt{\strut\text{Sample Variance}}$$
The standard deviation has the same units as the original observations. We use the standard deviation heavily in statistics.
The sample standard deviation ($s$) is an estimate of the true population standard deviation ($\sigma$).
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
15. What is the sample standard deviation, $s$, of the temperatures of the five patients with relatively uncomplicated cases of the bird flu?
<a href="javascript:showhide('Q15')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q15" style="display:none;">
* The sum of the squared deviations is the sum of the values in Column 3. This sum equals 2.20. We divide the sum of Column 3 ($2.20$) by $n-1=5-1=4$ to get the sample variance, $s^2$:
<center>
$s^2=\frac{sum}{n-1}=\frac{2.20}{5-1}=0.55$
</center>
* This is the sample variance.
<table>
<thead>
<tr class="header">
<th><p>Observation $x$</p></th>
<th><p>Deviation<br />
from the Mean<br />
$x-\bar x$</p></th>
<th><p>Squared Deviation<br />
from the Mean<br />
$\left(x-\bar x\right)^2$</p></th>
</tr>
</thead>
<tbody>
<tr class="even">
<td><p>$38.1$</p></td>
<td><p>$38.1-38.8=-0.7$</p></td>
<td><p>$(-0.7)^2=0.49$</p></td>
</tr>
<tr class="odd">
<td><p>$38.3$</p></td>
<td><p>$38.3-38.8=-0.5$</p></td>
<td><p>$(-0.5)^2=0.25$</p></td>
</tr>
<tr class="even">
<td><p>$38.4$</p></td>
<td><p>$38.4-38.8=-0.4$</p></td>
<td><p>$(-0.4)^2=0.16$</p></td>
</tr>
<tr class="odd">
<td><p>$39.5$</p></td>
<td><p>$39.5-38.8=0.7$</p></td>
<td><p>$(0.7)^2=0.49$</p></td>
</tr>
<tr class="even">
<td><p>$39.7$</p></td>
<td><p>$39.7-38.8=0.9$</p></td>
<td><p>$(0.9)^2=0.81$</p></td>
</tr>
<tr class="odd">
<td><p>$\bar x = 38.8$</p></td>
<td><p>Sum $=0$</p></td>
<td><p>Sum $=2.20$</p></td>
</tr>
<tr class="even">
<td><p>Variance:</p></td>
<td><p>$\displaystyle{s^2=\frac{sum}{n-1}=\frac{2.20}{5-1}=0.55}$</p></td>
<td></td>
</tr>
<tr class="odd">
<td><p>Standard Deviation:</p></td>
<td><p>$\displaystyle{s = \sqrt{s^2}=\sqrt{0.55} \approx 0.742}$</p></td>
<td></td>
</tr>
</tbody>
</table>
- The sample standard deviation is $s = 0.742$ degrees Centigrade.
* Take a few minutes to verify that you can recreate this table on your own.
</div>
</div>
<br>
#### Summary
<div class="Emphasis">
**Standard Deviation**
The **standard deviation** is one number that describes the spread in a set of data. If the data points are close together the standard deviation will be smaller than if they are spread out.
At this point, it may be difficult to understand the meaning and usefulness of the standard deviation. For now, it is enough for you to recognize the following points:
+ The standard deviation is a measure of how spread out the data are.
+ If the standard deviation is large, then the data are very spread out.
+ If the standard deviation is zero, then all the values are the identical--there is no spread in the data.
+ The standard deviation cannot be negative.
<br>
</div>
**Variance**
The variance is the square of the standard deviation. The sample variance is denoted by the symbol $s^2$. You found the sample standard deviation for patient temperatures of uncomplicated cases of bird in the bird above is $s = 0.74162$. So, the sample variance for this data set is $s^2 = 0.74162^2 = 0.550$. Be aware, if you had squared the rounded value of $s^2 = 0.742$ in the calculation, you would have gotten an answer of 0.551 instead. This would be considered incorrect!
<!-- [[NEED TO DISCUSS STATISTIC v. PARAMETER.]] -->
<div class="message Tip">**Rounding:** Use **un**rounded values in interim calculations. Rounding too early in the process can lead to wrong answers.</div>
<br>
<br>
<div class="SoftwareHeading">Excel Instructions</div>
<div class="Software">
**To calculate the sample variance in Excel:**
* In a blank cell, type "**=var.s(**"
* Highlight the data (the cell range reference will be added to your formula)
* Close the parenthesis with "**)**" and hit enter
<br>
</div>
<br>
The standard deviation and variance are two commonly used measures of the spread in a data set. Why is there more than one measure of the spread? The standard deviation and the variance each have their own pros and cons.
The variance has excellent theoretical properties. It is an unbiased estimator of the true population variance. That means that if many, many samples of $n$ observations were drawn, the variances computed for all the samples would be centered nicely around the true population variance, $\sigma^2$. Because of these benefits, the variance is regularly used in higher-level statistics applications. One drawback of the variance is that the units for the variance are the square of the units for the original data. In the bird flu example, the body temperatures were measured in degrees Centigrade. So, the variance will have units of degrees Centigrade squared $(^\circ \text{C})^2$. What does degrees Centigrade squared mean? How do you interpret this? It doesn't make any sense. This is one of the major drawbacks of the sample variance.
Because we take the square root of the variance to get the standard deviation, the standard deviation is in the same units as the original data. This is a great advantage, and is one of the reasons that the standard deviation is commonly used to describe the spread of data.
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
Enter the patient temperature data for the severe cases of bird flu into Excel. Then use Excel to calculate the numerical summaries you have learned so far. As a reminder, the temperatures of patients with a severe case of bird flu are:
<center> 39.1, 39.5, 38.9, 39.2, 39.9, 39.7, 39 </center>
16. What is the mean, median, standard deviation and variance of the sample?
<a href="javascript:showhide('Q16')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q16" style="display:none;">
<div style="padding-left:35px;"><img src="./Images/Lesson_03_summary_stats_severe_bird_flu.PNG"></div>
</div>
<br>
For the next two questions, consider the histograms below comparing weight (in kilograms) of men (top histogram) to elephant seals (bottom histogram).
<br>
<div style="padding-left:35px;">
Weight of Men Compared to Weight of Seals<br>
<img src="./Images/DiveSealsVsMenWeights-Hist.png" width=80%>
</div>
17. Based on the histograms, who has a greater sample mean weight, men or elephant seals?
<a href="javascript:showhide('Q17')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q17" style="display:none;">
* The mean is a measure of the center of a distribution. The mean weight of the men is less than the mean weight of the seals. We can see this because the bulk of the data in the histogram for the men’s weight is to the left of the seals'. The center of the distribution of elephant seals is about 195 kg. The center of the distribution of men's weight is located below 100 kg on the number line.
</div>
<br>
18. Based on the histograms, do the weights of men or elephant seals have a larger sample standard deviation?
<a href="javascript:showhide('Q18')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q18" style="display:none;">
* Standard deviation is a measure of spread. You will note that the weights of the seals are more spread out than the weights of the men. Therefore, we conclude that the sample standard deviation of elephant seal weights is larger than the sample standard deviation of men's weights.
</div>
</div>
<br>
**Review of Parameters and Statistics**
We have now learned some statistics that can be used to estimate population parameters. For example, we use $\bar x$ to estimate the population mean $\mu$. The sample statistics $s$ estimates the true population standard deviation $\sigma$. The following table summarizes what we have done so far:
<center>
<table>
<thead>
<tr class="header">
<th></th>
<th><p>Sample Statistic</p></th>
<th><p>Population Parameter</p></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><p>Mean</p></td>
<td><p>$\bar x$</p></td>
<td><p>$\mu$</p></td>
</tr>
<tr class="even">
<td><p>Standard Deviation</p></td>
<td><p>$s$</p></td>
<td><p>$\sigma$</p></td>
</tr>
<tr class="odd">
<td><p>Variance</p></td>
<td><p>$s^2$</p></td>
<td><p>$\sigma^2$</p></td>
</tr>
<tr class="even">
<td><p>$\vdots$</p></td>
<td><p>$\vdots$</p></td>
<td><p>$\vdots$</p></td>
</tr>
</tbody>
</table>
</center>
Unless otherwise specified, we will always use Excel to find the sample variance and sample mean.
In each case, the sample statistic estimates the population parameter. The ellipses $\vdots$ in this table hint that we will add rows in the future.
<br>
**Optional Reading: Formulas for $s$ and $s^2$ (Hidden)**
<a href="javascript:showhide('s2')"><span style="font-size:8pt;">Click Here if you love Math</span></a>
<div id="s2" style="display:none;">
**Formulas**
For those who like formulas, the equation for the sample variance and sample standard deviation are given here.
**Sample variance**:
$$\displaystyle{ s^2=\frac{\sum\limits_{i=1}^n (x_i-\bar x)^2}{n-1} }$$
**Sample standard deviation**:
$$\displaystyle{ s=\sqrt{s^2}=\sqrt{\frac{\sum\limits_{i=1}^n (x_i-\bar x)^2}{n-1}} }$$
where $x_i$ is the $i^{th}$ observed data value, and $i=1, 2, \ldots, n$.
Unless otherwise specified, we will always use Excel<!--{{Course_Filter|course=B C|content=SPSS}}--> to find the sample variance and sample mean.
**Why do we divide by $n-1$?**
When computing the standard deviation or the variance, we are finding a value that describes the spread of data values. It is a measure of how far the data are from the mean. Since we do not know the true mean ($\mu$,) we use the sample mean ($\bar x$,) to estimate it. Typically, the data will be closer to $\bar x$ than to $\mu$, since $\bar x$ was computed using the data. To compensate for this, we divide by $n-1$ rather than $n$ when we find the "average" of the squared deviations from the mean. It turns out, that subtracting 1 from $n$ inflates this average by the precise amount needed to compensate for the use of $\bar x$ as an estimate for $\mu$. As a result, the sample variance ($s^2$) is a good estimator of the population variance ($\sigma^2$.)
</div>
<br>
Neither the standard deviation nor the variance is **resistant** to outliers. This means that when there are outliers in the data set, the standard deviation and the variance become artificially large. It is worth noting that the mean is also not resistant. When there are outliers, the mean will be "pulled" in the direction of the outliers.
The mean and standard deviation are used to describe the center and spread when the distribution of the data is symmetric and bell-shaped. If data are not symmetric and bell-shaped, we typically use the five-number summary (discussed below) to describe the spread, because this summary is resistant.
## Additional Tools to Describe the Data
Recall the five steps of the Statistical Process (and the mnemonic "Daniel Can Discern More Truth).
<center>
<table>
<tbody>
<tr class="odd">
<td><p>Step 1:</p></td>
<td><p>**D**aniel</p></td>
<td><p>**D**esign the study</p></td>
</tr>
<tr class="even">
<td><p>Step 2:</p></td>
<td><p>**C**an</p></td>
<td><p>**C**ollect data</p></td>
</tr>
<tr class="odd">
<td><p>Step 3:</p></td>
<td><p>**D**iscern</p></td>
<td><p>**D**escribe the data</p></td>
</tr>
<tr class="even">
<td><p>Step 4:</p></td>
<td><p>**M**ore</p></td>
<td><p>**M**ake inferences</p></td>
</tr>
<tr class="odd">
<td><p>Step 5:</p></td>
<td><p>**T**ruth</p></td>
<td><p>**T**ake action</p></td>
</tr>
</tbody>
</table>
<img src="./Images/StepsAll.png">
</center>
**Step 3** of this process is "**Describe the data**." You have already learned about the mean, median, mode, standard deviation, variance and histograms. These can be good ways to describe the data. The following information on percentiles, quartiles, 5-number summaries, and boxplots will help you learn other common ways to describe data, especially if the data are skewed or contain outliers.
<img src="./Images/Step3.png">
For symmetric, bell-shaped data, the mean and standard deviation provide a good description of the center and shape of the distribution. The mean and standard deviation are not sufficient to describe a distribution that is skewed or has outliers. An **outlier** is any observation that is very far from the others. The mean is pulled in the direction of the outlier. Also, the standard deviation is inflated by points that are very far from the mean.
Now, you have probably had some experience with percentiles in the past especially when you received a score on a standardized test such as the ACT. Even though percentiles are commonly used, they are generally misunderstood. Before examining the wrong site/wrong patient data, let's review percentiles. Even if you think you understand percentiles, please study this section carefully.
<br>
### Percentiles and Quartiles
Imagine a very long street with houses on one side. The houses increase in value from left to right. At the left end of the street is a small cardboard box with a leaky roof. Next door is a slightly larger cardboard box that does not leak. The houses eventually get larger and more valuable. The rightmost house on the street is a huge mansion.
<div class="QuestionsHeading">Answer the following question:</div>
<div class="Questions">
19. There are 100 homes with increasing property values. How many fences are needed to separate the 100 properties?
<a href="javascript:showhide('Q19')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q19" style="display:none;">
* In order to separate the 100 homes, 99 fences are required.
</div>
</div>
<br>