Statistics

The stats exercises have been chosen to introduce/solidify some relevant statistical concepts related to data science. The solutions for these exercises are available in the ThinkStats repository on GitHub. You should focus on understanding the statistical concepts, python programming and interpreting the results. If you are stuck, review the solutions and recode the python in a way that is more understandable to you.

For example, in the first exercise, the author has already written a function to compute Cohen's D. You could import it, or you could write your own code to practice python and develop a deeper understanding of the concept.

Think Stats uses a higher degree of python complexity from the python tutorials and introductions to python concepts, and that is intentional to prepare you for the bootcamp.

One of the skills to learn here is to understand other people’s code. And this author is quite experienced, so it’s good to learn how functions and imports work.

3. Instructions for Cloning the Repo

Using the code referenced in the book, follow the step-by-step instructions below.

Step 1. Create a directory on your computer where you will do the prework. Below is an example:

(Mac):      /Users/yourname/ds/metis/metisgh/prework  
(Windows):  C:/ds/metis/metisgh/prework

Step 2. cd into the prework directory. Use GitHub to pull this repo to your computer.

$ git clone https://github.com/AllenDowney/ThinkStats2.git

Step 3. Put your ipython notebook or python code files in this directory (that way, it can pull the needed dependencies):

(Mac):     /Users/yourname/ds/metis/metisgh/prework/ThinkStats2/code  
(Windows):  C:/ds/metis/metisgh/prework/ThinkStats2/code

4. Required Exercises

Include your Python code, results and explanation (where applicable).

Q1. Think Stats Chapter 2 Exercise 4 (effect size of Cohen's d)

Cohen's D is an example of effect size. Other examples of effect size are: correlation between two variables, mean difference, regression coefficients and standardized test statistics such as: t, Z, F, etc. In this example, you will compute Cohen's D to quantify (or measure) the difference between two groups of data.

You will see effect size again and again in results of algorithms that are run in data science. For instance, in the bootcamp, when you run a regression analysis, you will recognize the t-statistic as an example of effect size.

Q2. Think Stats Chapter 3 Exercise 1 (actual vs. biased)

This problem presents a robust example of actual vs biased data. As a data scientist, it will be important to examine not only the data that is available, but also the data that may be missing but highly relevant. You will see how the absence of this relevant data will bias a dataset, its distribution, and ultimately, its statistical interpretation.

Q3. Think Stats Chapter 4 Exercise 2 (random distribution)

This questions asks you to examine the function that produces random numbers. Is it really random? A good way to test that is to examine the pmf and cdf of the list of random numbers and visualize the distribution. If you're not sure what pmf is, read more about it in Chapter 3.

Q4. Think Stats Chapter 5 Exercise 1 (normal distribution of blue men)

This is a classic example of hypothesis testing using the normal distribution. The effect size used here is the Z-statistic.

Q5. Bayesian (Elvis Presley twin)

Bayes' Theorem is an important tool in understanding what we really know, given evidence of other information we have, in a quantitative way. It helps incorporate conditional probabilities into our conclusions.

Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin? Assume we observe the following probabilities in the population: fraternal twin is 1/125 and identical twin is 1/300.

p = 5/11

p(fraternal and two boys) = 1/125 * 1/2 * 1/2 = 1/500
p(identical and two boys) = 1/300 * 1/2 = 1/600
p(twins and two boys) = 1/500 + 1/600
p(identical and two boys | twins and two boys) = p(identical and two boys) * p(twins and two boys | identical and two boys) / p(twins and two boys)
p(identical and two boys | twins and two boys) = 1/600 * 1 / (1/500 + 1/600)
p = 5/11

Q6. Bayesian & Frequentist Comparison

How do frequentist and Bayesian statistics compare?

In Bayesian statistics, probabilities are related to our knowledge of an event. An experiment begins with an assumed distribution ("prior") which is updated after data is collected to arrive at a new distribution ("posterior"). Parameters in Bayesian experiments are unknown and described in terms of probabilities. Frequentist statistics regards probabilities as related to the frequencies of outcomes, but does not assign a probability to these outcomes nor does it consider prior information. Data is considered a repeatable random sample with fixed parameters, and experiment results are expressed as a confidence interval.

5. Optional Exercises

The following exercises are optional, but we highly encourage you to complete them if you have the time.

Q7. Think Stats Chapter 7 Exercise 1 (correlation of weight vs. age)

In this exercise, you will compute the effect size of correlation. Correlation measures the relationship of two variables, and data science is about exploring relationships in data.

Q8. Think Stats Chapter 8 Exercise 2 (sampling distribution)

In the theoretical world, all data related to an experiment or a scientific problem would be available. In the real world, some subset of that data is available. This exercise asks you to take samples from an exponential distribution and examine how the standard error and confidence intervals vary with the sample size.

Q9. Think Stats Chapter 6 Exercise 1 (skewness of household income)

Q10. Think Stats Chapter 8 Exercise 3 (scoring)

Q11. Think Stats Chapter 9 Exercise 2 (resampling)

6. Recommended Reading

Read Allen Downey's Think Bayes book. It is available online for free, or you can buy a paper copy if you would like.

7. More Resources

Some people enjoy video content such as Khan Academy's Probability and Statistics or the much longer and more in-depth Harvard Statistics 110. You might also be interested in the book Statistics Done Wrong or a very short overview from School of Data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

07-statistics.md

07-statistics.md

Statistics

Table of Contents

1. Introduction

2. Why We Are Using Think Stats

3. Instructions for Cloning the Repo

4. Required Exercises

Q1. Think Stats Chapter 2 Exercise 4 (effect size of Cohen's d)

Q2. Think Stats Chapter 3 Exercise 1 (actual vs. biased)

Q3. Think Stats Chapter 4 Exercise 2 (random distribution)

Q4. Think Stats Chapter 5 Exercise 1 (normal distribution of blue men)

Q5. Bayesian (Elvis Presley twin)

Q6. Bayesian & Frequentist Comparison

5. Optional Exercises

Q7. Think Stats Chapter 7 Exercise 1 (correlation of weight vs. age)

Q8. Think Stats Chapter 8 Exercise 2 (sampling distribution)

Q9. Think Stats Chapter 6 Exercise 1 (skewness of household income)

Q10. Think Stats Chapter 8 Exercise 3 (scoring)

Q11. Think Stats Chapter 9 Exercise 2 (resampling)

6. Recommended Reading

7. More Resources

Files

07-statistics.md

Latest commit

History

07-statistics.md

File metadata and controls

Statistics

Table of Contents

1. Introduction

2. Why We Are Using Think Stats

3. Instructions for Cloning the Repo

4. Required Exercises

Q1. Think Stats Chapter 2 Exercise 4 (effect size of Cohen's d)

Q2. Think Stats Chapter 3 Exercise 1 (actual vs. biased)

Q3. Think Stats Chapter 4 Exercise 2 (random distribution)

Q4. Think Stats Chapter 5 Exercise 1 (normal distribution of blue men)

Q5. Bayesian (Elvis Presley twin)

Q6. Bayesian & Frequentist Comparison

5. Optional Exercises

Q7. Think Stats Chapter 7 Exercise 1 (correlation of weight vs. age)

Q8. Think Stats Chapter 8 Exercise 2 (sampling distribution)

Q9. Think Stats Chapter 6 Exercise 1 (skewness of household income)

Q10. Think Stats Chapter 8 Exercise 3 (scoring)

Q11. Think Stats Chapter 9 Exercise 2 (resampling)

6. Recommended Reading

7. More Resources