Lecture/Module Notes 2024 #19

trevorcampbell · 2024-09-12T23:54:09Z

Creating this issue thread so @gpleiss and I can take notes about the class as we go (todo items, things that work well, things that don't, things we messed up, etc).

Module 0

Setup issues are kind of rough (especially for windows). I wonder if it's worth setting up a docker image and having students install Docker Desktop if they're struggling to get their own machine set up?
The R intro is a bit scattered in terms of level. Like there are slides that are super basic (what's a vector?) and some that are much more advanced (function dispatch) in the same lecture. I wonder if we can replace this first lecture with an extended lab/demo (that students are expected to work through not just in lecture but complete afterward at home, or maybe a pre-lab that students are expected to complete before the class with a pre-recorded tutorial video series) that walks students through an example of each concept
Same sort of comment re: the Git intro. Many students on slack asking questions that indicate that they really have 0 idea what's going on, they're just going through the motions. "I clicked revert and the orange boxes didn't go away" is sort of the level of question I'm seeing. I don't have a good idea what to do about this though...it's really hard to introduce Git even with pre-material
- we have a git basics chapter in the dsci textbook https://datasciencebook.ca/version-control.html but I'm not sure I'm a huge fan of it (even though I wrote a bunch of it :-) )

LM lecture

Should we talk about training/testing data splits? (i.e. more introduction to the basic mechanics of predictive modelling)
Create more parallelism with first lecture of Module 1
- Algorithm/risk vs model w/ OLS vs MLE
- Training fit vs test fit for evaluating model
Emphasize distinction between inference/prediction
Emphasize distinction between training fit versus predictive loss (differentiating with LM content in 306)

Module 1

Regression function lecture

Going from statistical models to risk/losses feels a bit like a jump
- We should introduce risk as a way of assessing predictive models
Proofs: could we have people "turn to neighbours" to work through proofs together
- Proof of regression function minimizing MSE is a lot - could we get students to work through this in pairs?
- Derivation of estimation risk + prediction risk
"Anything strange" slide is confusing
Slides should mention that regression function is not realizable - we have to estimate it
Terminology: "estimation" in "estimation risk" should be introduced

Bias/variance tradeoff slides

These slides seem to go back and forth between regression and more simple models without covariates
- I suggest we start without covariates, talk about bias/variance/irreducible error, then risk (so like this lecture and the next one too) etc, then move on to regression settings with covariates
- reason: students are getting confused with what we're averaging over (in regression sometimes we think of risk as a function of covariates, sometimes we average over population of covariates)
First slide on bias/variance slides (the bias) should reintroduce a mathematical definition variance
Opportunity for students to work through bias/variance calculations in pairs
Add a plot of bias/variance as a function of model complexity?

Risk estimation

These slides might be inconsistent about defining risk with the previous slides. In the previous slides, risk is the expected loss $R = E[(Y - \hat f(X))^2]$ including all sources of randomness (so, the new point $(X, Y)$ and the training data via $\hat f$). So $R$ is just a fixed number. These new slides seem to suggest risk is a random quantity dependent on $\hat f$ (i.e. the training data). So in other words, we're conditioning on training data -- we now seem to be defining $R = E[(Y - \hat f(X))^2 | Y_1, \dots, Y_N]$ -- $\hat f$ is fixed now within the conditional exp, and $R$ is a random quantity.
After confirming with D -- risk is meant to be an expectation over everything (randomness in new test data Y, randomness in training data Y_1, ... Y_N) and it's just a function of the population distribution and the chosen estimator, which is a func of (training data) x (new test data) -> (output)

Information criteria

Significantly shorten (maybe only derive GCV, talk about the other 3 briefly)
Compare/contrast their terms
Activity for students to predict how good their risk is on some real-world models

Model/variable selection

Lecture could be far more interactive, opportunity for more student participation

Module 2

The longer I spend teaching this, the more I question why we need estimates of risk. I think we should only need estimates of test error.
- Risk accounts for randomness in training data, and is useful when designing a new procedure. But when designing a new procedure, you don't need to estimate anything; you can draw from whatever population(s) you're designing based on.
- Test error is useful when you've collected a training dataset, done some fitting of something, and now you want to know how well it generalizes. You do need an estimate of test error using only your training set, since that's the only information you have about your population.
- Waiting for discussion with D, but my suspicion is we should go back to Module 1 and replace all risk estimation with test error estimation (and of course assessing the risk of various test error estimates)
- Discussed with D. Need to read chapter 7 of ESL in depth. It seems to be that ESL claims (without proof) that we can't use these techniques to estimate test error, even though we can use them to estimate risk. Entirely possible, though I'm still a bit skeptical... It shouldn't be too hard to do some quick napkin math at some point. If it turns out to be true, that's worth a slide or two in the risk/risk estimation slides ("why aren't we trying to estimate test error?").
We maybe also want to wait to discuss bias/variance until the end of Module 1
- The material is a bit abstract, and getting "hands dirty" with risk estimates and actual models may be better before estimating risk
- It's nice to discuss this idea right before leading into ridge regression

Ridge regression / lasso

Replace "geometry of constrained regression" 2D example with a 1D example
Bias/variance of ridge regression is a good option for students to do some math in lecture
Making interactive 3D plots of the constrained optimization plots could be good for helping visualization
Coordinate descent information may be confusing
Warnings slide is too fast (but very important)
Honestly, the whole thing probably went too fast. This stuff could probably be two lectures (if we add activities). On the schedule it's just one.
- Big +1 to more intuition on constrained optimization via interactive visualizations. In my office hours, it seems some people didn't quite "get" the idea of constrained opt
- Also people didn't quite get why curvy-diamonds were bad for optimization (you can get to a point where the best option is to actually "go uphill" to get to the optimum)
- We might want to do some interactive viz for coordinate descent / gradient descent too.
On the slide where we say "intuition for why lasso selects variables", just replace that with "here's how coordinate descent works" and then one point on the slide will be "hey check it out, it squishes small things towards 0"
- may even be worth deriving the coordinate descent update, if we split into 2 lecs

Idea: maybe we should move code-heavy slides about "how to do things in R" into separate slide decks that students can peruse on their own, but not things we go through in class.

Smoothing slides

We never put up the equation for a smoothing predictor!

Module 3

I thought I had resolved this risk vs test error thing, but now in the classification slides it looks like we are minimizing test error (min_g E[ ell(Y, g(X)) ])

Gradient descent

The parabolas demonstrating gradient descent confused some students. They did not understand that the parabolas were not connected to one another.

LDA/QDA

Probably not worth including as material.
We should introduce Naive Bayes as our one generative method

Other metrics

Trevor spending a lot of time on the blackboard talking about the TPR/FPR and the ROC curve was very helpful.
Tying in different thresholds into a notion of risk would be very helpful.

Module 4

Bootstrap

Slide introducing the bootstrap algorithm has lots of text
We should consider switching bagging and bootstrap.

Bagging

Should have a better demo of bagging

Boosting

Trevor's bump function example was quite easy to follow - it should be added to the material
Trevor's explanation of functional derivatives was quite intuitive
When talking about boosting, we talk about minimizing various losses, which we've only briefly touched upon in the context of MLE. We may want to consider covering empirical risk minimization early on.
We should emphasize how "efficient" boosting is - the one slide that compares boosting to a single decision tree rushes through this point

Module 5

trevorcampbell added the discussion label Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lecture/Module Notes 2024 #19

Lecture/Module Notes 2024 #19

trevorcampbell commented Sep 12, 2024 •

edited by gpleiss

Loading

Lecture/Module Notes 2024 #19

Lecture/Module Notes 2024 #19

Comments

trevorcampbell commented Sep 12, 2024 • edited by gpleiss Loading

Module 0

LM lecture

Module 1

Regression function lecture

Bias/variance tradeoff slides

Risk estimation

Information criteria

Model/variable selection

Module 2

Ridge regression / lasso

Smoothing slides

Module 3

Gradient descent

LDA/QDA

Other metrics

Module 4

Bootstrap

Bagging

Boosting

Module 5

trevorcampbell commented Sep 12, 2024 •

edited by gpleiss

Loading