Introduction to Theory of Machine Learning

The holy grail of machine learning is finding an in-sample estimate of the out-of-sample error. A low out-of-sample error is achieved through:

1 - Can we make sure that Ein (in-sample error) is small enough?

2 - Can we make sure that Eout (out-of-sample error) is close enough to Ein?

Armed with a more complex model, we are all but assured to have a lower Ein. But a complex model might also lead to a larger Eout provided that the number of records in the training set is not sufficient enough. This note will develop the logic to pin this relationship concretely by drawing heavily on the concepts and notes taken in the Learning from Data course.

##(a) Feasibility of Learning:

Hoeffding inequality describes the relation between in-sample and out-of-sample error in probabilistic terms, where ∈ is the error term, N is the number of samples, and h is the hypothesis.

Hoeffding-1

However, when you pick the 'best-fitting in-sample' hypothesis, Hoeffding does not apply to the test set because you picked g among H space. Think of p-values. Do the experiment 20 times and you will eventually get it right!

In the most conservative case, the events in the RHS are independent. Hence, we have to sum them up to make it work for the test set. We will establish later on that relaxing this assumption will make learning feasible.

Hoeffding-1

Complexity of the hypothesis set means a larger M. If we have a more complex model, we will fit better in-sample but need more examples (a higher N) to negate the large M in order to satisfy the inequality. The inequality as it is does not offer too much hope!

##(b) How to measure complexity of a model:

The growth function counts the most dichotomies on any N points for a given model and satisfies the equation:

Growth-1

Why? Because for N points, the maximum number of dichotomies equals 2^n. Hence, we can argue that the complexity of a model or hypothesis set can be measured by the growth function. The more complex the model is, the more points a hypothesis set can 'shatter'.

Let's play a game. I give you a budget of N points. You get to choose where to place them. We can show that a 2D perceptron can shatter up to 4 points. The break point is defined as the point at which the model fails to get all possible combinations.

We can show that the break point for the perceptron is hence 4.

Possible Homework: Try break points with different models

Once you define that the growth function takes the polynomial form, we are set. Learning using the hypothesis set becomes feasible given enough number of data points, N.

Why? Because in the Hoeffding inequality above, we defined M as the complexity of the model. Let's replace M with the growth function that has a finite breaking point. We can show that with a polynomial replacing M, the exponential will eventually manage to bring the value of RHS down given a large N.

Growth-2

VC dimension of a model in simple terms corresponds to the number of points it can shatter, hence the complexity of the model.

##(c) Theory of Generalization and VC Dimension

To summarize, to be able to generalize and show that learning from data is feasible, we need to have two things in place:

Proof that mh(N) is polynomial.

Once you have this, negative exponential in Hoeffding inequality will eventually triumph.

Proof that mh(N) can replace M:

Replacing M with mh(N) results in VC inequality formula below, which is arguably the most important equation in machine learning. The inequality helps us to narrow the union bound into VC inequality.

VC-1

If there is a break-point for hypothesis set, you can have a g such that Ein tracks Eout well given large enough sample size.

Note that what we talked about is independent of 'target function'. This is important, because we do not know and will not know the 'target function'.

VC-2

If d(vc) is finite, g ∈ H will generalize.Note that this is a probabilistic statement in the form of VC Inequality.

In general, think of d(vc) as the effective number of parameters that you can dial in to get to all dichotomies.

Transform VC inequality into generalization bound:

VC-3

where Ω defines how well the model is generalized:

VC-4

where N is # of samples, Ƕ is complexity, p is probability of error in VC bound.

We can conclude the following:

A more complex model is better for Ein (this is always true) but bad for Ω and hence for Eout.
If you want the probability of error to be lower, you will need a higher N.
A larger N narrows the gap between Ein and Eout.

##(d) Bias-Variance Trade-off

Bias-Variance is a stand alone theory but will help us the drive a similar conclusion about learning as above.

A small Eout --> A good approximation of f out-of-sample. This is primarily what we are interested in.

More complex Ƕ -> better chance of approximating f, the target function. Many more hypothesis to choose from that would be closer to the target function. But at the same time we need more examples to generalize better as defined by the VC bound!

Less complex Ƕ -> better chance of generalization out of sample. Target function might be in the hypothesis set but it will be more difficult to get to that point.

VC analysis approach in the last section shows the generalization bound:

VC-3

On the other hand, bias-variance trade-off decomposes Eout into ... well ... bias and variance!

Bias defines how well Ƕ can approximate f in the best case scenario. Variance measures how well we can zoom in on a good h ∈ Ƕ.

Two components can be thought as 'approximation' and 'generalization' components. Let's formally define the decomposition:

Bias Variance Decomposition

In plain terms, how far the hypothesis you formulated given a particular dataset differs from the actual target function is broken into two parts:

Variance: How far does the hypothesis differ from best possible hypothesis you can possibly get. Your algorithm navigates through the dataset to come up with the hypothesis h. Another set of datapoints will result in a different hypothesis h.
Bias: How far is the best possible h (denoted as ḡ(x)) from f ? This is a conceptual tool and does not need to be in the hypothesis set. Note that the bias component does not depend on the dataset.

Trade-off looks like this:

Bias Variance Trade-off

When you go from a small Ƕ to a large Ƕ, the bias always decreases because your hypothetical h becomes more complex and thus has the degrees of freedom to mirror the target function f closer. However, variance starts increasing at some point because the Ƕ becomes so complex, the learning algorithm does have enough data points to navigate succesfully. The learning algorithm ends up with h's that are far off from each other. In other terms, you might run into a issue where:

Ƕ ↑ bias ↓ but variance ↑

Bias Variance Graph

The bottom line is that you should match the model complexity to the data resources not to the target complexity. If we had more data, variance of the more complex model would be lower but we can't afford this in this case.

Hence, the game in learning is that we should pick a hypothesis set we can afford to navigate in the training set!

Possible Homework: Demonsrate bias-variance tradeoff with sin curve.

##(e) Learning Curves

In this section, we will illustrate graphically that both approaches above lead to the same conclusions. Particularly, both approaches talk about approximation and generalization.

###Simple vs Complex model

You need more examples to learn better -read: lower Eout-.
The parallel line is the bias. Notice that the bias is lower for the more complex model.
On the flipside, if you have only few examples, you should clearly see that you cannot afford to use a complex algorithm. Your Ein will be lower but any hopes of learning from data will get bashed by a large Eout.
- In VC bound terms, the Ω is large because of the complexity of the model.
- In bias-variance terms, your bias is lower, but your variance is super high because you don't have enough data points.
Also notice that Ein can be zero if you have a complex enough model and not a lot of data points. The model in this case will memorize everything. But as seen in the figure, this does not translate well into Eout.
You need more examples for the more complex model. If you don't, use a simpler model. Remember: the game in learning is that we should pick a hypothesis set we can afford to navigate in the training set!

Simple-vs-Complex

VC-Dimension vs Bias-Variance Approach:

Bias depends on the complexity of the model (not on the dataset in hand) and hence constant. It is the best you can do in theory given the hypothesis set.
If you have a more complex model, in the absence of sufficient data points, Ω will drive the Eout to the moon even though Ein goes down.
- more complex the model --> Ein ↓ but Ω ↑

VC Dimension vs Bias-Variance Decomposition

##(f) Overfitting

Overfitting, in simplest terms, is fitting the data more than warranted. The culprit is fitting the noise.

Think of two logistic regression models; one with 2 variables and the other one with 10 variables.

The latter model will always fit better in-sample. However, if the out-of-sample error of the complex more is higher than the simpler one in a given exercise, we say that the complex model is overfitting. More complex model only loses (and overfits) because number of samples is inadequate. In other terms, overfitting happens when Eout ↑ even though Ein ↓.

Overfitting

Noise

Noise can be though in two parts: deterministic noise and stochastic noise. So far we have not touched upon the complexity of the target function. Deterministic noise is part of target complexity that our Ƕ cannot capture. We are more susceptible to detect a false pattern given the limitations. As far as we are concerned, this unexplainable part is not any different than the stochastic Gaussian noise.

Bias-Variance Framework

We can also approach the problem from a bias-variance framework. As you recall, bias is a function of your hypothesis set and does not depend on the training data. In that light, deterministic noise is how short your best hypothesis falls of the target. Bias is the direct impact of the noise. Problem of overfitting shows up in the form a high variance. The noise effects the variance by making the model blindly follow the noise unable to distinguish noise from signal. The algorithm will try to fit both types of noise, which is out of reach, which will lead to a high variance.

Going back to our two models:

-> Logistic Regression with 2 variables will have a higher bias but lower variance.
-> Logistic Regrssion model with 10 variables will have a lower bias but higher variance.

Overfitting

An anology might be a person finding herself in an unfamiliar forest with a partial map. Every time she starts her expedition to find her way out, she will end up at a different place (high variance), because she does not have the toolkit to deal with the complexity of the forest. Solution can be (i) studying forestry to understand the ecosystem etc better or (ii) acquiring a more complete map. In our example, the first solution is analogus to a more complex hypothesis set, whereas second solution can be thought as working with more data points.

In short, the interaction of Ƕ with N determines the variance, hence the overfitting problem. This is another way of stating that we should pick a hypothesis set we can afford to navigate in the training set!

Summarizing the somehow wordy explanation:

N  ↑ 					-> Overfitting ↓
Deterministic Noise  ↑ 	-> Overfitting ↑
Stochastic Noise  ↑ 	-> Overfitting ↑

How do we deal with overfitting is explained in the next section.

##(g) Regularization and Cross Validation

###Regularization

Putting in a regulizer is inevitable in almost in all machine learning situations. Regulization reduces variance at the expense of bias and introduces a simpler hypothesis by limiting the weights of the parameters. As a practical rule, the noise is higher-frequency than the signal. Constraining learning towards smaller hypothesis punishes noise more than signal. Smaller weights often correspond to smalller functions.

The graph below illustrates the impact of introduction of regularization. In the absence of regulization, the Wlin minimizes the in-sample error. Introduction of the regularization constraint (red circle) prohibits us going all the way to Wlin. Formally, λ is the hyperparameter of the model. A more restricted model or a smaller C (radius of the circle) corresponds to a larger λ.

Regularization-2

Regularization-1

The new error term is called augmented error and is better than Ein as a proxy of Eout. Introduction of the hyperparameter takes us one step closer to the holy grail of machine learning, finding an in-sample estimate of the out-of-sample error.

The Eaug error term is analogus to generalization bound we derived from VC ineqaulity. Both equations explicity penalize for model complexity through overfit penalty.

Regularization-3

Choosing the term Ω is a heuristic choice guided by theory and certain goals. Two very common regularization techniques are L1 and L2 norms:

Regularization-4

On the other hand, validation helps us to determine the right dosage of hyperparameter in our model.

###Validation

While regularization is mainly about estimating the overfit penalty, validation looks at the bottom line by estimating Eout. It is called 'validation' because we use it to make choices.

Validation-1

Let's call the number of datapoints in the training set D and number of points in the validation set K.

K points for validation
D - K points for training

We can show that Eval is an unbiased estimator of Eout - because we have not seen the datapoints in the validation set while training our model. You can see that the variance of the estimate depends on K. The larger the K, the smaller the variance of the estimate will be. Ideally, we want a small variance, hence a large K.

Validation-2

However, K is not free. You borrow K points away from the training set. Why does this matter? To answer the question, let's go back to learning curves. While a large K all but ensures that we have a reliable estimate but at the expense of model performance. In other words, I am getting a reliable estimate of a worse quantity as K get larger.

Validation-3

Model Selection using Validation

Choosing a parameter with validation introduces bias because you pick the model with lower Eval. It is the minimum error on one set of realization of datapoints. We should conceed that bias is not going to hurt us much. But we need to be careful with keeping the validation set fairly clean or not very contaminated.

In formal terms, the generalization bound applies to this particular problem. M below denotes number of parameters to choose in the validation exercise.

Validation-4

Cross-Validation

Now that we have taken care of the reliability of the validation set, let's go back to the selection of K. We have the following dilemma with the selection of K. We want a large K to have a reliable out-of-sample estimate, in the meantime we want a small K to have a robust model.

Can K be equal to both small and large?

Validation-5

Leave-one-out validation helps us to get there.

N - 1 training set and 1 point for validation. . The common point between all the little guys is that they are out of sample with respect to the hypothesis they try evaluate, unbiased, and trying to estimate the same thing. Repeat the process N times.

The common thread among all the hypothesis is that they are obtained by training on N-1 data points. Because of the learning curve, we know there is a tendency.The fact that they all come from N-1 datapoints tells me that they are realizations of something that is the expected value of all them. The little guys are trying to estimate the same thing as a result.

You will notice that, all of a sudden, the validation set looks quite respectable with N examples. This comes on top of the fact that I was able to use N-1 to train. The catch is e's are not independent. Surprisingly, the method is remarkably effiecient in getting there.

Validation-6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction to Theory of Machine Learning

VC-Dimension vs Bias-Variance Approach:

Clone this wiki locally