- Models probabilities for classification
- Values strictly between 0 (negatife class) and 1 (positive class)
- Linear regression not ideal because:
- Values can be > 1 or < 0
- Extreme values influence linear regression model
- Used when dependent variable is categorical
- Binary logistic regression - categorical response has only two possible outcomes
- Multinomial logistic regression - three or more categories without ordering
- Ordinal logistic regression - three or more categories with ordering
- Identify features and labels (data)
- Set initial weights for model
- Loop until model is close
- Calculate cost
- Calculate gradient
- If length of gradient vector is close to 0, stop; otherwise, adjust weights based on the gradient
-
Classifier should output between 0 and 1 only
-
Sigmoid function:
$$ g(z) = \frac{1}{1 + \exp(-z)} $$ -
Hypothesis representation:
-
$h_{\theta}(x)$ = estimated probability that y = 1 on input x $h_{\theta}(x) = P(y=1|x; \theta)$ - Since y must equal 0 or 1,
$P(y=0|x; \theta) = 1 - P(y=1|x; \theta)$
- Separates region where the hypothesis predicts y = 1 from the region where the hypothesis predicts y = 0
- Property of the hypothesis, not the training set
- Higher order polynomials result in more complex decision boundaries
- Gradient descent will converge into global minimum only if the function is convex
$$ J_{\theta} = \frac{1}{m}\sum_{i=1} ^{m} \text{Cost}(h_{\theta}(x^{(i)}), y^{(i)})
\text{Cost}(h_{\theta}(x), y) = \begin{cases}
-\log(h_{\theta}(x)), & \text{if
For the y = 1 case:
- Cost = 0 if
$y = 1$ ,$h_{\theta}(x) = 1$ - As
$h_{\theta}(x)\to0$ ,$Cost\to\infty$
For the y = 0 case:
- Cost = 0 if
$y = 0$ ,$h_{\theta}(x) = 0$ - As
$h_{\theta}(x)\to1$ ,$Cost\to\infty$
If there is a large difference between
We can rewrite the cost function as:
Checking our cases will result in our first definition:
- If
$y = 1$ :$\text{Cost}(h_{\theta}(x), y) = -1(\log(h_{\theta}(x))) - 0(\log(1-h_{\theta}(x))) = -\log(h_{\theta}(x))$ - If
$y = 0$ :$\text{Cost}(h_{\theta}(x), y) = -0(\log(h_{\theta}(x))) - 1(\log(1-h_{\theta}(x))) = -\log(1 - h_{\theta}(x))$
So our final logistic regression cost function is:
- We use gradient descent to minimize our cost function
- To do this, we repeatedly update each parameter:
$$ \theta_{j} := \theta_{j} - \alpha\frac{\partial}{\partial \theta_{j}}J(\theta)\theta_{j}
\theta_{j} := \theta_{j} - \alpha\sum_{i=1} ^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x_{j} ^{(i)} $$
- Optimization algorithms:
- Conjugate gradient
- BFGS
- L-BFGS
- Advantages:
- No need to manually pick
$\alpha$ - Often faster than gradient descent
- No need to manually pick
- Disadvantages:
- More complex
- Turn the problem into several binary classification problems
- Train a logistic regression classifier
$h_{\theta} ^{(i)}(x)$ for each class/category$i$ to predict the probability that$y = i$ - To make a prediction on a new input
$x$ , calculate and pick the class$i$ that maximizes$h_{\theta} ^{(i)}(x)$