cs188-sankararaman-ML


Computer Science 188 - Introduction to Machine Learning

http://web.cs.ucla.edu/~sriram/courses/cs188.winter-2017/html/index.html

==============
Jan. 10, 2017
==============

mini-quiz (30 mins) on Tue. Jan. 17.

Course in machine learning by Hal Daume III (CIML)
Machine Learning: The art and science of algorithms that make sense of data by
Peter Flach (FL)

Exams
------
Midterm (Feb. 14)
Final (Mar. 22)

closed book, closed notes

Python 2.7
-----------
numpy
scipy
scikit-learn


When do we use ML?
-------------------
- human expertise does not exist
  + navigating on Mars
- humans cannot explain their expertise
  + speech recognition
- algorithms must be customized
  + personalized medicine
- data exists to acquire expertise
  + genomics
- recognizing patterns
  + facial identities or facial expressions
  + handwritten or spoken words
  + medical images
- generating patterns
  + generating images or motion sequences
- recognizing anomalies
  + unusual credit card transactions
  + unusual patterns of sensor readings in a nuclear power plant
- prediction
  + future stock prices or currency exchange rates


Machine Learning
-----------------
Study of algorithms that
- improve their performance P
- at some task T
- with experience E
A well-defined learning task is given by <P, T, E>


Defining the Learning Task
---------------------------
Improve on task T with respect to performance metric P, based on experience E

T: Playing checkers
P: Percentage of games won agianst arbitrary opponents
E: Playing practice games agianst itself

T: Recognizing hand-written words
P: Percentage of words correctly classified
E: Database of human-labeled images of handwritten words

T: Driving on four-lane highways using vision sensors
P: Average distance traveled before a human-judged error
E: Sequence of images and steering commands recorded while observing a human
   driver

T: Categorize email messages as spam or legitimate
P: Percentage of email messages correctly classified
E: Database of emails, some with human-given labels


TYPES OF LEARNING


Supervised (Inductive) Learning
--------------------------------
Learning with a teacher
Given: labeled traning instances (examples)
Goal: learn mapping that predicts label for test instance

  Regression
  -----------
  Given (x1, y1), (x2, y2), ... , (xn, yn)
  Learn a function f(x) to predict y given x
  * y is real-valued == regression

  Classification
  ---------------
  Given (x1, y1), (x2, y2), ... , (xn, yn)
  Learn a function f(x) to predict y given x
  * y is categorical == classification

x can be multi-dimensional
each dimension corresponds to an attribute


Unsupervised Learning
----------------------
Learning without a teacher
Given: unlabeled inputs
Goal: learn some intrinsic structure in inputs

Given x1, x2, ... , xn (without labels)
Output hidden structure behind the x's

- clustering
- gemonics application
  group individuals by genetic similarity
- independent component analysis
  separate a combined signal into its original sources
  two people are talking, separate the voices


Reinforcement Learning
-----------------------
Learning by interacting
Given: agent interacting in environment (having set of states)
Goal: learn policy (state to action mapping) that maximizes agent's reward

Given sequence of states and actions with (delayed) rewards
Learn policy that maximizes agent's reward

- game playing
  given sequences of moves and whether or not the player won at the end,
  learn to make good moves
- robot in maze


FRAMING A LEARNING PROBLEM


Representing Instances/Examples
--------------------------------
What is an instance?
How is it represented?

- Define a list of features.
  This is how the algorithm views the data.
  Features are the questions we can ask about the instances.
  
  Red Apple: red, round, leaf, 3oz, ...
  Green Apple: green, round, no leaf, 4oz, ...
  Yellow Banana: yellow, curved, no leaf, 4oz, ...
  Green Banana: green, curved, no leaf, 5oz, ...

During learning/traning/induction, learn a model of what distinguishes apples
and bananas based on the features.
Generate a classifier.

With a new instance, apply the classifier.
Classifier classifies a new instance based on the features.


Learning Algorithm
-------------------
Learning is about *generalizing* from training data
What does this *assume* about training and test set?

- Don't see a test instance exactly the same as a training instance


Generalization Ability: performance of the learning algorithm
How do we measure performance depends on the problem we are trying to solve
The training and test data should be strongly related


Loss function L(y, y^)
  tells us how bad the system's prediction of y is compared to the true
  value of y

  - A loss function for regression (squared loss)
    L(y, y^) = (y - y^)^2

  - A loss function for classification
    L(y, y^) = 0   if y = y^
               1   otherwise

Use the probabilistic model of learning
There is some unknown probability distribution p over instance/label pairs
called the data generating distribution


Learning Problem defined by
  loss function: measures of performance
  data generating distribution: what data do we expect to see
    characterizes experience


Problem Setting
  set of possible instances X
  set of possible labels Y
  unknown target function f: X -> Y
  set of function hypothesis: H = { h | h: X -> Y }
Input
  training instances drawn from the dta generating distribution p
    { (xi, yi) }i = 1 -> n  =  { (x1, y1), ... , (xn, yn) }
Output
  hypothesis h in H that best approximates f
  h should do well (as measured by the loss) on future instances
  formally, h should have low expected (test) loss/risk

    E((x,y)~p)[L(y,h(x))] = sum(x,y) p(x,y)L(y,h(x))

Problem: we don't know what p is
         but we are given samples drawn from p

we instead approximate the risk by the training error/empirical risk
  
  1/n sum(i = i -> n) L(yi, h(xi))

why is this reasonable?
  both the training data and the test data set are generated based on
  this distribution

Problem: can make the training error zero by memorizing


Regression
-----------
For regression, the common choice is squared loss

  L(yi,h(xi)) = (yi - h(xi))^2

Empirical loss of function h applied to training data is then

  1/n sum(i = 1 -> n) L(yi,h(xi)) = 1/n sum(i = 1 -> n) (yi - h(xi))^2


Fundamental Difficulties of Machine Learning
---------------------------------------------
we have access to the training error but really care about the expected loss
our learned funciton needs to generalize beyond the training data

Key Issues in Machine Learning
-------------------------------
Representation:  how to choose a hypothesis space?
  - often we use prior knowledge to guide this choice
  - the ability to answer the next two questions also affects choice
  - which spaces have been useful in practical applications and why?

Optimization:  how to find the best hypothesis within this space?
  - (the ALGORITHMIC question), at the intersection of computer science
    and optimization research
  - are some learning problems computationally intractable?
    (the COMPUTATIONAL quesion)

Evaluation:  how to gauge the accuracy of a hypothesis on unseen testing data?
  - the previous exampe showed that choosing the hypothesis which simply
    minimizes training set error (i.e. empirical loss) is not optimal
  - this question is the main topic of learning theory
  - how can we have confidence in the results?
    (the STATISTICAL question)

Formulation:  how can we formulate application problems as machine learning
              problems? (the ENGINEERING question)

Machine Learning in Practice
-----------------------------
  +
L | understand domain, prior knowledge, and goals
O | data integration, selection, cleaning, pre-processing
O | learn models
P | interpret results
  +


==============
Jan. 12, 2017
==============

Talking about probability and statistics, for uncertainty

Random Variables
-----------------

Example: Toss a coin twice
---------------------------

Sample Space (omega): {HH, HT, TH, TT}

P(HH) = 1/4
P(HT) = 1/4
P(TH) = 1/4
P(TT) = 1/4

random variable: X = number of heads observed, can be in {0, 1, 2}

P(X=0) = 1/4
P(X=1) = 1/2
P(X=2) = 1/4

P(X=0 or X=1) = P(X=0 v X=1) = P(X=0) + P(X=1) = 3/4


+--------------------------+
|                    omega |   area(omega) = 1
|  +------+      +------+  |
|  |  A1  |      |  A2  |  |
|  +------+      +------+  |
|                          |
+--------------------------+


Axioms of Probability
----------------------
omega : sample space
A     : event

(1) P(A) >= 0     for all events A
(2) P(omega) = 1
(3) P(A1 v A2) = P(A1) + P(A2)    for A1 ^ A2 = NIL


Corollaries
------------
(1) P(A^C) = 1 - P(A)
(2) P(A1 v A2) = P(A1) + P(A2) - P(A1 ^ A2)


Discrete Random Variables
--------------------------

X in {x_1, ..., x_M}
Probability Mass Function: P_X(x_m) = P(X = x_m)


Example
--------
        (1)    (2)
Box     Red    Blue
Balls   Red    Blue

Ball       Red   Blue
Box
    Red     6     2
    Blue    1     3

X(box)   {r, b}   {x_1, ..., x_m}
Y(Ball)  {r, b}   {y_1, ..., y_l}

       Y
  +--------------------
  |    |
X |----+--------------- c_i
  |    |
  |    |
  |    |

      r_j


P(X=x_i, Y=y_j)

repeat N times

n_ij = #{X=x_i and Y=y_j}
c_i = #{X=x_i}
r_j = #{Y=y_j}
N = total number of times of experiment


Joint Probability
------------------
P(X=x_i and Y=y_j)
P(X=x_i, Y=y_j) = n_ij / N


Marginal Probability
---------------------
Sum Rule of Probability

  P(X=x_i) = c_i / N     with   c_i = sum{j} n_ij
           = sum{j} n_ij / N
           = sum{j} (n_ij / N)
           = sum{j} P(X=x_i, Y=y_j)
  P(Y=y_j) = sum{i} P(X=x_i, Y=y_j)


Conditional Probability
------------------------
P(Y=y_j | X=x_i) = n_ij / c_i
                 = (n_ij / N) / (c_i / N)
                 = P(X=x_i, Y=y_j) / P(X=x_i)

Summary
--------
    Sum Rule:  P(X) = sum{y} P(X,y)
Product Rule:  P(X,Y) = P(Y|X)P(X)
                      = P(X|Y)P(Y)
Bayes Theorem: P(X|Y) = P(Y|X)P(X) / P(Y)


Independence
-------------
X, Y are independent iff  P(X,Y) = P(X)P(Y)
P(X,Y) = P(X|Y)P(Y) = P(X)P(Y)  ==>  P(X|Y) = P(X)


Continuous Random Variables
----------------------------
X is in R
P(X in [x, x+dx]) = p(x)dx

Probability Density Function (PDF)
-----------------------------------
p(x) is the PDF for continuous random variables

P(X in [a,b]) = integrate{a -> b} p(x)dx

(1) p(x) >= 0
(2) integrate{-infty -> +infty} p(x)dx = 1

Cumulative Density Function (CDF)
----------------------------------
probability that X is less than or equal to some chosen value x

P(X <= x)


given X,Y, want to find
P(X in [x1,x2], Y in [y1,y2])
= integrate{x1 -> x2} integrate{y1 -> y2} p(x,y)dxdy

p(x,y) : joint probability density function


Expectation & Variance
-----------------------
random variable X with distribution p(x): X ~ p(x)

Let X in {x_1, ... , x_M}

Expectation: E[X] = sum{m=1 -> M} x_m * p(x_m)

             E[X] = integrate{-infty -> +infty} x * p(x) * dx


Variance: Var[X] = sum{m=1 -> M} (x_m - E[X])^2 * p(x_m)
                 = E[ (X - E[X])^2 ]

          Var[X] = integrate{-infty -> +infty} (x - E[X])^2 * p(x) * dx


E[g(X)] = sum{m=1 -> M} g(x_m) * p(x_m)
g(x) can be any function
  g(x) = (x-E[X])^2  ->  Variance


---------------------------------
Discrete: Binary Random Variable
Bernoulli Distribution
---------------------------------

X ~ Ber(p)

P(X=0) & P(X=1)   with   P(X=0) + P(X=1) = 1

usually we have P(X=1) = p   (0 <= p <= 1)

probability mass function = {   p   x = 1     }
                            { 1-p   x = 0     }
                            {   0   x = others}

Parameters: p


-----------------------------------------
Continuous: Normal/Guassian Distribution
-----------------------------------------

X  is in  R

PDF: p(x) = 1/sqrt(2 * pi * sigma^2) * exp( (-1)/(2 * sigma^2) (x - mu)^2 )

Parameters: (mu, sigma)

E[X] = mu
Var[X] = sigma^2


==============
Jan. 17, 2017
==============

X ~ Ber(p)
X is in {0,1}

P(X = 1) = p
P(X = 0) = 1-p


statistics: inverse probability
--------------------------------
probability :  parameters  ---->  samples/data
  
  bernoulli distribution  :      p         ---->  X
  normal distribution     : (mu, sigma^2)  ---->  X

  general distribution    :         theta  ---->  X


Example 1
----------
Toss a coin 5 times, observe 3 heads
Goal: find p

likelihood
-----------
X1, X2, X3, X4, X5

likelihood function
L(theta) = P(X1, X2, X3, X4, X5; theta)

each toss is a Bernoulli random variable

Xi = 1  if toss landed heads
   = 0  otherwise

Xi ~ Ber(p)

we have 5 independent tosses

X1  X2  X3  X4  X5
1   0   1   1   0

L(theta) = P(X1, X2, X3, X4, X5 | theta)
         = P(X1;theta)P(X2;theta)P(X3;theta)P(X4;theta)P(X5;theta)
         = theta * (1-theta) * theta * theta * (1-theta)
         = theta^3 (1-theta)^2

find theta that maximizes the likelihood: L(theta)

* ^theta = argmax L(theta)
    ^theta: Maximum Likelihood Estimator (MLE)

^theta = 3/5


Probabilistic/Statistical Models
---------------------------------
P(X; theta)

for coin tossing we have a Bernoulli model, other models for others

data/observation: samples drawn from the model

D = {x1, x2, ... , xn}
xi ~ P(xi, theta)
   ^
   |
   |
    drawn independently from the model

likelihood
-----------
L(theta) = P(x1, ... , xN; theta)
         = prod{i=1 -> n} P(xi; theta)


Optimization
-------------
max f(x)
 x

          ^
     f(x) |
          |
          |               global max
          |                  ____
          |                 /    \
          |   local max    /      \
          |      ____     /        \____
          |     /    \   /
          | ___/      \_/
          |
          +-------------------------------> x


Fermat's Theorem: all optima of f(x) occrs where f'(x) = 0


Example 2
----------
X ~ N(mu, sigma^2)
      ^   ^
      |   |
      |   |
   mean   variance


p(x; (mu, sigma^2)) = 1/sqrt(2*pi*sigma^2) ((-1/2*sigma^2) * (x-mu)^2)

x1, x2, ... , xN
xn ~ N(mu, sigma^2)
   ^
   |
   |
    independent

theta = (    mu   )
        ( sigma^2 )

likelihood: L(mu, sigma^2) = P(x1, ... , xN; (mu, sigma^2))
                           = prod{n=1 -> N} P(xn; (mu, sigma^2))

(^mu, ^sigma^2) = argmax L(mu, sigma^2)
                (mu, sigma^2)

  need to optimize
  optimization is an important field in ML

l(mu, sigma^2) = log(L(mu, sigma^2))
               = log( prod{i=1 -> N} P(xn; (mu, sigma^2)) )
               = sum{n=1 -> N} log( p(xn; (mu, sigma^2)) )

(^mu, ^sigma^2) = argmax l(mu, sigma^2)
                (mu, sigma^2)

l(mu, sigma^2) = sum{n=1 -> N} log{ (1/sqrt(2*pi*sigma^2)) e^(-(xn-mu)/(2*sigma^2)) }
               = sum{n=1 -> N} { ((-1/2*sigma^2)(xn-mu)^2) - (1/2)log(2*pi*sigma^2) }

     dl/d(mu) (mu, sigma^2) = 0
dl/d(sigma^2) (mu, sigma^2) = 0


dl/d(mu) (mu, sigma^2) = sum{n=1 -> N} d/d(mu) (-1/2*sigma^2)(xn-mu)^2
                       = sum{n=1 -> N} (xn-mu)/sigma^2 = 0

     ^mu = (sum{n=1 -> N} xn) / N = bar(x)     [sample mean]
^sigma^2 = (sum{n=1 -> N} (xn-bar(x))^2) / N   [sample variance]


==============
Jan. 19, 2017
==============

Sample Dataset
---------------
columns denote features/attributes and labels/targets
rows denote instance-label pairs (xn,yn)
class label denotes whether tennis game was played

binary classification problem

For each decision tree
  each internal node: test one feature
  each edge: select one value for the feature
  each leaf: predict class

  allows to make a series of questions and choose next question to
  get to the results we want

Tree
             * <--------- root node
            /  \ <------- edge
           *    *  <----- internal node
               /  \
              *    * <--- leaf

a decision tree partitions the feature space


Decision Tree Learning
-----------------------

setup

  set of possible instances X
    each instance x in X is a feature vector
    (humidity=low, wind=weak, outlook=rain, temp=hot)
  set of possible labels Y
    Y is discrete valued
  unknown target function f: X -> Y
  model/hypothesis: H = {h|h: X -> Y}
  each hypothesis h is a decision tree

goal : train/induce/learn a function h that maps instance to label

three things to learn
----------------------
(1) structure of the tree
(2) threshold values (theta_i)
(3) values for the leaves (A, B, ...)


-------------------
Learning Algorithms
-------------------

Ockham's Razor
---------------
The simplest consistent explanation is the best.
  a form of inductive bias
  find smallest decision tree that correctly classifies all training examples
  NP-hard
  instead, greedily construct a decision tree that is pretty small


Algorithm 1
------------
DecisionTreeTrain(data, features)
  guess <- the most frequest label in data
  if all labels in data are the same then
    return LEAF(guess)
  else
    f <- the "best" feature in features
    NO  <- the subset of data on which f = NO
    YES <- the subset of data on which f = YES
    left  <- DecisionTreeTrain(NO, features \ {f})
    right <- DecisionTreeTrain(YES, features \ {f})
    return NODE(f, left, right)
  endif


how to choose the best feature
-------------------------------
possibilities
  random: select a feature at random
  highest accuracy: select feature with largest accuracy
  max-gain: select feature with largest information gain

ID3 algorithm: one algorithm for decision tree learning
  select the feature with largest information gain


which attribute to split at the root
  idea: use information gain to choose the attribute


Entropy
--------
idea: gaining information reduces uncertainty

if a random variable X has K different values a_1, a_2, ..., a_K
it's entropy is given by

  H[X] = -sum{k=1 -> K} P(X=a_k) * log(P(X=a_k))
                                   ^
                                   |
                                  if the base is 2, the unit of the entropy
                                  is called "bit"

this measures the amount of uncertainty of a random variable with a
specific probability distribution
the higher it is, the less confident we are in its outcome

Conditional Entropy
--------------------
given two random variables X and Y

  H[Y|X] = sum{k} P(X=a_k) * H[Y|X=a_k]

  X : the attribute to be split
  Y : outcome (binary)

information gain

  GAIN = H[Y] - H[Y|X]

when H[Y] is fixed, we only need to compare conditional entropy

mutual information between Y and X
  the expected reduction in entropy of the target variable Y due to
  sorting on the feature X

    example
    -------
    GAIN[Y,Patrons] = H[Y] - H[Y|Patrons]
                    = 1 - 0.45
                    = 0.55
    GAIN[Y,Type] = H[Y] - H[Y|Type]
                 = 1 - 1
                 = 0


Algorithm 2
------------
DecisionTreeTrain(data, features)
  guess <- the most frequest label in data
  if all labels in data are the same then
    return LEAF(guess)
  else if features is empty then
    return LEAF(guess)
  else
    f <- the "best" feature in features
    NO  <- the subset of data on which f = NO
    YES <- the subset of data on which f = YES
    left  <- DecisionTreeTrain(NO, features \ {f})
    right <- DecisionTreeTrain(YES, features \ {f})
    return NODE(f, left, right)
  endif


we need to careful to pick an appropriate tree depth
if the tree is too deep, we can overfit (memorize training data)
if the tree is too shallow, we underfit (not learn enough)
max depth is a hyperparameter that should be tuned by the data

a hyperparameter is
  a parameter that controls the other parameters of the tree
  max depth of 0     : underfitting
  max depth of infty : overfitting


Reasons for Overfitting
------------------------
noisy data
  two instances have the same feature values but different class labels
  some of the feature or label values are incorrect
some features are irrelevant to classification
target variable is non-deterministic in the features
  in general we cannot measure all the variables needed to predict
  so target variable is not uniquely determined by the input feature values
training error is not guaranteed to be zero

* our decision tree learning procedure always increases traning set accuracy
* though training accuracy is increasing, test accuracy can decrease


how to get a realistic estimate of accuracy
--------------------------------------------
train model on training data
compute accuracy on test data
DOWNSIDE: throwing out some data


Cross-Validation (CV)
----------------------
don't just choose one particular split of the data

in principle, we should do this multiple times since performance may
be different for each split

K-Fold Cross Validation (e.g. K = 10)
  randomly partition full data set of N instances into K disjoint subsets
    each roughly of size N/K
  choose each fold in turn as the test set
  train model on the other folds and evaluate
  compute average statistics over K test performances
  can also do leave-one-out CV where K = N

recipe
-------
split training data into K equal parts
denote each part as D_k^{TRAIN}
for every k in [1,K]
  train a model using D = D^{TRAIN} - D_k^{TRAIN}
  evaluate the performance of the model on D_k^{TRAIN}
average the K performance metrics


Avoid Overfitting
------------------
get more training data
remove irrelevant features
decision tree pruning
  prune while building tree (early stopping)
  prune after building tree (post pruning)


Controlling the Size of the Tree
---------------------------------
if we cut the tree, not all traning sample would be classified correctly
we label the leaves of this smaller tree with the majority of traning
samples' labels

  example
  -------
         ******  <-- Wait: Yes
         ******  <-- Wait: No
        Patrons?
    /      |      \
  None   Some    Full
         ****    **
   **            ****  <-- choose Wait: No


==============
Jan. 31, 2017
==============

special case: binary classification
  instance (feature vectors): x in R^D
  label: y in {-1, +1}
  model/hypothesis:
    H = {h|h: X -> Y, h(x) = sign(sum{})}


Perceptron Prediction
----------------------
input: x in R^D, w in R^D, b in R   (D = dimensions)

a = sum{d=1 -> D} w_d * x_d + b = w^T * x + b
^y = sign(a)

output: ^y
sum{d=1 -> D} w_d * x_d + b: hyperplane in D dimensions with parameters (w,b)
w: weights
b: bias
a: activation


Hyperplane in D dimensions
---------------------------
x in R^D

  sum{d=1 -> D} w_d * x_d + b = 0

x that satisfies the equation above is in the hyperplane
the set of points that satisfies the equation makes the hyperplane


D+1 dimensions
---------------
~x^T = (1, x_1, ..., x_D)
~w^T = (b, w_1, ..., w_D)

  ~w^T * ~x = 0


Perceptron Learning
--------------------
iteratively solving one case at a time

  REPEAT

    pick a data point x_n
    compute a = w^T * x_n, using current w
    if ay_n > 0, do nothing
    else

      w <- w + y_n * x_n

  UNTIL converged


Properties of Perceptron Learning
----------------------------------
this is an online algorithm - looks at one instance at a time
convergence
  if training data is not linearly separable, the algorithm does not converge
  if the training data is linearly separable, the algorithm stops in a finite
  number of steps (converges)
how long to convergence?
  depends on the difficulty of the problem
    how far apart are the neg and pos samples


instead of predicting the class, predict the probability of instance being
in a class

perceptron does not produce probability estimates


Logistic Classification
------------------------
setup for binary classification

  input: x in R^D
  output: y in {0,1}
  training data: data = {(x_n, y_n), n = 1,2,...,N}
  hypothesis/model:

    h_w,b(x) = p(y=1 | x; b,w) = sigma(a(x))

  where

    a(x) = b + sum{d} w_d * x_d = b + w^T * x

  and sigma() stands for the sigmoid function

    sigma(a) = 1 / (1 + e^{-a})

  where a = b + w^T * x

    the sigmoid function will take the output and
    map it to a number between 0 and 1

  given training data N samples/instances:
    data^train = {(x1, y1), ..., (x_N, y_N)}
    train/learn/induce h_w,b
    find values for (w,b)


  properties of the sigmoid function
  -----------------------------------
  bounded between 0 and 1  <- thus, interpretable as probability
  monotonically increasing, thus, usable to derive classification rules
    sigma(a) > 0.5, positive (classify as '1')
    sigma(a) < 0.5, negative (classify as '0')
    sigma(a) = 0.5, undecidable
  nice computational properties

Linear or Nonlinear?
---------------------
sigma(a) is nonlinear, however, the decision boundary is determined by

  sigma(a) = 0.5


Likelihood Function
--------------------
Let X1, ..., X_N be IID (independent and identically distributed) with PDF
p(x|theta) (also written as p(x;theta)).

The likelihood function is defined by L(theta)

  L(theta) = p(X_1, ..., X_N; theta) = prod{i=1 -> N} p(X_i; theta)

Notes: the likelihood function is just the joint density of the data, except
that we treat it as a function of the parameter theta.

order of data doesn't matter -> independence between variables


Maximum Likelihood Estimator
-----------------------------
Definition: the maximum likelihood estimator (MLE) ^theta
is the value of theta that maximizes L(theta)

The log-likelihood function is defined by l(theta) = log(L(theta)).
Its maximum occurs at the same place as that of the likelihood function

  using logs simplifies mathematical expressions
    converts exponents to products and products to sums
  using logs helps with numerical stability

the same is true of the likelihood function times any constant
thus we shall drop constants in the likelihood function


Likelihood Function for Logistic Regression
--------------------------------------------
probability of a single training sample (x_n, y_n)

  p(y_n|x_n; w) = { h_w,b = sigma(b + w^T * x_n)             if y_n = 1  }
                  { 1 - h_w,b(x) = 1 - sigma(b + w^T * x_n)  otherwise   }

compact expression, exploring that y_n is either 1 or 0

  p(y_n|x_n; w) = h_w,b(x_n)^{y_n} [1 - h_w,b(x_n)]^{1-y_n}

Also

  y_n ~ Ber(h_w,b(x_n))

-----------------------------------
 example
-----------------------------------
  X_1, ..., X_N
  X_n ~ Ber(p)   n = 1, ..., N

  l(p) = log( P(X1, ..., X_n; p) )
-----------------------------------

l(w,b) = log(L(w,b))

L(w,b) = prod{n=1 -> N} P(y_n | x_n; w,b)
       = prod{n=1 -> N} h_w,b(x_n)^{y_n} (1 - h_w,b(x_n))^{1-y_n}

l(w,b) = log( prod{n=1 -> N} P(y_n | x_n; w,b) )
       = sum{n=1 -> N} log(P(y_n | x_n; w,b))
       = sum{n=1 -> N} log(h_w,b(x_n)^{y_n} (1 - h_w,b(x_n))^{1-y_n})
       = sum{n=1 -> N} log(h_w,b(x_n)^{y_n}) + log(1-h_w,b(x_n))^{1-y_n}
       = sum{n=1 -> N} y_n * log(h_w,b(x_n)) + (1-y_n) * log(1-h_w,b(x_n))


we can ignore the distinction between bias and weights for convenience

  append 1 to x

    x <- [1 x_1 x_2 ... x_D]

  append b to w

    theta <- [b w_1 w_2 ... w_D]

  J(theta) = -sum{n} (y_n * log(h_theta(x_n)) + (1-y_n) * log(1-h_theta(x_n)))

we would like to minimize this negative log likelihood


one way to minimize a function f
---------------------------------
gradient descent
  gives information on whether function is inc or dec
  start at a random point
  determine a descent direction
  pick a point and follow the gradient