Many familiar distributions like the ones we covered in lecture 1 are exponential family distributions. As Brad Efron likes to say, exponential family distributions bridge the gap between the Gaussian family and general distributions. For Gaussian distributions, we have exact small-sample distributional results (
Exponential family distributions have densities of the form, \begin{align*} p(y \mid \eta) &= h(y) \exp \left {\langle t(y), \eta \rangle - A(\eta) \right}, \end{align*} where
-
$h(y): \cY \to \reals_+$ is the base measure, -
$t(y) \in \reals^T$ are the sufficient statistics, -
$\eta \in \reals^T$ are the natural parameters, and -
$A(\eta): \reals^T \to \reals$ is the log normalizing function (aka the partition function).
The log normalizer ensures that the density is properly normalized, \begin{align*} A(\eta) &= \log \int h(y) \exp \left {\langle t(y), \eta \rangle \right} \dif y \end{align*}
The domain of the exponential family is the set of valid natural parameters,
Consider the scalar Gaussian distribution, \begin{align*} \mathrm{N}(y; \mu, \sigma^2) &= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2\sigma^2} ( y- \mu)^2} \ &= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2\sigma^2} ( y^2 - 2 y \mu + \mu^2)} \end{align*} We can write this as an exponential family distribution where, where
- the base measure is
$h(y) = \frac{e^{-\frac{y^2}{2 \sigma^2}}}{\sqrt{2 \pi \sigma^2}}$ - the sufficient statistics are
$t(y) = \frac{y}{\sigma}$ - the natural parameter is
$\eta = \frac{\mu}{\sigma}$ - the log normalizer is
$A(\eta) = \frac{\eta^2}{2}$ - the domain is
$\Omega = \reals$
The Bernoulli distribution can be written in exponential family form, \begin{align*} \mathrm{Bern}(y; p) &= p^{y} , (1-p)^{1 - y} \ &= \exp \left{ y \log p + (1-y) \log (1- p) \right } \ &= \exp \left{ y \log \frac{p}{1 - p} + \log (1 - p) \right } \ &= h(y) \exp \left{ y \eta - A(\eta) \right } \end{align*} where
- the base measure is
$h(y) = 1$ - the sufficient statistics are
$t(y) = y$ - the natural parameter is
$\eta = \log \frac{p}{1- p}$ - the log normalizer is \begin{align*} A(\eta) &= -\log ( 1 - p) \ &= - \log \left(1 - \frac{e^{\eta}}{1 + e^{\eta}} \right) \ &= - \log \frac{1}{1 + e^{\eta}} \ &= \log \left(1 + e^{\eta} \right). \end{align*}
- the domain is
$\Omega = \reals$
Likewise, take the Poisson pmf, \begin{align*} \mathrm{Po}(y; \lambda) &= \frac{1}{y!} \lambda^{y} e^{-\lambda} \ &= \frac{1}{y!} \exp \left{ y \log \lambda - \lambda \right } \ &= h(y) \exp \left{ y \eta - A(\eta) \right } \end{align*} where
- the base measure is
$h(y) = \frac{1}{y!}$ - the sufficient statistics are
$t(y) = y$ - the natural parameter is
$\eta = \log \lambda$ - the log normalizer
$A(\eta) = \lambda = e^\eta$ - the domain is
$\Omega = \reals$
Finally, take the categorical pmf for
- the base measure is
$h(y) = \bbI[y \in {1,\ldots,K}]$ - the sufficient statistics are
$t(y) = \mbe_y$ , the one-hot vector representation of$y$ - the natural parameter is
$\mbeta = \log \mbpi = (\log \pi_1, \ldots, \log \pi_K)^\top \in \reals^K$ - the log normalizer
$A(\mbeta) = 0$ - the domain is
$\Omega = \reals^K$
The cumulant generating function — i.e., the log of the moment generative function — is a difference of log normalizers,
\begin{align*}
\log \E_\eta[e^{\langle t(Y), \theta \rangle}]
&= \log \int h(y) \exp \left{ \langle t(y), \eta + \theta \rangle - A(\eta) \right} \dif y \
&= \log e^{A(\eta + \theta) - A(\eta)} \
&= A(\eta + \theta) - A(\eta) \
&\triangleq K_\eta(\theta)
\end{align*}
Its derivatives (with respect to
-
$\nabla_\theta K_\eta(0) = \nabla A(\eta)$ yields the first cumulant of$t(Y)$ , its mean -
$\nabla^2_\theta K_\eta(0) = \nabla^2 A(\eta)$ yields the second cumulant, its covariance
Higher order cumulants can be used to compute skewness, kurtosis, etc.
We can also obtain this result more directly. \begin{align*} \nabla A(\eta) &= \nabla \log \int h(y) \exp \left {\langle t(y), \eta \rangle \right} \dif y \ &= \frac{\int h(y) \exp \left {\langle t(y), \eta \rangle \right} t(y) \dif y}{\int h(y) \exp \left {\langle t(y), \eta \rangle \right} \dif y} \ &= \int p(y \mid \eta) , t(y) \dif y \ &= \E_\eta[t(Y)] \end{align*} Again, the gradient of the log normalizer yields the expected sufficient statistics,
The Hessian of the log normalizer yields the covariance of the sufficient statistics, \begin{align*} \nabla^2 A(\eta) &= \nabla \int p(y \mid \eta) , t(y) \dif y \ &= \int p(y \mid \eta) , t(y) , (t(y) - \nabla A(\eta))^\top \dif y \ &= \E[t(Y) t(Y)^\top ] - \E[t(Y)] , \E[t(Y)]^\top \ &= \mathrm{Cov}[t(Y)] \end{align*}
Suppose we have
Since the log normalizer is convex, all local optima are global. If the log normalizer is strictly convex, the MLE will be unique.
Setting the gradient to zero and solving yields the stationary conditions for the MLE,
\begin{align*}
\nabla A[\hat{\eta}{\mathsf{MLE}}] &= \bbE[t(Y); \hat{\eta}{\mathsf{MLE}}]
= \frac{1}{n} \sum_{i=1}^n t(y_i).
\end{align*}
When
Recall that the MLE is asymptotically normal with variance given by the inverse Fisher information, \begin{align*} \cI(\eta) &= - \E[\nabla^2 \log p(y_i; \eta)] = \nabla^2 A(\eta) = \Cov_\eta[t(Y)]. \end{align*} Thus, the asymptotic covariance of $\hat{\eta}{\mathsf{MLE}}$ is $\cI(\eta)^{-1} = \tfrac{1}{n} \Cov\eta[t(Y)]^{-1}$.
The Hessian of the log normalizer gives the covariance of the sufficient statistic. Since covariance matrices are positive semi-definite, the log normalizer is a convex function on
If the covariance is strictly positidive definite — i.e., if the minimum eigenvalue of
::::{admonition} Question Is the exponential family representation of the categorical distribution above a minimal representation? If not, how could you encode it in minimal form?
:::{admonition} Answer
:class: dropdown
The categorical representation above is not minimal because the log normalizer is identically zero, and hence it is not strictly convex. The problem stems from the fact that the natural parameters
Instead, we could parameterize the categorical distribution in terms of the log probabilities for only the first
- the base measure is
$h(y) = \bbI[y \in {1,\ldots,K}]$ - the sufficient statistics are
$\mbt(y) = (\bbI[y=1], \ldots, \bbI[y=K-1])^\top$ - the natural parameter is
$\mbeta = (\eta_1, \ldots, \eta_{K-1})^\top \in \reals^{K-1}$ where$\eta_k = \log \frac{\pi_k}{1 - \sum_{j=1}^{K-1} \pi_j}$ are the logits - the log normalizer
$A(\mbeta) = \log \left(1 + \sum_{k=1}^{K-1} e^{\eta_k} \right)$ - the domain is
$\Omega = \reals^{K-1}$
::: ::::
When constructing models with exponential family distributions, like the generalized linear models below, it is often more convenient to work with the mean parameters instead. for a
Two facts:
- The gradient mapping
$\nabla A: \Omega \mapsto \cM$ is injective (one-to-one) if and only if the exponential family is minimal. - The gradient is a surjective mapping from mean parameters to the interior of
$\cM$ . All mean parameters in the interior of$\cM$ (excluding the boundary) can be realized by an exponential family distribution. (Mean parameters on the boundary of$\cM$ can be realized by a limiting sequence of exponential family distributions.)
Together, these facts imply that the gradient of the log normalizer defines a bijective map from
For minimal families, we can work with the mean parameterization instead,
\begin{align*}
p(y; \mu)
&= h(y) \exp \left{ \langle t(y), [\nabla A]^{-1}(\mu) \rangle - A([\nabla A]^{-1}(\mu)) \right}.
\end{align*}
for mean parameters
Alternatively, consider the maximum likelihood estimate of the mean parameter
Now back to the Jacobian... applying the inverse function theorem, shows that it equals the inverse covariance matrix, \begin{align*} \frac{\partial \eta}{\partial \mu} (\mu) = \frac{\partial [\nabla A]^{-1}}{\partial \mu} (\mu) = [\nabla^2 A ([\nabla A]^{-1}(\mu))]^{-1} = \Cov_{\eta(\mu)}[t(Y)]^{-1}, \end{align*} which is indeed positive definite for minimal exponential families.
We obtain the Fisher information of the mean parameter
Thus, the MLE of the mean parameter is asymptotically normal with covariance determined by the inverse Fisher information,
:::{note}
Compare this result to the asymptotic covariances we computed in Lecture 1 for the Bernoulli distribution. Recall that for
The Kullback-Leibler (KL) divergence, or relative entropy, between two distributions is,
\begin{align*}
\KL{p}{q} &= \E_{p}\left[\log \frac{p(Y)}{q(Y)} \right].
\end{align*}
It is non-negative and equal to zero if and only if
When
:::: {admonition} Example: Poisson Distribution
Consider the Poisson distribution with known mean
- sufficient statistics
$t(y) = y$ - natural parameter
$\eta = \log \lambda$ - log normalizer
$A(\eta) = e^\eta$
Derive the KL divergence between two Poisson distributions with means
:::{admonition} Answer :class: tip, dropdown The KL divergence is, \begin{align*} \KL{p}{q} &= \langle e^{\eta_p}, \eta_p - \eta_q \rangle - e^{\eta_p} + e^{\eta_q} \ &= \lambda_p \log \frac{\lambda_p}{\lambda_q} - \lambda_p + \lambda_q \ \end{align*} :::
::::
::::{admonition} Example: Gaussian Distribution
Consider the scalar Gaussian distribution with known variance
- sufficient statistics
$t(y) = \frac{y}{\sigma}$ - natural parameter
$\eta = \frac{\mu}{\sigma}$ - log normalizer
$A(\eta) = \frac{\eta^2}{2}$
Derive the KL divergence between two Gaussians with equal variance. Denote their natural parameters by
:::{admonition} Answer :class: tip, dropdown The KL divergence is, \begin{align*} \KL{p}{q} &= \langle \eta_p, \eta_p - \eta_q \rangle - \frac{\eta_p^2}{2} + \frac{\eta_q^2}{2} \ &= \frac{\eta_p^2}{2} - \eta_p\eta_q + \frac{\eta_q^2}{2} \ &= \frac{1}{2} (\eta_p - \eta_q)^2 \ &= \frac{1}{2 \sigma^2} (\mu_p - \mu_q)^2. \end{align*} Note that here, the KL is a symmetric function. :::
::::
Rearranging terms, we can view the KL divergence as a remainder in a Taylor approximation of the log normalizer,
\begin{align*}
A(\eta_q)
&= A(\eta_p) + (\eta_q - \eta_p)^\top \nabla A(\eta_p) + \KL{p}{q}.
\end{align*}
From this perspective, we see that the KL divergence is related to the Fisher information,
\begin{align*}
\KL{p}{q}
&\approx \frac{1}{2} (\eta_q - \eta_p)^\top \nabla^2 A(\eta_p) (\eta_q - \eta_p) \
&= \frac{1}{2} (\eta_q - \eta_p)^\top \cI(\eta_p) (\eta_q - \eta_p),
\end{align*}
up to terms of order
Thus, while the KL divergence is not a distance metric due to its asymmetry, it is approximately a squared distance under the Fisher information metric, \begin{align*} 2 \KL{p}{q} \approx |\eta_q - \eta_p|_{\cI(\eta_p)}^2. \end{align*} We call this quantity the deviance. It is simply twice the KL divergence.
(expfam:deviance_residuals)=
In a normal model, the standarized residual is
The same form generalizes to other exponential families as well, with the deviance residual between the true and estimated mean parameters defined as,
\begin{align*}
r_{\mathsf{D}}(\hat{\mu}, \mu) &= \mathrm{sign}(\hat{\mu} - \mu) \sqrt{2 \KL{\hat{\mu}}{\mu}}.
\end{align*}
One can show that deviance residuals tend to be closer to normal than the more obvious Pearson residuals,
\begin{align*}
r_{\mathsf{P}}(\hat{\mu}, \mu) &= \frac{\hat{\mu} - \mu}{\sqrt{\Var[t(Y); \hat{\mu}]}}.
\end{align*}
For more on deviance residuals, see {cite:t}efron2022exponential
, ch. 1.
Exponential family distributions have many beautiful properties, and we've only scratched the surface in this chapter.
- We'll see other nice properties when we talk about building probabilistic models for joint distributions of random variables using exponential family distributions, and conjugate relationships between exponential families will simplify many aspects of Bayesian inference.
- We'll also see that inference in exponential families is closely connected to convex optimization — we saw that today for the MLE! — but for more complex models, the optimization problems can still be computationally intractable, even though its convex. That will motivate our discussion of variational inference later in the course.
Armed with exponential family distributions, we can start to build more expressive models for categorical data. First up, generalized linear models!