gradient estimation #38

slwu89 · 2024-03-02T23:50:15Z

For me, I need to read https://arxiv.org/abs/1906.10652 as background and then familiarize myself with https://github.com/gaurav-arya/StochasticAD.jl/tree/main and https://arxiv.org/abs/2210.08572

slwu89 · 2024-03-16T15:09:38Z

The basic problem we face is computing the gradients $\eta$ of a general stochastic objective function $\mathcal{F}(\theta)$.

$$ \mathcal{F}(\theta) = \int p(x;\theta)f(x)dx = \mathbb{E}_{p(x;\theta)}[f(x)] $$

and its gradient:

$$ \eta = \nabla_{\theta}\mathcal{F}(\theta) = \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)] $$

The general Monte Carlo estimator of the stochastic objective function is:

$$ \tilde{\mathcal{F}}(\theta) = \frac{1}{N}\sum_{i=1}^{N} f(x^{(n)}), \text{ where } x^{(n)} \sim p(x;\theta) $$

There are 2 general families of methods to develop gradient estimators for the stochastic gradient $\nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)]$:

Derivatives of measure: differentiation of the measure $p(x;\theta)$.
Derivatives of paths: differentiation of the objective $f(x)$, which carries the information from parameters $\theta$ through the random variate $x$ to the objective.

These families each have different strengths and weaknesses due to their construction. There are 3 major classes of estimators that fall into these 2 families considered in the review paper.

Score function estimator

This estimator belongs to the class of derivatives of measure. It's named after the score function, the gradient of the log probability with respect to parameters: $\nabla_{\theta} \log p(x;\theta) = \frac{\nabla_{\theta} p(x;\theta)}{p(x;\theta)}$.

$$ \eta = \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)] $$

$$ \nabla_{\theta} \int p(x;\theta)f(x)dx = \int f(x) \nabla_{\theta} p(x;\theta)dx $$

$$
\int f(x) p(x;\theta) f(x) \nabla_{\theta} \log p(x;\theta) dx = \mathbb{E}{p(x;\theta)}[f(x) \nabla{\theta} \log p(x;\theta)]
$$

This final formulation as an expectation motivates the straightforward Monte Carlo estimator:

$$ \tilde{\eta} = \frac{1}{N}\sum_{i=1}^{N} f(x^{(n)}) \nabla_{\theta} \log p(x^{(n)};\theta), \text{ where } x^{(n)} \sim p(x;\theta) $$

Pathwise estimator

As the name suggests, this one belongs to the family of derivatives of paths. The pathwise estimator is so called because we can usually separate the sampling of the measure $\hat{x}\sim p(x;\theta)$ into a draw from a "base distribution" $\hat{\epsilon} \sim p(\epsilon)$ that does not depend on $\theta$ and a deterministic transformation (path) $\hat{x} = g(\hat{\epsilon};\theta)$ that produces the correct distribution.

$$ \eta = \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)] $$

$$ \nabla_{\theta} \int p(x;\theta)f(x)dx = \nabla_{\theta} \int p(\epsilon) f(g(\epsilon; \theta)) d\epsilon $$

Which leads to

$$
\mathbb{E}{p(\epsilon)}[ \nabla{\theta} f(g(\epsilon;\theta)) ]
$$

The Monte Carlo estimator of the gradient is again straightforward once we get the expectation:

$$ \tilde{\eta} = \frac{1}{N}\sum_{i=1}^{N} \nabla_{\theta} f(g(\epsilon^{(n)};\theta)), \text{ where } \epsilon^{(n)} \sim p(\epsilon) $$

This one is nice when we can apply it, because after our draw from the base distribution we push $\theta$ into the objective function and can apply standard chain rule differentiation.

Measure-valued estimator

This also belongs to the family of derivatives of measure. It depends on some relatively new results on weak derivatives, which are derivatives of a density with respect to a single parameter depending on a specific decomposition.

$$ \nabla_{\theta_{i}}p(x;\theta) = c_{\theta_{i}}( p^{+}(x;\theta) - p^{-}(x;\theta) ) $$

Where the triple $(c_{\theta_{i}}, p^{+}(x;\theta), p^{-}(x;\theta) )$ referred to as the i-th weak derivative of the density. It has a a constant, and a positive and negative component of the density (themselves both measures).

The estimator is defined for each index of $\theta$.

$$ \eta_{i} = \nabla_{\theta_{i}} \mathbb{E}_{p(x;\theta)} [f(x)] $$

$$ \nabla_{\theta_{i}} \int p(x;\theta) f(x) dx = \int \nabla_{\theta_{i}} p(x;\theta) f(x) dx $$

$$
c_{\theta_{i}} \left( \int f(x)p^{+}{i}(x;\theta) dx - \int f(x)p^{-}{i}(x;\theta) dx \right) = c_{\theta_{i}} \left( \mathbb{E}{p^{+}{i}(x;\theta)}[f(x)] - \mathbb{E}{p^{-}{i}(x;\theta)}[f(x)]\right)
$$

In the final line we see that we get the estimator as a difference in expectations, so we can formulate the Monte Carlo estimator straightforwardly.

$$ \tilde{\eta_{i}}= \frac{c_{\theta_{i}}}{N}\left( \sum_{i}^{N}f(\dot{x}^{(n)}) - \sum_{i}^{N}f(\ddot{x}^{(n)}) \right ); \text{ where } \dot{x}^{(n)}\sim p_{i}^{+}(x;\theta), \ddot{x}^{(n)}\sim p_{i}^{-}(x;\theta) $$

slwu89 · 2024-03-16T16:29:46Z

What is StochasticAD.jl doing?

Our stochastic derivatives also extend the pathwise gradient estimator
to discrete programs but do so unbiasedly. The conceptual starting point of our approach is finite
differences with common random numbers [21, 22, 23], whose ideas have also been extended by
direct optimization [ 24 , 25 ], but crucially we show how to take the exact limit of step size ε → 0 even
in the discrete case

Can we line up notation?

$\frac{d \mathbb{E}[X(p)]}{dp}$ is akin to $\nabla_{\theta}\mathbb{E}_{p(x;\theta)}[f(x)]$. In their notation $X$ is both the measure and the objective, parameterized by $p$.
By the way the "program" $X(p)\in E$, $E$ is a Euclidean space with index set $I$
Their "base measure" is described as $\omega \sim \mathbb{P}$ (with sample space $\Omega$, of course). This is then plugged into the stochastic program $X(p)(\omega)$ at that value of parameter space
The somewhat annoying notation $dX(\epsilon)$ means the differential at the point in parameter space $p$ perturbed by $\epsilon$. It's given as $dX(\epsilon) = X(p+\epsilon) - X(p)$; I believe they call the first term the derivative computation and the latter term the primal computation
In the core of the MC estimator for pathwise gradients is the term $\nabla_{\theta}f(g(\epsilon^{(n)};\theta))$. The issue StochasticAD.jl faces is, how to do that gradient when $x$ is a discrete random variate? For continuous cases, one simple draws $\omega$ and then does $\frac{ X(p+\epsilon)(\omega) - X(p)(\omega) }{\epsilon}$, and gives us the pathwise estimator

The construction is given below, without much care for rigor. Let's saw we already do our "primal" draw $X(p)$. Then the derivative for discrete random variates needs to consider the probability of a finite jump (e.g. from 0 to 1), because $dX(\epsilon)/\epsilon$ is unbounded in that case, but its important for discrete r.v's. The stochastic derivative is defined as 3 random variables $(\delta,w,Y)$, where $w\in\mathbb{R}$ and $Y\in E$. $\delta$ is the almost sure derivative (in the case of no jump), in the case of the jump it is $w(Y-X(p))$, making $w$ something like the derivative of the probability of a finite jump (e.g. flow of probability when $p$ is changing), and $Y$ is the distribution of finite jumps, given that a jump happens. Then the stochastic derivative of $X(p)$ at $p$ is:

$$ \frac{d \mathbb{E}[X(p)]}{dp} = \mathbb{E}[\delta + w(Y-X(p))] $$

This seems to be what's implemented in the code, see first part and second part

slwu89 · 2024-04-13T18:40:04Z

@adolgert we should pay attention to this: https://github.com/arnauqb/diff_abms_tutorial/tree/main

slwu89 · 2024-07-09T23:07:14Z

@adolgert alert https://github.com/JuliaDecisionFocusedLearning/DifferentiableExpectations.jl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gradient estimation #38

gradient estimation #38

slwu89 commented Mar 2, 2024

slwu89 commented Mar 16, 2024 •

edited

Loading

slwu89 commented Mar 16, 2024 •

edited

Loading

slwu89 commented Apr 13, 2024

slwu89 commented Jul 9, 2024

gradient estimation #38

gradient estimation #38

Comments

slwu89 commented Mar 2, 2024

slwu89 commented Mar 16, 2024 • edited Loading

slwu89 commented Mar 16, 2024 • edited Loading

slwu89 commented Apr 13, 2024

slwu89 commented Jul 9, 2024

slwu89 commented Mar 16, 2024 •

edited

Loading

slwu89 commented Mar 16, 2024 •

edited

Loading