Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gradient estimation #38

Open
slwu89 opened this issue Mar 2, 2024 · 4 comments
Open

gradient estimation #38

slwu89 opened this issue Mar 2, 2024 · 4 comments

Comments

@slwu89
Copy link
Collaborator

slwu89 commented Mar 2, 2024

For me, I need to read https://arxiv.org/abs/1906.10652 as background and then familiarize myself with https://github.com/gaurav-arya/StochasticAD.jl/tree/main and https://arxiv.org/abs/2210.08572

@slwu89
Copy link
Collaborator Author

slwu89 commented Mar 16, 2024

The basic problem we face is computing the gradients $\eta$ of a general stochastic objective function $\mathcal{F}(\theta)$.

$$ \mathcal{F}(\theta) = \int p(x;\theta)f(x)dx = \mathbb{E}_{p(x;\theta)}[f(x)] $$

and its gradient:

$$ \eta = \nabla_{\theta}\mathcal{F}(\theta) = \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)] $$

The general Monte Carlo estimator of the stochastic objective function is:

$$ \tilde{\mathcal{F}}(\theta) = \frac{1}{N}\sum_{i=1}^{N} f(x^{(n)}), \text{ where } x^{(n)} \sim p(x;\theta) $$

There are 2 general families of methods to develop gradient estimators for the stochastic gradient $\nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)]$:

  • Derivatives of measure: differentiation of the measure $p(x;\theta)$.
  • Derivatives of paths: differentiation of the objective $f(x)$, which carries the information from parameters $\theta$ through the random variate $x$ to the objective.

These families each have different strengths and weaknesses due to their construction. There are 3 major classes of estimators that fall into these 2 families considered in the review paper.

Score function estimator

This estimator belongs to the class of derivatives of measure. It's named after the score function, the gradient of the log probability with respect to parameters: $\nabla_{\theta} \log p(x;\theta) = \frac{\nabla_{\theta} p(x;\theta)}{p(x;\theta)}$.

$$ \eta = \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)] $$

$$ \nabla_{\theta} \int p(x;\theta)f(x)dx = \int f(x) \nabla_{\theta} p(x;\theta)dx $$

$$
\int f(x) p(x;\theta) f(x) \nabla_{\theta} \log p(x;\theta) dx = \mathbb{E}{p(x;\theta)}[f(x) \nabla{\theta} \log p(x;\theta)]
$$

This final formulation as an expectation motivates the straightforward Monte Carlo estimator:

$$ \tilde{\eta} = \frac{1}{N}\sum_{i=1}^{N} f(x^{(n)}) \nabla_{\theta} \log p(x^{(n)};\theta), \text{ where } x^{(n)} \sim p(x;\theta) $$

Pathwise estimator

As the name suggests, this one belongs to the family of derivatives of paths. The pathwise estimator is so called because we can usually separate the sampling of the measure $\hat{x}\sim p(x;\theta)$ into a draw from a "base distribution" $\hat{\epsilon} \sim p(\epsilon)$ that does not depend on $\theta$ and a deterministic transformation (path) $\hat{x} = g(\hat{\epsilon};\theta)$ that produces the correct distribution.

$$ \eta = \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)] $$

$$ \nabla_{\theta} \int p(x;\theta)f(x)dx = \nabla_{\theta} \int p(\epsilon) f(g(\epsilon; \theta)) d\epsilon $$

Which leads to

$$
\mathbb{E}{p(\epsilon)}[ \nabla{\theta} f(g(\epsilon;\theta)) ]
$$

The Monte Carlo estimator of the gradient is again straightforward once we get the expectation:

$$ \tilde{\eta} = \frac{1}{N}\sum_{i=1}^{N} \nabla_{\theta} f(g(\epsilon^{(n)};\theta)), \text{ where } \epsilon^{(n)} \sim p(\epsilon) $$

This one is nice when we can apply it, because after our draw from the base distribution we push $\theta$ into the objective function and can apply standard chain rule differentiation.

Measure-valued estimator

This also belongs to the family of derivatives of measure. It depends on some relatively new results on weak derivatives, which are derivatives of a density with respect to a single parameter depending on a specific decomposition.

$$ \nabla_{\theta_{i}}p(x;\theta) = c_{\theta_{i}}( p^{+}(x;\theta) - p^{-}(x;\theta) ) $$

Where the triple $(c_{\theta_{i}}, p^{+}(x;\theta), p^{-}(x;\theta) )$ referred to as the i-th weak derivative of the density. It has a a constant, and a positive and negative component of the density (themselves both measures).

The estimator is defined for each index of $\theta$.

$$ \eta_{i} = \nabla_{\theta_{i}} \mathbb{E}_{p(x;\theta)} [f(x)] $$

$$ \nabla_{\theta_{i}} \int p(x;\theta) f(x) dx = \int \nabla_{\theta_{i}} p(x;\theta) f(x) dx $$

$$
c_{\theta_{i}} \left( \int f(x)p^{+}{i}(x;\theta) dx - \int f(x)p^{-}{i}(x;\theta) dx \right) = c_{\theta_{i}} \left( \mathbb{E}{p^{+}{i}(x;\theta)}[f(x)] - \mathbb{E}{p^{-}{i}(x;\theta)}[f(x)]\right)
$$

In the final line we see that we get the estimator as a difference in expectations, so we can formulate the Monte Carlo estimator straightforwardly.

$$ \tilde{\eta_{i}}= \frac{c_{\theta_{i}}}{N}\left( \sum_{i}^{N}f(\dot{x}^{(n)}) - \sum_{i}^{N}f(\ddot{x}^{(n)}) \right ); \text{ where } \dot{x}^{(n)}\sim p_{i}^{+}(x;\theta), \ddot{x}^{(n)}\sim p_{i}^{-}(x;\theta) $$

@slwu89
Copy link
Collaborator Author

slwu89 commented Mar 16, 2024

What is StochasticAD.jl doing?

  • Our stochastic derivatives also extend the pathwise gradient estimator
    to discrete programs but do so unbiasedly. The conceptual starting point of our approach is finite
    differences with common random numbers [21, 22, 23], whose ideas have also been extended by
    direct optimization [ 24 , 25 ], but crucially we show how to take the exact limit of step size ε → 0 even
    in the discrete case

Can we line up notation?

  • $\frac{d \mathbb{E}[X(p)]}{dp}$ is akin to $\nabla_{\theta}\mathbb{E}_{p(x;\theta)}[f(x)]$. In their notation $X$ is both the measure and the objective, parameterized by $p$.
  • By the way the "program" $X(p)\in E$, $E$ is a Euclidean space with index set $I$
  • Their "base measure" is described as $\omega \sim \mathbb{P}$ (with sample space $\Omega$, of course). This is then plugged into the stochastic program $X(p)(\omega)$ at that value of parameter space
  • The somewhat annoying notation $dX(\epsilon)$ means the differential at the point in parameter space $p$ perturbed by $\epsilon$. It's given as $dX(\epsilon) = X(p+\epsilon) - X(p)$; I believe they call the first term the derivative computation and the latter term the primal computation
  • In the core of the MC estimator for pathwise gradients is the term $\nabla_{\theta}f(g(\epsilon^{(n)};\theta))$. The issue StochasticAD.jl faces is, how to do that gradient when $x$ is a discrete random variate? For continuous cases, one simple draws $\omega$ and then does $\frac{ X(p+\epsilon)(\omega) - X(p)(\omega) }{\epsilon}$, and gives us the pathwise estimator

The construction is given below, without much care for rigor. Let's saw we already do our "primal" draw $X(p)$. Then the derivative for discrete random variates needs to consider the probability of a finite jump (e.g. from 0 to 1), because $dX(\epsilon)/\epsilon$ is unbounded in that case, but its important for discrete r.v's. The stochastic derivative is defined as 3 random variables $(\delta,w,Y)$, where $w\in\mathbb{R}$ and $Y\in E$. $\delta$ is the almost sure derivative (in the case of no jump), in the case of the jump it is $w(Y-X(p))$, making $w$ something like the derivative of the probability of a finite jump (e.g. flow of probability when $p$ is changing), and $Y$ is the distribution of finite jumps, given that a jump happens. Then the stochastic derivative of $X(p)$ at $p$ is:

$$ \frac{d \mathbb{E}[X(p)]}{dp} = \mathbb{E}[\delta + w(Y-X(p))] $$

This seems to be what's implemented in the code, see first part and second part

@slwu89
Copy link
Collaborator Author

slwu89 commented Apr 13, 2024

@adolgert we should pay attention to this: https://github.com/arnauqb/diff_abms_tutorial/tree/main

@slwu89
Copy link
Collaborator Author

slwu89 commented Jul 9, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant