Understanding AutoregressiveTransform #16

tipf · 2023-03-02T14:18:45Z

tipf
Mar 2, 2023

Hi everyone,

I'm trying to understand how the AutoregressiveTransform is implemented. Especially how the inverse part is implemented -- but for the beginning, I will start with the forward pass.

As far as I understand, the auto-regression's purpose is to create something I would call a "triangular dependency". Each output y_i is defined as a function of the inputs with an equal or lower index y_i = f(x_1...x_i). Let's take a simple 3D example:

y_1 = f(x_1)
y_2 = f(x_1, x_2)
y_3 = f(x_1, x_2, x_3)

The function f() is the meta parameter of the AutoregressiveTransform. I would expect that f() is hence a function with a multivariate input and a univariate output that is called for each dimension independently in an iterative manner.
The implementation of the AutoregressiveTransform seems to be more sophisticated, and all the magic seems to happen here.

Could someone help me to understand how this causes the autoregressive transformation? I'm not that familiar with Python, but I cannot find the autoregression there. Or am I looking at the wrong place?

To give some context, I'm looking for a way to modify the triangular dependency for the inverse case. I want to compute x_2 = f^-1(y_2, x_1) for a externally given x_1 and y_2. This requires to understand the autoregression before I can "bypass" it partially.

Answered by francois-rozet

Mar 3, 2023

Hello @tipf, this is a very important question! I'll start by explaining what is an autoregressive transformation formally and then how it is implemented in Zuko. Tell me if this is clear.

Formalization

Let $x$ be a vector in $\mathbb{R}^n$. An autoregressive transformation is a mapping $y = f(x) \in \mathbb{R}^n$ such that the $i$-th element of $y$ is a bijective univariate transformation of the $i$-th element of $x$, conditioned on the preceding elements. That is $y_i = f_i(x_i | x_{1:i-1})$ and $x_i = f_i^{-1}(y_i | x_{1:i-1})$ where $x_{1:i} = (x_1, x_2, \dots, x_i)$. It is important to note that $f_i$ is only bijective with respect to $x_i$, hence the vertical bar between $x_i$ and $…

View full answer

francois-rozet · 2023-03-03T13:25:14Z

francois-rozet
Mar 3, 2023
Maintainer

Hello @tipf, this is a very important question! I'll start by explaining what is an autoregressive transformation formally and then how it is implemented in Zuko. Tell me if this is clear.

Formalization

Let $x$ be a vector in $\mathbb{R}^n$. An autoregressive transformation is a mapping $y = f(x) \in \mathbb{R}^n$ such that the $i$-th element of $y$ is a bijective univariate transformation of the $i$-th element of $x$, conditioned on the preceding elements. That is $y_i = f_i(x_i | x_{1:i-1})$ and $x_i = f_i^{-1}(y_i | x_{1:i-1})$ where $x_{1:i} = (x_1, x_2, \dots, x_i)$. It is important to note that $f_i$ is only bijective with respect to $x_i$, hence the vertical bar between $x_i$ and $x_{1:i-1}$.

We can decompose the forward pass $y = f(x)$ of an autoregressive transformation in two steps:

The transformations $f_i$ are conditioned with respect to $x_{1:i-1}$.
The conditioned transformations $f_i(\cdot | x_{1:i-1})$ are applied to $x_i$.

For the inverse pass $x = f^{-1}(y)$, we cannot immediately condition all transformations because we don't have access to $x$. Fortunately, the first transformation $f_1$ does not depend on $x$. So we can get the first element as $x_1 = f_1^{-1}(y_1)$. Now that we have the first element, we can get the second element as $x_2 = f_2^{-1}(y_2 | x_1)$. And so on until we have $x$. If we decompose the process:

For $i = 1$ to $n$
a. The transformation $f_i$ is conditioned with respect to $x_{1:i-1}$.
b. The inverse conditioned transformation $f_i^{-1}(\cdot | x_{1:i-1})$ is applied to $y_i$.

Generalization

Unfortunately, in some cases, it is not possible (or not desirable) to condition and inverse the transformations $f_i$ individually during the passes. Instead, the conditioning and application of all transformations $f_i$ is performed at once. In other words, we only have access to a multivariate function $g(x | z)$ such that $g_i(x | z) = f_i(x_i | z_{1:i-1})$ and $g_i^{-1}(y | z) = f_i^{-1}(y_i | z_{1:i-1})$, were $z \in \mathbb{R}^n$ is a conditioning vector.

This is not much of a problem for the forward pass, as we have $y = f(x) = g(x | x)$. Now for the inverse pass, unless $z = x$, $g^{-1}(y | z)$ will not be equal to $x$, except for the first element as $f_1$ does not depend on $z$. However, we notice that if $z_{1:j} = x_{1:j}$,

$$ g_i^{-1}(y | z) = f_i^{-1}(y_i | z_{1:i-1}) = f_i^{-1}(y_i | x_{1:i-1}) = x_i $$

for all $i \leq j + 1$. In words, if the $j$ first elements of $z$ are the same as $x$'s, then the $j + 1$ first elements of $g^{-1}(y | z)$ are the same as $x$'s. Therefore, starting with an arbitrary condition $z$, applying $n$ times the recurrence

$$ z \gets g^{-1}(y | z) $$

leads to $z = x$ by induction.

Zuko's implementation

In AutoregressiveTransform, the meta function is $g$. It takes the condition $z$ and returns $g(\cdot | z)$ as a Transform. The forward pass is meta(x)(x) because $y = g(x | x)$. For the inverse pass, one step of the recurrence $z \gets g^{-1}(y | z)$ is z = meta(z).inv(y).

This design allows for a wide range of autoregressive architectures (e.g. fully autoregressive and coupling) and does not rely on the variables ordering, which makes it very modular. All the conditioning shenanigans (e.g. masked networks and parametrizations) are hidden in the meta function, while the inversion (and log determinant) is handled by the Transform interface. It is also much faster to perform all transformations as a single (vectorized) operation.

Example

Let's take a simple example to illustrate the principles. We have $x \in \mathbb{R}^3$ and the autoregressive transformation

$$ y = f(x) = (x_1 + 1, x_2 + x_1, x_3 + x_1 \times x_2) $$

If $x = (1, 2, 3)$, $y = (1 + 1, 2 + 1, 3 + 1 \times 2) = (2, 3, 5)$. The corresponding function $g$ is

$$ g(x | z) = (x_1 + 1, x_2 + z_1, x_3 + z_1 \times z_2) $$

and its inverse

$$ g^{-1}(y | z) = (y_1 - 1, y_2 - z_1, y_3 - z_1 \times z_2) $$

Let $z = (0, 0, 0)$ be our initial condition. The first iteration of the recurrence leads to

$$ z \gets g^{-1}(y | z) = (2 - 1, 3 - 0, 5 - 0 \times 0) = (1, 3, 5) $$

where $z_1 = x_1$. The second iteration leads to

$$ z \gets g^{-1}(y | z) = (2 - 1, 3 - 1, 5 - 1 \times 3) = (1, 2, 2) $$

where $z_{1:2} = x_{1:2}$. The third (and final) iteration leads to

$$ z \gets g^{-1}(y | z) = (2 - 1, 3 - 1, 5 - 1 \times 2) = (1, 2, 3) = x$$

where all elements are correct.

0 replies

tipf · 2023-03-03T17:24:07Z

tipf
Mar 3, 2023
Author

Hi @francois-rozet,
Thank you for your fast, extensive, and extremely helpful response!
I think my mistake was to mix up the conditioning of $f$ on parts of $x$ and the bijective relation.

I will try to digest everything until next week...

0 replies

tipf · 2023-03-07T14:48:12Z

tipf
Mar 7, 2023
Author

I think I have got the basic principle. At least to a certain degree... Again thanks a lot @francois-rozet for your great explanation you made it pretty clear. 👍

I was able to implement an inverse function that transforms a sample while keeping some of its elements constant. I added this to the AutoregressiveTransform class:

  def inverse_given_partial(self, y: Tensor, x_part: Tensor) -> Tensor:
      # find lenght of partial
      x_part_len = x_part.shape[-1]
  
      # copy partial to x
      x = torch.zeros_like(y)
      x[..., :x_part_len] = x_part
  
      # do inverse passes  for the non-partial dimensions
      for _ in range(self.passes - x_part_len):
          x = self.meta(x).inv(y)
          # overwrite partial dim, because the parial is known and we just want to update the rest
          x[..., :x_part_len] = x_part
  
      return x

It allows me to implement a conditional sampler. Here in this example is the first element of $x$ drawn from $\mathcal{N}(25, 1)$ and the second component generated using inverse_given_partial:

Since this is a rather specific use case and the API does not generalize to other transforms, I doubt that it makes sense to add it to Zuko. But maybe the code will help someone...

1 reply

francois-rozet Mar 23, 2023
Maintainer

I am glad my answer helped you 😉 Note that your function is only correct if x_part is actually (part of) the inverse of y.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding AutoregressiveTransform #16

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Understanding AutoregressiveTransform #16

tipf Mar 2, 2023

Formalization

Replies: 3 comments · 1 reply

francois-rozet Mar 3, 2023 Maintainer

Formalization

Generalization

Zuko's implementation

Example

tipf Mar 3, 2023 Author

tipf Mar 7, 2023 Author

francois-rozet Mar 23, 2023 Maintainer

tipf
Mar 2, 2023

Replies: 3 comments 1 reply

francois-rozet
Mar 3, 2023
Maintainer

tipf
Mar 3, 2023
Author

tipf
Mar 7, 2023
Author

francois-rozet Mar 23, 2023
Maintainer