probabilists · francois-rozet · Jun 6, 2024 · Mar 28, 2024 · Mar 28, 2024 · Mar 28, 2024
diff --git a/docs/tutorials/basics.ipynb b/docs/tutorials/basics.ipynb
@@ -6,7 +6,14 @@
    "source": [
     "# Learn the basics\n",
     "\n",
-    "This notebook walks you through the basics of PyTorch/Zuko distributions and transformations, how to parametrize probabilistic models, how to instantiate pre-built normalizing flows and finally how to create custom flow architectures. Training is covered in other tutorials."
+    "This notebook walks you through \n",
+    "\n",
+    "- the basics of PyTorch/Zuko distributions and transformations, \n",
+    "- how to parametrize probabilistic models, \n",
+    "- how to instantiate pre-built normalizing flows and finally \n",
+    "- how to create custom flow architectures. \n",
+    "\n",
+    "Training is covered in subsequent tutorials. This tutorial requires two central imports:"
    ]
   },
   {
@@ -94,7 +101,7 @@
     "\n",
     "$$ p(X = x) = p(Z = f(x)) \\left| \\det \\frac{\\partial f(x)}{\\partial x} \\right| $$\n",
     "\n",
-    "and sampling from $p(X)$ can be performed by first drawing realizations $z \\sim p(Z)$ and then applying the inverse transformation $x = f^{-1}(z)$. Such combination of a base distribution and a bijective transformation is sometimes called a *normalizing flow* as the base distribution is often standard normal."
+    "and sampling from $p(X)$ can be performed by first drawing realizations $z \\sim p(Z)$ and then applying the inverse transformation $x = f^{-1}(z)$. Such combination of a base distribution and a bijective transformation is sometimes called a *normalizing flow*. The name indicates that the base distribution is a standard *normal* distribution."
    ]
   },
   {
@@ -130,7 +137,7 @@
     "\n",
     "When designing the distributions module, the PyTorch team decided that distributions and transformations should be lightweight objects that are used as part of computations but destroyed afterwards. Consequently, the [`Distribution`](torch.distributions.distribution.Distribution) and [`Transform`](torch.distributions.transforms.Transform) classes are not sub-classes of [`torch.nn.Module`](torch.nn.Module), which means that we cannot retrieve their parameters with `.parameters()`, send their internal tensor to GPU with `.to('cuda')` or train them as regular neural networks. In addition, the concepts of conditional distribution and transformation, which are essential for probabilistic inference, are impossible to express with the current interface.\n",
     "\n",
-    "To solve these problems, [`zuko`](zuko) defines two concepts: the [`LazyDistribution`](zuko.flows.core.LazyDistribution) and the [`LazyTransform`](zuko.flows.core.LazyTransform), which are modules whose forward pass returns a distribution or transformation, respectively. These components hold the parameters of the distributions/transformations as well as the recipe to build them, such that the actual distribution/transformation objects are lazily built and destroyed when necessary. Importantly, because the creation of the distribution/transformation object is delayed, an eventual condition can be easily taken into account. This design enables lazy distributions to act like distributions while retaining features inherent to modules, such as trainable parameters."
+    "To solve these problems, [`zuko`](zuko) defines two concepts: the [`LazyDistribution`](zuko.flows.core.LazyDistribution) and the [`LazyTransform`](zuko.flows.core.LazyTransform), which are modules whose forward pass returns a distribution or transformation, respectively. These components hold the parameters of the distributions/transformations as well as the recipe to build them. This way, the actual distribution/transformation objects are lazily constructed and destroyed when necessary. Importantly, because the creation of the distribution/transformation object is delayed, an eventual condition can be easily taken into account. This design enables lazy distributions to act like distributions while retaining features inherent to modules, such as trainable parameters."
    ]
   },
   {
@@ -139,7 +146,7 @@
    "source": [
     "### Variational inference\n",
     "\n",
-    "Let's say we have a dataset of pairs $(x, c) \\sim p(X, C)$ and want to model the distribution of $X$ given $c$, that is $p(X | c)$. The goal of variational inference is to find the model $q_{\\phi^\\star}(X | c)$ that is most similar to $p(X | c)$ among a family of (conditional) distributions $q_\\phi(X | c)$ distinguished by their parameters $\\phi$. Expressing the dissimilarity between two distributions as their [Kullback-Leibler](https://wikipedia.org/wiki/Kullback–Leibler_divergence) (KL) divergence, the variational inference objective becomes\n",
+    "Let's say we have a dataset of pairs $(x, c) \\sim p(X, C)$ and want to model the distribution of $X$ given $c$, that is $p(X | c)$. The goal of variational inference is to find the model $q_{\\phi^\\star}(X | c)$ that is most similar to $p(X | c)$ among a family of (conditional) distributions $q_\\phi(X | c)$ distinguished by their parameters $\\phi$. Expressing the dissimilarity between two distributions as their [Kullback-Leibler](https://wikipedia.org/wiki/Kullback\u2013Leibler_divergence) (KL) divergence, the variational inference objective becomes\n",
     "\n",
     "$$\n",
     "    \\begin{align}\n",
@@ -324,6 +331,13 @@
     "    optimizer.zero_grad()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note, `model(c)` calls the `forward` method as described above."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -512,7 +526,7 @@
    "source": [
     "### Custom architecture\n",
     "\n",
-    "Alternatively, a flow can be built as a custom [`Flow`](zuko.flows.core.Flow) object given a sequence of lazy transformations and a base lazy distribution. Follows a condensed example of many things that are possible in Zuko. But remember, with great power comes great responsibility (and great bugs)."
+    "Alternatively, a flow can be built as a custom [`Flow`](zuko.flows.core.Flow) object given a sequence of lazy transformations and a base lazy distribution. The following demonstrates a condensed example of many things that are possible in Zuko. But remember, with great power comes great responsibility (and great bugs)."
    ]
   },
   {

diff --git a/docs/tutorials/forward_kl.ipynb b/docs/tutorials/forward_kl.ipynb
@@ -27,7 +27,7 @@
    "source": [
     "## Dataset\n",
     "\n",
-    "We consider the Two Moons dataset."
+    "We consider the *Two Moons* dataset for demonstrative purposes."
    ]
   },
   {
@@ -88,7 +88,7 @@
    "source": [
     "## Unconditional flow\n",
     "\n",
-    "We use a neural spline flow (NSF) as density estimator $q_\\phi(x)$."
+    "We use a neural spline flow (NSF) as density estimator $q_\\phi(x)$. The goal of the unconditional flow is to sample the Two Moons \"distribution\" entirely."
    ]
   },
   {
@@ -173,14 +173,14 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "(0) 1.3520090579986572 ± 0.25871574878692627\n",
-      "(1) 1.147993564605713 ± 0.1022777259349823\n",
-      "(2) 1.1174802780151367 ± 0.09858577698469162\n",
-      "(3) 1.0956673622131348 ± 0.1021992415189743\n",
-      "(4) 1.0934643745422363 ± 0.09762168675661087\n",
-      "(5) 1.0758651494979858 ± 0.09098420292139053\n",
-      "(6) 1.0708422660827637 ± 0.09713941812515259\n",
-      "(7) 1.0695130825042725 ± 0.09372557699680328\n"
+      "(0) 1.3520090579986572 \u00b1 0.25871574878692627\n",
+      "(1) 1.147993564605713 \u00b1 0.1022777259349823\n",
+      "(2) 1.1174802780151367 \u00b1 0.09858577698469162\n",
+      "(3) 1.0956673622131348 \u00b1 0.1021992415189743\n",
+      "(4) 1.0934643745422363 \u00b1 0.09762168675661087\n",
+      "(5) 1.0758651494979858 \u00b1 0.09098420292139053\n",
+      "(6) 1.0708422660827637 \u00b1 0.09713941812515259\n",
+      "(7) 1.0695130825042725 \u00b1 0.09372557699680328\n"
      ]
     }
    ],
@@ -201,7 +201,7 @@
     "\n",
     "    losses = torch.stack(losses)\n",
     "\n",
-    "    print(f'({epoch})', losses.mean().item(), '±', losses.std().item())"
+    "    print(f'({epoch})', losses.mean().item(), '\u00b1', losses.std().item())"
    ]
   },
   {
@@ -234,7 +234,7 @@
    "source": [
     "## Conditional flow\n",
     "\n",
-    "We use a neural spline flow (NSF) as density estimator $q_\\phi(x | c)$, where $c$ is the label."
+    "We use a neural spline flow (NSF) as density estimator $q_\\phi(x | c)$, where $c$ is the label referencing a specific part of the Two Moons 'distribution'."
    ]
   },
   {
@@ -255,14 +255,14 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "(0) 0.7147052884101868 ± 0.4756987988948822\n",
-      "(1) 0.40776583552360535 ± 0.10716820508241653\n",
-      "(2) 0.3866541087627411 ± 0.10318031907081604\n",
-      "(3) 0.37453195452690125 ± 0.10870178788900375\n",
-      "(4) 0.3634893000125885 ± 0.10033125430345535\n",
-      "(5) 0.36492055654525757 ± 0.10960724204778671\n",
-      "(6) 0.3537733554840088 ± 0.09780355542898178\n",
-      "(7) 0.3559333086013794 ± 0.1038535088300705\n"
+      "(0) 0.7147052884101868 \u00b1 0.4756987988948822\n",
+      "(1) 0.40776583552360535 \u00b1 0.10716820508241653\n",
+      "(2) 0.3866541087627411 \u00b1 0.10318031907081604\n",
+      "(3) 0.37453195452690125 \u00b1 0.10870178788900375\n",
+      "(4) 0.3634893000125885 \u00b1 0.10033125430345535\n",
+      "(5) 0.36492055654525757 \u00b1 0.10960724204778671\n",
+      "(6) 0.3537733554840088 \u00b1 0.09780355542898178\n",
+      "(7) 0.3559333086013794 \u00b1 0.1038535088300705\n"
      ]
     }
    ],
@@ -285,7 +285,7 @@
     "\n",
     "    losses = torch.stack(losses)\n",
     "\n",
-    "    print(f'({epoch})', losses.mean().item(), '±', losses.std().item())"
+    "    print(f'({epoch})', losses.mean().item(), '\u00b1', losses.std().item())"
    ]
   },
   {
@@ -305,6 +305,7 @@
     }
    ],
    "source": [
+    "# sample the flow while conditioning on the 'top' part of two moons\n",
     "samples = flow(torch.tensor([0.0])).sample((16384,))\n",
     "\n",
     "plt.figure(figsize=(4.8, 4.8))\n",
@@ -329,6 +330,7 @@
     }
    ],
    "source": [
+    "# sample the flow while conditioning on the 'bottom' part of two moons\n",
     "samples = flow(torch.tensor([1.0])).sample((16384,))\n",
     "\n",
     "plt.figure(figsize=(4.8, 4.8))\n",

diff --git a/docs/tutorials/reverse_kl.ipynb b/docs/tutorials/reverse_kl.ipynb
@@ -84,7 +84,7 @@
    "source": [
     "## Flow\n",
     "\n",
-    "We use a neural spline flow (NSF) as density estimator $q_\\phi(x)$. However, we inverse the transformation(s), which makes sampling more efficient as the inverse call of an autoregressive transformation is $D$ (where $D$ is the number of features) times slower than its forward call."
+    "We use a neural spline flow (NSF) as density estimator $q_\\phi(x)$. However, we invert the transformation(s), which makes sampling more efficient as the inverse call of an autoregressive transformation is $D$ (where $D$ is the number of features) times slower than its forward call."
    ]
   },
   {
@@ -174,14 +174,14 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "(0) -1.012157678604126 ± 1.0205215215682983\n",
-      "(1) -1.5622574090957642 ± 0.03264421969652176\n",
-      "(2) -1.5753192901611328 ± 0.033491406589746475\n",
-      "(3) -1.5814640522003174 ± 0.025743382051587105\n",
-      "(4) -1.5768922567367554 ± 0.04906836897134781\n",
-      "(5) -1.5749255418777466 ± 0.13962876796722412\n",
-      "(6) -1.5877153873443604 ± 0.015589614398777485\n",
-      "(7) -1.5886530876159668 ± 0.029878195375204086\n"
+      "(0) -1.012157678604126 \u00b1 1.0205215215682983\n",
+      "(1) -1.5622574090957642 \u00b1 0.03264421969652176\n",
+      "(2) -1.5753192901611328 \u00b1 0.033491406589746475\n",
+      "(3) -1.5814640522003174 \u00b1 0.025743382051587105\n",
+      "(4) -1.5768922567367554 \u00b1 0.04906836897134781\n",
+      "(5) -1.5749255418777466 \u00b1 0.13962876796722412\n",
+      "(6) -1.5877153873443604 \u00b1 0.015589614398777485\n",
+      "(7) -1.5886530876159668 \u00b1 0.029878195375204086\n"
      ]
     }
    ],
@@ -204,7 +204,7 @@
     "\n",
     "    losses = torch.stack(losses)\n",
     "\n",
-    "    print(f'({epoch})', losses.mean().item(), '±', losses.std().item())"
+    "    print(f'({epoch})', losses.mean().item(), '\u00b1', losses.std().item())"
    ]
   },
   {