diff --git a/docs/tutorials/basics.ipynb b/docs/tutorials/basics.ipynb
index 7dde168..c96d4b1 100644
--- a/docs/tutorials/basics.ipynb
+++ b/docs/tutorials/basics.ipynb
@@ -6,7 +6,14 @@
    "source": [
     "# Learn the basics\n",
     "\n",
-    "This notebook walks you through the basics of PyTorch/Zuko distributions and transformations, how to parametrize probabilistic models, how to instantiate pre-built normalizing flows and finally how to create custom flow architectures. Training is covered in other tutorials."
+    "This notebook walks you through \n",
+    "\n",
+    "- the basics of PyTorch/Zuko distributions and transformations, \n",
+    "- how to parametrize probabilistic models, \n",
+    "- how to instantiate pre-built normalizing flows and finally \n",
+    "- how to create custom flow architectures. \n",
+    "\n",
+    "Training is covered in subsequent tutorials."
    ]
   },
   {
@@ -94,7 +101,7 @@
     "\n",
     "$$ p(X = x) = p(Z = f(x)) \\left| \\det \\frac{\\partial f(x)}{\\partial x} \\right| $$\n",
     "\n",
-    "and sampling from $p(X)$ can be performed by first drawing realizations $z \\sim p(Z)$ and then applying the inverse transformation $x = f^{-1}(z)$. Such combination of a base distribution and a bijective transformation is sometimes called a *normalizing flow* as the base distribution is often standard normal."
+    "and sampling from $p(X)$ can be performed by first drawing realizations $z \\sim p(Z)$ and then applying the inverse transformation $x = f^{-1}(z)$. Such combination of a base distribution and a bijective transformation is sometimes called a *normalizing flow*. The term *normalizing* refers to the fact that the base distribution is often a (standard) *normal* distribution."
    ]
   },
   {
@@ -130,7 +137,7 @@
     "\n",
     "When designing the distributions module, the PyTorch team decided that distributions and transformations should be lightweight objects that are used as part of computations but destroyed afterwards. Consequently, the [`Distribution`](torch.distributions.distribution.Distribution) and [`Transform`](torch.distributions.transforms.Transform) classes are not sub-classes of [`torch.nn.Module`](torch.nn.Module), which means that we cannot retrieve their parameters with `.parameters()`, send their internal tensor to GPU with `.to('cuda')` or train them as regular neural networks. In addition, the concepts of conditional distribution and transformation, which are essential for probabilistic inference, are impossible to express with the current interface.\n",
     "\n",
-    "To solve these problems, [`zuko`](zuko) defines two concepts: the [`LazyDistribution`](zuko.flows.core.LazyDistribution) and the [`LazyTransform`](zuko.flows.core.LazyTransform), which are modules whose forward pass returns a distribution or transformation, respectively. These components hold the parameters of the distributions/transformations as well as the recipe to build them, such that the actual distribution/transformation objects are lazily built and destroyed when necessary. Importantly, because the creation of the distribution/transformation object is delayed, an eventual condition can be easily taken into account. This design enables lazy distributions to act like distributions while retaining features inherent to modules, such as trainable parameters."
+    "To solve these problems, [`zuko`](zuko) defines two concepts: the [`LazyDistribution`](zuko.flows.core.LazyDistribution) and the [`LazyTransform`](zuko.flows.core.LazyTransform), which are modules whose forward pass returns a distribution or transformation, respectively. These components hold the parameters of the distributions/transformations as well as the recipe to build them. This way, the actual distribution/transformation objects are lazily constructed and destroyed when necessary. Importantly, because the creation of the distribution/transformation object is delayed, an eventual condition can be easily taken into account. This design enables lazy distributions to act like distributions while retaining features inherent to modules, such as trainable parameters."
    ]
   },
   {
@@ -139,7 +146,7 @@
    "source": [
     "### Variational inference\n",
     "\n",
-    "Let's say we have a dataset of pairs $(x, c) \\sim p(X, C)$ and want to model the distribution of $X$ given $c$, that is $p(X | c)$. The goal of variational inference is to find the model $q_{\\phi^\\star}(X | c)$ that is most similar to $p(X | c)$ among a family of (conditional) distributions $q_\\phi(X | c)$ distinguished by their parameters $\\phi$. Expressing the dissimilarity between two distributions as their [Kullback-Leibler](https://wikipedia.org/wiki/Kullback–Leibler_divergence) (KL) divergence, the variational inference objective becomes\n",
+    "Let's say we have a dataset of pairs $(x, c) \\sim p(X, C)$ and want to model the distribution of $X$ given $c$, that is $p(X | c)$. The goal of variational inference is to find the model $q_{\\phi^\\star}(X | c)$ that is most similar to $p(X | c)$ among a family of (conditional) distributions $q_\\phi(X | c)$ distinguished by their parameters $\\phi$. Expressing the dissimilarity between two distributions as their [Kullback-Leibler](https://wikipedia.org/wiki/Kullback-Leibler_divergence) (KL) divergence, the variational inference objective becomes\n",
     "\n",
     "$$\n",
     "    \\begin{align}\n",
@@ -237,7 +244,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Calling the forward method of the model with a context $c$ returns a distribution object, which we can use to draw realizations or evaluate the likelihood of realizations."
+    "Calling the forward method of the model with a context $c$ returns a distribution object, which we can use to draw realizations or evaluate the likelihood of realizations. In the code below, `model(c=c[0])` calls the `forward` method as implemented above."
    ]
   },
   {
@@ -512,7 +519,7 @@
    "source": [
     "### Custom architecture\n",
     "\n",
-    "Alternatively, a flow can be built as a custom [`Flow`](zuko.flows.core.Flow) object given a sequence of lazy transformations and a base lazy distribution. Follows a condensed example of many things that are possible in Zuko. But remember, with great power comes great responsibility (and great bugs)."
+    "Alternatively, a flow can be built as a custom [`Flow`](zuko.flows.core.Flow) object given a sequence of lazy transformations and a base lazy distribution. The following demonstrates a condensed example of many things that are possible in Zuko. But remember, with great power comes great responsibility (and great bugs)."
    ]
   },
   {
diff --git a/docs/tutorials/forward_kl.ipynb b/docs/tutorials/forward_kl.ipynb
index cdef2b9..6760331 100644
--- a/docs/tutorials/forward_kl.ipynb
+++ b/docs/tutorials/forward_kl.ipynb
@@ -27,7 +27,7 @@
    "source": [
     "## Dataset\n",
     "\n",
-    "We consider the Two Moons dataset."
+    "We consider the *Two Moons* dataset for demonstrative purposes."
    ]
   },
   {
@@ -88,7 +88,7 @@
    "source": [
     "## Unconditional flow\n",
     "\n",
-    "We use a neural spline flow (NSF) as density estimator $q_\\phi(x)$."
+    "We use a neural spline flow (NSF) as density estimator $q_\\phi(x)$. The goal of the unconditional flow is to approximate the entire Two Moons distribution."
    ]
   },
   {
@@ -173,14 +173,14 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "(0) 1.3520090579986572 ± 0.25871574878692627\n",
-      "(1) 1.147993564605713 ± 0.1022777259349823\n",
-      "(2) 1.1174802780151367 ± 0.09858577698469162\n",
-      "(3) 1.0956673622131348 ± 0.1021992415189743\n",
-      "(4) 1.0934643745422363 ± 0.09762168675661087\n",
-      "(5) 1.0758651494979858 ± 0.09098420292139053\n",
-      "(6) 1.0708422660827637 ± 0.09713941812515259\n",
-      "(7) 1.0695130825042725 ± 0.09372557699680328\n"
+      "(0) 1.3520090579986572 \u00b1 0.25871574878692627\n",
+      "(1) 1.147993564605713 \u00b1 0.1022777259349823\n",
+      "(2) 1.1174802780151367 \u00b1 0.09858577698469162\n",
+      "(3) 1.0956673622131348 \u00b1 0.1021992415189743\n",
+      "(4) 1.0934643745422363 \u00b1 0.09762168675661087\n",
+      "(5) 1.0758651494979858 \u00b1 0.09098420292139053\n",
+      "(6) 1.0708422660827637 \u00b1 0.09713941812515259\n",
+      "(7) 1.0695130825042725 \u00b1 0.09372557699680328\n"
      ]
     }
    ],
@@ -201,7 +201,7 @@
     "\n",
     "    losses = torch.stack(losses)\n",
     "\n",
-    "    print(f'({epoch})', losses.mean().item(), '±', losses.std().item())"
+    "    print(f'({epoch})', losses.mean().item(), '\u00b1', losses.std().item())"
    ]
   },
   {
@@ -234,7 +234,7 @@
    "source": [
     "## Conditional flow\n",
     "\n",
-    "We use a neural spline flow (NSF) as density estimator $q_\\phi(x | c)$, where $c$ is the label."
+    "We use a conditional NSF as density estimator $q_\\phi(x | c)$, where $c$ is the label indicating either the top or bottom moon of the Two Moons distribution."
    ]
   },
   {
@@ -255,14 +255,14 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "(0) 0.7147052884101868 ± 0.4756987988948822\n",
-      "(1) 0.40776583552360535 ± 0.10716820508241653\n",
-      "(2) 0.3866541087627411 ± 0.10318031907081604\n",
-      "(3) 0.37453195452690125 ± 0.10870178788900375\n",
-      "(4) 0.3634893000125885 ± 0.10033125430345535\n",
-      "(5) 0.36492055654525757 ± 0.10960724204778671\n",
-      "(6) 0.3537733554840088 ± 0.09780355542898178\n",
-      "(7) 0.3559333086013794 ± 0.1038535088300705\n"
+      "(0) 0.7147052884101868 \u00b1 0.4756987988948822\n",
+      "(1) 0.40776583552360535 \u00b1 0.10716820508241653\n",
+      "(2) 0.3866541087627411 \u00b1 0.10318031907081604\n",
+      "(3) 0.37453195452690125 \u00b1 0.10870178788900375\n",
+      "(4) 0.3634893000125885 \u00b1 0.10033125430345535\n",
+      "(5) 0.36492055654525757 \u00b1 0.10960724204778671\n",
+      "(6) 0.3537733554840088 \u00b1 0.09780355542898178\n",
+      "(7) 0.3559333086013794 \u00b1 0.1038535088300705\n"
      ]
     }
    ],
@@ -285,7 +285,7 @@
     "\n",
     "    losses = torch.stack(losses)\n",
     "\n",
-    "    print(f'({epoch})', losses.mean().item(), '±', losses.std().item())"
+    "    print(f'({epoch})', losses.mean().item(), '\u00b1', losses.std().item())"
    ]
   },
   {
@@ -305,6 +305,7 @@
     }
    ],
    "source": [
+    "# sample from the flow conditioned on the top moon label\n",
     "samples = flow(torch.tensor([0.0])).sample((16384,))\n",
     "\n",
     "plt.figure(figsize=(4.8, 4.8))\n",
@@ -329,6 +330,7 @@
     }
    ],
    "source": [
+    "# sample from the flow conditioned on the bottom moon label\n",
     "samples = flow(torch.tensor([1.0])).sample((16384,))\n",
     "\n",
     "plt.figure(figsize=(4.8, 4.8))\n",
diff --git a/docs/tutorials/reverse_kl.ipynb b/docs/tutorials/reverse_kl.ipynb
index 7b53702..0850b7f 100644
--- a/docs/tutorials/reverse_kl.ipynb
+++ b/docs/tutorials/reverse_kl.ipynb
@@ -84,7 +84,7 @@
    "source": [
     "## Flow\n",
     "\n",
-    "We use a neural spline flow (NSF) as density estimator $q_\\phi(x)$. However, we inverse the transformation(s), which makes sampling more efficient as the inverse call of an autoregressive transformation is $D$ (where $D$ is the number of features) times slower than its forward call."
+    "We use a neural spline flow (NSF) as density estimator $q_\\phi(x)$. However, we invert the transformation(s), which makes sampling more efficient as the inverse call of an autoregressive transformation is $D$ (where $D$ is the number of features) times slower than its forward call."
    ]
   },
   {
@@ -174,14 +174,14 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "(0) -1.012157678604126 ± 1.0205215215682983\n",
-      "(1) -1.5622574090957642 ± 0.03264421969652176\n",
-      "(2) -1.5753192901611328 ± 0.033491406589746475\n",
-      "(3) -1.5814640522003174 ± 0.025743382051587105\n",
-      "(4) -1.5768922567367554 ± 0.04906836897134781\n",
-      "(5) -1.5749255418777466 ± 0.13962876796722412\n",
-      "(6) -1.5877153873443604 ± 0.015589614398777485\n",
-      "(7) -1.5886530876159668 ± 0.029878195375204086\n"
+      "(0) -1.012157678604126 \u00b1 1.0205215215682983\n",
+      "(1) -1.5622574090957642 \u00b1 0.03264421969652176\n",
+      "(2) -1.5753192901611328 \u00b1 0.033491406589746475\n",
+      "(3) -1.5814640522003174 \u00b1 0.025743382051587105\n",
+      "(4) -1.5768922567367554 \u00b1 0.04906836897134781\n",
+      "(5) -1.5749255418777466 \u00b1 0.13962876796722412\n",
+      "(6) -1.5877153873443604 \u00b1 0.015589614398777485\n",
+      "(7) -1.5886530876159668 \u00b1 0.029878195375204086\n"
      ]
     }
    ],
@@ -204,7 +204,7 @@
     "\n",
     "    losses = torch.stack(losses)\n",
     "\n",
-    "    print(f'({epoch})', losses.mean().item(), '±', losses.std().item())"
+    "    print(f'({epoch})', losses.mean().item(), '\u00b1', losses.std().item())"
    ]
   },
   {