Fix documentation links and images, add more info on customizing atte…

…ntion. PiperOrigin-RevId: 649267636
google-deepmind · Jul 4, 2024 · 308d7b4 · 308d7b4
1 parent 2c7bee8
commit 308d7b4
Show file tree

Hide file tree

Showing 4 changed files with 37 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -75,7 +75,7 @@ Documentation on Penzai can be found at
 > boilerplate. It also includes a more flexible transformer implementation with
 > support for more pretrained model variants. You can read about the
 > differences between the two APIs in the
-> ["Changes in the V2 API"](v2_differences) overview.
+> ["Changes in the V2 API"][v2_differences] overview.
 >
 > We plan to stabilize the V2 API and move it out of experimental in release
 > ``0.2.0``, replacing the V1 API. If you wish to keep the V1 behavior, we

diff --git a/docs/_static/readme_teaser.png b/docs/_static/readme_teaser.png
diff --git a/docs/guides/howto_reference.md b/docs/guides/howto_reference.md
@@ -457,6 +457,40 @@ patched_model = (
 where `target` is the layer to linearize, `linearize_around` computes the input that the layer should be linearized at (e.g. by modifying its input activation or returning a constant), and `evaluate_at` computes the input that the linear approximation should be evaluated at (usually the same as the original input, but can also be different).
 
 
+### Customizing attention masks in `TransformerLM`
+
+By default, most `TransformerLM` architecture variants are specialized to causal attention masks, using the `pz.nn.ApplyCausalAttentionMask` layer (or sometimes `pz.nn.ApplyCausalSlidingWindowAttentionMask`). These layers use the token positions input to build a causal attention mask and apply it to the attention logits.
+
+If you would like to customize the attention mask computation, you can swap out these layers for `pz.nn.ApplyExplicitAttentionMask` layers, using something like
+
+```
+explicit_attn_model = (
+  pz.select(model)
+  .at_instances_of(
+    pz.nn.ApplyCausalAttentionMask
+    | pz.nn.ApplyCausalSlidingWindowAttentionMask
+  )
+  .apply(lambda old: pz.nn.ApplyExplicitAttentionMask(
+    mask_input_name="attn_mask",
+    masked_out_value=old.masked_out_value,
+  ))
+)
+```
+
+This will create a copy of the model that expects a side input called `attn_mask`, and uses it to mask the inputs. You can call it using something like
+
+```
+# tokens should have named shape {..., "seq": n_seq}
+# token_positions should have named shape {..., "seq": n_seq}
+# attn_mask should be a boolean array with named shape
+#   {..., "seq": n_seq, "kv_seq": n_seq}
+token_logits = explicit_attn_model(
+  tokens, token_positions=token_positions, attn_mask=attn_mask
+)
+```
+
+For more control, you can also define your own layer and insert it in place of the attention masking logic.
+
 ----------------------------
 ## Training and Fine-Tuning Models (V2 API)
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -141,8 +141,8 @@ Here's how you could initialize and visualize a simple neural network::
 To learn more about how to build and manipulate neural networks with Penzai,
 we recommend starting with the
 "How to Think in Penzai" tutorial
-(`V1 API version <notebooks/how_to_think_in_penzai>`,
-`V2 API version <notebooks/v2_how_to_think_in_penzai>`),
+(:doc:`V1 API version <notebooks/how_to_think_in_penzai>`,
+:doc:`V2 API version <notebooks/v2_how_to_think_in_penzai>`),
 which gives a high-level overview of how to think about and use Penzai
 models. Afterward, you could: