softmax.html

<!DOCTYPE html>
<html lang="en">

<head>

<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">

<title>Softmax</title>
<meta name="description" content="Converting numbers to probability distributions">

<style>
body {
  background-color: beige;
  color: rgb(41, 45, 58);
}

pre {
  color: rgb(38, 61, 106);
  background-color: rgb(218, 218, 183);
}

b {
  color: rgb(165, 122, 13);
}

@media (prefers-color-scheme: dark) {
  body {
    background-color: rgb(19, 22, 32);
    color: beige;
  }

  pre {
    background-color: rgb(29, 32, 38);
    color: rgb(186, 186, 169);
  }

  b {
    color: rgb(238, 204, 11);
  }
}

body {
  line-height: 1.4;
  margin-inline: 0;
  margin-block: 2rem 3rem;
}

p, pre {
  margin-block: 1.5em;
  margin-inline: 2em;
  border-radius: 1px;
}

p {
  padding: 1ch 2ch;
  max-width: 32rem;
}

p + p {
  padding-block-start: 0;
}

pre {
  padding: 2.25ch 2.75ch;
}

@media (width > 35em) {
  pre {
    max-width: 32rem;
  }
}

@media (width < 35em) {
  p, pre {
    margin-inline: 0;
  }
  pre {
    overflow-y: auto;
  }
}
</style>

</head>

<body>

<p>
<b>Softmax</b> converts an arbitrary set of numbers into a probability distribution.
That is, the numbers will all be between 0 and 1, and will sum together to 1.
</p>

<pre><code>const sm = (xs) => {
  const s = xs.map(Math.exp).reduce((a, x) => a + x, 0);
  return xs.map((x) => Math.exp(x) / s);
};
</code></pre>

<p>
It has some nice characteristics. E.g. compared to other methods of
normalization, this handles negatives, small, large, zero, anything you can
throw at it usually.
</p>

<pre><code>// helper method to print as percentages
const per = (xs) => xs.map((x) => Math.round(x * 100));

> per(sm([1, 0, 3, -10, 0.07]))
[ 11, 4, 81, 0, 4 ]
</code></pre>

<p>It has some not so nice ones too. E.g. it is not scale invariant.</p>

<pre><code>> per(sm([1, 2]))
[ 27, 73 ]
> per(sm([2, 4]))
[ 12, 88 ]
> per(sm([4, 8]))
[ 2, 98 ]
</code></pre>

<p>
Intuitively, I would think the presence of the exponent will amplify
differences. This happens indeed
</p>

<pre><code>> per(sm([1, 2, 4, 8]))
[ 0, 0, 2, 98 ]
</code></pre>

<p>
But not to an extent that I expected on first contact. On the contrary, it seems
to further "compress" the numbers together if they're close together
</p>

<pre><code>> per(sm([0.1, 0.2, 0.3, 0.4]))
[ 21, 24, 26, 29 ]
</code></pre>

<p>
Again, depending on the task at hand, this might or might not be the behaviour I
might want.
</p>

<p>
It seems to work great in ML settings, in particular for converting the output
of the last layer into probability distributions, for reasons that seem to be
tied to how the backpropogation works.
</p>

</body>
</html>