Any Python program fits in 24 characters

purplesyringa · Nov 17, 2024 · 1fb24a8 · 1fb24a8
1 parent 15e6d18
commit 1fb24a8
Show file tree

Hide file tree

Showing 6 changed files with 117 additions and 55 deletions.
diff --git a/blog/any-python-program-fits-in-24-characters/index.html b/blog/any-python-program-fits-in-24-characters/index.html
@@ -0,0 +1,58 @@
+<!doctypehtml><html prefix="og: http://ogp.me/ns#"lang=en_US><meta charset=utf-8><meta content=width=device-width,initial-scale=1 name=viewport><title>Any Python program fits in 24 characters* | purplesyringa's blog</title><link href=../../favicon.ico?v=2 rel=icon><link href=../../all.css rel=stylesheet><link href=../../blog.css rel=stylesheet><link href=../../vendor/Temml-Local.css rel=stylesheet><link crossorigin href=https://fonts.googleapis.com/css2?family=Noto+Sans:ital,wght@0,100..900;1,100..900&family=Roboto+Mono:ital,wght@0,100..700;1,100..700&family=Roboto:ital,wght@0,400;0,700;1,400;1,700&family=Slabo+27px&display=swap rel=stylesheet><link href=../../fonts/webfont.css rel=stylesheet><link media="screen and (prefers-color-scheme: dark"href=../../vendor/atom-one-dark.min.css rel=stylesheet><link media="screen and (prefers-color-scheme: light"href=../../vendor/a11y-light.min.css rel=stylesheet><link title="Blog posts"href=../../blog/feed.rss rel=alternate type=application/rss+xml><meta content="Any Python program fits in 24 characters*"property=og:title><meta content=article property=og:type><meta content=https://purplesyringa.moe/blog/any-python-program-fits-in-24-characters/og.png property=og:image><meta content=https://purplesyringa.moe/blog/any-python-program-fits-in-24-characters/ property=og:url><meta content="* If you don’t take whitespace into account.
+My friend challenged me to find the shortest solution to a certain Leetcode-style problem in Python. They were generous enough to let me use whitespace for free, so that the code stays readable. So that’s exactly what we’ll abuse to encode any Python program in 24 bytes, ignoring whitespace."property=og:description><meta content=en_US property=og:locale><meta content="purplesyringa's blog"property=og:site_name><meta content=summary_large_image name=twitter:card><meta content=https://purplesyringa.moe/blog/any-python-program-fits-in-24-characters/og.png name=twitter:image><script data-website-id=0da1961d-43f2-45cc-a8e2-75679eefbb69 defer src=https://zond.tei.su/script.js></script><body><header><div class=viewport-container><div class=media><a href=https://github.com/purplesyringa><img alt=GitHub src=../../images/github-mark-white.svg></a></div><h1><a href=/>purplesyringa</a></h1><nav><a href=../..>about</a><a class=current href=../../blog/>blog</a><a href=../../sink/>kitchen sink</a></nav></div></header><section><div class=viewport-container><h2>Any Python program fits in 24 characters*</h2><time>November 17, 2024</time><p><em>* If you don’t take whitespace into account.</em><p>My friend challenged me to find the shortest solution to a certain Leetcode-style problem in Python. They were generous enough to let me use whitespace for free, so that the code stays readable. So that’s exactly what we’ll abuse to encode <em>any</em> Python program in <eq><math><mn>24</mn></math></eq> bytes, ignoring whitespace.<blockquote><p>This post originally stated that <eq><math><mn>30</mn></math></eq> characters are always enough. Since then, someone on the codegolf Discord server has devised a better solution, reaching <eq><math><mn>24</mn></math></eq> bytes. After a few minor modifications, it satisfies the requirements of this problem, so I publish it here too.</blockquote><p class=next-group><span aria-level=3 class=side-header role=heading><span>Bits</span></span>We can encode arbitrary data in a string by only using whitespace. For example, we could encode <code>0</code> bits as spaces and <code>1</code> bits as tabs. Now you just have to decode this.<p>As you start implementing the decoder, it immediately becomes clear that this approach requires about 50 characters at minimum. You can use <code>c % 2 for c in b"..."</code> to extract individual bits, then you need to merge bits by using <code>str</code> and concatenating then with <code>"".join(...)</code>, then you to parse the bits with <code>int.to_bytes(...)</code>, and finally call <code>exec</code>. We need to find another solution.<p class=next-group><span aria-level=3 class=side-header role=heading><span>Characters</span></span>What if we didn’t go from characters to bits and then back? What if instead, we mapped each whitespace character to its own non-whitespace character and then evaluated that?<pre><code class=language-python><span class=hljs-built_in>exec</span>(
+    <span class=hljs-string>"[whitespace...]"</span>
+        .replace(<span class=hljs-string>" "</span>, <span class=hljs-string>"A"</span>)
+        .replace(<span class=hljs-string>"\t"</span>, <span class=hljs-string>"B"</span>)
+        .replace(<span class=hljs-string>"\v"</span>, <span class=hljs-string>"C"</span>)
+        .replace(<span class=hljs-string>"\f"</span>, <span class=hljs-string>"D"</span>)
+        ...
+)
+</code></pre><p>Unicode has quite a lot of whitespace characters, so this should be possible, in theory. Unfortunately, this takes even more bytes in practice. Under 50 characters, we can fit just two <code>replace</code> calls:<pre><code class=language-python><span class=hljs-built_in>exec</span>(<span class=hljs-string>"[whitespace...]"</span>.replace(<span class=hljs-string>" "</span>,<span class=hljs-string>"A"</span>).replace(<span class=hljs-string>"\t"</span>,<span class=hljs-string>"B"</span>))
+</code></pre><p>But we don’t have to use <code>replace</code>! The less-known <code>str.translate</code> method can perform multiple single-character replaces at once:<pre><code class=language-python><span class=hljs-meta>>>> </span><span class=hljs-string>"Hello, world!"</span>.translate({<span class=hljs-built_in>ord</span>(<span class=hljs-string>"H"</span>): <span class=hljs-string>"h"</span>, <span class=hljs-built_in>ord</span>(<span class=hljs-string>"!"</span>): <span class=hljs-string>"."</span>})
+<span class=hljs-string>'hello, world.'</span>
+</code></pre><p>The following fits in 50 characters:<pre><code class=language-python><span class=hljs-built_in>exec</span>(<span class=hljs-string>"[whitespace...]"</span>.translate({<span class=hljs-number>9</span>: <span class=hljs-string>"A"</span>, <span class=hljs-number>11</span>: <span class=hljs-string>"B"</span>, <span class=hljs-number>12</span>: <span class=hljs-string>"C"</span>, <span class=hljs-number>28</span>: <span class=hljs-string>"D"</span>})
+</code></pre><p>4 characters isn’t much to work with, but here’s some good news: <code>translate</code> takes anything indexable with integers (code points). We can thus replace the dict with a string:<pre><code class=language-python><span class=hljs-built_in>exec</span>(
+    <span class=hljs-string>"[whitespace...]"</span>.translate(
+        <span class=hljs-string>"         A BC               DEFGH                                                                                                    I                          J"</span>
+    )
+)
+</code></pre><p>The characters <code>ABCDEFGHIJ</code> are located at indices <eq><math><mrow><mn>9</mn><mo separator=true>,</mo></mrow><mrow><mn>11</mn><mo separator=true>,</mo></mrow><mrow><mn>12</mn><mo separator=true>,</mo></mrow><mrow><mn>28</mn><mo separator=true>,</mo></mrow><mrow><mn>29</mn><mo separator=true>,</mo></mrow><mrow><mn>30</mn><mo separator=true>,</mo></mrow><mrow><mn>31</mn><mo separator=true>,</mo></mrow><mrow><mn>32</mn><mo separator=true>,</mo></mrow><mrow><mn>133</mn><mo separator=true>,</mo></mrow><mrow><mn>160</mn></mrow></math></eq> – all whitespace code points below <eq><math><mn>256</mn></math></eq> except CR and LF, which are invalid in a string. While this code is long, most of it is just whitespace, which we ignore. After removing whitespace, it’s only <eq><math><mn>32</mn></math></eq> characters:<pre><code class=language-python><span class=hljs-built_in>exec</span>(<span class=hljs-string>""</span>.translate(<span class=hljs-string>"ABCDEFGHIJ"</span>))
+</code></pre><p>We can now encode any Python program that uses at most <eq><math><mn>10</mn></math></eq> different characters. We could now use <a href=https://github.com/kuangkzh/PyFuck>PyFuck</a>, which transforms any Python script to an equivalent script that uses only <eq><math><mn>8</mn></math></eq> characters: <code>exc('%0)</code>. This reduces the code size to <eq><math><mn>30</mn></math></eq> charaters (plus whitespace). A bit of postprocessing is necessary to get it working well, as PyFuck often has exponential output, but that’s a minor issue.<p class=next-group><span aria-level=3 class=side-header role=heading><span>A better way</span></span>But it turns out there’s another way to translate whitespace to non-whitespace.<blockquote><p>This solution was found by a reader of my blog – thanks!</blockquote><p>When <code>repr</code> is applied to Unicode strings, it replaces the Unicode codepoints with their <code>\uXXXX</code> representations. For example, <code>U+2001 Em Quad</code> is encoded as <code>'\u2001'</code>. All in all, Unicode whitespace gives us unlimited supply of <code>\</code>, <code>x</code>, and the whole hexadecimal alphabet (plus two instances of <code>'</code>).<p>Say we wanted to extract the least significant digits of characters from <code>U+2000</code> to <code>U+2007</code>. Here’s how to do this:<pre><code class=language-python><span class=hljs-comment># Imagine these \uXXXX escapes are literal whitespace characters</span>
+<span class=hljs-meta>>>> </span><span class=hljs-built_in>repr</span>(<span class=hljs-string>"\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007"</span>)[<span class=hljs-number>6</span>::<span class=hljs-number>6</span>]
+<span class=hljs-string>'01234567'</span>
+</code></pre><p>To get <code>\</code>, <code>x</code>, and the rest of the hexadecimal alphabet, we need characters like <code>U+000B</code> and <code>U+001F</code>. We also need to align the strings exactly, so that one of the columns contains all the alphabet:<pre><code class=language-python>         v
+\: <span class=hljs-string>"     \t "</span>
+x: <span class=hljs-string>"    \x0b"</span>
+<span class=hljs-number>0</span>: <span class=hljs-string>"\u2000  "</span>
+<span class=hljs-number>1</span>: <span class=hljs-string>"\u2001  "</span>
+<span class=hljs-number>2</span>: <span class=hljs-string>"\u2002  "</span>
+<span class=hljs-number>3</span>: <span class=hljs-string>"\u2003  "</span>
+<span class=hljs-number>4</span>: <span class=hljs-string>"\u2004  "</span>
+<span class=hljs-number>5</span>: <span class=hljs-string>"\u2005  "</span>
+<span class=hljs-number>6</span>: <span class=hljs-string>"\u2006  "</span>
+<span class=hljs-number>7</span>: <span class=hljs-string>"\u2007  "</span>
+<span class=hljs-number>8</span>: <span class=hljs-string>"\u2008  "</span>
+<span class=hljs-number>9</span>: <span class=hljs-string>"\u2009  "</span>
+a: <span class=hljs-string>"\u200a  "</span>
+b: <span class=hljs-string>"  \x0b  "</span>
+c: <span class=hljs-string>"  \x0c  "</span>
+d: <span class=hljs-string>"  \x1d  "</span>
+e: <span class=hljs-string>"  \x1e  "</span>
+f: <span class=hljs-string>"  \x1f  "</span>
+         ^
+</code></pre><p>This requires us to increase the step to <eq><math><mn>8</mn></math></eq>, but it works!<p>Now, if we have free access to <code>\</code>, <code>x</code>, and the hexadecimal alphabet, we can reduce any program to just <eq><math><mn>4</mn></math></eq> characters outside this alphabet (we’re lucky that <code>exec</code> is free):<pre><code class=language-python><span class=hljs-comment># print("Hello, world!")</span>
+<span class=hljs-built_in>exec</span>(<span class=hljs-string>'\x70\x72\x69\x6e\x74\x28\x22\x48\x65\x6c\x6c\x6f\x2c\x20\x77\x6f\x72\x6c\x64\x21\x22\x29'</span>)
+</code></pre><p>Now we can encode this using the previous trick, leaving <code>('')</code> as-is, and run it:<pre><code class=language-python><span class=hljs-built_in>exec</span>(<span class=hljs-built_in>repr</span>(<span class=hljs-string>"[encoding of exec]([padding]'[user code]'[padding])"</span>)[<span class=hljs-number>6</span>::<span class=hljs-number>8</span>])
+</code></pre><p class=next-group><span aria-level=3 class=side-header role=heading><span>The end</span></span>So that’s how you print <em>Lorem Ipsum</em> in only <eq><math><mn>24</mn></math></eq> characters and just <eq><math><mn>10</mn></math></eq> KiB of whitespace. <a href=https://github.com/purplesyringa/24-characters-of-python>Check out the repo on GitHub.</a><p>Hope you found this entertaining! If anyone knows how to bring this to <eq><math><mn>23</mn></math></eq> characters or less, I’m all ears. :)</div></section><footer><div class=viewport-container><h2>Made with my own bare hands (why.)</h2></div></footer><script>window.addEventListener("keydown", e => {
+				if (e.key === "Enter") {
+					if (e.ctrlKey) {
+						window.open("https://github.com/purplesyringa/site/edit/master/blog/any-python-program-fits-in-24-characters/index.md", "_blank");
+					} else if (
+						e.target.type === "checkbox"
+						&& e.target.parentNode
+						&& e.target.parentNode.className === "expansible-code"
+					) {
+						e.target.click();
+					}
+				}
+			});</script>
diff --git a/...on-program-fits-in-30-characters/index.md → ...on-program-fits-in-24-characters/index.md b/...on-program-fits-in-30-characters/index.md → ...on-program-fits-in-24-characters/index.md
@@ -1,15 +1,17 @@
 ---
-title: Any Python program fits in 30 characters*
+title: Any Python program fits in 24 characters*
 time: November 17, 2024
 intro: |
     *\* If you don't take whitespace into account.*
 
-    My friend challenged me to find the shortest solution to a certain Leetcode-style problem in Python. They were generous enough to let me use whitespace for free, so that the code stays readable. So that's exactly what we'll abuse to encode *any* Python program in $30$ bytes, ignoring whitespace.
+    My friend challenged me to find the shortest solution to a certain Leetcode-style problem in Python. They were generous enough to let me use whitespace for free, so that the code stays readable. So that's exactly what we'll abuse to encode *any* Python program in $24$ bytes, ignoring whitespace.
 ---
 
 *\* If you don't take whitespace into account.*
 
-My friend challenged me to find the shortest solution to a certain Leetcode-style problem in Python. They were generous enough to let me use whitespace for free, so that the code stays readable. So that's exactly what we'll abuse to encode *any* Python program in $30$ bytes, ignoring whitespace.
+My friend challenged me to find the shortest solution to a certain Leetcode-style problem in Python. They were generous enough to let me use whitespace for free, so that the code stays readable. So that's exactly what we'll abuse to encode *any* Python program in $24$ bytes, ignoring whitespace.
+
+> This post originally stated that $30$ characters are always enough. Since then, someone on the codegolf Discord server has devised a better solution, reaching $24$ bytes. After a few minor modifications, it satisfies the requirements of this problem, so I publish it here too.
 
 
 ### Bits
@@ -69,35 +71,69 @@ The characters `ABCDEFGHIJ` are located at indices $9, 11, 12, 28, 29, 30, 31, 3
 exec("".translate("ABCDEFGHIJ"))
 ```
 
+We can now encode any Python program that uses at most $10$ different characters. We could now use [PyFuck](https://github.com/kuangkzh/PyFuck), which transforms any Python script to an equivalent script that uses only $8$ characters: `exc('%0)`. This reduces the code size to $30$ charaters (plus whitespace). A bit of postprocessing is necessary to get it working well, as PyFuck often has exponential output, but that's a minor issue.
+
+
+### A better way
 
-### Alphabet
 
-We can now encode any Python program that uses at most $10$ different characters.
+But it turns out there's another way to translate whitespace to non-whitespace.
 
-This would be more than enough for JavaScript: [JSFuck](https://jsfuck.com/) can transform any JS program to an equivalent JS program that only uses characters from the 6-character set `[]()!+`. Does anything like this exist for Python?
+> This solution was found by a reader of my blog -- thanks!
 
-Actually, it does! [PyFuck](https://github.com/kuangkzh/PyFuck) transforms any Python script to an equivalent script that uses only 8 characters: `exc('%0)`. This means that we can reduce our $10$-byte alphabet to just $8$ bytes, further reducing the code size:
+When `repr` is applied to Unicode strings, it replaces the Unicode codepoints with their `\uXXXX` representations. For example, `U+2001 Em Quad` is encoded as `'\u2001'`. All in all, Unicode whitespace gives us unlimited supply of `\`, `x`, and the whole hexadecimal alphabet (plus two instances of `'`).
+
+Say we wanted to extract the least significant digits of characters from `U+2000` to `U+2007`. Here's how to do this:
 
 ```python
-exec("[whitespace...]".translate("         e xc               ('%0)"))
+# Imagine these \uXXXX escapes are literal whitespace characters
+>>> repr("\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007")[6::6]
+'01234567'
 ```
 
-That's $30$ characters (plus whitespace).
+To get `\`, `x`, and the rest of the hexadecimal alphabet, we need characters like `U+000B` and `U+001F`. We also need to align the strings exactly, so that one of the columns contains all the alphabet:
 
+```python
+         v
+\: "     \t "
+x: "    \x0b"
+0: "\u2000  "
+1: "\u2001  "
+2: "\u2002  "
+3: "\u2003  "
+4: "\u2004  "
+5: "\u2005  "
+6: "\u2006  "
+7: "\u2007  "
+8: "\u2008  "
+9: "\u2009  "
+a: "\u200a  "
+b: "  \x0b  "
+c: "  \x0c  "
+d: "  \x1d  "
+e: "  \x1e  "
+f: "  \x1f  "
+         ^
+```
 
-### Optimization
+This requires us to increase the step to $8$, but it works!
 
-There's just one problem: the output of PyFuck is *exponential* in the count of non-`exc(0)` characters in the input code. So to encode realistic programs with just `exc('%0)`, we need to pass code through *a nested encoder* before passing it to PyFuck. The optimized nested code looks like this:
+Now, if we have free access to `\`, `x`, and the hexadecimal alphabet, we can reduce any program to just $4$ characters outside this alphabet (we're lucky that `exec` is free):
 
 ```python
-exec(int("[bits of code]".replace("(","0").replace(")","1"),2).to_bytes([length of code]))
+# print("Hello, world!")
+exec('\x70\x72\x69\x6e\x74\x28\x22\x48\x65\x6c\x6c\x6f\x2c\x20\x77\x6f\x72\x6c\x64\x21\x22\x29')
 ```
 
-We store bits as `(` and `)`, so there's only a fixed cost due to PyFuck (about $400$ KiB). The bits of code take $8 \times$ more space than the original bytes, but that's nothing compared to the PyFuck overhead.
+Now we can encode this using the previous trick, leaving `('')` as-is, and run it:
+
+```python
+exec(repr("[encoding of exec]([padding]'[user code]'[padding])")[6::8])
+```
 
 
 ### The end
 
-So that's how you print *Lorem Ipsum* in only $30$ characters and just $420$ KiB of whitespace (still smaller than Electron). [Check out the repo on GitHub.](https://github.com/purplesyringa/30-characters-of-python)
+So that's how you print *Lorem Ipsum* in only $24$ characters and just $10$ KiB of whitespace. [Check out the repo on GitHub.](https://github.com/purplesyringa/24-characters-of-python)
 
-Hope you found this entertaining! If anyone knows how to bring this to $29$ characters or less, I'm all ears. :)
+Hope you found this entertaining! If anyone knows how to bring this to $23$ characters or less, I'm all ears. :)
diff --git a/...thon-program-fits-in-30-characters/og.png → ...thon-program-fits-in-24-characters/og.png b/...thon-program-fits-in-30-characters/og.png → ...thon-program-fits-in-24-characters/og.png