Learn Emacs Lisp By making a Brainfuck compiler

{{{subtitle(Lisp tutorial)}}} {{{date(July 2018)}}} {{{theme-color(19cba9)}}}

Introduction

A good developer friend of mine, once asked me if I could teach him some Lisp. I’ve been using Emacs for half a decade and have learned some Lisp along the way by extending it in various ways. Since he’s an experienced developer, I figured I could try to teach him the little Lisp I knew by making a small stack based virtual machine. We then decided to keep going and tried write a Brainfuck compiler targeting our newly implemented VM. We got a good kick out of it and I wanted to share this experience with other developers wanting to learn a combination of Lisp, bytecode, and Brainfuck. This Lisp tutorial is therefore somewhat different from the more typical ones in that we’re going to learn the language by coding in it, during which I’ll explain the various concepts surrounding the language as we encounter them. By the end of this tutorial, we’ll have learned some Lisp, implemented a small compiler targeting a custom made VM, and of course, Brainfuck. Sadly, you will not be a Lisp-guru by the end of this article. My hope however, is that you will be independent enough in Lisp to be mostly using Emacs Lisp’s documentation from there. Please do not hesitate to look some of the concepts introduced in this article that you have trouble grasping, maybe because I could have done a better job explaining them, on Stack overflow as you go through this tutorial. Indeed, some of these concepts need to be played with to get a feel for them and you’ll therefore need more practice than is provided in this article. Luckily, this tutorial does not implement any compiler optimization, you could therefore decide to implement some obvious ones after finishing this article.

That being said, there are two ways we could proceed in this project:

Design and implement a virtual machine first before getting started with the compiler,
Restrict ourselves to a subset of Brainfuck, design a VM allowing for a compiler targeting it to compile this subset, and iterate until we have implemented all of Brainfuck’s syntax.

The later option allows us to design the VM’s bytecode instructions as we need them, and implementing baby steps by baby steps the compiler, so it’s the option I have opted for in this tutorial. This also allows us, as we are also learning lisp, to grow two programs in parallel and therefore, defer having to deal with large lisp programs until later.

This article is structured into Parts, with a Brainfuck tutorial opening as Part I so we understand what we have to deal with. As you will attest, it’s a fun little language! We’ll also get a little Elisp environment up and running, we’ll be installing Emacs, but no need to have ever used it! Part II will get us started Lisping by implementing the VM’s boilerplate. We’ll introduce quite a few Lisp concepts as we walk through our code. Part III, IV, and V will be each divided into two sub-parts. The first one dealing with the necessary VM modifications such as adding bytecode instructions. The second will address the compiler modifications necessary for translating the currently studied Bainfuck syntax subset to the newly enhanced bytecode.

Finally, we will close this tutorial with Part VI by making a small Emacs major mode for Brainfuck providing syntax highlight along with a REPL. This will allow us to play a bit with Emacs’ API.

Let’s get started :)

Part I: Getting started

In this Part, we’ll first get our Elisp environment up and running, and then start experimenting with Brainfuck. The reason we do it in this order is simple: we’ll be running some Elisp to get familiar with Brainfuck, and I’ll use the occasion to introduce some of the Lisp concepts we’ll be using. So we might as well have an Elisp environment going.

Getting an Emacs Lisp environment running

We’ll get started by getting an Elisp environment up and running. We’ll be installing a fresh emacs, cloning the repo of the compiler we’ll be coding so we can experiment with Brainfuck in the next sub-part, and we’ll be creating and evaluating a new Elisp file. Finally, we’ll introduce some basic Emacs functionalities so we can code sane, and recommend some packages to install to make your life easier. If you’ve already used Emacs, you scan safely skip to the next sub-part.

Installing Emacs is easy, it should be along the line of:

# arch
pacman -Syu emacs

# ubuntu
apt install emacs

# alpine
apk add emacs

# MacOS
brew install emacs

# Windows
choco install emacs

We can now clone the repo containing the Brainfuck compiler and VM we’ll be coding, we’ll see how to use them in the next sub-part:

git clone git@github.com:tbinetruy/brainfuck-compiler.git
cd brainfuck-compiler

We can now start up emacs in the freshly downloaded repository, appending -nw to start it in console mode if preferred:

emacs      # windowed
emacs -nw  # terminal mode

Now that Emacs is up and running, know that you can call any (public, or interactive more precisely) functions by pressing M-x where M is the Meta key (usually alt). We can create a new file with the command M-x find-file, M-x [1] will trigger a input at the bottom left of the Emacs window, where you can type in find-file and press Enter. There, you’ll be able to enter a new Elisp file name, for example new.el, and press the Enter key. Finally, press M-x save-buffer to write the file to disk (you’ll get a notification at the bottom left of the Emacs window). From there, you can call M-x eval-buffer to evaluate the buffer.

[1] https://www.gnu.org/software/emacs/manual/html_node/emacs/M_002dx.html

Evaluating the buffer loads the current Elisp buffer into the Emacs environment. In new.el, enter the following. We’ll see what this does in Part II, for now let’s just make sure that you environment is up and working:

(defun some-func ()
  (print "hello"))

This defines a function that takes no arguments and prints out “hello”. Evaluate the buffer with M-x eval-buffer, split the window with M-x split-window-right, move to the new window with M-x windmove-right, and start the Elisp REPL with M-x ielm. Now, typing (some-func) in the REPL and hit Enter to evaluate it. This should output “hello” in the REPL, everything works :)

Note that you can hit TAB when calling emacs functions with M-x as well as in the Elisp REPL (ielm). Some of the M-x calls we’ve made are binded to shortcuts, as described in M-x describe-bidings:

C-x C-s for save-buffer which saves the current file,
C-x C-c for save-buffers-kill-terminal which saves all opened files and quits Emacs,
C-x C-3 for split-window-right which splits the current window vertically,
C-x o for other-window which cycles the cursor through windows,
C-x C-f for find-file which opens a new file.

By default, windmove is not binded to any shortcuts. If you want to between windows with SHIFT plus an arrow key (S-<left>, S-<right>, S-<up>, S-<down>), then you need to evaluate (windmove-default-keybindings) as instructed in [1]. To do so, here are a few possibilities:

[1] https://emacs.stackexchange.com/questions/3458/how-to-switch-between-windows-quickly

Create a file .emacs in your home, write (windmove-default-keybindings) in it, save the file, and restart Emacs. Emacs reads ~~/.emacs~ when it starts meaning that you can write whatever Elisp you want in it to configure your Emacs.
You could also write (windmove-default-keybindings) in new.el and evaluate the buffer. You don’t need to save the file but you’ll need to do that every time you start Emacs.
Finally, you can press M-: [1], which prompts you for some Elisp at the bottom left of the Emacs window. There, enter (windmove-default-keybindings) and hit ENTER, this will have evaluated the expression in the Emacs environment. As with 2., you’ll need to do this every time you start Emacs.

[1] https://www.gnu.org/software/emacs/manual/html_node/emacs/Lisp-Eval.html

Last but not least, as we will see, Lisp code necessitates a lot of parenthesis and can be rough to write and align by hand. The Lisp syntax being a tree, actually allow for Emacs to help you out writing and navigating Lisp code by operating on the tree. Just like Vim allows you to navigate text much simpler than by using arrows, the Emacs package smartparens [1] or paredit [2], allows one to navigate Lisp code semantically. Once you get used to it, you actually never have to write parenthesis ever again when writing and editing Lisp code. Check out on Emacsrocks [3] how the narrator uses paredit [3] to navigate Lisp code.

In order to install these packages, you’ll need to setup Emacs’ community repository, MELPA [4] https://melpa.org/, which you can learn how to on Emacs’ wiki [5].

(require 'package)

(add-to-list 'package-archives
             '("MELPA Stable" . "https://stable.melpa.org/packages/") t)
(package-initialize)

Installing paredit is then as simple as typing M-x package-install RET paredit RET (where RET is the Return key). Similarly, might want to install which-key [6] by entering M-x package-install RET which-key RET, as this will show keyboard shortcuts available as you start typing them. You’ll need to activate Which Key by either entering M-x which-key-mode, M-: (which-key-mode), or adding (which-key-mode) in your .emacs file. Then, when you’ll hit the prefix key [8] C-x, Emacs will show at the bottom of the screen what shortcuts to hit to call an Emacs function. In the case of C-x, which key will show you that if you hit 2 you will evaluate split-window-bellow [7].

[1] https://github.com/Fuco1/smartparens
[2] https://www.emacswiki.org/emacs/ParEdit
[3] http://emacsrocks.com/e14.html
[4] https://melpa.org/
[5] https://www.emacswiki.org/emacs/MELPA
[6] https://github.com/justbur/emacs-which-key
[7] https://github.com/justbur/emacs-which-key/blob/master/img/which-key-bottom.png
[8] https://www.gnu.org/software/emacs/manual/html_node/elisp/Prefix-Keys.html

Now that we have a working Emacs environment to lisp in, let’s first go over Brainfuck’s syntax.

An introduction to Brainfuck

Let us first understand what Brainfuck is about. The basic idea is that we have a heap at out disposal, represented as a list of integers, and a pointer to an entry in that list. We can navigate this heap in Brainfuck by moving the pointer around it by incrementing or decrementing it. Finally, we can output to stdout the heap entry that is currently being pointed to by converting it to its associated ASCII character. Before introducing the looping mechanism offered by Brainfuck, let us first go over the syntax for the concepts we have just discussed:

<: decrement the heap pointer
>: increment the heap pointer
+: increment the heap entry being pointed to by the heap pointer
-: decrement the heap entry being pointed to by the heap pointer
.: output as ASCII to stdout the heap entry being pointed to by the heap pointer.

As an example, if we begin with a heap of length ten with each entries set to 0 and want to set it to [1, 2, 3, 0, 0, ...], we could write the following Brainfuck program:

+>++>+++

Let’s compile this Brainfuck program and run it through our VM we will program in this article. Let’s first load the two files in this repo [1]:

[1] https://github.com/tbinetruy/brainfuck-compiler

(load-file "./vm.el")
(load-file "./bf-compiler.el")

Nice, our first Lisping ! We can see that the imports worked since the result printed by the Elisp (Emacs Lisp) REPL(Elisp interpreter) is t, which stands for the true boolean. Let’s look at the code now. On the first line, we’ve loaded the vm.el file from our current directory’s into the REPL by calling the function load-file. Notice how we’re calling the function, with the following syntax: (function-name arg1 arg2 ...). We infer from the above snippet that load-file takes as argument a string representing the path of the file to load. Notice also that the .el extension is the standard extension for elisp files.

Let’s proceed by compiling our Brainfuck code with the function compiler//compile and run the output through the VM with vm//main:

(setq src "+>++>+++")
(vm//main (compiler//compile src t))

(1 2 3 0 0 0 0 0 0 0)

As we can see in the code’s output just above, the heap’s end state is indeed a list of zeros with the three first elements set to the first three natural numbers. Notice that we’ve defined a variable src with the Elisp macro setq taking the variable name as a first argument and its value as a second argument. We’ve also just mentioned the word “macro”, let’s dive into what this means briefly because it’s one of the biggest selling points used when discussing lisps. For now, we’ll simply define it as a function that first rewrites its arguments and then runs the transformation’s output. Here, we don’t really care how setq rewrites its arguments, we just know (thanks to Elisp’s documentation) what it does. In this case, it rewrites them such that we have a variable src defined and now accessible from the current scope. Don’t worry too much about macros for now, we’ll discuss them in greater detail as we go through this tutorial and in particular, when we implement our own. However, I wanted to introduce them since we just encountered one in the pretty name of setq.

The second line in the above snippet runs nested functions. And interestingly, we will see that we can do this because, unlike macros, functions evaluates their arguments before evaluating their bodies [*]. Here, we can see that vm//main takes a single argument, which is the output of the compiler//compile function applied to the src variable.

[*] Indeed, macros do not evaluate their arguments before running their body. That’s exactly why they can operate on their arguments, and transform them allowing for some very powerful meta-programming. In vm//main however, we want to evaluate the call to compiler//compile and pass this result as argument before running the VM.

Let us now introduce Brainfuck’s looping mechanism. We surround a number of instructions to iterate over with [ and ], the loop will execute the instruction block until the value pointed to by the heap pointer equals zero when beginning the next iteration. The following snippet implements an addition, in this case 3 + 3, by making use of a simple loop:

(vm//main (compiler//compile "+++>+++<[->+<]" t))

(0 6 0 0 0 0 0 0 0 0)

We get the correct result by initiating the heap to (3 3 0 ...) before the loop start, then subtracting unity from the first heap entry and adding it to the second until the first entry vanishes. As an exercise, you could modify our addition function to have the result in the first heap entry rather than the second ;)

Hello World

Let us now conclude this Brainfuck tutorial by making a Hello World program. In this first snippet, we increment the first heap entry until 32 in order to “cache” the space character which we’ll need later. We then increment the heap pointer and start printing characters to stdout by incrementing or decrementing the current heap entry with + or - and outputting it to stdout with .. We therefore start with 72 plus signs corresponding to the letter ‘H’. We then decrement by 3 and output ‘E’ etc. Once at the end of “HELLO”, we decrement the heap pointer and output the space character we needed. We increment back the heap pointer and continue outputting “WORLD”. We could have used a single heap entry to output our words, but reserving the first entry for special chars and the second for letters saves us from having to decrement from 79 (‘O’) to 32 (’ ‘) back to 87 (‘W’). It’s also funner :)

(vm//main (compiler//compile "++++++++++++++++++++++++++++++++>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++.---.+++++++..+++.<.>++++++++.--------.+++.------.--------." t))

HELLO WORLD

(32 68 0 0 0 0 0 0 0 0)

In the result above, the standard output produced by running the Hello World program is shown in the left box, and the heap end state in the right one.

Bellow is a different hello world that I found on the English Wikipedia entry for Brainfuck. Notice that it is both shorter and more complex than the previous example. In particular, it uses uses loops and a little more memory to save up on + and - signs. So mysterious:

(setq src "++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.")
(vm//main (compiler//compile src t) nil) ; "prints Hello World!"

Hello World!

(0 0 72 100 87 33 10 0 0 0)

Part II: virtual machine boilerplate

Our brainfuck compiler will target a virtual machine that we’ll design in this section. It’ll teach us a bit about bytecode [1], and is a good exercise to learn a bit about Lisp and computers in general as our bytecode will resemble Assembly language in some ways.

[1] https://en.wikipedia.org/wiki/Bytecode

From an abstract point of view, our virtual machine will be a program that takes a list of instructions (that we’ve been calling bytecode) that can operate on a stack and a heap. We’ll also implement registers that can be used to store variables, such as the program counter tracking which instruction is currently being executed by the VM. Ultimately, the goal will be to have instructions allowing us to translate Brainfuck code into a sequence of instructions outputting the desired result when ran through the VM. As we will see at the end of this document, what we’ll be implementing is be very similar to Web Assembly and Python’s bytecode among other VM instruction sets. So it’s actually quite interesting :)

Let’s get Lisping! There are a few concepts to go over, but we will mostly learn about them as we code. The snippet bellow declares a function vm//main, which takes a list of instructions as first argument and an optional log boolean which we’ll use to see what is going on in our VM as it executes code. I’ve added a significant of comments to help you decipher what is going, and as you’ll see, we’ve already gone over most of the syntax.

The structure of the VM will be simple. We first declare a stack and registers variables which will be a list and a key-value store respectively. We’ll initialize the stack as an empty list. The registers variable will be an associative list, which we’ll discover in more details bellow, with a single entry pc, storing the program counter, initialized to zero. We then need to iterate through the instructions passed as arguments to vm//main, and case match them to handle them accordingly. Finally, we return the stack and registers end states.

Because we’ll need to be adding numbers to execute Brainfuck, the first instructions the VM will be implementing will be PUSH and ADD. The former pushes its argument on the stack, and the former adds the two top elements from the stack, and replaces them by the result. For instance, if the top of the stack is 1, that we PUSH 2 and call ADD, the VM will pop 1 and 2 off the stack, add them, and push the result, 3, back on the stack:

                             +------------+             
                             |      2     |             
+------------+               +------------+            +------------+
|     1      |               |      1     |            |      3     |
+------------+    PUSH 2     +------------+    ADD     +------------+
|     ..     |  -------->    |     ..     |  --------> |     ..     |
+------------+               +------------+            +------------+
|Stack bottom|               |Stack bottom|            |Stack bottom|
+------------+               +------------+            +------------+

Finally, the code. For now, scan through the code, read the comments, and see you after where I’ll try to explain you what is going on in detail.

; function declaration
(defun vm//main (instructions &optional log)
  ; Variable declarations
  (let ((stack '())  ; The initial stack is an empty list
        (registers (list (cons 'pc  0))))  ; Registers are list of cons. Here, there
                                           ; is only 1 register, the program counter
    ; Iteration over instruction set until program counter is
    ; greater then the instruction set length
    (while (< (alist-get 'pc registers) (length instructions))
      ; Variable declarations: the current instruction obtained
      ; using the program counter is stored. We then parse the
      ; instruction and its argument and store them.
      (let* ((instruction (nth (alist-get 'pc registers) instructions))
             (key (nth 0 instruction))
             (val (nth 1 instruction)))
        ; Case matching and handling of current instruction
        (if (equal key "PUSH")
            ; Update stack variable with mutate stack returned
            ; by the function`vm//push`
            (setq stack (vm//push stack val)))
        (if (equal key "ADD")
            (setq stack (vm//op stack '+)))
        (if (equal key "SUB")
            (setq stack (vm//op stack '-)))

        ; ...
        ; Add additional cases here
        ; ...

        ; Log stack current instruction, stack and registers
        ; if necessary
        (if log
            (message "%s %s %s" instruction stack registers)))
      ; Increment the program counter and end the current iteration
      (vm//increment-pc registers 1))
    ; Return the registers and final stack states
    `(,stack ,registers)))

As you can see, we use the macro defun in Elisp to declare a function. Its first argument is the function’s name, the second a list of arguments - two in our case, with an optional one -, and n more expressions that will evaluate sequentially and return the last expression evaluated by the function.

In this function, we first declare a stack as an empty list using '() in the let macro. This macro takes a list of assignments as first argument and an expression as a second argument. The variables declared will be available from the code passed as second argument to let when it executes. We also define our registers as a list of cons (which you can view as a pair) where the first element is the key and the second the value for that key. In Lisp, this is called an associative list, or alist, and loosely be viewed as a dictionary. Hence our registers are a simple key->value store with the key being the register name and the value, the value stored at that register. We only define a single register for now, called pc for program counter.

Speaking of alists, we can now formalize the let macro signature. It takes an alist as first argument, defining the variables and their initial values, and binds them to the scope of the other arguments which will be evaluated iteratively. Note that the variables defined in the alist will not be available outside the let form.

We then iterate over the list of instructions using while macro until the program counter, incremented at each iteration, grows larger than the length of the instruction list. As you can witness, while takes a condition as first argument, and loop over all other arguments making up the while body. We use a special variant of let to parse the currently executed bytecode instruction: let*. The difference with the former is subtle: without the *, we cannot have the variables refer to each other sequentially in the first argument for let (the definitions). Here however, we get the element from the instruction list pointed to by the program counter (with (nth (alist-get 'pc registers) instructions)) and then use this value to declare the variables key and value which correspond to the instruction name and its argument respectively. Had we not used let*, we’d have to use two nested let expressions.

At this point, you’ve probably wondered why are there quotes sprinkled all over our lisp code, and it’s time to understand this syntax. As you can see, our Lisp code consists of nested expressions of the form (exp-name exp-arg-1 exp-arg-2 ... exp-arg-n) that will evaluate and return other expressions. There are rules to the evaluations of expressions and in particular, when evaluating a function, all its arguments are evaluated first, and then passed to the function body for further use. The quote symbol we’ve been using is actually a lisp shorthand [1] for the quote macro, therefore 'some-exp expands to (quote (some-exp)). Note the word “expands”, which is the term used to describe the transformation done by a macro. Unlike functions, macros do not evaluate their arguments when executing allowing them to be used to manipulate code, we’ll actually implement our own macro shortly. Sometimes however, you want to pass un-evaluated arguments to a function, and in this case quote the argument. This is what is been done when declaring a cons cell in the above snippet, and when calling alist-get for getting an alist’s entry for a specific key. In both cases pc needs to be quoted as it is not available in the scope (either as a function, variable, macro or whatever) and hence would throw. By quoting it we actually pass the symbol, pc, to cons and alist-get by quoting it with 'pc. Think of it this way: if we are not evaluating a variable that is not defined, then there is no problem, we just associate it a symbol and pass that symbol around :) Last but not least, this means that the quote can be used to get a list. Indeed, evaluating (1 2 3) will look for a function or macro called 1 and evaluate it with arguments 2 and 3; whereas '(1 2 3) (or equivalently, (quote (1 2 3))) will return a list with elements 1, 2, and 3. By the way, you could also create a list by calling the more traditional (list 1 2 3). All of this may seem quite complex at first, and frankly, I haven’t completely grasped it myself (please help me out if I’m wrong btw, all help is welcomed ;)), but you’ll get it as we start coding some more!

[1] I believe the quote symbol is hard coded in the Elisp reader but is usually implemented as a “reader macro”, which Elisp does not support if my memory does not fail me.

Finally, we do case matching on the instruction name, execute it, and increment the program counter before returning the function with the final stack and registers state. Which is where we meet the quasi-quote symbol (or backquote), `. When applied to an expression, it evaluates none of its arguments but the ones “escaped” by the comma, ,. This is a very convenient way to create new lists as it allows us to evaluate some of its elements. As an example, if val evaluates to 10, then `(foo ,val) will evaluate to the list (foo 10).

Let’s now go over the PUSH instruction. We call it from vm//main with (vm//push stack val) and assign its result back to stack with setq which allows to modify a variable available in the current scope. As you can see, pushing onto a list is very easy in elisp, we simply make use of the default push macro.

(defun vm//push (stack val)
  (push val stack))

The vm//push function simply takes a stack and a value to push as argument, and returns the resulting stack. Notice that push returns the mutated original list when evaluated.

Now, the ADD instruction. Since SUB, MUL and DIV work similarly, we have factored the logic into a function vm//op which we call by passing the lisp functions +, - , * and / to. Remember what the ADD instruction does, vm//op generalizes it: it pops two elements off the stack and applies the function passed as the first argument to them and pushes the result back on the stack. We apply arguments to a function dynamically using funcall:

(defun vm//op (stack op)
  (push (funcall op (pop stack) (pop stack)) stack))

Notice that funcall is itself a function and that its arguments will be evaluated first. This allows us to pass (pop stack) as funcall arguments and be sure to the resulting stack is what will be operated on. Notice that pop return the popped element and mutates the original list.

The ADD instruction is then implemented as (vm//op stack '+), where we pass the + function symbol as a second argument by quoting it thus preventing its evaluation. One could loosely view the quote symbol as a function pointer in this particular case.

The vm//increment-pc function increments the pc register by a value passed as argument. Setting the pc is a bit tricky: we want to set the second element of the key-value pair corresponding to pc in the registers alist. We do this using setcdr on (assq 'pc registers) with val. Concretely, a con’s car and cdr are it’s first and second element respectively. We can modify the con’s second element by calling setcdr on it. Here what we’re trying to do is find the cons corresponding to the program counter in the registers alist, and set its cdr to a new value. We can get a pointer to this cons using assq which takes a key, 'pc, as first argument to look into the alist provided by the second argument, registers. The function set-pc takes the registers and a new program counter value as argument, and returns the new register value. However, since the argument registers is a pointer to an alist, calling setcdr on an entry will modify it in the calling function as well. This is the reason why we do not need return the new registers alist.

(defun vm//set-pc (registers val)
  (setcdr (assq 'pc registers) val))

We can now make use of vm//set-pc to implement vm//increment-pc which will increment our program counter, this function speaks for itself but parsing it if you’ve never lisped much before can take some time:

(defun vm//increment-pc (registers val)
  (vm//set-pc registers (+ (alist-get 'pc registers) val)))

Let’s now try to add two numbers! All we need to do is PUSH the two numbers on the stack and call ADD:

(vm//main '(("PUSH" 10)
            ("PUSH" 20)
            ("ADD")))

30

(pc . 3)

As you can see, we have the correct output of 30 and our final program counter value is 3, as expected since we only have three instructions to run.

Part III: Compiling a Brainfuck subset to our VM’s bytecode

Now that we’ve learned how Brainfuck works and have our small VM working, we can start compiling a first set of Brainfuck instructions to our bytecode. We’ll need to make some modifications to our VM: add a heap and a heap pointer since that’s how brainfuck works. Then, we’ll be able to easily compile the +, - , < and > instructions. The important takeaway here is that we now have two registers, one to track which instruction we are currently executing and another to identify where in the heap we are currently located at. We’ll also need to add instructions to mutate the heap or else it would be of no use.

VM modifications

We’ll add on top of our VM’s stack and registers, some RAM used as our heap and initialized as a list of zeros using the function make-list. It’s arguments are the list length to return, and the initial value for its entries. We’ll start with a heap of length 10 initializes as zeros with (make-list 10 0). We’ll also add the READ_RAM and WRITE_RAM instructions to interact with the VM’s RAM. Finally, we need three new registers: eax, ebx and ecx which essentially gives our VM three global variables. eax and ebx will be used by some VM instructions such as WRITE_RAM by reading the value to write in ebx and the location to write to in eax. We could implement WRITE_RAM only on the stack by popping the two top stack entries and acting upon them as with additions, but I have chosen to implement them as registers. Not only to mimic the way system calls work on Linux but also and mostly because it’s funner. Moreover, our compiler will use the ecx register to store the heap pointer since we need to store it somewhere.

Note: An other way to implement the heap pointer would be by implementing a stack pointer that would allow us to move around the stack. We could then store the heap pointer (and any other variables) by pushing it on the stack and recovering it at any time by moving the stack pointer back to the variable location. This is how variables can be implemented on a compiler that allows moving the stack pointer around. For example, the Tezos blockchain virtual machine which allows moving the instruction pointer (i.e. the stack pointer) around with the DIP and IIP instructions (“decrement instruction pointer” and “increment instruction pointer”).

We first define two helpers that will allow us to easily store and read a value from a register. Starting with the vm//store function which takes a stack and a registers alist as argument and returns the new stack. We are simply generalizing the work we did in vm//set-pc. We’ll use the stack to store the register name and its value by pushing them onto it, and popping them in vm//store to set the appropriate registers alist entry as in vm//set-pc:

(defun vm//store (stack registers)
  "Stores the second to top stack entry into the register referred
to by the top stack entry and pops these two value from the stack.
Ex: [S, new_val, register_name] -> [S]
    {R, register_name: old_val} -> {R, register_name: new_val}"
  (setcdr (assq (pop stack) registers) (pop stack))
  stack)

In the above docstring, we introduce a convenient notation allowing us to formalize the stack and register mutations by the function. [S, new_val, register_name] means “some stack with n elements and the top two elements being new_val and register_name”. {R, register_name: old_val} means “a set of registers R with an additional register register_name and value old_val”. The arrow, -> indicates the transition of stack or register states before and after the function evaluates.

We should also implement a function allowing us to load a register onto the stack. In the vm//load function bellow, we get the register name to retrieve by popping the stack, and we use cdr and assq to retrieve the associated register cons and getting its second entry (the register value). Finally, we return the new stack:

(defun vm//load (stack registers)
  "Pushes the content of the register referred to by
the top stack entry on the stack and pops the register
name.
Ex: [S, register_name] -> [S, registers[register_name]]"
  (push (cdr (assq (pop stack) registers)) stack))

We then need to be able to write to a ram address by reading from the register eax which address to write at, and from register ebx the new value. We have not yet seen how to edit a list nth’s entry however. We need to dive into how lisps work a tiny bit in order to understand how to do that. In lisp, lists are essentially linked lists constructed from cons. Each cons’s car is its value, and cdr the address to the next con cell. Here’s a little diagram that will help you out, they’re taken from [1] and slightly modified for our needs, but I encourage you to go read [1] as it explains a lot about the inner workings of lisp ;)

[1] https://www.emacswiki.org/emacs/ListModification

             +-------+-------+
cons cell--->|  car  |  cdr  |
             +-------+-------+

        +-------+---+    +------+---+  +-------+-----+
list--->| alpha | *----->| beta | *--->| gamma | nil |
        +-------+---+    +------+---+  +-------+-----+
             ^              ^     ^               ^
             |              |     |               |
         con cell 1        car   cdr           list end

We can now understand how to access a list’s element, what we want is similar to finding the nth cdr of the list. Which is exactly why Elisp has a function nthcdr available for us. But that will actually only give us the sub-list going from n, so we want its first element, its car which we can set with setcar. Hence, in lisp, setting a list’s nth element is done by setting the car of the nthcdr of a list :)

Back to reading and writing from and to our ram using the registers. We’ll have vm//write-ram read the ram address to write at register eax and the new value at register ebx, we use this information to return an updated ram.

(defun vm//write-ram (registers ram)
  "Stores a value at some ram address:
eax: ram address
ebx: value"
  (let ((addr (alist-get 'eax registers))
        (val (alist-get 'ebx registers)))
    (setcar (nthcdr addr ram) val))
  ram)

To spice it up a bit, we’ll implement vm//read-ram by reading the address from register eax and updating eax with that value, so the stack will not change when this function executes, only the content of the eax register will! To do so, we set the desired register with setcdr and assq to the nth entry in our ram, or in Lisp words, the car of the nthcdr of the list. n being the content of the eax register in our case:

(defun vm//read-ram (registers ram)
  "Read value from ram. Return value in eax
eax: ram addr
return eax: value at ram addr"
  (let ((addr (alist-get 'eax registers)))
    (setcdr (assq 'eax registers) (car (nthcdr addr ram)))))

We can now add these new instructions to our virtual machine:

(defun vm//main (instructions &optional log)
  (let ((stack '())
        (registers (list
                    (cons 'eax  0)
                    (cons 'ebx  0)
                    (cons 'ecx  0)   ; heap pointer
                    (cons 'pc  0)))  ; program counter
        (ram (make-list 10 0)))
    (while (< (alist-get 'pc registers) (length instructions))
      (let* ((instruction (nth (alist-get 'pc registers) instructions))
             (key (nth 0 instruction))
             (val (nth 1 instruction)))
        (if (equal key "READ_RAM")
            (vm//read-ram registers ram))
        (if (equal key "WRITE_RAM")
            (setq ram (vm//write-ram registers ram)))
        (if (equal key "LOAD")
            (setq stack (vm//load stack registers)))
        (if (equal key "STORE")
            (setq stack (vm//store stack registers)))
        (if (equal key "PUSH")
            (setq stack (vm//push stack val)))
        (if (equal key "ADD")
            (setq stack (vm//op stack '+)))
        (if (equal key "SUB")
            (setq stack (vm//op stack '-)))
        (if log
            (message "%s %s %s" instruction stack registers)))
      (vm//increment-pc registers 1))
    `(,stack ,registers, ram)))

Let’s play with it for a bit :) In the snippet bellow, we push 5 to eax and 10 to ebx and call WRITE_RAM to store 10 at heap address 5. We also call READ_RAM which will write 10 to the eax register since this register contained 5 before calling the instruction. In other words, the content of eax, 5, will be replaced with the heap value at 5, which is 10. Note that we end up with an empty stack:

(vm//main '(("PUSH" 5)
            ("PUSH" eax)
            ("STORE")
            ("PUSH" 10)
            ("PUSH" ebx)
            ("STORE")
            ("WRITE_RAM")
            ("READ_RAM")))

nil

((eax . 10) (ebx . 10) (pc . 8))

(0 0 0 0 0 10 0 0 0 0)

Compiling

Finally, we can start compiling our Brainfuck instruction subset. The first thing we need is a lexer, which will tokenize Brainfuck code. It is a very simple lexer that just takes a string with no spaces and ouputs a list of characters making up the string. We use the function split-string on a Brainfuck string passed as the argument code and split it by the empty character. This will split the string into a list of characters making it up:

(defun lexer//lex (code)
  (split-string code ""))

(lexer//lex "++++>+++++<[->+<]")

+

>

+

<

[

-

>

+

<

]

Keep It Simple Stupid !

Let’s now write some boilerplate for our compiler. We need a function that take some code as input and lexes it, we store the result in a variable tokens which we will iterate over. We also need an instructions variable which is initialized as an empty list and which will contain the bytecode representation of the Brainfuck code that will be ran through our VM. We also need a position counter called current-pos which will contain a pointer to the current token to compile. Note that this counter is independent from the VM, it’s simply used by the compiler to track which Brainfuck token it is currently compiling. We then iterate over the token list and process them (+, -, < and >).

To get us warmed up, let’s define a function that returns some initialization code. We want to initialize the heap pointer to zero. Remember that in Brainfuck, the goal is to move around the heap (our VM’s RAM) and increment or decrement entries to compute something. We therefore need to track this heap pointer somewhere in our VM, why not in the ecx register since it isn’t used by our VM? So let’s initialize the ecx register to zero:

(defun compiler//init-code (instructions)
  "Initialize the program counter, stored in ecx, to 0."
  (push-instruction '("PUSH" 0))
  (push-instruction '("PUSH" ecx))
  (push-instruction '("STORE"))
  instructions)

Notice that we’ve used a not yet implemented macro called push-instruction. We’ll implement it shortly, for now, just know that it pushes its argument to the instructions list passed as argument to compiler//init-code.

We now define our compiler’s main body, compiler//compile, that takes some code as its only argument, and returns a list of bytecode instructions for our VM to execute. This function is in some ways very similar to vm//main:

(defun compiler//compile (code)
  "Compiles a Brainfuck code string."
  (let ((tokens (lexer//lex code))  ; Lex the code
        (instructions  ; bytecode start with our init code
         (compiler//init-code instructions))
        (current-pos 0))  ; we start at the first token
    ; We iterate on the tokens
    (while (< current-pos (length tokens))
      ; We parse the token
      (let ((el (nth current-pos tokens)))
        (if (equal el ">")
            ; Call the appropriate function and update the
            ; instructions with the result
            (setq instructions (compiler//increment-pc instructions 1)))
        (if (equal el "<")
            (setq instructions (compiler//increment-pc instructions -1)))
        (if (equal el "-")
            (setq instructions (compiler//increment-value instructions -1)))
        (if (equal el "+")
            (setq instructions (compiler//increment-value instructions 1))))
      ; We increment the current position
      (setq current-pos (1+ current-pos)))
    ; And return the final bytecode list
    instructions))

We’ll now implement the push-instruction macro that we’ve been using. We implement it using the macro defmacro, as opposed to defun to define functions. As said previously, a macro differs from a function in that it does not evaluate its arguments, and it returns an expression to be evaluated later. This macro takes an instruction as input and returns an expression that assigns to instructions the concatenation of instruction to the list instructions the compiler shall return. We implement this helper as a macro rather than a function to save us from passing list pointers around.

(defmacro push-instruction (instr)
  `(setq instructions (nconc instructions (list ,instr))))

Notice the new function, nconc, which returns the concatenation of n lists: (nconc '(1 2 3) '(4 5)) => (1 2 3 4 5).

But what does this macro return? It transforms its argument instr, which is a list, with the expression returned by the macro. It is this returned expression that will be executed by the macro caller. Hence, macros manipulate code before executing it allow for some very powerful meta-programming when needed. This is why macros do not evaluate their arguments, so you can manipulate them before hand. In particular, you can see that in Lisp’s case, we can manipulate a program as data using macros. We call this property homoiconicity [1], which we quote some of the Wikipedia entry bellow:

A language is homoiconic if a program written in it can be manipulated as data using the language, and thus the program’s internal representation can be inferred just by reading the program itself. For example, a Lisp program is written as a regular Lisp list, and can be manipulated by other Lisp code. In homoiconic languages, all code can be accessed and transformed as data, using the same representation. This property is often summarized by saying that the language treats “code as data”. [1]

[1] https://en.wikipedia.org/wiki/Homoiconicity

Let us dig into push-instruction a little more so we really understand what is going on. We can expand the macro as with the function macroexpand to see its result as follows:

(macroexpand '(push-instruction '(PUSH 1)))

As you can see in the macroexpand output, we are generating an expression that will assign the new instructions list to instructions. Notice also that since macroexpand is itself a function, we need to quote its argument or else it would be evaluated. In this case, the evaluation of the macro would first expand it and then evaluate it before throwing since there is no instructions variable defined in the current scope.

Let us now investigate how to increment the program counter, stored in ecx. We first push the register on the stack, load its value, add unity to it, and store it back to ecx:

(defun compiler//increment-pc (instructions val)
  "Increments the value of the program counter by val.

It works as follows:
- Push the value of ecx (where the pc is stored) onto the stack
- Push val on the stack
- Call the add instruction
- Store the resulting value back into ecx"
  (push-instruction '("PUSH" ecx))
  (push-instruction '("LOAD"))

  (push-instruction `("PUSH" ,val))

  (push-instruction '("ADD"))

  (push-instruction '("PUSH" ecx))
  (push-instruction '("STORE")))

Incrementing a value pointed to in the heap by the heap pointer (stored in ecx) is a little more complex. We first copy the heap pointer from ecx to eax and call READ_RAM to read the heap value now at eax (and ecx). This stores the heap value at the heap pointer on eax which we then push on the stack allowing us to increment it. We then store this new value by saving it to ebx, copying the heap pointer (in ecx) to eax again (since its first copy was overwritten by the READ_RAM instruction), and calling WRITE_RAM (which writes ebx at RAM address eax).

(defun compiler//increment-value (instructions val)
  "Increments by val the content of the address pointed at by the pc."
  ; Get the program counter from ecx and store it to eax.
  ; This is the address at which we want to increment the value.
  (push-instruction '("PUSH" ecx))
  (push-instruction '("LOAD"))
  (push-instruction '("PUSH" eax))
  (push-instruction '("STORE"))

  ; We then read from the ram at the address in eax (the pc)
  (push-instruction '("READ_RAM"))

  ; The value we read is now in eax, and we want to load it on the stack
  (push-instruction '("PUSH" eax))
  (push-instruction '("LOAD"))

  ; We can now add val to the value we pushed on the stack
  (push-instruction `("PUSH" ,val))
  (push-instruction '("ADD"))

  ; and store it back in ebx.
  (push-instruction '("PUSH" ebx))
  (push-instruction '("STORE"))

  ; Finally, we store the pc in eax again
  (push-instruction '("PUSH" ecx))
  (push-instruction '("LOAD"))
  (push-instruction '("PUSH" eax))
  (push-instruction '("STORE"))

  ; and call the store to ram syscall.
  (push-instruction '("WRITE_RAM"))
  instructions)

Let’s run a simple example making use of the Brainfuck subset we have just implemented in the compiler:

(vm//main (compiler//compile "+>++>+++<-"))

(eax . 1)	(ebx . 1)	(ecx . 1)	(pc . 133)
1	1	3	0	0	0	0	0	0	0

As expected, the heap state we get is (1 1 3 0 ... 0) meaning that we have properly compiled these instructions. Notice that our VM’s program counter is at 133, meaning that our compiler compiled 133 instructions from this brainfuck code! Surely, there are a few optimizations that we could easily make, but we’ll let the motivated readers give it a try. One obvious optimization that comes to mind is to “compress” repeating characters at compile time. For instance, this code ++++ increments by 1 the heap value currently pointed to instead of simply adding five to it at once. A similar optimization can be made with the mutation of the heap pointer with < and >.

Part IV: Adding Brainfuck loops to our compiler

We’ll now implement the Brainfuck’s looping mechanism. This feature is crucial as without it, Brainfuck would not be a Turing complete language. In order to introduce this feature, we need to add some bytecode instructions to our VM. In particular, we’ll need to be able to jump around which instruction is currently being executed by mutating the VM’s program counter. To do so, we’ll add the relative jump and conditional instructions. We’ll then be free to compile loops easily. Let’s get started!

VM modifications

We’ll need to implement jumps and conditional statements in our VM. Indeed, Brainfuck’s looping mechanism, introduced in Part I, requires to jump back to the beginning of the loop block only if the heap value at the location the heap pointer points to is not zero when the current iteration begins. We’ll implement the jump as a relative jump simply by incrementing or decrementing the program counter.

The if statement is a bit trickier: if the top stack entry is 0 (False), then increment the program counter by one additional unit. When added to the VM’s incrementation of the program counter after executing each instruction, this means that a False statement will actually increment the program counter by two. If the top stack entry differs from 0, then the VM does nothing and the program counter will only be incremented by unity.

(defun vm//if (stack registers)
  "Pops the top stack entry and increments the program counter by
unity if it is eqaul to 0, otherwise do nothing."
  (if (equal (pop stack) 0)
      (vm//increment-pc registers 1))
  stack)

We can now add the IF and RJUMP instructions on our VM. RJUMP’s implementation is very simply, it modifies the program counter by a given value: (vm//increment-pc registers val).

(defun vm//main (instructions &optional log)
  (let ((stack '())
        (registers (list
                    (cons 'eax  0)
                    (cons 'ebx  0)
                    (cons 'ecx  0)   ; heap pointer
                    (cons 'pc  0)))  ; program counter
        (ram (make-list 10 0)))
    (while (< (alist-get 'pc registers) (length instructions))
      (let* ((instruction (nth (alist-get 'pc registers) instructions))
             (key (nth 0 instruction))
             (val (nth 1 instruction)))
        (if (equal key "RJUMP") ; relative jump
            (vm//increment-pc registers val))
        (if (equal key "IF")    ; conditional statement
            (setq stack (vm//if stack registers)))
        (if (equal key "READ_RAM")
            (vm//read-ram registers ram))
        (if (equal key "WRITE_RAM")
            (setq ram (vm//write-ram registers ram)))
        (if (equal key "LOAD")
            (setq stack (vm//load stack registers)))
        (if (equal key "STORE")
            (setq stack (vm//store stack registers)))
        (if (equal key "PUSH")
            (setq stack (vm//push stack val)))
        (if (equal key "ADD")
            (setq stack (vm//op stack '+)))
        (if (equal key "SUB")
            (setq stack (vm//op stack '-)))
        (if log
            (message "%s %s %s" instruction stack registers)))
      (vm//increment-pc registers 1))
    `(,stack ,registers, ram)))

Let’s test these new instructions. We’ll try both if cases to demonstrate how it works. The snippet bellow pushes 0 and calls IF. We therefore expect that condition to fail and hence skip the ("PUSH" 10) instruction leading to a final stack equal to (20).

(vm//main '(("PUSH" 0)
            ("IF")
            ("RJUMP" 1) ; if true
            ("RJUMP" 1) ; else
            ("PUSH" 10)
            ("PUSH" 20)))

20
(eax . 0)	(ebx . 0)	(ecx . 0)	(pc . 6)
0	0	0	0	0	0	0	0	0	0

Testing with a True condition, we now do not skip over the ("PUSH" 10) instruction leading to a stack equal to (20 10) as expected.

(vm//main '(("PUSH" 1)
            ("IF")
            ("RJUMP" 1) ; if true
            ("RJUMP" 1) ; else
            ("PUSH" 10)
            ("PUSH" 20)))

20	10
(eax . 0)	(ebx . 0)	(ecx . 0)	(pc . 6)
0	0	0	0	0	0	0	0	0	0

Compiling

We have now reached what I found being the roughest part of this project, implementing Brainfuck’s looping mechanism. Not that it’s particularly hard, it isn’t really, but it was, I found, somewhat harder than what we’ve been working on so far. We’ll go through it in order by first implementing the [ token which begins the loop, we’ll then end the loop with ] and tight it all up by enhancing our compiler//compile function. We’ll of course need to implement a little find-matching-char helper which will allow us to jump back to the beginning of the loop :D

For compiler//loop-start, the rules are easy, if the heap value at the program counter (stored in ecx) is 0, then jump to jump-length, otherwise jump to the next instruction. The function takes as argument instructions which is the currently compiled bytecode we will be adding to, and jump-length which corresponds to the number of tokens between the matching [ and ]. Some comments are added to explain the logic of various bytecode sequences. Our strategy is as follows: we first load the heap value we need to test on the stack, and then jump to the end of the loop if it equals zero, otherwise step to the next instruction:

(defun compiler//loop-start (instructions jump-length)
  "Check that value at pc (stored in ecx) is not 0.
If it is, then jump by jump-length to matching ]
otherwise, jump over the else instruction to the loop body."
  ; Store the program counter to eax
  (push-instruction '("PUSH" ecx))
  (push-instruction '("LOAD"))
  (push-instruction '("PUSH" eax))
  (push-instruction '("STORE"))

  ; Read the ram
  (push-instruction '("READ_RAM"))
  ; Load this value on the stack
  (push-instruction '("PUSH" eax))
  (push-instruction '("LOAD"))

  ; Test if it is different from 0
  (push-instruction '("IF"))
  ; If it is not, increment by 1
  (push-instruction '("RJUMP" 1))
  ; Otherwise jump to the end of the loop
  (push-instruction `("RJUMP" ,(+ 1 jump-length)))
  ; Return the new set of instructions
  instructions)

Handling the loop end, ], is easy. We define a function compiler//loop-end that has the same signature as compiler//loop-start, and jump back the amount dictated by the argument jump-length instructions. On top of which we need to add the number of instructions written by compiler//loop-start and compiler//loop-end to check whether or not to continue iterating, which in our case is the magic number 11. Here’s a possible implementation for compiler//loop-end:

(defun compiler//loop-end (instructions jump-length)
  "Loop back to the start of the loop. It uses a relative jump backwards
with length equal to that of the loop body (jump-length) added to then
loop head code (added by compiler//loop-start)"
  (let ((loop-extra-length -11))  ; number of instructions in loop head and tail
    (push-instruction `("RJUMP" ,(+ loop-extra-length (- jump-length)))))
  instructions)

What we need to do now, is implement a function that can calculate the distance between matching brackets, so we can call compiler//loop-start and compile//loop-end since we need a parameter jump-length. This function, find-matching-char, will take as arguments a list of tokens, the opening character ([ in our case), and the closing character (] in our case) which will allow us to handle nested patterns. It will also take the current compiler position in the list of tokens.

We define a counter and a return value (to store the position of the matching loop character). We now introduce a new looping mechanism over lists, the dolist macro, which takes as first argument an expression of length 2 with first element being a variable name to be used in place of list elements, and a second element being the list to iterate over. The second argument to the dolist macro is a body to iterate over. We’ll use it to iterate over the tokens from the current compiler position and conditionally increment or decrement as we encounter various open-char or close-char. Finally, we return the location of the matching token:

(defun find-matching-char (tokens open-char close-char current-pos)
  "Finds the position in an array of chars of the matching characer."
  (let ((counter 0)
        (return-value 0))
    (dolist (el (nthcdr current-pos tokens))
      (if (equal 0 return-value)
          (progn (if (equal el open-char)
                     (setq counter (1+ counter)))
                 (if (equal el close-char)
                     (setq counter (1- counter)))
                 (if (equal counter 0)
                     (setq return-value current-pos)
                   (setq current-pos (1+ current-pos))))))
    return-value))

Notice the progn macro used in the conditions. We’ve used if expressions quite a lot so far, but I haven’t explained how to do an else case. As it turns out, the if macro takes three arguments: a condition, a true case, and a false case. We still don’t have the need for an else case, but clearly, we needed to be able to evaluate more than one expression in the top if expression true case. progn takes any number of arguments but wraps them into a single expression so you can pass all its body “at once”. This is what allows us to evaluate the nested conditions in the top condition’s true case.

We can now tie these three functions together into a compiler//loop function. There are a few new functions that we’re using, namely:

cl-subseq, which returns a sublist and takes as arguments the sequence to slice, the starting index, and optionally the ending index,
1-, which simply subtracts unity from its argument.

We use these to extract the tokens making up the loop body, called middle-instructions in the code bellow, and concatenate them to the loop start and end mechanisms:

(defun compiler//loop (code tokens instructions current-pos)
  (let ((start-pos current-pos)
        (matching-pos (find-matching-char tokens "[" "]" current-pos))
        (current-token (nth current-pos tokens))
        (middle-instructions '()))
      ; We calculate the sublist of tokens making up the loop
      (setq middle-instructions (nconc middle-instructions
                                       (compiler//compile
                                        (cl-subseq code
                                                   current-pos
                                                   (1- matching-pos))
                                        nil)))
    ; We get the loop starting code with, in particular, the condition
    ; mechanism
    (setq instructions (compiler//loop-start instructions
                                             (length middle-instructions)))
    ; We concatenate the starting mechanism to the loop core
    (setq instructions (nconc instructions middle-instructions))

    ; We add the loop end character related instructions
    (setq instructions (compiler//loop-end instructions
                                           (length middle-instructions))))
  instructions)

The modification to compiler//compile is trivial, we add a test case for [, add the loop related instructions, and set the compiler’s current token position to the loop’s end, ]:

(defun compiler//compile (code include-init-code)
  "Compiles a brainfuck code string. If this function is called
recursively, set include-init-code to t in the first call and to
nil and all subsequent calls."
  (let ((tokens (lexer//lex code))
        (instructions '())
        (current-pos 0))
    (if include-init-code
        (setq instructions (compiler//init-code instructions)))
    (while (< current-pos (length tokens))
      (let ((el (nth current-pos tokens)))
        (if (equal el "[")
            (progn
              (setq instructions (compiler//loop code
                                                 tokens
                                                 instructions
                                                 current-pos))
              (setq current-pos (find-matching-char tokens "[" "]" current-pos))))
        (if (equal el ">")
            (setq instructions (compiler//increment-pc instructions 1)))
        (if (equal el "<")
            (setq instructions (compiler//increment-pc instructions -1)))
        (if (equal el "-")
            (setq instructions (compiler//increment-value instructions -1)))
        (if (equal el "+")
            (setq instructions (compiler//increment-value instructions 1))))
      (setq current-pos (1+ current-pos)))
    instructions))

We can now safely try some Brainfuck code that uses loops:

(vm//main (compiler//compile "++>++<[->+<]" t))

(eax . 0)	(ebx . 4)	(ecx . 0)	(pc . 134)
0	4	0	0	0	0	0	0	0	0

Finally, it works! Onto .!!

Part V: Adding the Brainfuck stdout instruction, `.`

Let’s now add the Brainfuck feature that allows us to print to the standard output. We’ll need to enhance our VM with a variable to keep track of calls to the standard output along with the related bytecode instruction. Compiling should then be very easy!

VM modifications

We’ll need to add a variable to our vm//main function that can store stdout, we’ll initialize it as the empty string. The instruction APPEND_STDOUT will allow us to push, character by character:

(defun vm//append-stdout (stdout registers)
  "Appends the ascii-code char stored in eax to stdout"
  (concat stdout (char-to-string (alist-get 'eax registers))))

vm//append-stdout makes use of the concat function return the concatenation of its string arguments. char-to-list is a function that allows us to convert the integer in the register eax to its associated ASCII character through a string of length one. Printing will therefore require to store a number in the eax register before calling APPEND_STDOUT.

We can now add our std our variable, initialize it to 0, and add the APPEND_STDOUT test case in the loop body:

(defun vm//main (instructions &optional log)
  (let ((stack '())
        (registers (list
                    (cons 'eax nil)
                    (cons 'ebx nil)
                    (cons 'ecx nil)
                    (cons 'pc  0)))
        (ram (make-list 10 0))
        (stdout ""))  ; This is the variable that will contain the standart output
    (while (< (alist-get 'pc registers) (length instructions))
      (let* ((elt (nth (alist-get 'pc registers) instructions))
             (key (nth 0 elt))
             (val (nth 1 elt)))
        ; We're adding our new instruction
        (if (equal key "APPEND_STDOUT")
            (setq stdout (vm//append-stdout stdout registers)))
        (if (equal key "READ_RAM")
            (vm//read-ram registers ram))
        (if (equal key "WRITE_RAM")
            (setq ram (vm//write-ram registers ram)))
        (if (equal key "LOAD")
            (setq stack (vm//load stack registers)))
        (if (equal key "STORE")
            (setq stack (vm//store stack registers)))
        (if (equal key "RJUMP")
            (vm//increment-pc registers val))
        (if (equal key "IF")
            (setq stack (vm//if stack registers)))
        (if (equal key "PUSH")
            (setq stack (vm//push stack val)))
        (if (equal key "SUB")
            (setq stack (vm//op stack '-)))
        (if (equal key "ADD")
            (setq stack (vm//op stack '+)))
        (if log
            (message "%s %s %s %s" elt stack registers ram)))
      (vm//increment-pc registers 1))
    `(,stdout ,stack ,registers ,ram)))

Let’s check that it works by storing the number 70 in eax and calling APPEND_STDOUT, and stdout should equal “F”:

(vm//main '(("PUSH" 70)
            ("PUSH" eax)
            ("STORE")
            ("APPEND_STDOUT")))

F

nil

((eax . 70) (ebx) (ecx) (pc . 4))

(0 0 0 0 0 0 0 0 0 0)

As you can see, this codes output is indeed “F” suggesting that our VM modifications works appropriately.

Compiling

We can now start compiling the . Brainfuck instruction and conclude Part V :) We first read the heap value at the heap pointer by copying ecx to eax, calling READ_RAM, and outputting to stdout with APPEND_STDOUT:

(defun compiler//append-stdout (instructions)
  "Appends the char in pc (stored in ecx) to stdout."
  ;  store the pc from ecx to eax
  (push-instruction '("PUSH" ecx))
  (push-instruction '("LOAD"))
  (push-instruction '("PUSH" eax))
  (push-instruction '("STORE"))
  ; call the READ_RAM instruction which reads ram at the
  ; address stored in eax and writes the result in eax.
  (push-instruction '("READ_RAM"))
  ;  call the APPEND_STDOUT instruction, which writes to
  ;  stdout the ascii code stored in eax."
  (push-instruction '("APPEND_STDOUT"))
  instructions)

We can now add the test case for the . token and call compiler//append-stdout:

(defun compiler//compile (code include-init-code)
  "Compiles a brainfuck code string. If this function is called
recursively, set include-init-code to t in the first call and to
nil and all subsequent calls."
  (let ((tokens (lexer//lex code))
        (instructions '())
        (current-pos 0))
    (if include-init-code
        (setq instructions (compiler//init-code instructions)))
    (while (< current-pos (length tokens))
      (let ((el (nth current-pos tokens)))
        (if (equal el "[")
            (progn
              (setq instructions (compiler//loop code
                                                 tokens
                                                 instructions
                                                 current-pos))
              (setq current-pos (find-matching-char tokens "[" "]" current-pos))))
        (if (equal el ".")
            (setq instructions (compiler//append-stdout instructions)))
        (if (equal el ">")
            (setq instructions (compiler//increment-pc instructions 1)))
        (if (equal el "<")
            (setq instructions (compiler//increment-pc instructions -1)))
        (if (equal el "-")
            (setq instructions (compiler//increment-value instructions -1)))
        (if (equal el "+")
            (setq instructions (compiler//increment-value instructions 1))))
      (setq current-pos (1+ current-pos)))
    instructions))

Finally, we can compile the Hello World example taken from Wikipedia’s Brainfuck English entry and verify that it runs properly:

(setq src "++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.")
(vm//main (compiler//compile src t) nil) ; "prints Hello World!"

Hello World!
nil	((eax . 10) (ebx . 10) (ecx . 6) (pc . 1246))	(0 0 72 100 87 33 10 0 0 0)

Part VI: Making an Emacs major mode for Brainfuck

In this final Part, we’ll be implementing a Bainfuck major mode for Emacs. As we shall see, each file opened in Emacs is attached to major mode governing its editing behavior and we’ll go into more detail regarding what this means shortly. The important take-away is that this will allow us to get familiar with Emacs’ API

We’ll first implement the syntax highlighting, which should be very easy since half a dozen characters make up Brainfuck’s syntax. We’ll then implement a function that will allow us to expand Brainfuck that was written with a shorthand, which we’ll define. This will demonstrate how to dynamically modify a buffer in Elisp, a buffer being an abstraction for a file opened in Emacs. We’ll then implement a linter that’ll let us know when Brainfuck loops, brackets, are unbalanced. Finally, we’ll make a small Brainfuck REPL that will be binded to our major mode so we can instantiate it easily from any opened Brainfuck file.

Let’s get started :)

Syntax coloring

Let’s first look into the mechanism Emacs uses to color text based on what it represents. If you open a Javascript file you’d expect it to be highlighted differently than for a Python file. When you open a file in Emacs, its content is loaded in an object called a buffer to which is associated a major mode defining its syntax, shortcuts that should be made available for that particular buffer, etc.

Our strategy is simple: we’ll define a minimal major mode for Brainfuck that colors each token differently. A major mode is composed among other things of a keymap, a syntax table, a keywords that can be highlighted. We’ll go over them individually starting with the keyword highlighting.

We’ll first define a global variable with the macro defvar that takes the variable name as first argument, its initial value as second, and a documentation string as third. We then set this variable to its desired value which is a list of cons, each taking a regex to match as first element, and a font lock as second. The reason we first set the global variable bf-font-lock-keywords and then assign it a value is because defvar does not override if it has an initial value preventing package developers to easily reload the file currently worked on. See [1] for more details on this behavior.

[1] http://ergoemacs.org/emacs/elisp_defvar_problem.html

First, let’s see how to assign a face color to the Brainfuck token +. We need to find a regex that will match the plus sign. The answer is "\\+".

Notice the double backslash in our regex? Since the plus sign has meaning, we need to escape it with a backslash. But since the backslash is a special token in strings, used to itself escape characters, we need to escape it as well with itself. This we have a double backslash preceding the plus sign in our regex. We’ll assign this token the font color that would normally be used for variables using font-lock-variable-name-face. We go on similarly with other tokens leaving the loop related tokens, [ and ], so that they’ll be the default color.

(defvar bf-font-lock-keywords nil
  "Default highlighting expressions for bf-mode")
(setq bf-font-lock-keywords
      (list
       '("\\+" . font-lock-variable-name-face)
       '("\\-" . font-lock-function-name-face)
       '("\\." . font-lock-keyword-face)
       '("<" . font-lock-doc-face)
       '(">" . font-lock-constant-face)))

We also need a mode map, which assigns keys to particular commands for the given major mode. For now, we don’t have any commands to bind, so we’ll set it to a default keymap with the function make-keymap:

(defvar bf-mode-map nil "Keymap for BF major mode")
(setq bf-mode-map (make-keymap))

Finally, let’s introduce the syntax table for the major mode. First, the syntax table is a data structure that assigns each character in the buffer with a class: punctuation, comment start, string delimiter, etc. We’ll simply define comments by making a default table with make-syntax-table and modifying it with modify-syntax-entry. We’ll assign the # and \n (line break) characters to the comment start and end classes which are < and > respectively:

(defvar bf-mode-syntax-table nil "Syntax table for bf-mode")
(setq bf-mode-syntax-table
      (let ((st (make-syntax-table)))
        (modify-syntax-entry ?# "<" st)
        (modify-syntax-entry ?\n ">" st)
        st))

Notice the question marks before the characters to modify in the syntax table. In Elisp, it is the syntax for characers rather than strings (between double quotes). As an example, "#" is a string of length 1, but ?# is the character sharp. We can convert between the two type with the functions string-to-char and char-to-string. Here however, the function modify-syntax-entry expects a character as first argument.

Before defining our major mode, let’s define a data structure holding code to run in bf-mode-hook that will be ran when the major mode is activated so we can modify the major mode externally. We’ll define the global variable bf-mode-hook as nil for now:

(defvar bf-mode-hook nil "Hooks for bf-mode")

We can tie everything up us follows by defining a function bf-mode. We’ll make it interactive by calling the interactive macro inside the function. This will allow us to call this function from the Emacs UI by typing M-x bf-mode RET. We’ll first reset all the local variables for this buffer as explicitly mentioned in the documentation [1]. We then set the buffer’s syntax table to bf-mode-syntax-table defined above using set-syntax-table. Similarly, using use-local-map to set the buffer’s keymap to the one we defined previously. We then create a local variable font-lock-defaults and assign it the value defined above. We then set the major-mode and mode-name global variables to bf-mode and Brainfuck respectively, and run the hooks with run-hooks.

[1] “The major mode command should start by calling kill-all-local-variables. This runs the normal hook change-major-mode-hook, then gets rid of the buffer-local variables of the major mode previously in effect”, https://www.gnu.org/software/emacs/manual/html_node/elisp/Major-Mode-Conventions.html

(defun bf-mode ()
  "Major mode for editing Brainfuck files"
  (interactive)
  (kill-all-local-variables)
  (set-syntax-table bf-mode-syntax-table)
  (use-local-map bf-mode-map)
  (set (make-local-variable 'font-lock-defaults) '(bf-font-lock-keywords))
  (setq major-mode 'bf-mode)
  (setq mode-name "Brainfuck")
  (run-hooks 'bf-mode-hook))

Notice the use of set to update the font locks. This does the same as setq, but is not a macro, meaning that it evaluates its arguments before doing the assignment. In particular, the first argument is evaluated to the newly created local variable.

Finally, we add files with the .bf extension to the associative list auto-mode-alist which assigns a file extension to a major mode and activates it automatically when opening files of this type. The extension match is done using a regex, we use the double backslash in front of the period to escape it since it is a regex special token.

(add-to-list 'auto-mode-alist '("\\.bf" . bf-mode))

Finally, opening a Brainfuck file looks pretty:

Brainfuck shorthands

Brainfuck is very hard to read and write. If one needs to write n identical tokens, one needs to be very careful when typing. We’ll use Emacs to help us out write Brainfuck “sanely”. What we want to do is as follows: we want to be able to write 5+ instead of +++++, and bind a function to some shortcut that will expand everywhere in the buffer where this shorthand is used. Of course, we want it to work with any n, not just 5, and with any tokens, not just +.

The strategy is as follows: we bring back the cursor to the beginning of the buffer and search forward with a regex matching n[+,-,<,>,.] for n some natural number. We can do this with the following regex, "\\([0-9]+\\)\\([-\\|+\\|<\\|>\\|.]\\)". We use the regex parenthesis operators to separate the number and the token. We can then easily replace the match in the buffer with its expansion:

(defun bf-mode-expand-buffer ()
  (interactive)
  (goto-char 0)
  (while (re-search-forward "\\([0-9]+\\)\\([-\\|+\\|<\\|>\\|.]\\)" nil t)
    (let* ((n-string (buffer-substring (match-beginning 0) (match-end 0)))
           (n (string-to-number n-string))
           (token-string (buffer-substring (match-end 1) (+ 1 (match-end 1))))
           (token (string-to-char token-string)))
      (replace-match (make-string n token)))))

First, we bring back the cursor to the beginning of the buffer with (goto-char 0), and the search the regex forward with a while loop. The while macro evaluates its argument before each iteration until it evaluates to false. It evaluates its second argument at each iteration. Here, we want to iterate until searching forward for the regex with re-search-forward evaluates falsely. In the loop body, we extract the matches from the buffer by using match-beginning and match-end with their group number (0 for the number, and 1 for the token). We then use replace-match to replace the match with the expansion using make-string.

We can now update our mode map by setting its global variable as follows to bind expanding the buffer to C-c e:

(setq bf-mode-map
  (let ((map (make-keymap)))
    (define-key map "\C-ce" 'bf-mode-expand-buffer)
    map))

Let’s try it out! That’s our code before:

And that’s what we get after:

You could try to implement the inverse function to bf-mode-expand-buffer. This would allow the user to toggle between both representations. Additionally, it would make for a very fun exercise!

Linting using Flycheck

Let’s now add a basic linter for our major mode. We’ll be using the very powerful Flycheck [1] which is an extensible on-the-fly syntax checker. What we want is to let the user know when the brackets are not matched. Indeed, a loop beginning requires a loop end and if the brackets are unbalanced, the compiler will throw.

[1] https://www.flycheck.org/en/latest/

We can install Flycheck from Melpa (See Part I) as follows:

M-x package-install RET flycheck RET

We first need to require the module before going on:

(require 'flycheck)

We’ll now implement a little helper that checks whether or not the brackets are balanced in the current buffer. Since we need to look through the entire buffer, we need to set the cursor to line 1 and column 1. We’d like to restore the cursor location to where it was before the search. We can do that using the save-excursion macro which will restore the cursor to where it was before calling the macro after evaluating it. The rest of the function is very easy, we scan a first time the buffer and count the occurrences of the character [ and then do the same for ] before comparing if both values match.

(defun bf-mode-check-matching-brackets ()
  "Count open and close brackets in the buffer
saving the cursor location."
  (let ((open-count 0)
        (close-count 0))
    (save-excursion
      (goto-char 0)
      (while (re-search-forward "\\[" nil t)
        (setq open-count (1+ open-count)))
      (goto-char 0)
      (while (re-search-forward "\\]" nil t)
        (setq close-count (1+ close-count))))

    (not (equal open-count close-count))))

However, this implementation is terrible. First, it scans the buffer twice, and then accepts code where a closing bracket is written before an opening one. Try re-implementing the function using find-matching-char that we implemented above.

Flycheck will take an Elisp function as a callback to evaluate automatically to lint the buffer, we’ll call this function bf-mode-lint. This function takes two arguments, the first one being the current Flycheck checker user, and the second, a callback passed by Flycheck. This callback needs to be called inside bf-mode-lint dynamically with funcall as done previously with 'finished as first argument to end the linting and the errors as second argument. In essence, bf-mode-lint evaluates bf-mode-check-matching-brackets and calls the Flycheck callback with a list of errors if it evaluates falsely. It calls the Flycheck callback with no errors otherwise.

In the case where the brackets are not matched, saving the cursor position, we got to the beginning of the buffer, search for the first opening bracket, and return a list of a single Flycheck error with flycheck-error-new-at and the current line and column number as well as an error message:

(defun bf-mode-lint (checker callback)
  "Check if the counters match or not:
If not: call the Flycheck callback with an error
        at the first opening bracket in the buffer.
Otherwise: call the Flycheck callback with no errors."
  (interactive)
  (if (bf-mode-check-matching-brackets)
      (funcall callback
               'finished
               (save-excursion
                 (goto-char 0)
                 (re-search-forward "\\[" nil t)
                 (list (flycheck-error-new-at
                        (line-number-at-pos) (current-column) 'info
                        "Unbalanced loop" :checker checker))))
    (funcall callback 'finished nil)))

Finally, we define the Flycheck Brainfuck checker associating bf-mode-lint to our Brainfuck mode. The checker will be called brainfuck-checker:

(flycheck-define-generic-checker 'brainfuck-checker
  "A Brainfuck syntax checker."
  :start 'bf-mode-lint
  :modes 'bf-mode)

We can now add our newly defined checker to Flycheck’s list of checkers, flycheck-checkers:

(add-to-list 'flycheck-checkers 'brainfuck-checker)

Finally, we can turn on the Flycheck minor mode with (flycheck-mode) inside bf-mode to have the linter turned on when turning the major mode on. Remember when we said earlier that each buffer has an associated major mode to it? In order to add extra functionality to major modes, each buffer can have multiple minor modes associated with them. This is exactly what we are doing to get all of the Flycheck features associated with our buffer when turning on its minor mode on.

(defun bf-mode ()
  "Major mode for editing Brainfuck files"
  (interactive)
  (kill-all-local-variables)
  (set-syntax-table bf-mode-syntax-table)
  (use-local-map bf-mode-map)
  (set (make-local-variable 'font-lock-defaults) '(bf-font-lock-keywords))
  (setq major-mode 'bf-mode)
  (setq mode-name "Brainfuck")
  (run-hooks 'bf-mode-hook)
  (flycheck-mode))  ; Starting Flycheck minor mode automatically

We can now try to unbalance some brackets to witness the popup and Flycheck error list that you can access with M-x flycheck-error-list:

Documentation through Eldoc

Our quest to the perfect Brainfuck major mode continues with a nice feature of emacs, Eldoc. This package, integrated in Emacs by default, allows to navigate around the buffer, and emit to the mini-buffer some string. Note that the mini-buffer is the “echo-area” at the bottom of the Emacs window. For instance, we want that when the cursor is on the [ token, that the mini-buffer reads “Starts a loop etc”. We’ll do so for each Brainfuck token.

Eldoc simply takes a callback and prints to the mini-buffer its result when the cursor moves around the buffer. Our callback, bf-mode-eldoc-function, will look at the current character under the cursor with (char-after) and case match it with each Brainfuck token. The appropriate documentation string is then returned by the function. Notice that we make use of a helper macro catch-token in the code to save up on if case.

(defun  bf-mode-eldoc-function ()
  "Prints to documentation for Brainfuck tokens in the
mini-buffer using Eldoc."
  (let ((current-char (char-to-string (char-after)))
        (return-value "")
        (loop-start-doc "Starts a loop. If the heap value pointed too is zero, then end the loop.")
        (loop-end-doc "Return to the beginning of the loop.")
        (increment-heap-doc "Increment the heap value pointed to.")
        (decrement-heap-doc "Decrement the heap value pointed to.")
        (increment-heap-pointer-doc "Increment the heap pointer.")
        (decrement-heap-pointer-doc "Decrement the heap pointer.")
        (append-stdout-doc "Append the heap value pointed to to stdout."))
    (defmacro catch-token (token doc)
      `(if (equal current-char ,token)
          (setq return-value ,doc)))
    (catch-token "[" loop-start-doc)
    (catch-token "]" loop-end-doc)
    (catch-token "+" increment-heap-doc)
    (catch-token "-" decrement-heap-doc)
    (catch-token ">" increment-heap-pointer-doc)
    (catch-token "<" decrement-heap-pointer-doc)
    (catch-token "." append-stdout-doc)
    return-value))

Finally, we modify the bf-mode function to create a local variable eldoc-documentation-function that Eldoc will then call:

(defun bf-mode ()
  "Major mode for editing Brainfuck files"
  (interactive)
  (kill-all-local-variables)
  (set-syntax-table bf-mode-syntax-table)
  (use-local-map bf-mode-map)
  (set (make-local-variable 'font-lock-defaults) '(bf-font-lock-keywords))
  (setq major-mode 'bf-mode)
  (setq mode-name "Brainfuck")
  (run-hooks 'bf-mode-hook)
  ; Setting the Eldoc callback for this buffer (locally)
  (set (make-local-variable 'eldoc-documentation-function) 'bf-mode-eldoc-function)
  (flycheck-mode))

We can now notice the token documentation in the mini-buffer:

Evaluate the buffer

We now want to map to a shortcut to evaluate the buffer through our compiler and virtual machine pipeline, and output the result to the mini-buffer.

We’ll need to pre-process the buffer before sending it for compilation however. Remember that out compiler only lexes by separating tokens, and we’ve added comments. So we want to remove all line breaks, \n, all spaces, and anything of the form # some text \n. Of course, we’ll also need to expand the Brainfuck shorthands with bf-mode-expand-buffer before compiling. Since our expansion function operates on a buffer, but that we do not want to change the user’s buffer, we’ll copy it in a temporary one. We can do so by copying the reference to the current buffer with (current-buffer into a variable orig-buf. We then insert it in a temporary buffer with with-temp-buffer and insert-buffer-substring. Only once in the temporary buffer do we expand and remove unnecessary characters and store the result in a variable called src. Finally, we can compile src and run the result through our VM and store the result which consists of a list of length four containing the standard output, the stack, register, and heap states. We deconstruct the list and print the result to the mini-buffer with the function message:

(defun bf-mode-evaluate-buffer ()
  "Compiles the current buffer and evaluates it.
The result is echoed to the mini-buffer."
  (interactive)
  (let ((orig-buf (current-buffer)))
    (with-temp-buffer
      (insert-buffer-substring orig-buf)
      (bf-mode-expand-buffer)
      (let ((buffer (buffer-string))
            (src "")
            (result nil))
        (setq src (replace-regexp-in-string
                   " \\|\n\\|#.*\n"
                   ""
                   buffer))
        (setq result (vm//main (compiler//compile src t)))
        (message "stdout: %s \nstack: %s \nregisters: %s \nheap: %s"
                 (car result)
                 (cadr result)
                 (caddr result)
                 (cadddr result))))))

Note theses funny functions in the message call: cadr, caddr and cadddr. These are shortands for (car (cdr _)), (car (cdr (cdr _))), (car (cdr (cdr (cdr _)))), where _ is some list. Hence, its yet another way to get the nth element of a list which we did earlier with (nth some-index some-list).

Finally, we update the key map so that we can type C-c E to compile and evaluate the buffer through the VM and print the result to the mini-buffer:

(setq bf-mode-map
  (let ((map (make-keymap)))
    (define-key map "\C-ce" 'bf-mode-expand-buffer)
    (define-key map "\C-cE" 'bf-mode-evaluate-buffer)
    map))

As you can see, we can now evaluate the current buffer and get the result in the mini-buffer:

The REPL

The last feature for our major mode will be a REPL. What we’d like is be able to hit a shortcut and have a the current Emacs window be split in two with a Brainfuck REPL waiting to evaluate code. The goal is to get the following in this window:

Welcome to the Brainfuck REPL 

Press C-c C-c with the cursor on the last line to evaluate it.


=: ++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.
Hello World!

=: ++++++++++++++++++++++++++++++++>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++.---.+++++++..+++.<.>++++++++.--------.+++.------.--------.
HELLO WORLD

=:

Simply, the user has to enter some Brainfuck code after the =: sign and hit C-c C-c to have the result being printed on screen and a new prompt waiting for some new user input.

How are we going to do this? We’ll create a new major mode, bf-repl-mode, that will make a buffer interactive. So we’ll first split the current window, create a new buffer, and activate the bf-repl-mode. This will insert some predetermined introductory text and insert =: inviting for the user to type something on the last line. Finally, we’ll bind a function that runs the buffer last line through the compiler and VM and prints the output on a new line. We’ll bind this function, bf-repl//insert-init-text, to the shortcut C-c C-c.

Let’s start with the function that inserts some introductory text in the current buffer. Very simple, we already know how to:

(defun bf-repl//insert-init-text ()
  (interactive)
  (insert "Welcome to the Brainfuck REPL\n\nPress C-c C-c with the cursor on the last line to evaluate it.\n\n\n=: "))

Let’s now write the function that evaluates the last line and inserts the result in the buffer. We get the current line content with (thing-at-point 'line t) and clean it up from the =: sign. We then get the result by compiling the source and running it through the VM. Finally, we insert the result and a new prompt =::

(defun bf-repl//eval-line ()
  (interactive)
  (let* ((line (thing-at-point 'line t))
         (src (replace-regexp-in-string "=: " "" line))
         (result (vm//main (compiler//compile src t))))
    (insert (concat "\n" (car result) "\n\n=: "))))

Again, we define an empty hook list for our new mode:

(defvar bf-repl-mode-hook nil)

In our key map, we bind bf-repl//eval-line to C-c C-c:

(defvar bf-repl-mode-map nil "Keymap for BF REPL major mode")
(setq bf-repl-mode-map
  (let ((map (make-keymap)))
    (define-key map "\C-c\C-c" 'bf-repl//eval-line)
    map))

And hook it all up in bf-repl-mode as done in bf-mode previously:

(defun bf-repl-mode ()
  "Major mode for a live Brainfuck REPL"
  (interactive)
  (kill-all-local-variables)
  (set-syntax-table bf-mode-syntax-table)
  (use-local-map bf-repl-mode-map)
  (set (make-local-variable 'font-lock-defaults) '(bf-font-lock-keywords))
  (setq major-mode 'bf-repl-mode)
  (setq mode-name "Brainfuck REPL")
  (run-hooks 'bf-repl-mode-hook)
  (bf-repl//insert-init-text))  ; We insert the init text

We now need to add a function to bf-mode’s key map so that we can easily spawn a REPL. We use split-window-right and windmove-right to create a new window and move the cursor to it. We the create a new buffer named *Brainfuck REPL, switch to it, and activate bf-repl-mode:

(defun bf-mode-start-repl ()
  "Opens a window and instantiates a Brainfuck REPL in it."
  (interactive)
  (split-window-right)
  (windmove-right)
  (get-buffer-create "*Brainfuck REPL*")
  (switch-to-buffer "*Brainfuck REPL*")
  (bf-repl-mode))

Finally, we update bf-mode’s key map to bind spawning a REPL to C-C r:

(setq bf-mode-map
  (let ((map (make-keymap)))
    (define-key map "\C-ce" 'bf-mode-expand-buffer)
    (define-key map "\C-cE" 'bf-mode-evaluate-buffer)
    (define-key map "\C-cr" 'bf-mode-start-repl)
    map))

Let’s see if it works:

Our environment looks good :)

Additionally, you could try and add a function in bf-mode that allows you to evaluate a selected region of code through the REPL. One way of doing that would be to copy the selected text and pasting it on the last line of the REPL buffer, that you can select with its name, and calling bf-repl//eval-line from Elisp.

Conclusion

Thus we conclude this article. It’s been much longer than I initially thought, and hopefully, you have now learned enough Elisp that you could learn any other Lisps (Common Lisp, Clojure, Racket etc) fairly easily and that you can hack Emacs when needed.

Let’s recapitulate what we have and haven’t gone over in this article. Ideally, you started knowing little to nothing about Brainfuck, bytecode, Lisp or the Emacs API. We learned about each by progressively implementing a Brainfuck compiler targeting simple stack based VM before integrating it into a custom made Brainfuck major mode for Emacs. Which was not an easy task to say the least! What are the next steps from here? Well, you decide, but you could dive deeper into some of the elements we’ve discussed. For example, if you’re interested in compilers, you could now try to compile a small subset of C by first lexing it and then writing a compiler that targets our VM! Note that you’ll need to modify the VM by adding instructions to move the stack pointer around so you can push variables on the stack and retrieve them when needed. Try to compile the following for instance:

int a = 1;
int b = 2;
int c = a + b;

You are now definitely capable of writing an Elisp lexer and compiler targeting your modified VM with the help of the documentation.

If you want to dig into some more into bytecodes, as said in Part II when introducing stack-based VMs, there are many out there. Let’s look at some of the most common one starting with Web Assembly (WASM). Its text format looks Lispy, and to some extend works like our stack machine. Here’s some WASM format taken from [1] that you should read if you liked this article:

[1] https://developer.mozilla.org/en-US/docs/WebAssembly/Understanding_the_text_format

(func $double (param $p i32)
  local.get $p
  local.get $p
  i32.add)

This defines a function of one parameter p called double. When called it pushes the parameter twice on the stack and returns their addition. Simple! Python also compiles to some bytecode! Let’s look into it:

import dis
dis.dis('print("Hello, World!")')

As you can see, the call print("Hello, World!") uses the instructions LOAD_NAME, LOAD_CONST, CALL_FUNCTION, and RETURN_VALUE. Now you know what goes into these .pyc files! Elisp files also bytecompile to the Elisp VM that Emacs runs, its those .elc files you’ll sometimes see. Finally, let’s also introduce Michelson, the Tezos VM bytecode [1] that is typed so you can statically type check it! Michelson implements the SWAP and DUP instructions that allow to swap elements on the stack and duplicating them.

[1] https://tezos.gitlab.io/whitedoc/michelson.html

So you could look into these various bytecodes and try to implement some of their features. To my knowledge, none of them implement a register mechanism, we just implemented them because real processors implement them and so I wanted to learn how they might work. In our case, it really is a redundant mechanism to pass arguments to our instructions and could (should ?) be removed. But it was fun to implement and compile with nonetheless!

Let’s also note the fact that we have completely over looked unit testing of our code. I suggest looking into ERT, the ELisp Regression Testing library [1], that is included in Emacs for those more serious about developing Emacs packages. I use this opportunity to mention the fact that we have not created an Emacs’ package, and that those interested can look into here [2].

[1] https://www.gnu.org/software/emacs/manual/html_node/ert/index.html
[2] https://www.gnu.org/software/emacs/manual/html_node/elisp/Simple-Packages.html

Finally, I would like to suggest some resources I used a lot when developing this tutorial, namely:

The Elisp documentation: https://www.gnu.org/software/emacs/manual/elisp.html
The Emacs Wiki: https://www.emacswiki.org/
Ergo Emacs: http://ergoemacs.org/emacs/emacs.html
The many Stackoverflow and Github issues I’ve skimmed through to find some answers I was looking for

I hope you’ve enjoyed this article, thank you for reading it, and see you next time ;)

Cheers!

Thomas Binétruy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tuto.org

tuto.org

Learn Emacs Lisp By making a Brainfuck compiler

Introduction

Part I: Getting started

Getting an Emacs Lisp environment running

An introduction to Brainfuck

Hello World

Part II: virtual machine boilerplate

Part III: Compiling a Brainfuck subset to our VM’s bytecode

VM modifications

Compiling

Part IV: Adding Brainfuck loops to our compiler

VM modifications

Compiling

Part V: Adding the Brainfuck stdout instruction, `.`

VM modifications

Compiling

Part VI: Making an Emacs major mode for Brainfuck

Syntax coloring

Brainfuck shorthands

Linting using Flycheck

Documentation through Eldoc

Evaluate the buffer

The REPL

Conclusion

Files

tuto.org

Latest commit

History

tuto.org

File metadata and controls

Learn Emacs Lisp By making a Brainfuck compiler

Introduction

Part I: Getting started

Getting an Emacs Lisp environment running

An introduction to Brainfuck

Hello World

Part II: virtual machine boilerplate

Part III: Compiling a Brainfuck subset to our VM’s bytecode

VM modifications

Compiling

Part IV: Adding Brainfuck loops to our compiler

VM modifications

Compiling

Part V: Adding the Brainfuck stdout instruction, .

VM modifications

Compiling

Part VI: Making an Emacs major mode for Brainfuck

Syntax coloring

Brainfuck shorthands

Linting using Flycheck

Documentation through Eldoc

Evaluate the buffer

The REPL

Conclusion

Part V: Adding the Brainfuck stdout instruction, `.`