counter machine JIT

The machine implemented here is actually variant of so called counter machine. You can read a lot abot them in following wikipedia articles:

http://en.wikipedia.org/wiki/Counter_machine
http://en.wikipedia.org/wiki/Counter_machine_reference_model
http://en.wikipedia.org/wiki/Random-access_machine

Implementation has following features:

program with numbered, consecutive instructions
memory cells from 0 to n,
- memory is 0-initialized
- 0th cell is programs result
- cells 1-k are inputs (k depends on function that the program computes)
4 instructions:
- Zero(n) - put zero to n-th cell
- Succ(n) - increment value in n-th cell
- Transfer(m,n) - copy value from m-th cell to n-th cell
- If(m,n, x) - if contents of m-th and n-th cells is equal jump to instruction number x
jumping outside of available instructions, should stop the execution ~~(not handled by JIT yet)~~

As you can see T instruction isn't actually needed, and it could be implemented using Z, S and I.

You can read in mentioned articles, what needs to be done, to make this machine Turing complete.

Before we begin some cool numbers, to grab your attention:

compiled for x64
parsing input...............................................................  DONE
value in cell number 0: 165580141
program exited...
 * no-jit  time: 1.70075

parsing input...............................................................  DONE
value in cell number 0: 165580141
program exited...
 *    jit  time: 0.399607

So you can see that, although JIT is quite dummy, speed up is over 4 times.

UPDATE 1 jit handles "jumping outside of available instructions"

UPDATE 2 jit should handle ugly case when jump that triggered generation, goes forward changes in 807345b2c13cbad94c79f31378333dc4317cb801 does that.

Interpreter

First the parser converts file in text form to more convenient byte-code. The parser also recognizes following variant of If instruction I(a,a,x) which basically is unconditional jump. So bytecode opcodes are as follow:

// gram_bc.h
enum {
        Op_Zero
        , Op_Increment
        , Op_Transfer
        , Op_Conditional_Jump
        , Op_Jump
};

The parser also finds what is the index of 'biggest' cell addressed (maxMemAccess).

The interpreter is very simple, you can see some early version of it in this commit:

dynasm-jit-playground/src/gRamMachine.cpp

Line 206 in 5ed7984

void executeInstructions(uint32_t* instructions, size_t n, uint64_t* machineMem, size_t maxMemAccess) {

As you can see, printq trace instructions, takes more space than instruction handling itself.

I guess you could fit both parser and interpreter in about page of python code.

I've changed interpreter later, to use following struct, instead of array of uint32_ts:

struct Instr {
	uint32_t opcode;
	uint32_t op[3]; // operands
	uint16_t statsCounter, statsIdx;
};

You can see it here:

dynasm-jit-playground/src/gRamMachine.cpp

Line 201 in e57d6db

void executeInstructions(Instr* instructions, size_t n, uint64_t* machineMem, size_t maxMemAccess) {

Additional fields statsCounter and statsIdx are used to collect some data during execution of instructions. Particularly statsCounter is used to count how many times given instruction has been executed.

statsIdx was supposed to be used for nicer tracing (and later during generation), but right now it's not used.

Now the only flow affecting instructions are Op_Conditional_Jump and Op_Jump instructions. So, when executing Jmp instruction I check if it looks like hot loop, in current implementation it means jump destination has been executed 4 times. (4 was choosen ad-hoc).

There is also additional check that jump itself was executed less or same number of times than destination.

If conditions are ok, I'm trying to generate a code between (destination and jump)

Which also means there is WRONG assumption, that interesting jump goes back. (Maybe I'll fix it later, maybe I won't)

Code generation

prelude

Let's look at pre-update version, the generation happens in following call:

dynasmGenerator(&da, instructions, gpc, jmpEip, machineMem, maxMemAccess)

da is small DynAsm wrapper, gpc is PC of destination and jmpEip is PC of jump. (so gpc is start, jmpEip is end)

Before we look at generator itself, let's see what happens if generation succeeded, instruction at gpc is modified in following way:

original opcode is replaced with new opcode Op_Generated
original operands 0,1 are overwritten with a pointer to generated function
original operand 2 is overwritten with length in instructions, ot generated block

Now in interpeter part if we stumble upon Op_Generated opcode, I do exactly what's needed:

get the pointer and execute generated code
change current instruction pointer accordingly

generator

Now we can head to generator itself which is in gram_x64.dasc.

Generator works in two passes:

first pass:
- checks if all jumps within currently generated block points to the block itself if not, I won't be generating code for such block
- since all jumps are direct, I collect all the jumps destinations
second pass - code generation itself

Between first and second pass, I set identifiers, for all possible destinations, they will be used in dasm's so called PC labels. You can read more about different type of labels here: Jumping with DynAsm - by Corsix and here: lua-l mailing list, Josh Haberman

Then goes generation, which is pretty straightforward. rdi register is used to keep pointer to machine's memory

Since all instructions use direct addressing, I calculate proper cell index during generation, and will use that value in generated code, here's example:

case Op_Transfer:
	op1 = GETOP(1, locGpc);
	op2 = GETOP(2, locGpc);
	| mov eax, op1
	| mov rcx, aword [rdi + rax*8]
	| mov eax, op2
	| mov aword [rdi + rax*8], rcx

So op1, op2 will be calculated during generation, and will be put into proper place with dasm_put directives later, so for example if during generation op1==3 && op2==5 following NATIVE code will be generated later:

000000300A040038 B8 03 00 00 00       mov         eax,3  
000000300A04003D 48 8B 0C C7          mov         rcx,qword ptr [rdi+rax*8]  
000000300A040041 B8 05 00 00 00       mov         eax,5  
000000300A040046 48 89 0C C7          mov         qword ptr [rdi+rax*8],rcx

Last thing is treatment of jumps, but first consider following piece of code, for which generator will be called (with start==9, end==12):

 9: +-> I(2, 6, 13) -+
10: |   S(5)         |
11: |   S(6)         |
12: +-- I(1, 1, 9)   |
13: ...            <-+

As you can see using my leet ascii skillz I've drawn how the jumps goes, inside the generator there are checks like:

if (dest == end+1) {
	| jmp >3

} else {
	| je => labels[dest]
}

Reasoning is as follows:

~~I know that jump at end goes UP (I've written about it at the end of previous section)~~ a. this is actually irrelevant
I'm NOT generating code for pieces, that have jumps outside of (start, end+1)
if there's jump to end+1 somewhere from within the code it means it's ~~"end-of-loop"~~ a. either "end-of-loop" as in ascii above b. or it is jump outside of currently generated block, so we're finished anyway

So in case of jump to end+1, I simply generate jmp to end of generated function, and otherwise, I generate PC jump, to one of the labels, collected during pass 1.

There's difference in handling of Op_Generated, which loows like this:

| mov rax, aword [rsi + locGpc*sizeof(Instr) + 4]
| call rax

if (dest == end+1) {
	| jmp >3

} else {
	// ...
}
locGpc = dest;

So basically if after call there is some more code, we do not need to jump to it, as it will be placed directly after native call instruction, e.g:

; this is code generated for Op_Generated
000000656E1B004A 48 8B 86 B8 00 00 00 mov         rax,qword ptr [rsi+0B8h]  
000000656E1B0051 FF D0                call        rax
; and right after *native* call, there's a code generated for:
; Op_transfer, op[0]==(3), op[1]==(2)
000000656E1B0053 B8 03 00 00 00       mov         eax,3  
000000656E1B0058 48 8B 0C C7          mov         rcx,qword ptr [rdi+rax*8]  
000000656E1B005C B8 02 00 00 00       mov         eax,2  
000000656E1B0061 48 89 0C C7          mov         qword ptr [rdi+rax*8],rcx

epilogue

What could be quite trivially improved, is concatenating subsequent S instructions into new instruction like Op_Add. This could be either done after parsing or while JITting. Other thing that could be interesting to add, is recognizing some usual 'patterns' during JITting and generating even better code for them e.g.:

 ..: assuming cell c has zero
n  : I(b, c, n+4)
n+1: S(c)
n+2: S(x)
n+4: I(y,y, n)

(add value of b-th cell to x-th cell)

If you'd like to see how interpreter and generator works, inside interpreter there is global verbose variable, and inside generator, you'd have to change, following define:

printgen(xxx) do { if(0) std::cerr << xxx; } while(0)

(and use if(1) instead of if(0))

Also if you'd like to debug generated code, "| int3" is your friend.

I think that would be all for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRAM.md

GRAM.md

counter machine JIT

Interpreter

Code generation

prelude

generator

epilogue

Files

GRAM.md

Latest commit

History

GRAM.md

File metadata and controls

counter machine JIT

Interpreter

Code generation

prelude

generator

epilogue