Sunday, July 22, 2012

Implementing Fast Interpreters

Many modern virtual machines include either a fast interpreter or a fast baseline compiler. A baseline compiler needs extra memory and is likely slower for code that is only executed once. An interpreter avoids this memory overhead  and is very flexible. For example, it can quickly switch between execution modes (profiling, tracing, single-stepping, etc.). So what, then, is the state of the art of building fast interpreters?

Modern C/C++ compilers are very good at optimising "normal" code, but they are not very good at optimising large switch tables or direct-threaded interpreters. The latter also needs the "Labels as Values" GNU extension. In a direct-threaded interpreter, each instructions takes the following form:

op_ADD_INT:
    int op1 = READ_OP1;
    int op2 = READ_OP2;
    int result = op1 + op2;  // The actual implementation.
    WRITE_RESULT(result);
    unsigned int opcode = pc->opcode;
    ++pc;
    goto *dispatch[opcode];  // Jump to code for next instruction.

The variable dispatch (the dispatch table) is an array of labels, one for each opcode. The last three lines transfer control to the implementation of the next bytecode instruction. If we want to change the implementation of an instruction dynamically, we can just update the pointer in the dispatch table, or change the dispatch table pointer to point to a different table altogether.

Note that everything except for the line "result = op1 + op2" is interpreter overhead and its cost should be minimised. The program counter (pc) and the dispatch table are needed by every instruction, so they should be kept in registers throughout. Similarly, we want a variable that points to the stack and possibly a variable that points to a list of literals — two more registers. We also need further registers for holding the operands. These registers better be the same for each instruction, since otherwise some instructions need to move things around, leading to memory accesses. On x86 we only have 7 general-purpose registers which makes it very hard for the compiler to optimally use them for all instructions.

Interpreters Written in Assembly


For this reason, most interpreters are hand-optimised assembly routines with a specially designed calling convention that keeps as much state as possible in registers. A particularly nice and fast example is the LuaJIT 2 interpreter.

Both the standard Lua interpreter and the LuaJIT 2 interpreter use a register-based bytecode format. Compared to the more well-known stack-based bytecodes, a register-based bytecode has larger instructions, but requires fewer instructions overall. For example, the expression "s = x + (y * z)" in a stack-based bytecode would translate to something like:

PUSHVAR 0    -- push variable "x" onto operand stack
PUSHVAR 1    -- push variable "y"
PUSHVAR 2    -- push variable "z"
MUL          -- top of operand stack is (y * z)
ADD          -- top of operand stack is x + (y * z)
STOREVAR 3   -- write result into variable "s"

With a few optimisations, this can be encoded as only 6 bytes. In a register-based bytecode this would translate to something like this:

MUL 4, 1, 2   -- tmp = y * z
ADD 3, 0, 4   -- s = x + tmp

Each instruction takes a variable indices, the (virtual) registers, and reads from and writes directly to the variable (stored on the stack). In LuaJIT 2, each instruction requires 4 bytes, thus the overall bytecode size is a bit larger. However, executing these instructions incurs the interpreter overhead only twice instead of 6 times in the stack-based bytecode. It may also avoid memory traffic by avoiding the separate operand stack.

The LuaJIT 2 interpreter also uses a very simple bytecode format that avoids further bit shuffling (increasing the cost of decoding). Instructions can only have one of two forms:

OP A, B, C
OP A, D

Here OP, A, B, and C are 8-bit fields. OP is the opcode and A, B, C are usually register IDs or sometimes literals. D is a 16bit field and overlaps with B and C. It usually holds a literal value (e.g., a jump offset).

The LuaJIT 2 interpreter now combines this with a calling convention where part of the next instruction is decoded before actually transferring control to it. This tries to take advantage of superscalar execution on modern CPUs. For example, the an integer addition instruction implemented using this technique would look as follows:

bc_ADD_INT:
    -- edx = BASE = start of stack frame. E.g., virtual register 5
    --    is at memory address BASE + 5 * BYTES_PER_WORD
    -- esi = PC, always points to the next instruction
    -- ebx = DISPATCH = the dispatch table
    -- ecx = A = pre-decoded value of field A
    -- eax = D = pre-decoded value of field D
    -- ebp = OP, opcode of current instruction (usually ignored)

    -- Semantics: BASE[A] = BASE[B] + BASE[C]

    -- 1. Decode D into B (ebp) and C (eax, same as D)
    movzx ebp, ah  -- zero-extend 8-bit register ah into ebp = B
    movzx eax, al  -- zero-extend 8-bit register al into eax = C
    -- 2. Perform the actual addition
    mov ebp, [edx + 4*ebp]   -- read BASE[B]
    add ebp, [edx + 4*eax]   -- ebp = BASE[B] + BASE[C]
    mov [edx + 4*ecx], ebp   -- BASE[A] = ebp
    -- 3. Dispatch next instruction
    mov eax, [esi]    -- Load next instruction into eax
    movzx ecx, ah     -- Predecode A into ecx
    movzx ebp, al     -- zero-extend OP into ebp
    add esi, 4        -- increment program counter
    shr eax, 16       -- predecode D
    jmp [ebx + ebp * 4]  -- jump to next instruction via dispatch table

The reason for predecoding some of the arguments is that the final "jmp" instruction may be quite expensive because indirect branches cause difficulties for branch predictors. The previous instructions help keep the pipeline somewhat busy if the branch did indeed get mispredicted.

These 11 instructions are typically executed in about 5 clock cycles on an Intel Core 2 processor. Based on a very simple benchmark of only simple (addition, branch, compare) instructions, interpreted code is roughly 7x slower than machine code. For more complicated instructions the interpreter overhead becomes less severe and the slowdown should be smaller.

Note, though, that in this example the ADD operation was type-specialised to integers. In many dynamic languages the addition operator is overloaded to work over several types. A generic ADD bytecode instruction then must include a type check (e.g., int vs. float) and then dispatch to the relevant implementation. This can introduce severe execution overheads due to the high cost of branch mispredictions on modern (desktop) CPUs. However, this is partly a language design issue (or at least language/implementation co-design issue) and independent of how we choose to implement our interpreter.

In addition to the performance advantages over C-based interpreters, there is size advantage. It's a good idea to ensure that the frequently-executed parts of the interpreter fit within the L1 instruction cache (typically around 32-64 KiB). Writing the interpreter in assembly helps with that. For example, the above code requires exactly 32 bytes. If we chose to use register ebp for BASE, it would have been a few bytes larger due to encoding restrictions on x86.

Portable Fast Interpreters?


The downside of writing an interpreter in assembly is that it is completely unportable. Furthermore, if we need to change the semantics of a bytecode instruction we have to update it for each architecture separately. For example, LuaJIT 2 has interpreters for 6 architectures of around 4000K 4000 lines per architecture (ARMv6, MIPS, PPC, PPCSPE/Cell, x86, x86-64).

The WebKit project therefore built its own custom "portable" assembly language as a Ruby DSL. For an example of what it looks like, see LowLevelInterpreter.asm. Occasionally, you probably still want to special-case some code for a specific architecture, but it seems to go into the right direction. It also seems to make code more readable than Dalvik's (Android's VM) template substitution method. I'd rather have all the code in the same file (possibly with a few #ifdef-like places) rather than spread over 200+ files. It also should be quite simple to translate this assembly into C to get a default interpreter for architectures that are not yet supported.

So far, every project seems to have built its own tool chain. I guess it's too much of a niche problem with too many project-specific requirements to give rise to a reusable standard tool set.

8 comments:

Alessandro said...

Loved your post ; ).

Assembly, Interpreters, speed... Very cool!

Does the "portable assembly" solution through DSL incurs a significant speed penalty?

Anonymous said...

If you have ASM code for the different operations and you want to do a little more work, you can use your table as a template and MemCpy the code together. Then, patch the operations that have constants. Presto! From interpreter to compiler! You can spend a year doing little optimizations.

Anonymous said...

What about LLVM bitcode? Seems like a reasonable compromise between portability and assembly-level flexibility.

Unknown said...

This wasn't new when Charles Moore invented Forth. Good idea then, good idea now. I wonder if at some point Knuth's MIX, had multiple back ends, would that help with the portability problem?

Thomas Schilling said...

Alessandro: The design goal is that it shouldn't. The main idea is to remove unnecessary repetition if the code is the same or very similar on different architectures. If it isn't, then you still use architecture-specific code instead of compromising performance.

TerryD: Yes, there are many ways to remove interpreter overhead, but they all start going into JIT territory. E.g., for the MemCpy technique, you need to manage extra memory, possibly flush instruction caches, make sure to dispatch execution to the generated code, possibly invalidate them if the execution mode changes, etc. Context-threading uses a similar idea to improve branch prediction.

If you're fine with an interpreter and just want a little bit more performance, then going this route is fine and the additional complexity is probably worth it. If you want a better-optimising JIT on top, then it's probably fine to use a simple interpreter and not have an additional proto-JIT.

aweasd: AFAIK, the LLVM JIT is relatively slow. As above, you'll also have to manage the memory for the compiled code. If you care more about ease of implementation than performance, I'd go with a direct-threaded interpreter in C. I'm not sure if LLVM is ready as a full JIT these days. I know the Unladen Swallow (Python JIT) project had it's issues and concluded that LLVM wasn't ready, yet.

hsm: I assume you mean the RISC-like MMIX, since the original MIX is an abomination by today's standard. That said, using MMIX would probably hide too many important details. For example, MMIX has 256 registers, but for our purposes we want to optimise register use for each architecture and not rely on having 256 of them. Since ultimately performance is the main goal, we shouldn't try to hide all the implementation details, but just automate the boring parts.

Jim Perry said...

Nicely explained. These are some of the very techniques, albeit in Java--that are used in the highly customizable GPVM that I designed for my Computer Organization and Assembly Lanuguage students. Well Done.

Alessandro said...

Thomas Schilling: Thanks for the answer!

Jon Harrop said...

I'm not sure it makes sense to expend so much effort optimizing an interpreter. I'd expect a splat compiler to be simpler and faster...