Tuesday, July 31, 2012

Implementing Fast Interpreters: Discussion & Further Reading

My last article on "Implementing Fast Interpreters" was posted to both Hacker News and the Programming subreddit. Commenters pointed out some valid questions and some related work. This post aims to address some of the issue and collect the related work mentioned in the HN and Reddit comments and some more.

Why not write a simple JIT?

The main benefit of an interpreter usually is simplicity and portability. Writing the interpreter in assembly makes it a bit harder to write and you lose portability. So, if we have to go down to the level of assembly, why not write a simple template-based code generator? Wouldn't that be faster? The answer is: it depends.

As Mike Pall (author of LuaJIT) pointed out, LuaJIT v1 was based on a simple JIT and does not consistently beat the LuaJIT v2 interpreter (which is written in assembly).

Performance comparison of LuaJIT v1.1 (simple JIT) vs. LuaJIT v2 interpreter.
A JIT compiler (even a simple one) needs to:
  • Manage additional memory. If the code is short-lived (e.g., there is another optimisation level) then unused memory must be reclaimed and fragmentation may become an issue.
  • Mark that memory as executable (possibly requiring flushing the instruction cache or operating system calls).
  • Manage the transition between compiled code segments. E.g., we need a mapping from bytecode targets to corresponding machine code. If the target hasn't been compiled, yet, this can be done on-demand and the branch instruction can be updated to directly jump to the target. If the target may ever change in the future, we also need a way to locate all the branches to it in order to invalidate them if necessary.
  • If the interpreter has multiple execution modes (e.g., a trace-based JIT has a special recoding mode which is only used for a short time), then code needs to be generated for each execution mode.
All this is quite a bit more complicated than a pure interpreter, even one written in assembler. Whether it's a good reason to take on this complexity depends on how high the interpreter overhead actually is. If the bytecode format is under our control then we can optimise it for our interpreter as done in LuaJIT 2. This isn't always the case, though. The DynamoRIO system is a runtime instrumentation framework for x86. The x86 instruction format is very complex and thus expensive to decode. DynamoRIO therefore does not use an interpreter and instead decodes the host program into basic blocks. (No code generation is really needed, because DynamoRIO emulates x86 on x86, but you could imagine using the same technique to emulate x86 on, say, ARM.) The challenges (and solutions) for this approach are discussed in Derek Bruening's excellent PhD thesis on DynamoRIO. Bebenita et al. describe an interpreter-less JIT-compiler design for a Java VM in their paper "Trace-Based Compilation in Execution Environments without Interpreters" (PDF).

DSLs for generating interpreters

The PyPy project generates a C interpreter and a trace-based JIT based on a description in RPython. "Restricted Python", a language with Python-like syntax, but C-with-type-inference like semantics. Laurence Tratt described his experiences using PyPy in his article "Fast Enough VMs in ast Enough Time". PyPy's toolchain makes many design decisions for you; its usefulness therefore depends on whether these decisions align with your goals. The PyPy Status Blog contains many examples of language VMs written in PyPy (e.g. PHP, BF, regular expressions).

A DSL aimed at developing a fast assembly interpreter can still be useful for other purposes. One problem in implementing a JIT compiler is ensuring that the semantics of the compiled code match the semantics of the interpreter (or baseline compiler). By having a slightly higher-level abstraction it could become possible to use the same specification to both generate the interpreter and parts of the compiler. For example, in a trace-based JIT, the code for recording and interpreting looks very similar:
Interpreter:                Trace Recorder:

  mov tmp, [BASE + 8 * RB]    Ref r1 = loadSlot(ins->b);
                              Ref r2 = loadSlot(ins->c);
  add tmp, [BASE + 8 * RC]    Ref r3 = emit(IR_ADD_INT, r1, r2);
  mov [BASE + 8 * RA], tmp    writeSlot(ins->a, r3);

If we treat the trace recorder as just another architecture, it may save us a lot of maintenance effort later on.

Further Reading

There are many other resources on this topic, but here's a short selection of interesting articles and papers on the topic.
If you have more links to good articles on the topic, please leave a comment.

Thursday, July 26, 2012

ARM's New 64 Bit Instruction Set

You may have heard that ARM, whose CPUs are extremely popular for embedded devices, is trying to move into the low-power server market. One of the current main difficulties for using ARM processors in servers is that it is only a 32 bit architecture (A32). That means, that a single process can address at most 4GB of memory (and some of it is reserved for the OS kernel). That isn't a problem on current embedded devices, but it can be on large multi-threaded server applications. To address this issue ARM has been working on a 64 bit instruction set (A64). To my knowledge there is no commercially available hardware that implements this instruction set, yet, but ARM has already released patches to the Linux kernel to support it.

To my surprise, this new 64 bit instruction set is quite different from the existing 32 bit instruction sets. (Perhaps, I shouldn't be surprised since the two Thumb instruction sets were indeed quite different from the existing instruction sets.) It looks like a very clean RISC-style design. Here are my highlights:
  • All instructions are 32 bits wide (unlike the Thumb variants, but like the original A32).
  • 31 general purpose 64 bit wide registers (instead of 14 general purpose 32-bit registers in A32). The 32nd register is either hardwired to zero or the stack pointer. These registers can be accessed as 32 bit (called w0, w1, ..., w31) or 64 bit registers (called x0, x1, ..., x31).
  • Neither the stack pointer (SP) nor the program counter (PC) are general purpose registers. They are only read and modified by certain instructions.
  • In A32, most instructions could be executed conditionally. This is no longer the case.
  • Conditional instructions are not executed conditionally, but instead pick one of two inputs based on a condition. For example, the "conditional select" instruction CSEL x2, x4, x5, cond implements x2 = if cond then x4 else x5. This subsumes a conditional move: CMOV x1, x2, cond can be defined as a synonym for CSEL x1, x2, x1, cond. There are many more of these conditional instructions, but they all will modify the target register.
  • A conditional compare instruction can be used to implement C's short-circuiting semantics. In a conditional compare the condition flags are only updated if the previous condition was true.
  • There is now an integer division instruction. However, it does not generate an exception/trap upon division by zero. Instead (x/0) = 0. That may seem odd, but I think it's a good idea. A conditional test before a division instruction is likely to be cheaper than a kernel trap.
  • The virtual address space is 49 bits or 512TB. Unlike x86-64/AMD64, where the top 16 bits must all be zero or all one, the highest 8 bits may optionally be usable as a tag. This is configured using a system register. I'm not sure if that will require kernel support. It would certainly come in handy for implementing many higher-level programming languages.
  • A number of instructions for PC-relative addressing. This is useful for position independent code.
  • SIMD instruction support is now guaranteed. ARMv8 also support for crypto instructions. These are also available in A32.
All the existing ARM instruction sets (except perhaps Jazelle) will still be supported. I don't think you can dynamically switch between different instruction sets as was the case for A32/Thumb, though.

Further reading:

Sunday, July 22, 2012

Implementing Fast Interpreters

Many modern virtual machines include either a fast interpreter or a fast baseline compiler. A baseline compiler needs extra memory and is likely slower for code that is only executed once. An interpreter avoids this memory overhead  and is very flexible. For example, it can quickly switch between execution modes (profiling, tracing, single-stepping, etc.). So what, then, is the state of the art of building fast interpreters?

Modern C/C++ compilers are very good at optimising "normal" code, but they are not very good at optimising large switch tables or direct-threaded interpreters. The latter also needs the "Labels as Values" GNU extension. In a direct-threaded interpreter, each instructions takes the following form:

    int op1 = READ_OP1;
    int op2 = READ_OP2;
    int result = op1 + op2;  // The actual implementation.
    unsigned int opcode = pc->opcode;
    goto *dispatch[opcode];  // Jump to code for next instruction.

The variable dispatch (the dispatch table) is an array of labels, one for each opcode. The last three lines transfer control to the implementation of the next bytecode instruction. If we want to change the implementation of an instruction dynamically, we can just update the pointer in the dispatch table, or change the dispatch table pointer to point to a different table altogether.

Note that everything except for the line "result = op1 + op2" is interpreter overhead and its cost should be minimised. The program counter (pc) and the dispatch table are needed by every instruction, so they should be kept in registers throughout. Similarly, we want a variable that points to the stack and possibly a variable that points to a list of literals — two more registers. We also need further registers for holding the operands. These registers better be the same for each instruction, since otherwise some instructions need to move things around, leading to memory accesses. On x86 we only have 7 general-purpose registers which makes it very hard for the compiler to optimally use them for all instructions.

Interpreters Written in Assembly

For this reason, most interpreters are hand-optimised assembly routines with a specially designed calling convention that keeps as much state as possible in registers. A particularly nice and fast example is the LuaJIT 2 interpreter.

Both the standard Lua interpreter and the LuaJIT 2 interpreter use a register-based bytecode format. Compared to the more well-known stack-based bytecodes, a register-based bytecode has larger instructions, but requires fewer instructions overall. For example, the expression "s = x + (y * z)" in a stack-based bytecode would translate to something like:

PUSHVAR 0    -- push variable "x" onto operand stack
PUSHVAR 1    -- push variable "y"
PUSHVAR 2    -- push variable "z"
MUL          -- top of operand stack is (y * z)
ADD          -- top of operand stack is x + (y * z)
STOREVAR 3   -- write result into variable "s"

With a few optimisations, this can be encoded as only 6 bytes. In a register-based bytecode this would translate to something like this:

MUL 4, 1, 2   -- tmp = y * z
ADD 3, 0, 4   -- s = x + tmp

Each instruction takes a variable indices, the (virtual) registers, and reads from and writes directly to the variable (stored on the stack). In LuaJIT 2, each instruction requires 4 bytes, thus the overall bytecode size is a bit larger. However, executing these instructions incurs the interpreter overhead only twice instead of 6 times in the stack-based bytecode. It may also avoid memory traffic by avoiding the separate operand stack.

The LuaJIT 2 interpreter also uses a very simple bytecode format that avoids further bit shuffling (increasing the cost of decoding). Instructions can only have one of two forms:

OP A, B, C

Here OP, A, B, and C are 8-bit fields. OP is the opcode and A, B, C are usually register IDs or sometimes literals. D is a 16bit field and overlaps with B and C. It usually holds a literal value (e.g., a jump offset).

The LuaJIT 2 interpreter now combines this with a calling convention where part of the next instruction is decoded before actually transferring control to it. This tries to take advantage of superscalar execution on modern CPUs. For example, the an integer addition instruction implemented using this technique would look as follows:

    -- edx = BASE = start of stack frame. E.g., virtual register 5
    --    is at memory address BASE + 5 * BYTES_PER_WORD
    -- esi = PC, always points to the next instruction
    -- ebx = DISPATCH = the dispatch table
    -- ecx = A = pre-decoded value of field A
    -- eax = D = pre-decoded value of field D
    -- ebp = OP, opcode of current instruction (usually ignored)

    -- Semantics: BASE[A] = BASE[B] + BASE[C]

    -- 1. Decode D into B (ebp) and C (eax, same as D)
    movzx ebp, ah  -- zero-extend 8-bit register ah into ebp = B
    movzx eax, al  -- zero-extend 8-bit register al into eax = C
    -- 2. Perform the actual addition
    mov ebp, [edx + 4*ebp]   -- read BASE[B]
    add ebp, [edx + 4*eax]   -- ebp = BASE[B] + BASE[C]
    mov [edx + 4*ecx], ebp   -- BASE[A] = ebp
    -- 3. Dispatch next instruction
    mov eax, [esi]    -- Load next instruction into eax
    movzx ecx, ah     -- Predecode A into ecx
    movzx ebp, al     -- zero-extend OP into ebp
    add esi, 4        -- increment program counter
    shr eax, 16       -- predecode D
    jmp [ebx + ebp * 4]  -- jump to next instruction via dispatch table

The reason for predecoding some of the arguments is that the final "jmp" instruction may be quite expensive because indirect branches cause difficulties for branch predictors. The previous instructions help keep the pipeline somewhat busy if the branch did indeed get mispredicted.

These 11 instructions are typically executed in about 5 clock cycles on an Intel Core 2 processor. Based on a very simple benchmark of only simple (addition, branch, compare) instructions, interpreted code is roughly 7x slower than machine code. For more complicated instructions the interpreter overhead becomes less severe and the slowdown should be smaller.

Note, though, that in this example the ADD operation was type-specialised to integers. In many dynamic languages the addition operator is overloaded to work over several types. A generic ADD bytecode instruction then must include a type check (e.g., int vs. float) and then dispatch to the relevant implementation. This can introduce severe execution overheads due to the high cost of branch mispredictions on modern (desktop) CPUs. However, this is partly a language design issue (or at least language/implementation co-design issue) and independent of how we choose to implement our interpreter.

In addition to the performance advantages over C-based interpreters, there is size advantage. It's a good idea to ensure that the frequently-executed parts of the interpreter fit within the L1 instruction cache (typically around 32-64 KiB). Writing the interpreter in assembly helps with that. For example, the above code requires exactly 32 bytes. If we chose to use register ebp for BASE, it would have been a few bytes larger due to encoding restrictions on x86.

Portable Fast Interpreters?

The downside of writing an interpreter in assembly is that it is completely unportable. Furthermore, if we need to change the semantics of a bytecode instruction we have to update it for each architecture separately. For example, LuaJIT 2 has interpreters for 6 architectures of around 4000K 4000 lines per architecture (ARMv6, MIPS, PPC, PPCSPE/Cell, x86, x86-64).

The WebKit project therefore built its own custom "portable" assembly language as a Ruby DSL. For an example of what it looks like, see LowLevelInterpreter.asm. Occasionally, you probably still want to special-case some code for a specific architecture, but it seems to go into the right direction. It also seems to make code more readable than Dalvik's (Android's VM) template substitution method. I'd rather have all the code in the same file (possibly with a few #ifdef-like places) rather than spread over 200+ files. It also should be quite simple to translate this assembly into C to get a default interpreter for architectures that are not yet supported.

So far, every project seems to have built its own tool chain. I guess it's too much of a niche problem with too many project-specific requirements to give rise to a reusable standard tool set.