Friday, August 17, 2012

Beyond Package Version Policies

When I read the announcement of the latest GHC release candidate, I did not feel excitement but rather annoyance. The reason is that now I have to go and check all my packages' dependency specifications and see if they require a version bump. I was not the only one. In the following I sketch an approach that I think could take out most of the pain we currently experience in Hackage ecosystem.

The Problem


The reason for this annoyance is the rather strict Hackage package versioning policy (PVP). The PVP specifies the format of the version number of a Hackage package as a sequence of four integers separated by dots:

majorA.majorB.minor.patchlevel

The top two digits denote the major version number, and must be incremented if a potentially breaking change is introduced in a new package release. The minor number must be incremented if new functionality was added, but the package is otherwise still compatible with the previous version. Finally, a patchlevel increment is necessary if the API is unchanged and only bugfixes or non-functional changes (e.g., documentation) were made. This sounds fairly reasonable and is basically the same as what is used for shared libraries in the C world.

When specifying dependencies on other packages, authors are strongly encouraged to specify upper bounds on the major version. This is intended to avoid breaking the package if a new major version of the dependency is released (Cabal-install always tries to use the latest possible version of a dependency). If the package also works with a newer version of the dependency, then the author is expected to release a new version of his/her library with an increased upper bound for the dependency version.

 In Haskell, unfortunately, this system doesn't work too well for a number of reasons:

  1. Say, my package P depends on package A-1.0 and I now want to test if it works with the newly released version A-1.1. My package also depends on package B-0.5 which in turn also depends on A-1.0. GHC currently cannot link two versions of the same package into the same executable, so we must pick one version that works with both -- in this case that's A-1.0. D'oh!
    I now have two options: (a) wait for the author of package B to test it against A-1.1, or (b) do it myself. If I choose option (b) I also have to send my patch to the author of B, wait for him/her to upload the new version to Hackage and only then can I upload my new version to Hackage. The problem is multiplied by the number of (transitive) dependencies of my package and the number of different authors of these packages. This process takes time (usually months) and the fast release rate of GHC (or many other Haskell packages, for that matter) doesn't make it any easier.
  2. Packages get major-version upgrades rather frequently. One reason is that many Haskell libraries are still in flux. Another is that if a package adds a new instance, a major version upgrade is required. We can protect against new functions/types being added to a package because we can use explicit import lists. New instances are imported automatically, and there's no way to hide them when importing a module.
  3. A package version is a very crude and conservative approximation that a dependent package might break.
Generally, I think it's a good thing that Haskell packages are updated frequently and improved upon. The problem is that the current package infrastructure and tools don't work well with it. The PVP is too conservative.

A Better Approach


The key notion is to track dependencies at the level of individual functions, types, etc. rather than at the level of whole packages.

When a package P depends on another package A it usually doesn't depend on the whole package. Most of the time P just depends on a few functions and types. If some other part of A is changed, that shouldn't affect P. We have so much static information available to us, it's a shame we're not taking advantage of it. Consider the following system:
  1. When I compile my code, the compiler knows exactly which functions, types, etc. my program uses and from which packages they come from. The compiler (or some other tool) writes this information to a file (preferably in a human-readable format). Let's call this file: dependencies.manifest
  2. Additionally, the compiler/tool also generates a list of all the functions, types, etc. defined by code in my package. Let's call that file: exports.manifest. I believe GHC's ABI versioning already does something very similar to this, although it just reduces this all to a hash.
The first use of this information is to decide whether a package is compatible with an expected dependency. So, if my package's "dependency.manifest" contained (for example)

type System.FilePath.Posix.FilePath = Data.String.String
System.FilePath.Posix.takeBaseName :: System.FilePath.Posix.FilePath -> System.FilePath.Posix.FilePath

then it is compatible with any future (or past) version of the filepath package that preserves this API and that defines FilePath as a type synonym for Strings.

Of course, this only checks for API name and type compatibility, not actual semantic compatibility. This requires some hints from the package authors, as described below. Together with annotations from the package author about semantic changes, the only information we need to check if a newer package is a compatible dependency are the versions the original versions of the dependencies used and the manifest of the new package.

For example, let's say version 0.1 of my package looks as follows:

module Gravity where
bigG :: Double -- N * (m / kg)^2
bigG = 6.674e-11
force :: Double -> Double -> Double -> Double -- N
force m1 m2 r = (bigG * m1 * m2) / (r * r)

Its manifest will look something like this:

Gravity.bigG :: Double, 0.1
Gravity.force :: Double -> Double -> Double -> Double, 0.1

The version of each item is the version of the package at which it was introduced or changed its semantics.

Now I add a new function in version 0.1.1:

standardGravity :: Double -- m/s^2
standardGravity = 9.80665

The manifest for version 0.1.1 now will be

Gravity.bigG :: Double, 0.1
Gravity.force :: Double -> Double -> Double -> Double, 0.1
Gravity.standardGravity :: Double, 0.1.1

Now, let's say I want to improve the accuracy of bigG in version 0.2:

bigG = 6.67384

Since bigG was changed and force depended upon it, by default the new manifest would be:

Gravity.bigG :: Double, 0.2
Gravity.force :: Double -> Double -> Double -> Double, 0.2
Gravity.standardGravity :: Double, 0.1.1

However, one could argue that this is a backwards compatible change, hence the manifest would be adjusted by the author (with the help of tools) to:

Gravity.bigG :: Double, 0.1
Gravity.force :: Double -> Double -> Double -> Double, 0.1
Gravity.standardGravity :: Double, 0.1.1

That is the same manifest as version 0.1.1, thus 0.2 is 100% compatible with all users of 0.1.1 (according to the package author), and even all users of 0.1 because no functionality has been removed.

Even if manifests didn't include the version number (for now) I believe just the API information is precise enough for most cases. It will still be necessary to constrain the allowed range of package dependencies, but that should be the rare exception (e.g., a performance regressions) rather than the current state where dependencies need to be adjusted every few months.

Upgrade Automation


This mechanism alone only helps with being less conservative when checking whether a package can work with an updated dependency. The other issue is that Haskell package APIs are often moving quickly and thus breaking code is unavoidable. If a package only has a few dependents this may not be such a big deal, but it becomes a problem for widely used packages. For example, during the discussions for including the vector package into the Haskell Platform some reviewers asked for functions to be moved from one module into the other. Roman, vector's maintainer, argued against this noting it would break many dependencies -- a valid concern. Even if this was only a small issue, fear of breaking dependent packages can slow down improvements in package APIs.

The Go programming language project has a tool called "gofix", which can automatically rewrite code for simple API changes and generates warnings for places that require human attention. Haskell has so much static information, that such a tool is quite feasible (e.g., HaRe can already do most of the important bits).

So, I imagine that a newly-released package specifies up to two additional pieces of information:

  • An annotated manifest indicating where semantic changes were made while retaining the same API. This can be seen as bumping the version of a single function/type, rather than of the whole API. To avoid the impact of human error this, too, should be tool supported. For example, if we compute an ABI hash for each function, we can detect which functions were modified. The package author can then decide if that was just a refactoring or an actual semantic change.
    (This has to be done with the help of tools. Imagine we refactor a frequently used internal utility function. Then all functions that use it would potentially have changed semantics. However, as soon that function is marked as backwards compatible, so will all its users. So it's important that a tool asks the package author for compatibility by starting with the leaf nodes.)
  • Optionally, the author may specify an upgrade recipe to be used by an automated tool or even just a user of the library. This could include simple instructions like renaming of functions (which includes items moved between modules or even packages), or more complicated things like a definition of a removed function in terms of newly-added functions. For more complicated changes a textual description of the changes can give higher-level instructions for how to manually upgrade. Since this should be human-readable anyway, we may as well specify this upgrade recipe in a (formally defined) format that looks like a Changelog file.

    Summary


    The PVP doesn't work well because it is too conservative and too coarse-grained. Haskell contains enough static information to accurately track dependencies at the level of functions and types. We should take advantage of this information.

    The ideas presented above certainly require refinement, but even if we have to be conservative in a few places (e.g., potentially conflicting instance imports), I think it will still be much less painful than the current system.

    Comments and constructive critiques welcome!

    Tuesday, July 31, 2012

    Implementing Fast Interpreters: Discussion & Further Reading

    My last article on "Implementing Fast Interpreters" was posted to both Hacker News and the Programming subreddit. Commenters pointed out some valid questions and some related work. This post aims to address some of the issue and collect the related work mentioned in the HN and Reddit comments and some more.

    Why not write a simple JIT?


    The main benefit of an interpreter usually is simplicity and portability. Writing the interpreter in assembly makes it a bit harder to write and you lose portability. So, if we have to go down to the level of assembly, why not write a simple template-based code generator? Wouldn't that be faster? The answer is: it depends.

    As Mike Pall (author of LuaJIT) pointed out, LuaJIT v1 was based on a simple JIT and does not consistently beat the LuaJIT v2 interpreter (which is written in assembly).

    Performance comparison of LuaJIT v1.1 (simple JIT) vs. LuaJIT v2 interpreter.
    A JIT compiler (even a simple one) needs to:
    • Manage additional memory. If the code is short-lived (e.g., there is another optimisation level) then unused memory must be reclaimed and fragmentation may become an issue.
    • Mark that memory as executable (possibly requiring flushing the instruction cache or operating system calls).
    • Manage the transition between compiled code segments. E.g., we need a mapping from bytecode targets to corresponding machine code. If the target hasn't been compiled, yet, this can be done on-demand and the branch instruction can be updated to directly jump to the target. If the target may ever change in the future, we also need a way to locate all the branches to it in order to invalidate them if necessary.
    • If the interpreter has multiple execution modes (e.g., a trace-based JIT has a special recoding mode which is only used for a short time), then code needs to be generated for each execution mode.
    All this is quite a bit more complicated than a pure interpreter, even one written in assembler. Whether it's a good reason to take on this complexity depends on how high the interpreter overhead actually is. If the bytecode format is under our control then we can optimise it for our interpreter as done in LuaJIT 2. This isn't always the case, though. The DynamoRIO system is a runtime instrumentation framework for x86. The x86 instruction format is very complex and thus expensive to decode. DynamoRIO therefore does not use an interpreter and instead decodes the host program into basic blocks. (No code generation is really needed, because DynamoRIO emulates x86 on x86, but you could imagine using the same technique to emulate x86 on, say, ARM.) The challenges (and solutions) for this approach are discussed in Derek Bruening's excellent PhD thesis on DynamoRIO. Bebenita et al. describe an interpreter-less JIT-compiler design for a Java VM in their paper "Trace-Based Compilation in Execution Environments without Interpreters" (PDF).

    DSLs for generating interpreters


    The PyPy project generates a C interpreter and a trace-based JIT based on a description in RPython. "Restricted Python", a language with Python-like syntax, but C-with-type-inference like semantics. Laurence Tratt described his experiences using PyPy in his article "Fast Enough VMs in ast Enough Time". PyPy's toolchain makes many design decisions for you; its usefulness therefore depends on whether these decisions align with your goals. The PyPy Status Blog contains many examples of language VMs written in PyPy (e.g. PHP, BF, regular expressions).

    A DSL aimed at developing a fast assembly interpreter can still be useful for other purposes. One problem in implementing a JIT compiler is ensuring that the semantics of the compiled code match the semantics of the interpreter (or baseline compiler). By having a slightly higher-level abstraction it could become possible to use the same specification to both generate the interpreter and parts of the compiler. For example, in a trace-based JIT, the code for recording and interpreting looks very similar:
    Interpreter:                Trace Recorder:
    
      mov tmp, [BASE + 8 * RB]    Ref r1 = loadSlot(ins->b);
                                  Ref r2 = loadSlot(ins->c);
      add tmp, [BASE + 8 * RC]    Ref r3 = emit(IR_ADD_INT, r1, r2);
      mov [BASE + 8 * RA], tmp    writeSlot(ins->a, r3);
    

    If we treat the trace recorder as just another architecture, it may save us a lot of maintenance effort later on.

    Further Reading


    There are many other resources on this topic, but here's a short selection of interesting articles and papers on the topic.
    If you have more links to good articles on the topic, please leave a comment.

    Thursday, July 26, 2012

    ARM's New 64 Bit Instruction Set

    You may have heard that ARM, whose CPUs are extremely popular for embedded devices, is trying to move into the low-power server market. One of the current main difficulties for using ARM processors in servers is that it is only a 32 bit architecture (A32). That means, that a single process can address at most 4GB of memory (and some of it is reserved for the OS kernel). That isn't a problem on current embedded devices, but it can be on large multi-threaded server applications. To address this issue ARM has been working on a 64 bit instruction set (A64). To my knowledge there is no commercially available hardware that implements this instruction set, yet, but ARM has already released patches to the Linux kernel to support it.

    To my surprise, this new 64 bit instruction set is quite different from the existing 32 bit instruction sets. (Perhaps, I shouldn't be surprised since the two Thumb instruction sets were indeed quite different from the existing instruction sets.) It looks like a very clean RISC-style design. Here are my highlights:
    • All instructions are 32 bits wide (unlike the Thumb variants, but like the original A32).
    • 31 general purpose 64 bit wide registers (instead of 14 general purpose 32-bit registers in A32). The 32nd register is either hardwired to zero or the stack pointer. These registers can be accessed as 32 bit (called w0, w1, ..., w31) or 64 bit registers (called x0, x1, ..., x31).
    • Neither the stack pointer (SP) nor the program counter (PC) are general purpose registers. They are only read and modified by certain instructions.
    • In A32, most instructions could be executed conditionally. This is no longer the case.
    • Conditional instructions are not executed conditionally, but instead pick one of two inputs based on a condition. For example, the "conditional select" instruction CSEL x2, x4, x5, cond implements x2 = if cond then x4 else x5. This subsumes a conditional move: CMOV x1, x2, cond can be defined as a synonym for CSEL x1, x2, x1, cond. There are many more of these conditional instructions, but they all will modify the target register.
    • A conditional compare instruction can be used to implement C's short-circuiting semantics. In a conditional compare the condition flags are only updated if the previous condition was true.
    • There is now an integer division instruction. However, it does not generate an exception/trap upon division by zero. Instead (x/0) = 0. That may seem odd, but I think it's a good idea. A conditional test before a division instruction is likely to be cheaper than a kernel trap.
    • The virtual address space is 49 bits or 512TB. Unlike x86-64/AMD64, where the top 16 bits must all be zero or all one, the highest 8 bits may optionally be usable as a tag. This is configured using a system register. I'm not sure if that will require kernel support. It would certainly come in handy for implementing many higher-level programming languages.
    • A number of instructions for PC-relative addressing. This is useful for position independent code.
    • SIMD instruction support is now guaranteed. ARMv8 also support for crypto instructions. These are also available in A32.
    All the existing ARM instruction sets (except perhaps Jazelle) will still be supported. I don't think you can dynamically switch between different instruction sets as was the case for A32/Thumb, though.

    Further reading:

    Sunday, July 22, 2012

    Implementing Fast Interpreters

    Many modern virtual machines include either a fast interpreter or a fast baseline compiler. A baseline compiler needs extra memory and is likely slower for code that is only executed once. An interpreter avoids this memory overhead  and is very flexible. For example, it can quickly switch between execution modes (profiling, tracing, single-stepping, etc.). So what, then, is the state of the art of building fast interpreters?

    Modern C/C++ compilers are very good at optimising "normal" code, but they are not very good at optimising large switch tables or direct-threaded interpreters. The latter also needs the "Labels as Values" GNU extension. In a direct-threaded interpreter, each instructions takes the following form:

    op_ADD_INT:
        int op1 = READ_OP1;
        int op2 = READ_OP2;
        int result = op1 + op2;  // The actual implementation.
        WRITE_RESULT(result);
        unsigned int opcode = pc->opcode;
        ++pc;
        goto *dispatch[opcode];  // Jump to code for next instruction.

    The variable dispatch (the dispatch table) is an array of labels, one for each opcode. The last three lines transfer control to the implementation of the next bytecode instruction. If we want to change the implementation of an instruction dynamically, we can just update the pointer in the dispatch table, or change the dispatch table pointer to point to a different table altogether.

    Note that everything except for the line "result = op1 + op2" is interpreter overhead and its cost should be minimised. The program counter (pc) and the dispatch table are needed by every instruction, so they should be kept in registers throughout. Similarly, we want a variable that points to the stack and possibly a variable that points to a list of literals — two more registers. We also need further registers for holding the operands. These registers better be the same for each instruction, since otherwise some instructions need to move things around, leading to memory accesses. On x86 we only have 7 general-purpose registers which makes it very hard for the compiler to optimally use them for all instructions.

    Interpreters Written in Assembly


    For this reason, most interpreters are hand-optimised assembly routines with a specially designed calling convention that keeps as much state as possible in registers. A particularly nice and fast example is the LuaJIT 2 interpreter.

    Both the standard Lua interpreter and the LuaJIT 2 interpreter use a register-based bytecode format. Compared to the more well-known stack-based bytecodes, a register-based bytecode has larger instructions, but requires fewer instructions overall. For example, the expression "s = x + (y * z)" in a stack-based bytecode would translate to something like:

    PUSHVAR 0    -- push variable "x" onto operand stack
    PUSHVAR 1    -- push variable "y"
    PUSHVAR 2    -- push variable "z"
    MUL          -- top of operand stack is (y * z)
    ADD          -- top of operand stack is x + (y * z)
    STOREVAR 3   -- write result into variable "s"

    With a few optimisations, this can be encoded as only 6 bytes. In a register-based bytecode this would translate to something like this:

    MUL 4, 1, 2   -- tmp = y * z
    ADD 3, 0, 4   -- s = x + tmp

    Each instruction takes a variable indices, the (virtual) registers, and reads from and writes directly to the variable (stored on the stack). In LuaJIT 2, each instruction requires 4 bytes, thus the overall bytecode size is a bit larger. However, executing these instructions incurs the interpreter overhead only twice instead of 6 times in the stack-based bytecode. It may also avoid memory traffic by avoiding the separate operand stack.

    The LuaJIT 2 interpreter also uses a very simple bytecode format that avoids further bit shuffling (increasing the cost of decoding). Instructions can only have one of two forms:

    OP A, B, C
    OP A, D

    Here OP, A, B, and C are 8-bit fields. OP is the opcode and A, B, C are usually register IDs or sometimes literals. D is a 16bit field and overlaps with B and C. It usually holds a literal value (e.g., a jump offset).

    The LuaJIT 2 interpreter now combines this with a calling convention where part of the next instruction is decoded before actually transferring control to it. This tries to take advantage of superscalar execution on modern CPUs. For example, the an integer addition instruction implemented using this technique would look as follows:

    bc_ADD_INT:
        -- edx = BASE = start of stack frame. E.g., virtual register 5
        --    is at memory address BASE + 5 * BYTES_PER_WORD
        -- esi = PC, always points to the next instruction
        -- ebx = DISPATCH = the dispatch table
        -- ecx = A = pre-decoded value of field A
        -- eax = D = pre-decoded value of field D
        -- ebp = OP, opcode of current instruction (usually ignored)
    
        -- Semantics: BASE[A] = BASE[B] + BASE[C]
    
        -- 1. Decode D into B (ebp) and C (eax, same as D)
        movzx ebp, ah  -- zero-extend 8-bit register ah into ebp = B
        movzx eax, al  -- zero-extend 8-bit register al into eax = C
        -- 2. Perform the actual addition
        mov ebp, [edx + 4*ebp]   -- read BASE[B]
        add ebp, [edx + 4*eax]   -- ebp = BASE[B] + BASE[C]
        mov [edx + 4*ecx], ebp   -- BASE[A] = ebp
        -- 3. Dispatch next instruction
        mov eax, [esi]    -- Load next instruction into eax
        movzx ecx, ah     -- Predecode A into ecx
        movzx ebp, al     -- zero-extend OP into ebp
        add esi, 4        -- increment program counter
        shr eax, 16       -- predecode D
        jmp [ebx + ebp * 4]  -- jump to next instruction via dispatch table

    The reason for predecoding some of the arguments is that the final "jmp" instruction may be quite expensive because indirect branches cause difficulties for branch predictors. The previous instructions help keep the pipeline somewhat busy if the branch did indeed get mispredicted.

    These 11 instructions are typically executed in about 5 clock cycles on an Intel Core 2 processor. Based on a very simple benchmark of only simple (addition, branch, compare) instructions, interpreted code is roughly 7x slower than machine code. For more complicated instructions the interpreter overhead becomes less severe and the slowdown should be smaller.

    Note, though, that in this example the ADD operation was type-specialised to integers. In many dynamic languages the addition operator is overloaded to work over several types. A generic ADD bytecode instruction then must include a type check (e.g., int vs. float) and then dispatch to the relevant implementation. This can introduce severe execution overheads due to the high cost of branch mispredictions on modern (desktop) CPUs. However, this is partly a language design issue (or at least language/implementation co-design issue) and independent of how we choose to implement our interpreter.

    In addition to the performance advantages over C-based interpreters, there is size advantage. It's a good idea to ensure that the frequently-executed parts of the interpreter fit within the L1 instruction cache (typically around 32-64 KiB). Writing the interpreter in assembly helps with that. For example, the above code requires exactly 32 bytes. If we chose to use register ebp for BASE, it would have been a few bytes larger due to encoding restrictions on x86.

    Portable Fast Interpreters?


    The downside of writing an interpreter in assembly is that it is completely unportable. Furthermore, if we need to change the semantics of a bytecode instruction we have to update it for each architecture separately. For example, LuaJIT 2 has interpreters for 6 architectures of around 4000K 4000 lines per architecture (ARMv6, MIPS, PPC, PPCSPE/Cell, x86, x86-64).

    The WebKit project therefore built its own custom "portable" assembly language as a Ruby DSL. For an example of what it looks like, see LowLevelInterpreter.asm. Occasionally, you probably still want to special-case some code for a specific architecture, but it seems to go into the right direction. It also seems to make code more readable than Dalvik's (Android's VM) template substitution method. I'd rather have all the code in the same file (possibly with a few #ifdef-like places) rather than spread over 200+ files. It also should be quite simple to translate this assembly into C to get a default interpreter for architectures that are not yet supported.

    So far, every project seems to have built its own tool chain. I guess it's too much of a niche problem with too many project-specific requirements to give rise to a reusable standard tool set.