Monday, May 12, 2008

The Thing That Should Not Be (Or: How to import 18500+ patches from Darcs into Git in less than three days)

I like Darcs. Really. It is easy to learn and use and for smallish projects I never had any real problems. Unfortunately, it still has some performance problems and it is likely that some operations will never be fast.

An extreme example of where you run into those problems is the GHC repository. It consists of over 18500 patches and spans over 12 years of history. When I tried to build the latest version I ran into a linker error which I know I didn't get with the snapshot from one month ago. As GHC builds take quite a while I wanted to use an efficient way to find which exact change introduced the problem. More precisely I wanted git bisect.

I know that Don had converted Darcs repositories to Git in order to get ohloh statistics, but he reported that this process was rather painful. It took four weeks(!) to convert the GHC repository.

So, I looked what tools were out there, and how to improve them. I know that there is Tailor, but I looked at darcs-to-git by Steve Purcell first and found it very hackable. I didn't like that it saved the Darcs patch ID in the Git commit message, so I changed that and I extended it to properly translate Darcs escape sequences. I also added a parameter to only pull a number of patches at the same time, so that I can import a big repository in stages and I allowed custom mapping from committer names to other committer names. I used this to map various pseudonyms to (a unique) full name and email address. (I hope no one minds being credited with his or her full name. ;) )

It worked rather well for smallish repositories (a bit less than 2000 patches) but I had serious problems to get it to work with GHC.

  • Darcs has a bug on case-insensitive volumes (which OS X uses by default), so Steve suggested using a case-sensitive sparse image. This works, but it is probably a bit slower. I tried running it on my FreeBSD home server, but it has only 256 MB of RAM (usually fine for a home file server) so Darcs ran out of space and eventually got killed by the OS. (Getting Darcs to compile on my server was an adventure in itself--first a few hours to update the ports tree, then one more to compile GHC 6.8 which then just failed to install...) Fortunately, my Laptop has 2 GB, so it works fine there.
  • At startup darcs-to-git reads the full Darcs patch inventory. For such a big repo as GHC this takes over a minute (and lots of RAM). Caching it in a file didn't seem to help much. I could have lived with that, but there was a more serious problem: the approach used by darcs-to-git (and, it seems, also by Tailor) doesn't work!
  • darcs-to-git pulls one patch at a time by giving it's ID to darcs pull --match 'hash', then git adds the changes on the Git side and git commits it with the appropriate commit message. The patches are pulled in the order in which they were applied in the source repo, so any dependencies should be fulfilled. Nevertheless, Darcs refused to apply some patches -- silently. Darcs just determined that I didn't want to pull any patches and didn't do anything. This is most likely a Darcs bug, but I heard it was only a known bug for some development version of Darcs 2 (I used Darcs 1.0.9 at that time). Anyways, that didn't work; it failed at about patch 30 of the GHC repository.
  • OK. So instead of pulling patches by ID we could fake user interaction. Something like this:
    $ echo "yd" | darcs pull source-repo
    The input corresponds to "Yes, I want to pull this patch" and "Ok, I'm done and want to pull all the selected patches". This works reliably and also has the advantage that we don't have to read the whole history up front but instead can just retrieve the info for the last applied patch via
    darcs changes --last 1 --xml
  • By now you might have guessed, though, that this still didn't work very well. It took about 60 seconds per patch (with about 1 second of this used by Git), resulting in estimated 13 days(!) CPU time for the full repository.
  • Interestingly, most of those 60 seconds are spent before any patch choice is displayed, so apparently Darcs is doing something to calculate which patches to show. After that, displaying more choices is relatively quick. Apparently, the startup time depends on the number of patches not yet pulled.

This leads to the following trick.

We use two intermediate repositories. We use one to pull several patches at a time from the source repository. I use 15 patches, ie.:

$ cd tmp/ghc.pull
$ echo "yyyyyyyyyyyyyyyd" | darcs pull /path/to/ghc
We now could import from this intermediate repository into Git, since the startup time to pull from this repo is now much lower. However, we'd like to already start and pull the next 15 patches into the temporary repository. Pulling from and into the same repo at the same time doesn't work (Dars locks the repo), so we also need to mirror this temporary repository. A cp -r would work, but as the repository grows larger, this would do unnecessary work. So I just pull the changes at once.
$ cd /tmp/ghc.pull_mirror
$ darcs pull --all /tmp/ghc.pull  # this is pretty quick now
Now we can import into our git mirror from there, and already start pulling the next 15 patches (proper term for this is "macro pipelining", I believe).
$ cd /path/to/ghc.git
$ ./darcs-to-git /tmp/ghc.pull_mirror &  # run in background
$ cd /tmp/ghc.pull
$ echo "yyyy..." # etc
Of course, before pulling from the first mirror into the second mirror we have to make sure that darcs-to-git has finished pulling from the second mirror. I have implemented this as a shell script on top of darcs-to-git, but I may move it into darcs-to-git at some point.

My fork of darcs-to-git as well as Steve's main repo are both available at Github. I haven't pushed all of my local changes yet, but I plan to implement pulling dars patches "interactively" as a possible option for darcs-to-git, so maybe check the repo in a week or two.

With this approach I am down to about 200 seconds per 15 patches or about 68 hours fo the 18500 patches of the GHC repo which is just below the promised three days. (Of course, YMMV)

So the moral of this story? Darcs is very slow for biggish repositories, especially for rarely used border cases (such as pulling patches one by one). It may be possible to fix them, but I doubt that this will be easy. I tried using the new hashed format and the darcs-2 format, but converting the GHC repo didn't work for me. I certainly hope that things get better, and I plan to help at least a little by submitting several bug reports in the next couple of days about the problems I ran into in the past days. Let's see what happens.

Oh, and Darcs needs a killer-app like Github!

Sunday, March 16, 2008

A short reminder

Folds and maps are Haskell's default iteration combinators. Mapping is easy enough, but folds can often be rather messy, especially if nested. For example, given a map of sets of some values, we want to write a function to swap keys and values. The function's type will be:

type SetMap k a = Map k (Set a)

invertSetMap :: (Ord a, Ord b) => SetMap a b -> SetMap b a
The resulting map should contain a key for each value of type b, occurring in any set in the original map. The new values (of type Set a) are all those original keys for which the new key occurred in the value set. Intuitively, if the original set was representing arrows from values of type a to values of type b, this function should reverse all arrows. We can easily implement this function using two nested folds.

invertSetMap sm = 
      (\k as r -> 
          S.fold (\a r' -> M.insertWith S.union a (S.singleton k) r')
That's not pretty at all! I had written quite a bit of this kind of code (and hated it each time), until I finally remembered a fundamental Haskell lesson. Haskell uses lists to simulate iteration and specify other kinds of control flow. In particular list comprehensions are often extremely cheap, since the compiler can automatically remove many or all intermediate lists and generate very efficient code. So let's try again.

invertSetMap sm = M.fromListWith S.union
    [ (a, S.singleton k) | (k, as) <- M.assocs sm
                         , a       <- S.toList as ]
So much more readable! A quick benchmark also shows that it's slightly faster (a few percent for a very big map). Lesson to take home: If your folds get incomprehensible consider list comprehensions.