Monday, May 12, 2008

The Thing That Should Not Be (Or: How to import 18500+ patches from Darcs into Git in less than three days)

I like Darcs. Really. It is easy to learn and use and for smallish projects I never had any real problems. Unfortunately, it still has some performance problems and it is likely that some operations will never be fast.

An extreme example of where you run into those problems is the GHC repository. It consists of over 18500 patches and spans over 12 years of history. When I tried to build the latest version I ran into a linker error which I know I didn't get with the snapshot from one month ago. As GHC builds take quite a while I wanted to use an efficient way to find which exact change introduced the problem. More precisely I wanted git bisect.

I know that Don had converted Darcs repositories to Git in order to get ohloh statistics, but he reported that this process was rather painful. It took four weeks(!) to convert the GHC repository.

So, I looked what tools were out there, and how to improve them. I know that there is Tailor, but I looked at darcs-to-git by Steve Purcell first and found it very hackable. I didn't like that it saved the Darcs patch ID in the Git commit message, so I changed that and I extended it to properly translate Darcs escape sequences. I also added a parameter to only pull a number of patches at the same time, so that I can import a big repository in stages and I allowed custom mapping from committer names to other committer names. I used this to map various pseudonyms to (a unique) full name and email address. (I hope no one minds being credited with his or her full name. ;) )

It worked rather well for smallish repositories (a bit less than 2000 patches) but I had serious problems to get it to work with GHC.

  • Darcs has a bug on case-insensitive volumes (which OS X uses by default), so Steve suggested using a case-sensitive sparse image. This works, but it is probably a bit slower. I tried running it on my FreeBSD home server, but it has only 256 MB of RAM (usually fine for a home file server) so Darcs ran out of space and eventually got killed by the OS. (Getting Darcs to compile on my server was an adventure in itself--first a few hours to update the ports tree, then one more to compile GHC 6.8 which then just failed to install...) Fortunately, my Laptop has 2 GB, so it works fine there.
  • At startup darcs-to-git reads the full Darcs patch inventory. For such a big repo as GHC this takes over a minute (and lots of RAM). Caching it in a file didn't seem to help much. I could have lived with that, but there was a more serious problem: the approach used by darcs-to-git (and, it seems, also by Tailor) doesn't work!
  • darcs-to-git pulls one patch at a time by giving it's ID to darcs pull --match 'hash ...id...', then git adds the changes on the Git side and git commits it with the appropriate commit message. The patches are pulled in the order in which they were applied in the source repo, so any dependencies should be fulfilled. Nevertheless, Darcs refused to apply some patches -- silently. Darcs just determined that I didn't want to pull any patches and didn't do anything. This is most likely a Darcs bug, but I heard it was only a known bug for some development version of Darcs 2 (I used Darcs 1.0.9 at that time). Anyways, that didn't work; it failed at about patch 30 of the GHC repository.
  • OK. So instead of pulling patches by ID we could fake user interaction. Something like this:
    $ echo "yd" | darcs pull source-repo
    The input corresponds to "Yes, I want to pull this patch" and "Ok, I'm done and want to pull all the selected patches". This works reliably and also has the advantage that we don't have to read the whole history up front but instead can just retrieve the info for the last applied patch via
    darcs changes --last 1 --xml
  • By now you might have guessed, though, that this still didn't work very well. It took about 60 seconds per patch (with about 1 second of this used by Git), resulting in estimated 13 days(!) CPU time for the full repository.
  • Interestingly, most of those 60 seconds are spent before any patch choice is displayed, so apparently Darcs is doing something to calculate which patches to show. After that, displaying more choices is relatively quick. Apparently, the startup time depends on the number of patches not yet pulled.

This leads to the following trick.

We use two intermediate repositories. We use one to pull several patches at a time from the source repository. I use 15 patches, ie.:

$ cd tmp/ghc.pull
$ echo "yyyyyyyyyyyyyyyd" | darcs pull /path/to/ghc
We now could import from this intermediate repository into Git, since the startup time to pull from this repo is now much lower. However, we'd like to already start and pull the next 15 patches into the temporary repository. Pulling from and into the same repo at the same time doesn't work (Dars locks the repo), so we also need to mirror this temporary repository. A cp -r would work, but as the repository grows larger, this would do unnecessary work. So I just pull the changes at once.
$ cd /tmp/ghc.pull_mirror
$ darcs pull --all /tmp/ghc.pull  # this is pretty quick now
Now we can import into our git mirror from there, and already start pulling the next 15 patches (proper term for this is "macro pipelining", I believe).
$ cd /path/to/ghc.git
$ ./darcs-to-git /tmp/ghc.pull_mirror &  # run in background
$ cd /tmp/ghc.pull
$ echo "yyyy..." # etc
Of course, before pulling from the first mirror into the second mirror we have to make sure that darcs-to-git has finished pulling from the second mirror. I have implemented this as a shell script on top of darcs-to-git, but I may move it into darcs-to-git at some point.

My fork of darcs-to-git as well as Steve's main repo are both available at Github. I haven't pushed all of my local changes yet, but I plan to implement pulling dars patches "interactively" as a possible option for darcs-to-git, so maybe check the repo in a week or two.

With this approach I am down to about 200 seconds per 15 patches or about 68 hours fo the 18500 patches of the GHC repo which is just below the promised three days. (Of course, YMMV)

So the moral of this story? Darcs is very slow for biggish repositories, especially for rarely used border cases (such as pulling patches one by one). It may be possible to fix them, but I doubt that this will be easy. I tried using the new hashed format and the darcs-2 format, but converting the GHC repo didn't work for me. I certainly hope that things get better, and I plan to help at least a little by submitting several bug reports in the next couple of days about the problems I ran into in the past days. Let's see what happens.

Oh, and Darcs needs a killer-app like Github!

17 comments:

Anonymous said...

Zomg, this is just, well, wow is all I can say. Everyone where I work thinks git is great. I guess I should try it.

Anonymous said...

I kind of feel we should just convert the GHC repository to Git outright and be done with it. This would resolve the performance problems at a stroke and mean that we're relying on a source control system that is much more actively maintained.

The only problem is that it's a bit of a shame to abandon the Haskell-based approach :-)

kowey said...

Did you mean bug reports for darcs? If so, we're looking forward to them! I've always had a suspicion that mixed in with the genuine performance problems (algorithm stuff), there are some silly issues lurking around which hurt performance, and which can be fixed at little cost. Hopefully your reports will help us to uncover them.

The honey monster said...

It seems that some form of bracketing operation in DARCS may have been easier to implement.

Everyone seems to be very pro GIT these days, but DARCS remains the technology pioneer in this field and it is imperative that rather than look through rose tinted spectacles at the competition we might all benefit from a commitment to make the tool better.

Haskell is receiving much criticism for its poor performance, DARCS needs to continue to improve as a means to deflect these criticisms.

BTW, even in jest, never suggest that GHC should move to GIT, you are just arming the doubters with all the ammunition they need to convince the ignorant that haskell is not a viable choice.

Dan Nugent said...

honey master: Doubtful. Darcs speed issues and other corner cases seem to come more from its dedication to the theory of patches than the implementation language. Not that I'm criticizing David's theory, I think Moore's law will eventually prove it to be the superior DVCS strategem, but I think if someone decided to clone git in Haskell, the performance difference wouldn't be noticeable.

james.d.s said...

Darcs' Theory of Patches is a badly-implemented solution waiting for a problem. Auto-commuting patches solves no problems whatsoever. The only valuable context of a patch, is the state of the entire repository at the time of the patch /not/ the minimum set of dependencies required to satisfy a merge, as in Darc's world view. So it has this neat merge algorithm - big deal - it has no other redeeming features. Even its supposedly easy to use interface is a complete lemon when you have to detangle yourself out of a web of conflicts because it is so /opaque/.

I have absolutely no issue with Haskell as a language, but if the Haskell community is looking to use Darcs for versioning GHC to save face/and or political reasons, then they are backing a lemon and will only look foolish in the end.

I am sure you could write a decently performing Git clone in Haskell. The problem is not Haskell - it's Darcs's algorithms and complexity. Beautiful on the outside, butt-ugly on the inside.

james.d.s said...

Moore's Law may not solve Darcs's algorithm problems. You see, the running time of some of the algorithms is exponential based on the number of patches involved. Moore's Law states that there is a mere doubling of CPU power every 18 months. In other words, Moore's Law is polynomial, and your CPU power won't increase fast enough to deal with an exponential algorithm. Unless the edge cases that cause the exponential behavior can be avoided, Darcs is boned.

Thomas Schilling said...

@james.d.s:

Moore's law *is* exponential. The supercomputer Top 500 list also has a logarithmic scale.

But, yes, I don't believe that Darcs' issues can be solved by Moore's law alone. But the Darcs developers aren't sitting around and waiting. They are rather few, though, so they have to set priorities. The exponential runtime problem is already solved. Some other performance problems may be solvable with more caching or more incremental algorithms. There may be some things that may be architecturally harder, but I don't think it's fundamentally flawed.

I've been learning Git in the last few weeks. It is indeed very fast, but many things are also more complicated. Tools have many many parameters and it takes a while to figure out which ones you need for your regular work. When you encounter a new problem, you will have to dig around a while to find the proper steps. OTOH, there's a lot of good documentation out there and my experience with the #git IRC channel was a very good one. Both tools have their strengths and weaknesses.

@honey monster:

I don't think GHC will switch to Git soon, but having a mirror of the Darcs repo in Git would allow fast annotate, bisect and would enable the GHC trac to have a source and history browser. I don't know of any plans to move away from Darcs anytime soon, but it would be stupid to ignore that there are problems with Darcs and not look if we can do something to work around them until they are fixed. Just sticking with Darcs and refusing to consider anything else just because Darcs is written in Haskell is foolish. Haskell is faster than Python and people don't complain (much) about Mercurial performance, so if anyone thinks "it's because of Haskell" they just plain ignorant.

james.d.s said...

Thomas, you are correct. I'll put that comment about Moore's Law down to a momentary brain fart. Instead of math, I should stick to what I am good at...

I do however stick to my guns about Darc's Theory of Patches to be useless. You don't need it. If I need to get a Darcs repository into the state it was in when a patch was recorded, I cannot do it in the general case, because the context of a patch only includes the minimum amount of changes required to satisfy the application of the patch. In other words, the direct and transitive dependencies of the patch.

The only time I can do it successfully is if I have a tag, or the patch was created in my repo, in which case the entries in the inventory log *might* be in the same order I applied them, and I can unpull everything upto the state I want.

AFAIK the exponential runtime problem is not solved. Two issues I reported in January are still outstanding. I can get darcs to die (painfully slowly) in a t least three different ways, using get, pull or convert (latest source) on a large repo at work.

Unfortunately I cannot furnish the darcs developers with examples. We have since moved to Git. All of our old history in Darcs may as well have been sent to /dev/null. Anything darcs attempts to do (annotate, diff etc) take so long it's useless, or it runs out of RAM on a 64 bit machine with 8GB. The repo is only about half a GB in size on disk.

Perhaps now you can see why I think the Theory of Patches is so flawed. I don't think it will ever be fixed.

I intend to write up a blog post addressing these concerns in more detail, and propose an an alternative 'patch commutation' algorithm to what Darcs uses that doesn't suffer from these exponential issues. It simply boils down to tracking a hunk's dependencies on other hunks *explicitly*, rather than depending on a fancy algorithm to calculate it after the fact.

Any two hunks that reference the same ancestor hunk, and the same region of lines within a hunk conflict, and thus the patches that contain those hunks also conflict.

After that, you are left with dependency resolution and 'make' has been doing that for decades.

Anyway, I have gone off topic. But yeah, Darcs is the two sandwiches short of a picnic DVCS, and Git is great shining beacon for all wannable DVCS's to aspire to ;0)

kowey said...

For the interested, the thread James is referring to is this one. Are there others, James? Thanks!

Gour said...

Hi!

the approach used by darcs-to-git (and, it seems, also by Tailor) doesn't work!

Yesterday, while evaluating bzr, I tried to convert some linear repos from darcs --> bzr via git.

Tried both darcs-to-git and darcs2git and they both failed miserably in converting something like darcsum repo (see http://chneukirchen.org/repos/darcsum/).

Otoh, using latest tailor, the task was finished very quickly.

I know that the post is about converting GHC repo, but some readers might get a feeling that tailor is not capable for their use, although it might be.

Sincerely,
Gour

lele said...

Hi, thanks for the trick for speeding up the migration, I added recipe in the Tailor's README. Tailor was able to transpose the whole darcs history to bazaar, approx 5800 patches, without a glitch, taking approx 3-4 hours.

Steve Purcell said...

Gour - "failed miserably" seems an inaccurate description.

I just successfully ran darcs-to-git against the darcsum repository, and everything went very smoothly, converting all the darcs patches in a couple of minutes.

Let me know if I'm missing something, and I'll look into fixing it! :-)

Gour said...

Hi!

I'm sorry, but 'darcs-to-git' "failed miserably" 'cause, out of my own stupidity, I used with bzr' fast-import plugin not understanding it is not drop-in replacement for darc2git.

However, today I tried to convert darcs-2 repo and it fails with:

gour@nitai ~/r/g/darcs-bzr> ruby ../darcs-to-git/darcs-to-git ../../darcs/darcs.net/
Running: ["darcs", "-v"]
Initialising the working area.
Running: ["darcs", "init"]
Running: ["git-init"]
Initialized empty Git repository in .git/
Running: ["darcs", "changes", "--reverse", "--repodir=../../darcs/darcs.net/", "--xml", "--summary"]
../darcs-to-git/darcs-to-git:323:in `darcs_date_to_git_date': Wrong darcs date format (RuntimeError)
from ../darcs-to-git/darcs-to-git:220:in `initialize'
from ../darcs-to-git/darcs-to-git:248:in `new'
from ../darcs-to-git/darcs-to-git:248:in `read_from_repo'
from ../darcs-to-git/darcs-to-git:244:in `map'
from ../darcs-to-git/darcs-to-git:244:in `read_from_repo'
from ../darcs-to-git/darcs-to-git:408


So, atm, it looks tailor is the only player in the field to perform such tasks.

Any idea?

Sincerely,
Gour

Thomas Schilling said...

Gour, that is a weird bug. The reason is probably that darcs used to use a different format for time stamps in earlier versions. I'll take a look at it. I also found a couple of other bugs which I have to fix properly and push them to the online repo. I hope to get to this next weekend.

Gour said...

The reason is probably that darcs used to use a different format for time stamps in earlier versions. I'll take a look at it. I also found a couple of other bugs which I have to fix properly and push them to the online repo. I hope to get to this next weekend.

Right.

Lele (tailor's author) fixed it in tailor.

otoh, darcs-to-git fails with darcs-2 repo format as well.


Sincerely,
Gour

Gour said...

Hi Steve,

Let me know if I'm missing something, and I'll look into fixing it! :-)

Here https://bugs.launchpad.net/bzr-fastimport/+bug/232177/ you can see my problems with darcs2git.


Sincerely,
Gour