Monday, May 21, 2007

Network.HTTP + ByteStrings

Update: I mixed some numbers. I wrote about 375 MB, but it were 175 MB. (Noone seemed to have noticed though. Anyways, the argument still holds.)

Haskell's Network.HTTP package isn't quite as good as it could be. Well, to be precise, it is not at all as good as it should be. In addition to API problems (for which I proposed a solution in my previous blog entry) there's also a major performance problem, due to strictness and use of regular list-based Strings. A simple wget-style program written in Haskell used like ./get http://localhost/file.big on a local 175 MB file almost locked my 1GB laptop due to constant swapping. I had to kill it, as it was using up more than 500 MB of RAM (still swapping). At this point it had run for 50 seconds at had written not a single byte to the output file. At the same time a normal wget completed after abound 10 seconds. Since the file was retrieved from a local server I assume overall performance was inherently limited by disk speed (or the operating system's caching strategy). The current implementation performed so badly for two reasons:
  • Since it uses list-based strings each retrieved byte will take up (at least) 8 byte in program memory (one cons cell, or tag + data + pointer to tail).
  • It implements custom, list-based buffering. The buffer size is 1000 characters/bytes, which is rather OK for line-based reading, but if the HTTP protocol requests to read a large block of data, this block will be read in 1000 byte chunks and then be appended to the part that has alrady been read. So if we read a block of 8000 bytes, the first block will be read and consequently be copied 8-times(!). Let's not think about reading a block of 175000000 bytes. Also because we already know the answer.
But let's not flame the original author(s). It's better than nothing and it gave me and my project partner Jonas an interesting project topic. So we decided to overcome the evil at its root and replace Strings using ByteStrings--this way we would get buffering for free. To give you a taste for what this accomplishes:
ProgramRuntimeMemory Use
wget~10s~0.5MB
./get using strict ByteStrings~18s~175MB
./get using lazy ByteStrings~11s~3MB
Adding strict ByteStrings was relatively straightforward. Network.HTTP already implements a simple Stream abstraction with a simple interface:
class Stream x where 
    readLine   :: x -> IO (Result String)
    readBlock  :: x -> Int -> IO (Result String)
    writeBlock :: x -> String -> IO (Result ())
    close      :: x -> IO ()
Implementing this for strict ByteStrings is just a matter of calling the corresponding functions from the ByteStrings module. With one small annoyance: The HTTP parsing functions expect readLine to return the trailing newline, which hGetLine does not include, so we have to append it manually, which in turn is an O(n) operation. For simplicity, we also didn't convert the header parsing and writing functions to use ByteStrings, but instead inserted the appropriate calls to pack and unpack. This could become a performance bottleneck if we have many small HTTP requests. OTOH, we might soon have a Parsec version that works on ByteStrings. As could be seen from the above benchmarks, using strict ByteStrings still forces us to completely load a packet into memory before we can start using it, which may result in unnecessary high memory usage. The obvious solution to this problem is to use lazy ByteStrings. For lazy ByteStrings things work a bit differently. Instead of calling hGet and hGetLine inside the stream API, we call hGetContents when we open the connection. This gives us a lazy ByteString which we store in the connection object and then use regular list functions on that string to implement the required API.
openTCPPort uri port = 
    do { s <- socket AF_INET Stream 6
       -- [...]
       ; h <- socketToHandle s ReadWriteMode
       ; bs <- BS.hGetContents h  -- get the lazy ByteString
       ; bsr <- newIORef bs       -- and store it as an IORef
       ; v <- newIORef (MkConn s a h bsr uri) 
       ; return (ConnRef v)
       }

readBlock c n =
        readIORef (getRef c) >>= \conn -> case conn of
          ConnClosed -> return (Left ErrorClosed)
          MkConn sock addr h bsr host ->
              do { bs <- readIORef bsr 
                 ; let (bl,bs') = BS.splitAt (fromIntegral n) bs
                 ; writeIORef bsr bs'  
                 ; return $ Right bl
                 }
   
    readLine c =
        readIORef (getRef c) >>= \conn -> case conn of
          ConnClosed -> return (Left ErrorClosed)
          MkConn sock addr h bsr host ->
              do { bs <- readIORef bsr
                 ; let (l,bs') = BS.span (/='\n') bs
                 ; let (nl,bs'') = BS.splitAt 1 bs'
                 ; writeIORef bsr bs''
                 ; return (Right (BS.append l nl)) -- add '\n'
                 }
        `Prelude.catch` \e -> [...]
There are two main problems with this implementation, though:
  • ByteStrings currently only work on handles not on sockets. Thus we have to turn sockets into handles using socketToHandle which, according to the source code linked from the Haddock documentation will fail if we're in a multithreaded environment. (search for "PARALLEL_HASKELL" in Network.Socket's source.
  • Furthermore, after converting a socket to a handle we should no longer use this socket. So we can't change any settings of the socket, but close it by calling hClose on the handle. HTTP allows the user to specify whether a socket should be closed after the response has been received. This is a bit more tricky when we use lazy ByteStrings since our request function will return immediately with a lazy ByteString as a result but no data has been read (except, maybe, on block). We thus must not close the socket right away, but only after all its contents have been read. So we must rely on hGetContents to close our handle (and thus socket) -- which is does not! From recent #haskell comments this seems to be bug. In any case though we'd want to be able to specify the behavior, as we might as well keep the socket open.
There are further issues to consider. E.g., can we rely on the operating system to buffer everything for us if we don't read it right away? I don't know the details, but I assume this is handled by some lower layer, possibly dropping packages and re-requesting them if necessary. That's just guessing though. Unfortunately, I will not have the time to work out these issues anytime soon, as I will be busy with my Google Summer of Code project (cabal configurations). There also is a SoC project to replace Network.HTTP with libcurl bindings, but it would probably be a good idea to still have a reasonable Haskell-only solution around. So if anyone wants to pick it up, you're welcome! You can get the sources for the lazy version with darcs get http://www.dtek.chalmers.se/~tox/darcs/http and for the strict version darcs get http://www.dtek.chalmers.se/~tox/darcs/http-strict If you're interested you can take a look at our project page.

10 comments:

Bryan O'Sullivan said...

Thomas, you can use ByteString with Sockets. See createAndTrim'.

simonmar said...

There's no problem with socketToHandle and threads. The __PARALLEL_HASKELL__ stuff is to do with GpH, not GHC's current multiprocessor support, which doesn't require recompiling the libraries.

Thomas Schilling said...

Simon, that's good news. We still need control whether to close the socket/handle after hGetContents completed. I'm also not sure about read errors. This is a more general problem of lazy bytestrings though, as they're essentially treated as a pure data structure. The simplest treatment would just end the lazy string there (like hGetContents does for handles).

Jonas Ã…dahl said...

Actually whenever hGetContents reads an EOF the handle should always be closed. This works for HTTP sessions with Keep-Alive connections too, because no EOF is received until the server closes the connection. As the response body is a bunch of merged chunks it wont continue reading after the chunk flow is done, and it should (not tested) be possible to continue using the session by sending another request.

One problem I can think of though is if even though "Connection: close" is specified in the header of the request the server doesn't close the connection. This will result in that hGetContents never reads an EOF and will never close the handle.

Anonymous said...

Regarding your open question on whether you can rely on the operating system to buffer everything in the event you do not start reading, the answer is that with a TCP socket, the socket's receive buffer in the host's kernel will eventually fill up, at which point TCP's flow control (google tcp window size) will kick in and notify the sender to stop transmitting until space is available again in the receive buffer, which only happens once the application programmer calls read(). On the other hand, if this were UDP, which does not have flow control, once the receive buffer fills, any additional data is discarded.

Now, with regards to lazy IO in general, this reminds me of a recent thread on the cafe mailing list regarding using lazy IO to lazily read the contents of a mail directory containing thousands of files. The net of that thread was simply that it was a bad idea because you have no control over when the open file descriptors are garbage collected. So, wouldn't the same be an issue here? If I wanted to download more than 1024 (my ulimit for open file descriptors on my system) web pages in parallel, I suppose using this approach of lazy IO would fail.

This then lead to Oleg's article on using a left-fold enumeration to strictly consume data but with the option to terminate consumption prematurely. Would using the left-enumerator by a better way to go in this case?

Thomas Schilling said...

I agree that laziness does have its issues and a more clever approach is definitely necessary, since after all this is a library which should work in as many ways of use as possible. I can't quite tell if your suggestion will work here, but from reading the first paragraphs of http://okmij.org/ftp/papers/LL3-collections-enumerators.txt (which I assume you were referring to), it does look promising.

Thanks, for your comment! I'll look into it.

BTW, would you happen to have a link to the haskell-cafe thread available (it has some problems with being indexed by search engines, IIUC).

Thomas Schilling said...

OK, I spent some time thinking about this approach. Now, treating the retrieved data as a collection leads to the question, What should be the elements of the collection be?

Word8 would be too low-level, since you might want to write them to a file for example.

Strict ByteStrings of some specified chunk size could work. I.e.

doRequest :: Request -> a -> (ByteString -> a -> m (Either a a)) -> a

This way the folded function could determine when to stop and may or may not store the chunks in the results.

However, this might complicates treating the data as a stream (e.g. for parsing it to a tree). We'd have to re-implement the abstractions provided by Data.ByteString.Lazy. So I tend to lend towards the following simple interface:

withHTTP :: (MonadIO m) => Request -> a -> (Response -> LazyByteString -> a -> m a) -> m a

with the restriction that the lazy bytestring has to be forced as much as necessary, since we'll close the socket after the handler terminates.

Anonymous said...

Here is a link to part of the thread where Oleg's left-fold enumerator is talked about:

http://article.gmane.org/gmane.comp.lang.haskell.cafe/20635

This email was also interesting (it's by Oleg) as it talks about some of the reasons why the left fold enumerator is useful (towards the bottom of the email):

http://article.gmane.org/gmane.comp.lang.haskell.cafe/20553/match=left+fold+file+resources

In any case, after I suggested this, I was also a bit stumped as to what the appropriate collection should be, but I'm a complete newbie so that is understandable. I was going to post a question to the mailing list asking how one would use the left-fold enumerator and the bytestring library together.

Anonymous said...

So when can we see this in Network.HTTP ? I'd love to have this for a current project.

Henning Thielemann said...

hGetContents should never close the file itself, because the caller of hGetContents may decide to not consume all the characters from hGetContents, and thus must close the file himself. Exceptions while reading from the file can be indicated by an asynchronous exception as provided by the explicit-exception package.