PS: Greetings from the second Hackathon 2007.
Friday, October 05, 2007
New Haskell Tutorial
Monday, May 21, 2007
Network.HTTP + ByteStrings
Update: I mixed some numbers. I wrote about 375 MB, but it were 175 MB. (Noone seemed to have noticed though. Anyways, the argument still holds.)
Haskell's Network.HTTP package isn't quite as good as it could be. Well, to be precise, it is not at all as good as it should be. In addition to API problems (for which I proposed a solution in my previous blog entry) there's also a major performance problem, due to strictness and use of regular list-basedString
s. A simple wget
-style program written in Haskell used like ./get http://localhost/file.big
on a local 175 MB file almost locked my 1GB laptop due to constant swapping. I had to kill it, as it was using up more than 500 MB of RAM (still swapping). At this point it had run for 50 seconds at had written not a single byte to the output file. At the same time a normal wget
completed after abound 10 seconds. Since the file was retrieved from a local server I assume overall performance was inherently limited by disk speed (or the operating system's caching strategy).
The current implementation performed so badly for two reasons:
- Since it uses list-based strings each retrieved byte will take up (at least) 8 byte in program memory (one cons cell, or tag + data + pointer to tail).
- It implements custom, list-based buffering. The buffer size is 1000 characters/bytes, which is rather OK for line-based reading, but if the HTTP protocol requests to read a large block of data, this block will be read in 1000 byte chunks and then be appended to the part that has alrady been read. So if we read a block of 8000 bytes, the first block will be read and consequently be copied 8-times(!). Let's not think about reading a block of 175000000 bytes. Also because we already know the answer.
Program | Runtime | Memory Use |
---|---|---|
wget | ~10s | ~0.5MB |
./get using strict ByteStrings | ~18s | ~175MB |
./get using lazy ByteStrings | ~11s | ~3MB |
Network.HTTP
already implements a simple Stream abstraction with a simple interface:
class Stream x where readLine :: x -> IO (Result String) readBlock :: x -> Int -> IO (Result String) writeBlock :: x -> String -> IO (Result ()) close :: x -> IO ()Implementing this for strict ByteStrings is just a matter of calling the corresponding functions from the ByteStrings module. With one small annoyance: The HTTP parsing functions expect
readLine
to return the trailing newline, which hGetLine
does not include, so we have to append it manually, which in turn is an O(n) operation.
For simplicity, we also didn't convert the header parsing and writing functions to use ByteStrings, but instead inserted the appropriate calls to pack
and unpack
. This could become a performance bottleneck if we have many small HTTP requests. OTOH, we might soon have a Parsec version that works on ByteStrings.
As could be seen from the above benchmarks, using strict ByteStrings still forces us to completely load a packet into memory before we can start using it, which may result in unnecessary high memory usage. The obvious solution to this problem is to use lazy ByteStrings.
For lazy ByteStrings things work a bit differently. Instead of calling hGet
and hGetLine
inside the stream API, we call hGetContents
when we open the connection. This gives us a lazy ByteString which we store in the connection object and then use regular list functions on that string to implement the required API.
openTCPPort uri port = do { s <- socket AF_INET Stream 6 -- [...] ; h <- socketToHandle s ReadWriteMode ; bs <- BS.hGetContents h -- get the lazy ByteString ; bsr <- newIORef bs -- and store it as an IORef ; v <- newIORef (MkConn s a h bsr uri) ; return (ConnRef v) } readBlock c n = readIORef (getRef c) >>= \conn -> case conn of ConnClosed -> return (Left ErrorClosed) MkConn sock addr h bsr host -> do { bs <- readIORef bsr ; let (bl,bs') = BS.splitAt (fromIntegral n) bs ; writeIORef bsr bs' ; return $ Right bl } readLine c = readIORef (getRef c) >>= \conn -> case conn of ConnClosed -> return (Left ErrorClosed) MkConn sock addr h bsr host -> do { bs <- readIORef bsr ; let (l,bs') = BS.span (/='\n') bs ; let (nl,bs'') = BS.splitAt 1 bs' ; writeIORef bsr bs'' ; return (Right (BS.append l nl)) -- add '\n' } `Prelude.catch` \e -> [...]There are two main problems with this implementation, though:
- ByteStrings currently only work on handles not on sockets. Thus we have to turn sockets into handles using
socketToHandle
which, according to the source code linked from the Haddock documentation will fail if we're in a multithreaded environment. (search for "PARALLEL_HASKELL" in Network.Socket's source. - Furthermore, after converting a socket to a handle we should no longer use this socket. So we can't change any settings of the socket, but close it by calling
hClose
on the handle. HTTP allows the user to specify whether a socket should be closed after the response has been received. This is a bit more tricky when we use lazy ByteStrings since our request function will return immediately with a lazy ByteString as a result but no data has been read (except, maybe, on block). We thus must not close the socket right away, but only after all its contents have been read. So we must rely onhGetContents
to close our handle (and thus socket) -- which is does not! From recent #haskell comments this seems to be bug. In any case though we'd want to be able to specify the behavior, as we might as well keep the socket open.
darcs get http://www.dtek.chalmers.se/~tox/darcs/http
and for the strict version
darcs get http://www.dtek.chalmers.se/~tox/darcs/http-strict
If you're interested you can take a look at our project page.
Monday, May 07, 2007
Towards Better Error Handling
A while ago Eric Kidd wrote a rant about inconsistent error reporting mechanisms in Haskell. He found eight different idioms, none of which were completely satisfying. In this post I want to propose a very simple but IMO pretty useful and easy-to-use scheme, that works with standard Haskell.
The Haskell HTTP Package is a good test case for such scheme. The most immediate requirements are:
- It should work from within any monad (not just
IO
). - It should be possible to catch and identify any kind of error that happened inside a call to a library routine.
- It should be possible to ignore the error-handling (e.g., for simple scripts that just die in case of error)
So far, the public API functions mostly have a signature like
type Result a = Either ConnError a simpleHTTP :: Request -> IO (Result Response)
This requires C-style coding where we have to check for an error after each call. Additionally, we might still get an IOException
, and have to catch it somewhere else (if we want to). A simple workaround is to write a wrapper function for calls to the HTTP API. For example:
data MyErrorType = ... | HTTPErr ConnError | IOErr IOException instance Error MyErrorType where noMsg = undefined -- who needs these anyways? strMsg _ = undefined instance MonadError MyErrorType MyMonad where ... -- | Perform the API action and transform any error into our custom -- error type and re-throw it in our custom error type. ht :: IO (Result a) -> MyMonad a ht m = do { r <- io m ; case r of Left cerr -> throwError (HTTPErr cerr) Right x -> return x } -- | Perform an action in the IO monad and re-throw possible -- IOExceptions as our custom error type. io :: IO a -> MyMonad a io m = do { r <- liftIO $ (m >>= return . Right) `catchError` (\e -> return (Left e)) ; case r of Left e -> throwError (IOErr e) Right a -> return a }
We defined a custom error type, because we can have only one error type per monad. Exceptions in the IO monad and API error messages are then caught immediately and wrapped in our custom error type.
But why should every user of the library do that? Can't we just fix the library? Of course we can! Now, that we have a specific solution we can go and generalize. Let's start by commenting out the type signatures of ht
and io
and ask ghci
what it thinks about the functions' types:
*Main> :t io io :: (MonadIO m, MonadError MyErrorType m) => IO a -> m a *Main> :t ht ht :: (MonadIO t, MonadError MyErrorType t) => IO (Either ConnError t1) -> t t1
Alright, this already looks pretty general. There's still our custom MyErrorType
in the signature, though. To fix this we apply the standard trick and use a type class.
data HTTPErrorType = ConnErr ConnError | IOErr IOException -- | An instance of this class can embed 'HTTPError's. class HTTPError e where fromHTTPError :: HTTPErrorType -> e
Our wrapper functions now have a nice general type, that allows us to move them into the library.
throwHTTPError = throwError . fromHTTPError ht :: (MonadError e m, MonadIO m, HTTPError e) => IO (Result a) -> m a ht m = do { r <- io m ; case r of Left cerr -> throwHTTPError (ConnErr cerr) Right a -> return a } -- | Perform an action in the IO monad and re-throw possible -- IOExceptions as our custom error type. io :: (MonadError e m, MonadIO m, HTTPError e) => IO a -> m a io m = do r <- liftIO $ (m >>= return . Right) `catchError` (\e -> return (Left e)) case r of Left e -> throwHTTPError (IOErr e) Right a -> return a
After wrapping, all exported functions will have a signature of the form:
f :: (MonadError e m, MonadIO m, HTTPError e) => ... arguments ... -> m SomeResultType
Now the user is free to choose whichever monad she wants (that allows throwing errors and I/O). The only added burden is for the user to specify how to embed a HTTPError
in the respective error type of the monad. We can already specify the instance for IO
, though.
instance HTTPError IOException where fromHTTPError (IOErr e) = e fromHTTPError (ConnErr e) = userError $ show e
This way, our modified API works nicely out of the box whenever we just use the IO
monad and we can use it in our custom monad by writing only one simple instance declaration.
data MyErrorType = ... | HTTPErr HTTPErrorType instance HTTPError MyErrorType where fromHTTPError = HTTPErr test1 req = do { r <- simpleHTTP req ; putStrLn (rspCode r) } `catchError` handler where handler (HTTPErr (ConnErr e)) = putStrLn $ "Connection error." handler (HTTPErr (IOErr e)) = putStrLn $ "I/O Error." handler _ = putStrLn $ "Whatever."
If we don't care about the error and thus don't want to implement the instance, we can still force our API to be in the IO
monad and thus reuse IOException
to embed possible HTTP errors.
test2 req = do { r <- liftIO $ simpleHTTP req ; putStrLn (rspCode r) }
I think this is a very simple but useful scheme. I already implemented this with a friend in the HTTP package—and it works (without -fglasgow-exts
).
In addition to the added type class, there is the further potential drawback that an IOException
will always be wrapped in an API-specific error type. So when a program uses more than one API that uses this scheme, an IOException
may be wrapped in either, which may or may not be what is desired. A more sophisticated system, that deals with this problem and provides additional features, is explain in Simon Marlow's paper "An Extensible Dynamically-Typed Hierarchy of Exceptions" (PDF).
Comments welcome.