Metscrape's Revenge

Posted on July 19, 2012 by Jack Kelly

I finished something I’d been meaning to write for quite a long time: an expanded version of metscrape. Once again, I chose Haskell because I wanted a well-behaved windows build (can you believe the MinGW runtime doesn’t have strndup() and doesn’t support positional arguments in printf()?). It also meant another excuse to stretch my skills: this time I had to slice up the html documents using HXT.

Like many haskell libraries, the haddock documentation is a twisty maze of types and functions with no obvious clues as to how they fit together. I remember thinking “Why can’t I just use XPath?”. (Answer: it’s in hxt-xpath, and you need to understand how hxt works, anyway.)

HXT is built up on this idea of composing tree transformations, expressed as arrows. I first made it work on the MetVUW scraper, where I needed to scrape one page to build the image URLs. I managed to get it working there, and then started making it work on the BOM’s weather forecasts. The haskell wiki has a remarkably good introduction, which sets out the tree structure, then explores filters and filter combinators which motivates the introduction of arrows. Most importantly, it has several good examples which are well-explained. I spent some time working out how to slice out and reassemble the different parts of the forecast when something clicked.

I still don’t completely understand arrows, but I made it work. Having the ability to do transformations on subtrees is a much more powerful model as opposed to simple selection, and it made the code a fair bit simpler. I went into the zone in a way that’s only happened a few times in recent memory, usually on the largest and most interesting programming assignments at uni. Most of a precious day off disappeared, having been turned into this code, and I couldn’t be happier about it.