Accidentally-Quadratic HashMaps

Posted on October 16, 2020 by Jack Kelly
Tags: coding, haskell, performance

TL;DR: Unless you’re careful, using foldMap to build HashMaps (from unordered-containers) can take O(n2) time. This can be avoided by:

  1. Building a Map (from containers) instead;

  2. Using foldr or foldl with insert; or

  3. Using foldMap to build an intermediate list [(k, v)] and then building the HashMap with fromList.

Shout-out to Luke Worth who found this out while we were chasing down performance problems.

foldMap

The foldMap function is a very convenient way to build up maps, especially if you use the newtype wrappers from monoidal-containers:

foldMap :: (Foldable t, Monoid m) => (a -> m) -> t a -> m

-- Specialise m ~ MonoidalMap k v:
foldMap
  :: (Foldable t, Eq k, Ord k, Monoid v)
  => (a -> MonoidalMap k v) -> t a -> MonoidalMap k v

You map each a to a map (usually with the singleton function), and the Monoid instance combines each map into a single final result. This is a convenient idiom, but it takes quadratic time if you try to use it to build HashMaps. Let’s look at why.

The Monoid instance for HashMap is implemented in terms of union (MonoidalHashMap uses unionWith (<>), but is similar). union has O(m + n) runtime, and the computation we get from our foldMap has the form:

m1 `union` (m2 `union` (m3 `union` (m4 `union` (...))))

Assume each m_i is a singleton map (size 1). Then the total cost of the unions will add up like this:

If you expand out the KN term, you will find a 1 + 2 + 3 + … + N sum, which shows that KN ∈ O(N2).

Why is Map Fine?

union for Map has running time O(m × log (n/m + 1)); m ≤ n. Since we’re dealing with singleton maps, we can say that m = 1 is the smaller side of the union, and the runtime simplifies to O(log n). From this, we can see that the runtime of all the unions will be O(nlog n).

What if I Need HashMap?

There are at least a couple of ways to turn a Foldable into a HashMap without hitting this performance trap:

  1. HashMap’s insert has O(log n) performance. Using it with the Foldable version of foldl or foldr will give you a HashMap that takes O(nlog n) time to build.

  2. If you’re not building singleton maps, you could try building an intermediate list of (k, v) pairs, and calling fromList on it. The Foldable instance for [] right-associates the applications of (<>), so concatenating singleton lists should be performant. HashMap’s fromList has O(nlog n) complexity, which is also fine.

Conclusion

Map can be surprisingly fast, and HashMap can be surprisingly slow. None of this says, “never use HashMap”, but I was surprised to see an idiom as natural as foldMap become a performance trap. This also isn’t the first time I’ve been able to get better performance out of Map — functions like fromDistinctAscList can exploit input structure for better performance — and I often find the O(log n) lookups quite acceptable. I’m left wondering whether good ol’ Map should be my default mapping type, unless I have a demonstrated need for HashMap’s features.

Previous Post
All Posts | RSS | Atom
Next Post
Copyright © 2020 Jack Kelly
Site generated by Hakyll (source)