TL;DR: Unless you’re careful, using
foldMap
to build HashMap
s (from unordered-containers
)
can take O(n2) time. This
can be avoided by:
Building a Map
(from containers
)
instead;
Using foldr
or foldl
with
insert
; or
Using foldMap
to build an intermediate list
[(k, v)]
and then building the HashMap
with
fromList
.
Shout-out to Luke Worth who found this out while we were chasing down performance problems.
foldMap
The foldMap
function is a very convenient way to build
up maps, especially if you use the newtype wrappers from monoidal-containers
:
foldMap :: (Foldable t, Monoid m) => (a -> m) -> t a -> m
-- Specialise m ~ MonoidalMap k v:
foldMap
:: (Foldable t, Eq k, Ord k, Monoid v)
=> (a -> MonoidalMap k v) -> t a -> MonoidalMap k v
You map each a
to a map (usually with the
singleton
function), and the Monoid
instance
combines each map into a single final result. This is a convenient
idiom, but it takes quadratic time if you try to use it to build
HashMap
s. Let’s look at why.
The Monoid
instance for HashMap
is
implemented in terms of union
(MonoidalHashMap
uses unionWith (<>)
,
but is similar). union
has O(m+n) runtime,
and the computation we get from our foldMap
has the
form:
`union` (m2 `union` (m3 `union` (m4 `union` (...)))) m1
Assume each m_i
is a singleton map (size 1). Then the total cost of the
union
s will add up like this:
K1, the cost
of the innermost m_(n-1) `union` m_n
, costs 1 + 1 = 2;
K2, the cost
of the two innermost union
s (m_(n-2) `union` (m_(n-1)
`union` m_n)
) costs 1 + 2 + 2 = 5, because the map costs 1 + 2 = 3 to build, plus another 2 for the innermost
union
;
K3, the cost
of the three innermost union
s (m_(n-3) `union`
(m_(n-2) `union` (m_(n-1) `union` m_n))
) costs 1 + 3 + 5 = 9, because we pay 1 + 3 = 4 to build the map, plus another
5 to build the union of the three final
entries;
Generally, the cost of the ith union, Ki, will cost 1 + i + Ki − 1.
If you expand out the KN term, you will find a 1 + 2 + 3 + … + N sum, which shows that KN ∈ O(N2).
Map
Fine?union
for Map
has running time O(m×log(n/m+1)); m ≤ n.
Since we’re dealing with singleton maps, we can say that m = 1 is the smaller side of the
union
, and the runtime simplifies to O(logn). From this, we can
see that the runtime of all the union
s will be O(nlogn).
HashMap
?There are at least a couple of ways to turn a Foldable
into a HashMap
without hitting this performance trap:
HashMap
’s insert
has O(logn)
performance. Using it with the Foldable
version of
foldl
or foldr
will give you a
HashMap
that takes O(nlogn) time to
build.
If you’re not building singleton maps, you could try building an
intermediate list of (k, v)
pairs, and calling
fromList
on it. The Foldable
instance for
[]
right-associates the applications of
(<>)
, so concatenating singleton lists should be
performant. HashMap
’s fromList
has O(nlogn)
complexity, which is also fine.
Map
can be surprisingly fast, and HashMap
can be surprisingly slow. None of this says, “never use
HashMap
”, but I was surprised to see an idiom as natural as
foldMap
become a performance trap. This also isn’t the
first time I’ve been able to get better performance out of
Map
— functions like fromDistinctAscList
can exploit input structure for better performance — and I often find
the O(logn) lookups
quite acceptable. I’m left wondering whether good ol’ Map
should be my default mapping type, unless I have a demonstrated need for
HashMap
’s features.