Frozen - zero cost initialization for immutable containers and various algorithms

MalBot · February 20, 2018, 2:30pm

Reminder

Basically, Frozen provides four containers:

frozen::set<T, N>, an immutable container that provides a subset of the std::set<T> API and contains exactly N unique elements.
frozen::unordered_set<T, N>, an immutable container that provides a subset of the std::unordered_set<T> API and contains exactly N unique elements.
frozen::map<K, V, N>, an immutable container that provides a subset of the std::map<K, V> API and contains exactly N unique elements.
frozen::unordered_map<K, V, N>, an immutable container that provides a subset of the std::unordered_map<K, V> API and contains exactly N unique elements.

And two searchers for string search algorithms:

frozen::knuth_morris_pratt_searcher<N>, a constexpr version of the Knuth Morris & Pratt algorithm initialization.
frozen::boyer_moore_searcher<N>, a constexpr version of the Boyer Moore algorithm initialization.

The library is available under a BSD-3 license on Github and tested on Linux, OSX and Windows.

String Searcher

The core idea behind frozen is that under some assumptions, runtime computations can be done at compile-time using the fantastic guarantees provided by the constexpr keywords.

Note

It's relatively rare to find performance guarentees in the C++ language. A large part of the coolness of the language relies on compiler efforts, and not on the standard. I find it astonishing.

There is a class of algorithms that also fit in the category I pay an initialization cost to get better performances, and most engineer hear about it during their studies: string search, and the famous Knuth Morris & Pratt algorithm. The global idea is build an helper table from the needle, that will make the lookup in the haystack faster. But one needs to pay for building the helper table, unless the needle is known at compile-time. In that case the construction can be done by the compiler and there's no reason not to use that property.

C++17 introduces the parametrized std::search(ForwardIterator first, ForwardIterator last, const Searcher& searcher ) algorithm, where the searcher implements whatever logic we want. So we implemented two common searcher and ported this std::search overload to C++14 inside the frozen namespace.

Here is the result of a small benchmark, where the haystack is not very long and the needles are words that either don't live in the haystack or near the end.

------------------------------------------------------------------
Benchmark                           Time           CPU Iterations
------------------------------------------------------------------
BM_StrFzSearchInBM                254 ns        254 ns    2765885
BM_StrStdSearchInBM               579 ns        579 ns    1182645
BM_StrFzSearchInKMP               405 ns        405 ns    1694983

Unsurprisingly the Knuth Morris & Pratt (KMP) version is not as fast as the Boyer Moore (BM) version. And the Frozen (Fz) versions are faster than the standard (Std) one, which takes into account the intiialization step which is free for the Frozen versions!

Note

The haystack and needle stuff actually comes from the man page of strstr(3), a very inspirational source :-)

A few Benchmarks

Using Google Benchmark, a simple benchmark has been setup. It can be found in the source code if you wish to reproduce. The benchmark is running on an Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz.

For container of strings (using the keywords of the C language as testbed), which is a disputable benchmark, we get the following:

------------------------------------------------------------------
Benchmark                           Time           CPU Iterations
------------------------------------------------------------------
BM_StrInFzSet                     378 ns        377 ns    1887396
BM_StrInStdSet                    641 ns        641 ns    1036938
BM_StrNotInFzSet                  297 ns        297 ns    2361108
BM_StrNotInStdSet                 545 ns        545 ns    1328919

That's good news! Frozen sets (FzSet) are roughly x1.7 faster than Standard sets (StdSet) when the key belongs to the set (StrIn), and when it does not (StrNotIn). All the keys used in the benchmark are small, this probably introduces a bias, but that's a relevant use case. And the reason why we get that speedup is explained in one of the section below.

------------------------------------------------------------------
Benchmark                           Time           CPU Iterations
------------------------------------------------------------------
BM_StrInFzUnorderedSet            381 ns        381 ns    1816196
BM_StrInStdUnorderedSet           913 ns        913 ns     758206
BM_StrNotInFzUnorderedSet         211 ns        211 ns    3239510
BM_StrNotInStdUnorderedSet        710 ns        710 ns     976333

In case of unordered sets, Frozen sets (FzSet) are roughly x2.4 faster than Standard sets (StdSet) when the key belongs to the set (StrIn). It's even x3.3 faster when the key is not in the set. That's great and related to the fact that the hashing function used for strings is much faster in the case of Frozen, thanks to the constexpr generative algorithm used to build the table. More about this in another section below :-)

Note

The same values are used for the two kinds of sets, and unsurprisingly, std::set is fatser than std::unordered_set for a small range of values. Less comparison leads to faster lookups :-)

Out of curiosity, the timings for sets of integers:

------------------------------------------------------------------
Benchmark                           Time           CPU Iterations
------------------------------------------------------------------
BM_IntInFzSet                      58 ns         58 ns    9341306
BM_IntInStdSet                    153 ns        153 ns    4561453
BM_IntNotInFzSet                   67 ns         67 ns   10391602
BM_IntNotInStdSet                 161 ns        161 ns    4324596
BM_IntInFzUnorderedSet            173 ns        173 ns    4065700
BM_IntInStdUnorderedSet           544 ns        544 ns    1268655
BM_IntNotInFzUnorderedSet         131 ns        131 ns    5281960
BM_IntNotInStdUnorderedSet        500 ns        500 ns    1272852

About Performance of frozen::set<>

Let's have a look to the code generated by Clang for the following C++ code:

#include <set>
#include <frozen/set.h>
bool test(std::set<int> const& some) {

return some.count(1);

}

bool test(frozen::set<int, 10> const& some) {

return some.count(1);

}

Compiled by clang++-5.0 some.cpp -std=c++14 -c -O3 -I$FROZEN_PREFIX/include, this results in the following assembly for the standard implementation:

(...)
lea    0x18(%rcx),%rdx
lea    0x10(%rcx),%rsi
cmpl   $0x0,0x20(%rcx)
cmovle %rdx,%rsi
cmovg  %rcx,%rax
mov    (%rsi),%rcx
test   %rcx,%rcx
jne    20 <test(std::set<int, std::less<int>, std::allocator<int> > const&)+0x20>
cmp    %rdi,%rax
je     50 <test(std::set<int, std::less<int>, std::allocator<int> > const&)+0x50>
cmpl   $0x1,0x20(%rax)
cmovg  %rdi,%rax
cmp    %rdi,%rax
setne  %al
retq
(...)

We can see two jumps, one that perform the while loop of the binary search, and one that chooses the left or right branch during the binary search. Nothing unusual.

The assembly generated for the Frozen implementation:

lea    0x4(%rdi),%rax
cmpl   $0x0,0x18(%rdi)
lea    0x1c(%rdi),%rcx
cmovle %rcx,%rax
cmpl   $0x0,0x8(%rax)
lea    0xc(%rax),%rcx
cmovg  %rax,%rcx
cmpl   $0x0,(%rcx)
lea    0x4(%rcx),%rax
cmovg  %rcx,%rax
add    $0x2c,%rdi
cmp    %rdi,%rax
je     97 <test(frozen::set<int, 10ul, std::less<int> > const&)+0x37>
cmpl   $0x2,(%rax)
setl   %al
retq

The while loop has disappeared, and that's normal because the size of the container is known at compile time, so the scanning can be unrolled. And the comparison is no longer done using a branch, but using conditional moves. And that's great for the instruction pipeline!

Out of curiosity, here is the C++ code that led to the above assembly:

template <class ForwardIt>
inline constexpr ForwardIt doit(ForwardIt first, std::integral_constant<std::size_t, 0>)
{
  return first;
}
template <class ForwardIt, std::size_t N>

inline constexpr ForwardIt doit(ForwardIt first, std::integral_constant<std::size_t, N>)

{

auto constexpr step = N / 2;

auto it = first + step;

auto constexpr next_count = (N&1) ? step : (N - (step + 1));

if (compare_(it, value_)) {

return doit(it + 1, std::integral_constant<std::size_t, next_count>{});

} else {

return doit(first, std::integral_constant<std::size_t, next_count>{});

}

}

Isn't that great? The Clang compiler has been able to turn a recursive function with an explicit branch to a branchless, call-less version :-).

About Performance of frozen::unordered_set<>

First, I have to thank Chris Beck for his contribution on that side. Basically I overlooked the impact of random choice in the algorithm described in the blogpost that inspired *Frozen* :-/

As a result, this version of Frozen fixes two issues:

Faster convergence: the random search of good parameter makes it faster to find a suitable candidate

Faster implementation: the newly used parametric hash is the FNV hash:

template <> struct elsa<string> {
  (...)
  constexpr std::size_t operator()(string value, std::size_t seed) const {
    std::size_t d = seed;
    for (std::size_t i = 0; i < value.size(); ++i)
      d = (d * 0x01000193) ^ value[i];
    return d;
  }
};

As a comparison, the default hash function for std::string uses the Murmur hash which is most probably a better function, but slower.

Conclusion

The Frozen library is (to my mind at least) a good example of how constexpr can be used to pay for the initialization cost of an algorithm or a container at compile time, and this changes the way we should think on algorithmic complexity on certain case. Who cares to pay more if it's not paid at runtime?

Thanks to QuarksLab for the supportive work, and to Chris Beck and Jérôme Dumesnil for their high quality reviews.

Article Link: http://blog.quarkslab.com/frozen-zero-cost-initialization-for-immutable-containers-and-various-algorithms.html