Reminder
Basically, Frozen provides four containers:
- frozen::set<T, N>, an immutable container that provides a subset of the std::set<T> API and contains exactly N unique elements.
- frozen::unordered_set<T, N>, an immutable container that provides a subset of the std::unordered_set<T> API and contains exactly N unique elements.
- frozen::map<K, V, N>, an immutable container that provides a subset of the std::map<K, V> API and contains exactly N unique elements.
- frozen::unordered_map<K, V, N>, an immutable container that provides a subset of the std::unordered_map<K, V> API and contains exactly N unique elements.
And two searchers for string search algorithms:
- frozen::knuth_morris_pratt_searcher<N>, a constexpr version of the Knuth Morris & Pratt algorithm initialization.
- frozen::boyer_moore_searcher<N>, a constexpr version of the Boyer Moore algorithm initialization.
The library is available under a BSD-3 license on Github and tested on Linux, OSX and Windows.
String Searcher
The core idea behind frozen is that under some assumptions, runtime computations can be done at compile-time using the fantastic guarantees provided by the constexpr keywords.
Note
It's relatively rare to find performance guarentees in the C++ language. A large part of the coolness of the language relies on compiler efforts, and not on the standard. I find it astonishing.
There is a class of algorithms that also fit in the category I pay an initialization cost to get better performances, and most engineer hear about it during their studies: string search, and the famous Knuth Morris & Pratt algorithm. The global idea is build an helper table from the needle, that will make the lookup in the haystack faster. But one needs to pay for building the helper table, unless the needle is known at compile-time. In that case the construction can be done by the compiler and there's no reason not to use that property.
C++17 introduces the parametrized std::search(ForwardIterator first, ForwardIterator last, const Searcher& searcher ) algorithm, where the searcher implements whatever logic we want. So we implemented two common searcher and ported this std::search overload to C++14 inside the frozen namespace.
Here is the result of a small benchmark, where the haystack is not very long and the needles are words that either don't live in the haystack or near the end.
------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_StrFzSearchInBM 254 ns 254 ns 2765885 BM_StrStdSearchInBM 579 ns 579 ns 1182645 BM_StrFzSearchInKMP 405 ns 405 ns 1694983
Unsurprisingly the Knuth Morris & Pratt (KMP) version is not as fast as the Boyer Moore (BM) version. And the Frozen (Fz) versions are faster than the standard (Std) one, which takes into account the intiialization step which is free for the Frozen versions!
Note
The haystack and needle stuff actually comes from the man page of strstr(3), a very inspirational source :-)
A few Benchmarks
Using Google Benchmark, a simple benchmark has been setup. It can be found in the source code if you wish to reproduce. The benchmark is running on an Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz.
For container of strings (using the keywords of the C language as testbed), which is a disputable benchmark, we get the following:
------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_StrInFzSet 378 ns 377 ns 1887396 BM_StrInStdSet 641 ns 641 ns 1036938 BM_StrNotInFzSet 297 ns 297 ns 2361108 BM_StrNotInStdSet 545 ns 545 ns 1328919
That's good news! Frozen sets (FzSet) are roughly x1.7 faster than Standard sets (StdSet) when the key belongs to the set (StrIn), and when it does not (StrNotIn). All the keys used in the benchmark are small, this probably introduces a bias, but that's a relevant use case. And the reason why we get that speedup is explained in one of the section below.
------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_StrInFzUnorderedSet 381 ns 381 ns 1816196 BM_StrInStdUnorderedSet 913 ns 913 ns 758206 BM_StrNotInFzUnorderedSet 211 ns 211 ns 3239510 BM_StrNotInStdUnorderedSet 710 ns 710 ns 976333
In case of unordered sets, Frozen sets (FzSet) are roughly x2.4 faster than Standard sets (StdSet) when the key belongs to the set (StrIn). It's even x3.3 faster when the key is not in the set. That's great and related to the fact that the hashing function used for strings is much faster in the case of Frozen, thanks to the constexpr generative algorithm used to build the table. More about this in another section below :-)
Note
The same values are used for the two kinds of sets, and unsurprisingly, std::set is fatser than std::unordered_set for a small range of values. Less comparison leads to faster lookups :-)
Out of curiosity, the timings for sets of integers:
------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_IntInFzSet 58 ns 58 ns 9341306 BM_IntInStdSet 153 ns 153 ns 4561453 BM_IntNotInFzSet 67 ns 67 ns 10391602 BM_IntNotInStdSet 161 ns 161 ns 4324596 BM_IntInFzUnorderedSet 173 ns 173 ns 4065700 BM_IntInStdUnorderedSet 544 ns 544 ns 1268655 BM_IntNotInFzUnorderedSet 131 ns 131 ns 5281960 BM_IntNotInStdUnorderedSet 500 ns 500 ns 1272852
About Performance of frozen::set<>
Let's have a look to the code generated by Clang for the following C++ code:
#include <set> #include <frozen/set.h>bool test(std::set<int> const& some) {
return some.count(1);
}
bool test(frozen::set<int, 10> const& some) {
return some.count(1);
}
Compiled by clang++-5.0 some.cpp -std=c++14 -c -O3 -I$FROZEN_PREFIX/include, this results in the following assembly for the standard implementation:
(...) lea 0x18(%rcx),%rdx lea 0x10(%rcx),%rsi cmpl $0x0,0x20(%rcx) cmovle %rdx,%rsi cmovg %rcx,%rax mov (%rsi),%rcx test %rcx,%rcx jne 20 <test(std::set<int, std::less<int>, std::allocator<int> > const&)+0x20> cmp %rdi,%rax je 50 <test(std::set<int, std::less<int>, std::allocator<int> > const&)+0x50> cmpl $0x1,0x20(%rax) cmovg %rdi,%rax cmp %rdi,%rax setne %al retq (...)
We can see two jumps, one that perform the while loop of the binary search, and one that chooses the left or right branch during the binary search. Nothing unusual.
The assembly generated for the Frozen implementation:
lea 0x4(%rdi),%rax cmpl $0x0,0x18(%rdi) lea 0x1c(%rdi),%rcx cmovle %rcx,%rax cmpl $0x0,0x8(%rax) lea 0xc(%rax),%rcx cmovg %rax,%rcx cmpl $0x0,(%rcx) lea 0x4(%rcx),%rax cmovg %rcx,%rax add $0x2c,%rdi cmp %rdi,%rax je 97 <test(frozen::set<int, 10ul, std::less<int> > const&)+0x37> cmpl $0x2,(%rax) setl %al retq
The while loop has disappeared, and that's normal because the size of the container is known at compile time, so the scanning can be unrolled. And the comparison is no longer done using a branch, but using conditional moves. And that's great for the instruction pipeline!
Out of curiosity, here is the C++ code that led to the above assembly:
template <class ForwardIt> inline constexpr ForwardIt doit(ForwardIt first, std::integral_constant<std::size_t, 0>) { return first; }template <class ForwardIt, std::size_t N>
inline constexpr ForwardIt doit(ForwardIt first, std::integral_constant<std::size_t, N>)
{
auto constexpr step = N / 2;
auto it = first + step;
auto constexpr next_count = (N&1) ? step : (N - (step + 1));
if (compare_(it, value_)) {
return doit(it + 1, std::integral_constant<std::size_t, next_count>{});
} else {
return doit(first, std::integral_constant<std::size_t, next_count>{});
}
}
Isn't that great? The Clang compiler has been able to turn a recursive function with an explicit branch to a branchless, call-less version :-).
About Performance of frozen::unordered_set<>
First, I have to thank Chris Beck for his contribution on that side. Basically I overlooked the impact of random choice in the algorithm described in the blogpost that inspired *Frozen* :-/
As a result, this version of Frozen fixes two issues:
-
Faster convergence: the random search of good parameter makes it faster to find a suitable candidate
-
Faster implementation: the newly used parametric hash is the FNV hash:
template <> struct elsa<string> { (...) constexpr std::size_t operator()(string value, std::size_t seed) const { std::size_t d = seed; for (std::size_t i = 0; i < value.size(); ++i) d = (d * 0x01000193) ^ value[i]; return d; } };
As a comparison, the default hash function for std::string uses the Murmur hash which is most probably a better function, but slower.
Conclusion
The Frozen library is (to my mind at least) a good example of how constexpr can be used to pay for the initialization cost of an algorithm or a container at compile time, and this changes the way we should think on algorithmic complexity on certain case. Who cares to pay more if it's not paid at runtime?
Thanks to QuarksLab for the supportive work, and to Chris Beck and Jérôme Dumesnil for their high quality reviews.
Article Link: http://blog.quarkslab.com/frozen-zero-cost-initialization-for-immutable-containers-and-various-algorithms.html