Dear Boost,
Finally I have found some time to write up my experimentations with
using transactional GCC or Intel TSX TM to make an improved
concurrent_unordered_map. The quick tl;dr summary is this:
1. Transactional GCC is great iff you are happy accepting up to a 50%
performance reduction in normal code execution. For a greenfield
multithreaded project this is absolutely worth it due to the enormous
savings in debug effort.
2. First generation Intel TSX isn't worth using in almost any common
use case. The only common use place it makes sense is for almost
always read only data which is *always* highly contended where
speedups can approach linear to execution cores, and even then the
refactoring you need to make to avoid writes to any cache line
touched by any other thread will make your code slower than throwing
everything inside a standard spin lock. Spin locks are well optimised
and allow simple implementations, plus spin locks are very fast for
uncontended use. Any Intel TSX (HLE, RTM) introduces penalties to
uncontended use and seems to force a pipeline stall which can reduce
single threaded performance by as much as 40%.
3. Trying to implement atomics-based logic to have the CPU select
non-TSX during uncontended and TSX during contended comes out at best
as slightly slower than always using TSX. This, for me, simply kills
TSX v1.0, I no longer consider it worth thinking about till v2.0 is
made by Intel.
For reference for those wishing to repeat my experimentations, you
can see the history of my four attempts at a concurrent_unordered_map
at
https://github.com/ned14/boost.spinlock/commits/concurrent_unordered_m
ap2 plus benchmark results at
https://ci.nedprod.com/view/Boost%20Thread-Expected-Permit/job/Memory%
20Transactions%20Test%20Linux%20GCC%204.8/.
So, onto my proposed new concurrent_unordered_map design, which is
NOT a split ordered list design as Intel TBB or Microsoft PPL does
it. I had four criteria for my concurrent_unordered_map design which
were based on my needs for proposed Boost.AFIO:
1. erase() is safe. This is for me the big showstopper with split
ordered list designs, and I think it probably is for most people.
2. insert(), find(), and erase() all had to have single threaded
performance similar to a std::unordered_map<> with a spin lock around
it i.e. fast.
3. Everything apart from rehashing should be able to be concurrent.
This includes iteration, inserting, finding, erasing, sizing.
4. Memory transaction friendly. This means transactional GCC right
now, but maybe some future Intel TSX v2.0 support later. This
criterion, incidentally, *requires* C++ 11 noexcept support, or at
least I can't see a way to avoid it otherwise (just to be clear, I am
claiming that for the implementation to _efficiently_ use memory
transactions it needs noexcept move construction for both key and
mapped type).
Designs tried and benchmarked and rejected:
1. Items in a bucket were kept via an atomic forward pointer to the
next item and one used cmpxchg to insert and remove items. This had
really great, even superb, performance for load factors near one, but
became pathological for badly distributed hash functions (I'm looking
at you libstdc++). Once you rehashed the hash function to make it
better distributed, performance nose dived. I was tempted to override
std::hash with a decent implementation ...
2. Items in a bucket were kept as as a linear list of hashes and
atomic pointers to value_type's. Each atomic pointer's bottom bit was
its spin lock to keep each item footprint small, so you could lock
each individual item during usage. The problem here really was that
how do you safely resize the table without a per-bucket lock, and
that halved performance over a spinlocked unordered_map as
transactions really didn't deliver as expected here.
3. I created a specialisation only for mapped types of some
std::shared_ptr<> which let me implement a completely lock free
design. Unfortunately it was also rather slow, also about half the
performance of a spinlocked unordered_map. Atomic increments and
decrements of the same cache line across execution cores are slow :)
Final proposed design (comments welcome):
* item_type is allocated via a rebind of the supplied allocator, and
looks like:
struct item_type
{
value_type p;
size_t hash;
};
If value_type has a big sizeof(), a very good idea is to make the
above a unique_ptr as otherwise you can't pack many item_type's into
a cache line (AFIO's value_type is a std::pair> which is 24 bytes, so item_type is 32 bytes which
works well for my purposes).
* bucket_type looks like this:
struct bucket_type {
spinlock<bool> lock;
atomic<unsigned> count; // count is used items in there
std::vector items;
char
pad[64-sizeof(spinlock<bool>)-sizeof(atomic<unsigned>)-sizeof(std::vec
tor)];
};
static_assert(sizeof(bucket_type)==64, "bucket_type is not 64 bytes
long!");
Note that items has its capacity directly managed, and is always
expanded to powers of two. The size of actual items stored can float
between zero and capacity.
* And finally, concurrent_unordered_map itself merely contains:
template>> class concurrent_unordered_map {
spinlock<bool> _rehash_lock;
hasher _hasher;
key_equal _key_equal;
std::vector _buckets;
};
* To insert or find an item involves determining the bucket from the
modulus of the hash by the number of buckets, locking the bucket and
scanning the linear list for an empty item or the right item as
appropriate. We know empty items from a magic hash value which is
defined to some magic constant (advice on the best magic hash value
least likely to turn up from std::hash is welcome). find() starts
from the start of the list forwards, insert() starts from the end of
the list backwards to help with cache line contention. As insert()
needs to check for key matches as well as for empty slots, it needs
to happen anyway, but if we knew key collision could never happen we
could detect when the table is full from the atomic count and skip
the table scan. If insert doesn't find a match nor an empty slot, if
there is capacity it appends the item, if no capacity a capacity
doubling is done and then an append. If capacity doubling occurs, all
references to items in that bucket become invalid. I intend an
insert_stable() API which will refuse to modify capacity, plus a
bucket_reserve() API.
insert() and find() have O(avrg bucket size (not count)).
On transactional GCC find() and the initial seach for key equivalence
in insert() are implemented as transactions without locking the
bucket. All modifications are done by locking the bucket. This was
found to have the best performance through trial and error.
* To erase an item is easy: iterators store an iterator to the bucket
and an offset into the linear table. Erasure involves constructing a
temporary value_type, locking the bucket and performing a *move*
construction out of the table into the temporary value_type,
resetting the hash to empty. Destruction can then occur without
holding the bucket lock. If the item just erased is the last in the
bucket, the bucket's size (not capacity) is reduced by one, and that
is repeated until the bucket size is the minimal possible.
I currently violate STL allocator requirements here because I don't
use the allocator for the temporary value_type to avoid a usually
pointless malloc. Any suggestions on how to fix this would be
welcome.
You'll note that erasure here is safe as its storage never moves,
though if you concurrently use and erase the same item you'll get a
race onto a destroyed item which is as expected.
erase() is O(1).
* size() under this design becomes O(bucket count) as we sum the
count members of each bucket. empty() is worst case O(bucket count)
when the map is empty, best case O(1) when the first bucket is not
empty, average case O(bucket count/item count/2).
* Iteration under this design is queerly inversely dependent on load
factor - with the higher the load factor, the better the performance.
This is because we don't bother with the usual forward pointer per
item as those are expensive for concurrent use and doubly linked
pointers are rather a lot of overhead per item, so iteration is a
case of iterating linear tables and once completed, searching for the
next non-empty bucket which in a sparsely occupied bucket list is
very lumpy as cache lines are pulled one by one into the cache.
* rehash() under this design firstly locks the special rehash lock,
then locks every bucket in the list. It then constructs a new empty
bucket list and locks every lock in that. It then swaps the bucket
lists, and moves all the items from the old buckets to the new
buckets. It finally unlocks all the buckets, and the special rehash
lock.
This function is hard to write completely race free under all CPU
load circumstances without burdening the main API implementations. I
may use the concept of keeping around the old just emptied bucket
list which is permanently
marked to users as "go reload the bucket list" to work around this,
and then
document that you must not rehash too frequently or else you will
race your
code.
Note that rehashing under this design is completely manual. If it
were not,
insertion, erasure and item referencing could not be safe. This
implies that if
you rehash, it is up to you to not be using items by reference at the
time.
Rehashing is also by definition not concurrent, indeed it is the only
function
in the design which is not capable of concurrency.
Thoughts on this design before I write a final implementation are
welcome. I don't plan to complete my implementation past the minimum
needed for proposed Boost.AFIO, so if someone else wishes to finish
it that and make it part of official Boost that would be great.
Niall
--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/