On 28 Jul 2014 at 7:21, Rob Stewart wrote:
The worst case, add I see it, is that you simply increment (or decrement) a computed hash value should the magic value ever arise. That is, you increase the odds of a collision so that no value ever uses the magic value.
That works for me. Great idea, thank you. Here are some benchmarks for a first attempt at the design I proposed earlier. The tests involve reserving 10,000 capacity and inserting 5,000 items to try and remove the overhead of rehashing for unordered_map. The write test is a random burst of up to 128 insertions or deletions which may be done concurrently by as many threads as the CPU has (8 hyperthreaded ones). The read test is as many finds as it can do concurrently. I tested single threaded and 8 threaded to ensure the concurrent design was not slower than a spinlocked std::unordered_map. First single threaded: === Large unordered_map spinlock write performance === 1. Achieved 23548585.078085 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 64616938.960940 transactions per second === Large concurrent_unordered_map write performance === There are 1 threads in this CPU 1. Achieved 36929611.532615 transactions per second === Large concurrent_unordered_map read performance === There are 1 threads in this CPU 1. Achieved 66784946.368549 transactions per second As concurrent_unordered_map doesn't have to check the load factor during inserts and erases, it is quicker single threaded. Eight threaded: === Large unordered_map spinlock write performance === 1. Achieved 22012179.855850 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 42627903.829194 transactions per second === Large concurrent_unordered_map write performance === There are 8 threads in this CPU 1. Achieved 166278672.013391 transactions per second === Large concurrent_unordered_map read performance === There are 8 threads in this CPU 1. Achieved 165071376.646606 transactions per second The contention on the read spinlock reduces unordered_map performance over the single threaded case significantly. Meanwhile my concurrent_unordered_map sees a 4.5x insert/erase improvement and a 2.5x find improvement, in fact both writing and reading the concurrent_unordered_map are now both the same performance under maximum contention. Someone is bound to ask how does this compare to a split ordered list concurrent_unordered_map? Well, for Microsoft PPL's one using VS2013: === Large concurrent_unordered_map read performance === There are 4 threads in this CPU Mine: Achieved 37400530.145442 transactions per second PPL: Achieved 54401408.200488 transactions per second So PPL is about 1.5x of mine for reads. PPL can't do concurrent erase though, so I can't test the writes. My design is also intended for transactional GCC, and there I see the following benchmarks: Single threaded: === Large unordered_map spinlock write performance === 1. Achieved 15925637.896608 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 52702345.715530 transactions per second === Large concurrent_unordered_map write performance === There are 1 threads in this CPU 1. Achieved 19225268.992226 transactions per second === Large concurrent_unordered_map read performance === There are 1 threads in this CPU 1. Achieved 18177607.124302 transactions per second Something is bad here with my single threaded read performance, it's way low. I'm going to assume it's a bug in transactional GCC, see below for what I mean. Eight threaded: === Large unordered_map spinlock write performance === 1. Achieved 13913300.037094 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 32325773.115938 transactions per second === Large concurrent_unordered_map write performance === There are 8 threads in this CPU 1. Achieved 95011747.620646 transactions per second === Large concurrent_unordered_map read performance === There are 8 threads in this CPU 1. Achieved 109675855.394830 transactions per second My concurrent_unordered_map sees a 5x insert/erase improvement and a 6x find improvement. That surely means my code is in fact good and is parallelising nicely. Obviously these are early results, I still have much to do. It is looking like it should provide a nice wee boost for AFIO's performance for sure. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/