[concurrent] Fast, lock-based queue implementation
I have written a new concurrent queue that has a focus on high throughput. You can check it out here: https://github.com/davidstone/concurrent-queue Some highlights of this queue: * The bulk interface is faster than other queues for many common load patterns * Almost any element type is supported with very few requirements * Supports any number of producers * Supports any number of consumers (although it works best with only one) * Can dynamically grow and shrink in memory as needed * The interface allows for very little undefined behavior: for the most part, as long as your special member functions like move constructors do not reference the queue you are inserting into, you are fine * Interface based on the pattern of moving values in and returning values via return, rather than other queues which use reference output parameters * Header-only * Design should be fairly straightforward to verify for correctness * Compiles warning free with a very high warning level on gcc and clang * Does not have any errors under ubsan, asan, or tsan I would like to submit this for inclusion in Boost if people find it interesting and useful, but first I would like to get some feedback on the design, implementation, documentation, tests, etc. I have tested this exact version on Gentoo Linux under clang 4.0 and gcc 6.3.0. I will be testing out Visual Studio 2015 and 2017 in the near future to ensure there have not been any regressions. I have run fundamentally the same queue on Visual Studio 2012 (no longer supported due to C++14 features), Visual Studio 2015 (Windows Server 2008 R2, Windows Server 2012), gcc 4.9.1 (Red Hat 6 under developer toolset 3), gcc 4.9.2 (Red Hat 7), and clang 3.9.0 (Red Hat 7) for several years now with no issues, so it is a fairly mature design.
David Stone wrote:
I have written a new concurrent queue that has a focus on high throughput. You can check it out here:
- If C++14 is required, why not use the standard thread/chrono facilities? - Where can I see the benchmark results? You link to a fork of the moodycamel repo, but I couldn't find the actual results.
On Sat, Jun 17, 2017 at 10:26 PM, Peter Dimov via Boost < boost@lists.boost.org> wrote:
- If C++14 is required, why not use the standard thread/chrono facilities?
Excellent question that I should document. std::thread does not support interruption, boost::thread does. I find this feature compelling enough to justify using the boost version over the standard. The next best alternative, as far as I know, is to do the equivalent of adding a "stop processing messages" message to the queue. In particular, I found this to work well together with boost::scoped_threadboost::interrupt_and_join_if_joinable. The use of boost::chrono instead of std::chrono is just to interoperate with boost::condition_variable.
- Where can I see the benchmark results? You link to a fork of the moodycamel repo, but I couldn't find the actual results.
Oh, I must have forgotten to save those. I will re-run the benchmarks and put the results in a document in the repo.
David Stone wrote:
On Sat, Jun 17, 2017 at 10:26 PM, Peter Dimov via Boost < boost@lists.boost.org> wrote:
- If C++14 is required, why not use the standard thread/chrono facilities?
Excellent question that I should document. std::thread does not support interruption, boost::thread does. I find this feature compelling enough to justify using the boost version over the standard. The next best alternative, as far as I know, is to do the equivalent of adding a "stop processing messages" message to the queue.
Natively supporting this mode of operation is not a bad idea in its own right. This is sometimes preferable to the interruption approach because it guarantees that the consumers will drain the queue before shutting down (and that guarantee comes for free, no effort is required on part of the user, as long as the producers are shut down beforehand.) You will also need tests.
Sorry for the late response, I was in the middle of a cross-country move when I posted this. It is probably easy to overlook the tests because they are currently put in https://github.com/davidstone/concurrent-queue/blob/master/source/queue.cpp , which does not have the word "test" in its path anywhere. I believe it has a fairly strong suite of tests, but I would be happy to be proven wrong so that I can add more. I will be updating the documentation soon to include the results of all benchmarks, but here are some overall performance numbers for context. Hardware: Intel i7 8 core processor (sixth generation, Skylake) Platform: gcc or clang on Linux or mingw, with -O3 -march=native -flto I am able to process 1 billion int per second with 1 reader thread, 1 writer thread, and 1000 int added per batch. This throughput scales approximately linearly with batch size up until this point, so a batch size of 500 gives you approximately 500 million int per second. After that, it's a little trickier to improve throughput, but as far as I can tell the ideal number on this type of processor gets you up to 3 billion int per second with three reader threads and 3-10 writer threads (all gave approximately the same result) writing in batches of 2400 messages. Using the queue in the worst possible way can degrade performance pretty significantly from this point. The worst case for the queue is to have many (10+) readers doing almost no work per item with a single writer adding 1 element per batch. This case gives you only 550,000 int / second. To help understand how the numbers work in the middle, 1 reader and 4 writers writing 40 elements per batch gives about 100 million int per second through the queue. On Visual Studio 2015 (Release mode), these numbers are about half on the high end and unchanged on the low end. I have not yet investigated why this is or how to optimize the performance for this compiler further. To understand how performance would change with different types, it depends in part on how the queue is used. If you can reserve enough memory that the queue never gets larger than that value, you can have more predictable performance. The performance of writes is then determined by the cost of constructing the object via the constructor you chose. In the limit, this is at least the 'size' of the object. On the read side, however, no such cost applies because the pop_all function is O(1): it's just a swap of the underlying container. This means that large elements do not have the same cost as they would in most queue designs. I will think a bit about supporting other methods of queue shutdown, but I will only consider them if they do not add any time or space per element overhead.
participants (2)
-
David Stone
-
Peter Dimov