Sorry for the late response, I was in the middle of a cross-country move when I posted this. It is probably easy to overlook the tests because they are currently put in https://github.com/davidstone/concurrent-queue/blob/master/source/queue.cpp , which does not have the word "test" in its path anywhere. I believe it has a fairly strong suite of tests, but I would be happy to be proven wrong so that I can add more. I will be updating the documentation soon to include the results of all benchmarks, but here are some overall performance numbers for context. Hardware: Intel i7 8 core processor (sixth generation, Skylake) Platform: gcc or clang on Linux or mingw, with -O3 -march=native -flto I am able to process 1 billion int per second with 1 reader thread, 1 writer thread, and 1000 int added per batch. This throughput scales approximately linearly with batch size up until this point, so a batch size of 500 gives you approximately 500 million int per second. After that, it's a little trickier to improve throughput, but as far as I can tell the ideal number on this type of processor gets you up to 3 billion int per second with three reader threads and 3-10 writer threads (all gave approximately the same result) writing in batches of 2400 messages. Using the queue in the worst possible way can degrade performance pretty significantly from this point. The worst case for the queue is to have many (10+) readers doing almost no work per item with a single writer adding 1 element per batch. This case gives you only 550,000 int / second. To help understand how the numbers work in the middle, 1 reader and 4 writers writing 40 elements per batch gives about 100 million int per second through the queue. On Visual Studio 2015 (Release mode), these numbers are about half on the high end and unchanged on the low end. I have not yet investigated why this is or how to optimize the performance for this compiler further. To understand how performance would change with different types, it depends in part on how the queue is used. If you can reserve enough memory that the queue never gets larger than that value, you can have more predictable performance. The performance of writes is then determined by the cost of constructing the object via the constructor you chose. In the limit, this is at least the 'size' of the object. On the read side, however, no such cost applies because the pop_all function is O(1): it's just a swap of the underlying container. This means that large elements do not have the same cost as they would in most queue designs. I will think a bit about supporting other methods of queue shutdown, but I will only consider them if they do not add any time or space per element overhead.