[Performance of circular_buffer vs Queue][Multithreading may be broken]
Dear Boost developer, I found the circular buffer to be slower than std::queue for multithreading except in the case of a single thread (where it is notably faster). I am using resize() to have the expected size filled as in vector. Is this behavior expected or documented anywhere? What kind of speedup is to be expected against the std::queue implementation? If there is none, then what is the purpose of circular buffer? I do also have broken code on a multithreaded application that worked fine before. I do pass the circular buffer as reference via functions from file scope, which in combination with the threads (initially starting elsewhere) somehow broke the code. Do you do regular benchmarks against other implementations? I dont see linked graphs on your homepage. Regards, Jan
On Wed, 1 Apr 2020 at 05:27, Jan Hafer via Boost
Dear Boost developer, I found the circular buffer to be slower than std::queue for
These are different containers. multithreading except in the case of a single thread (where it is
notably faster).
std::queue is not synchronized. I am using resize() to have the expected size filled as in vector.
Is this behavior expected or documented anywhere? What kind of speedup is to be expected against the std::queue implementation? If there is none, then what is the purpose of circular buffer?
You could read the relevant docs. I strongly doubt that this boost container is slower than anything in the STL when compared like for like. I do also have broken code on a multithreaded application that worked
fine before. I do pass the circular buffer as reference via functions from file scope, which in combination with the threads (initially starting elsewhere) somehow broke the code.
Do you do regular benchmarks against other implementations? I dont see linked graphs on your homepage.
Regards, Jan
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
degski -- @systemdeg "We value your privacy, click here!" Sod off! - degski "Anyone who believes that exponential growth can go on forever in a finite world is either a madman or an economist" - Kenneth E. Boulding "Growth for the sake of growth is the ideology of the cancer cell" - Edward P. Abbey
Thanks for the fast reply. On 01.04.20 16:08, degski wrote:
On Wed, 1 Apr 2020 at 05:27, Jan Hafer via Boost
wrote: Dear Boost developer, I found the circular buffer to be slower than std::queue for
These are different containers.
Yes, however I needed a queue and was interested in the performance characteristics. Apparently std::queue is slower on single-thread applications, but faster for multiple threads. I could reproduce the results on my own circular buffer implementation.
multithreading except in the case of a single thread (where it is
notably faster).
std::queue is not synchronized.
Yes, I do use 1 buffer/queue per thread.
I am using resize() to have the expected size filled as in vector.
Is this behavior expected or documented anywhere? What kind of speedup is to be expected against the std::queue implementation? If there is none, then what is the purpose of circular buffer?
You could read the relevant docs. I strongly doubt that this boost container is slower than anything in the STL when compared like for like.
True, could reproduce.
I do also have broken code on a multithreaded application that worked
fine before. I do pass the circular buffer as reference via functions from file scope, which in combination with the threads (initially starting elsewhere) somehow broke the code.
Do you do regular benchmarks against other implementations? I dont see linked graphs on your homepage.
I would like to see the benchmark results against std-implementations linked on the main page. If possible, also as simple comparison per thread count.
Gesendet: Donnerstag, 02. April 2020 um 18:43 Uhr Von: "Jan Hafer via Boost"
Thanks for the fast reply. Is this behavior expected or documented anywhere? What kind of speedup is to be expected against the std::queue implementation? If there is none, then what is the purpose of circular buffer?
You could read the relevant docs. I strongly doubt that this boost container is slower than anything in the STL when compared like for like.
True, could reproduce.
I do also have broken code on a multithreaded application that worked
fine before. I do pass the circular buffer as reference via functions from file scope, which in combination with the threads (initially starting elsewhere) somehow broke the code.
Do you do regular benchmarks against other implementations? I dont see linked graphs on your homepage.
I would like to see the benchmark results against std-implementations linked on the main page. If possible, also as simple comparison per thread count.
Could you give a full example /repro of what exactly you are measuring? Saying "container X is faster than Y" is not very helpful without showing what operations you are measuring and what the workload is. You e.g. could create a small github repository, or a gist, or put the your benchmark here: http://quick-bench.com/ Best Mike
On Thu, 2 Apr 2020 at 17:43, Jan Hafer via Boost
Yes, I do use 1 buffer/queue per thread.
So you're saying that circular_buffer is slower on a given thread when other threads are accessing their own circular_buffer in parallel? That sounds unlikely to be circular buffer's fault. The first thing to check is that the memory placement of your circular buffer does not result in false sharing with other threads. More generally you could be stalling due to any inter-connection, such as crossing NUMA domains or talking to hardware. Otherwise you could be saturating the execution ports of your core and thus not get the expected speed-up from hyper-threading, but in any case you'd be the best throughput overall.
On 02.04.20 20:59, Mathias Gaunard wrote:
On Thu, 2 Apr 2020 at 17:43, Jan Hafer via Boost
wrote: Yes, I do use 1 buffer/queue per thread.
So you're saying that circular_buffer is slower on a given thread when other threads are accessing their own circular_buffer in parallel? That sounds unlikely to be circular buffer's fault.
Yes and I dont know quite the reason for it. My Threads know their id to access a file-global data structure containing their queue/circular buffer. They start another after in a thread-safe way and exit on emptying the queue/circular buffer.
The first thing to check is that the memory placement of your circular buffer does not result in false sharing with other threads. More generally you could be stalling due to any inter-connection, such as crossing NUMA domains or talking to hardware. Otherwise you could be saturating the execution ports of your core and thus not get the expected speed-up from hyper-threading, but in any case you'd be the best throughput overall.
I know how to obtain the source code of boost, but the compiler vendors sadly provide no direct internet-searchable links to the std source code. Thanks for your advice. I may dig further with perf to obtain cache-miss statistics. My goal was just to inform you of this characteristics, since this is not documented.
Jan Hafer wrote:
I know how to obtain the source code of boost, but the compiler vendors sadly provide no direct internet-searchable links to the std source code.
https://github.com/gcc-mirror/gcc/tree/master/libstdc%2B%2B-v3 https://github.com/llvm-mirror/libcxx/ https://github.com/microsoft/stl
On 3/04/2020 08:41, Jan Hafer wrote:
On 02.04.20 20:59, Mathias Gaunard wrote:
So you're saying that circular_buffer is slower on a given thread when other threads are accessing their own circular_buffer in parallel? That sounds unlikely to be circular buffer's fault.
Yes and I dont know quite the reason for it. My Threads know their id to access a file-global data structure containing their queue/circular buffer. They start another after in a thread-safe way and exit on emptying the queue/circular buffer.
That sounds like you're allocating a single array of circular_buffers and then accessing them from different threads. That's basically the worst possible thing to do; as Mathias was saying, that will end up sharing cache lines between different cores and your performance will tank. At minimum, you should embed the circular_buffer into another struct that has sizeof() >= std::hardware_destructive_interference_size, and make an array of that. But better still, embed the circular_buffer into your processing classes and don't have arrays of them at all. (If you're pre-C++17 and don't have std::hardware_destructive_interference_size, then using 64 works for most modern platforms.) Ideally, the circular_buffer implementation itself should also separate all internal producer-thread members and consumer-thread members by std::hardware_destructive_interference_size and try very hard to not cross over. (Here the main thing that matters is write accesses.) If you want to try using a more modern circular buffer that gets this correct, have a look at Boost.Lockfree's spsc_queue. https://www.boost.org/doc/libs/1_72_0/doc/html/boost/lockfree/spsc_queue.htm... (While you're there, there's also an MPMC queue as well. Note that a lockfree queue will tend to be slower than std::queue in uncontended benchmarks, but the avoidance of locks can be superior in highly contended or other specialised scenarios.)
-----Original Message----- From: Boost
On Behalf Of Gavin Lambert via Boost Sent: 3 April 2020 00:16 To: boost@lists.boost.org Cc: Gavin Lambert Subject: Re: [boost] [Performance of circular_buffer vs Queue][Multithreading may be broken] On 3/04/2020 08:41, Jan Hafer wrote:
On 02.04.20 20:59, Mathias Gaunard wrote:
So you're saying that circular_buffer is slower on a given thread when other threads are accessing their own circular_buffer in parallel? That sounds unlikely to be circular buffer's fault.
Yes and I dont know quite the reason for it. My Threads know their id to access a file-global data structure containing their queue/circular buffer. They start another after in a thread-safe way and exit on emptying the queue/circular buffer.
That sounds like you're allocating a single array of circular_buffers and then accessing them from different threads.
That's basically the worst possible thing to do; as Mathias was saying, that will end up sharing cache lines between different cores and your performance will tank.
At minimum, you should embed the circular_buffer into another struct that has sizeof() >= std::hardware_destructive_interference_size, and make an array of
I am following this discussion thread with no particular interest, but I can imagine that there are others who will find the discussion, theorizing and advice rather valuable. Can anyone add a link to it to the circular buffer documentation? Paul A. Bristow that.
But better still, embed the circular_buffer into your processing classes and
have arrays of them at all.
(If you're pre-C++17 and don't have std::hardware_destructive_interference_size, then using 64 works for most modern platforms.)
Ideally, the circular_buffer implementation itself should also separate all internal producer-thread members and consumer-thread members by std::hardware_destructive_interference_size and try very hard to not cross over. (Here the main thing that matters is write accesses.)
If you want to try using a more modern circular buffer that gets this correct, have a look at Boost.Lockfree's spsc_queue.
https://www.boost.org/doc/libs/1_72_0/doc/html/boost/lockfree/spsc_queue.ht ml
(While you're there, there's also an MPMC queue as well. Note that a lockfree queue will tend to be slower than std::queue in uncontended benchmarks, but
don't the
avoidance of locks can be superior in highly contended or other specialised scenarios.)
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Fri, 3 Apr 2020 at 02:35, Paul A Bristow via Boost
I am following this discussion thread with no particular interest, but I can imagine that there are others who will find the discussion, theorizing and advice rather valuable. Can anyone add a link to it to the circular buffer documentation?
I assume it's this cb: https://www.boost.org/doc/libs/1_72_0/doc/html/circular_buffer.html degski -- @systemdeg "We value your privacy, click here!" Sod off! - degski "Anyone who believes that exponential growth can go on forever in a finite world is either a madman or an economist" - Kenneth E. Boulding "Growth for the sake of growth is the ideology of the cancer cell" - Edward P. Abbey
Jan Hafer wrote:
I do also have broken code on a multithreaded application that worked fine before. I do pass the circular buffer as reference via functions from file scope, which in combination with the threads (initially starting elsewhere) somehow broke the code.
Does this mean that you're using the same circular_buffer from more than one thread without synchronization?
participants (7)
-
degski
-
Gavin Lambert
-
Jan Hafer
-
Mathias Gaunard
-
Mike
-
pbristow@hetp.u-net.com
-
Peter Dimov