boost lockfree queue.hpp - alternate implementation w/o compare_exchange
Is there any interest in an alternate implementation of a lockfree queue that doesn't use compare_exchange (assuming it is correct, or can be made to be correct)? I have a multi-producer / multi-consumer implementation that generally doesn't suffer nearly as much from contention. On my hardware (an older high-end 64 bit x86 workstation) I get roughly comparable throughput with existing boost lockfree queue with 1 producer and 1 consumer. With 4 producers and 4 consumers I get about 3x throughput gain on existing boost lockfree queue. However, this implementation has a weakness, it suffers in over subscription scenarios (threads in middle of push/pop that lose their time-slice); with twice as many threads as cores its performances is comparable to boost lockfree queue (using try_push/try_pop, otherwise throughput is down the toilet), and with 4x thread/core ratio the throughput is somewhat less than boost lockfree queue. This was tested on Windows 7 x64 with Inel Xeon X5355 x 2 (2 sockets w/ 4 cores a piece) @ 2.66 Ghz and 32 GB RAM @ 667 MHz. The implementation can be found here: https://github.com/benmccart/GuarunteedMpmcQueue/blob/master/queue/queue.hpp Is the implementation correct? Can anyone spot any race conditions, or conditions under which the implementation would fail to work as expected (other than queue is full/empty or over-subscription)? ~Ben
interesting ... it might take me a bit to have a closer look at the code, though from a quick look, it seems like a variation of the wait-free ringbuffer, which is available in the wild? btw, one thing i could spot right away: you are using std::this_thread::yield() ... which is blocking the caller thread and not allowed e.g. in real-time applications ... On 2015-10-21 05:02, boost@mccart.us wrote:
Is there any interest in an alternate implementation of a lockfree queue that doesn't use compare_exchange (assuming it is correct, or can be made to be correct)?
I have a multi-producer / multi-consumer implementation that generally doesn't suffer nearly as much from contention. On my hardware (an older high-end 64 bit x86 workstation) I get roughly comparable throughput with existing boost lockfree queue with 1 producer and 1 consumer. With 4 producers and 4 consumers I get about 3x throughput gain on existing boost lockfree queue. However, this implementation has a weakness, it suffers in over subscription scenarios (threads in middle of push/pop that lose their time-slice); with twice as many threads as cores its performances is comparable to boost lockfree queue (using try_push/try_pop, otherwise throughput is down the toilet), and with 4x thread/core ratio the throughput is somewhat less than boost lockfree queue.
This was tested on Windows 7 x64 with Inel Xeon X5355 x 2 (2 sockets w/ 4 cores a piece) @ 2.66 Ghz and 32 GB RAM @ 667 MHz.
The implementation can be found here: https://github.com/benmccart/GuarunteedMpmcQueue/blob/master/queue/queue.hpp
Is the implementation correct? Can anyone spot any race conditions, or conditions under which the implementation would fail to work as expected (other than queue is full/empty or over-subscription)?
~Ben
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On 2015-10-21 05:02, boost@mccart.us wrote:
Is there any interest in an alternate implementation of a lockfree queue that doesn't use compare_exchange (assuming it is correct, or can be made to be correct)?
I have a multi-producer / multi-consumer implementation that generally doesn't suffer nearly as much from contention. On my hardware (an older high-end 64 bit x86 workstation) I get roughly comparable throughput with existing boost lockfree queue with 1 producer and 1 consumer. With 4 producers and 4 consumers I get about 3x throughput gain on existing boost lockfree queue. However, this implementation has a weakness, it suffers in over subscription scenarios (threads in middle of push/pop that lose their time-slice); with twice as many threads as cores its performances is comparable to boost lockfree queue (using try_push/try_pop, otherwise throughput is down the toilet), and with 4x thread/core ratio the throughput is somewhat less than boost lockfree queue.
This was tested on Windows 7 x64 with Inel Xeon X5355 x 2 (2 sockets w/ 4 cores a piece) @ 2.66 Ghz and 32 GB RAM @ 667 MHz.
The implementation can be found here: https://github.com/benmccart/GuarunteedMpmcQueue/blob/master/queue/queue.hpp
Is the implementation correct? Can anyone spot any race conditions, or conditions under which the implementation would fail to work as expected (other than queue is full/empty or over-subscription)?
~Ben
On 2015-10-21 03:15, Tim Blechmann wrote: interesting ... it might take me a bit to have a closer look at the code, though from a quick look, it seems like a variation of the wait-free ringbuffer, which is available in the wild?
btw, one thing i could spot right away: you are using std::this_thread::yield() ... which is blocking the caller thread and not allowed e.g. in real-time applications ...
The yield can be removed without any ill effect except in cases of over-subscription. Without the yield performance is *very* poor because the thread is forced to wait for other threads that aren't even running (lines 355 & 378)... Basically the yield is only a work-around for something that is undesirable anyway (over-subsciption), but it limits the 'damage' in those scenarios. If there was a cross-platform way to prevent a thread from getting rescheduled from within push/pop it would make things so much simpler...
On 22 Oct 2015 3:26 a.m.,
The yield can be removed without any ill effect except in cases of
over-subscription. Without the yield performance is *very* poor because the thread is forced to wait for other threads that aren't even running (lines 355 & 378)... Basically the yield is only a work-around for something that is undesirable anyway (over-subsciption), but it limits the 'damage' in those scenarios. It seems to me then it is not really lock free then, right?
On 22/10/2015 20:57, Giovanni Piero Deretta wrote:
It seems to me then it is not really lock free then, right?
There's a difference between "lock free" and "wait free" (though I haven't examined the code to see which most correctly applies here). Wait-free is better than lock-free, of course, but it's also incredibly hard to achieve in an MPMC problem.
On 22 Oct 2015 9:04 a.m., "Gavin Lambert"
On 22/10/2015 20:57, Giovanni Piero Deretta wrote:
It seems to me then it is not really lock free then, right?
There's a difference between "lock free" and "wait free" (though I
haven't examined the code to see which most correctly applies here).
Wait-free is better than lock-free, of course, but it's also incredibly
hard to achieve in an MPMC problem.
Lock free still requires at least one thread to make progress in any situation. If this were the case yield wouldn't be necessary. As I t is I do not think it qualifies even as obstruction-free.
On 2015-10-22 03:18, Giovanni Piero Deretta wrote:
On 22 Oct 2015 9:04 a.m., "Gavin Lambert"
wrote: On 22/10/2015 20:57, Giovanni Piero Deretta wrote:
It seems to me then it is not really lock free then, right?
There's a difference between "lock free" and "wait free" (though I
haven't examined the code to see which most correctly applies here).
Wait-free is better than lock-free, of course, but it's also incredibly
hard to achieve in an MPMC problem.
Lock free still requires at least one thread to make progress in any situation. If this were the case yield wouldn't be necessary. As I t is I do not think it qualifies even as obstruction-free.
Assuming all threads are currently schedule, then multiple reader and writer threads make progress on average, and at least one reader thread and one writer thread will make progress. Assuming all threads are scheduled, it is absolutely obstruction free, and lock free. It is 'mostly' wait free... the waiting is the time it takes to increment the trailing edge of either back (for writers) or front (for readers).
On Fri, Oct 23, 2015 at 7:41 PM,
On 2015-10-22 03:18, Giovanni Piero Deretta wrote:
On 22 Oct 2015 9:04 a.m., "Gavin Lambert"
wrote: On 22/10/2015 20:57, Giovanni Piero Deretta wrote:
It seems to me then it is not really lock free then, right?
There's a difference between "lock free" and "wait free" (though I haven't examined the code to see which most correctly applies here).
Wait-free is better than lock-free, of course, but it's also incredibly hard to achieve in an MPMC problem.
Lock free still requires at least one thread to make progress in any situation. If this were the case yield wouldn't be necessary. As I t is I do not think it qualifies even as obstruction-free.
Assuming all threads are currently schedule, then multiple reader and writer threads make progress on average, and at least one reader thread and one writer thread will make progress. Assuming all threads are scheduled, it is absolutely obstruction free, and lock free. It is 'mostly' wait free... the waiting is the time it takes to increment the trailing edge of either back (for writers) or front (for readers).
According to your previous description, it seems to me that a writer could be blocked forever waiting for a preempted, dead thread or blocked thread. Similarly for a reader.
On 2015-10-23 15:53, Giovanni Piero Deretta wrote:
On Fri, Oct 23, 2015 at 7:41 PM,
wrote: On 2015-10-22 03:18, Giovanni Piero Deretta wrote:
On 22 Oct 2015 9:04 a.m., "Gavin Lambert"
wrote: On 22/10/2015 20:57, Giovanni Piero Deretta wrote:
It seems to me then it is not really lock free then, right?
There's a difference between "lock free" and "wait free" (though I haven't examined the code to see which most correctly applies here).
Wait-free is better than lock-free, of course, but it's also incredibly hard to achieve in an MPMC problem.
Lock free still requires at least one thread to make progress in any situation. If this were the case yield wouldn't be necessary. As I t is I do not think it qualifies even as obstruction-free.
Assuming all threads are currently schedule, then multiple reader and writer threads make progress on average, and at least one reader thread and one writer thread will make progress. Assuming all threads are scheduled, it is absolutely obstruction free, and lock free. It is 'mostly' wait free... the waiting is the time it takes to increment the trailing edge of either back (for writers) or front (for readers).
According to your previous description, it seems to me that a writer could be blocked forever waiting for a preempted, dead thread or blocked thread. Similarly for a reader.
If the yield bit is removed from the push/pop implementation, I think you will find the statement above about assumptions to be accurate upon investigation of implementation, assuming the implementation is correct. It would be fabulous if anyone could spot errors in implementation, let me know if they think it is correct. It may very well be that there are no generalized use cases where higher throughput is needed in a queue with larger numbers of readers/writers. However, I would expect there is at least limited utility for some work loads... I was hoping there would be some interest on this list if there was.
On 2015-10-22 03:03, Gavin Lambert wrote:
On 22/10/2015 20:57, Giovanni Piero Deretta wrote:
It seems to me then it is not really lock free then, right?
There's a difference between "lock free" and "wait free" (though I haven't examined the code to see which most correctly applies here).
Wait-free is better than lock-free, of course, but it's also incredibly hard to achieve in an MPMC problem.
Strictly speaking it is not wait-free, because all writers in push() have to wait for the writer before them to complete (increment trail edge index by 1) before they can complete (increase the trail edge index by 1). However, this is the only waiting part, they don't have to wait to write their item into the queue (assuming there are empty slots in the queue). The readers behave symmetrically. The one advantage of this queue (assuming the implementation is correct) is that there is lower contention than methods that use compare_exchange (and therefore spend more time colliding and waiting).
On 2015-10-22 02:57, Giovanni Piero Deretta wrote:
On 22 Oct 2015 3:26 a.m.,
wrote: The yield can be removed without any ill effect except in cases of
over-subscription. Without the yield performance is *very* poor because the thread is forced to wait for other threads that aren't even running (lines 355 & 378)... Basically the yield is only a work-around for something that is undesirable anyway (over-subsciption), but it limits the 'damage' in those scenarios.
It seems to me then it is not really lock free then, right?
That is an interesting viewpoint, and in the perspective of over subscription I suppose you have a point in a certain sense. All writers allowed to enter push or readers allowed to enter pop can do all their work in parallel except to increase the trailing edge (write/read position essentially). Each writer (and symmetrically reader)has to wait for the writer before it before increasing the queue size (and allowing a reader).
Have you run ThreadSanitizer on it?
On Tue, Oct 20, 2015 at 11:02 PM,
Is there any interest in an alternate implementation of a lockfree queue that doesn't use compare_exchange (assuming it is correct, or can be made to be correct)?
I have a multi-producer / multi-consumer implementation that generally doesn't suffer nearly as much from contention. On my hardware (an older high-end 64 bit x86 workstation) I get roughly comparable throughput with existing boost lockfree queue with 1 producer and 1 consumer. With 4 producers and 4 consumers I get about 3x throughput gain on existing boost lockfree queue. However, this implementation has a weakness, it suffers in over subscription scenarios (threads in middle of push/pop that lose their time-slice); with twice as many threads as cores its performances is comparable to boost lockfree queue (using try_push/try_pop, otherwise throughput is down the toilet), and with 4x thread/core ratio the throughput is somewhat less than boost lockfree queue.
This was tested on Windows 7 x64 with Inel Xeon X5355 x 2 (2 sockets w/ 4 cores a piece) @ 2.66 Ghz and 32 GB RAM @ 667 MHz.
The implementation can be found here: https://github.com/benmccart/GuarunteedMpmcQueue/blob/master/queue/queue.hpp
Is the implementation correct? Can anyone spot any race conditions, or conditions under which the implementation would fail to work as expected (other than queue is full/empty or over-subscription)?
~Ben
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Tue, Oct 20, 2015 at 11:02 PM,
Is there any interest in an alternate implementation of a lockfree queue that doesn't use compare_exchange (assuming it is correct, or can be made to be correct)?
<...>
This was tested on Windows 7 x64 with Inel Xeon X5355 x 2 (2 sockets w/ 4 cores a piece) @ 2.66 Ghz and 32 GB RAM @ 667 MHz.
P.S. Intel isn't the best place to test lock-free code. You need crazy platforms with more relaxed guarantees.
The implementation can be found here: https://github.com/benmccart/GuarunteedMpmcQueue/blob/master/queue/queue.hpp
Is the implementation correct? Can anyone spot any race conditions, or conditions under which the implementation would fail to work as expected (other than queue is full/empty or over-subscription)?
I took a quick look, I may easily have misunderstood the code: So you have a leading edge and trailing edge on both front and back. OK. So on push, we incr leading edge, reserving a space. When we are done, we incr the trailing edge. But we can't incr the trailing edge until every other pusher has incremented the trailing edge, correct? ie between trail and lead, we have some partially constructed elements, and we can't move trail until we know the one on that edge is completely written. ie the important part is that everything before trail MUST be completely written. That is our invariant. It appears to me, then, that if T's copy/move constructor decides to grab a lock, or deadlock, or kill its own thread, all other pushers stop, waiting for the trailing pusher to finish, even though it may never do so. Is that correct? If so, it is not lock free. Answer the question: "Is there ANY point in the program where an idiot could kill a thread and cause the algorithm to break?" If the answer is yes, then it is not lock free. It may be obstruction free, I haven't thought about it enough. The other question is "do I have comments in the code starting with the word 'wait'?" If so, it is probably not lock free. Also, eventually, pullers will also wait forever, as they appear to spin-lock on empty. NOTE: None of the above means it is a bad queue; that remains to be seen. It is just not lock-free. It may typically perform better than lock-free. And lock-free is typically about performance. But note that it is also about being robust and deadlock free, etc. So it may depend on use cases. See also, https://www.youtube.com/watch?v=Xf35TLFKiO8 which starts to describe my array-based MPMC queue. The C++Now version is better, but those videos aren't available yet. I'd blame Marshall here, but he's volunteering his own time, so let's be patient. My queue uses CAS and I don't claim that it is fast. I'm doing it for theoretical correctness first. (It also, at least in the video, does not yet handle anything beyond ints). Tony
On 27/10/2015 11:05, Gottlob Frege wrote:
NOTE: None of the above means it is a bad queue; that remains to be seen. It is just not lock-free. It may typically perform better than lock-free. And lock-free is typically about performance. But note that it is also about being robust and deadlock free, etc. So it may depend on use cases.
Performance is a tricky subject with lock-free. Often uncontended performance is worse. But the main reason they exist is to provide better worst-case guarantees for latency in particular in the face of contention. (Typically the key improvement is that at least one running thread can always make forward progress, unlike lock-based structures where one thread can make progress, but it is not necessarily running.)
See also, https://www.youtube.com/watch?v=Xf35TLFKiO8 which starts to describe my array-based MPMC queue. The C++Now version is better, but those videos aren't available yet. I'd blame Marshall here, but he's volunteering his own time, so let's be patient.
My queue uses CAS and I don't claim that it is fast. I'm doing it for theoretical correctness first. (It also, at least in the video, does not yet handle anything beyond ints).
My current favourite mostly-lock-free MPMC queue is Dmitry Vyukov's (http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue).
On Mon, Oct 26, 2015 at 7:29 PM, Gavin Lambert
My current favourite mostly-lock-free MPMC queue is Dmitry Vyukov's (http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue).
Darn, wish I hadn't seen that. Now I might need to change my queue. :-)
participants (5)
-
boost@mccart.us
-
Gavin Lambert
-
Giovanni Piero Deretta
-
Gottlob Frege
-
Tim Blechmann