Re: [Boost-users] thread_group::interrupt_all is not reliable
I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting. This gave me the clue to try that: http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_broadca...
The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().
thread::interrupt() calls pthread_cond_broadcast in pthread/thread.cpp. Although "predictable scheduling" doesn't seem like it should include a failure to wake up, taking the mutex around the call to thread_pool::interrupt_all() appears to be 100% reliable. I can patch my app to do that, but I don't think there's a general solution. The documentation should include a note that thread::interrupt() isn't reliable unless the caller is holding the mutex associated with the condition variable on which the interrupted thread is waiting. Of course, this could be a bug in the OS X pthreads implementation as well. - Stoney
I've discovered that under circumstances apparently related to timing and load, sending interrupt_all to a thread_group when all the threads are waiting on a boost::condition_variable leaves one thread waiting about 1/3 of the time. This is with boost 1_40_0 running on Mac OS X 10.6.2, with 32-bit boost libraries. Boost uses the posix thread system here. I boiled my app down to some test code that runs as a command-line app. It's a bit longer than I'd like, but this configuration seems to be necessary to invoke the problem. The test uses a queue to pass "tasks" from the main thread to worker threads, and another queue to pass "results" back to the main thread. The problem is most apparent when all the tasks are finished and the queue empties, so that all the worker threads are waiting on the input queue when the main thread sends interrupt_all.
I've looked at the waiting thread in a debugger when this happens, and found that it has been interrupted, but is still waiting on the condition. It looks like it just got missed by the interrupt_all. This is more likely to happen when there are a lot of worker threads (16, or one per core in my testing).
The test code is parked at http://sb.org/ThreadTest.zip, 20KB. It's an XCode 3.2 project, but the five source files could be readily compiled and run in any Unix environment.
I don't see any errors in the code that could cause these failures. There is a work-around, which is to interrupt the waiting thread again. This required a modified version of thread_group so I could do a timed_join_all on it.
I welcome any suggestions about what could be wrong here, or ways to simplify the test to make it more suitable for a bug report.
- Stoney -- Stonewall Ballard stoney@sb.org http://stoney.sb.org/
Stonewall Ballard wrote:
I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting.
This gave me the clue to try that: http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_broadca...
The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().
thread::interrupt() calls pthread_cond_broadcast in pthread/thread.cpp.
Although "predictable scheduling" doesn't seem like it should include a failure to wake up, taking the mutex around the call to thread_pool::interrupt_all() appears to be 100% reliable.
I can patch my app to do that, but I don't think there's a general solution. The documentation should include a note that thread::interrupt() isn't reliable unless the caller is holding the mutex associated with the condition variable on which the interrupted thread is waiting.
Of course, this could be a bug in the OS X pthreads implementation as well.
Hi, FWIW, I ran that test of yours several times with varying parameters on my machine (quad core, 64bit, linux) and it did not show a single failure. Of course, since it is not a deterministic effect even on your machine, failure to reproduce does not really mean much, but well, I thought you might like to hear anyway :-) And I totally agree: Predictable scheduling should not be required to wake up all threads, especially since the document also says <snip> The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex [...] </cite> As for boiling down the application for others to inspect: Your debugger showed that the thread is still in wait() after the interrupt call. Can you assure that ALL worker threads are in wait() prior to the interrupt? * If yes: There seems to be no connection with the interlocked queue, the sleep and so on. It should be possible to get rid of all that for a much simpler test program * If no: OK, there seems to be a connection between the wait(), the interrupt and the sleep and/or mutex. In any case, I would assume that by analysing the situation right before the interrupt, you should be able to reproduce the problem with much less code. Hope that helps in any way? Regards, Roland
- Stoney
I've discovered that under circumstances apparently related to timing and load, sending interrupt_all to a thread_group when all the threads are waiting on a boost::condition_variable leaves one thread waiting about 1/3 of the time. This is with boost 1_40_0 running on Mac OS X 10.6.2, with 32-bit boost libraries. Boost uses the posix thread system here. I boiled my app down to some test code that runs as a command-line app. It's a bit longer than I'd like, but this configuration seems to be necessary to invoke the problem. The test uses a queue to pass "tasks" from the main thread to worker threads, and another queue to pass "results" back to the main thread. The problem is most apparent when all the tasks are finished and the queue empties, so that all the worker threads are waiting on the input queue when the main thread sends interrupt_all.
I've looked at the waiting thread in a debugger when this happens, and found that it has been interrupted, but is still waiting on the condition. It looks like it just got missed by the interrupt_all. This is more likely to happen when there are a lot of worker threads (16, or one per core in my testing).
The test code is parked at http://sb.org/ThreadTest.zip, 20KB. It's an XCode 3.2 project, but the five source files could be readily compiled and run in any Unix environment.
I don't see any errors in the code that could cause these failures. There is a work-around, which is to interrupt the waiting thread again. This required a modified version of thread_group so I could do a timed_join_all on it.
I welcome any suggestions about what could be wrong here, or ways to simplify the test to make it more suitable for a bug report.
- Stoney
On Dec 1, 2009, at 3:29 AM, Roland Bock wrote:
Stonewall Ballard wrote:
The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().
I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting. This gave me the clue to try that: http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_broadca... thread::interrupt() calls pthread_cond_broadcast in pthread/thread.cpp. Although "predictable scheduling" doesn't seem like it should include a failure to wake up, taking the mutex around the call to thread_pool::interrupt_all() appears to be 100% reliable. I can patch my app to do that, but I don't think there's a general solution. The documentation should include a note that thread::interrupt() isn't reliable unless the caller is holding the mutex associated with the condition variable on which the interrupted thread is waiting. Of course, this could be a bug in the OS X pthreads implementation as well.
Hi,
FWIW, I ran that test of yours several times with varying parameters on my machine (quad core, 64bit, linux) and it did not show a single failure. Of course, since it is not a deterministic effect even on your machine, failure to reproduce does not really mean much, but well, I thought you might like to hear anyway :-)
Thanks, but this doesn't surprise me. Since the reliability drops rapidly as I add threads, I suspect it has something to do with running this on an 8-core (16 hyperthread) machine. I also suspect that it's a Mac OS bug.
And I totally agree: Predictable scheduling should not be required to wake up all threads, especially since the document also says
<snip> The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex [...] </cite>
Of course, that could just be a promise that it won't crash. I hope you're right, though.
As for boiling down the application for others to inspect: Your debugger showed that the thread is still in wait() after the interrupt call.
Can you assure that ALL worker threads are in wait() prior to the interrupt? * If yes: There seems to be no connection with the interlocked queue, the sleep and so on. It should be possible to get rid of all that for a much simpler test program * If no: OK, there seems to be a connection between the wait(), the interrupt and the sleep and/or mutex.
Yes, I added code to check that, and all 16 threads were waiting on that condition when I interrupted them.
In any case, I would assume that by analysing the situation right before the interrupt, you should be able to reproduce the problem with much less code.
If I understand this correctly, I should be able to reproduce it with pthreads alone (no boost). I'm going to try that when I get some time and file a bug report.
Hope that helps in any way?
Yes, thanks. Reasoning about threads requires code review. - Stoney -- Stonewall Ballard stoney@sb.org http://stoney.sb.org/
Stonewall Ballard wrote:
I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting.
This looks like a classic CV pitfall when using "atomic" predicates that are not protected by the mutex used for the wait. The basic outline is that thread A does a CV wait on an atomic boolean variable, and thread B sets this variable and does a notify. There exists a race in which A sees false, is preempted, B stores true, does notify, waking up nobody, and then A continues with the wait. The cure is to insert an empty lock+unlock of A's mutex in B between the store and the notify; it doesn't need to encompass the store, and it doesn't need to encompass the notify call, either. I can't read the boost.thread code well enough to be able to diagnose the problem, but from a cursory look, it looks possible to me that condition_variable::wait may perform its interruption check, be preempted, the interrupt can proceed with setting interrupt_requested and doing a broadcast, and then condition_variable::wait to block on its pthread_cond_wait. Your workaround does prevent it, but I think that boost.thread should take care of that internally, by storing a mutex pointer in the thread data, along with the cv pointer.
On Dec 1, 2009, at 9:50 AM, Peter Dimov wrote:
Stonewall Ballard wrote:
I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting.
This looks like a classic CV pitfall when using "atomic" predicates that are not protected by the mutex used for the wait. The basic outline is that thread A does a CV wait on an atomic boolean variable, and thread B sets this variable and does a notify. There exists a race in which A sees false, is preempted, B stores true, does notify, waking up nobody, and then A continues with the wait. The cure is to insert an empty lock+unlock of A's mutex in B between the store and the notify; it doesn't need to encompass the store, and it doesn't need to encompass the notify call, either.
I can't read the boost.thread code well enough to be able to diagnose the problem, but from a cursory look, it looks possible to me that condition_variable::wait may perform its interruption check, be preempted, the interrupt can proceed with setting interrupt_requested and doing a broadcast, and then condition_variable::wait to block on its pthread_cond_wait.
I've found that all the threads are waiting on the condition variable at the time that I interrupt them, so that scenario doesn't seem to apply. The actual interrupt code in pthread/thread.cpp is: void thread::interrupt() { detail::thread_data_ptr const local_thread_info=get_thread_info(); if(local_thread_info) { lock_guard<mutex> lk(local_thread_info->data_mutex); local_thread_info->interrupt_requested=true; if(local_thread_info->current_cond) { BOOST_VERIFY(!pthread_cond_broadcast(local_thread_info->current_cond)); } } } As you can see, each thread has a mutex protecting its internal state, and the interrupter locks that mutex before setting interrupt_requested to true. I think that this precludes a race condition there. I think that what's actually happening here is that 16 threads are waiting on the same condition, then they're individually interrupted (using the above code). Each interrupt wakes all the threads waiting on the condition, where all but one thread locks the mutex, see that it's not interrupted, then unlocks the mutex and waits. Given the hardware parallelism on my 16-hyperthread Mac Pro, I suspect that there's ample opportunity for a bug.
Your workaround does prevent it, but I think that boost.thread should take care of that internally, by storing a mutex pointer in the thread data, along with the cv pointer.
As I'm getting more convinced that this is bug in pthreads, at least on OS X, it seems unwise to add that to boost::thread. A pthread condition variable knows its mutex, of course, but there doesn't appear to be any way to retrieve it. I'm going to try to reproduce this using just pthreads calls. Thanks for your help. - Stoney -- Stonewall Ballard stoney@sb.org http://stoney.sb.org/
Stonewall Ballard wrote:
On Dec 1, 2009, at 9:50 AM, Peter Dimov wrote:
I can't read the boost.thread code well enough to be able to diagnose the problem, but from a cursory look, it looks possible to me that condition_variable::wait may perform its interruption check, be preempted, the interrupt can proceed with setting interrupt_requested and doing a broadcast, and then condition_variable::wait to block on its pthread_cond_wait.
I've found that all the threads are waiting on the condition variable at the time that I interrupt them, so that scenario doesn't seem to apply.
They are before interrupt_all is issued, but then, as you yourself write:
I think that what's actually happening here is that 16 threads are waiting on the same condition, then they're individually interrupted (using the above code). Each interrupt wakes all the threads waiting on the condition, where all but one thread locks the mutex, see that it's not interrupted, then unlocks the mutex and waits.
they are getting interrupted and awakened one by one, the end result being that condition_variable::wait and thread::interrupt are being called in parallel.
The actual interrupt code in pthread/thread.cpp is: ... As you can see, each thread has a mutex protecting its internal state, and the interrupter locks that mutex before setting interrupt_requested to true. I think that this precludes a race condition there.
It doesn't. inline void condition_variable::wait(unique_lock<mutex>& m) { detail::interruption_checker check_for_interruption(&cond); // thread::interrupt is executed here BOOST_VERIFY(!pthread_cond_wait(&cond,m.mutex()->native_handle())); }
On Dec 1, 2009, at 11:43 AM, Peter Dimov wrote:
Stonewall Ballard wrote:
On Dec 1, 2009, at 9:50 AM, Peter Dimov wrote:
I can't read the boost.thread code well enough to be able to diagnose the problem, but from a cursory look, it looks possible to me that condition_variable::wait may perform its interruption check, be preempted, the interrupt can proceed with setting interrupt_requested and doing a broadcast, and then condition_variable::wait to block on its pthread_cond_wait.
I've found that all the threads are waiting on the condition variable at the time that I interrupt them, so that scenario doesn't seem to apply.
They are before interrupt_all is issued, but then, as you yourself write:
I think that what's actually happening here is that 16 threads are waiting on the same condition, then they're individually interrupted (using the above code). Each interrupt wakes all the threads waiting on the condition, where all but one thread locks the mutex, see that it's not interrupted, then unlocks the mutex and waits.
they are getting interrupted and awakened one by one, the end result being that condition_variable::wait and thread::interrupt are being called in parallel.
The actual interrupt code in pthread/thread.cpp is: ... As you can see, each thread has a mutex protecting its internal state, and the interrupter locks that mutex before setting interrupt_requested to true. I think that this precludes a race condition there.
It doesn't.
inline void condition_variable::wait(unique_lock<mutex>& m) { detail::interruption_checker check_for_interruption(&cond);
// thread::interrupt is executed here
BOOST_VERIFY(!pthread_cond_wait(&cond,m.mutex()->native_handle())); }
On Dec 1, 2009, at 12:12 PM, Anthony Williams wrote:
Peter is right.
The mutex that is locked across the broadcast protects the interrupt flag, but is distinct from the mutex that is associated with the condition variable. It is thus possible for the waiting thread to miss an interruption between when it checks for interruption and when it enters pthread_cond_wait.
Anthony
I see that now. It's necessary for the interrupter to lock the condition variable mutex before interrupting the thread. The likelihood of failure would depend on the number of threads waiting on the condition variable, and the use of an interrupt_all() to interrupt them repeatedly. So this really is a bug in boost::thread, at least with the pthreads implementation. I suppose that makes it easier to fix than a bug in pthreads. Peter's suggestion: On Dec 1, 2009, at 9:50 AM, Peter Dimov wrote:
Your workaround does prevent it, but I think that boost.thread should take care of that internally, by storing a mutex pointer in the thread data, along with the cv pointer.
seems like the right thing to do for a general fix. Thanks to all for your help. I'm confident now that my workaround will keep this from happening in my app, but a proper fix to boost::threads is necessary. I'm surprised that I can't find any other reports of this problem. - Stoney -- Stonewall Ballard stoney@sb.org http://stoney.sb.org/
Stonewall Ballard wrote:
I see that now. It's necessary for the interrupter to lock the condition variable mutex before interrupting the thread. The likelihood of failure would depend on the number of threads waiting on the condition variable, and the use of an interrupt_all() to interrupt them repeatedly.
So this really is a bug in boost::thread, at least with the pthreads implementation. I suppose that makes it easier to fix than a bug in pthreads.
...
I'm surprised that I can't find any other reports of this problem.
You might want to file a ticket for this issue on svn.boost.org so that it doesn't get lost.
On Dec 3, 2009, at 9:38 AM, Peter Dimov wrote:
Stonewall Ballard wrote:
I see that now. It's necessary for the interrupter to lock the condition variable mutex before interrupting the thread. The likelihood of failure would depend on the number of threads waiting on the condition variable, and the use of an interrupt_all() to interrupt them repeatedly.
So this really is a bug in boost::thread, at least with the pthreads implementation. I suppose that makes it easier to fix than a bug in pthreads.
...
I'm surprised that I can't find any other reports of this problem.
You might want to file a ticket for this issue on svn.boost.org so that it doesn't get lost.
Done. Ticket #3735. - Stoney -- Stonewall Ballard stoney@sb.org http://stoney.sb.org/
participants (3)
-
Peter Dimov
-
Roland Bock
-
Stonewall Ballard