Interprocess mutex & condition variable at process termination
Dear Experts, I've just been surprised by the behaviour of the interprocess mutex and condition variable on abnormal process termination, i.e. they are not automatically released. Google tells me that I'm not the first to be surprised by this; there have been previous posts here, stack overflow questions etc. One often-valid observation is that if a process crashes - or otherwise terminates without executing its destructors - while it holds a lock on a shared data structure then the data is probably now corrupt, so unlocking the mutex that protects it is not very useful. I think there is an important case where that does not apply - when the process that crashes is only reading the shared data. In my case, I had written a "monitor" utility that loops forever, waiting on a shared condition, taking the corresponding mutex, and then dumping the shared data to stdout. I had been running this and stopping it by pressing ctrl-C and it had not occurred to me that this might not work as I expected. My attempt at debugging using this utility was making my problems worse, not better! Modifying this code to run destructors on ctrl-C is non-trivial. I am aware that the SysV shared semaphore is able to undo on process termination (see SEM_UNDO in man semop), and I had assumed that Boost.Interprocess was using this or something like it. I now see that it is using pthreads, which I didn't even realise could work between processes, and I don't think this API has any way to specify process termination behaviour. Anyway, I'd like to suggest that the interprocess docs should make some mention of the behaviour of the synchronisation primitives on process termination, e.g. somewhere near the beginning of http://www.boost.org/doc/libs/1_63_0/doc/html/interprocess/synchronization_m... I may now try to implement some primitives that use semop() and unlock automatically. I haven't yet looked at what's involved to implement a condition variable on top of a semaphore, so I may not get very far! Has anyone else ever tried this? Also, I note that Interprocess is using "old style" times, not std::chrono like the std::mutex/condition do. Are there any plans to update this? Thanks, Phil.
On 02/15/17 20:42, Phil Endecott via Boost wrote:
Dear Experts,
I've just been surprised by the behaviour of the interprocess mutex and condition variable on abnormal process termination, i.e. they are not automatically released.
Google tells me that I'm not the first to be surprised by this; there have been previous posts here, stack overflow questions etc.
One often-valid observation is that if a process crashes - or otherwise terminates without executing its destructors - while it holds a lock on a shared data structure then the data is probably now corrupt, so unlocking the mutex that protects it is not very useful. I think there is an important case where that does not apply - when the process that crashes is only reading the shared data. In my case, I had written a "monitor" utility that loops forever, waiting on a shared condition, taking the corresponding mutex, and then dumping the shared data to stdout. I had been running this and stopping it by pressing ctrl-C and it had not occurred to me that this might not work as I expected. My attempt at debugging using this utility was making my problems worse, not better! Modifying this code to run destructors on ctrl-C is non-trivial.
I am aware that the SysV shared semaphore is able to undo on process termination (see SEM_UNDO in man semop), and I had assumed that Boost.Interprocess was using this or something like it. I now see that it is using pthreads, which I didn't even realise could work between processes, and I don't think this API has any way to specify process termination behaviour.
There is a way to handle this case, but this API is not universally supported: http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_... If that API is not supported on your platform, you may want to avoid locking the mutex without a timeout (i.e. failing to acquire a mutex for a given time should be considered an indication that the mutex has been abandoned in the locked state). In general, synchronization primitives that reside in shared memory (such as pthread mutexes or Boost.Interprocess mutexes) should be considered vulnerable to (a) corruption and (b) becoming unusable (like, indefinitely locked) because of a user process misbehavior. That is rather obvious considering that such primitives typically do not include any other resources, such as handles to kernel objects or file descriptors and as such "don't exist" for the kernel (consequently, the kernel cannot release them on process termination). Robust mutexes that I referenced above are an exception to that general rule. Named primitives, such as SysV semaphores, are typically more protected because there is at least a file descriptor or something that corresponds to the name and there is usually a limited API to interact with the primitive (i.e. you usually don't have a direct access to the primitive data). There are a number of named synchronization primitives in Boost.Interprocess, although I don't think they provide "auto unlock on process termination" feature.
Anyway, I'd like to suggest that the interprocess docs should make some mention of the behaviour of the synchronisation primitives on process termination, e.g. somewhere near the beginning of http://www.boost.org/doc/libs/1_63_0/doc/html/interprocess/synchronization_m...
I may now try to implement some primitives that use semop() and unlock automatically. I haven't yet looked at what's involved to implement a condition variable on top of a semaphore, so I may not get very far! Has anyone else ever tried this?
If you want (more or less) reliable interprocess synchronization, you will currently have to implement it yourself. There are a number of compromises to make along the way. For instance, pthread robust mutexes API does not quite fit into the traditional C++ mutex API, so one has to improvise. In the absence of robust mutexes, the timeout workaround is not universally applicable, and the timeout itself is, obviously, case-specific. Also, most of these APIs are not fully portable (not between Windows and POSIX-compatible systems, anyway), so you end up with OS-specific branches. I did implement this an a few of my projects. One example is Boost.Log, where I opportunistically use robust mutexes: https://github.com/boostorg/log/blob/develop/src/posix/ipc_sync_wrappers.hpp https://github.com/boostorg/log/blob/develop/src/posix/ipc_reliable_message_... You can see Windows implementation is quite different: https://github.com/boostorg/log/blob/develop/src/windows/ipc_sync_wrappers.h... https://github.com/boostorg/log/blob/develop/src/windows/ipc_sync_wrappers.c... https://github.com/boostorg/log/blob/develop/src/windows/ipc_reliable_messag... The best solution to these problems, however, is to avoid locks altogether and use lock-free algorithms in such a way that any data in the shared memory is valid and can be handled.
Andrey Semashev wrote:
On 02/15/17 20:42, Phil Endecott via Boost wrote:
I've just been surprised by the behaviour of the interprocess mutex and condition variable on abnormal process termination, i.e. they are not automatically released.
There is a way to handle this case, but this API is not universally supported:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_...
Thanks for pointing that out. For some reason I thought that "robust" mutexes solved some other problem. I think that in my case where I have some processes that only read the shared data, it would be possible to handle EOWNERDEAD by either continuing if the previous lock were a read-lock, or by throwing if it were a write lock. I don't think Interprocess does any of this, does it?
The best solution to these problems, however, is to avoid locks altogether and use lock-free algorithms in such a way that any data in the shared memory is valid and can be handled.
Maybe, though my next concern would be how to implement the functionality of a condition variable. What happens if a process crashes while it is waiting on a condition variable? I did once know how Linux implements condition variables using atomics and futexes, and I think it's probably safe to crash in this situation, but I guess there are no guarantees. Thanks, Phil.
On 16/02/2017 12:31, Phil Endecott via Boost wrote:
Andrey Semashev wrote:
On 02/15/17 20:42, Phil Endecott via Boost wrote:
I've just been surprised by the behaviour of the interprocess mutex and condition variable on abnormal process termination, i.e. they are not automatically released.
There is a way to handle this case, but this API is not universally supported:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_...
I think that in my case where I have some processes that only read the shared data, it would be possible to handle EOWNERDEAD by either continuing if the previous lock were a read-lock, or by throwing if it were a write lock. I don't think Interprocess does any of this, does it?
The best solution to these problems, however, is to avoid locks altogether and use lock-free algorithms in such a way that any data in the shared memory is valid and can be handled.
Maybe, though my next concern would be how to implement the functionality of a condition variable. What happens if a process crashes while it is waiting on a condition variable? I did once know how Linux implements condition variables using atomics and futexes, and I think it's probably safe to crash in this situation, but I guess there are no guarantees.
The only portable way that I know of to build a portable interprocess mutex which knows when one of the processes has died is using a pipe instance. You write a byte to "unlock" the mutex and read all bytes until it's empty to "lock" the mutex. select() can be used to block until the mutex is unlocked. I've built a fair few of these over the years and performance is actually pretty good considering what it is. I'm kinda surprised that Boost.Interprocess doesn't have one yet. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 02/16/17 15:31, Phil Endecott via Boost wrote:
Andrey Semashev wrote:
On 02/15/17 20:42, Phil Endecott via Boost wrote:
I've just been surprised by the behaviour of the interprocess mutex and condition variable on abnormal process termination, i.e. they are not automatically released.
There is a way to handle this case, but this API is not universally supported:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_...
Thanks for pointing that out. For some reason I thought that "robust" mutexes solved some other problem.
I think that in my case where I have some processes that only read the shared data, it would be possible to handle EOWNERDEAD by either continuing if the previous lock were a read-lock, or by throwing if it were a write lock. I don't think Interprocess does any of this, does it?
No, to my knowkedge, Boost.Interprocess doesn't use rubust mutexes. You have to know though that condition variables also include a mutex, and not a robust one. I don't know of any way, portable or not, to make a CV use a robust mutex internally.
The best solution to these problems, however, is to avoid locks altogether and use lock-free algorithms in such a way that any data in the shared memory is valid and can be handled.
Maybe, though my next concern would be how to implement the functionality of a condition variable. What happens if a process crashes while it is waiting on a condition variable?
It depends on whether the process was actually blocked on the futex used by CV to wait for notifications. If so, *I think* the failure might be recoverable - some other thread will notify more threads than there are actually waiting, and that's harmless. If not, then the internal mutex in the CV was abandoned in the locked state and the CV is unusable. Basically, when you want robust mutexes, CVs are not an option. You might want to consider process-shared semaphores as a replacement. http://pubs.opengroup.org/onlinepubs/7908799/xsh/sem_init.html
I did once know how Linux implements condition variables using atomics and futexes, and I think it's probably safe to crash in this situation, but I guess there are no guarantees.
Yes, futexes are the way to go, if you target specifically Linux. The important advantage is that you're in control on the mutex/CV implementation and can define the behavior if the synchronization promitive was abandoned. The tricky part is to detect when it is abandoned.
On Feb 16, 2017, at 7:31 AM, Phil Endecott via Boost
Maybe, though my next concern would be how to implement the functionality of a condition variable. What happens if a process crashes while it is waiting on a condition variable? I did once know how Linux implements condition variables using atomics and futexes, and I think it's probably safe to crash in this situation, but I guess there are no guarantees.
I recently developed an application which uses a process-shared condition variable to coordinate graphics updates between an "author" and one or more "viewer" processes. I found that on Linux, not only was killing a viewer while it was waiting on the condition variable harmless, but killing and relaunching the author (which reinitialized the mutex and CV) had no adverse effect -- the application continued to work. It's undefined behavior, of course, so I was pleasantly surprised. By contrast, the OS X I tested on doesn't even appear to be POSIX-conforming. Not only does relaunching the viewer after killing it mid-wait cause failures in the viewer *and* the tester (as one could at least anticipate, if not hope for), but a second viewer launched while the first was still waiting failed in pthread_cond_wait() (returning EINVAL), thus effectively limiting Apple's implementation of process-shared condition variables to two processes. So yeah, no guarantees. Josh
On 15/02/2017 18:42, Phil Endecott via Boost wrote:
Dear Experts,
I've just been surprised by the behaviour of the interprocess mutex and condition variable on abnormal process termination, i.e. they are not automatically released.
Google tells me that I'm not the first to be surprised by this; there have been previous posts here, stack overflow questions etc.
One often-valid observation is that if a process crashes - or otherwise terminates without executing its destructors - while it holds a lock on a shared data structure then the data is probably now corrupt, so unlocking the mutex that protects it is not very useful. I think there is an important case where that does not apply - when the process that crashes is only reading the shared data. In my case, I had written a "monitor" utility that loops forever, waiting on a shared condition, taking the corresponding mutex, and then dumping the shared data to stdout. I had been running this and stopping it by pressing ctrl-C and it had not occurred to me that this might not work as I expected. My attempt at debugging using this utility was making my problems worse, not better! Modifying this code to run destructors on ctrl-C is non-trivial.
There is a very poor but effective workaround if your application can support long delays. Search for BOOST_INTERPROCESS_ENABLE_TIMEOUT_WHEN_LOCKING and BOOST_INTERPROCESS_TIMEOUT_WHEN_LOCKING_DURATION_MS. It's not documented, but it should be added.
I am aware that the SysV shared semaphore is able to undo on process termination (see SEM_UNDO in man semop), and I had assumed that Boost.Interprocess was using this or something like it. I now see that it is using pthreads, which I didn't even realise could work between processes, and I don't think this API has any way to specify process termination behaviour.
Yes, but SysV shared semaphroes can't be placed in shared memory.
Anyway, I'd like to suggest that the interprocess docs should make some mention of the behaviour of the synchronisation primitives on process termination, e.g. somewhere near the beginning of http://www.boost.org/doc/libs/1_63_0/doc/html/interprocess/synchronization_m...
Good suggestion.
I may now try to implement some primitives that use semop() and unlock automatically. I haven't yet looked at what's involved to implement a condition variable on top of a semaphore, so I may not get very far! Has anyone else ever tried this?
There are several algorithms, but the problem is placing them in shared memory. See an adapter in: C:\Data\Libs\boost\boost\interprocess\sync\detail\condition_algorithm_8a.hpp
Also, I note that Interprocess is using "old style" times, not std::chrono like the std::mutex/condition do. Are there any plans to update this?
Yes, but I really can't get time to implement it. The idea one would support std::chrono and boost::chrono. Patches welcome ;-) Best, Ion
participants (5)
-
Andrey Semashev
-
Ion GaztaƱaga
-
Josh Juran
-
Niall Douglas
-
Phil Endecott