[thread] Address sanitizer failures on marshall-mac
Hi, I'm observing Boost.Log and Boost.Atomic test failures on marshall-mac with AddressSanitizer enabled. http://www.boost.org/development/tests/develop/developer/output/marshall-mac... http://www.boost.org/development/tests/develop/developer/output/marshall-mac... The error messages don't contain function names but the backtraces have Boost.Thread on top. I can also see similar failures in Boost.Thread: http://www.boost.org/development/tests/develop/developer/output/marshall-mac... Is this a known problem in Boost.Thread? Can the backtraces be deciphered so that function names are shown?
On 7 Mar 2014 at 21:19, Andrey Semashev wrote:
I'm observing Boost.Log and Boost.Atomic test failures on marshall-mac with AddressSanitizer enabled.
The error messages don't contain function names but the backtraces have Boost.Thread on top. I can also see similar failures in Boost.Thread:
http://www.boost.org/development/tests/develop/developer/output/marshall-mac...
Is this a known problem in Boost.Thread? Can the backtraces be deciphered so that function names are shown?
I've seen these too when sanitising AFIO. I investigated a few in Boost.Thread and found them false positives. This isn't to say they are all false positives, just that there can appear to be a lot of them. We really ought to mark them up inline with the source using the valgrind magic macros (which I've done for AFIO). Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/
Niall Douglas wrote:
I've seen these too when sanitising AFIO. I investigated a few in Boost.Thread and found them false positives.
How could such an error: ==1193==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x61600000f7e0 in thread T0 be a false positive?
On 7 Mar 2014 at 20:12, Peter Dimov wrote:
I've seen these too when sanitising AFIO. I investigated a few in Boost.Thread and found them false positives.
How could such an error:
==1193==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x61600000f7e0 in thread T0
be a false positive?
I saw a few of these when sanitising AFIO. They turned out to be caused by AFIO using Boost.Thread incorrectly. Again, I'm not saying that there aren't these bugs in Thread, just saying that for what AFIO used in Thread I didn't find any not caused by my poor programming. I think you mentioned backtracing being a problem? Turn on asynchronous unwind tables, and turn on frame pointers. Obviously disable inlining and build with debug info. I have Travis CI do a valgrind pass on all of AFIO's unit tests every time I make a commit, I could probably pull the build flags for that if you'd like? I may also have a valgrind memcheck filter file around. That might help you too? Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/
Le 07/03/14 18:53, Niall Douglas a écrit :
On 7 Mar 2014 at 21:19, Andrey Semashev wrote:
I'm observing Boost.Log and Boost.Atomic test failures on marshall-mac with AddressSanitizer enabled.
The error messages don't contain function names but the backtraces have Boost.Thread on top. I can also see similar failures in Boost.Thread:
http://www.boost.org/development/tests/develop/developer/output/marshall-mac...
Is this a known problem in Boost.Thread? Can the backtraces be deciphered so that function names are shown? I've seen these too when sanitising AFIO. I investigated a few in Boost.Thread and found them false positives.
This isn't to say they are all false positives, just that there can appear to be a lot of them. We really ought to mark them up inline with the source using the valgrind magic macros (which I've done for AFIO).
Niall
Hi, I was aware of this issue, but I didn't reached to diagnose what was wrong. Could you explain me how did you found that some of them are false positives? Could you provide a patch with the needed annotations? Thanks, Vicente
On 7 Mar 2014 at 19:13, Vicente J. Botet Escriba wrote:
I was aware of this issue, but I didn't reached to diagnose what was wrong.
Could you explain me how did you found that some of them are false positives?
Through inspection, the usual debugging experience. Figuring out causes of memcheck failure is not easy nor quick. The main cause of false positives is when Boost uses atomics to implement low level primitives such as locks. You need to annotate all CAS lock operations with the fact they are CAS locks - that way a thread sanitiser knows you're serialising code. Otherwise it appears you're riddling your code with race conditions. Markup is very easy, but tedious. You effectively must audit every line of code.
Could you provide a patch with the needed annotations?
Marking up all of Boost.Thread with all the necessary annotations and fixing up any problems revealed is probably a full (and extremely worthwhile) GSoC. Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/
On Fri, Mar 7, 2014 at 10:35 PM, Niall Douglas
On 7 Mar 2014 at 19:13, Vicente J. Botet Escriba wrote:
I was aware of this issue, but I didn't reached to diagnose what was wrong.
Could you explain me how did you found that some of them are false positives?
Through inspection, the usual debugging experience. Figuring out causes of memcheck failure is not easy nor quick.
The main cause of false positives is when Boost uses atomics to implement low level primitives such as locks. You need to annotate all CAS lock operations with the fact they are CAS locks - that way a thread sanitiser knows you're serialising code. Otherwise it appears you're riddling your code with race conditions.
I think you're confusing ThreadSanitizer and AddressSanitizer. Double free is never a false positive.
Markup is very easy, but tedious. You effectively must audit every line of code.
Could you provide a patch with the needed annotations?
Marking up all of Boost.Thread with all the necessary annotations and fixing up any problems revealed is probably a full (and extremely worthwhile) GSoC.
I'd be careful with such markup. I don't know how exactly ThreadSanitizer works, but if markup means calling some function in runtime then that's probably not an acceptable solution in the context of atomics.
On 8 Mar 2014 at 0:55, Andrey Semashev wrote:
The main cause of false positives is when Boost uses atomics to implement low level primitives such as locks. You need to annotate all CAS lock operations with the fact they are CAS locks - that way a thread sanitiser knows you're serialising code. Otherwise it appears you're riddling your code with race conditions.
I think you're confusing ThreadSanitizer and AddressSanitizer. Double free is never a false positive.
No, but I think you didn't understand my post. Double frees which apparently occur Boost.Thread may in fact be double deletes in upstream code, so if type Foo's destructor has at some point the destruction of say a boost::future<>, double deleting Foo will appear as if Boost.Thread is double freeing. In fact the fault is in upstream code, not Boost.Thread.
Marking up all of Boost.Thread with all the necessary annotations and fixing up any problems revealed is probably a full (and extremely worthwhile) GSoC.
I'd be careful with such markup. I don't know how exactly ThreadSanitizer works, but if markup means calling some function in runtime then that's probably not an acceptable solution in the context of atomics.
No functions are called in valgrind inserted markup. Just some harmless bytes which act as fingerprints. Normally the CPU skips right over them. Many codebases ship valgrind fingerprints in final release build images (AFIO is one of them). I found no statistically significant difference in performance nor bloat on out of order CPUs. In short: valgrind is very well designed, unsurprising as it came from the mind of Julian Seward. Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/
On Saturday 08 March 2014 14:25:45 Niall Douglas wrote:
On 8 Mar 2014 at 0:55, Andrey Semashev wrote:
The main cause of false positives is when Boost uses atomics to implement low level primitives such as locks. You need to annotate all CAS lock operations with the fact they are CAS locks - that way a thread sanitiser knows you're serialising code. Otherwise it appears you're riddling your code with race conditions.
I think you're confusing ThreadSanitizer and AddressSanitizer. Double free is never a false positive.
No, but I think you didn't understand my post. Double frees which apparently occur Boost.Thread may in fact be double deletes in upstream code, so if type Foo's destructor has at some point the destruction of say a boost::future<>, double deleting Foo will appear as if Boost.Thread is double freeing. In fact the fault is in upstream code, not Boost.Thread.
That doesn't mean that the error is false positive. It just means that the error is not in Boost.Thread. I didn't say I'm 100% positive that the problem is in Boost.Thread. Although I'd say this looks like the most probable case given that the problem is indicated by multiple libraries, including Boost.Thread itself.
Marking up all of Boost.Thread with all the necessary annotations and fixing up any problems revealed is probably a full (and extremely worthwhile) GSoC.
I'd be careful with such markup. I don't know how exactly ThreadSanitizer works, but if markup means calling some function in runtime then that's probably not an acceptable solution in the context of atomics.
No functions are called in valgrind inserted markup. Just some harmless bytes which act as fingerprints. Normally the CPU skips right over them.
Even nop requires decoding effort, so it does have a cost. And I was referring to ThreadSanitizer markup. Does it use the same markup as valgrind does?
On 8 Mar 2014 at 19:18, Andrey Semashev wrote:
Even nop requires decoding effort, so it does have a cost.
As I said before, on a recent out of order CPU it usually has no statistically significant cost.
And I was referring to ThreadSanitizer markup. Does it use the same markup as valgrind does?
Firstly, the only valid reason to use AddressSanitiser over valgrind memcheck is when valgrind is too slow. The problem with AddressSanitiser, like some other clang/GCC sanitisers, is that *everything* in the process space needs to be compiled with the sanitiser turned on. Otherwise memory corruption caused by an uninstrumented bit of code can magically appear in other places, and forget about debugging the cause easily. I personally use those sanitisers which require everything to be compiled with them as a quick smoke check, but use valgrind for finding causes. The ThreadSanitiser doesn't need everything to be instrumented, and it has the huge advantage of speed and memory consumption over valgrind's helgrind or DRD. ThreadSanitiser has two versions, one based on valgrind where a subset of DRD/helgrind instrumentation is recognised, the other based on a clang backend which currently does not recognise DRD/helgrind instrumentation. The DRD/helgrind markup is really about *documentation* of use of non-standard threading constructs, plus you can subvert the macros to do other useful things e.g. auto-generate a ThreadSanitiser suppressions file for you. The clang/GCC ThreadSanitizer is still in beta, and reusing DRD/helgrind markup is a very likely feature add in the future. Hope this explains everything. Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/
On 8 March 2014 16:07, Niall Douglas wrote:
The ThreadSanitiser doesn't need everything to be instrumented, and it has the huge advantage of speed and memory consumption over valgrind's helgrind or DRD. ThreadSanitiser has two versions, one based on valgrind where a subset of DRD/helgrind instrumentation is recognised, the other based on a clang backend which currently does not recognise DRD/helgrind instrumentation.
The valgrind-based ThreadSanitizer v1 is not really being developed or supported now IIUC. The new version is present in both Clang and GCC, and if you use those compilers' built-in atomic operations (either the older __sync_xxxx or newer __atomic_xxx intrinsics) then tsan should understand them and need no instrumentation.
On 8 Mar 2014 at 18:29, Jonathan Wakely wrote:
The new version is present in both Clang and GCC, and if you use those compilers' built-in atomic operations (either the older __sync_xxxx or newer __atomic_xxx intrinsics) then tsan should understand them and need no instrumentation.
Can tsan understand when an atomic is being used for serialisation? I can see a CAS lock being heuristically determined, but some of the fancier semaphore based techniques surely need some explicit markup. Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/
On 8 March 2014 18:34, Niall Douglas wrote:
On 8 Mar 2014 at 18:29, Jonathan Wakely wrote:
The new version is present in both Clang and GCC, and if you use those compilers' built-in atomic operations (either the older __sync_xxxx or newer __atomic_xxx intrinsics) then tsan should understand them and need no instrumentation.
Can tsan understand when an atomic is being used for serialisation? I can see a CAS lock being heuristically determined, but some of the fancier semaphore based techniques surely need some explicit markup.
I believe it can, because it knows that an atomic store with memory_order_release in one thread and a load with memory_order_acquire in another thread implies an ordering, and so can tell there is no race. I don't think tsan v2 even supports any explicit markup, so you couldn't use it if you wanted to.
On 9 Mar 2014 at 10:31, Jonathan Wakely wrote:
Can tsan understand when an atomic is being used for serialisation? I can see a CAS lock being heuristically determined, but some of the fancier semaphore based techniques surely need some explicit markup.
I believe it can, because it knows that an atomic store with memory_order_release in one thread and a load with memory_order_acquire in another thread implies an ordering, and so can tell there is no race.
The reason I'm suspicious this is the case is because valgrind doesn't do this, yet it certainly can tell atomic ops from non-atomic ones (besides, on x86/64 loads always acquire and stores always release anyway). I can see maybe that the compiler knows things that valgrind cannot, but I guess we're probably speculating now.
I don't think tsan v2 even supports any explicit markup, so you couldn't use it if you wanted to.
Sure, but tsan v2 is very explicitly said to be unfinished, and really its huge utility right now is that it's usably quick compared to DRD or helgrind rather than having a superior feature set. There's some design document for tsan v2 around which had a list of stuff they planned, can't seem to find it now. Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/
On 9 March 2014 19:45, Niall Douglas wrote:
On 9 Mar 2014 at 10:31, Jonathan Wakely wrote:
Can tsan understand when an atomic is being used for serialisation? I can see a CAS lock being heuristically determined, but some of the fancier semaphore based techniques surely need some explicit markup.
I believe it can, because it knows that an atomic store with memory_order_release in one thread and a load with memory_order_acquire in another thread implies an ordering, and so can tell there is no race.
The reason I'm suspicious this is the case is because valgrind doesn't do this, yet it certainly can tell atomic ops from non-atomic ones (besides, on x86/64 loads always acquire and stores always release anyway).
That's exactly why valgrind can't tell atomic ops from non-atomic ones (on x86_64). Valgrind just sees the instructions, it doesn't know the context that generated the instruction.
I can see maybe that the compiler knows things that valgrind cannot, but I guess we're probably speculating now.
Of course the compiler knows things valgrind cannot. If the source code has a compiler intrinsic for an atomic op then tsan knows the compiler is prevented from moving things across the operation and knows the generated instruction is being used specifically as an atomic op, not just a plain load or store, and can tag/instrument/whatever that the operation will not result in a data race (even though it might use exactly the same insn as another load/store that might result in a race).
On Fri, Mar 7, 2014 at 10:13 PM, Vicente J. Botet Escriba
Le 07/03/14 18:53, Niall Douglas a écrit :
On 7 Mar 2014 at 21:19, Andrey Semashev wrote:
I'm observing Boost.Log and Boost.Atomic test failures on marshall-mac with AddressSanitizer enabled.
Hi,
I was aware of this issue, but I didn't reached to diagnose what was wrong.
Do you have or need a ticket for this problem?
Le 07/03/14 21:59, Andrey Semashev a écrit :
On Fri, Mar 7, 2014 at 10:13 PM, Vicente J. Botet Escriba
wrote: Le 07/03/14 18:53, Niall Douglas a écrit :
On 7 Mar 2014 at 21:19, Andrey Semashev wrote:
I'm observing Boost.Log and Boost.Atomic test failures on marshall-mac with AddressSanitizer enabled. Hi,
I was aware of this issue, but I didn't reached to diagnose what was wrong. Do you have or need a ticket for this problem?
There are other issues when using asan but don't contains AddressSanitizer: attempting free Please fill it. Any help on these bugs will be really appreciated. Best, Vicente
On Saturday 08 March 2014 03:02:27 Vicente J. Botet Escriba wrote:
Le 07/03/14 21:59, Andrey Semashev a écrit :
On Fri, Mar 7, 2014 at 10:13 PM, Vicente J. Botet Escriba
wrote: Le 07/03/14 18:53, Niall Douglas a écrit :
On 7 Mar 2014 at 21:19, Andrey Semashev wrote:
I'm observing Boost.Log and Boost.Atomic test failures on marshall-mac with AddressSanitizer enabled.
Hi,
I was aware of this issue, but I didn't reached to diagnose what was wrong.
Do you have or need a ticket for this problem?
There are other issues when using asan but don't contains
AddressSanitizer: attempting free
Please fill it.
Any help on these bugs will be really appreciated.
I tried to reproduce the error on my local Linux machine with gcc 4.8 and clang 3.2, but the error doesn't show. Also, todays tests on marshall-mac are all green, so I guess someone fixed it.
Le 08/03/14 16:41, Andrey Semashev a écrit :
On Saturday 08 March 2014 03:02:27 Vicente J. Botet Escriba wrote:
Le 07/03/14 21:59, Andrey Semashev a écrit :
Do you have or need a ticket for this problem? There are other issues when using asan but don't contains
AddressSanitizer: attempting free
Please fill it.
Any help on these bugs will be really appreciated. I tried to reproduce the error on my local Linux machine with gcc 4.8 and clang 3.2, but the error doesn't show. Also, todays tests on marshall-mac are all green, so I guess someone fixed it.
I don't believe on miracles. I suspect that the errors are spurious. I will check the regressions tests regularly. Best, Vicente
participants (5)
-
Andrey Semashev
-
Jonathan Wakely
-
Niall Douglas
-
Peter Dimov
-
Vicente J. Botet Escriba