On 8 Jul 2014 at 23:12, Gottlob Frege wrote:
2013. The talk is called "Non-Allocating std::future/promise". I think most of it is about a... non-allocating future-promise. Hopefully. That was at least the idea.
Dear dear dear ... given that I was working at BlackBerry with you at the time, and went with you to C++ Now, I really don't know how I missed this.
I'm going to assume that I didn't miss this and instead choose to forget and then pretend that your idea was my idea. So, my apologies, and I'll credit you in the docs when the time comes.
I _thought_ you were in the audience, but that could have been one of my other talks.
I was at *one* of your talks. I also almost certainly reviewed your slides as I do for most C++ Now talks I don't make it to. I have no excuse really, just failing memory (though in fairness, it's been an awfully full two years for me, two transatlantic relocations, first baby etc, I can see some memories have got deleted to make space)
At work, I hardly talked about it, so if you weren't at the talk, you could have easily missed it. Chandler said that Google also ended up with similar code, so we are all thinking along the same lines. Chandler had some good ideas for handling the exceptions as well (ie if thrown when setting the value). It is hard to be 100% standards compliant (since the standard basically assumes every implementation uses an allocated storage location, and those assumptions leak into the interface).
Boost.Thread's promise-future doesn't implement allocator support, so a de-malloced implementation shouldn't lose us too much (I agree we'll have to slightly deviate from the standard in some APIs, but TBH it's the standard that needs fixing here, promise-future shouldn't allocate). I need a malloc-free promise-future for AFIO. I see an exact latency resonance peak at one thread sleep duration, and upon investigation it's because the futures are sleeping the thread due to malloc being latency lumpy. AFIO also currently does eight malloc/frees per op executed with four inside a global lock, and I'd very much like to see that down to four malloc/frees per op with none inside a global lock. Also the batch hash engine's tasks are too finely grained to use mallocing promise-future. The promise-future adds about 15-20% to each hash round. That needs to become < 5%.
Yes, replacing the spin and pointer updating with TM would be nice.
And here is where things become very interesting.
<...interesting TM stuff...>
Yes, keep us informed. I've been assuming TM won't work well for "big" transactions, but I have no idea yet what is big and what is small.
The upper limit is probably 100 cache lines touched. My current best guess is the small limit is somewhere around 10 cache lines touched, so you need to exceed 10 lines and keep under 50. A narrow window.
Of course, we could also just ask the TM guys, like Michael Wong et al. But nothing beats experiencing it for yourself.
He'll say go use transactional GCC, and he's right. I put in a code path to use __transaction_relaxed as that's the malloc capable one (malloc doesn't abort transactions in transactional GCC, unlike in TSX). Performance was dismal, especially so on non-TSX hardware where it was another order of magnitude slower again. My lesson learned from that is when writing code targeting both TSX and transactional GCC, don't bother with __transaction_relaxed, just use __transaction_atomic and follow the same granularity rules as with TSX. Regarding transactional GCC, it is neat the way you can write metaprogramming which generates code which the compiler optimiser spots can elide all locking completely, then your output runs completely in parallel. That is very hard to do normally in metaprogramming. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/