Re: [boost] [sync] Optimizing Boost's sync using futex2

15 May 2021

      Às 14:03 de 14/05/21, Niall Douglas escreveu:
...
On 14/05/2021 17:33, André Almeida wrote:
...
...
Another example is Boost.ASIO where one might think that if it used
io_uring instead of epoll() there might be significant gains. However,
benchmarks say no, epoll() is not a bottleneck almost all of the time.
I'd hazard a guess it will be similar with futex2, almost all of the
time in the ways Boost waits on the kernel, futex2 being many times
faster will only confer a small if any measurable gain to most Boost
code once measured from the top.
Thanks Niall for your reply, it makes sense to me. Now it's more clear
for me how new kernel features are adopted in Boost and I assume that
this process is probably similar in others userspace libraries as well.
Now I've thought about it a little more, it would seem to me that the
best way futex2 could help is if it could make the pthread primitives
better behaved. Then all code, including Boost, benefits.
I'm probably already telling you stuff you know, but glibc's pthread
mutex implementation works by logging each waiter upon a locked mutex
into the thread environment of the thread which currently owns that
mutex, such that when that thread exits that mutex it knows exactly
which other thread can now be awoken to claim the lock (this also allows
pthread mutexes to have no init and deinit nor allocation requirements,
as they are just a naked atomic which has dual use as a spin lock and
identifying which thread's environment to register waiters with).
Whilst this is great, and more efficient than other platforms, I've
often wondered if the kernel couldn't be more participatory in pthread
mutexes than the current mostly userspace implementation is. I mean, on
FreeBSD the kernel is aware of exactly which threads await on each and
every userspace mutex, and knowing that it can make different scheduling
decisions than it might otherwise (specifically, it avoids pathological
corner case performance drops).
In very heavily threaded code doing heavy i/o, tail latencies are orders
of magnitude better on FreeBSD than Linux in my experience (though it
depends on scheduler chosen and lots of other factors). I've never
shaken the feeling that it's the kernel awareness of mutex wait queues
which gives FreeBSD the edge.
All that said, FreeBSD mutexes are much slower in the average case than
on Linux. That excellent corner case performance comes at a cost. So I'm
not entirely sure what I'm advocating to you here, except a random
splattering of ideas to consider, if you haven't already.
Well, trade-offs :)

I know that other platforms provide sync mechanisms where the kernel is
more aware of what's going on and can make some smarter decision for the
scheduling. But all the semantics of futex is build upon the expectation
that the kernel doesn't need to be involved if the lock is free, that is
probably the reason why the average case is better.

So this is a different trend of work, but maybe in the future someone
will try to achieve that in a new sync API if those corner cases are
causing too much inconvenience, who knows.
...
Niall