Às 14:03 de 14/05/21, Niall Douglas escreveu:
On 14/05/2021 17:33, André Almeida wrote:
Another example is Boost.ASIO where one might think that if it used io_uring instead of epoll() there might be significant gains. However, benchmarks say no, epoll() is not a bottleneck almost all of the time. I'd hazard a guess it will be similar with futex2, almost all of the time in the ways Boost waits on the kernel, futex2 being many times faster will only confer a small if any measurable gain to most Boost code once measured from the top.
Thanks Niall for your reply, it makes sense to me. Now it's more clear for me how new kernel features are adopted in Boost and I assume that this process is probably similar in others userspace libraries as well.
Now I've thought about it a little more, it would seem to me that the best way futex2 could help is if it could make the pthread primitives better behaved. Then all code, including Boost, benefits.
I'm probably already telling you stuff you know, but glibc's pthread mutex implementation works by logging each waiter upon a locked mutex into the thread environment of the thread which currently owns that mutex, such that when that thread exits that mutex it knows exactly which other thread can now be awoken to claim the lock (this also allows pthread mutexes to have no init and deinit nor allocation requirements, as they are just a naked atomic which has dual use as a spin lock and identifying which thread's environment to register waiters with).
Whilst this is great, and more efficient than other platforms, I've often wondered if the kernel couldn't be more participatory in pthread mutexes than the current mostly userspace implementation is. I mean, on FreeBSD the kernel is aware of exactly which threads await on each and every userspace mutex, and knowing that it can make different scheduling decisions than it might otherwise (specifically, it avoids pathological corner case performance drops).
In very heavily threaded code doing heavy i/o, tail latencies are orders of magnitude better on FreeBSD than Linux in my experience (though it depends on scheduler chosen and lots of other factors). I've never shaken the feeling that it's the kernel awareness of mutex wait queues which gives FreeBSD the edge.
All that said, FreeBSD mutexes are much slower in the average case than on Linux. That excellent corner case performance comes at a cost. So I'm not entirely sure what I'm advocating to you here, except a random splattering of ideas to consider, if you haven't already.
Well, trade-offs :) I know that other platforms provide sync mechanisms where the kernel is more aware of what's going on and can make some smarter decision for the scheduling. But all the semantics of futex is build upon the expectation that the kernel doesn't need to be involved if the lock is free, that is probably the reason why the average case is better. So this is a different trend of work, but maybe in the future someone will try to achieve that in a new sync API if those corner cases are causing too much inconvenience, who knows.
Niall