Re: [boost] [sync] Optimizing Boost's sync using futex2

14 May 2021

      On 5/14/21 7:35 PM, André Almeida wrote:
...
Às 18:07 de 13/05/21, Andrey Semashev via Boost escreveu:
...
Beware of a long post.
Of the libraries I mentioned, the prime user of futex2 would be
Boost.Atomic. With the current implementation based on existing futex
API, the important missing part is support for futex sizes other than 32
bits. This means that for atomics other than 32-bit Boost.Atomic must
use an internal lock pool to implement waiting and notifying operations,
which increases thread contention. For inter-process atomics, this means
that waiting must be done using a spin loop, which is terribly
inefficient. So, the support for 8, 16 and 64-bit futexes would be very
much needed here.
Cool, I'll be adding "better support for userspace atomics" as an use
case for variable size futexes.
So far, I've been advertising that variable sized futexes would be good
for saving some memory. Do you think that using 8bit sized futexes for
e.g. Boost's mutexes would save something or would be unlikely noticed?
Honestly, I don't think small-sized futexes would be noticeable in terms 
of memory consumption in real world applications, aside from extremely 
low end embedded domain (think microcontrollers with kilobytes of 
memory). And in those kind of devices you may not have a Linux kernel in 
the first place. Even mobile phones these days come with several gigs of 
RAM.

Most of the data is 4 or 8 byte aligned, so making a futex smaller than 
that would likely just contribute to padding. And, to avoid false 
sharing, you generally want to make sure the mutex and the protected 
data is within a cache line (typically, 64 bytes) and no other, 
unrelated data is in that cache line. So unless you're creating 
thousands of futexes and want to pack them tightly (which is detrimental 
to performance), I don't think the small futex size is a meaningful 
advantage.

Small sized futexes are mostly useful when you already have a small data 
item, and you also want to block on it, which is the use case of atomics.
...
...
As for use cases outside Boost, that application that I described in the
LKML post would benefit not only from 64-bit futexes but also from the
ability to wait on multiple futexes.
[...]
I can provide more details on this use case, if you're
interested.
Please, I would love to hear more about that.
Well, I've already described most of it in the LKML post and my previous 
reply. To recap:

- The futexes are created in shared memory and accessed by multiple 
processes.
- The futexes are created in threes, one is used as a mutex and the 
other two as condition variables. We use FUTEX_REQUEUE to move blocked 
threads from cond vars to the mutex to wake them up upon releasing the 
mutex.
- The mutex futex value is divided into sets of bits. Some bits are used 
for ABA counter and some - for PID of the owner to implement robust 
semantics. This is where a larger futex would be useful, as the 32-bit 
putex is not large enough to accommodate a full PID.
- For cond var futexes, we're using the futex bitset feature to 
selectively wake up threads subscribed to different events. In 
particular, this allows a thread to wake up all other threads within the 
same process but not the threads from other processes, which is useful 
for indicating process-local events (e.g. a request for termination).

Looking at our current implementation, we are missing a 
FUTEX_REQUEUE_BITSET operation, which would act as FUTEX_REQUEUE, but 
only for blocked threads with a matching bitset. Because of that we have 
to use FUTEX_WAKE_BITSET, which may cause a thundering herd effect.

It is possible to replace the futex bitset functionality with more 
futexes (with the ability to wait on multiple futexes). However, I 
suspect, in some cases such replacement would not be so easy. The 
problem is that with bitsets the notifying thread only has to perform 
one syscall to notify of multiple events (FUTEX_WAKE_BITSET), while with 
multiple futexes one would have to perform one FUTEX_WAKE per futex 
representing an event. I *think*, in our case we can avoid this overhead 
and implement a scheme where notifying (or rather, requeueing from) only 
one futex is enough. But I'm not sure this will be doable in other 
applications.

Also, I'm not sure if the increased number of futexes would cause more 
overhead on the kernel side. Did you perform benchmarks comparing futex 
bitsets and an equivalent logic with multiple futex2 futexes? I mean, 
have two configs:

- N threads multiplexed on a single futex using different bits in a bitset.
- N threads, each waiting on N futex2 futexes.

And compare the cost of blocking, waking and requeueing a thread in each 
config.

In any case, I think, futex bitsets are a useful feature, even if our 
application can do without it.