Re: [boost] [sync] Optimizing Boost's sync using futex2

15 May 2021


      Às 15:05 de 14/05/21, Andrey Semashev via Boost escreveu:
...
On 5/14/21 7:35 PM, André Almeida wrote:
...
Às 18:07 de 13/05/21, Andrey Semashev via Boost escreveu:
...
Beware of a long post.
Of the libraries I mentioned, the prime user of futex2 would be
Boost.Atomic. With the current implementation based on existing futex
API, the important missing part is support for futex sizes other than 32
bits. This means that for atomics other than 32-bit Boost.Atomic must
use an internal lock pool to implement waiting and notifying operations,
which increases thread contention. For inter-process atomics, this means
that waiting must be done using a spin loop, which is terribly
inefficient. So, the support for 8, 16 and 64-bit futexes would be very
much needed here.
Cool, I'll be adding "better support for userspace atomics" as an use
case for variable size futexes.
So far, I've been advertising that variable sized futexes would be good
for saving some memory. Do you think that using 8bit sized futexes for
e.g. Boost's mutexes would save something or would be unlikely noticed?
Honestly, I don't think small-sized futexes would be noticeable in terms
of memory consumption in real world applications, aside from extremely
low end embedded domain (think microcontrollers with kilobytes of
memory). And in those kind of devices you may not have a Linux kernel in
the first place. Even mobile phones these days come with several gigs of
RAM.
Most of the data is 4 or 8 byte aligned, so making a futex smaller than
that would likely just contribute to padding. And, to avoid false
sharing, you generally want to make sure the mutex and the protected
data is within a cache line (typically, 64 bytes) and no other,
unrelated data is in that cache line. So unless you're creating
thousands of futexes and want to pack them tightly (which is detrimental
to performance), I don't think the small futex size is a meaningful
advantage.
Small sized futexes are mostly useful when you already have a small data
item, and you also want to block on it, which is the use case of atomics.
Noted, thanks.
...
...
...
As for use cases outside Boost, that application that I described in the
LKML post would benefit not only from 64-bit futexes but also from the
ability to wait on multiple futexes.
[...]
I can provide more details on this use case, if you're
interested.
Please, I would love to hear more about that.
Well, I've already described most of it in the LKML post and my previous
reply. To recap:
- The futexes are created in shared memory and accessed by multiple
processes.
- The futexes are created in threes, one is used as a mutex and the
other two as condition variables. We use FUTEX_REQUEUE to move blocked
threads from cond vars to the mutex to wake them up upon releasing the
mutex.
- The mutex futex value is divided into sets of bits. Some bits are used
for ABA counter and some - for PID of the owner to implement robust
semantics. This is where a larger futex would be useful, as the 32-bit
putex is not large enough to accommodate a full PID.
- For cond var futexes, we're using the futex bitset feature to
selectively wake up threads subscribed to different events. In
particular, this allows a thread to wake up all other threads within the
same process but not the threads from other processes, which is useful
for indicating process-local events (e.g. a request for termination).
Right, so you have a nice custom sync mechanism implementation. If isn't
confidential, can you explain which kind of application/work load are
you dealing with? e.g. it's a database
...
Looking at our current implementation, we are missing a
FUTEX_REQUEUE_BITSET operation, which would act as FUTEX_REQUEUE, but
only for blocked threads with a matching bitset. Because of that we have
to use FUTEX_WAKE_BITSET, which may cause a thundering herd effect.
It is possible to replace the futex bitset functionality with more
futexes (with the ability to wait on multiple futexes). However, I
suspect, in some cases such replacement would not be so easy. The
problem is that with bitsets the notifying thread only has to perform
one syscall to notify of multiple events (FUTEX_WAKE_BITSET), while with
multiple futexes one would have to perform one FUTEX_WAKE per futex
representing an event. I *think*, in our case we can avoid this overhead
and implement a scheme where notifying (or rather, requeueing from) only
one futex is enough. But I'm not sure this will be doable in other
applications.
Also, I'm not sure if the increased number of futexes would cause more
overhead on the kernel side. Did you perform benchmarks comparing futex
bitsets and an equivalent logic with multiple futex2 futexes? I mean,
have two configs:
- N threads multiplexed on a single futex using different bits in a bitset.
- N threads, each waiting on N futex2 futexes.
And compare the cost of blocking, waking and requeueing a thread in each
config.
In any case, I think, futex bitsets are a useful feature, even if our
application can do without it.
I didn't benchmark that, because I never thought of wait on multiple as
a replacement for futex bitset operations, but I'll have a look on that.
Also, the current API doesn't accommodate bitset for now, so it's good
to know that there are users out there, I thought this feature wasn't
used nowadays.