On 5/14/21 7:35 PM, André Almeida wrote:
Às 18:07 de 13/05/21, Andrey Semashev via Boost escreveu:
Beware of a long post.
Of the libraries I mentioned, the prime user of futex2 would be Boost.Atomic. With the current implementation based on existing futex API, the important missing part is support for futex sizes other than 32 bits. This means that for atomics other than 32-bit Boost.Atomic must use an internal lock pool to implement waiting and notifying operations, which increases thread contention. For inter-process atomics, this means that waiting must be done using a spin loop, which is terribly inefficient. So, the support for 8, 16 and 64-bit futexes would be very much needed here.
Cool, I'll be adding "better support for userspace atomics" as an use case for variable size futexes.
So far, I've been advertising that variable sized futexes would be good for saving some memory. Do you think that using 8bit sized futexes for e.g. Boost's mutexes would save something or would be unlikely noticed?
Honestly, I don't think small-sized futexes would be noticeable in terms of memory consumption in real world applications, aside from extremely low end embedded domain (think microcontrollers with kilobytes of memory). And in those kind of devices you may not have a Linux kernel in the first place. Even mobile phones these days come with several gigs of RAM. Most of the data is 4 or 8 byte aligned, so making a futex smaller than that would likely just contribute to padding. And, to avoid false sharing, you generally want to make sure the mutex and the protected data is within a cache line (typically, 64 bytes) and no other, unrelated data is in that cache line. So unless you're creating thousands of futexes and want to pack them tightly (which is detrimental to performance), I don't think the small futex size is a meaningful advantage. Small sized futexes are mostly useful when you already have a small data item, and you also want to block on it, which is the use case of atomics.
As for use cases outside Boost, that application that I described in the LKML post would benefit not only from 64-bit futexes but also from the ability to wait on multiple futexes. [...] I can provide more details on this use case, if you're interested.
Please, I would love to hear more about that.
Well, I've already described most of it in the LKML post and my previous reply. To recap: - The futexes are created in shared memory and accessed by multiple processes. - The futexes are created in threes, one is used as a mutex and the other two as condition variables. We use FUTEX_REQUEUE to move blocked threads from cond vars to the mutex to wake them up upon releasing the mutex. - The mutex futex value is divided into sets of bits. Some bits are used for ABA counter and some - for PID of the owner to implement robust semantics. This is where a larger futex would be useful, as the 32-bit putex is not large enough to accommodate a full PID. - For cond var futexes, we're using the futex bitset feature to selectively wake up threads subscribed to different events. In particular, this allows a thread to wake up all other threads within the same process but not the threads from other processes, which is useful for indicating process-local events (e.g. a request for termination). Looking at our current implementation, we are missing a FUTEX_REQUEUE_BITSET operation, which would act as FUTEX_REQUEUE, but only for blocked threads with a matching bitset. Because of that we have to use FUTEX_WAKE_BITSET, which may cause a thundering herd effect. It is possible to replace the futex bitset functionality with more futexes (with the ability to wait on multiple futexes). However, I suspect, in some cases such replacement would not be so easy. The problem is that with bitsets the notifying thread only has to perform one syscall to notify of multiple events (FUTEX_WAKE_BITSET), while with multiple futexes one would have to perform one FUTEX_WAKE per futex representing an event. I *think*, in our case we can avoid this overhead and implement a scheme where notifying (or rather, requeueing from) only one futex is enough. But I'm not sure this will be doable in other applications. Also, I'm not sure if the increased number of futexes would cause more overhead on the kernel side. Did you perform benchmarks comparing futex bitsets and an equivalent logic with multiple futex2 futexes? I mean, have two configs: - N threads multiplexed on a single futex using different bits in a bitset. - N threads, each waiting on N futex2 futexes. And compare the cost of blocking, waking and requeueing a thread in each config. In any case, I think, futex bitsets are a useful feature, even if our application can do without it.