[sync] Optimizing Boost's sync using futex2

older
Final Notice: BinTray is shutting...

André Almeida

13 May 2021 13 May '21

2:27 p.m.

Hi there, I'm the author of futex2[0], a WIP new set of Linux's syscalls that allows userspace to write efficient sync mechanisms. I would like to hear from Boost's developers if the project would benefit from this new interface.

...

From Boost/sync's codebase, I can see that you are already familiar with futexes, but just in case:

* What's futex? futex stands for "fast userspace mutexes". It's a syscall that basically exposes a wait queue, where userspace can ask to put threads to sleep and to wake them. The index of such queue is an unsigned int address (that is usually called a futex). It's value doesn't mean anything for the kernel, userspace is responsible to defining the rules of it and implementing the logic. * Why futex2? futex has long standing issues that seems impossible to be solved in the current interface, so we are designing a new set of syscalls taking those problems in account (more on that at [0]). It's not merged yet, and I'm researching some uses cases to validate and get feedback about the API. The new API has the following new features: * Variable sized futexes (6, 16, 32 and 64 bits): we aren't restrict to only 32 bits anymore, and this can be used for decreasing memory usage or to wait on an address * Wait for multiple futexes: this function enables userspace to wait on a list of futexes and wake on the first that triggers a wake. Our current use case for this is for game engines and to emulate WaitForMultipleObjects from WinAPI in Wine * NUMA-awarenes: current futex has a single global hash table that is allocated in some node. Every futex() call that happens on the other nodes will have a access penalty. At futex2, there'll be a single table per node, so you can specify which one you want to operate on to increase locality. The detailed description of the API can be seen in the documentation patch[1]. Do you think that Boost would benefit from it? Thanks, André - futex2 talk at OpenSource Summit: https://www.youtube.com/watch?v=eG6GMBTcPQ8 [0] https://lore.kernel.org/lkml/20210427231248.220501-1-andrealmeid@collabora.c... [1] https://lore.kernel.org/lkml/20210427231248.220501-7-andrealmeid@collabora.c...

Show replies by date

Niall Douglas

13 May 13 May

4:11 p.m.

On 13/05/2021 15:27, André Almeida via Boost wrote:

...

I'm the author of futex2[0], a WIP new set of Linux's syscalls that allows userspace to write efficient sync mechanisms. I would like to hear from Boost's developers if the project would benefit from this new interface.

Firstly, thanks for your work. I had already seen it on LKML. I'm personally still fonder of how BSD implements wait queues, but futex2 is a nice improvement on futex, so thanks!

...

The detailed description of the API can be seen in the documentation patch[1]. Do you think that Boost would benefit from it?

To be honest, it's very much a case of "it depends" modulated by whether adding more code to support newer Linux kernels is worth the added long term maintenance costs, given the benefits to the average Boost user. I'm very sure that in very specific use cases, futex2 is a large benefit. However, because Boost must be portable, there is a certain amount of needing to assume lowest common denominator. Boost.Thread is an excellent example of this - works great on Windows XP, but because it was written around XP, it would work a lot greater if rewritten around Windows 10. But there isn't much incentive to rewrite it, given what we've currently got is well understood, warts and all, and those who need to will code around its limitations. Another example is Boost.ASIO where one might think that if it used io_uring instead of epoll() there might be significant gains. However, benchmarks say no, epoll() is not a bottleneck almost all of the time. I'd hazard a guess it will be similar with futex2, almost all of the time in the ways Boost waits on the kernel, futex2 being many times faster will only confer a small if any measurable gain to most Boost code once measured from the top. Others here will probably disagree with my assessment above. In any case, they will agree with "patches adding support for futex2 are welcome". Niall

André Almeida

14 May 14 May

4:33 p.m.

Às 13:11 de 13/05/21, Niall Douglas via Boost escreveu:

...

On 13/05/2021 15:27, André Almeida via Boost wrote:

...
I'm the author of futex2[0], a WIP new set of Linux's syscalls that allows userspace to write efficient sync mechanisms. I would like to hear from Boost's developers if the project would benefit from this new interface.

Firstly, thanks for your work. I had already seen it on LKML. I'm personally still fonder of how BSD implements wait queues, but futex2 is a nice improvement on futex, so thanks!

...

...
The detailed description of the API can be seen in the documentation patch[1]. Do you think that Boost would benefit from it?

To be honest, it's very much a case of "it depends" modulated by whether adding more code to support newer Linux kernels is worth the added long term maintenance costs, given the benefits to the average Boost user.

I'm very sure that in very specific use cases, futex2 is a large benefit. However, because Boost must be portable, there is a certain amount of needing to assume lowest common denominator. Boost.Thread is an excellent example of this - works great on Windows XP, but because it was written around XP, it would work a lot greater if rewritten around Windows 10. But there isn't much incentive to rewrite it, given what we've currently got is well understood, warts and all, and those who need to will code around its limitations.

Another example is Boost.ASIO where one might think that if it used io_uring instead of epoll() there might be significant gains. However, benchmarks say no, epoll() is not a bottleneck almost all of the time. I'd hazard a guess it will be similar with futex2, almost all of the time in the ways Boost waits on the kernel, futex2 being many times faster will only confer a small if any measurable gain to most Boost code once measured from the top.

Thanks Niall for your reply, it makes sense to me. Now it's more clear for me how new kernel features are adopted in Boost and I assume that this process is probably similar in others userspace libraries as well. André

Niall Douglas

5:03 p.m.

On 14/05/2021 17:33, André Almeida wrote:

...

...
Another example is Boost.ASIO where one might think that if it used io_uring instead of epoll() there might be significant gains. However, benchmarks say no, epoll() is not a bottleneck almost all of the time. I'd hazard a guess it will be similar with futex2, almost all of the time in the ways Boost waits on the kernel, futex2 being many times faster will only confer a small if any measurable gain to most Boost code once measured from the top.

Thanks Niall for your reply, it makes sense to me. Now it's more clear for me how new kernel features are adopted in Boost and I assume that this process is probably similar in others userspace libraries as well.

Now I've thought about it a little more, it would seem to me that the best way futex2 could help is if it could make the pthread primitives better behaved. Then all code, including Boost, benefits. I'm probably already telling you stuff you know, but glibc's pthread mutex implementation works by logging each waiter upon a locked mutex into the thread environment of the thread which currently owns that mutex, such that when that thread exits that mutex it knows exactly which other thread can now be awoken to claim the lock (this also allows pthread mutexes to have no init and deinit nor allocation requirements, as they are just a naked atomic which has dual use as a spin lock and identifying which thread's environment to register waiters with). Whilst this is great, and more efficient than other platforms, I've often wondered if the kernel couldn't be more participatory in pthread mutexes than the current mostly userspace implementation is. I mean, on FreeBSD the kernel is aware of exactly which threads await on each and every userspace mutex, and knowing that it can make different scheduling decisions than it might otherwise (specifically, it avoids pathological corner case performance drops). In very heavily threaded code doing heavy i/o, tail latencies are orders of magnitude better on FreeBSD than Linux in my experience (though it depends on scheduler chosen and lots of other factors). I've never shaken the feeling that it's the kernel awareness of mutex wait queues which gives FreeBSD the edge. All that said, FreeBSD mutexes are much slower in the average case than on Linux. That excellent corner case performance comes at a cost. So I'm not entirely sure what I'm advocating to you here, except a random splattering of ideas to consider, if you haven't already. Niall

André Almeida

15 May 15 May

5:50 p.m.

Às 14:03 de 14/05/21, Niall Douglas escreveu:

...

On 14/05/2021 17:33, André Almeida wrote:

...
...
Another example is Boost.ASIO where one might think that if it used io_uring instead of epoll() there might be significant gains. However, benchmarks say no, epoll() is not a bottleneck almost all of the time. I'd hazard a guess it will be similar with futex2, almost all of the time in the ways Boost waits on the kernel, futex2 being many times faster will only confer a small if any measurable gain to most Boost code once measured from the top.

Thanks Niall for your reply, it makes sense to me. Now it's more clear for me how new kernel features are adopted in Boost and I assume that this process is probably similar in others userspace libraries as well.

Now I've thought about it a little more, it would seem to me that the best way futex2 could help is if it could make the pthread primitives better behaved. Then all code, including Boost, benefits.

I'm probably already telling you stuff you know, but glibc's pthread mutex implementation works by logging each waiter upon a locked mutex into the thread environment of the thread which currently owns that mutex, such that when that thread exits that mutex it knows exactly which other thread can now be awoken to claim the lock (this also allows pthread mutexes to have no init and deinit nor allocation requirements, as they are just a naked atomic which has dual use as a spin lock and identifying which thread's environment to register waiters with).

Whilst this is great, and more efficient than other platforms, I've often wondered if the kernel couldn't be more participatory in pthread mutexes than the current mostly userspace implementation is. I mean, on FreeBSD the kernel is aware of exactly which threads await on each and every userspace mutex, and knowing that it can make different scheduling decisions than it might otherwise (specifically, it avoids pathological corner case performance drops).

In very heavily threaded code doing heavy i/o, tail latencies are orders of magnitude better on FreeBSD than Linux in my experience (though it depends on scheduler chosen and lots of other factors). I've never shaken the feeling that it's the kernel awareness of mutex wait queues which gives FreeBSD the edge.

All that said, FreeBSD mutexes are much slower in the average case than on Linux. That excellent corner case performance comes at a cost. So I'm not entirely sure what I'm advocating to you here, except a random splattering of ideas to consider, if you haven't already.

Well, trade-offs :) I know that other platforms provide sync mechanisms where the kernel is more aware of what's going on and can make some smarter decision for the scheduling. But all the semantics of futex is build upon the expectation that the kernel doesn't need to be involved if the lock is free, that is probably the reason why the average case is better. So this is a different trend of work, but maybe in the future someone will try to achieve that in a new sync API if those corner cases are causing too much inconvenience, who knows.

...

Niall

Andrey Semashev

13 May 13 May

9:07 p.m.

Beware of a long post. On 5/13/21 5:27 PM, André Almeida wrote:

...

Hi there,

I'm the author of futex2[0], a WIP new set of Linux's syscalls that allows userspace to write efficient sync mechanisms. I would like to hear from Boost's developers if the project would benefit from this new interface.

From Boost/sync's codebase, I can see that you are already familiar with futexes, but just in case:

[snip]

...

The detailed description of the API can be seen in the documentation patch[1]. Do you think that Boost would benefit from it?

Hi, and thank you for working on this and especially for including 64-bit futex support in the latest patches. I have already described some of the use cases in my earlier post on LKML[1], but I'll try to recap and expand on it here. Boost contains many libraries, but there are few of them that deal with thread synchronization directly: - Boost.Atomic implements atomic operations and also basic wait/notify operations. Supports both inter-thread and inter-process synchronization. - Boost.Interprocess implements inter-process communication primitives, including synchronization. - Boost.Sync and Boost.Thread implement inter-thread communication primitives, including synchronization. (Note that Boost.Sync is not an officially accepted library yet; you can consider it a work in progress that is not yet an official part of Boost.) A few other libraries are worth mentioning. Boost.Fiber and Boost.Log implement custom thread synchronization primitives that use futex API directly. Some libraries may be also using low-level thread synchronization APIs, such as pthread and WinAPI, but not futex directly. Of the libraries I mentioned, the prime user of futex2 would be Boost.Atomic. With the current implementation based on existing futex API, the important missing part is support for futex sizes other than 32 bits. This means that for atomics other than 32-bit Boost.Atomic must use an internal lock pool to implement waiting and notifying operations, which increases thread contention. For inter-process atomics, this means that waiting must be done using a spin loop, which is terribly inefficient. So, the support for 8, 16 and 64-bit futexes would be very much needed here. Another potential use case for futex2 is the mass locking algorithms[2] in Boost.Thread. Basically, the algorithm accepts a list of lockable objects (e.g. locks or mutexes) and attempts to lock them all before returning. Here, I imagine, the support for waiting on multiple futexes could come in handy. It should be noted that the algorithms are generic, so they must work on any type of lockable objects, including those that do not use or expose a futex, so the optimization is not trivial or universally applicable. However, if the algorithm is applied to Boost.Thread primitives, and those expose a futex, this could work quite well. Although Boost.Interprocess doesn't currently use futexes directly, I imagine it would benefit from it. Not in least part because pthread does not provide robust condition variables, and robust mutexes alone are often not enough for organizing inter-process communication. Robust IPC is a recurring theme in Boost.Interprocess issues and PRs, so I think, some solution is needed here and futex could be a building block. In my LKML post I have described one solution to this problem (that is implemented in a project outside Boost) and there 64-bit futexes would be very much useful. Alternatively, futex2 could offer a new API for implementing robust primitives in userspace. I know the current futex2 patch set does not implement robust futexes, and I'm not asking to implement them, but if there are plans to eventually add robust futexes, here is a thought. The new API should preferably support multiple users of this feature. That is, the kernel API should allow any piece of userspace code (not just libc) to mark individual futexes as robust, without having to maintain a common list of robust futexes in userspace. Currently, this list is maintained by libc internally, which prevents any futex user (other than libc itself) from using robust futexes. But this feature should probably be discussed with libc develolpers. Other than the above, I can't readily remember potential use cases for futex2 in Boost. We do use futexes (the currently exiting futex API) in Boost.Sync and other libraries and could use them elsewhere, but for primitives like mutexes, condition variables, semaphores and events the existing API is sufficient. We currently don't implement NUMA-specific primitives, which might be a good future addition to Boost, but I can't tell whether the new futex2 API would be sufficient to it. Better NUMA support could be interesting for the thread pool implementation in Boost.Thread, but I'm not familiar with that code and don't know how useful futex2 would be there. As for use cases outside Boost, that application that I described in the LKML post would benefit not only from 64-bit futexes but also from the ability to wait on multiple futexes. We are also using futex bitset API in order to reduce the number of woken threads blocked on a futex. The bitset is used as a mask of events that each blocked thread subscribes to. When the notifying thread wakes, it sets the bitset to the mask of events that happened, so that only the threads that are waiting for the events are woken up. I think, this could be emulated with multiple futexes in the futex2 design, although I'm not sure if that would be as efficient, as that would increase the number of futexes at least twofold in our case (since every thread most of the time subscribes to at least two events). I can provide more details on this use case, if you're interested. [1]: https://lore.kernel.org/lkml/9557a62c-ab64-495b-36bd-6d8db426ddce@gmail.com/ [2]: https://www.boost.org/doc/libs/1_76_0/doc/html/thread/synchronization.html#t...

André Almeida

14 May 14 May

4:35 p.m.

Às 18:07 de 13/05/21, Andrey Semashev via Boost escreveu:

...

Beware of a long post.

On 5/13/21 5:27 PM, André Almeida wrote:

...
Hi there,

I'm the author of futex2[0], a WIP new set of Linux's syscalls that allows userspace to write efficient sync mechanisms. I would like to hear from Boost's developers if the project would benefit from this new interface.

From Boost/sync's codebase, I can see that you are already familiar with futexes, but just in case:

[snip]

...
The detailed description of the API can be seen in the documentation patch[1]. Do you think that Boost would benefit from it?

Hi, and thank you for working on this and especially for including 64-bit futex support in the latest patches. I have already described some of the use cases in my earlier post on LKML[1], but I'll try to recap and expand on it here.

Sorry for not replying your post, I thing it got lost in my inbox. <snip>

...

Of the libraries I mentioned, the prime user of futex2 would be Boost.Atomic. With the current implementation based on existing futex API, the important missing part is support for futex sizes other than 32 bits. This means that for atomics other than 32-bit Boost.Atomic must use an internal lock pool to implement waiting and notifying operations, which increases thread contention. For inter-process atomics, this means that waiting must be done using a spin loop, which is terribly inefficient. So, the support for 8, 16 and 64-bit futexes would be very much needed here.

Cool, I'll be adding "better support for userspace atomics" as an use case for variable size futexes. So far, I've been advertising that variable sized futexes would be good for saving some memory. Do you think that using 8bit sized futexes for e.g. Boost's mutexes would save something or would be unlikely noticed?

...

Another potential use case for futex2 is the mass locking algorithms[2] in Boost.Thread. Basically, the algorithm accepts a list of lockable objects (e.g. locks or mutexes) and attempts to lock them all before returning. Here, I imagine, the support for waiting on multiple futexes could come in handy. It should be noted that the algorithms are generic, so they must work on any type of lockable objects, including those that do not use or expose a futex, so the optimization is not trivial or universally applicable. However, if the algorithm is applied to Boost.Thread primitives, and those expose a futex, this could work quite well.

In Wine, wait on multiples futexes is used as backend for a operation that can wait on different things as well (WaitForMultipleObjects), so I think the solution was to add a `unsigned int futex` at those objects. It's good to see that more use cases would benefit from this feature.

...

Alternatively, futex2 could offer a new API for implementing robust primitives in userspace.

Yes, robust futexes are not part of futex2 discussion for now, but thanks anyway for your proactive feedback. I'll revisit this suggestion and ping you when it's time for it. <snip>

...

As for use cases outside Boost, that application that I described in the LKML post would benefit not only from 64-bit futexes but also from the ability to wait on multiple futexes. [...] I can provide more details on this use case, if you're interested.

Please, I would love to hear more about that. Thank you very much Andrey, your detailed feedback will help me a lot with futex2 development! André

Andrey Semashev

6:05 p.m.

On 5/14/21 7:35 PM, André Almeida wrote:

...

Às 18:07 de 13/05/21, Andrey Semashev via Boost escreveu:

...
Beware of a long post.

Of the libraries I mentioned, the prime user of futex2 would be Boost.Atomic. With the current implementation based on existing futex API, the important missing part is support for futex sizes other than 32 bits. This means that for atomics other than 32-bit Boost.Atomic must use an internal lock pool to implement waiting and notifying operations, which increases thread contention. For inter-process atomics, this means that waiting must be done using a spin loop, which is terribly inefficient. So, the support for 8, 16 and 64-bit futexes would be very much needed here.

Cool, I'll be adding "better support for userspace atomics" as an use case for variable size futexes.

So far, I've been advertising that variable sized futexes would be good for saving some memory. Do you think that using 8bit sized futexes for e.g. Boost's mutexes would save something or would be unlikely noticed?

Honestly, I don't think small-sized futexes would be noticeable in terms of memory consumption in real world applications, aside from extremely low end embedded domain (think microcontrollers with kilobytes of memory). And in those kind of devices you may not have a Linux kernel in the first place. Even mobile phones these days come with several gigs of RAM. Most of the data is 4 or 8 byte aligned, so making a futex smaller than that would likely just contribute to padding. And, to avoid false sharing, you generally want to make sure the mutex and the protected data is within a cache line (typically, 64 bytes) and no other, unrelated data is in that cache line. So unless you're creating thousands of futexes and want to pack them tightly (which is detrimental to performance), I don't think the small futex size is a meaningful advantage. Small sized futexes are mostly useful when you already have a small data item, and you also want to block on it, which is the use case of atomics.

...

...
As for use cases outside Boost, that application that I described in the LKML post would benefit not only from 64-bit futexes but also from the ability to wait on multiple futexes. [...] I can provide more details on this use case, if you're interested.

Please, I would love to hear more about that.

Well, I've already described most of it in the LKML post and my previous reply. To recap: - The futexes are created in shared memory and accessed by multiple processes. - The futexes are created in threes, one is used as a mutex and the other two as condition variables. We use FUTEX_REQUEUE to move blocked threads from cond vars to the mutex to wake them up upon releasing the mutex. - The mutex futex value is divided into sets of bits. Some bits are used for ABA counter and some - for PID of the owner to implement robust semantics. This is where a larger futex would be useful, as the 32-bit putex is not large enough to accommodate a full PID. - For cond var futexes, we're using the futex bitset feature to selectively wake up threads subscribed to different events. In particular, this allows a thread to wake up all other threads within the same process but not the threads from other processes, which is useful for indicating process-local events (e.g. a request for termination). Looking at our current implementation, we are missing a FUTEX_REQUEUE_BITSET operation, which would act as FUTEX_REQUEUE, but only for blocked threads with a matching bitset. Because of that we have to use FUTEX_WAKE_BITSET, which may cause a thundering herd effect. It is possible to replace the futex bitset functionality with more futexes (with the ability to wait on multiple futexes). However, I suspect, in some cases such replacement would not be so easy. The problem is that with bitsets the notifying thread only has to perform one syscall to notify of multiple events (FUTEX_WAKE_BITSET), while with multiple futexes one would have to perform one FUTEX_WAKE per futex representing an event. I *think*, in our case we can avoid this overhead and implement a scheme where notifying (or rather, requeueing from) only one futex is enough. But I'm not sure this will be doable in other applications. Also, I'm not sure if the increased number of futexes would cause more overhead on the kernel side. Did you perform benchmarks comparing futex bitsets and an equivalent logic with multiple futex2 futexes? I mean, have two configs: - N threads multiplexed on a single futex using different bits in a bitset. - N threads, each waiting on N futex2 futexes. And compare the cost of blocking, waking and requeueing a thread in each config. In any case, I think, futex bitsets are a useful feature, even if our application can do without it.

André Almeida

15 May 15 May

5:25 p.m.

Às 15:05 de 14/05/21, Andrey Semashev via Boost escreveu:

...

On 5/14/21 7:35 PM, André Almeida wrote:

...
Às 18:07 de 13/05/21, Andrey Semashev via Boost escreveu:

...
Beware of a long post.

Of the libraries I mentioned, the prime user of futex2 would be Boost.Atomic. With the current implementation based on existing futex API, the important missing part is support for futex sizes other than 32 bits. This means that for atomics other than 32-bit Boost.Atomic must use an internal lock pool to implement waiting and notifying operations, which increases thread contention. For inter-process atomics, this means that waiting must be done using a spin loop, which is terribly inefficient. So, the support for 8, 16 and 64-bit futexes would be very much needed here.

Cool, I'll be adding "better support for userspace atomics" as an use case for variable size futexes.

So far, I've been advertising that variable sized futexes would be good for saving some memory. Do you think that using 8bit sized futexes for e.g. Boost's mutexes would save something or would be unlikely noticed?

Honestly, I don't think small-sized futexes would be noticeable in terms of memory consumption in real world applications, aside from extremely low end embedded domain (think microcontrollers with kilobytes of memory). And in those kind of devices you may not have a Linux kernel in the first place. Even mobile phones these days come with several gigs of RAM.

Most of the data is 4 or 8 byte aligned, so making a futex smaller than that would likely just contribute to padding. And, to avoid false sharing, you generally want to make sure the mutex and the protected data is within a cache line (typically, 64 bytes) and no other, unrelated data is in that cache line. So unless you're creating thousands of futexes and want to pack them tightly (which is detrimental to performance), I don't think the small futex size is a meaningful advantage.

Small sized futexes are mostly useful when you already have a small data item, and you also want to block on it, which is the use case of atomics.

Noted, thanks.

...

...
...
As for use cases outside Boost, that application that I described in the LKML post would benefit not only from 64-bit futexes but also from the ability to wait on multiple futexes. [...] I can provide more details on this use case, if you're interested.

Please, I would love to hear more about that.

Well, I've already described most of it in the LKML post and my previous reply. To recap:

- The futexes are created in shared memory and accessed by multiple processes. - The futexes are created in threes, one is used as a mutex and the other two as condition variables. We use FUTEX_REQUEUE to move blocked threads from cond vars to the mutex to wake them up upon releasing the mutex. - The mutex futex value is divided into sets of bits. Some bits are used for ABA counter and some - for PID of the owner to implement robust semantics. This is where a larger futex would be useful, as the 32-bit putex is not large enough to accommodate a full PID. - For cond var futexes, we're using the futex bitset feature to selectively wake up threads subscribed to different events. In particular, this allows a thread to wake up all other threads within the same process but not the threads from other processes, which is useful for indicating process-local events (e.g. a request for termination).

Right, so you have a nice custom sync mechanism implementation. If isn't confidential, can you explain which kind of application/work load are you dealing with? e.g. it's a database

...

Looking at our current implementation, we are missing a FUTEX_REQUEUE_BITSET operation, which would act as FUTEX_REQUEUE, but only for blocked threads with a matching bitset. Because of that we have to use FUTEX_WAKE_BITSET, which may cause a thundering herd effect.

It is possible to replace the futex bitset functionality with more futexes (with the ability to wait on multiple futexes). However, I suspect, in some cases such replacement would not be so easy. The problem is that with bitsets the notifying thread only has to perform one syscall to notify of multiple events (FUTEX_WAKE_BITSET), while with multiple futexes one would have to perform one FUTEX_WAKE per futex representing an event. I *think*, in our case we can avoid this overhead and implement a scheme where notifying (or rather, requeueing from) only one futex is enough. But I'm not sure this will be doable in other applications.

Also, I'm not sure if the increased number of futexes would cause more overhead on the kernel side. Did you perform benchmarks comparing futex bitsets and an equivalent logic with multiple futex2 futexes? I mean, have two configs:

- N threads multiplexed on a single futex using different bits in a bitset. - N threads, each waiting on N futex2 futexes.

And compare the cost of blocking, waking and requeueing a thread in each config.

In any case, I think, futex bitsets are a useful feature, even if our application can do without it.

I didn't benchmark that, because I never thought of wait on multiple as a replacement for futex bitset operations, but I'll have a look on that. Also, the current API doesn't accommodate bitset for now, so it's good to know that there are users out there, I thought this feature wasn't used nowadays.

Andrey Semashev

5:48 p.m.

On 5/15/21 8:25 PM, André Almeida wrote:

...

Às 15:05 de 14/05/21, Andrey Semashev via Boost escreveu:

...
On 5/14/21 7:35 PM, André Almeida wrote:

...
Às 18:07 de 13/05/21, Andrey Semashev via Boost escreveu:

...
As for use cases outside Boost, that application that I described in the LKML post would benefit not only from 64-bit futexes but also from the ability to wait on multiple futexes. [...] I can provide more details on this use case, if you're interested.

Please, I would love to hear more about that.

Well, I've already described most of it in the LKML post and my previous reply. To recap:

- The futexes are created in shared memory and accessed by multiple processes. - The futexes are created in threes, one is used as a mutex and the other two as condition variables. We use FUTEX_REQUEUE to move blocked threads from cond vars to the mutex to wake them up upon releasing the mutex. - The mutex futex value is divided into sets of bits. Some bits are used for ABA counter and some - for PID of the owner to implement robust semantics. This is where a larger futex would be useful, as the 32-bit putex is not large enough to accommodate a full PID. - For cond var futexes, we're using the futex bitset feature to selectively wake up threads subscribed to different events. In particular, this allows a thread to wake up all other threads within the same process but not the threads from other processes, which is useful for indicating process-local events (e.g. a request for termination).

Right, so you have a nice custom sync mechanism implementation. If isn't confidential, can you explain which kind of application/work load are you dealing with? e.g. it's a database

The application is a media processing engine. The synchronization mechanism I described is a part of the communication mechanism used to exchange media content between media processing nodes. The application is scalable and flexible, and the effectiveness of this synchronization mechanism plays key role in the application performance.

1510

Age (days ago)

1512

Last active (days ago)

List overview

Download

9 comments

3 participants

participants (3)

Andrey Semashev
André Almeida
Niall Douglas