[dynamic_bitset] Using of GCC built-in functions may increase performance of some methods

newer
[gil] Template instantiation with...

Evgeny Shulgin

22 Aug 2018 22 Aug '18

7:03 p.m.

Hi all, when I was playing around the Boost.DynamicBitset library, I noticed the last lines of the count() function (which returns the number of bits in the bitset that are set): return do_count(m_bits.begin(), num_blocks(), Block(0), static_cast<value_to_type<(bool)mode> *>(0)); Well, this line was modified 10 years ago last time. Let's go deeper. GNU C provides several language features not found in ISO standard C. These include a bunch of bit operations - you can look at them here: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html. For example, *__builtin_popcountll *counts the number of bits in an 'unsigned long long', so we can use this kind of code to calculate bits (don't judge, I wrote this fast and didn't think of coding standards): size_type res = 0; for (size_type i = 0; i < m_bits.size(); i++) { res += __builtin_popcountll(m_bits[i]); } I checked that m_bits[] and popcount() both work with 64-bit values on my machine. It halved the running time of a program that has a bitset and calls count() over and over again - here's the code <https://gist.github.com/Izaron/5005fed4184ee6483699191dc5ef3677> I used. These __builtin_XXX functions can be used everywhere where it's possible, with preprocessor directives. Or is it actually used and longer time means something else? I used the latest version of Boost.

Show replies by date

Andrey Semashev

22 Aug 22 Aug

9:16 p.m.

On Wed, Aug 22, 2018 at 10:03 PM, Evgeny Shulgin via Boost <boost@lists.boost.org> wrote:

...

Hi all, when I was playing around the Boost.DynamicBitset library, I noticed the last lines of the count() function (which returns the number of bits in the bitset that are set):

return do_count(m_bits.begin(), num_blocks(), Block(0), static_cast<value_to_type<(bool)mode> *>(0));

Well, this line was modified 10 years ago last time.

Let's go deeper. GNU C provides several language features not found in ISO standard C. These include a bunch of bit operations - you can look at them here: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html. For example, *__builtin_popcountll *counts the number of bits in an 'unsigned long long', so we can use this kind of code to calculate bits (don't judge, I wrote this fast and didn't think of coding standards):

size_type res = 0; for (size_type i = 0; i < m_bits.size(); i++) { res += __builtin_popcountll(m_bits[i]); }

I checked that m_bits[] and popcount() both work with 64-bit values on my machine. It halved the running time of a program that has a bitset and calls count() over and over again - here's the code <https://gist.github.com/Izaron/5005fed4184ee6483699191dc5ef3677> I used. These __builtin_XXX functions can be used everywhere where it's possible, with preprocessor directives. Or is it actually used and longer time means something else? I used the latest version of Boost.

I think such an optimization would be useful. Note that MSVC also has intrinsics for popcount[1], although I don't think those are supported when the target CPU doesn't implement the corresponding instructions. You would have to check at compile time whether the target CPU supports it (e.g. by checking if __AVX__ is defined). I think, Boost.DynamicBitset is community-maintained. If you create a PR, someone from CMT will take a look. [1]: https://msdn.microsoft.com/en-us/library/bb385231.aspx

Gavin Lambert

23 Aug 23 Aug

2:08 a.m.

On 23/08/2018 09:16, Andrey Semashev wrote:

...

I think such an optimization would be useful. Note that MSVC also has intrinsics for popcount[1], although I don't think those are supported when the target CPU doesn't implement the corresponding instructions. You would have to check at compile time whether the target CPU supports it (e.g. by checking if __AVX__ is defined).

While compile-time detection is better, if you can do it (because it lets it be completely inlined); if the compile-time detection fails, you can still do runtime detection, eg. by defining something like: // header file extern int (*popcnt64)(uint64_t); // source file static bool is_popcnt_supported() { int info[4] = { 0 }; __cpuid(info, 1); return (info[2] & 0x00800000) != 0; } static int popcnt64_intrinsic(uint64_t v) { return /* _mm_popcnt_64(v) or __builtin_popcountll(v) */; } static int popcnt64_emulation(uint64_t v) { // code that calculates it with bit twiddling } static int popcnt64_auto(uint64_t v) { popcnt64 = is_popcnt_supported() ? &popcnt64_intrinsic : &popcnt64_emulation; return popcnt64(v); } int (*popcnt64)(uint64_t) = &popcnt64_auto; Repeat for other argument sizes as needed. You could probably do something fancier with C++11 guaranteed static initialisation, but this will work on all compilers.

Andrey Semashev

9:45 a.m.

On 08/23/18 05:08, Gavin Lambert via Boost wrote:

...

On 23/08/2018 09:16, Andrey Semashev wrote:

...
I think such an optimization would be useful. Note that MSVC also has intrinsics for popcount[1], although I don't think those are supported when the target CPU doesn't implement the corresponding instructions. You would have to check at compile time whether the target CPU supports it (e.g. by checking if __AVX__ is defined).

While compile-time detection is better, if you can do it (because it lets it be completely inlined); if the compile-time detection fails, you can still do runtime detection, eg. by defining something like:

// header file extern int (*popcnt64)(uint64_t);

// source file static bool is_popcnt_supported() {     int info[4] = { 0 };     __cpuid(info, 1);     return (info[2] & 0x00800000) != 0; }

static int popcnt64_intrinsic(uint64_t v) {     return /* _mm_popcnt_64(v) or __builtin_popcountll(v) */; }

static int popcnt64_emulation(uint64_t v) {    // code that calculates it with bit twiddling }

static int popcnt64_auto(uint64_t v) {     popcnt64 = is_popcnt_supported()         ? &popcnt64_intrinsic         : &popcnt64_emulation;     return popcnt64(v); }

int (*popcnt64)(uint64_t) = &popcnt64_auto;

Repeat for other argument sizes as needed. You could probably do something fancier with C++11 guaranteed static initialisation, but this will work on all compilers.

This code requires a separately compiled unit. It can be done in header-only style, though.

Brian Kuhl

24 Aug 24 Aug

9:49 p.m.

But not on ARM or PowerPC says the embedded guy. Though both Gcc and clang support __builtin_popcount() across all processors, which is what we are using these days. I did notice one library does a call out openCl for popcount () I honestly was thinking about the same sort approach, though all compile time based. But I wasn’t sure what you would do with a signed value? As the builtin treats everything as unsigned. Do you mask out the sign bit or just cast it, or consider it an error to take a bit count of a signed value? Sent from my iPhone

...

On Aug 22, 2018, at 10:08 PM, Gavin Lambert via Boost <boost@lists.boost.org> wrote:

...
On 23/08/2018 09:16, Andrey Semashev wrote: I think such an optimization would be useful. Note that MSVC also has intrinsics for popcount[1], although I don't think those are supported when the target CPU doesn't implement the corresponding instructions. You would have to check at compile time whether the target CPU supports it (e.g. by checking if __AVX__ is defined).

While compile-time detection is better, if you can do it (because it lets it be completely inlined); if the compile-time detection fails, you can still do runtime detection, eg. by defining something like:

// header file extern int (*popcnt64)(uint64_t);

// source file static bool is_popcnt_supported() { int info[4] = { 0 }; __cpuid(info, 1); return (info[2] & 0x00800000) != 0; }

static int popcnt64_intrinsic(uint64_t v) { return /* _mm_popcnt_64(v) or __builtin_popcountll(v) */; }

static int popcnt64_emulation(uint64_t v) { // code that calculates it with bit twiddling }

static int popcnt64_auto(uint64_t v) { popcnt64 = is_popcnt_supported() ? &popcnt64_intrinsic : &popcnt64_emulation; return popcnt64(v); }

int (*popcnt64)(uint64_t) = &popcnt64_auto;

Repeat for other argument sizes as needed. You could probably do something fancier with C++11 guaranteed static initialisation, but this will work on all compilers.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

degski

23 Aug 23 Aug

4:39 a.m.

On Thu, 23 Aug 2018 at 00:17, Andrey Semashev via Boost < boost@lists.boost.org> wrote:

...

I think such an optimization would be useful. Note that MSVC also has intrinsics for popcount[1], although I don't think those are supported when the target CPU doesn't implement the corresponding instructions.

This instruction is supported on Intel Penryn (x86, 2007) and AMD 3rd Gen. and onward from there. You would have to check at compile time whether the target CPU

...

supports it (e.g. by checking if __AVX__ is defined).

On a Penryn, __AVX__ will give a false negative. Checking __cpuid as Gavin writes should also work on that processor. On ARM, which appears to have become an important cpu-type, there is the corresponding VCNT instruction: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHCFDB... , but this instruction does not seem to be provided as an intrinsic by Microsoft. degski -- *“If something cannot go on forever, it will stop" - Herbert Stein* *“No, it isn’t truth. Truth isn’t truth" - Rudolph W. L. Giuliani*

Marshall Clow

26 Aug 26 Aug

1:42 a.m.

On Wed, Aug 22, 2018 at 2:03 PM, Evgeny Shulgin via Boost < boost@lists.boost.org> wrote:

...

Hi all, when I was playing around the Boost.DynamicBitset library, I noticed the last lines of the count() function (which returns the number of bits in the bitset that are set):

return do_count(m_bits.begin(), num_blocks(), Block(0), static_cast<value_to_type<(bool)mode> *>(0));

Well, this line was modified 10 years ago last time.

Let's go deeper. GNU C provides several language features not found in ISO standard C. These include a bunch of bit operations - you can look at them here: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html. For example, *__builtin_popcountll *counts the number of bits in an 'unsigned long long', so we can use this kind of code to calculate bits (don't judge, I wrote this fast and didn't think of coding standards):

size_type res = 0; for (size_type i = 0; i < m_bits.size(); i++) { res += __builtin_popcountll(m_bits[i]); }

I checked that m_bits[] and popcount() both work with 64-bit values on my machine. It halved the running time of a program that has a bitset and calls count() over and over again - here's the code <https://gist.github.com/Izaron/5005fed4184ee6483699191dc5ef3677> I used. These __builtin_XXX functions can be used everywhere where it's possible, with preprocessor directives. Or is it actually used and longer time means something else?

By a curious coincidence, I have been working in that part of libc++ ;-) You can take a look here: https://github.com/llvm-mirror/libcxx/blob/master/include/bit Gcc and clang have provided the bit intrinsics for a long, long time - since before libc++ cares about. Boost's mileage may vary. The MSVC intrinsics are missing (for my purposes) important things - like "noexcept" and "constexpr". -- Marshall

degski

4:55 a.m.

On Sun, 26 Aug 2018 at 04:42, Marshall Clow via Boost <boost@lists.boost.org> wrote:

...

The MSVC intrinsics are missing (for my purposes) important things - like "noexcept" and "constexpr".

They are C-"functions", they "are" noexcept, you can mark any C++-function containing calls to those intrinsics noexcept, and you won't be lying. The constexpr bit depends on what the C++-compiler does with those intrinsic calls, with some exceptions, most intrinsics map to exactly one assembler instruction, i.e. there is no reason why they could not be constexpr (but that is a C++-concept). degski -- *“If something cannot go on forever, it will stop" - Herbert Stein* *“No, it isn’t truth. Truth isn’t truth" - Rudolph W. L. Giuliani*

Marshall Clow

3:18 p.m.

On Sat, Aug 25, 2018 at 11:55 PM, degski <degski@gmail.com> wrote:

...

On Sun, 26 Aug 2018 at 04:42, Marshall Clow via Boost < boost@lists.boost.org> wrote:

...
The MSVC intrinsics are missing (for my purposes) important things - like "noexcept" and "constexpr".

They are C-"functions", they "are" noexcept, you can mark any C++-function containing calls to those intrinsics noexcept, and you won't be lying. The constexpr bit depends on what the C++-compiler does with those intrinsic calls, with some exceptions, most intrinsics map to exactly one assembler instruction, i.e. there is no reason why they could not be constexpr (but that is a C++-concept).

They certainly *could be*, but they are not. BTW, the mapping to one assembler instruction is irrelevant for constexpr; rather the question is "does the compiler know how to evaluate that at compile time". -- Marshall

2512

Age (days ago)

2516

Last active (days ago)

List overview

Download

8 comments

6 participants

participants (6)

Andrey Semashev
Brian Kuhl
degski
Evgeny Shulgin
Gavin Lambert
Marshall Clow