[multiprecission] Large divides layered on 128 bit divides
Hi,
The architecture of cpp_int being build on 64 bit arithmetic using 128 bit double_limb_type is interesting.
I have a question on the large divide (divide_unsigned_helper). It uses the upper portions of the large integers to get an estimation of the quotient. Subtracts out a multiple of that quotient and repeats.
It does this in 128 bit values if available from the compiler:
double_limb_type a = (static_cast
On 19/08/2021 23:02, Neill Clift via Boost-users wrote:
Hi,
The architecture of cpp_int being build on 64 bit arithmetic using 128 bit double_limb_type is interesting.
I have a question on the large divide (divide_unsigned_helper). It uses the upper portions of the large integers to get an estimation of the quotient. Subtracts out a multiple of that quotient and repeats.
It does this in 128 bit values if available from the compiler:
double_limb_type a = (static_cast
(prem[r_order]) << CppInt1::limb_bits) | prem[r_order - 1]; double_limb_type b = py[y_order];
double_limb_type v = a / b;
The compiler emulates this operation in the routine __udivmodti4 which itself uses an iterative approach.
It seems to use a pretty basic shift and subtract algorithm mind you.
As a general rule for multiprecision is it OK to layer the Knuth like algorithm D on top of each other this way.
I have no idea myself but wonder if this is a known issue. I would have guessed that it made sense to do a 64 by 64 bit divide to guess the quotient and repeat.
Good question. As I recall I tried both single-limb and double-limb partial-quotients and the double-limb (128 bit) version was slightly faster. There is a balance here between removing as large a chunk as you can with each loop, compared to more expensive operations within the loop. You could for example, perform "Karatsuba-like" division by splitting a B-bit numerator into two B/2 chunks and perform schoolboy division on the two "digit" numbers. But the fact that no-one seems to have done this suggests how well it must work ;) On the other hand, __int128, while a synthetic type, is sufficiently well optimised for this to be a useful chunk size. HTH, John. -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
Thanks John,
I now know a bit more about this. It seems the 128bit support in windows and linux is done differently. I am told optimized routines inside the c library are used on linux but on windows support for divide is provided by compiler-rt and the code is very old and unoptimized.
That given what you tell me I think my way forward is to try and get a better version of the divide routines working.
Neill.
-----Original Message-----
From: Boost-users
Hi,
The architecture of cpp_int being build on 64 bit arithmetic using 128 bit double_limb_type is interesting.
I have a question on the large divide (divide_unsigned_helper). It uses the upper portions of the large integers to get an estimation of the quotient. Subtracts out a multiple of that quotient and repeats.
It does this in 128 bit values if available from the compiler:
double_limb_type a = (static_cast
(prem[r_order]) << CppInt1::limb_bits) | prem[r_order - 1]; double_limb_type b = py[y_order];
double_limb_type v = a / b;
The compiler emulates this operation in the routine __udivmodti4 which itself uses an iterative approach.
It seems to use a pretty basic shift and subtract algorithm mind you.
As a general rule for multiprecision is it OK to layer the Knuth like algorithm D on top of each other this way.
I have no idea myself but wonder if this is a known issue. I would have guessed that it made sense to do a 64 by 64 bit divide to guess the quotient and repeat.
Good question. As I recall I tried both single-limb and double-limb partial-quotients and the double-limb (128 bit) version was slightly faster. There is a balance here between removing as large a chunk as you can with each loop, compared to more expensive operations within the loop. You could for example, perform "Karatsuba-like" division by splitting a B-bit numerator into two B/2 chunks and perform schoolboy division on the two "digit" numbers. But the fact that no-one seems to have done this suggests how well it must work ;) On the other hand, __int128, while a synthetic type, is sufficiently well optimised for this to be a useful chunk size. HTH, John. -- This email has been checked for viruses by Avast antivirus software. https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avast.c... _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.boost...
participants (2)
-
John Maddock
-
Neill Clift