Regarding comparison to Intel's implementation (bid based) which is used in both GCC's decimal64 and also in Bloomberg, that's especially important because it used some impressively large tables to get performance (which also bloats binaries)
So far we don't rely on giant tables. Chris added an STM board QEMU to our CI that checks our ROM usage among other things. I'll work on some benchmarks and see if it's worth building out the above non-IEEE754 decimal32.
I have added benchmarks for the GCC builtin _Decimal32, _Decimal64, and _Decimal128. For operations add, sub, mul, div the geometric mean of the runtime ratios (boost.decimal runtime / GCC runtime) are:
decimal32: 0.932 decimal64: 1.750 decimal128: 4.837
It's interesting that for every operation the GCC _Decimal64 is faster than _Decimal32 where as ours increases run time with size. In any event I should be able to use all of the existing boost::decimal::decimal32 implementations for the basic operations (since they are already benchmarking faster than reference) with a class that directly stores the sign, exp, and significand to see if it's noticeably faster.
Glen, In the past few weeks we've been hammering on adding non-IEEE 754 compliant fast types to the library that directly store the sign, exp, and significand rather than decoding at each step. Additionally the fast types will normalize the significand and exponent rather than allowing decimal cohorts since accounting for this was the first step in each operation (also cohorts are likely of only academic interest if the goal is speed). With these changes we found that decimal32_fast runtime in the benchmarks is approximately 0.251 of regular decimal32 while yielding identical computational results. It's the classic tradeoff of space for time since decimal32_fast contains minimally 48-bits of state. We will continue to squeeze more performance out of the library and add decimal64_fast and decimal128_fast, but we wanted to provide some intermediate results. Matt