Re: [Boost-Users] NDEBUG, affects floating point math?
On Fri, Feb 28, 2003 at 09:53:56AM +0100, Joerg Walter wrote:
I wrote a small piece of C code to calculate a symmetric-matrix by row-matrix product by directly accessing the memory (which btw is about 1000% faster than ublas's generated code with gcc 3.2.3 and -O3, why is ublas so slow?)
If you have *not* defined NDEBUG, ublas uses many internal consistency checks and usually no expression templates.
I was referring to the case when NDEBUG is defined and comparing symmetric_matrix by general_matrix products. uBLAS is quite slow with this operation. However, I compared my symmetric_matrix by general_matrix_transpose product and ublas was only 3-4 times slower than the optimized C, not too bad.
And when I compile with -DNDEBUG the results differ by around +/- 1e-16, but when its not defined the results match almost perfectly.
You've probably hit a bug, may be an old one (please consider to check against CVS, too). Could you please send or post a small sample, if the problem persists?
I think I know the source of the problem now, and I'm not sure if it's one that boost should handle or not. I attached my test code at the end of this email, only 2.7k, I hope it isn't inappropriate for the list. 387 floating point registers are stored as 80 bits internally and rounded to 64 bits when assigned to memory. SSE2 floating-point registers are stored in the native 64-bit format. The following table lists the compiler flags used and the accumulated absolute difference between the two products over 100 randomly generated matrices. gcc version 3.2.3 20030221 (Debian prerelease) no compiler flags diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0)) -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0)) -O1 diff sum [5,3]((2.91694e-15,2.65348e-15,2.7027e-15),(4.4049e-15,3.29251e-15,3.44863e-15),(4.10436e-15,4.14816e-15,3.24393e-15),(3.57873e-15,3.75741e-15,2.64068e-15),(5.03091e-15,2.79637e-15,4.06099e-15)) -O1 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0)) -O2 diff sum [5,3]((4.37584e-15,4.8659e-15,3.61516e-15),(4.61783e-15,3.77736e-15,4.04169e-15),(4.46691e-15,5.10876e-15,5.00576e-15),(3.64118e-15,4.91968e-15,5.17641e-15),(3.97273e-15,4.25528e-15,3.75394e-15)) -O2 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0)) -O3 diff sum [5,3]((4.96131e-15,3.88968e-15,4.34245e-15),(5.07754e-15,5.91974e-15,4.19109e-15),(4.89539e-15,4.29279e-15,5.06344e-15),(4.79781e-15,4.66033e-15,3.77996e-15),(4.75227e-15,4.72539e-15,3.33414e-15)) -O3 -DNDEBUG diff sum [5,3]((4.60569e-15,2.67928e-15,2.7079e-15),(3.21054e-15,2.98112e-15,2.61249e-15),(3.91701e-15,3.60822e-15,2.93515e-15),(3.23006e-15,3.42434e-15,3.53363e-15),(3.18105e-15,2.77035e-15,3.67198e-15)) -O1 -mfpmath=sse,387 -march=pentium4 -msse2 diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0)) -O1 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0)) -O2 -mfpmath=sse,387 -march=pentium4 -msse2 diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0)) -O2 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0)) -O3 -mfpmath=sse,387 -march=pentium4 -msse2 diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0)) -O3 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG diff sum [5,3]((3.62991e-15,3.74006e-15,4.62824e-15),(4.00548e-15,4.5311e-15,3.60129e-15),(3.48332e-15,5.11743e-15,3.08174e-15),(3.49373e-15,4.05752e-15,4.2865e-15),(4.81169e-15,4.51549e-15,4.18589e-15)) Is this perhaps a compiler bug? Part of the problem actually comes from whether the products are accumulated into a register by the compiler, or whether they are accumulated into memory. If accumulated into a 387 register it has higher-precision before it is rounded back down to 64-bits to be stored into memory. If the compiler generates code to accumulate the products directly into memory then the results of each individual product are rounded to 64-bits at every iteration and the result will be slightly less accurate. Now, when the -msse2 flag is used, the registers are 64-bit only and thus no accuracy is gained by using them, only speed. -- Stephen [Non-text portions of this message have been removed]
Hi Stephen, you wrote:
I wrote a small piece of C code to calculate a symmetric-matrix by row-matrix product by directly accessing the memory (which btw is about 1000% faster than ublas's generated code with gcc 3.2.3 and -O3, why is ublas so slow?)
If you have *not* defined NDEBUG, ublas uses many internal consistency checks and usually no expression templates.
I was referring to the case when NDEBUG is defined and comparing symmetric_matrix by general_matrix products. uBLAS is quite slow with this operation. However, I compared my symmetric_matrix by general_matrix_transpose product and ublas was only 3-4 times slower than the optimized C, not too bad.
OK, I've noticed that you use something like C = prod (A, B); As far as I understand, your matrices are alias free, so you could try C.assign (prod (A, B)); instead.
And when I compile with -DNDEBUG the results differ by around +/- 1e-16, but when its not defined the results match almost perfectly.
You've probably hit a bug, may be an old one (please consider to check against CVS, too). Could you please send or post a small sample, if the problem persists?
I think I know the source of the problem now, and I'm not sure if it's one that boost should handle or not. I attached my test code at the end of this email, only 2.7k, I hope it isn't inappropriate for the list.
387 floating point registers are stored as 80 bits internally and rounded to 64 bits when assigned to memory. SSE2 floating-point registers are stored in the native 64-bit format. The following table lists the compiler flags used and the accumulated absolute difference between the two products over 100 randomly generated matrices.
gcc version 3.2.3 20030221 (Debian prerelease)
no compiler flags diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O1 diff sum
[5,3]((2.91694e-15,2.65348e-15,2.7027e-15),(4.4049e-15,3.29251e-15,3.44863e- 15),(4.10436e-15,4.14816e-15,3.24393e-15),(3.57873e-15,3.75741e-15,2.64068e- 15),(5.03091e-15,2.79637e-15,4.06099e-15))
-O1 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O2 diff sum
[5,3]((4.37584e-15,4.8659e-15,3.61516e-15),(4.61783e-15,3.77736e-15,4.04169e -15),(4.46691e-15,5.10876e-15,5.00576e-15),(3.64118e-15,4.91968e-15,5.17641e -15),(3.97273e-15,4.25528e-15,3.75394e-15))
-O2 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O3 diff sum
[5,3]((4.96131e-15,3.88968e-15,4.34245e-15),(5.07754e-15,5.91974e-15,4.19109 e-15),(4.89539e-15,4.29279e-15,5.06344e-15),(4.79781e-15,4.66033e-15,3.77996 e-15),(4.75227e-15,4.72539e-15,3.33414e-15))
-O3 -DNDEBUG diff sum
[5,3]((4.60569e-15,2.67928e-15,2.7079e-15),(3.21054e-15,2.98112e-15,2.61249e -15),(3.91701e-15,3.60822e-15,2.93515e-15),(3.23006e-15,3.42434e-15,3.53363e -15),(3.18105e-15,2.77035e-15,3.67198e-15))
-O1 -mfpmath=sse,387 -march=pentium4 -msse2 diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O1 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O2 -mfpmath=sse,387 -march=pentium4 -msse2 diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O2 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O3 -mfpmath=sse,387 -march=pentium4 -msse2 diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O3 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG diff sum
[5,3]((3.62991e-15,3.74006e-15,4.62824e-15),(4.00548e-15,4.5311e-15,3.60129e -15),(3.48332e-15,5.11743e-15,3.08174e-15),(3.49373e-15,4.05752e-15,4.2865e- 15),(4.81169e-15,4.51549e-15,4.18589e-15))
Is this perhaps a compiler bug?
At least an issue ;-) Building your program with Intel 7.0 (boost default settings) results in diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0)) for both debug and release builds. Setting the magical compiler switch -ffloat-store seems to heal the problem for GCC, too.
Part of the problem actually comes from whether the products are accumulated into a register by the compiler, or whether they are accumulated into memory. If accumulated into a 387 register it has higher-precision before it is rounded back down to 64-bits to be stored into memory. If the compiler generates code to accumulate the products directly into memory then the results of each individual product are rounded to 64-bits at every iteration and the result will be slightly less accurate. Now, when the -msse2 flag is used, the registers are 64-bit only and thus no accuracy is gained by using them, only speed.
Thanks, Joerg
One correction:
Building your program with Intel 7.0 (boost default settings) results in
diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
for both debug and release builds.
But I needed to use double *rdata = new double (n); ... delete [] rdata; instead of the VLA(?) double rdata[n]; I didn't even know, that GCC accepts that already (in a C++ program ;-). Regards, Joerg
jhr.walter@t-online.de writes:
But I needed to use
double *rdata = new double (n);
This dynamically creates one double object and initializes it with the value n and uses its address to intialize rdata.
... delete [] rdata;
And this tells the compiler to deallocate the dynamically allocated array rdata points to. Your program has undefined behavior. I haven't followed this thread, but the code you posted here definitely looks very fishy.
Hi Thomas, you wrote:
But I needed to use
double *rdata = new double (n);
This dynamically creates one double object and initializes it with the value n and uses its address to intialize rdata.
Here's the original code: double *rdata = new double[n]; // stores non-contigous part of the current row of A
... delete [] rdata;
And this tells the compiler to deallocate the dynamically allocated array rdata points to. Your program has undefined behavior.
I haven't followed this thread, but the code you posted here definitely looks very fishy.
Thanks for the correction and sorry for the confusion, Joerg
participants (3)
-
jhr.walter@t-online.de
-
Stephen Crowley
-
Thomas Maeder