Hi Stephen, you wrote:
I wrote a small piece of C code to calculate a symmetric-matrix by row-matrix product by directly accessing the memory (which btw is about 1000% faster than ublas's generated code with gcc 3.2.3 and -O3, why is ublas so slow?)
If you have *not* defined NDEBUG, ublas uses many internal consistency checks and usually no expression templates.
I was referring to the case when NDEBUG is defined and comparing symmetric_matrix by general_matrix products. uBLAS is quite slow with this operation. However, I compared my symmetric_matrix by general_matrix_transpose product and ublas was only 3-4 times slower than the optimized C, not too bad.
OK, I've noticed that you use something like C = prod (A, B); As far as I understand, your matrices are alias free, so you could try C.assign (prod (A, B)); instead.
And when I compile with -DNDEBUG the results differ by around +/- 1e-16, but when its not defined the results match almost perfectly.
You've probably hit a bug, may be an old one (please consider to check against CVS, too). Could you please send or post a small sample, if the problem persists?
I think I know the source of the problem now, and I'm not sure if it's one that boost should handle or not. I attached my test code at the end of this email, only 2.7k, I hope it isn't inappropriate for the list.
387 floating point registers are stored as 80 bits internally and rounded to 64 bits when assigned to memory. SSE2 floating-point registers are stored in the native 64-bit format. The following table lists the compiler flags used and the accumulated absolute difference between the two products over 100 randomly generated matrices.
gcc version 3.2.3 20030221 (Debian prerelease)
no compiler flags diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O1 diff sum
[5,3]((2.91694e-15,2.65348e-15,2.7027e-15),(4.4049e-15,3.29251e-15,3.44863e- 15),(4.10436e-15,4.14816e-15,3.24393e-15),(3.57873e-15,3.75741e-15,2.64068e- 15),(5.03091e-15,2.79637e-15,4.06099e-15))
-O1 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O2 diff sum
[5,3]((4.37584e-15,4.8659e-15,3.61516e-15),(4.61783e-15,3.77736e-15,4.04169e -15),(4.46691e-15,5.10876e-15,5.00576e-15),(3.64118e-15,4.91968e-15,5.17641e -15),(3.97273e-15,4.25528e-15,3.75394e-15))
-O2 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O3 diff sum
[5,3]((4.96131e-15,3.88968e-15,4.34245e-15),(5.07754e-15,5.91974e-15,4.19109 e-15),(4.89539e-15,4.29279e-15,5.06344e-15),(4.79781e-15,4.66033e-15,3.77996 e-15),(4.75227e-15,4.72539e-15,3.33414e-15))
-O3 -DNDEBUG diff sum
[5,3]((4.60569e-15,2.67928e-15,2.7079e-15),(3.21054e-15,2.98112e-15,2.61249e -15),(3.91701e-15,3.60822e-15,2.93515e-15),(3.23006e-15,3.42434e-15,3.53363e -15),(3.18105e-15,2.77035e-15,3.67198e-15))
-O1 -mfpmath=sse,387 -march=pentium4 -msse2 diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O1 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O2 -mfpmath=sse,387 -march=pentium4 -msse2 diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O2 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O3 -mfpmath=sse,387 -march=pentium4 -msse2 diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0))
-O3 -mfpmath=sse,387 -march=pentium4 -msse2 -DNDEBUG diff sum
[5,3]((3.62991e-15,3.74006e-15,4.62824e-15),(4.00548e-15,4.5311e-15,3.60129e -15),(3.48332e-15,5.11743e-15,3.08174e-15),(3.49373e-15,4.05752e-15,4.2865e- 15),(4.81169e-15,4.51549e-15,4.18589e-15))
Is this perhaps a compiler bug?
At least an issue ;-) Building your program with Intel 7.0 (boost default settings) results in diff sum [5,3]((0,0,0),(0,0,0),(0,0,0),(0,0,0),(0,0,0)) for both debug and release builds. Setting the magical compiler switch -ffloat-store seems to heal the problem for GCC, too.
Part of the problem actually comes from whether the products are accumulated into a register by the compiler, or whether they are accumulated into memory. If accumulated into a 387 register it has higher-precision before it is rounded back down to 64-bits to be stored into memory. If the compiler generates code to accumulate the products directly into memory then the results of each individual product are rounded to 64-bits at every iteration and the result will be slightly less accurate. Now, when the -msse2 flag is used, the registers are 64-bit only and thus no accuracy is gained by using them, only speed.
Thanks, Joerg