SIMD implementation of uBLAS
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation). -- ---------------- Atluri Aditya Avinash, India.
On 29/05/2013 06.13, Aditya Avinash wrote:
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).
A few comments: - That is not C++ but just C in disguise of C++ code . SSE1 CTOR doesn't use initialization list . SSE1 doesn't have a DTOR and the user has to explicit call the Free method - const-correctness is not in place - The SSE namespace should have been put in a "detail" namespace - Use memcpy instead of explicit for - Why is SSE1 template when it works only when T is a single-precision, floating-point value ? Also I believe a nice interface whould have been: SSE1::vector A(1024); SSE1::vector B(1024); SSE1::vector C(1024); C = A + B; Regards Gaetano Mendola
On 29/05/2013 06:45, Gaetano Mendola wrote:
On 29/05/2013 06.13, Aditya Avinash wrote:
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).
A few comments:
- That is not C++ but just C in disguise of C++ code . SSE1 CTOR doesn't use initialization list . SSE1 doesn't have a DTOR and the user has to explicit call the Free method
- const-correctness is not in place - The SSE namespace should have been put in a "detail" namespace - Use memcpy instead of explicit for - Why is SSE1 template when it works only when T is a single-precision, floating-point value ?
Also I believe a nice interface whould have been:
SSE1::vector A(1024); SSE1::vector B(1024); SSE1::vector C(1024);
C = A + B;
Regards Gaetano Mendola
See our work on Boost.SIMD ...
@Gaetano: Thank you for the comments. I'll change accordingly and post it
back. I am using T because, the code need to run double precision float
also.
@Joel: The Boost.SIMD is generalized. Designing algorithms specific to
uBLAS increases the performance. Odeint have their own simd backend.
On Wed, May 29, 2013 at 10:36 AM, Joel Falcou
On 29/05/2013 06:45, Gaetano Mendola wrote:
On 29/05/2013 06.13, Aditya Avinash wrote:
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).
A few comments:
- That is not C++ but just C in disguise of C++ code . SSE1 CTOR doesn't use initialization list . SSE1 doesn't have a DTOR and the user has to explicit call the Free method
- const-correctness is not in place - The SSE namespace should have been put in a "detail" namespace - Use memcpy instead of explicit for - Why is SSE1 template when it works only when T is a single-precision, floating-point value ?
Also I believe a nice interface whould have been:
SSE1::vector A(1024); SSE1::vector B(1024); SSE1::vector C(1024);
C = A + B;
Regards Gaetano Mendola
See our work on Boost.SIMD ...
______________________________**_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boosthttp://lists.boost.org/mailman/listinfo.cgi/boost
-- ---------------- Atluri Aditya Avinash, India.
On 05/29/2013 07:33 AM, Aditya Avinash wrote:
@Gaetano: Thank you for the comments. I'll change accordingly and post it back. I am using T because, the code need to run double precision float also. @Joel: The Boost.SIMD is generalized. Designing algorithms specific to uBLAS increases the performance. Odeint have their own simd backend.
odeint has no simd backend, At least i am not aware of an simd backend. Having one would be really great.
On Wed, May 29, 2013 at 10:36 AM, Joel Falcou
wrote: On 29/05/2013 06:45, Gaetano Mendola wrote:
On 29/05/2013 06.13, Aditya Avinash wrote:
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).
A few comments:
- That is not C++ but just C in disguise of C++ code . SSE1 CTOR doesn't use initialization list . SSE1 doesn't have a DTOR and the user has to explicit call the Free method
- const-correctness is not in place - The SSE namespace should have been put in a "detail" namespace - Use memcpy instead of explicit for - Why is SSE1 template when it works only when T is a single-precision, floating-point value ?
Also I believe a nice interface whould have been:
SSE1::vector A(1024); SSE1::vector B(1024); SSE1::vector C(1024);
C = A + B;
Regards Gaetano Mendola
See our work on Boost.SIMD ...
______________________________**_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boosthttp://lists.boost.org/mailman/listinfo.cgi/boost
Am sorry. My bad. It's boost.simd. Why isn't it included in boost? I have heard about it recently. Is there a chance that it is added to boost in the near future? On Wed, May 29, 2013 at 11:57 AM, Karsten Ahnert < karsten.ahnert@googlemail.com> wrote:
On 05/29/2013 07:33 AM, Aditya Avinash wrote:
@Gaetano: Thank you for the comments. I'll change accordingly and post it back. I am using T because, the code need to run double precision float also. @Joel: The Boost.SIMD is generalized. Designing algorithms specific to uBLAS increases the performance. Odeint have their own simd backend.
odeint has no simd backend, At least i am not aware of an simd backend. Having one would be really great.
On Wed, May 29, 2013 at 10:36 AM, Joel Falcou
wrote: On 29/05/2013 06:45, Gaetano Mendola wrote:
On 29/05/2013 06.13, Aditya Avinash wrote:
Hi, i have developed vector addition algorithm which exploits the
hardware parallelism (SSE implementation).
A few comments:
- That is not C++ but just C in disguise of C++ code . SSE1 CTOR doesn't use initialization list . SSE1 doesn't have a DTOR and the user has to explicit call the Free method
- const-correctness is not in place - The SSE namespace should have been put in a "detail" namespace - Use memcpy instead of explicit for - Why is SSE1 template when it works only when T is a single-precision, floating-point value ?
Also I believe a nice interface whould have been:
SSE1::vector A(1024); SSE1::vector B(1024); SSE1::vector C(1024);
C = A + B;
Regards Gaetano Mendola
See our work on Boost.SIMD ...
______________________________****_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boost
http://lists.boost.org/mailman/listinfo.cgi/boost> ______________________________**_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boosthttp://lists.boost.org/mailman/listinfo.cgi/boost
-- ---------------- Atluri Aditya Avinash, India.
On May 29, 2013, at 2:35 AM, Aditya Avinash
Am sorry. My bad. It's boost.simd. Why isn't it included in boost? I have heard about it recently. Is there a chance that it is added to boost in the near future?
On Wed, May 29, 2013 at 11:57 AM, Karsten Ahnert < karsten.ahnert@googlemail.com> wrote:
On 05/29/2013 07:33 AM, Aditya Avinash wrote:
[snip lots of quoted text]
On Wed, May 29, 2013 at 10:36 AM, Joel Falcou
wrote: On 29/05/2013 06:45, Gaetano Mendola wrote:
On 29/05/2013 06.13, Aditya Avinash wrote:
[snip even more quoted text]
Regards Gaetano Mendola
See our work on Boost.SIMD ...
[snip multiple sigs and ML footers] Please read http://www.boost.org/community/policy.html#quoting before posting. ___ Rob (Sent from my portable computation engine)
Am sorry. My bad. It's boost.simd. Why isn't it included in boost? I have heard about it recently. Is there a chance that it is added to boost in the near future?
If I ever had to choose I would go with boost.SIMD, because the folks at Metascale have put some really hard work shaping their library and a custom solution would only be ugly plagiarism. Also in that case consider that you outsource one of your major problems: supporting new vector instructions. You only work on them if nobody supports boost.SIMD anymore. -Nasos On 05/29/2013 09:40 AM, Aditya Avinash wrote:
On Wed, May 29, 2013 at 10:36 AM, Joel Falcou
wrote: See our work on Boost.SIMD ...
I have a question specific to you. Implementing uBLAS with it's own SIMD code and using uBLAS with Boost.SIMD, which of these can be more faster? (performance)
-- Aditya Avinash Atluri
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Wed, May 29, 2013 at 7:22 PM, Nasos Iliopoulos
If I ever had to choose I would go with boost.SIMD, because the folks at Metascale have put some really hard work shaping their library and a custom solution would only be ugly plagiarism.
Also in that case consider that you outsource one of your major problems: supporting new vector instructions. You only work on them if nobody supports boost.SIMD anymore.
Thank you! This question is for the list. What about ARM NEON? -- Aditya Avinash Atluri
On 29/05/13 15:40, Aditya Avinash wrote:
On Wed, May 29, 2013 at 10:36 AM, Joel Falcou
wrote: See our work on Boost.SIMD ...
I have a question specific to you. Implementing uBLAS with it's own SIMD code and using uBLAS with Boost.SIMD, which of these can be more faster? (performance)
Assuming the code does the same thing, it would be the same.
On 29/05/13 06:13, Aditya Avinash wrote:
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).
That's something trivial to do, and unfortunately even that trivial code is broken (it's written for a generic T but clearly does not work for any T beside float). It still has nothing to do with uBLAS. Bringing SIMD to uBLAS could be fairly difficult. Is this part of the GSoC projects? Who's in charge of this? I'd like to know what the plan is: optimize very specific operations with SIMD or try to provide a framework to use SIMD in expression templates? The former is better adressed by simply binding BLAS, the latter is certainly not as easy as it sounds.
Thanks for the comment. I'll work on the code and make appropriate changes. I'll implement the BLAS soon. No, this is not GSOC project. It's the second option. Provide framework to use SIMD templates. On Wed, May 29, 2013 at 3:04 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
That's something trivial to do, and unfortunately even that trivial code is broken (it's written for a generic T but clearly does not work for any T beside float). It still has nothing to do with uBLAS.
Bringing SIMD to uBLAS could be fairly difficult. Is this part of the GSoC projects? Who's in charge of this? I'd like to know what the plan is: optimize very specific operations with SIMD or try to provide a framework to use SIMD in expression templates?
The former is better adressed by simply binding BLAS, the latter is certainly not as easy as it sounds.
______________________________**_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boosthttp://lists.boost.org/mailman/listinfo.cgi/boost
-- ---------------- Atluri Aditya Avinash, India.
On 29/05/13 11:46, Aditya Avinash wrote:
It's the second option. Provide framework to use SIMD templates.
Ok, in that case, you need to first study how uBlas works. For example if you write something along the lines of a = trans(b + c) * d; AFAIK what uBlas does is something like for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j]; What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD. Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.
On Wed, May 29, 2013 at 4:10 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
Ok, in that case, you need to first study how uBlas works.
For example if you write something along the lines of
a = trans(b + c) * d;
AFAIK what uBlas does is something like
for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j];
What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD.
Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.
Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)?? -- Aditya Avinash Atluri
On 29/05/13 12:46, Aditya Avinash wrote:
On Wed, May 29, 2013 at 4:10 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
Ok, in that case, you need to first study how uBlas works.
For example if you write something along the lines of
a = trans(b + c) * d;
AFAIK what uBlas does is something like
for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j];
What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD.
Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.
Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)??
There is no algorithm here. It's just the evaluation of a uBlas matrix expression template.
On Wed May 29 2013 04:35:05 PM IST, Mathias Gaunard
On 29/05/13 12:46, Aditya Avinash wrote:
On Wed, May 29, 2013 at 4:10 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
Ok, in that case, you need to first study how uBlas works.
For example if you write something along the lines of
a = trans(b + c) * d;
AFAIK what uBlas does is something like
for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j];
What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD.
Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.
Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)??
There is no algorithm here. It's just the evaluation of a uBlas matrix expression template.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Ok. Shall i start writing SIMD code for it?
On Wed, May 29, 2013 at 5:06 PM, Aditya Atluri
**
On Wed May 29 2013 04:35:05 PM IST, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
On 29/05/13 12:46, Aditya Avinash wrote:
On Wed, May 29, 2013 at 4:10 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
Ok, in that case, you need to first study how uBlas works.
For example if you write something along the lines of
a = trans(b + c) * d;
AFAIK what uBlas does is something like
for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j];
What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD.
Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.
Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)??
There is no algorithm here. It's just the evaluation of a uBlas matrix expression template.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I apologize for my previous mails. My reply is half clear. What i meant is, should i convert all the codes and algorithms in uBLAS with a background SIMD implementation?
-- -- Aditya Avinash Atluri
On 29/05/13 13:36, Aditya Atluri wrote:
Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)??
There is no algorithm here. It's just the evaluation of a uBlas matrix expression template.
Ok. Shall i start writing SIMD code for it?
You can work on a patch to uBlas if you want and submit it to its maintainer for inclusion. I don't understand your question.
Ok. On Wed, May 29, 2013 at 5:48 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
On 29/05/13 13:36, Aditya Atluri wrote:
You can work on a patch to uBlas if you want and submit it to its maintainer for inclusion.
OK!
I don't understand your question.
My question is, the current code is based on CPU. Shall i port it to SIMD architecture? -- -- Aditya Avinash Atluri
Aditya, it would be better of uBLAS specific discussions are kept into the uBLAS mailing list rather than the Boost one, unless of course your post is of a more general nature that needs exposure to the wider Boost community. Thank you! Nasos On 05/29/2013 07:36 AM, Aditya Atluri wrote:
On Wed May 29 2013 04:35:05 PM IST, Mathias Gaunard
wrote: On 29/05/13 12:46, Aditya Avinash wrote:
On Wed, May 29, 2013 at 4:10 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
Ok, in that case, you need to first study how uBlas works.
For example if you write something along the lines of
a = trans(b + c) * d;
AFAIK what uBlas does is something like
for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j];
What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD.
Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.
Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)?? There is no algorithm here. It's just the evaluation of a uBlas matrix expression template.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost Ok. Shall i start writing SIMD code for it?
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hello, Bringing explicit SIMD into uBLAS is not in the near future plans because as you correctly mention this is far from trivial. This has been set as a general goal but I personally disagree that we should be considering it at this point. There is a GSOC project though that seeks to implement auto-vectorization friendly BLAS 1,2 and 3 functions, so that uBLAS can turn into a speedy BLAS drop-in replacement library. To that end I also like the idea uBLAS being called by C and even FORTRAN programs. We are also seeking ways of making the uBLAS expression templates more transparent to the compiler so that auto-vectorization can kick in - which it does in certain cases and provides a very nice performance boost on par with explicitly vectorized libraries. As a matter of fact I am surprised by the progress of the compilers auto-vectorization facilities the last few years, that make me -doubt- the need for explicit vectorization any more. The GSOC project will make it clear for us. An added benefit on relying on compiler is that future vector instructions come for free. A disadvantage is of course the non-guarantee that auto-vectorization will work but I find this rarely the case. Best, - Nasos On 05/29/2013 05:34 AM, Mathias Gaunard wrote:
On 29/05/13 06:13, Aditya Avinash wrote:
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).
That's something trivial to do, and unfortunately even that trivial code is broken (it's written for a generic T but clearly does not work for any T beside float). It still has nothing to do with uBLAS.
Bringing SIMD to uBLAS could be fairly difficult. Is this part of the GSoC projects? Who's in charge of this? I'd like to know what the plan is: optimize very specific operations with SIMD or try to provide a framework to use SIMD in expression templates?
The former is better adressed by simply binding BLAS, the latter is certainly not as easy as it sounds.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On 29/05/2013 15:00, Nasos Iliopoulos wrote:
As a matter of fact I am surprised by the progress of the compilers auto-vectorization facilities the last few years, that make me -doubt- the need for explicit vectorization any more. The GSOC project will make it clear for us. An added benefit on relying on compiler is that future vector instructions come for free. A disadvantage is of course the non-guarantee that auto-vectorization will work but I find this rarely the case.
I beg to differ, you're in for some nasty surprises. It basically works for simple operations on simple one-level loops with easily inferred loop boundaries. Also, those stuff are very fragile and based on vendor willingness to do whatever. In multiple actual cases we had to deal with in both academic and industrial context, the autovectorizer was rapidly confused even for rather simple c++ code.
On 05/29/2013 09:05 AM, Joel Falcou wrote:
On 29/05/2013 15:00, Nasos Iliopoulos wrote:
As a matter of fact I am surprised by the progress of the compilers auto-vectorization facilities the last few years, that make me -doubt- the need for explicit vectorization any more. The GSOC project will make it clear for us. An added benefit on relying on compiler is that future vector instructions come for free. A disadvantage is of course the non-guarantee that auto-vectorization will work but I find this rarely the case.
I beg to differ, you're in for some nasty surprises. It basically works for simple operations on simple one-level loops with easily inferred loop boundaries. Also, those stuff are very fragile and based on vendor willingness to do whatever.
That's true. So what we are looking at is breaking down the certain algorithms to a state that the vectorizer can penetrate the patterns. A triple loop won't be just vectorized but providing clear functional paths is working. I don't expect that it will work generally for the current expression templates back-end that's why the we encouraged the student to keep his proposal within the bounds of certain functions. Additionally just injecting explicit vectorization instructions is not gonna work; you need to alter your computational patterns that in the end come very close to what the compiler would optimize anyway.
In multiple actual cases we had to deal with in both academic and industrial context, the autovectorizer was rapidly confused even for rather simple c++ code.
We are also worried with polluting the code with vectorization instructions that will make things quite unmanageable in the future. I also think boost libraries should stay closer to what standard C++ specifies and this fixation of mine may be hindering my willingness to support non-standard items. This is a good discussion, Nasos
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On 29/05/2013 15:32, Nasos Iliopoulos wrote:
That's true. So what we are looking at is breaking down the certain algorithms to a state that the vectorizer can penetrate the patterns. A triple loop won't be just vectorized but providing clear functional paths is working. I don't expect that it will work generally for the current expression templates back-end that's why the we encouraged the student to keep his proposal within the bounds of certain functions.
Good luck with that.
Additionally just injecting explicit vectorization instructions is not gonna work; you need to alter your computational patterns that in the end come very close to what the compiler would optimize anyway.
That's exactly what boost simd algorithms wraps. There's no shame beign close to the machine, the problem is being close while using vendor specific intrinsic.
We are also worried with polluting the code with vectorization instructions that will make things quite unmanageable in the future. I also think boost libraries should stay closer to what standard C++ specifies and this fixation of mine may be hindering my willingness to support non-standard items.
then abandon vectorization now, it aint gonna fly.
On 29/05/13 15:00, Nasos Iliopoulos wrote:
We are also seeking ways of making the uBLAS expression templates more transparent to the compiler so that auto-vectorization can kick in - which it does in certain cases and provides a very nice performance boost on par with explicitly vectorized libraries.
As a matter of fact I am surprised by the progress of the compilers auto-vectorization facilities the last few years, that make me -doubt- the need for explicit vectorization any more. The GSOC project will make it clear for us. An added benefit on relying on compiler is that future vector instructions come for free. A disadvantage is of course the non-guarantee that auto-vectorization will work but I find this rarely the case.
Yet according to a variety of benchmarks, performance of uBLAS is very bad when compared to other similar libraries (Eigen, Armadillo, Blitz++, Blaze, or even our own library NT2) even for simple cases and with aggressive optimization settings.
That is one of the core purposes of the GSOC project. To provide fast
algorithms especially for items like matrix-matrix multiplications and
not to optimize the whole infrastructure.
Regarding the simple cases you mean that on your compiler uBLAS is
slower for example from Eigen on this piece of code?
#include <iostream>
#include <chrono>
#include
On 29/05/13 15:00, Nasos Iliopoulos wrote:
We are also seeking ways of making the uBLAS expression templates more transparent to the compiler so that auto-vectorization can kick in - which it does in certain cases and provides a very nice performance boost on par with explicitly vectorized libraries.
As a matter of fact I am surprised by the progress of the compilers auto-vectorization facilities the last few years, that make me -doubt- the need for explicit vectorization any more. The GSOC project will make it clear for us. An added benefit on relying on compiler is that future vector instructions come for free. A disadvantage is of course the non-guarantee that auto-vectorization will work but I find this rarely the case.
Yet according to a variety of benchmarks, performance of uBLAS is very bad when compared to other similar libraries (Eigen, Armadillo, Blitz++, Blaze, or even our own library NT2) even for simple cases and with aggressive optimization settings.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Yes, there is GSOC project for that that I'm mentoring. It's not an
easy task to be honest as we have to touch the architecture of ublas a
little bit. But it's passionating and fascinating !
Ideally we want to bring SIMD into expression templates.
Practically... let's see :-)
On Wed, May 29, 2013 at 10:34 AM, Mathias Gaunard
On 29/05/13 06:13, Aditya Avinash wrote:
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).
That's something trivial to do, and unfortunately even that trivial code is broken (it's written for a generic T but clearly does not work for any T beside float). It still has nothing to do with uBLAS.
Bringing SIMD to uBLAS could be fairly difficult. Is this part of the GSoC projects? Who's in charge of this? I'd like to know what the plan is: optimize very specific operations with SIMD or try to provide a framework to use SIMD in expression templates?
The former is better adressed by simply binding BLAS, the latter is certainly not as easy as it sounds.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (9)
-
Aditya Atluri
-
Aditya Avinash
-
David Bellot
-
Gaetano Mendola
-
Joel Falcou
-
Karsten Ahnert
-
Mathias Gaunard
-
Nasos Iliopoulos
-
Rob Stewart