SIMD implementation of uBLAS

Aditya Avinash

29 May 2013 29 May '13

4:13 a.m.

Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation). -- ---------------- Atluri Aditya Avinash, India.

Attachments:

Source.cpp (text/x-c++src — 292 bytes)
SSE.h (text/x-chdr — 2.0 KB)

Show replies by date

Gaetano Mendola

29 May 29 May

4:45 a.m.

On 29/05/2013 06.13, Aditya Avinash wrote:

...

Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).

A few comments: - That is not C++ but just C in disguise of C++ code . SSE1 CTOR doesn't use initialization list . SSE1 doesn't have a DTOR and the user has to explicit call the Free method - const-correctness is not in place - The SSE namespace should have been put in a "detail" namespace - Use memcpy instead of explicit for - Why is SSE1 template when it works only when T is a single-precision, floating-point value ? Also I believe a nice interface whould have been: SSE1::vector A(1024); SSE1::vector B(1024); SSE1::vector C(1024); C = A + B; Regards Gaetano Mendola

Joel Falcou

5:06 a.m.

On 29/05/2013 06:45, Gaetano Mendola wrote:

...

On 29/05/2013 06.13, Aditya Avinash wrote:

...
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).

A few comments:

- That is not C++ but just C in disguise of C++ code . SSE1 CTOR doesn't use initialization list . SSE1 doesn't have a DTOR and the user has to explicit call the Free method

- const-correctness is not in place - The SSE namespace should have been put in a "detail" namespace - Use memcpy instead of explicit for - Why is SSE1 template when it works only when T is a single-precision, floating-point value ?

Also I believe a nice interface whould have been:

SSE1::vector A(1024); SSE1::vector B(1024); SSE1::vector C(1024);

C = A + B;

Regards Gaetano Mendola

See our work on Boost.SIMD ...

Aditya Avinash

5:33 a.m.

@Gaetano: Thank you for the comments. I'll change accordingly and post it back. I am using T because, the code need to run double precision float also. @Joel: The Boost.SIMD is generalized. Designing algorithms specific to uBLAS increases the performance. Odeint have their own simd backend. On Wed, May 29, 2013 at 10:36 AM, Joel Falcou <joel.falcou@gmail.com> wrote:

...

On 29/05/2013 06:45, Gaetano Mendola wrote:

...
On 29/05/2013 06.13, Aditya Avinash wrote:

...
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).

A few comments:

- That is not C++ but just C in disguise of C++ code . SSE1 CTOR doesn't use initialization list . SSE1 doesn't have a DTOR and the user has to explicit call the Free method

- const-correctness is not in place - The SSE namespace should have been put in a "detail" namespace - Use memcpy instead of explicit for - Why is SSE1 template when it works only when T is a single-precision, floating-point value ?

Also I believe a nice interface whould have been:

SSE1::vector A(1024); SSE1::vector B(1024); SSE1::vector C(1024);

C = A + B;

Regards Gaetano Mendola

See our work on Boost.SIMD ...

______________________________**_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boost<http://lists.boost.org/mailman/listinfo.cgi/boost>

-- ---------------- Atluri Aditya Avinash, India.

Karsten Ahnert

6:27 a.m.

On 05/29/2013 07:33 AM, Aditya Avinash wrote:

...

@Gaetano: Thank you for the comments. I'll change accordingly and post it back. I am using T because, the code need to run double precision float also. @Joel: The Boost.SIMD is generalized. Designing algorithms specific to uBLAS increases the performance. Odeint have their own simd backend.

odeint has no simd backend, At least i am not aware of an simd backend. Having one would be really great.

...

On Wed, May 29, 2013 at 10:36 AM, Joel Falcou <joel.falcou@gmail.com> wrote:

...
On 29/05/2013 06:45, Gaetano Mendola wrote:

...
On 29/05/2013 06.13, Aditya Avinash wrote:

...
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).

A few comments:

- That is not C++ but just C in disguise of C++ code . SSE1 CTOR doesn't use initialization list . SSE1 doesn't have a DTOR and the user has to explicit call the Free method

- const-correctness is not in place - The SSE namespace should have been put in a "detail" namespace - Use memcpy instead of explicit for - Why is SSE1 template when it works only when T is a single-precision, floating-point value ?

Also I believe a nice interface whould have been:

SSE1::vector A(1024); SSE1::vector B(1024); SSE1::vector C(1024);

C = A + B;

Regards Gaetano Mendola

See our work on Boost.SIMD ...

______________________________**_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boost<http://lists.boost.org/mailman/listinfo.cgi/boost>

Aditya Avinash

6:35 a.m.

Am sorry. My bad. It's boost.simd. Why isn't it included in boost? I have heard about it recently. Is there a chance that it is added to boost in the near future? On Wed, May 29, 2013 at 11:57 AM, Karsten Ahnert < karsten.ahnert@googlemail.com> wrote:

...

On 05/29/2013 07:33 AM, Aditya Avinash wrote:

...
@Gaetano: Thank you for the comments. I'll change accordingly and post it back. I am using T because, the code need to run double precision float also. @Joel: The Boost.SIMD is generalized. Designing algorithms specific to uBLAS increases the performance. Odeint have their own simd backend.

odeint has no simd backend, At least i am not aware of an simd backend. Having one would be really great.

...
On Wed, May 29, 2013 at 10:36 AM, Joel Falcou <joel.falcou@gmail.com> wrote:

On 29/05/2013 06:45, Gaetano Mendola wrote:

...
On 29/05/2013 06.13, Aditya Avinash wrote:

...
Hi, i have developed vector addition algorithm which exploits the

...
hardware parallelism (SSE implementation).

A few comments:

- That is not C++ but just C in disguise of C++ code . SSE1 CTOR doesn't use initialization list . SSE1 doesn't have a DTOR and the user has to explicit call the Free method

- const-correctness is not in place - The SSE namespace should have been put in a "detail" namespace - Use memcpy instead of explicit for - Why is SSE1 template when it works only when T is a single-precision, floating-point value ?

Also I believe a nice interface whould have been:

SSE1::vector A(1024); SSE1::vector B(1024); SSE1::vector C(1024);

C = A + B;

Regards Gaetano Mendola

See our work on Boost.SIMD ...

______________________________****_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boost<htt**p://lists.boost.org/mailman/** listinfo.cgi/boost <http://lists.boost.org/mailman/listinfo.cgi/boost>>

______________________________**_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boost<http://lists.boost.org/mailman/listinfo.cgi/boost>

-- ---------------- Atluri Aditya Avinash, India.

Rob Stewart

9:27 a.m.

On May 29, 2013, at 2:35 AM, Aditya Avinash <adityaavinash143@gmail.com> wrote:

...

Am sorry. My bad. It's boost.simd. Why isn't it included in boost? I have heard about it recently. Is there a chance that it is added to boost in the near future?

On Wed, May 29, 2013 at 11:57 AM, Karsten Ahnert < karsten.ahnert@googlemail.com> wrote:

...
On 05/29/2013 07:33 AM, Aditya Avinash wrote:

[snip lots of quoted text]

...

...
...
On Wed, May 29, 2013 at 10:36 AM, Joel Falcou <joel.falcou@gmail.com> wrote:

On 29/05/2013 06:45, Gaetano Mendola wrote:

...
On 29/05/2013 06.13, Aditya Avinash wrote:

[snip even more quoted text]

...

...
...
...
...
Regards Gaetano Mendola

See our work on Boost.SIMD ...

[snip multiple sigs and ML footers] Please read http://www.boost.org/community/policy.html#quoting before posting. ___ Rob (Sent from my portable computation engine)

Aditya Avinash

9:34 a.m.

Am sorry. My bad. It's boost.simd. Why isn't it included in boost? I have heard about it recently. Is there a chance that it is added to boost in the near future?

Aditya Avinash

1:40 p.m.

On Wed, May 29, 2013 at 10:36 AM, Joel Falcou <joel.falcou@gmail.com> wrote:

...

See our work on Boost.SIMD ...

I have a question specific to you. Implementing uBLAS with it's own SIMD code and using uBLAS with Boost.SIMD, which of these can be more faster? (performance) -- Aditya Avinash Atluri

Nasos Iliopoulos

1:52 p.m.

If I ever had to choose I would go with boost.SIMD, because the folks at Metascale have put some really hard work shaping their library and a custom solution would only be ugly plagiarism. Also in that case consider that you outsource one of your major problems: supporting new vector instructions. You only work on them if nobody supports boost.SIMD anymore. -Nasos On 05/29/2013 09:40 AM, Aditya Avinash wrote:

...

On Wed, May 29, 2013 at 10:36 AM, Joel Falcou <joel.falcou@gmail.com> wrote:

...
See our work on Boost.SIMD ...

I have a question specific to you. Implementing uBLAS with it's own SIMD code and using uBLAS with Boost.SIMD, which of these can be more faster? (performance)

-- Aditya Avinash Atluri

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Aditya Avinash

1:58 p.m.

On Wed, May 29, 2013 at 7:22 PM, Nasos Iliopoulos <nasos_i@hotmail.com>wrote:

...

If I ever had to choose I would go with boost.SIMD, because the folks at Metascale have put some really hard work shaping their library and a custom solution would only be ugly plagiarism.

Also in that case consider that you outsource one of your major problems: supporting new vector instructions. You only work on them if nobody supports boost.SIMD anymore.

Thank you! This question is for the list. What about ARM NEON? -- Aditya Avinash Atluri

Mathias Gaunard

1:53 p.m.

On 29/05/13 15:40, Aditya Avinash wrote:

...

On Wed, May 29, 2013 at 10:36 AM, Joel Falcou <joel.falcou@gmail.com> wrote:

...
See our work on Boost.SIMD ...

I have a question specific to you. Implementing uBLAS with it's own SIMD code and using uBLAS with Boost.SIMD, which of these can be more faster? (performance)

Assuming the code does the same thing, it would be the same.

Mathias Gaunard

9:34 a.m.

On 29/05/13 06:13, Aditya Avinash wrote:

...

Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).

That's something trivial to do, and unfortunately even that trivial code is broken (it's written for a generic T but clearly does not work for any T beside float). It still has nothing to do with uBLAS. Bringing SIMD to uBLAS could be fairly difficult. Is this part of the GSoC projects? Who's in charge of this? I'd like to know what the plan is: optimize very specific operations with SIMD or try to provide a framework to use SIMD in expression templates? The former is better adressed by simply binding BLAS, the latter is certainly not as easy as it sounds.

Aditya Avinash

9:46 a.m.

Thanks for the comment. I'll work on the code and make appropriate changes. I'll implement the BLAS soon. No, this is not GSOC project. It's the second option. Provide framework to use SIMD templates. On Wed, May 29, 2013 at 3:04 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:

...

That's something trivial to do, and unfortunately even that trivial code is broken (it's written for a generic T but clearly does not work for any T beside float). It still has nothing to do with uBLAS.

Bringing SIMD to uBLAS could be fairly difficult. Is this part of the GSoC projects? Who's in charge of this? I'd like to know what the plan is: optimize very specific operations with SIMD or try to provide a framework to use SIMD in expression templates?

The former is better adressed by simply binding BLAS, the latter is certainly not as easy as it sounds.

______________________________**_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boost<http://lists.boost.org/mailman/listinfo.cgi/boost>

-- ---------------- Atluri Aditya Avinash, India.

Mathias Gaunard

10:40 a.m.

On 29/05/13 11:46, Aditya Avinash wrote:

...

It's the second option. Provide framework to use SIMD templates.

Ok, in that case, you need to first study how uBlas works. For example if you write something along the lines of a = trans(b + c) * d; AFAIK what uBlas does is something like for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j]; What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD. Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.

Aditya Avinash

10:46 a.m.

On Wed, May 29, 2013 at 4:10 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:

...

Ok, in that case, you need to first study how uBlas works.

For example if you write something along the lines of

a = trans(b + c) * d;

AFAIK what uBlas does is something like

for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j];

What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD.

Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.

Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)?? -- Aditya Avinash Atluri

Mathias Gaunard

11:05 a.m.

On 29/05/13 12:46, Aditya Avinash wrote:

...

On Wed, May 29, 2013 at 4:10 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:

...
Ok, in that case, you need to first study how uBlas works.

For example if you write something along the lines of

a = trans(b + c) * d;

AFAIK what uBlas does is something like

for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j];

What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD.

Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.

Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)??

There is no algorithm here. It's just the evaluation of a uBlas matrix expression template.

Aditya Atluri

11:36 a.m.

On Wed May 29 2013 04:35:05 PM IST, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

On 29/05/13 12:46, Aditya Avinash wrote:

...
On Wed, May 29, 2013 at 4:10 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:

...
Ok, in that case, you need to first study how uBlas works.

For example if you write something along the lines of

a = trans(b + c) * d;

AFAIK what uBlas does is something like

for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j];

What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD.

Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.

Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)??

There is no algorithm here. It's just the evaluation of a uBlas matrix expression template.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Ok. Shall i start writing SIMD code for it?

Aditya Avinash

12:17 p.m.

On Wed, May 29, 2013 at 5:06 PM, Aditya Atluri <adityaavinash143@gmail.com>wrote:

...

**

On Wed May 29 2013 04:35:05 PM IST, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:

...
On 29/05/13 12:46, Aditya Avinash wrote:

...
On Wed, May 29, 2013 at 4:10 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:

...
Ok, in that case, you need to first study how uBlas works.

For example if you write something along the lines of

a = trans(b + c) * d;

AFAIK what uBlas does is something like

for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j];

What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD.

Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.

Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)??

There is no algorithm here. It's just the evaluation of a uBlas matrix expression template.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

I apologize for my previous mails. My reply is half clear. What i meant is, should i convert all the codes and algorithms in uBLAS with a background SIMD implementation?

-- -- Aditya Avinash Atluri

Mathias Gaunard

12:18 p.m.

On 29/05/13 13:36, Aditya Atluri wrote:

...

...
...
Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)??

There is no algorithm here. It's just the evaluation of a uBlas matrix expression template.

Ok. Shall i start writing SIMD code for it?

You can work on a patch to uBlas if you want and submit it to its maintainer for inclusion. I don't understand your question.

Aditya Avinash

12:30 p.m.

Ok. On Wed, May 29, 2013 at 5:48 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:

...

On 29/05/13 13:36, Aditya Atluri wrote:

You can work on a patch to uBlas if you want and submit it to its maintainer for inclusion.

OK!

...

I don't understand your question.

My question is, the current code is based on CPU. Shall i port it to SIMD architecture? -- -- Aditya Avinash Atluri

Nasos Iliopoulos

1:04 p.m.

Aditya, it would be better of uBLAS specific discussions are kept into the uBLAS mailing list rather than the Boost one, unless of course your post is of a more general nature that needs exposure to the wider Boost community. Thank you! Nasos On 05/29/2013 07:36 AM, Aditya Atluri wrote:

...

On Wed May 29 2013 04:35:05 PM IST, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...
On 29/05/13 12:46, Aditya Avinash wrote:

...
On Wed, May 29, 2013 at 4:10 PM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:

...
Ok, in that case, you need to first study how uBlas works.

For example if you write something along the lines of

a = trans(b + c) * d;

AFAIK what uBlas does is something like

for(size_t i=0; i!=sz.height; ++i) for(size_t j=0; j!=sz.width; ++j) a[i][j] = (b[j][i] + c[j][i]) * d[i][j];

What you need to do is change the loop structure and modify the evaluation of all nodes involved to support SIMD.

Of course trans is going to be a problem. Thankfully uBlas doesn't have that many functions, so trans and herm are the only functions that exhibit that issue.

Should i write SIMD code for the algorithm. Or, as there is no such function in uBLAS, do you want me to develop CPU code (function)?? There is no algorithm here. It's just the evaluation of a uBlas matrix expression template.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost Ok. Shall i start writing SIMD code for it?

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Nasos Iliopoulos

1 p.m.

Hello, Bringing explicit SIMD into uBLAS is not in the near future plans because as you correctly mention this is far from trivial. This has been set as a general goal but I personally disagree that we should be considering it at this point. There is a GSOC project though that seeks to implement auto-vectorization friendly BLAS 1,2 and 3 functions, so that uBLAS can turn into a speedy BLAS drop-in replacement library. To that end I also like the idea uBLAS being called by C and even FORTRAN programs. We are also seeking ways of making the uBLAS expression templates more transparent to the compiler so that auto-vectorization can kick in - which it does in certain cases and provides a very nice performance boost on par with explicitly vectorized libraries. As a matter of fact I am surprised by the progress of the compilers auto-vectorization facilities the last few years, that make me -doubt- the need for explicit vectorization any more. The GSOC project will make it clear for us. An added benefit on relying on compiler is that future vector instructions come for free. A disadvantage is of course the non-guarantee that auto-vectorization will work but I find this rarely the case. Best, - Nasos On 05/29/2013 05:34 AM, Mathias Gaunard wrote:

...

On 29/05/13 06:13, Aditya Avinash wrote:

...
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).

That's something trivial to do, and unfortunately even that trivial code is broken (it's written for a generic T but clearly does not work for any T beside float). It still has nothing to do with uBLAS.

Bringing SIMD to uBLAS could be fairly difficult. Is this part of the GSoC projects? Who's in charge of this? I'd like to know what the plan is: optimize very specific operations with SIMD or try to provide a framework to use SIMD in expression templates?

The former is better adressed by simply binding BLAS, the latter is certainly not as easy as it sounds.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Joel Falcou

1:05 p.m.

On 29/05/2013 15:00, Nasos Iliopoulos wrote:

...

As a matter of fact I am surprised by the progress of the compilers auto-vectorization facilities the last few years, that make me -doubt- the need for explicit vectorization any more. The GSOC project will make it clear for us. An added benefit on relying on compiler is that future vector instructions come for free. A disadvantage is of course the non-guarantee that auto-vectorization will work but I find this rarely the case.

I beg to differ, you're in for some nasty surprises. It basically works for simple operations on simple one-level loops with easily inferred loop boundaries. Also, those stuff are very fragile and based on vendor willingness to do whatever. In multiple actual cases we had to deal with in both academic and industrial context, the autovectorizer was rapidly confused even for rather simple c++ code.

Nasos Iliopoulos

1:32 p.m.

On 05/29/2013 09:05 AM, Joel Falcou wrote:

...

On 29/05/2013 15:00, Nasos Iliopoulos wrote:

...
As a matter of fact I am surprised by the progress of the compilers auto-vectorization facilities the last few years, that make me -doubt- the need for explicit vectorization any more. The GSOC project will make it clear for us. An added benefit on relying on compiler is that future vector instructions come for free. A disadvantage is of course the non-guarantee that auto-vectorization will work but I find this rarely the case.

I beg to differ, you're in for some nasty surprises. It basically works for simple operations on simple one-level loops with easily inferred loop boundaries. Also, those stuff are very fragile and based on vendor willingness to do whatever.

That's true. So what we are looking at is breaking down the certain algorithms to a state that the vectorizer can penetrate the patterns. A triple loop won't be just vectorized but providing clear functional paths is working. I don't expect that it will work generally for the current expression templates back-end that's why the we encouraged the student to keep his proposal within the bounds of certain functions. Additionally just injecting explicit vectorization instructions is not gonna work; you need to alter your computational patterns that in the end come very close to what the compiler would optimize anyway.

...

In multiple actual cases we had to deal with in both academic and industrial context, the autovectorizer was rapidly confused even for rather simple c++ code.

We are also worried with polluting the code with vectorization instructions that will make things quite unmanageable in the future. I also think boost libraries should stay closer to what standard C++ specifies and this fixation of mine may be hindering my willingness to support non-standard items. This is a good discussion, Nasos

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Joel Falcou

2:27 p.m.

On 29/05/2013 15:32, Nasos Iliopoulos wrote:

...

That's true. So what we are looking at is breaking down the certain algorithms to a state that the vectorizer can penetrate the patterns. A triple loop won't be just vectorized but providing clear functional paths is working. I don't expect that it will work generally for the current expression templates back-end that's why the we encouraged the student to keep his proposal within the bounds of certain functions.

Good luck with that.

...

Additionally just injecting explicit vectorization instructions is not gonna work; you need to alter your computational patterns that in the end come very close to what the compiler would optimize anyway.

That's exactly what boost simd algorithms wraps. There's no shame beign close to the machine, the problem is being close while using vendor specific intrinsic.

...

We are also worried with polluting the code with vectorization instructions that will make things quite unmanageable in the future. I also think boost libraries should stay closer to what standard C++ specifies and this fixation of mine may be hindering my willingness to support non-standard items.

then abandon vectorization now, it aint gonna fly.

Mathias Gaunard

1:59 p.m.

On 29/05/13 15:00, Nasos Iliopoulos wrote:

...

We are also seeking ways of making the uBLAS expression templates more transparent to the compiler so that auto-vectorization can kick in - which it does in certain cases and provides a very nice performance boost on par with explicitly vectorized libraries.

As a matter of fact I am surprised by the progress of the compilers auto-vectorization facilities the last few years, that make me -doubt- the need for explicit vectorization any more. The GSOC project will make it clear for us. An added benefit on relying on compiler is that future vector instructions come for free. A disadvantage is of course the non-guarantee that auto-vectorization will work but I find this rarely the case.

Yet according to a variety of benchmarks, performance of uBLAS is very bad when compared to other similar libraries (Eigen, Armadillo, Blitz++, Blaze, or even our own library NT2) even for simple cases and with aggressive optimization settings.

Nasos Iliopoulos

2:58 p.m.

That is one of the core purposes of the GSOC project. To provide fast algorithms especially for items like matrix-matrix multiplications and not to optimize the whole infrastructure. Regarding the simple cases you mean that on your compiler uBLAS is slower for example from Eigen on this piece of code? #include <iostream> #include <chrono> #include <Eigen/Dense> #include <boost/numeric/ublas/matrix.hpp> using boost::numeric::ublas::noalias; std::chrono::high_resolution_clock::time_point now() { return std::chrono::high_resolution_clock::now(); } double duration_since( const std::chrono::high_resolution_clock::time_point &since) { return std::chrono::duration_cast<std::chrono::microseconds>(now() - since).count(); } typedef double value_type; typedef boost::numeric::ublas::matrix<value_type> ublas_matrix_type; typedef Eigen::Matrix<value_type, Eigen::Dynamic, Eigen::Dynamic> eigen_matrix_type; #define SIZE 200 #define ITERATIONS 3000 int main() { eigen_matrix_type EA(SIZE,SIZE), EB(SIZE,SIZE), EC(SIZE,SIZE), ED(SIZE,SIZE); ublas_matrix_type UA(SIZE,SIZE), UB(SIZE,SIZE), UC(SIZE,SIZE), UD(SIZE,SIZE); for( auto i=0; i!=SIZE; i++) for( auto j=0; j!=SIZE; j++){ EB(i,j)=i+3*j; EC(i,j)=i+5*j+2; ED(i,j)=2*i+3*j; UB(i,j)=i+3*j; UC(i,j)=i+5*j+2; UD(i,j)=2*i+3*j; } auto start = now(); for (auto i=0; i!=ITERATIONS; i++) EA.noalias() += 2*EB+3*(EC+ED); auto dur = (double)duration_since(start)/1000; std::cout << EA(SIZE-1,SIZE-1) << " Duration EIGEN: " << dur << " msec\n"; start = now(); for (auto i=0; i!=ITERATIONS; i++) noalias(UA) += 2*UB+3*(UC+UD); dur = (double)duration_since(start)/1000; std::cout << UA(SIZE-1,SIZE-1) << " Duration uBLAS: " << dur << " msec\n"; return 0; } $ g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.7.2-2ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu $ g++ -DNDEBUG -O3 -std=c++0x main.cpp -o benchmarks $ ./benchmarks 2.4495e+07 Duration EIGEN: 160.901 msec 2.4495e+07 Duration uBLAS: 160.86 msec $ ./benchmarks 2.4495e+07 Duration EIGEN: 165.348 msec 2.4495e+07 Duration uBLAS: 168.003 msec ./benchmarks 2.4495e+07 Duration EIGEN: 161.826 msec 2.4495e+07 Duration uBLAS: 160.674 msec Best regards, Nasos On 05/29/2013 09:59 AM, Mathias Gaunard wrote:

...

On 29/05/13 15:00, Nasos Iliopoulos wrote:

...
We are also seeking ways of making the uBLAS expression templates more transparent to the compiler so that auto-vectorization can kick in - which it does in certain cases and provides a very nice performance boost on par with explicitly vectorized libraries.

As a matter of fact I am surprised by the progress of the compilers auto-vectorization facilities the last few years, that make me -doubt- the need for explicit vectorization any more. The GSOC project will make it clear for us. An added benefit on relying on compiler is that future vector instructions come for free. A disadvantage is of course the non-guarantee that auto-vectorization will work but I find this rarely the case.

Yet according to a variety of benchmarks, performance of uBLAS is very bad when compared to other similar libraries (Eigen, Armadillo, Blitz++, Blaze, or even our own library NT2) even for simple cases and with aggressive optimization settings.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

David Bellot

2:43 p.m.

Yes, there is GSOC project for that that I'm mentoring. It's not an easy task to be honest as we have to touch the architecture of ublas a little bit. But it's passionating and fascinating ! Ideally we want to bring SIMD into expression templates. Practically... let's see :-) On Wed, May 29, 2013 at 10:34 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

On 29/05/13 06:13, Aditya Avinash wrote:

...
Hi, i have developed vector addition algorithm which exploits the hardware parallelism (SSE implementation).

That's something trivial to do, and unfortunately even that trivial code is broken (it's written for a generic T but clearly does not work for any T beside float). It still has nothing to do with uBLAS.

Bringing SIMD to uBLAS could be fairly difficult. Is this part of the GSoC projects? Who's in charge of this? I'd like to know what the plan is: optimize very specific operations with SIMD or try to provide a framework to use SIMD in expression templates?

The former is better adressed by simply binding BLAS, the latter is certainly not as easy as it sounds.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

4415

Age (days ago)

4415

Last active (days ago)

List overview

Download

28 comments

9 participants

participants (9)

Aditya Atluri
Aditya Avinash
David Bellot
Gaetano Mendola
Joel Falcou
Karsten Ahnert
Mathias Gaunard
Nasos Iliopoulos
Rob Stewart