interest in an asynchronous linear algebra library?

newer
[boost-users][boost-announce][QVM]...

Oswin Krause

19 Nov 2015 19 Nov '15

2:55 p.m.

Hi list, Disclaimer: What I am proposing here does not exist (yet). But it will in the near future and if there is interest i would like to give something back to the boost community. I would like to propose a new library to boost: asynchronous blas. What is it and what can I do with it? It is a linear algebra library where all computations are carried out in an asynchronous manner on GPU and CPU. Computing a matrix-matrix product does not block the current thread but instead launches a task which at some point returns with a result, allowing the main thread to continue until the result is needed. GPU and CPU tasks can be started at the same time depending on each other (e.g. a CPU computation waits until the GPU returned some needed result and vice-versa) Why is this important? Which problems does it solve? Many linear algebra libraries are target towards either a) solving a few large linear algebra problems b) solving many small similar problems This library is targeted towards the middle ground: a medium amount of decently sized problems. Here the problem is that it is often hard to use all the computing resources available as a single problem will not exhaust the device resources and there is no easy way to launch several similar tasks in parallel. This problem occurs often in GPU computing where out-of-order execution of kernels is used to fill up the device. Similarly, it is often hard to make CPU and GPU work together as GPU tasks are asynchronous while CPU tasks are often synchronous and it is not straight forward to couple both without blocking either CPU or GPU. How does it work? Expression templates similar to boost::ublas are used to create a dependency graph: which object is used in which computations and which tasks have to wait to ensure a correct outcome? a simple case would be two independent computations like matrix A=prod(B,trans(B)); matrix C=prod(trans(B),B); a normal library like ublas would wait until A is computed before C is evaluated. Instead, it can be more efficient to just add the two tasks to a list and let worker threads execute them as soon as resources are available. First the computation of A is put into the graph and A is marked as "being written to" while B is marked as "being read". When C is added to the list, it does not depend on the first computation as several tasks can read from B at the same time. So another worker thread can grab the task and execute it. If we now add the task B += prod(D,prod(C,trans(D)); the library computes it as matrix temp = prod(C,trans(D); B += prod(D,temp). temp can be computed as soon as C is computed. But B can only be computed after A and temp are computed as we can only write to B after all previous tasks using B are finished. Are you qualified? I am main contributor to https://github.com/Shark-ML/Shark/ and as part of it I forked ublas https://github.com/Shark-ML/Shark/tree/master/include/shark/LinAlg/BLAS which now uses several linear algebra libraries as possible backends (ATLAS, openBLAS...). What is the timeline? I would start developing a minimal example ASAP based on the rewrite of ublas above. I would use a minimum set of features (only dense square matrices) and basic GPU support. Opinions? Best, Oswin

Show replies by date

Ben Pope

19 Nov 19 Nov

4:58 p.m.

On 19/11/15 22:55, Oswin Krause wrote:

...

Hi list,

Disclaimer: What I am proposing here does not exist (yet). But it will in the near future and if there is interest i would like to give something back to the boost community.

I would like to propose a new library to boost: asynchronous blas.

What is it and what can I do with it? It is a linear algebra library where all computations are carried out in an asynchronous manner on GPU and CPU. Computing a matrix-matrix product does not block the current thread but instead launches a task which at some point returns with a result, allowing the main thread to continue until the result is needed. GPU and CPU tasks can be started at the same time depending on each other (e.g. a CPU computation waits until the GPU returned some needed result and vice-versa)

This sounds like a specific solution to a general need for schedulers. Have you looked at the current proposals from the various sg groups such as sg1? Do they not help to solve your problem? Now is the time to speak up if they don't. Go find Michael Wong or Hans Boehm. Ben

Oswin Krause

6:13 p.m.

Hi,

...

This sounds like a specific solution to a general need for schedulers.

This proposal can work on top of any scheduler that allows for dependency graphs. You could see this library as a specific application of scheduling linear algebra tasks, because we know exactly how the dependencies are and how to order stuff. Therefore, we can completely hide the scheduling.

...

Have you looked at the current proposals from the various sg groups such as sg1? Do they not help to solve your problem? Now is the time to speak up if they don't. Go find Michael Wong or Hans Boehm.

I am not sure about the current state of the proposals, I only remember a recent discussion about executors where I raised concerns about possible deadlocks and missing guarantees. Anyways, I would not know where to start looking for the current state of the proposals. I found this here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4414.pdf and I doubt one could implement dependency graphs with that proposal as the executor does not return references to scheduled tasks or allows for explicit dependencies(i can not use futures as I want operations as much as possible to use existing objects and not generate new objects as this can exhaust memory pretty easily). There is also no wording about "what happens when a work package adds more work packages to the executor and waits for them", which makes any thread_pool proposal worthless as blocking worker threads might result in deadlocks. I think there is further no hope that we get unified dependency handling which allows for tasks handled by different executors/schedulers (a CPU and GPU scheduler fx). Best, Oswin

Beman Dawes

9:59 p.m.

On Thu, Nov 19, 2015 at 1:13 PM, Oswin Krause < Oswin.Krause@ruhr-uni-bochum.de> wrote:

...

Hi,

This sounds like a specific solution to a general need for schedulers.

...
This proposal can work on top of any scheduler that allows for dependency graphs. You could see this library as a specific application of scheduling linear algebra tasks, because we know exactly how the dependencies are and how to order stuff. Therefore, we can completely hide the scheduling.

Have you looked at the current proposals from the various sg groups

...
such as sg1? Do they not help to solve your problem? Now is the time to speak up if they don't. Go find Michael Wong or Hans Boehm.

I am not sure about the current state of the proposals, I only remember a recent discussion about executors where I raised concerns about possible deadlocks and missing guarantees. Anyways, I would not know where to start looking for the current state of the proposals.

The C++ committee's papers are a good place to start. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/ Work from newest (at the end) to oldest (at the top). You might be particularly interested in: P0159R0 <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0159r0.html> Draft of Technical Specification for C++ Extensions for Concurrency, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0159r0.html N4514 <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4514.pdf> Technical Specification for C++ Extensions for Transactional Memory, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4514.pdf N4507 <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf> Technical Specification for C++ Extensions for Parallelism , http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf <http://lists.boost.org/mailman/listinfo.cgi/boost> --Beman

Sebastian Schaetz

21 Nov 21 Nov

5:12 a.m.

Hi Oswin, On Thursday, November 19, 2015 at 9:56:13 AM UTC-5, Oswin Krause wrote:

...

I would like to propose a new library to boost: asynchronous blas.

asynchrony and optimization of the AST are very interesting techniques that the DL community should leverage. There is a lot of existing knowledge and code in this area so in addition to looking at standard proposals I would recommend you look into advanced expression template libraries like NT2 [0] or Eigen [1] when researching previous work. I would also recommend you look into HPX [2]. Best, Sebastian [0] https://github.com/jfalcou/nt2 [1] http://eigen.tuxfamily.org/index.php?title=Main_Page [2] https://github.com/STEllAR-GROUP/hpx

Joel FALCOU

5:42 p.m.

On 21/11/2015 06:12, Sebastian Schaetz wrote:

...

asynchrony and optimization of the AST are very interesting techniques that the DL community should leverage. There is a lot of existing knowledge and code in this area so in addition to looking at standard proposals I would recommend you look into advanced expression template libraries like NT2 [0] or Eigen [1] when researching previous work. I would also recommend you look into HPX [2].

FYI, we have experimental support of HPX inside NT2 already. My PHD student defended on that recently. Need to merge that to NT2 master stll.

Oswin Krause

22 Nov 22 Nov

2:06 p.m.

On 2015-11-21 06:12, Sebastian Schaetz wrote:

...

Hi Oswin,

On Thursday, November 19, 2015 at 9:56:13 AM UTC-5, Oswin Krause wrote:

...
I would like to propose a new library to boost: asynchronous blas.

asynchrony and optimization of the AST are very interesting techniques that the DL community should leverage. There is a lot of existing knowledge and code in this area so in addition to looking at standard proposals I would recommend you look into advanced expression template libraries like NT2 [0] or Eigen [1] when researching previous work. I would also recommend you look into HPX [2].

Hi Sebastian (and others), I am a bit confused by the replies. Is this a "yes, there is interest but do some proper research beforehand"? Of course I do know Eigen, Armadillo and NT2 as I had to do some research in the past for the linked library. For the CPU part i was indeed planning a system with an interface similar to HPX,but I can not rely on HPX because It would then be a mandatory external library - and distributed computations are currently not in the scope of the project(honestly, I do not have enough experience with distributed computing to start such an endeavor). The question for me is: is there enough interest in such a thing that it is worthwhile for me to create a minimal prototype on which a discussion could be based on? Best, Oswin

Hartmut Kaiser

4:44 p.m.

...

On 2015-11-21 06:12, Sebastian Schaetz wrote:

...
Hi Oswin,

On Thursday, November 19, 2015 at 9:56:13 AM UTC-5, Oswin Krause wrote:

...
I would like to propose a new library to boost: asynchronous blas.

asynchrony and optimization of the AST are very interesting techniques that the DL community should leverage. There is a lot of existing knowledge and code in this area so in addition to looking at standard proposals I would recommend you look into advanced expression template libraries like NT2 [0] or Eigen [1] when researching previous work. I would also recommend you look into HPX [2].

Hi Sebastian (and others),

I am a bit confused by the replies. Is this a "yes, there is interest but do some proper research beforehand"?

Of course I do know Eigen, Armadillo and NT2 as I had to do some research in the past for the linked library. For the CPU part i was indeed planning a system with an interface similar to HPX,but I can not rely on HPX because It would then be a mandatory external library - and distributed computations are currently not in the scope of the project(honestly, I do not have enough experience with distributed computing to start such an endeavor).

<rant> Why do people fall into the trap of the NIH syndrome over and over again? *sigh* </rant> Anyways, while HPX supports distributed computing, you don't have to use it for this. HPX can be used just fine for SMP style (local only) applications as well. In fact we have shown that by simply switching from std::async/std::future to hpx::async/hpx::future you can achieve a speedup of up to a factor of 3-4 (depending on the application). That does not entail utilizing some of the more sophisticated features HPX provides you with (extending futures like proposed by the Concurrency TS, dataflow, parallel algorithms, task regions, GPU integration, etc.). Those usually give you an additional benefit in terms of speedup, readability, and maintainability. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Robert Ramey

6:32 p.m.

On 11/22/15 8:44 AM, Hartmut Kaiser wrote:

...

<rant> Why do people fall into the trap of the NIH syndrome over and over again? *sigh* </rant>

Hmmmm - Is HPX a boost library? Maybe you want to affiliate HPX with boost in the manner that spirit is. Might make for an interesting review discussion. Robert Ramey

Hartmut Kaiser

7:06 p.m.

Robert,

...

...
<rant> Why do people fall into the trap of the NIH syndrome over and over again? *sigh* </rant>

Hmmmm - Is HPX a boost library? Maybe you want to affiliate HPX with boost in the manner that spirit is. Might make for an interesting review discussion.

HPX is actually at least 5 (if not more) different libraries, so this would fall through even before any discussion has started. It is also ~250k LOC, nobody would want to review that anyways. While I understand the significance of being part of Boost (in terms of legal and company BS), I don't think that it would be worth the effort at this point for us to invest time into such an endeavor. That said - if somebody would be interested in turning HPX (or part of it) into his 'own' Boost library - please be our guest. We would be in full support for such an effort. We (HPX) have always used Boost as our role model for code quality, portability requirements, coding guidelines, license, review discipline (I believe we're even further along the road here if compared to Boost), etc. Thus - if somebody is interested in code quality and excellent support - I don't see any reason not to use HPX in the first place, even if it's external. It wouldn't be more 'external' than using Boost. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Oswin Krause

7:18 p.m.

Hi,

...

<rant> Why do people fall into the trap of the NIH syndrome over and over again? *sigh* </rant>

This is not NIH. I am happily using all kinds of external libraries. However, for a general purpose library, especially when targeting something as boost I hesitate to do it without providing a reasonable fallback implementation, especially if it is a library i can not assume to be readily installed and available on a regular unix system. My experience with users of my libraries is: most users just want to play around with it and do not care about it being fast and efficient. Thus requiring them to install a lot of external dependencies can make them loose interest - and i do not want that Later, when they want to try it out for real tasks they will remember that they should install the external libraries and then they can actually make use of it and see the improvements of it. (I do not work together with skilled programmers, but students, they struggle getting boost installed on their windows machines and then getting cmake to find boost, so i really don't want to add more hurdles!).

...

Anyways, while HPX supports distributed computing, you don't have to use it for this. HPX can be used just fine for SMP style (local only) applications as well. In fact we have shown that by simply switching from std::async/std::future to hpx::async/hpx::future you can achieve a speedup of up to a factor of 3-4 (depending on the application).

This for example would be a reason for me to provide a default implementation using std::future that i can exchange by the hpx:: versions using a preprocessor directive. Btw: is there a specific reason for that speedup? is there just less overhead of the implementation or is it improved scheduling of tasks?

...

That does not entail utilizing some of the more sophisticated features HPX provides you with (extending futures like proposed by the Concurrency TS, dataflow, parallel algorithms, task regions, GPU integration, etc.). Those usually give you an additional benefit in terms of speedup, readability, and maintainability.

And this would be an example where i would rather wait using the features unless I can find ways to make everything work with some default implementation. Best, Oswin

Hartmut Kaiser

7:30 p.m.

...

...
Anyways, while HPX supports distributed computing, you don't have to use it for this. HPX can be used just fine for SMP style (local only) applications as well. In fact we have shown that by simply switching from std::async/std::future to hpx::async/hpx::future you can achieve a speedup of up to a factor of 3-4 (depending on the application).

This for example would be a reason for me to provide a default implementation using std::future that i can exchange by the hpx:: versions using a preprocessor directive.

Btw: is there a specific reason for that speedup? is there just less overhead of the implementation or is it improved scheduling of tasks?

It's mainly because of a) very efficient user level threading, and b) low overhead synchronization primitives. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Joel FALCOU

23 Nov 23 Nov

12:58 p.m.

On 22/11/2015 20:18, Oswin Krause wrote:

...

Hi,

...
<rant> Why do people fall into the trap of the NIH syndrome over and over again? *sigh* </rant>

This is not NIH. I am happily using all kinds of external libraries. However, for a general purpose library, especially when targeting something as boost I hesitate to do it without providing a reasonable fallback implementation, especially if it is a library i can not assume to be readily installed and available on a regular unix system.

Having a cmake script install w/e dependecies oyu requrie is nto terrible to write. We tried to do our own async in NT2 then switched to HPX cause it was too much work.

Oswin Krause

3 Dec 3 Dec

9:19 a.m.

Hi, giving a small heads-up. There is now a minimal PoC at https://github.com/Ulfgard/aBLAS . Minimal in the sense that i took my existing LinAlg, ripped it apart and rewrote it partially to fit the new needs of the library. Only cpu is implemented, yet. I am open for suggestions, helpful advice and of course people who are interested to work on it. I am not too happy with the scheduling interface right now and its implementation looks a bit slower than necessary, but i think this will evolve over time. two basic examples showing what the library already can do are given in examples/ . For ublas users this should not look too foreign. For all who are interested, here is the basic design and the locations in include/aBLAS/: 1. computational kernels are implemented in kernels/ and represent typical bindings to the BLAS1-3 functionality as well as a default implementation (currently only dot,gemv,gemm and assignment are tested and working, no explicit bindings included, yet). kernels are enqueued in the scheduler via the expression template mechanisms and kernels are not allowed to enqueue kernels recursively. 2. a simple PoC scheduler is implemented in scheduling/scheduling.hpp. It implements a dependency graph between work packages and work is enqueued into a boost::thread::basic_thread_pool when all its dependencies are resolved. A kernel is enqueued together with a set of dependency_node objects which encapsulate dependencies of variables the kernel uses (i.e. every variable keeps track about what its latest dependencies are and whether these dependencies read from it or write to it). The current interface should be abstracted enough to allow implementation using different technologies(e.g. it should be possible to implement the scheduler in terms of HPX.). One of the tasks of the scheduler is to allow creation of closures where variables are guaranteed to exist until all kernels using them are finished as well as moving a variable into a closure. This is used to prevent an issue similar to the blocking destructor of std::future<T>. Instead of blocking, the variable is moved into the scheduler which then ensures lifetime, until all kernels are finished. This of course requires the kernels to be called in a way that they can cope with the moving of types. what is currently missing is "user created dependencies" to be used in conjunction with the gpu (as gpus are fully asynchronous, we have to register a callback that notifies the scheduler when the gpu is done with its computations just as the worker threads do). 3. basic matrix/vector classes are implemented in matrix.hpp and vector.hpp. The implementation is a bit convoluted for the "move into closure" to work. Basically they introduce another indirection. When a kernel is created, it references a special closure type of the variable (vector<T>::closure_type), which references that indirection. 4. the remaining files in include/aBLAS*.hpp implement the expression templates, which are similar to uBLAS. There are two types distinguished using the CRTP classes matrix_expression<Type,Device> and vector_expression<Type,Device>. Type is the exact type of the expression, Device marks the Device this is working on (cpu_tag or gpu_tag). The second template parameter ensures that one can not mix gpu and cpu expresions unless the operation explicitly allows this. While this looks clumsy, it removes code duplication between the different device types as most of the code can be used for both implementations. assignment.hpp implements the basic assignment operators (Except op=). A += B either calls kernel::assign<>(A,B) if B can be evaluated efficiently elementwise (e.g. A+= 2*A1+A2-A3) or calls B.plus_assign_to(A) which then assigns the terms one-by-one using their specialized kernels (e.g. A+=prod(B,C)+D is evaluated as kernels::gemm(B,C,A); kernels::assign<...>(A,D);) matrix/vector_proxy.hpp implement subranges, rows-of-matrix-operations, etc and matrix/vector_expression the algebraic operations. Best, Oswin

Hartmut Kaiser

9:39 p.m.

...

There is now a minimal PoC at https://github.com/Ulfgard/aBLAS . Minimal in the sense that i took my existing LinAlg, ripped it apart and rewrote it partially to fit the new needs of the library. Only cpu is implemented, yet. I am open for suggestions, helpful advice and of course people who are interested to work on it. I am not too happy with the scheduling interface right now and its implementation looks a bit slower than necessary, but i think this will evolve over time.

The scheduling should not be part of your library, it's a pure implementation detail.

...

two basic examples showing what the library already can do are given in examples/ . For ublas users this should not look too foreign.

I don't think that you should make it explicit to the user whether a certain operation has to be async or not. Just make every operation async by default and let the library decide when to synchronize (i.e. whenever the actual matrix data is accessed). Any expressions (like r += 2*x+1 + prod(x,y) - btw: why prod(x,y) and not x * y)?) should not execute any code directly, just construct an expression dependency tree which starts scheduling the asynchronous sub-tasks. You need to wait for the result only when somebody accesses an element of r. The easiest would be to let your matrix type wrap a future which references the actual matrix. Additional transformations of the expression tree could be used to merge the operations, if needed.

...

For all who are interested, here is the basic design and the locations in include/aBLAS/:

1. computational kernels are implemented in kernels/ and represent typical bindings to the BLAS1-3 functionality as well as a default implementation (currently only dot,gemv,gemm and assignment are tested and working, no explicit bindings included, yet). kernels are enqueued in the scheduler via the expression template mechanisms and kernels are not allowed to enqueue kernels recursively.

Why do you implement your own BLAS functions?

...

2. a simple PoC scheduler is implemented in scheduling/scheduling.hpp. It implements a dependency graph between work packages and work is enqueued into a boost::thread::basic_thread_pool when all its dependencies are resolved. A kernel is enqueued together with a set of dependency_node objects which encapsulate dependencies of variables the kernel uses (i.e. every variable keeps track about what its latest dependencies are and whether these dependencies read from it or write to it). The current interface should be abstracted enough to allow implementation using different technologies(e.g. it should be possible to implement the scheduler in terms of HPX.).

One of the tasks of the scheduler is to allow creation of closures where variables are guaranteed to exist until all kernels using them are finished as well as moving a variable into a closure. This is used to prevent an issue similar to the blocking destructor of std::future<T>. Instead of blocking, the variable is moved into the scheduler which then ensures lifetime, until all kernels are finished. This of course requires the kernels to be called in a way that they can cope with the moving of types.

As said, the scheduler should be a pure implementation detail and not exposed to the user. I'd even suggest to rely on executors for the scheduling needs, leaving the scheduler out of the picture completely. Also, if you need dependencies, those shouldn't be handled by the executor, but be better imposed by the code using it (via futures or similar).

...

what is currently missing is "user created dependencies" to be used in conjunction with the gpu (as gpus are fully asynchronous, we have to register a callback that notifies the scheduler when the gpu is done with its computations just as the worker threads do).

3. basic matrix/vector classes are implemented in matrix.hpp and vector.hpp. The implementation is a bit convoluted for the "move into closure" to work. Basically they introduce another indirection. When a kernel is created, it references a special closure type of the variable (vector<T>::closure_type), which references that indirection.

4. the remaining files in include/aBLAS*.hpp implement the expression templates, which are similar to uBLAS. There are two types distinguished using the CRTP classes matrix_expression<Type,Device> and vector_expression<Type,Device>. Type is the exact type of the expression, Device marks the Device this is working on (cpu_tag or gpu_tag). The second template parameter ensures that one can not mix gpu and cpu expresions unless the operation explicitly allows this. While this looks clumsy, it removes code duplication between the different device types as most of the code can be used for both implementations.

assignment.hpp implements the basic assignment operators (Except op=). A += B either calls kernel::assign<>(A,B) if B can be evaluated efficiently elementwise (e.g. A+= 2*A1+A2-A3) or calls B.plus_assign_to(A) which then assigns the terms one-by-one using their specialized kernels (e.g. A+=prod(B,C)+D is evaluated as kernels::gemm(B,C,A); kernels::assign<...>(A,D);)

matrix/vector_proxy.hpp implement subranges, rows-of-matrix-operations, etc and matrix/vector_expression the algebraic operations.

Can I use other BLAS implementations, like for instance Intel MKL? General note: I have the impression, that - just like any decently sized system eventually embeds a half-baked and bug-ridden implementation of LISP - any asynchronous library nowadays contains its own half-assed implementation of a parallel runtime system. So let me ask again: why not use a full blown, mature, and thoroughly tested runtime system for your needs instead of worrying about how to schedule 5 dependent tasks your library creates? Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Robert Ramey

22 Nov 22 Nov

6:30 p.m.

On 11/22/15 6:06 AM, Oswin Krause wrote:

...

On 2015-11-21 06:12, Sebastian Schaetz wrote:

...

Of course I do know Eigen, Armadillo and NT2 ... but I can not rely on HPX because It would then be a mandatory external library -

The question for me is: is there enough interest in such a thing that it is worthwhile for me to create a minimal prototype on which a discussion could be based on?

Hmmm - if you made a boost library which implemented what you desire, wouldn't that become an "external library" which you could not then use? Is the whole motivation of this proposal to work around some internal prohibition of "external code" imposed for some non-technical reasons like legal department BS or grant requirements or ??? I'm not getting the motivation here. Robert Ramey

David Bellot

5 Dec 5 Dec

2:49 p.m.

Hi Oswin, just a simple question, why not contribute it to ublas directly ? Ublas is getting old anyway and needs a serious rewrite of some parts. Your contribution would be exceptional. And your work would be recognized as being part of ublas directly and be used right away by thousands of users. Let me know if you're interested contributing back to ublas instead. Cheers, David On Thu, Nov 19, 2015 at 2:55 PM, Oswin Krause < Oswin.Krause@ruhr-uni-bochum.de> wrote:

...

Hi list,

Disclaimer: What I am proposing here does not exist (yet). But it will in the near future and if there is interest i would like to give something back to the boost community.

I would like to propose a new library to boost: asynchronous blas.

What is it and what can I do with it? It is a linear algebra library where all computations are carried out in an asynchronous manner on GPU and CPU. Computing a matrix-matrix product does not block the current thread but instead launches a task which at some point returns with a result, allowing the main thread to continue until the result is needed. GPU and CPU tasks can be started at the same time depending on each other (e.g. a CPU computation waits until the GPU returned some needed result and vice-versa)

Why is this important? Which problems does it solve? Many linear algebra libraries are target towards either a) solving a few large linear algebra problems b) solving many small similar problems

This library is targeted towards the middle ground: a medium amount of decently sized problems. Here the problem is that it is often hard to use all the computing resources available as a single problem will not exhaust the device resources and there is no easy way to launch several similar tasks in parallel. This problem occurs often in GPU computing where out-of-order execution of kernels is used to fill up the device. Similarly, it is often hard to make CPU and GPU work together as GPU tasks are asynchronous while CPU tasks are often synchronous and it is not straight forward to couple both without blocking either CPU or GPU.

How does it work? Expression templates similar to boost::ublas are used to create a dependency graph: which object is used in which computations and which tasks have to wait to ensure a correct outcome?

a simple case would be two independent computations like

matrix A=prod(B,trans(B)); matrix C=prod(trans(B),B);

a normal library like ublas would wait until A is computed before C is evaluated. Instead, it can be more efficient to just add the two tasks to a list and let worker threads execute them as soon as resources are available. First the computation of A is put into the graph and A is marked as "being written to" while B is marked as "being read". When C is added to the list, it does not depend on the first computation as several tasks can read from B at the same time. So another worker thread can grab the task and execute it.

If we now add the task

B += prod(D,prod(C,trans(D));

the library computes it as

matrix temp = prod(C,trans(D); B += prod(D,temp).

temp can be computed as soon as C is computed. But B can only be computed after A and temp are computed as we can only write to B after all previous tasks using B are finished.

Are you qualified? I am main contributor to https://github.com/Shark-ML/Shark/ and as part of it I forked ublas https://github.com/Shark-ML/Shark/tree/master/include/shark/LinAlg/BLAS which now uses several linear algebra libraries as possible backends (ATLAS, openBLAS...).

What is the timeline? I would start developing a minimal example ASAP based on the rewrite of ublas above. I would use a minimum set of features (only dense square matrices) and basic GPU support.

Opinions?

Best, Oswin

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Oswin Krause

7 Dec 7 Dec

8:47 a.m.

Hi David,

...

And your work would be recognized as being part of ublas directly and be used right away by thousands of users. Let me know if you're interested contributing back to ublas instead.

Yes, I am interested, but currently I do not see a roadmap as ublas is SO big, that I fear that I can not fix as much code as my proposed changes would break (even without the asynchronous part, gpus...). One of the starting points of my rewrite was that I deleted around 40k lines of ublas code before i started and then designed the interface around a defined subset of features (namely one dense format and one sparse format, later adding the triangular packed format. I still haven't gotten sparse right, though.). From then on i decided to break the functionality apart in a few places (assignment, kernels, and iterators) and use the expression templates as a simple connection between the parts. Another point: I would not propose a major change to an existing boost library without triggering a formal boost review process - this change would break any code that relies on any ublas internals. Best, Oswin

...

Cheers, David

On Thu, Nov 19, 2015 at 2:55 PM, Oswin Krause < Oswin.Krause@ruhr-uni-bochum.de> wrote:

...
Hi list,

Disclaimer: What I am proposing here does not exist (yet). But it will in the near future and if there is interest i would like to give something back to the boost community.

I would like to propose a new library to boost: asynchronous blas.

What is it and what can I do with it? It is a linear algebra library where all computations are carried out in an asynchronous manner on GPU and CPU. Computing a matrix-matrix product does not block the current thread but instead launches a task which at some point returns with a result, allowing the main thread to continue until the result is needed. GPU and CPU tasks can be started at the same time depending on each other (e.g. a CPU computation waits until the GPU returned some needed result and vice-versa)

Why is this important? Which problems does it solve? Many linear algebra libraries are target towards either a) solving a few large linear algebra problems b) solving many small similar problems

This library is targeted towards the middle ground: a medium amount of decently sized problems. Here the problem is that it is often hard to use all the computing resources available as a single problem will not exhaust the device resources and there is no easy way to launch several similar tasks in parallel. This problem occurs often in GPU computing where out-of-order execution of kernels is used to fill up the device. Similarly, it is often hard to make CPU and GPU work together as GPU tasks are asynchronous while CPU tasks are often synchronous and it is not straight forward to couple both without blocking either CPU or GPU.

How does it work? Expression templates similar to boost::ublas are used to create a dependency graph: which object is used in which computations and which tasks have to wait to ensure a correct outcome?

a simple case would be two independent computations like

matrix A=prod(B,trans(B)); matrix C=prod(trans(B),B);

a normal library like ublas would wait until A is computed before C is evaluated. Instead, it can be more efficient to just add the two tasks to a list and let worker threads execute them as soon as resources are available. First the computation of A is put into the graph and A is marked as "being written to" while B is marked as "being read". When C is added to the list, it does not depend on the first computation as several tasks can read from B at the same time. So another worker thread can grab the task and execute it.

If we now add the task

B += prod(D,prod(C,trans(D));

the library computes it as

matrix temp = prod(C,trans(D); B += prod(D,temp).

temp can be computed as soon as C is computed. But B can only be computed after A and temp are computed as we can only write to B after all previous tasks using B are finished.

Are you qualified? I am main contributor to https://github.com/Shark-ML/Shark/ and as part of it I forked ublas https://github.com/Shark-ML/Shark/tree/master/include/shark/LinAlg/BLAS which now uses several linear algebra libraries as possible backends (ATLAS, openBLAS...).

What is the timeline? I would start developing a minimal example ASAP based on the rewrite of ublas above. I would use a minimum set of features (only dense square matrices) and basic GPU support.

Opinions?

Best, Oswin

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

3495

Age (days ago)

3513

Last active (days ago)

List overview

Download

17 comments

8 participants

participants (8)

Beman Dawes
Ben Pope
David Bellot
Hartmut Kaiser
Joel FALCOU
Oswin Krause
Robert Ramey
Sebastian Schaetz