Hi Oswin, just a simple question, why not contribute it to ublas directly ? Ublas is getting old anyway and needs a serious rewrite of some parts. Your contribution would be exceptional. And your work would be recognized as being part of ublas directly and be used right away by thousands of users. Let me know if you're interested contributing back to ublas instead. Cheers, David On Thu, Nov 19, 2015 at 2:55 PM, Oswin Krause < Oswin.Krause@ruhr-uni-bochum.de> wrote:
Hi list,
Disclaimer: What I am proposing here does not exist (yet). But it will in the near future and if there is interest i would like to give something back to the boost community.
I would like to propose a new library to boost: asynchronous blas.
What is it and what can I do with it? It is a linear algebra library where all computations are carried out in an asynchronous manner on GPU and CPU. Computing a matrix-matrix product does not block the current thread but instead launches a task which at some point returns with a result, allowing the main thread to continue until the result is needed. GPU and CPU tasks can be started at the same time depending on each other (e.g. a CPU computation waits until the GPU returned some needed result and vice-versa)
Why is this important? Which problems does it solve? Many linear algebra libraries are target towards either a) solving a few large linear algebra problems b) solving many small similar problems
This library is targeted towards the middle ground: a medium amount of decently sized problems. Here the problem is that it is often hard to use all the computing resources available as a single problem will not exhaust the device resources and there is no easy way to launch several similar tasks in parallel. This problem occurs often in GPU computing where out-of-order execution of kernels is used to fill up the device. Similarly, it is often hard to make CPU and GPU work together as GPU tasks are asynchronous while CPU tasks are often synchronous and it is not straight forward to couple both without blocking either CPU or GPU.
How does it work? Expression templates similar to boost::ublas are used to create a dependency graph: which object is used in which computations and which tasks have to wait to ensure a correct outcome?
a simple case would be two independent computations like
matrix A=prod(B,trans(B)); matrix C=prod(trans(B),B);
a normal library like ublas would wait until A is computed before C is evaluated. Instead, it can be more efficient to just add the two tasks to a list and let worker threads execute them as soon as resources are available. First the computation of A is put into the graph and A is marked as "being written to" while B is marked as "being read". When C is added to the list, it does not depend on the first computation as several tasks can read from B at the same time. So another worker thread can grab the task and execute it.
If we now add the task
B += prod(D,prod(C,trans(D));
the library computes it as
matrix temp = prod(C,trans(D); B += prod(D,temp).
temp can be computed as soon as C is computed. But B can only be computed after A and temp are computed as we can only write to B after all previous tasks using B are finished.
Are you qualified? I am main contributor to https://github.com/Shark-ML/Shark/ and as part of it I forked ublas https://github.com/Shark-ML/Shark/tree/master/include/shark/LinAlg/BLAS which now uses several linear algebra libraries as possible backends (ATLAS, openBLAS...).
What is the timeline? I would start developing a minimal example ASAP based on the rewrite of ublas above. I would use a minimum set of features (only dense square matrices) and basic GPU support.
Opinions?
Best, Oswin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost