Hi,
giving a small heads-up.
There is now a minimal PoC at https://github.com/Ulfgard/aBLAS . Minimal
in the sense that i took my existing LinAlg, ripped it apart and rewrote
it partially to fit the new needs of the library. Only cpu is
implemented, yet. I am open for suggestions, helpful advice and of
course people who are interested to work on it. I am not too happy with
the scheduling interface right now and its implementation looks a bit
slower than necessary, but i think this will evolve over time.
two basic examples showing what the library already can do are given in
examples/ . For ublas users this should not look too foreign.
For all who are interested, here is the basic design and the locations
in include/aBLAS/:
1. computational kernels are implemented in kernels/ and represent
typical bindings to the BLAS1-3 functionality as well as a default
implementation (currently only dot,gemv,gemm and assignment are tested
and working, no explicit bindings included, yet). kernels are enqueued
in the scheduler via the expression template mechanisms and kernels are
not allowed to enqueue kernels recursively.
2. a simple PoC scheduler is implemented in scheduling/scheduling.hpp.
It implements a dependency graph between work packages and work is
enqueued into a boost::thread::basic_thread_pool when all its
dependencies are resolved. A kernel is enqueued together with a set of
dependency_node objects which encapsulate dependencies of variables the
kernel uses (i.e. every variable keeps track about what its latest
dependencies are and whether these dependencies read from it or write to
it). The current interface should be abstracted enough to allow
implementation using different technologies(e.g. it should be possible
to implement the scheduler in terms of HPX.).
One of the tasks of the scheduler is to allow creation of closures where
variables are guaranteed to exist until all kernels using them are
finished as well as moving a variable into a closure. This is used to
prevent an issue similar to the blocking destructor of std::future<T>.
Instead of blocking, the variable is moved into the scheduler which then
ensures lifetime, until all kernels are finished. This of course
requires the kernels to be called in a way that they can cope with the
moving of types.
what is currently missing is "user created dependencies" to be used in
conjunction with the gpu (as gpus are fully asynchronous, we have to
register a callback that notifies the scheduler when the gpu is done
with its computations just as the worker threads do).
3. basic matrix/vector classes are implemented in matrix.hpp and
vector.hpp. The implementation is a bit convoluted for the "move into
closure" to work. Basically they introduce another indirection. When a
kernel is created, it references a special closure type of the variable
(vector<T>::closure_type), which references that indirection.
4. the remaining files in include/aBLAS*.hpp implement the expression
templates, which are similar to uBLAS. There are two types distinguished
using the CRTP classes matrix_expression