I will need to experiment more with both of these libraries to get a better sense which one is the best fit. The preliminary idea is to split responsibilities between NN and uBlas/Boost.Compute such that NN library defines an interface and familiar abstractions in the NN domain, and uBlas/Boost.Compute are used as the core computation engine. If this idea works out as I hope it will do, we can put aside the discussion about the hardware support, because it will come with the underlying compute engine, and we can focus more on the convenience of the interface and abstractions that an NN library can provide for easier use of ML elements.
Blast.Compute + OpenCL extensions to leverage hardware definitely look like the right path to go and would be a useful addition to this library. It would require a careful selection of OpenCL kernels for optimal speed, which was obvious from this simple test with different implementations of Matrix * Vector that I ran on a few OpenCL devices that are available on my computer. To my surprise, plain C++ version was outperforming my GPU, and I got a nice increase from OpenCL implementation on CPU with a simplistic kernel. I must have a very old and slow GPU. These are the raw test results for 4096 x 4096 matrix in case anybody is interested . Best regards, Sergei Marchenko OpenCL Platform: 'ATI Stream' (vendor: Advanced Micro Devices, Inc.) Devices: Device: ' Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz ' (version: 2.0 ) (type: CPU) Device: 'Toucan ' (version: CAL 1.4.1848 ) (type: GPU) Extensions: cl_khr_icd cl_amd_event_callback cl_khr_d3d10_sharing OpenCL Platform: 'AMD Accelerated Parallel Processing' (vendor: Advanced Micro Devices, Inc.) Devices: Device: 'Turks' (version: 1800.11 (VM)) (type: GPU) Device: ' Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz' (version: 1800.11 (sse2,avx)) (type: CPU) Extensions: cl_khr_icd cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offline_devices Test Device: ' Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz ' (version: 2.0 ) (type: CPU) Testing matrix * vector (map+reduce kernels): Map Elapsed: 45999200 ns Map BandWidth: 1.50486 GB/s Reduce Elapsed: 170500 ns Reduce BandWidth: 12.3961 GB/s Elapsed: 54297900 ns BandWidth: 2.47188 GB/s Testing matrix * vector (naive kernel): Elapsed: 5341900 ns BandWidth: 12.5689 GB/s Testing matrix * vector (Boost.Compute algorithms): Elapsed: 724216500 ns BandWidth: 0.185351 GB/s Testing matrix * vector (plain C++): Elapsed: 17725800 ns BandWidth: 3.78779 GB/s Test Device: 'Toucan ' (version: CAL 1.4.1848 ) (type: GPU) Testing matrix * vector (map+reduce kernels): Map Elapsed: 490535376 ns Map BandWidth: 0.141116 GB/s Reduce Elapsed: 2236373 ns Reduce BandWidth: 0.945073 GB/s Elapsed: 602027100 ns BandWidth: 0.222943 GB/s Testing matrix * vector (naive kernel): Elapsed: 170503700 ns BandWidth: 0.393784 GB/s Testing matrix * vector (Boost.Compute algorithms): Elapsed: 6837179400 ns BandWidth: 0.019633 GB/s Testing matrix * vector (plain C++): Elapsed: 17901600 ns BandWidth: 3.75059 GB/s Test Device: 'Turks' (version: 1800.11 (VM)) (type: GPU) Testing matrix * vector (map+reduce kernels): Map Elapsed: 222894000 ns Map BandWidth: 0.310562 GB/s Reduce Elapsed: 5166778 ns Reduce BandWidth: 0.409063 GB/s Elapsed: 248867100 ns BandWidth: 0.539315 GB/s Testing matrix * vector (naive kernel): Elapsed: 156637700 ns BandWidth: 0.428643 GB/s Testing matrix * vector (Boost.Compute algorithms): Elapsed: 2145102000 ns BandWidth: 0.062577 GB/s Testing matrix * vector (plain C++): Elapsed: 17918300 ns BandWidth: 3.7471 GB/s Test Device: ' Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz' (version: 1800.11 (sse2,avx)) (type: CPU) Testing matrix * vector (map+reduce kernels): Map Elapsed: 37620700 ns Map BandWidth: 1.84001 GB/s Reduce Elapsed: 245500 ns Reduce BandWidth: 8.60911 GB/s Elapsed: 43919500 ns BandWidth: 3.05599 GB/s Testing matrix * vector (naive kernel): Elapsed: 5410200 ns BandWidth: 12.4102 GB/s Testing matrix * vector (Boost.Compute algorithms): Elapsed: 641987200 ns BandWidth: 0.209092 GB/s Testing matrix * vector (plain C++): Elapsed: 17944000 ns BandWidth: 3.74173 GB/s