Re: [boost] Feedback desired for neural networks library

11 Jan 2021

      ...
I will need to experiment more with both of these libraries to get a better sense which one is the best fit. The preliminary idea is to split responsibilities between NN and uBlas/Boost.Compute such that NN library defines an interface and familiar abstractions in the NN domain, and uBlas/Boost.Compute are used as the core computation engine. If this idea works out as I hope it will do, we can put aside the discussion about the hardware support, because it will come with the underlying compute engine, and we can focus more on the convenience of the interface and abstractions that an NN library can provide for easier use of ML elements.
Blast.Compute + OpenCL extensions to leverage hardware definitely look like the right path to go and would be a useful addition to this library. It would require a careful selection of OpenCL kernels for optimal speed, which was obvious from this simple test with different implementations of Matrix * Vector that I ran on a few OpenCL devices that are available on my computer. To my surprise, plain C++ version was outperforming my GPU, and I got a nice increase from OpenCL implementation on CPU with a simplistic kernel. I must have a very old and slow GPU.

These are the raw test results for 4096 x 4096 matrix in case anybody is interested .

Best regards,
Sergei Marchenko

OpenCL Platform: 'ATI Stream' (vendor: Advanced Micro Devices, Inc.)
Devices:
        Device: '        Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz                ' (version: 2.0                ) (type: CPU)
        Device: 'Toucan                                                         ' (version: CAL 1.4.1848       ) (type: GPU)
Extensions:
        cl_khr_icd
        cl_amd_event_callback
        cl_khr_d3d10_sharing

OpenCL Platform: 'AMD Accelerated Parallel Processing' (vendor: Advanced Micro Devices, Inc.)
Devices:
        Device: 'Turks' (version: 1800.11 (VM)) (type: GPU)
        Device: '        Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz' (version: 1800.11 (sse2,avx)) (type: CPU)
Extensions:
        cl_khr_icd
        cl_khr_d3d10_sharing
        cl_khr_d3d11_sharing
        cl_khr_dx9_media_sharing
        cl_amd_event_callback
        cl_amd_offline_devices

Test Device: '        Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz                ' (version: 2.0                ) (type: CPU)
Testing matrix * vector (map+reduce kernels):
  Map Elapsed: 45999200 ns
  Map BandWidth: 1.50486 GB/s
  Reduce Elapsed: 170500 ns
  Reduce BandWidth: 12.3961 GB/s
  Elapsed: 54297900 ns
  BandWidth: 2.47188 GB/s
Testing matrix * vector (naive kernel):
  Elapsed: 5341900 ns
  BandWidth: 12.5689 GB/s
Testing matrix * vector (Boost.Compute algorithms):
  Elapsed: 724216500 ns
  BandWidth: 0.185351 GB/s
Testing matrix * vector (plain C++):
  Elapsed: 17725800 ns
  BandWidth: 3.78779 GB/s

Test Device: 'Toucan                                                         ' (version: CAL 1.4.1848       ) (type: GPU)
Testing matrix * vector (map+reduce kernels):
  Map Elapsed: 490535376 ns
  Map BandWidth: 0.141116 GB/s
  Reduce Elapsed: 2236373 ns
  Reduce BandWidth: 0.945073 GB/s
  Elapsed: 602027100 ns
  BandWidth: 0.222943 GB/s
Testing matrix * vector (naive kernel):
  Elapsed: 170503700 ns
  BandWidth: 0.393784 GB/s
Testing matrix * vector (Boost.Compute algorithms):
  Elapsed: 6837179400 ns
  BandWidth: 0.019633 GB/s
Testing matrix * vector (plain C++):
  Elapsed: 17901600 ns
  BandWidth: 3.75059 GB/s

Test Device: 'Turks' (version: 1800.11 (VM)) (type: GPU)
Testing matrix * vector (map+reduce kernels):
  Map Elapsed: 222894000 ns
  Map BandWidth: 0.310562 GB/s
  Reduce Elapsed: 5166778 ns
  Reduce BandWidth: 0.409063 GB/s
  Elapsed: 248867100 ns
  BandWidth: 0.539315 GB/s
Testing matrix * vector (naive kernel):
  Elapsed: 156637700 ns
  BandWidth: 0.428643 GB/s
Testing matrix * vector (Boost.Compute algorithms):
  Elapsed: 2145102000 ns
  BandWidth: 0.062577 GB/s
Testing matrix * vector (plain C++):
  Elapsed: 17918300 ns
  BandWidth: 3.7471 GB/s

Test Device: '        Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz' (version: 1800.11 (sse2,avx)) (type: CPU)
Testing matrix * vector (map+reduce kernels):
  Map Elapsed: 37620700 ns
  Map BandWidth: 1.84001 GB/s
  Reduce Elapsed: 245500 ns
  Reduce BandWidth: 8.60911 GB/s
  Elapsed: 43919500 ns
  BandWidth: 3.05599 GB/s
Testing matrix * vector (naive kernel):
  Elapsed: 5410200 ns
  BandWidth: 12.4102 GB/s
Testing matrix * vector (Boost.Compute algorithms):
  Elapsed: 641987200 ns
  BandWidth: 0.209092 GB/s
Testing matrix * vector (plain C++):
  Elapsed: 17944000 ns
  BandWidth: 3.74173 GB/s