I think Python also supports a wide variety of hardware. You are right, of course, that it would be rather awkward for an existing C++ application to call into Python to do its ML tasks, having a native C++ library to do the job is preferred. That is not required. Both leading ML frameworks (TensorFlow & PyTorch) offer a C++ API for most, if not all, operations. At least the simple ones (working with tensors and layers) I am not sure about your argument regarding small data and or model sizes. I think in most cases you want to train Neural Nets with large amounts of data. Can you add generic GPU support with Boost.Compute? https://www.boost.org/doc/libs/1_75_0/libs/compute/doc/html/index.html
To be more specific, the example application that I have in the GitHub repo for MNIST digits dataset, produces a model, which can be trained to offer a 95% success rate in about 10-15 minutes on a single CPU core. While the example is somewhat synthetic, it is still representative of a wide variety of scenarios where an input from a sensor or a small image can be inspected by a NN component. Another application (not shown on GitHub) was a tiny model to estimate the cost of a web service API response time, given a small set of parameters, such as the user identity, API method, and payload size, which was re-trained on every start of the web service, and used to make predictions about the resource consumption by different callers for load balancing and throttling purposes. Those are good niche applications, I think.
Some more questions:
Are you building the network at compile-time or run-time? It looks from your examples like it is compile-time. I think your library should offer both. Building the network at compile-time may give some speed benefits as it can gain from compiler optimisations, but it would require re-compilation to change the network itself. Building the network at run-time means you can change the network without re-compiling. This is useful for example when you want to read the network configuration (not only its weights) at run-time from a configuration file.
Smallish networks are certainly a niche, if you want to do anything serious you won't be able to beat TF/PyTorch in performance. So keeping this focused on small, static (aka compiletime) models with only the basic layers and maybe even with optional training (removing this avoids the auto-differentiation need) could be the way. Using compile-time models makes this focused on usage of ML instead of development and allows the optimizations from the compiler to be used which are very important for small models. However I fear this is a not fit for Boost. ML evolves so fast, adding more and more layer types etc., that I fear this library to be outdated already during review. The only chance I see if this purposely is for very basic networks, i.e. FullyConnected, Convolution, SoftMax and similar basic layers, maybe with an extension to provide a ElementWise and BinaryOp layer templated by the operator (this may be problematic for auto-differentiation though). Reusing what we have (uBlas, Boost.Compute) might be a good idea too.