Hi Hans,
Interesting ideas. I have some algorithmic questions: I'd like to learn about the details behind the "just works" friendly objective so that I can decide if it will work for me -or not-, and under what circumstances. One reason I sometimes pick C++ instead of Python is because of performance, especially when I need to handle large datasets. In those cases the details often matter. So, if I was going to consider using it, it would be helpful to see performance metrics -e.g. compared to some naive alternative-. I have a simple benchmark comparison against three classes in the ROOT
Hi Thijs, On 5/5/16 4:36 PM, Thijs van den Berg wrote: framework comparing the performance on 1-dimensional, 3-dimensional, and 6-dimensional data. I will add the performance results to the documentation. I tested the benchmark on two different computers and got qualitatively different results, so I cannot draw a general conclusion. The speed is roughly similar. If you like to try for yourself, you could check out the code and activate the CMake option BUILD_CHECKS, then run the executable "nhistogram_speed" generated in the build directory. I could and probably should also do a comparison against numpy.histogram. If you have a particular kind of benchmark in mind, let me know, perhaps I can implement it. It is probably impossible to beat a specialized histogram type made exclusively for 1d-data and a particular binning strategy, because a dynamic solution like mine has additional overhead related to the ability to define the binning algorithm and the number of dimensions at run-time. Maybe an expert on this community sees a way to make it faster. That being said, the performance gap is not big and which was explicitly one of the design goals. I think that in most cases more CPU cycles are spend to generate/read the data that is to be filled into the histogram than used by histogram itself during the binning and counting.
I've read that you computes variance: can that computation be switched-on/off (e.g. I might not need it)? Also, there are various online (single pass, weighted) variance algorithms: some a stable, other not. Which one have you implemented? Does is use std::accumulate? It would be nice to reassure numerically focused users about the level of quality of he internals. I don't use std::accumulate, since it does not fit into my scheme.
I would also like to see information about the computational and memory complexity about two other internal algorithms I think I saw mentioned:
1) automatically re-binning: when you modify bins do you split a single bin, or do you readjust *all* bin boundaries? Do you keep a sorted list inside each bin? I think there is a misunderstanding. There is no automatic re-binning in
2) sparse storage: .. I know this is a complex field where lots of trade off can be made-. E.g. suppose I fill a 10-dimensional histogram with samples that (only) have elements on a diagonal -a potential worst case scenario for some methods would be-: for(int i: {1, 2, 3, 4, 5}) h.fill([i,i,i,i,i,i,i,i,i,i])
would this result in 5 sparse bins -the bins on the diagonal-, or 5^10 bins -the outer product of ten axis, each with 5 bins-? This histogram implements a dense storage strategy, for the sake of
If you do not fill the histogram with weighted events, there is no overhead involved for the variance estimate. If you fill the histogram with normal data, without using weights, then the variance is computed on-the-fly when the user requests it via histogram::variance(...). When no weights are involved, the variance estimate per bin is taken to be equal to the count in that bin, a common estimate based on Poisson theory. When weights were used during the filling, the variance estimate is the sum of squared weights. Storing the sum of squared weights for each bin requires twice the memory. I did not implement an option to switch that off, since in a statistical analysis, the variance estimate is as important as the actual count. I think it is safe to assume that if you have a special case with weighted data, you also want a variance estimate. I will put in more details on these things into the Notes section of the documentation. the sense that the number of bins along each axis of the histogram is changed. The number of bins is always the same, bins are not split. What grows is the size of the integer used to store a bin count. I start with 1 byte for each bin, which covers counts from 0 to 255. Once you try to fill a bin that has already 255 counts, the internal memory for *all* bins is re-allocated and replaced by a memory block that uses 2 bytes for each bin. In addition to the memory reallocation this involves a O(N) copy that is done in place (N is the number of bins). This procedure is repeated if any of the bins exceeds its new storage maximum of 65535, and so on. Since the reallocation is done for all bins at once, this overhead does not occur very often and does not introduce a significant performance hit. I considered more elaborate storage strategies where the number of bytes could differ for each bin, but all those that I could think of would significantly decrease performance and might actually increase the memory footprint for really large histograms. I can store an integer in 1 byte, but I already need 8 byte for a pointer, even if the pointer points nowhere. performance. Writing a histogram with sparse storage is not my design goal. However, there is a base class which handles the binning and the axis types which a sparse histogram could re-use. Best regards, Hans