Re: [boost] proposed new library "histogram"

5 May 2016

      ...
Hi Hans,
Interesting ideas.
I have some algorithmic questions: I'd like to learn about the details
behind the "just works" friendly objective so that I can decide if it will
work for me -or not-, and under what circumstances. One reason I sometimes
pick C++ instead of Python is because of performance, especially when I
need to handle large datasets. In those cases the details often matter. So,
if I was going to consider using it, it would be helpful to see performance
metrics -e.g. compared to some naive alternative-.
I have a simple benchmark comparison against three classes in the ROOT
Hi Thijs,

On 5/5/16 4:36 PM, Thijs van den Berg wrote:
framework comparing the performance on 1-dimensional, 3-dimensional, and 
6-dimensional data. I will add the performance results to the 
documentation. I tested the benchmark on two different computers and got 
qualitatively different results, so I cannot draw a general conclusion. 
The speed is roughly similar. If you like to try for yourself, you could 
check out the code and activate the CMake option BUILD_CHECKS, then run 
the executable "nhistogram_speed" generated in the build directory.

I could and probably should also do a comparison against numpy.histogram.
If you have a particular kind of benchmark in mind, let me know, perhaps 
I can implement it.

It is probably impossible to beat a specialized histogram type made 
exclusively for 1d-data and a particular binning strategy, because a 
dynamic solution like mine has additional overhead related to the 
ability to define the binning algorithm and the number of dimensions at 
run-time. Maybe an expert on this community sees a way to make it faster.

That being said, the performance gap is not big and which was explicitly 
one of the design goals. I think that in most cases more CPU cycles are 
spend to generate/read the data that is to be filled into the histogram 
than used by histogram itself during the binning and counting.
...
I've read that you computes variance: can that computation be
switched-on/off (e.g. I might not need it)? Also, there are various online
(single pass, weighted) variance algorithms: some a stable, other not.
Which one have you implemented? Does is use std::accumulate? It would be
nice to reassure numerically focused users about the level of quality of he
internals.
I don't use std::accumulate, since it does not fit into my scheme.
...
I would also like to see information about the computational and memory
complexity about two other internal algorithms I think I saw mentioned:
1) automatically re-binning: when you modify bins do you split a single
bin, or do you readjust *all* bin boundaries? Do you keep  a sorted list
inside each bin?
I think there is a misunderstanding. There is no automatic re-binning in
...
2) sparse storage: .. I know this is a complex field where lots of trade
off can be made-. E.g. suppose I fill a 10-dimensional histogram with
samples that (only) have elements on a diagonal -a potential worst case
scenario for some methods would be-:
for(int i: {1, 2, 3, 4, 5})
     h.fill([i,i,i,i,i,i,i,i,i,i])
would this result in 5 sparse bins -the bins on the diagonal-, or 5^10 bins
-the outer product of ten axis, each with 5 bins-?
This histogram implements a dense storage strategy, for the sake of
If you do not fill the histogram with weighted events, there is no 
overhead involved for the variance estimate.

If you fill the histogram with normal data, without using weights, then 
the variance is computed on-the-fly when the user requests it via 
histogram::variance(...). When no weights are involved, the variance 
estimate per bin is taken to be equal to the count in that bin, a common 
estimate based on Poisson theory. When weights were used during the 
filling, the variance estimate is the sum of squared weights. Storing 
the sum of squared weights for each bin requires twice the memory.

I did not implement an option to switch that off, since in a statistical 
analysis, the variance estimate is as important as the actual count. I 
think it is safe to assume that if you have a special case with weighted 
data, you also want a variance estimate.

I will put in more details on these things into the Notes section of the 
documentation.
the sense that the number of bins along each axis of the histogram is 
changed. The number of bins is always the same, bins are not split.

What grows is the size of the integer used to store a bin count. I start 
with 1 byte for each bin, which covers counts from 0 to 255. Once you 
try to fill a bin that has already 255 counts, the internal memory for 
*all* bins is re-allocated and replaced by a memory block that uses 2 
bytes for each bin. In addition to the memory reallocation this involves 
a O(N) copy that is done in place (N is the number of bins). This 
procedure is repeated if any of the bins exceeds its new storage maximum 
of 65535, and so on.

Since the reallocation is done for all bins at once, this overhead does 
not occur very often and does not introduce a significant performance hit.

I considered more elaborate storage strategies where the number of bytes 
could differ for each bin, but all those that I could think of would 
significantly decrease performance and might actually increase the 
memory footprint for really large histograms. I can store an integer in 
1 byte, but I already need 8 byte for a pointer, even if the pointer 
points nowhere.

performance. Writing a histogram with sparse storage is not my design goal.

However, there is a base class which handles the binning and the axis 
types which a sparse histogram could re-use.

Best regards,
Hans

Re: [boost] proposed new library "histogram"

Hans Dembinski