On 5 May 2016 at 00:21, Hans Dembinski
Hi everybody,
I recently added a new library called "histogram" to the Boost Incubator. I would like to advertise it a little here in the hope to find a person interested in reviewing it. I hope that shameless self-advertisement is not going against some rule of this list, but I am sure you will let me know.
My background is in analysis of big data in the fields of particle physics and astroparticle physics. Boost is very popular among my peers, since it is a free, high-quality, rich, and very well maintained collection of libraries. There is a growing number of tools to do statistical analysis in Boost and I think this project would fit in nicely, and fill a gap. We work with histograms a lot, so that's why my interest came from.
I am a senior programmer in C++ and Python with 10 years of experience. Guiding development through code reviews and tickets, as well as taking on responsibility for continuous maintenance, are natural for me. Naturally, I am willing to commit free time to maintain the project should it be accepted, and do my share of the work in this community.
I put a lot of thought and effort into this project, the rationale and my design choices are explained in the documentation, which I wrote according to the advice given at the Boost Incubator website. The project is feature complete from my side. What it needs now is the input from the Boost community to round off possible edges and to make the interface rich enough for everybody. I am good at considering the user perspective, but I cannot anticipate everyone's needs.
In case you got interested, here are the links:
Incubator link:
http://rrsd.com/blincubator.com/bi_library/histogram-2/?gform_post_id=1582
github link:
https://github.com/HDembinski/histogram
Best regards,
Hans
Hi Hans, Interesting ideas. I have some algorithmic questions: I'd like to learn about the details behind the "just works" friendly objective so that I can decide if it will work for me -or not-, and under what circumstances. One reason I sometimes pick C++ instead of Python is because of performance, especially when I need to handle large datasets. In those cases the details often matter. So, if I was going to consider using it, it would be helpful to see performance metrics -e.g. compared to some naive alternative-. I've read that you computes variance: can that computation be switched-on/off (e.g. I might not need it)? Also, there are various online (single pass, weighted) variance algorithms: some a stable, other not. Which one have you implemented? Does is use std::accumulate? It would be nice to reassure numerically focused users about the level of quality of he internals. I would also like to see information about the computational and memory complexity about two other internal algorithms I think I saw mentioned: 1) automatically re-binning: when you modify bins do you split a single bin, or do you readjust *all* bin boundaries? Do you keep a sorted list inside each bin? 2) sparse storage: .. I know this is a complex field where lots of trade off can be made-. E.g. suppose I fill a 10-dimensional histogram with samples that (only) have elements on a diagonal -a potential worst case scenario for some methods would be-: for(int i: {1, 2, 3, 4, 5}) h.fill([i,i,i,i,i,i,i,i,i,i]) would this result in 5 sparse bins -the bins on the diagonal-, or 5^10 bins -the outer product of ten axis, each with 5 bins-? Thanks, Thijs