Boost.Histogram contains useful and well-designed features, like excess
(over/underflow) bins, and axis transforms to name a few.
As this is a review, I am going to spent most time below on critical
scrutiny.
I. DESIGN
---------
Iterators are const. This prevents us from bootstrapping the histogram
with a prior distribution using STL algorithms like std::fill(),
std::generate(), or std::sample().
The adaptive_storage does not work well with STL algorithms, because
weight_counter is not an arithmetic type. See the Implementation section
below for more detail. I therefore propose that the default storage
policy is changed to something without weight_counter.
The adaptive_storage has two responsibilities: data compaction and
weights. Would it be possible to split this into two separate storage
policies? I have no real use for arrival variance.
I am not too fond of using operator() for insertion. The code looks like
a function call with side-effect. I prefer an explicitly named member
function.
The axis::ouflow_type is oddly named. I suggest that this is renamed to
something like axis::excess_type.
II. IMPLEMENTATION
------------------
The implementation is generally of high quality. However, I did
encounter the three problem areas listed below.
(A) Integration with STL algorithms can be improved. Here are some
examples:
First, std::distance() does not work on a histogram with array_storage.
This means that other STL algorithms, like std::any_of(), fails to
compile. I solved this problem by copying the distance_to() function
from the axis iterator to the histogram iterator.
Second, std::max_element (and brethren) cannot be used on a histogram.
Consider the following example:
auto element = std::max_element(h.begin(), h.end());
Compilation fails for adaptive_storage because weight_counter has no
less-than operator. Furthermore, compilation fails for array_storage
because iterator_over<H>::operator= fails. It attempts to re-assign
a reference, but accidentally triggers a copy-constructor instead.
Letting the iterator store a pointer instead of a reference solves the
problem.
Apropos, iterator_over<H>::operator= does not return *this. Consider
adding the -Werror=return-type compiler flag and a unit-test to that
calls this operator.
Third, std::inner_product fails to compile for adaptive_storage because
weight_counter does not have a binary operator*. The inner product of
two histograms is useful for calculating the cosine similarity, which
can be used to compare two distributions. std::inner_product works for
array_storage though.
(B) Indexing and size use different types (int versus std::size_t.) I
assume that this is because of the underflow bin, which is indexed by
-1. I am not certain whether or not this is a real problem, but it does
cause some oddities when stressed to the limit. For example, if we need
a histogram that counts each unsigned int, then we get an "lower <
upper" exception during construction:
using range_type = unsigned int;
auto h = make_static_histogram_with(
array_storage<int>(),
axis::integer<>(0, std::numeric_limits
Dear Bjørn,
On 23. Sep 2018, at 16:26, Bjorn Reese via Boost
wrote: Boost.Histogram contains useful and well-designed features, like excess (over/underflow) bins, and axis transforms to name a few.
As this is a review, I am going to spent most time below on critical scrutiny. […]
thank you very much for the great review and the time spend with the library. Some discussion below.
I. DESIGN ---------
Iterators are const. This prevents us from bootstrapping the histogram with a prior distribution using STL algorithms like std::fill(), std::generate(), or std::sample().
Writable iterators could be added, but I am not convinced that is a good idea. The histogram iterators are const, because I cannot imagine a use-case where the histogram has to be bootstrapped with a prior distribution. Only providing const iterators avoids the misconception that a histogram is a container. Histogram can make stronger guarantees about its internal consistency, if users cannot set bin counters arbitrarily.
The adaptive_storage does not work well with STL algorithms, because weight_counter is not an arithmetic type. See the Implementation section below for more detail. I therefore propose that the default storage policy is changed to something without weight_counter.
There are two solutions, right? Turing weight_counter into a full arithmetic type or removing support for weight_counter in the default storage.
The adaptive_storage has two responsibilities: data compaction and weights. Would it be possible to split this into two separate storage policies? I have no real use for arrival variance.
Agreed, it should be a compile-time option for adaptive_storage. https://github.com/HDembinski/histogram/issues/75
I am not too fond of using operator() for insertion. The code looks like a function call with side-effect. I prefer an explicitly named member function.
Originally, I had a special method called fill(), then I switched to operator(). There are two arguments in favour of operator(): - a 1D histogram can be filled with std::for_each or std::accumulate - consistency with Boost.Accumulators, which also use operator()
The axis::ouflow_type is oddly named. I suggest that this is renamed to something like axis::excess_type.
I am ok with renaming, but I think we haven't found the best name yet. What do you think about axis::tails_type?
II. IMPLEMENTATION ————————— […]
Agreed, the missing bits will be implemented and unit-tested.
III. DOCUMENTATION ————————— […]
Agreed, documentation will be expanded in this way.
Boost.Histogram should be ACCEPTED into Boost, provided:
1. Reference documentation is finalized.
I furthermore strongly recommend that the default storage policy is changed to something without weight_counter because of the various problems with STL algorithms. This recommendation is not a prerequisite for acceptance.
Thanks!
On 09/24/18 23:52, Hans Dembinski wrote:
The histogram iterators are const, because I cannot imagine a use-case where the histogram has to be bootstrapped with a prior distribution. Only providing const iterators avoids the misconception that a histogram is a container. Histogram can make stronger guarantees about its internal consistency, if users cannot set bin counters arbitrarily.
If you are using histograms to collect data for later analysis, then there may indeed not be much use for a prior distribution. However, if you use histograms to make decisions on-the-fly, then you cannot always wait for the histogram to be filled up with data in order to make reliable decisions. In such cases a prior distribution can help. Admittedly, I have not done this with histograms, but I do use CDFs (well, quantile approximations) quite a lot for on-the-fly decision- making. The other use case is to restore the histogram with data from an earlier run, but serialization already covers that.
There are two solutions, right? Turing weight_counter into a full arithmetic type or removing support for weight_counter in the default storage.
True, but then I am going to play the zero-overhead card :)
Originally, I had a special method called fill(), then I switched to operator(). There are two arguments in favour of operator(): - a 1D histogram can be filled with std::for_each or std::accumulate - consistency with Boost.Accumulators, which also use operator()
Inserting data into the histogram with std::accumulate is odd. An alternative is to use std::copy (or std::transform) with std::inserter. That does requires a histogram::insert(iterator hint, value_type value) where the hint is completely useless... but that is what std::set is using.
I am ok with renaming, but I think we haven't found the best name yet. What do you think about axis::tails_type?
Seems like you settled on options in another thread. I am fine with that. Related to this, I also like the extent() over shape(). This name is also used by Boost.MultiArray and the std::mdspan proposal (P0009R7). In the name of consistency, you may also consider renaming dim() to rank().
On 30. Sep 2018, at 17:29, Bjorn Reese via Boost
wrote: On 09/24/18 23:52, Hans Dembinski wrote:
The histogram iterators are const, because I cannot imagine a use-case where the histogram has to be bootstrapped with a prior distribution. Only providing const iterators avoids the misconception that a histogram is a container. Histogram can make stronger guarantees about its internal consistency, if users cannot set bin counters arbitrarily.
If you are using histograms to collect data for later analysis, then there may indeed not be much use for a prior distribution. However, if you use histograms to make decisions on-the-fly, then you cannot always wait for the histogram to be filled up with data in order to make reliable decisions. In such cases a prior distribution can help.
Admittedly, I have not done this with histograms, but I do use CDFs (well, quantile approximations) quite a lot for on-the-fly decision- making.
The other use case is to restore the histogram with data from an earlier run, but serialization already covers that.
I plan to add a struct `unsafe_access` which allows unsafe raw access to the bin counters, like so auto& storage = unsafe_access::storage(some_histogram); This will be part of the public interface, but is targeted towards advanced users. It will be used by algorithms that transform histograms and by serialization code. I also want to support other serialization libraries than Boost.Serialization, and unsafe_access makes this easier. With `unsafe_access`, advanced users get full control, but the standard interface still protects from trivial usage errors. The docs of `unsafe_access` then basically say: if you use this, you are on your own. It could be used to set the histogram with a prior, which seems like an advanced use case.
There are two solutions, right? Turing weight_counter into a full arithmetic type or removing support for weight_counter in the default storage.
True, but then I am going to play the zero-overhead card :)
Fair enough. https://github.com/HDembinski/histogram/issues/75 The default value would then be without weight support.
Originally, I had a special method called fill(), then I switched to operator(). There are two arguments in favour of operator(): - a 1D histogram can be filled with std::for_each or std::accumulate - consistency with Boost.Accumulators, which also use operator()
Inserting data into the histogram with std::accumulate is odd.
I was wrong, std::accumulate uses += to fill, which is not supported.
An alternative is to use std::copy (or std::transform) with std::inserter. That does requires a histogram::insert(iterator hint, value_type value) where the hint is completely useless... but that is what std::set is using.
histogram is not a container, so it should not support std::inserter. histogram is a accumulator and accumulators have a functor design, as established by Boost.Accumulator. They eat samples and update their internal state, there is an implied transformative aspect. Inserting a value implies that you can exactly retrieve it later, which is not the case here.
I am ok with renaming, but I think we haven't found the best name yet. What do you think about axis::tails_type?
Seems like you settled on options in another thread. I am fine with that.
Related to this, I also like the extent() over shape(). This name is also used by Boost.MultiArray and the std::mdspan proposal (P0009R7). In the name of consistency, you may also consider renaming dim() to rank().
Good suggestion, thank you! https://github.com/HDembinski/histogram/issues/112 Best regards, Hans
participants (2)
-
Bjorn Reese
-
Hans Dembinski