[histogram] Working with standard algorithms
I would like to see a better integration between boost.histogram and standard algorithms, especially those from <numeric>. Today they do not work together because the dereference operator of the axis iterator returns a pair (index and content) rather than the content. An example of where standard algorithms could be useful is calculating the cumulative distribution of a one-dimensional histogram, which ought to be a simple as: auto& axis = h.axis(); std:partial_sum(axis.begin(), axis.end(), axis.begin()); Another example is ranking a set of histograms based on the cosine similarity [1] using std::inner_product(). [1] https://en.wikipedia.org/wiki/Cosine_similarity
Dear Bjørn,
On 18. Nov 2017, at 17:08, Bjorn Reese via Boost
wrote: I would like to see a better integration between boost.histogram and standard algorithms, especially those from <numeric>. Today they do not work together because the dereference operator of the axis iterator returns a pair (index and content) rather than the content.
An example of where standard algorithms could be useful is calculating the cumulative distribution of a one-dimensional histogram, which ought to be a simple as:
auto& axis = h.axis(); std:partial_sum(axis.begin(), axis.end(), axis.begin());
Another example is ranking a set of histograms based on the cosine similarity [1] using std::inner_product().
I totally see the point for partial_sum. Could you also provide a code example for how you would like to compute the cosine similarity? I think we need several iterators then. The current axis iterators iterate over the bins definition, not really over the bin content of the histogram itself. I chose the current iterators in this way, because I believe this scheme generalises well to the multi-dimensional case. I wouldn't replace these iterators, because they seem useful to me, but additional iterators that iterate over the bin values and play well with algorithms are also clearly needed. How should we define such iterators so that they generalise to multiple dimensions? For example, it would be easy to implement new iterators which just iterates over the bin content: auto h = make_static_histogram(axis::regular(3, 0, 1)); // fill […], then std::vector<double> cumulative; std::partial_sum(h.begin(), h.end(), cumulative.begin(), std::back_inserter(cumulative)); Questions: - What should this do if h is 2d? The same? If yes, in which order should the 2d array be iterated over? The current design keeps the internal order an implementation detail (I use column-major internally, because it is easier to compute the strides on the fly), the two choices are fortran-like (column-major) or C-like (row-major). Perhaps provide both with clear names so it is obvious? Or just row-major, because that's the standard in C++? These questions must have been already discussed during the design of Boost.MultiArray. I think I should provide a way to provide a MultiArray copy of the bin content, thus leveraging all the work that already went into defining MultiArray. Similarly one could provide a read-only view of the variance estimates. A read-only MultiArray view of the internal bin counts of the histogram is also possible for array_storage policy, but not for the default adaptive_storage policy. - Writable iterators? In your example, you use writable iterators, to change the content (and the meaning of the content) of the histogram itself. In my reply here, I use read-only iterators, because changing the content of the histogram seems to go against the semantics. A histogram is more specific than a general container of values, the content has a strong meaning that cannot be arbitrarily changed. To be precise, what should the variance estimate, returned by `histogram::variance(…)` be for a cumulative count? The variance estimate for cumulative count is a whole matrix of covariances, since the successive entries are not independent. There are other operations where it is not clearly defined how the variance estimate should be computed. I think the best way to avoid these problems is to provide read-only iterators. Let me know what you think. I also added your message as an issue on github. Best regards, Hans
On 11/20/2017 12:51 PM, Hans Dembinski wrote:
I totally see the point for partial_sum. Could you also provide a code example for how you would like to compute the cosine similarity?
A working example using cosine similarity is too large to include here, but the crux of it is to calculate the cosine similarity between a histogram h1 and all other histograms in order to determine which of these other histograms is closest to h1. Anyways, it was just an example of where it makes sense to use std::inner_product() on two histograms.
I think we need several iterators then. The current axis iterators iterate over the bins definition, not really over the bin content of the histogram itself. I chose the current iterators in this way, because I believe this scheme generalises well to the multi-dimensional case. I wouldn't replace these iterators, because they seem useful to me, but additional iterators that iterate over the bin values and play well with algorithms are also clearly needed. I would rename the current indexing iterator so something like bin_iterator because it does something special:
using bin_iterator = unspecified; bin_iterator bin_begin() const; bin_iterator bin_end() const; The iterator (and reverse_iterator) types can then be used to traverse the content.
- What should this do if h is 2d? The same? If yes, in which order should the 2d array be iterated over?
I would let axis iterators traverse only along the axis. If we need something to traverse all elements within a multi-dimensional array, then we probably need something else, such as mdspan [1]. [1] https://github.com/kokkos/array_ref/blob/master/proposals/P0009.rst
- Writable iterators?
Good argument for const iterators (variance calculation.)
On 26. Nov 2017, at 12:51, Bjorn Reese via Boost
wrote: On 11/20/2017 12:51 PM, Hans Dembinski wrote:
I totally see the point for partial_sum. Could you also provide a code example for how you would like to compute the cosine similarity?
A working example using cosine similarity is too large to include here, but the crux of it is to calculate the cosine similarity between a histogram h1 and all other histograms in order to determine which of these other histograms is closest to h1. Anyways, it was just an example of where it makes sense to use std::inner_product() on two histograms.
Ok, I just wanted to know whether this example would require the same kind of interface.
I committed a change yesterday which implements value iterators for the histograms. This allows code like this:
auto h = /* make a histogram */
std::vector<double> csum;
std::partial_sum(h.begin(), h.end(), std::back_inserter(csum));
The forward iterator returned here iterates over the values of the histogram in implementation-defined order, skipping under-/overflow bins. For 1d-histograms, this is guaranteed to be the normal order you would expect from any 1d container.
This code works well for 1d-histograms. So what if I have 2d? The forward iterator can tell you the current 2d index of the value it points to. To fill a boost::multi_array, you would do
auto h = /* make 2d histogram, 2x3 */
boost::multi_array
I think we need several iterators then. The current axis iterators iterate over the bins definition, not really over the bin content of the histogram itself. I chose the current iterators in this way, because I believe this scheme generalises well to the multi-dimensional case. I wouldn't replace these iterators, because they seem useful to me, but additional iterators that iterate over the bin values and play well with algorithms are also clearly needed. I would rename the current indexing iterator so something like bin_iterator because it does something special:
using bin_iterator = unspecified; bin_iterator bin_begin() const; bin_iterator bin_end() const;
I suppose you suggest this to provide more clarity on what the axis iterators iterate over, but this solution has some drawbacks. The philosophy in this library is to regard an axis as a logical collection of bins (the individual bins do not really exist physically as objects in memory). If you see an axis like that, then you would expect `begin()` and `end()` to provide iterators over the bins. The axis does not know anything about the histogram it belongs to, it is a sub-structure without a callback into the super-structure. Therefore, it technically cannot even provide an iterator over the histogram values in the current design. This hierarchical separation is a nice feature. I would argue that iterators should not be named for documentation purposes, there are algorithms which expect the standard names const_iterator and iterator, and most of the time you would write `auto` nowadays, so that the typename does not appear in the source code and thus you loose it as a source of self-documentation. The case against `bin_begin()` and `bin_end()` is based on the lost compatibility with range-based for loops, which expect the standard names `begin()` and `end()`. I think it is better to emphasise the philosophy behind the design in the documentation to clear up these misunderstandings, no? I am also thinking of changing the bin iterator to make it less surprising, by reusing the design of the value_iterator above. Example code: auto a = /* make axis */ for (auto iter = a.begin(); iter != a.end(); ++iter) { auto index = iter.idx(); // special method to access current bin index auto bin = *iter; // the actual bin object, implemented as an interval with methods lower() and upper() for most axis types } The range-based for loop would then only iterate over the bins, not providing indices for (auto bin : a) { // do something with bin } Since range-based for loops are so nice, I am thinking of providing an adaptor called "enumerate", inspired by Python for (auto index_and_bin : enumerate(a)) { auto index = index_and_bin.first; auto bin = index_and_bin.second; // do something with index and bin } I think this design is less surprising then the current one.
The iterator (and reverse_iterator) types can then be used to traverse the content.
I put rbegin() and rend() for axis types on the to-do list.
- What should this do if h is 2d? The same? If yes, in which order should the 2d array be iterated over?
I would let axis iterators traverse only along the axis. If we need something to traverse all elements within a multi-dimensional array, then we probably need something else, such as mdspan [1].
[1] https://github.com/kokkos/array_ref/blob/master/proposals/P0009.rst
Yeah, mdspan would fit the needs of this library better than multi_array, because the rank can be a dynamic property, while the rank in multi_array is a compile-time constant. Since mdspan is not ready yet, we should still try to provide at least something to iterate over the values, hence my proposal of a value_iterator above. Best regards, Hans
participants (2)
-
Bjorn Reese
-
Hans Dembinski