Re: [boost] [histogram] Working with standard algorithms

29 Nov 2017

...
On 26. Nov 2017, at 12:51, Bjorn Reese via Boost <boost@lists.boost.org> wrote:
On 11/20/2017 12:51 PM, Hans Dembinski wrote:
...
I totally see the point for partial_sum. Could you also provide a code example for how you would like to compute the cosine similarity?
A working example using cosine similarity is too large to include here,
but the crux of it is to calculate the cosine similarity between a
histogram h1 and all other histograms in order to determine which of
these other histograms is closest to h1. Anyways, it was just an
example of where it makes sense to use std::inner_product() on two
histograms.
Ok, I just wanted to know whether this example would require the same kind of interface.

I committed a change yesterday which implements value iterators for the histograms. This allows code like this:

auto h = /* make a histogram */
std::vector<double> csum;
std::partial_sum(h.begin(), h.end(), std::back_inserter(csum));

The forward iterator returned here iterates over the values of the histogram in implementation-defined order, skipping under-/overflow bins. For 1d-histograms, this is guaranteed to be the normal order you would expect from any 1d container.

This code works well for 1d-histograms. So what if I have 2d? The forward iterator can tell you the current 2d index of the value it points to. To fill a boost::multi_array, you would do

auto h = /* make 2d histogram, 2x3 */
boost::multi_array<double, 2> a(boost::extends[2][3]);
for (auto iter = h.begin(); iter != h.end(); ++iter)
	a[iter.idx(0)][iter.idx(1)] = *iter;

This iterator is modelled after numpy.nditer in Python.

This approach has pros and cons:
+ simplest solution for users of 1d-histograms who do not care about variances and under-/overflow bins
- still hand-written loop required for multi-dimensional histograms and to access both values and variances

I originally wanted to provide a conversion to const_multi_array_ref, which would solve the iteration problem by connecting histogram to an established multi-dimensional array interface. I looked into the reference and I don't see how it can work, because const_multi_array_ref seems to require raw memory access. I cannot easily hand out a raw pointer to the internal counter memory, because the default storage handles this memory dynamically and switches the internal type used to store the counts as needed. If the type is not known at compile time, I cannot provide a raw pointer with the appropriate type to const_multi_array_ref. It may be possible to use a fancy pointer here (the template allows to specify the pointer type), which does the conversion into a fixed value type upon dereferencing, but I still see conflicts with the multi_array interface, which provides raw access to its internal memory with methods data() and origin().

However, I could convert the internal counters to double when the conversion to const_multi_array_ref is called. This would allow me to hand out a raw pointer to memory. For storage policies which use a fixed counter type, no conversion would be necessary, so that would work as well.
...
...
I think we need several iterators then. The current axis iterators iterate over the bins definition, not really over the bin content of the histogram itself. I chose the current iterators in this way, because I believe this scheme generalises well to the multi-dimensional case. I wouldn't replace these iterators, because they seem useful to me, but additional iterators that iterate over the bin values and play well with algorithms are also clearly needed.
I would rename the current indexing iterator so something like
bin_iterator because it does something special:
using bin_iterator = unspecified;
 bin_iterator bin_begin() const;
 bin_iterator bin_end() const;
I suppose you suggest this to provide more clarity on what the axis iterators iterate over, but this solution has some drawbacks.

The philosophy in this library is to regard an axis as a logical collection of bins (the individual bins do not really exist physically as objects in memory). If you see an axis like that, then you would expect `begin()` and `end()` to provide iterators over the bins. The axis does not know anything about the histogram it belongs to, it is a sub-structure without a callback into the super-structure. Therefore, it technically cannot even provide an iterator over the histogram values in the current design. This hierarchical separation is a nice feature.

I would argue that iterators should not be named for documentation purposes, there are algorithms which expect the standard names const_iterator and iterator, and most of the time you would write `auto` nowadays, so that the typename does not appear in the source code and thus you loose it as a source of self-documentation. The case against `bin_begin()` and `bin_end()` is based on the lost compatibility with range-based for loops, which expect the standard names `begin()` and `end()`.

I think it is better to emphasise the philosophy behind the design in the documentation to clear up these misunderstandings, no?

I am also thinking of changing the bin iterator to make it less surprising, by reusing the design of the value_iterator above. Example code:

auto a = /* make axis */
for (auto iter = a.begin(); iter != a.end(); ++iter) {
   auto index = iter.idx(); // special method to access current bin index
   auto bin = *iter; // the actual bin object, implemented as an interval with methods lower() and upper() for most axis types
}

The range-based for loop would then only iterate over the bins, not providing indices

for (auto bin : a) {
	// do something with bin
}

Since range-based for loops are so nice, I am thinking of providing an adaptor called "enumerate", inspired by Python

for (auto index_and_bin : enumerate(a)) {
   auto index = index_and_bin.first;
   auto bin = index_and_bin.second;
   // do something with index and bin
}

I think this design is less surprising then the current one.
...
The iterator (and reverse_iterator) types can then be used to traverse
the content.
I put rbegin() and rend() for axis types on the to-do list.
...
...
- What should this do if h is 2d? The same? If yes, in which order should the 2d array be iterated over?
I would let axis iterators traverse only along the axis. If we need
something to traverse all elements within a multi-dimensional array,
then we probably need something else, such as mdspan [1].
[1] https://github.com/kokkos/array_ref/blob/master/proposals/P0009.rst
Yeah, mdspan would fit the needs of this library better than multi_array, because the rank can be a dynamic property, while the rank in multi_array is a compile-time constant. Since mdspan is not ready yet, we should still try to provide at least something to iterate over the values, hence my proposal of a value_iterator above.

Best regards,
Hans