Re: [boost] [histogram] should some_axis::size() return unsigned or int?
Dear Jim,
On 29. Nov 2018, at 13:31, Jim Pivarski
wrote: It seems to me that if other C++ containers use unsigned integers for size(), you should too, for minimal surprise. The issue you raise about the "which bin?" function returning -1 would be solved in a very-strongly typed language like Haskell as an optional<int>. I know that C++ has optional types now, but I don't know how widely they're used or if there's a significant performance penalty. If this wouldn't look too weird in a C++ program and wouldn't slow it down (or needlessly complicate the code), an optional type would describe your intent more fully than -1.
there is boost::optional, which has the semantics of a pointer and can be used to represent a type that stores a value or not. ``` boost::optional o = some_fickle_function(); if (o) { // optional has a value? auto value = *o; // "dereference" to get the value } else { // handle case where value is missing } ``` This is not a good match here, because -1 here does not have the meaning of "value is missing", but it really is the logical index for the virtual bin that spans from -infinity to the lower edge of the first bin in the axis. Value arrow: -inf ——————— | ——————— | —————— | —————— |—————————> +inf bin -1 bin 0 bin 1 bin 2 bin 3 I think representing the underflow bin with -1 and the overflow bin with the value returned by size() is very intuitive and elegant. Best regards, Hans
On 30/11/2018 06:12, Hans Dembinski wrote:
This is not a good match here, because -1 here does not have the meaning of "value is missing", but it really is the logical index for the virtual bin that spans from -infinity to the lower edge of the first bin in the axis.
Value arrow: -inf ——————— | ——————— | —————— | —————— |—————————> +inf bin -1 bin 0 bin 1 bin 2 bin 3
I think representing the underflow bin with -1 and the overflow bin with the value returned by size() is very intuitive and elegant.
Conventionally your size_type should be the same type returned by size() and used for indexing. So I would expect that type to be int, given the above. Having said that, you're already departing from standard container conventions by having size() return a number that is *sometimes* 2 smaller than the "real" number of bins, which might frustrate generic algorithms. Completely without tongue in cheek, I wonder if it might be better to not provide a size() method at all (to avoid container implications which you do not satisfy) and spell it as bin_count() or something distinct instead. Another possibility that might sidestep all of this could be to remove the underflow/overflow bins from the standard indexing and make them only accessible through separate underflow_bin() methods (or similar) instead. But this might complicate other cases such as a find_bin_for_value() that needs to return an underflow/overflow (though that could be solved by returning a bin iterator, not an index). I'm not sure if that's actually *better* though. :)
On Thu, Nov 29, 2018 at 3:38 PM Gavin Lambert via Boost < boost@lists.boost.org> wrote:
On 30/11/2018 06:12, Hans Dembinski wrote:
This is not a good match here, because -1 here does not have the
meaning of "value is missing", but it really is the logical index for the virtual bin that spans from -infinity to the lower edge of the first bin in the axis.
Value arrow: -inf ——————— | ——————— | —————— | —————— |—————————> +inf bin -1 bin 0 bin 1 bin 2 bin 3
I think representing the underflow bin with -1 and the overflow bin
with the value returned by size() is very intuitive and elegant.
Conventionally your size_type should be the same type returned by size() and used for indexing. So I would expect that type to be int, given the above.
Having said that, you're already departing from standard container conventions by having size() return a number that is *sometimes* 2 smaller than the "real" number of bins, which might frustrate generic algorithms.
Completely without tongue in cheek, I wonder if it might be better to not provide a size() method at all (to avoid container implications which you do not satisfy) and spell it as bin_count() or something distinct instead.
+1 If the operation differs semantically from .size() in standard containers, then it shouldn't be called size().
Conventionally your size_type should be the same type returned by size() and used for indexing. So I would expect that type to be int, given the above.
Having said that, you're already departing from standard container conventions by having size() return a number that is *sometimes* 2 smaller than the "real" number of bins, which might frustrate generic algorithms.
Completely without tongue in cheek, I wonder if it might be better to not provide a size() method at all (to avoid container implications which you do not satisfy) and spell it as bin_count() or something distinct instead.
Another possibility that might sidestep all of this could be to remove the underflow/overflow bins from the standard indexing and make them only accessible through separate underflow_bin() methods (or similar) instead. But this might complicate other cases such as a find_bin_for_value() that needs to return an underflow/overflow (though that could be solved by returning a bin iterator, not an index).
A high +1:
Having size() makes on expect that "for(i=0;i I think representing the underflow bin with -1 and the overflow bin
with the value returned by size() is very intuitive and elegant. IMO this is against expectation of every C++ programmer: Being able to
index a container with -1 and size() so in every review of code using
this it will come up at least as a raised eyebrow.
Other idea: If those bins are so special that they don't fit into the
[0, size()) range, why not use a different function for getting them,
which is not the index operator? high_bin()/low_bin() come to mind.
But WHY was this chosen? Wouldn't it be ok if 0 is the first bin which
starts at -inf and size()-1 to be the last one spanning to inf? This
would allow a histogram of size 1 which has a single bin holding all
values. This makes me wonder to if you *always* want the +-inf bins or
if there are reasonable use cases NOT wanting them and expect
out-of-range values to be dropped.
Disclaimer: I'm not familiar with Boost.histogram and see this from a
users perspective only.
On 30. Nov 2018, at 08:51, Alexander Grund via Boost
wrote: Conventionally your size_type should be the same type returned by size() and used for indexing. So I would expect that type to be int, given the above.
Having said that, you're already departing from standard container conventions by having size() return a number that is *sometimes* 2 smaller than the "real" number of bins, which might frustrate generic algorithms.
Completely without tongue in cheek, I wonder if it might be better to not provide a size() method at all (to avoid container implications which you do not satisfy) and spell it as bin_count() or something distinct instead.
Another possibility that might sidestep all of this could be to remove the underflow/overflow bins from the standard indexing and make them only accessible through separate underflow_bin() methods (or similar) instead. But this might complicate other cases such as a find_bin_for_value() that needs to return an underflow/overflow (though that could be solved by returning a bin iterator, not an index).
A high +1:
Having size() makes on expect that "for(i=0;i
You are overestimating the importance of the *-flow bins, I think. Users usually ignore them when they analyse their histograms. They must be there for other reasons which are explained in the rationale and they are very useful for expert-level statistical analyses and for debugging. The beginner, however, should not notice their presence. In fact, the `indexed` range adaptor should probably skip them by default, and only iterate over them when that is explicitly requested.
I think representing the underflow bin with -1 and the overflow bin with the value returned by size() is very intuitive and elegant.
IMO this is against expectation of every C++ programmer: Being able to index a container with -1 and size() so in every review of code using this it will come up at least as a raised eyebrow.
An axis is not a container. It does not hold values and it has no operator[], precisely to emphasise this difference. It has size() though. See my email to Gavin with a long explanation why I think that makes sense.
Other idea: If those bins are so special that they don't fit into the [0, size()) range, why not use a different function for getting them, which is not the index operator? high_bin()/low_bin() come to mind.
See explanation to Gavin why this is worse.
But WHY was this chosen? Wouldn't it be ok if 0 is the first bin which starts at -inf and size()-1 to be the last one spanning to inf? This would allow a histogram of size 1 which has a single bin holding all values.
And why would you want such an axis? It would be pointless and make the histogram operate slower. Nevertheless, you can easily generate such an axis as user-defined axis type // this axis is valid and can be plugged into boost.histogram struct my_axis { int operator()(double x) const { return 0; } // always return 0 unsigned size() const { return 1; } // axis has size of 1 };
This makes me wonder to if you *always* want the +-inf bins or if there are reasonable use cases NOT wanting them and expect out-of-range values to be dropped.
Sure, there are reasonable use cases where you don't want them. It is a compile-time option for the builtin axis types (on by default). Also, user-defined axis types are not required to have them (see example above).
Disclaimer: I'm not familiar with Boost.histogram and see this from a users perspective only.
No problem, your user perspective is appreciated. It would be good to read the rationale, in the docs of boost.histogram though, since it lists the arguments that have been refined in previous discussions already. Best regards, Hans
Am 30.11.18 um 11:36 schrieb Hans Dembinski:
You are overestimating the importance of the *-flow bins, I think. Users usually ignore them when they analyse their histograms. They must be there for other reasons which are explained in the rationale and they are very useful for expert-level statistical analyses and for debugging. The beginner, however, should not notice their presence.
In fact, the `indexed` range adaptor should probably skip them by default, and only iterate over them when that is explicitly requested. Sounds reasonable: A range excluding the over/underflow bins and one including it. An axis is not a container. It does not hold values and it has no operator[], precisely to emphasise this difference. It has size() though. See my email to Gavin with a long explanation why I think that makes sense. Your code example was the following:
Other idea: If those bins are so special that they don't fit into the [0, size()) range, why not use a different function for getting them, which is not the index operator? high_bin()/low_bin() come to mind. See explanation to Gavin why this is worse. Combining this with "Users usually ignore them[...] the `indexed` range adaptor should probably skip them by default" I do see the need for extra functions here too. Your argument against "high_bin()/low_bin()" was: Iteration must be split. But your above comment already suggests,
for (unsigned i = 0; i < axis.size(); ++i) { auto x = h[i]; // do something with bin } So it looks like a container, although size and []-operator are in different instances (which feels weird, but ok) that there are iterators which can cover the whole range. Could they solve this split-iteration-problem?
But WHY was this chosen? Wouldn't it be ok if 0 is the first bin which starts at -inf and size()-1 to be the last one spanning to inf? This would allow a histogram of size 1 which has a single bin holding all values. And why would you want such an axis? It would be pointless and make the histogram operate slower.
I was not saying this should be done. It would just be consistent. There are 2 dimensions: - open ranged bins yes/no - number of bins In my mind enabling open ranged bins does not ADD bins but makes the first and last go to +-inf: axis(4,0,10,"",uoflow_type::on) -> [-inf,0), [0,5), [5,10), [10, inf] axis(4,0,10,"",uoflow_type::off) -> [0,2.5), [2.5,5), [5,7.5), [7.5,10) Of course this might be confusing so default should be "off" as "users usually ignore them" so they are advanced things one does not generally need, right? (Side note: The parameter description at https://hdembinski.github.io/histogram/doc/html/boost/histogram/axis/regular... is confusing due to the list order not matching the parameter order.) So my TLDR of this is: Consistency and meeting expectations. If it breaks either, think again about the choices made. For this it is either: - *-flow bins are kinda regular bins -> included in size(), iteration, same behavior like regulars - *-flow are special bins -> not included in size(), special accessors and iterators with default ones not including them. Given that: Why not have special constants for Underflow AND Overflow bin (e.g. -1 and -2) (instead of -1 and size(), where the latter is a runtime constant), then you could have a `int find_including_ouflow` and a `unsigned find` as well as `get(unsigned)`, `get_with_uoflow(int)` -> Idea is to make the special handling obvious Alex PS: I don't want to push anything. Just my thoughts on your issue in the hope it helps you finding a solution which you are happy with.
On 30. Nov 2018, at 12:58, Alexander Grund via Boost
wrote: Am 30.11.18 um 11:36 schrieb Hans Dembinski:
You are overestimating the importance of the *-flow bins, I think. Users usually ignore them when they analyse their histograms. They must be there for other reasons which are explained in the rationale and they are very useful for expert-level statistical analyses and for debugging. The beginner, however, should not notice their presence.
In fact, the `indexed` range adaptor should probably skip them by default, and only iterate over them when that is explicitly requested. Sounds reasonable: A range excluding the over/underflow bins and one including it. An axis is not a container. It does not hold values and it has no operator[], precisely to emphasise this difference. It has size() though. See my email to Gavin with a long explanation why I think that makes sense. Your code example was the following:
for (unsigned i = 0; i < axis.size(); ++i) { auto x = h[i]; // do something with bin }
`i` is the index for the histogram here (!) not for the axis itself. The histogram is also not a container, but it acts more like one. It also has a size() method and the value really includes all bins. A histogram is technically a multi-dimensional array, but semantically it is more than that. In a histogram, each dimension of the multi-dimensional array also has an associated arrow of values. The axis types logically represent these arrows, which have indices and values. The axis itself is not very much like a container and I don't expect users to run STL algorithms on axis objects.
So it looks like a container, although size and []-operator are in different instances (which feels weird, but ok)
I hope it is more clear now.
Other idea: If those bins are so special that they don't fit into the [0, size()) range, why not use a different function for getting them, which is not the index operator? high_bin()/low_bin() come to mind. See explanation to Gavin why this is worse. Combining this with "Users usually ignore them[...] the `indexed` range adaptor should probably skip them by default" I do see the need for extra functions here too. Your argument against "high_bin()/low_bin()" was: Iteration must be split. But your above comment already suggests, that there are iterators which can cover the whole range. Could they solve this split-iteration-problem?
Yes, but we are diverging from my original question now. Users are recommended to use the range adaptors and iterators provided by the library. For the adaptors and iterators, all cases can be gracefully implemented. But I know my potential users very well, they will also use integers as indices to loop over the histogram, because people in target community are often beginner programmers and using an integer index feels natural to them. The question is how to optimise the design for this use case and so that it does not clash with similar (mis)usage of STL containers.
But WHY was this chosen? Wouldn't it be ok if 0 is the first bin which starts at -inf and size()-1 to be the last one spanning to inf? This would allow a histogram of size 1 which has a single bin holding all values. And why would you want such an axis? It would be pointless and make the histogram operate slower.
I was not saying this should be done. It would just be consistent. There are 2 dimensions: - open ranged bins yes/no - number of bins In my mind enabling open ranged bins does not ADD bins but makes the first and last go to +-inf:
axis(4,0,10,"",uoflow_type::on) -> [-inf,0), [0,5), [5,10), [10, inf] axis(4,0,10,"",uoflow_type::off) -> [0,2.5), [2.5,5), [5,7.5), [7.5,10)
No, not a good idea. If we do that, then toggling *-flow bins on/off changes your whole program as it is written up to now, it will do something completely different. A user should be able to code the analysis and then decide: "Ah crap, these extra bins cost too much memory and in my special case they are always empty anyway, because my values never go out of the range that I specified. So let's just turn them off". Doing that optimization should not change logic of the program you wrote so far.
Of course this might be confusing so default should be "off" as "users usually ignore them" so they are advanced things one does not generally need, right?
I can see that you did not read the rationale carefully yet. Best regards, Hans
Am 30.11.18 um 13:27 schrieb Hans Dembinski:
I can see that you did not read the rationale carefully yet. I just did not understand everything ;) `i` is the index for the histogram here (!) not for the axis itself. The histogram is also not a container, but it acts more like one. It also has a size() method and the value really includes all bins. Ah ok. So the essence of your question was: What should the types be for histogram::size() and histogram::operator[]? BTW: Where is an interface description of that histogram class? In the reference I only find a forward declaration and "unspecified type", so I don't know what members it has.
In this case I go with the "same-as-stdlib" approach:
- Types should be the same
- for(auto i = SizeType(0); i Yes, but we are diverging from my original question now. Users are recommended to use the range adaptors and iterators provided by the library. For the adaptors and iterators, all cases can be gracefully implemented. But I know my potential users very well, they will also use integers as indices to loop over the histogram, because people in target community are often beginner programmers and using an integer index feels natural to them. The question is how to optimise the design for this use case and so that it does not clash with similar (mis)usage of STL containers.
See above. Especially for beginners it has to be consistent with other
usage. for(int i=-1; i<=h.size(); i++) WILL confuse them.
If you are referring to conversion warnings: Same as with stdlib.
No, not a good idea. If we do that, then toggling *-flow bins on/off changes your whole program as it is written up to now, it will do something completely different. A user should be able to code the analysis and then decide: "Ah crap, these extra bins cost too much memory and in my special case they are always empty anyway, because my values never go out of the range that I specified. So let's just turn them off". Doing that optimization should not change logic of the program you wrote so far. True, ok thanks for the clarification. Then axis.size is what the user
says and h.size()==axis.size||axis.size+2. Putting the *flow bins at the
end allows iterating over axis.size OR h.size with the same code. You
can even go from for(int i=0; i
On 30. Nov 2018, at 14:44, Alexander Grund via Boost
wrote: Am 30.11.18 um 13:27 schrieb Hans Dembinski:
I can see that you did not read the rationale carefully yet. I just did not understand everything ;)
Fair enough, but then let me know what you don't understand and I will try to explain it better.
`i` is the index for the histogram here (!) not for the axis itself. The histogram is also not a container, but it acts more like one. It also has a size() method and the value really includes all bins. Ah ok. So the essence of your question was: What should the types be for histogram::size() and histogram::operator[]?
No, the question is about axis::size(), whether it should return `int` or `unsigned`. There is no problem with histogram::size() or histogram::operator[].
BTW: Where is an interface description of that histogram class? In the reference I only find a forward declaration and "unspecified type", so I don't know what members it has.
There is a technical problem in the reference generation that I haven't been able to fix yet. Help from experts would be very much appreciated. In the code, the histogram interface is documented and the documentation is supposed to be part of the reference.
In this case I go with the "same-as-stdlib" approach:
- Types should be the same - for(auto i = SizeType(0); i
Agreed.
No, not a good idea. If we do that, then toggling *-flow bins on/off changes your whole program as it is written up to now, it will do something completely different.
A user should be able to code the analysis and then decide: "Ah crap, these extra bins cost too much memory and in my special case they are always empty anyway, because my values never go out of the range that I specified. So let's just turn them off". Doing that optimization should not change logic of the program you wrote so far.
True, ok thanks for the clarification. Then axis.size is what the user says and h.size()==axis.size||axis.size+2. Putting the *flow bins at the end allows iterating over axis.size OR h.size with the same code. You can even go from for(int i=0; i
This implies strange behaviour for the axis. Note that you can and sometimes want to use the axis manually to maps values to indices. Then you get this, if I follow your suggestion: auto a = axis::regular<>(2, 1, 2); // axis with two bins [1, 1.5), [1.5, 2.0) assert(a(0.99) == 3); // ?? assert(a(1) == 0); // ok assert(a(1.5) == 1); // ok assert(a(2.0) == 2); // ok assert(a(10.0) == 2); // ok The ordering of the indices is not the same as the ordering of the values. I find that unacceptable. Best regards, Hans
On 30. Nov 2018, at 00:38, Gavin Lambert via Boost
wrote: On 30/11/2018 06:12, Hans Dembinski wrote:
This is not a good match here, because -1 here does not have the meaning of "value is missing", but it really is the logical index for the virtual bin that spans from -infinity to the lower edge of the first bin in the axis. Value arrow: -inf ——————— | ——————— | —————— | —————— |—————————> +inf bin -1 bin 0 bin 1 bin 2 bin 3 I think representing the underflow bin with -1 and the overflow bin with the value returned by size() is very intuitive and elegant.
Conventionally your size_type should be the same type returned by size() and used for indexing. So I would expect that type to be int, given the above.
Makes sense.
Having said that, you're already departing from standard container conventions by having size() return a number that is *sometimes* 2 smaller than the "real" number of bins, which might frustrate generic algorithms.
Generic algorithm operate on iterator intervals, so that is a separate matter, I think. I think there is no perfect solution which is always unsurprising. In some calculations, you want to loop over the extra bins (underflow/overflow), but in most calculations you don't care about them and you want to ignore them. The current design is build on this insight, which is drawn from experience collected with CERN's ROOT library, where the underflow bin is placed at index 0. In that library, the first bin you are usually interested in starts at index 1. This turned out to be much more confusing and surprising to people, especially beginners. I originally had two methods, `size()` and `shape()`. The former would return the logical size of the axis, the latter the physical size. People didn't like `shape()` so we discussed other options and settled for `extend()`. `extend()` is now a free function in the axis::traits namespace and not a method anymore. The axis only has a `size()` method. In practice, the end user of the library has little use for `extend()`, it is mainly used internally.
Completely without tongue in cheek, I wonder if it might be better to not provide a size() method at all (to avoid container implications which you do not satisfy) and spell it as bin_count() or something distinct instead.
I am tempted by your argument, but not convinced whether this will really make the library less surprising. An axis may not be a container, but `for (int i = 0; i < axis.size(); ++i)` is what you would expect to work since it is similar enough to a container. We use these kinds of analogies everywhere, e.g. boost::optional uses pointer semantics, although it is not a pointer. Or iterators in general, they are not pointers and most of them do not provide the full semantics of a real pointer.
Another possibility that might sidestep all of this could be to remove the underflow/overflow bins from the standard indexing and make them only accessible through separate underflow_bin() methods (or similar) instead. But this might complicate other cases such as a find_bin_for_value() that needs to return an underflow/overflow (though that could be solved by returning a bin iterator, not an index).
Any other solution is worse, I hope the following argument shows this. Remember the value arrow from my other email. There is no other convention for the underflow and overflow indices that is as intuitive as this convention. It also makes looping over all bins, including the extra bins every easy: for (int i = 0; i < axis.size(); ++i) // loop over internal bins, what you would naively do for (int i = -1; i <= axis.size(); ++i) // loop over internal and extra bins In any other convention, I would have to loop over the internal bins and then handle the extra bins outside of the loop. for (int i = 0; i < axis.size(); ++i) do_something_with_index(i); do_something_with_index(axis.underflow_bin()); do_something_with_index(axis.overflow_bin()); This is clearly less elegant in code. Best regards, Hans
On 11/30/18 11:16 AM, Hans Dembinski via Boost wrote:
In any other convention, I would have to loop over the internal bins and then handle the extra bins outside of the loop.
I find this to be the cleanest solution, as the under-/overflow bins are special anyways. If we wish to iterator over all bins, then you could provide a special iterator for that. Hopefully it would be possible reuse the existing iterator and simply provide new begin/end functions. If we wish to index into the under-/overflow bins, then you could provide an operator[] that overloads on enum types for the underflow and overflow bins (as suggested in my formal review.)
participants (5)
-
Alexander Grund
-
Bjorn Reese
-
Emil Dotchevski
-
Gavin Lambert
-
Hans Dembinski