[histogram] should some_axis::size() return unsigned or int?

newer
Reminder: Boost Master branch will...

Hans Dembinski

29 Nov 2018 29 Nov '18

11:09 a.m.

Dear community (especially the reviewers of boost.histogram), I am still working on finishing the library based on the excellent reviews I got in September. I cannot make up my mind about a specific detail of the library interface and would like to ask for your input. Every axis type is required to have a size() method, which returns the number of bins along that axis (without counting extra bins for under- and overflow). Should this method return `unsigned` or `int`? = The case for `unsigned` = The size of an axis is non-negative, so the return type should reflect this. All the STL containers return their size as an unsigned integer. Naturally, boost.histogram should be as consistent as possible with the STL, so you don't have think about exceptions from the general rule. Whether `size()` returns a signed or unsigned type matters even in times of `auto`, since compilers like to warn about comparisons between signed and unsigned integers. These warnings and the STL have slowly thought people to write indexed loops with unsigned integers, like so: ``` auto h = make_histogram(…); // 1D histogram for (unsigned i = 0; i < axis.size(); ++i) { auto x = h[i]; // do something with bin } ``` We don't want to break this painfully learned habit by letting `size()` return a signed integer, because it would generate a warning in this naive loop. = The case for `int` = The type of the bin index returned by an axis is `int`, because it can be negative. An axis is required to return -1 when the axis represents an ordered range and the value fall below the low edge of the first bin. So the index must be signed. The size of an axis is highest index + 1. It would be natural to use the same type for `size()` and for the returned index. Internally in the library, there are many less-than comparisons between the index and the size of the axis. If the type of the latter is unsigned, the size must be cast to `int` in all these comparisons to avoid compiler warnings, which is adding code clutter in the implementation. The library provides a range adaptor, called `indexed`, which allows one to iterate over the bins of a histogram with a special proxy, that can return the current bin index and the accumulated value for that bin. The following naive code to ignore under/overflow bins while iterating will generate a warning when `size()` returns `unsigned` ``` auto h = make_histogram(…); // 1D histogram // iterate over indexed bin range, ignore indices < 0 and indices > size (= underflow/overflow bins) for (auto x : indexed(h)) { if (x[0] < 0 || x[0] >= h.axis().size()) // without warning only when x[0] and h.axis().size() both return `int` continue; // do something with bin } ``` People are supposed to use the indexed range to iterate over the bins and not a combination of simple loops like in the first listing, because it scales nicely to multi-dimensional histograms. In multiple dimensions, it is also potentially more efficient. The `indexed` range adaptor automatically moves sequentially in memory, which this is not guaranteed if the index is created by looping manually over several indices, like so ``` for (unsigned i = 0; i < h.axis(0).size(); ++i) for (unsigned j = 0; j < h.axis(1).size(); ++j) { auto x = h.at(i, j); // this may jump around in memory, causing cache misses } ``` How the linear index is generated from the multi-dimensional index is an implementation detail (although the implementation should probably anticipate these naive loops…) So, if `indexed` is anyway the default recommended way of iterating over a histogram, then size() should return `int` to be compatible with the index type, which is also `int`, because it makes naive code work better. Sorry for the long email, I would really appreciate your input. Best regards, Hans

Show replies by date

Olaf van der Spek

29 Nov 29 Nov

12:15 p.m.

On Thu, Nov 29, 2018 at 12:41 PM Hans Dembinski via Boost <boost@lists.boost.org> wrote:

...

Every axis type is required to have a size() method, which returns the number of bins along that axis (without counting extra bins for under- and overflow). Should this method return `unsigned` or `int`?

size() returns unsigned ssize() returns signed ;) -- Olaf

Paul A. Bristow

1:13 p.m.

...

-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Hans Dembinski via Boost Sent: 29 November 2018 11:10 To: Boost Devs Cc: Hans Dembinski; Jim Pivarski; Henry Schreiner Subject: [boost] [histogram] should some_axis::size() return unsigned or int?

Dear community (especially the reviewers of boost.histogram),

I am still working on finishing the library based on the excellent reviews I got in September. I cannot make up my mind about a specific detail of the library interface and would like to ask for your input.

Every axis type is required to have a size() method, which returns the number of bins along that axis (without counting extra bins for under- and overflow). Should this method return `unsigned` or `int`?

<big snip> you need a new name for signed axis size? int axis_size() perhaps ;-) Paul --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830

Hans Dembinski

4:59 p.m.

Dear Paul, dear Olaf,

...

On 29. Nov 2018, at 14:13, Paul A. Bristow via Boost <boost@lists.boost.org> wrote:

you need a new name for signed axis size?

int axis_size()

perhaps ;-)

both of you basically propose to have two similarly named "size" methods which return `unsigned` and `int` respectively. I really hope that you are kidding as indicated by the smileys in your messages. What I really need is a custom integer type for the index which is signed but can be compared safely with both `int` and `unsigned`. I could create a new integer type with a struct or an enum class and then implement all the operators explicitly. It is a lot of work, though.

Olaf van der Spek

30 Nov 30 Nov

7:43 a.m.

On Thu, Nov 29, 2018 at 5:58 PM Hans Dembinski via Boost <boost@lists.boost.org> wrote:

...

Dear Paul, dear Olaf,

...
On 29. Nov 2018, at 14:13, Paul A. Bristow via Boost <boost@lists.boost.org> wrote:

you need a new name for signed axis size?

int axis_size()

perhaps ;-)

both of you basically propose to have two similarly named "size" methods which return `unsigned` and `int` respectively. I really hope that you are kidding as indicated by the smileys in your messages.

Not really: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1227r0.html -- Olaf

Hans Dembinski

9:20 a.m.

...

On 30. Nov 2018, at 08:43, Olaf van der Spek <ml@vdspek.org> wrote:

On Thu, Nov 29, 2018 at 5:58 PM Hans Dembinski via Boost <boost@lists.boost.org> wrote:

...
Dear Paul, dear Olaf,

...
On 29. Nov 2018, at 14:13, Paul A. Bristow via Boost <boost@lists.boost.org> wrote:

you need a new name for signed axis size?

int axis_size()

perhaps ;-)

both of you basically propose to have two similarly named "size" methods which return `unsigned` and `int` respectively. I really hope that you are kidding as indicated by the smileys in your messages.

Not really: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1227r0.html

Ok, I see, thank you for the link. I am not sure if the proposal is the best solution for the show-case problem outlined in the proposal. My personal bias against having to similar "size" methods is high, since I feel that the high level interface now exposes a low-level problem, namely, how basic operators should work on signed and unsigned types. Best regards, Hans

Gavin Lambert

3 Dec 3 Dec

12:15 a.m.

On 30/11/2018 20:43, Olaf van der Spek wrote:

...

Not really: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1227r0.html

Isn’t that proposed definition of ssize() essentially identical to // member function constexpr difference_type ssize() const { return std::distance(begin(), end()); } // or moral equivalent, for containers that can implement it // more efficiently another way. // non-member function in std:: template <class C> constexpr auto ssize(const C& c) { return distance(begin(c), end(c)); } // or calling member ssize() if it exists The above seems "better" in that it allows a container intended exclusively for small sizes to use a difference_type smaller than ptrdiff_t. (In the text it explains why it's a bad idea for a container using uint16_t sizes with more than 32767 elements to use int16_t as a difference_type, but as I understand it that's already undefined behaviour -- size_type is required to be larger than difference_type and difference_type should be able to express std::distance(begin(), end()) or some algorithms will break.)

Jorg Brown

2:15 a.m.

[ Gaving cc'd me as the author of http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1227r0.html ] "distance(begin, end)" isn't O(1) for many types of containers, which is why P1227 defines ssize() as the static_casted result of size(). Of course, for many containers, size() is implemented as the subtraction of two pointers, static_casted to size_t. (See https://github.com/llvm-mirror/libcxx/blob/master/include/vectorfor example) In those cases, it would make more sense to define ssize() as simply the subtraction of those two pointers. "difference_type should be able to express std::distance(begin(), end()) or some algorithms will break" A similar point came up at the San Diego standards meeting: If you have a container large enough that it might take up more than half the address space, ptrdiff_t will not be able to express the difference between begin() and end(). The conclusion drawn (not necessarily shared by everyone in the room) was that containers that take up more than half of the address space are not really supported by C++. Or at least, not if their iterators have byte-level granularity. ""better" in that it allows a container intended exclusively for small sizes to use a difference_type smaller than ptrdiff_t." I applaud your use of quotation marks around better, since such a difference_type is often inefficient in that the caller is often forced to immediately promote it to larger size, before the processor can work with it. = - = - = - = - = Before going down the road of size() and ssize(), the real question is what should this do: for (const auto &el: some_axis) { // Iterates over the underflow/overflow bins? } for (size_t i = 0; i != some_axis.size(); ++i) { auto &el = some_axis[i]; // Iterates over the underflow/overflow bins? } It looks like you wanted to use an index of -1 and size() to refer to underflow and overflow. That's... interesting... and I mean "interesting" in a good sense of the word. What this means is that someone who wants to iterate the *-flow bins might write: for (int i = -1; i != some_axis.size() + 1; ++i) { auto &el = some_axis[i]; // Iterates over all the bins, including *-flow. } And the code would work. However someone else might naively write that first line differently: for (int i = -1; i < some_axis.size() + 1; ++i) { and that loop would never iterate any entries at all, if size() returned an unsigned value. This code, however, would work: for (int i = -1; (i == -1) || (i < some_axis.size() + 1); ++i) { ... it just gets really ugly, really fast... with an ssize() routine, this does work: for (int i = -1; i < some_axis.ssize() + 1; ++i) { But this does not: for (size_t i = -1; i < some_axis.ssize() + 1; ++i) { = - = - = If it were me writing the code purely for myself, I would avoid unsigned integers except for bit manipulation and intended overflow situations. And this would all work fine. Of course that still leaves the question of whether you want iteration from begin() to end() to include the overflow/underflow bins. Since you're writing this code for others, I'd be wary of breaking user's ingrained size_t / less-than habits. I'm curious about whether you've considered using 0 and size() - 1 as the indices for the special bins? In that case, users would write this to iterate over all bins: for (const auto &el: some_axis) { for (size_t i = 0; i < some_axis.size(); ++i) { And something like this to iterate over just the non-special bins: for (size_t i = 1; i <= some_axis.num_bins(); ++i) { This would support both my "signed everywhere" contingent, as well as the "size_t everywhere, be sure it never goes negative" contingent. -- Jorg On Sun, Dec 2, 2018 at 4:15 PM Gavin Lambert <boost@mirality.co.nz> wrote:

...

On 30/11/2018 20:43, Olaf van der Spek wrote:

...
Not really: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1227r0.html

Isn’t that proposed definition of ssize() essentially identical to

// member function constexpr difference_type ssize() const { return std::distance(begin(), end()); } // or moral equivalent, for containers that can implement it // more efficiently another way.

// non-member function in std:: template <class C> constexpr auto ssize(const C& c) { return distance(begin(c), end(c)); } // or calling member ssize() if it exists

The above seems "better" in that it allows a container intended exclusively for small sizes to use a difference_type smaller than ptrdiff_t.

(In the text it explains why it's a bad idea for a container using uint16_t sizes with more than 32767 elements to use int16_t as a difference_type, but as I understand it that's already undefined behaviour -- size_type is required to be larger than difference_type and difference_type should be able to express std::distance(begin(), end()) or some algorithms will break.)

Gavin Lambert

5:18 a.m.

On 3/12/2018 15:15, Jorg Brown wrote:

...

"distance(begin, end)" isn't O(1) for many types of containers, which is why P1227 defines ssize() as the static_casted result of size().

Of course, for many containers, size() is implemented as the subtraction of two pointers, static_casted to size_t. (See https://github.com/llvm-mirror/libcxx/blob/master/include/vector for example) In those cases, it would make more sense to define ssize() as simply the subtraction of those two pointers.

Which is why I added the "moral equivalent" clause. :)

...

""better" in that it allows a container intended exclusively for small sizes to use a difference_type smaller than ptrdiff_t."

I applaud your use of quotation marks around better, since such a difference_type is often inefficient in that the caller is often forced to immediately promote it to larger size, before the processor can work with it.

True. But it also in theory allows for more esoteric indexing, such as indexing by an enum or a complex value or some other UDT. (Technically that doesn't meet the requirements of Container, but provided the type behaves sufficiently like an integer then most algorithms would still work.) Whether that's "better" or not might be another argument. :) And AFAIK all the standard container types use ptrdiff_t as their difference_type anyway, so it seems reasonable to use that instead of mandating ptrdiff_t.

...

I'm curious about whether you've considered using 0 and size() - 1 as the indices for the special bins?

IIRC, Hans stated that another library does this and it makes it more awkward to use, since (he says) mostly you ignore the extra bins and turning them on or off changes the indexing of the regular bins.

Hans Dembinski

2:41 p.m.

...

On 3. Dec 2018, at 03:15, Jorg Brown via Boost <boost@lists.boost.org> wrote:

= - = - = - = - =

Before going down the road of size() and ssize(), the real question is what should this do:

for (const auto &el: some_axis) { // Iterates over the underflow/overflow bins? }

It should skip underflow/overflow.

...

for (size_t i = 0; i != some_axis.size(); ++i) { auto &el = some_axis[i]; // Iterates over the underflow/overflow bins? }

It should skip underflow/overflow and not produce a warning.

...

It looks like you wanted to use an index of -1 and size() to refer to underflow and overflow. That's... interesting... and I mean "interesting" in a good sense of the word.

It is really the most intuitive for a naive person with a math background, but without a strong C++ background. From the math perspective, an axis is an arrow over the real line. Here is my picture again. Value arrow for axis::regular<>(3, 1, 4): // 3 bins from 1 to 4 1 2 3 4 -inf ——————— | ————— | ————— | ————— | —————————> +inf bin -1 bin 0 bin 1 bin 2 bin 3 Both the values and the indices are ordered and have a natural correspondence. If the bin 0 is the interval [1, 2), then clearly the bin -1 should be the interval [-inf, 1), and similar for the overflow bin. The bins from index 0 to 2 are the "inner bins". Indexing over inner bins should not behave differently for an axis with and without *-flow bins. Therefore, the index of the first inner bin must be 0 and the last must be size() - 1, always. I could place the *-flow bins after the normal bins, at size() and size() + 1, but this breaks the nice correspondence between the ordering of indices and value intervals. I do not want to give that up easily.

...

What this means is that someone who wants to iterate the *-flow bins might write: […] for (int i = -1; i < some_axis.size() + 1; ++i) {

The goal is to make such code work correctly.

...

with an ssize() routine, this does work:

[…] for (size_t i = -1; i < some_axis.ssize() + 1; ++i) {

In this case, you would correctly get a warning from the compiler about initialising a value of type size_t with -1. I don't mind that this produces a warning, it should. The goal is to make these two lines work correctly and give no warning: for (unsigned i = 0; i < some_axis.size(); ++i) // … for (int i = -1; i < some_axis.size() + 1; ++i) // … I still think that most of the suggestions here try to work around a problem that we have with builtin integers, namely that the code `-1 < 1u` is not doing what you would naively expect. So an elaborate solution is to make a new integer type `axis::size_type` for axis objects, which wraps an `int` and is implicitly convertible to `int`, but implements comparison operators with `int` and `unsigned` types to yield the naive results without warnings: Method signature: axis::size_type my_axis::size() const Behaviour of axis::size_type: axis::size_type s = 0; assert((s-1) == -1); // ok, uses custom operator-(axis::size_type, int) assert(-1 < s); // ok, uses custom bool operator<(int, axis::size_type) assert(s < 1u); // no warning, uses custom operator<(axis::size_type, unsigned) The only "problem" that I can see with this solution is that I have to implement a lot of operators and write a lot of additional tests, etc. But it is most direct solution to the problem that I can see. Best regards, Hans

Brook Milligan

3:11 p.m.

...

On Dec 3, 2018, at 8:26 PM, Hans Dembinski via Boost <boost@lists.boost.org> wrote:

So an elaborate solution is to make a new integer type `axis::size_type` for axis objects, which wraps an `int` and is implicitly convertible to `int`, but implements comparison operators with `int` and `unsigned` types to yield the naive results without warnings:

Method signature: axis::size_type my_axis::size() const

Behaviour of axis::size_type: axis::size_type s = 0; assert((s-1) == -1); // ok, uses custom operator-(axis::size_type, int) assert(-1 < s); // ok, uses custom bool operator<(int, axis::size_type) assert(s < 1u); // no warning, uses custom operator<(axis::size_type, unsigned)

The only "problem" that I can see with this solution is that I have to implement a lot of operators and write a lot of additional tests, etc. But it is most direct solution to the problem that I can see.

You are aware of the Boost operators library, right? That dramatically reduces the task of writing operators and would seem appropriate here. From reading the thread, this seems like the best solution as it allows you to support the full range of operations that you deem appropriate for balancing the requirements. X::size_type really should be an opaque type anyway, so as long as it interoperates with native types as expected you should be fine. Cheers, Brook

Hans Dembinski

4:23 p.m.

...

On 3. Dec 2018, at 16:11, Brook Milligan <brook@nmsu.edu> wrote:

...
On Dec 3, 2018, at 8:26 PM, Hans Dembinski via Boost <boost@lists.boost.org> wrote:

So an elaborate solution is to make a new integer type `axis::size_type` for axis objects, which wraps an `int` and is implicitly convertible to `int`, but implements comparison operators with `int` and `unsigned` types to yield the naive results without warnings:

Method signature: axis::size_type my_axis::size() const

Behaviour of axis::size_type: axis::size_type s = 0; assert((s-1) == -1); // ok, uses custom operator-(axis::size_type, int) assert(-1 < s); // ok, uses custom bool operator<(int, axis::size_type) assert(s < 1u); // no warning, uses custom operator<(axis::size_type, unsigned)

The only "problem" that I can see with this solution is that I have to implement a lot of operators and write a lot of additional tests, etc. But it is most direct solution to the problem that I can see.

You are aware of the Boost operators library, right? That dramatically reduces the task of writing operators and would seem appropriate here. From reading the thread, this seems like the best solution as it allows you to support the full range of operations that you deem appropriate for balancing the requirements. X::size_type really should be an opaque type anyway, so as long as it interoperates with native types as expected you should be fine.

I am aware of the Boost operators library, but it is good that you remind me. If I allow my size_type to be implicitly convertible to int, I do not even have so many operators to implement. Implicit conversions are often bad, but it may be ok here.

Brook Milligan

4 Dec 4 Dec

3:27 a.m.

...

On Dec 3, 2018, at 10:08 PM, Hans Dembinski <hans.dembinski@gmail.com> wrote:

I am aware of the Boost operators library, but it is good that you remind me. If I allow my size_type to be implicitly convertible to int, I do not even have so many operators to implement. Implicit conversions are often bad, but it may be ok here.

Is this discussion moving in the direction of a safe_unsigned<T> type, i.e., unsigned without modulo arithmetic? I can't recall if Robert Ramey's safe_numerics library does that, but that might be useful. Cheers, Brook

Hans Dembinski

1:33 p.m.

...

On 4. Dec 2018, at 04:27, Brook Milligan <brook@nmsu.edu> wrote:

...
On Dec 3, 2018, at 10:08 PM, Hans Dembinski <hans.dembinski@gmail.com> wrote:

I am aware of the Boost operators library, but it is good that you remind me. If I allow my size_type to be implicitly convertible to int, I do not even have so many operators to implement. Implicit conversions are often bad, but it may be ok here.

Is this discussion moving in the direction of a safe_unsigned<T> type, i.e., unsigned without modulo arithmetic? I can't recall if Robert Ramey's safe_numerics library does that, but that might be useful.

I was wondering about the same thing. I only had a quick look into it, but it could potentially be very useful. I tried to implement such a safe_unsigned type yesterday evening, but eventually gave up. This is really a lot of added complexity just to avoid a warning that users may not even see, if they compile without warnings enabled. Perhaps I can use the safe_numerics library in a future version of boost.histogram to better solve this issue, but for now I will let axis::size() return `int`.

2400

Age (days ago)

2405

Last active (days ago)

List overview

Download

13 comments

6 participants

participants (6)

Brook Milligan
Gavin Lambert
Hans Dembinski
Jorg Brown
Olaf van der Spek
Paul A. Bristow