Interest in algorithm for arithmetic mean of integers

Matt Calabrese

3 Sep 2015 3 Sep '15

11:36 p.m.

AFAICT the only function for calculating arithmetic mean in boost is in the accumulators library and it's not really appropriate for integral types. In practice, properly calculating the mean for a range (especially a range that's only an input range but not a forward range) is surprisingly tricky to do in a way that avoids overflow and provides an exact result. I have a couple of solutions, including one that works for input ranges. Is this something that people would find useful? Is there a similar algorithm already implemented in boost that I'm not aware of? The approach I've taken yields an exact result as a mixed number and should work without overflow (assuming only that the range's size is able to be represented by its difference type). I believe the implementation that works for input ranges could theoretically be adapted to work with Boost.Accumulators, though I'm admittedly not too familiar with the library. -- -Matt Calabrese

Show replies by date

Rajaditya Mukherjee

4 Sep 4 Sep

2:43 p.m.

Hi Matt, I am imagining such a library would be greatly useful for numerical scientists and statisticians for precise computing. Can you give us a toy example of how it works ? (Sorry this is my first time posting on the boost mailing list) - Please forgive me if I have violated any guidelines. *Rajaditya Mukherjee * *3rd Year Graduate Student* Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio Tel :- +1-(614)-271-4439 email :- rajaditya.mukherjee@gmail.com , raj@rajaditya.com On Thu, Sep 3, 2015 at 7:36 PM, Matt Calabrese wrote:

...

AFAICT the only function for calculating arithmetic mean in boost is in the accumulators library and it's not really appropriate for integral types. In practice, properly calculating the mean for a range (especially a range that's only an input range but not a forward range) is surprisingly tricky to do in a way that avoids overflow and provides an exact result. I have a couple of solutions, including one that works for input ranges.

Is this something that people would find useful? Is there a similar algorithm already implemented in boost that I'm not aware of? The approach I've taken yields an exact result as a mixed number and should work without overflow (assuming only that the range's size is able to be represented by its difference type). I believe the implementation that works for input ranges could theoretically be adapted to work with Boost.Accumulators, though I'm admittedly not too familiar with the library.

-- -Matt Calabrese

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Matt Calabrese

6:20 p.m.

On Fri, Sep 4, 2015 at 7:43 AM, Rajaditya Mukherjee < rajaditya.mukherjee@gmail.com> wrote:

...

Hi Matt,

I am imagining such a library would be greatly useful for numerical scientists and statisticians for precise computing. Can you give us a toy example of how it works ?

I can't post the code without releasing it but I can briefly describe how it works. The general input-range version works by accumulating a mixed number that represents the mean at any given iteration. The mixed number has a whole number part that is of the value_type of the range, and the fractional part is represented as a numerator that is of the iterator's difference_type and a denominator that is of the iterator's difference type. The initial "value" of this state is {0, 0, 0}, which is indeterminate. At any given iteration, you are guaranteed that the fractional part's denominator is exactly equal to the number of elements that have been iterated over, the fractional part's numerator is in the range [0,denominator), and the whole number is the whole number part of the mean. In other words, the fractional part is never reduced. By having a state represented in this manner you are able to communicate back to the user both the exact average as well as the number of elements in the range (since the number of elements is equal to the denominator's value). This guarantee is also useful as we advance the state. The reason the result is 3 values representing a mixed number and not 2 values simply representing a rational number is because a rational number consisting of 2 values of either the difference type or the value type cannot be guaranteed to be able to represent the exact mean without possibility of overflow. With 3 values of appropriate type, you can guarantee that the mean can be represented since the whole number part will always be bounded by the smallest and largest value in the input range (and remember the whole part's type is the value_type of the range). The components of the fractional part use the difference_type of the range since I always represent the fractional part as N/D where D is the number of elements seen so far and N is in the range [0, D). In this manner, the exact mean is guaranteed to able to be represented at any given iteration, and the accumulation function can advance the state in a manner that will never overflow. You use it like any STL algorithm: ////////// auto result = mean(std::begin(range), std::end(range)); ////////// At the moment I do not have an implementation that works piecemeal as is akin to Boost.Accumulators, however since the implementation of the algorithm really is simply a call to std::accumulate, it should be very simple to expose such a form, and I imagine that it would be especially useful. Some further details on how the state is manipulated with each iteration: At any given point, we have the whole number part, the numerator of the fractional part, and the denominator of the fractional part (remember that the denominator is guaranteed to always be the size of the range up to this point). When advancing to the next state, I do an operation that has the same effect as taking a single, rational version of the mixed number N/D, changing this value to N/(D+1), and then adding in V/(D+1), where V is the next value in the range (remember that here D is the current number of elements that we've iterated over). This isn't directly what I do since I'm dealing with a mixed number instead of a rational number, but it's the same overall effect and is done in a way that cannot overflow. The only thing that is really hairy at this point is that there is mixed-mode arithmetic involved (the value_type and the difference_type need to be operands to arithmetic functions). At the moment I handle the mixed-mode arithmetic by explicitly casting, but ultimately it should probably be a customization point of the algorithm, especially if this is to be useful with user-defined, integer-like value types. -- -Matt Calabrese

Matt Calabrese

9:59 p.m.

On Fri, Sep 4, 2015 at 11:20 AM, Matt Calabrese wrote:

...

////////// auto result = mean(std::begin(range), std::end(range)); //////////

At the moment I do not have an implementation that works piecemeal as is akin to Boost.Accumulators, however since the implementation of the algorithm really is simply a call to std::accumulate, it should be very simple to expose such a form, and I imagine that it would be especially useful. Some further details on how the state is manipulated with each iteration:

I also do not support weighted averages yet, but it should be possible. In that case, I imagine that integral weights could be specified with the precondition of the algorithm being that the sum of the weights needs to be <= std::numeric_limits::max(). With such a precondition, I believe it's possible to produce an exact result using the same mixed number type that was described earlier. -- -Matt Calabrese

Bjorn Reese

5 Sep 5 Sep

11:07 a.m.

On 09/04/2015 08:20 PM, Matt Calabrese wrote:

...

At any given point, we have the whole number part, the numerator of the fractional part, and the denominator of the fractional part (remember that the denominator is guaranteed to always be the size of the range up to this point). When advancing to the next state, I do an operation that has the same effect as taking a single, rational version of the mixed number N/D, changing this value to N/(D+1), and then adding in V/(D+1), where V is the next value in the range (remember that here D is the current number of elements that we've iterated over). This isn't directly what I do since I'm dealing with a mixed number instead of a rational number, but it's the same overall effect and is done in a way that cannot overflow.

Can you provide an example of how it works? For example, if I have an input array with {10, 20, 30}, how are V, N, and D updated in each iteration?

Matt Calabrese

7:20 p.m.

On Sat, Sep 5, 2015 at 4:07 AM, Bjorn Reese wrote:

...

Can you provide an example of how it works? For example, if I have an input array with {10, 20, 30}, how are V, N, and D updated in each iteration?

Without posting the code, here is the generated output during accumulation for your example {10, 20, 30}. "State" in the following code is the state at the start of a given iteration. "Incremented state" is the state after the transformation of N/D to N/(D+1) I described in an earlier reply but before adding in the value from the current. Again, that transformation is done on a mixed number and without the possibility of overflow. Only additive operators, division, and modulus are used. There is no overflow with the additive operations because I do a difference calculation from the current denominator before performing the operation, carrying over to the whole number part when necessary. State : 0 0/0 Incremented State: 0 0/1 State : 10 0/1 Incremented State: 5 0/2 State : 15 0/2 Incremented State: 10 0/3 State : 20 0/3 Your sample array isn't too interesting in terms of the state changes, so here's a more complex one for the array {11, 17, 3, 31, 27} State : 0 0/0 Incremented State: 0 0/1 State : 11 0/1 Incremented State: 5 1/2 State : 14 0/2 Incremented State: 9 1/3 State : 10 1/3 Incremented State: 7 3/4 State : 15 2/4 Incremented State: 12 2/5 State : 17 4/5 Anyway, it seems like there's interest so I'll try to release it. What exactly is the review process for adding algorithms, or is it still somewhat informal? The algorithm should be able to be adapted to introduce weights. It would work very similarly, only when advancing the state, it would be based on the weight rather than always effectively using (D+1). Similarly, the denominator would no longer always exactly match the number of elements, but instead it would always match the sum of the weights that were encountered. I haven't implemented this weighted variation, but it shouldn't be too much of a change. -- -Matt Calabrese

Bjorn Reese

6 Sep 6 Sep

11:46 a.m.

On 09/05/2015 09:20 PM, Matt Calabrese wrote:

...

Your sample array isn't too interesting in terms of the state changes, so here's a more complex one for the array {11, 17, 3, 31, 27}

State : 0 0/0 Incremented State: 0 0/1

State : 11 0/1 Incremented State: 5 1/2

State : 14 0/2 Incremented State: 9 1/3

State : 10 1/3 Incremented State: 7 3/4

State : 15 2/4 Incremented State: 12 2/5

State : 17 4/5

I still do not understand how you calculate these state values, but looking at them it becomes clear that the average M[i] at iteration i is given by (assuming W[i] is the first entry in the state tuple at iteration i) M[i] = W[i] + (N[i] / D[i]) I am wondering if you can also calculate the running sum from the state variables by? S[i] = W[i] * D[i] + N[i]

...

Anyway, it seems like there's interest so I'll try to release it. What exactly is the review process for adding algorithms, or is it still somewhat informal?

If they are to be added to Boost.Algorithm, then you need to get the interest of the maintainer. He will decide what level of review is needed, if any.

Matt Calabrese

6:21 p.m.

On Sun, Sep 6, 2015 at 4:46 AM, Bjorn Reese wrote:

...

I still do not understand how you calculate these state values

If it's not already clear enough based on the descriptions, you'll have to wait until I release the code. On Sun, Sep 6, 2015 at 4:46 AM, Bjorn Reese wrote:

...

looking at them it becomes clear that the average M[i] at iteration i is given by (assuming W[i] is the first entry in the state tuple at iteration i)

M[i] = W[i] + (N[i] / D[i])

...

Logically that is the value that is represented, yes. The mixed number state is { W, N, D } where W is the whole number part, and N/D is the fractional part, where N an integer is in the range [0, D) On Sun, Sep 6, 2015 at 4:46 AM, Bjorn Reese wrote:

...

I am wondering if you can also calculate the running sum from the state variables by?

S[i] = W[i] * D[i] + N[i]

If this were just pure mathematics or if we were talking about infinite precision types, then yes, you could do that, but in practice we are dealing with built-in integer types or other user-define integer types that can only represent one of a certain closed set of integer values. Even with a range that only has two values in it you can potentially overflow when calculating the sum, which is why the mean algorithm that is that is presented never directly represents the sum. In this way, the algorithm is able to produce an exact mean for a range of any size, where each element of the range can hold any integer value that the value type of the range can hold. There are no additional preconditions for the function and it should always produce a valid result. On Sun, Sep 6, 2015 at 4:46 AM, Bjorn Reese wrote:

...

If they are to be added to Boost.Algorithm, then you need to get the interest of the maintainer. He will decide what level of review is needed, if any.

Yeah, I CC'd Marshall and Eric in earlier replies, so I imagine they will eventually respond. -- -Matt Calabrese

Marshall Clow

17 Sep 17 Sep

2:13 p.m.

On Sat, Sep 5, 2015 at 12:20 PM, Matt Calabrese wrote:

...

Anyway, it seems like there's interest so I'll try to release it. What exactly is the review process for adding algorithms, or is it still somewhat informal?

The algorithm should be able to be adapted to introduce weights. It would work very similarly, only when advancing the state, it would be based on the weight rather than always effectively using (D+1). Similarly, the denominator would no longer always exactly match the number of elements, but instead it would always match the sum of the weights that were encountered. I haven't implemented this weighted variation, but it shouldn't be too much of a change.

Yes, it's pretty informal. You need to convince the library maintainer (me :-)) that it's interesting and that people would use it. I don't think that's going to be too much of a problem. What I would suggest is that you make a pull request, and then post the URL here, and encourage people to comment/review the code that way. -- Marshall P.S. Sorry for the slow response.

Seth

4:40 p.m.

On 09/17/2015 04:13 PM, Marshall Clow wrote:

...

On Sat, Sep 5, 2015 at 12:20 PM, Matt Calabrese wrote:

The algorithm should be able to be adapted to introduce weights. It would work very similarly, only when advancing the state, it would be based on the weight rather than always effectively using (D+1). Similarly, the denominator would no longer always exactly match the number of elements, but instead it would always match the sum of the weights that were encountered. I haven't implemented this weighted variation, but it shouldn't be too much of a change. I think Boost Accumulator already does these.

E.g. http://www.boost.org/doc/libs/1_59_0/doc/html/accumulators/user_s_guide.html... Seth

Matt Calabrese

7:10 p.m.

On Thu, Sep 17, 2015 at 9:40 AM, Seth wrote:

...

On 09/17/2015 04:13 PM, Marshall Clow wrote:

...
On Sat, Sep 5, 2015 at 12:20 PM, Matt Calabrese wrote:

The algorithm should be able to be adapted to introduce weights. It would work very similarly, only when advancing the state, it would be based on the weight rather than always effectively using (D+1). Similarly, the denominator would no longer always exactly match the number of elements, but instead it would always match the sum of the weights that were encountered. I haven't implemented this weighted variation, but it shouldn't be too much of a change. I think Boost Accumulator already does these.

E.g.

http://www.boost.org/doc/libs/1_59_0/doc/html/accumulators/user_s_guide.html...

Yeah, it supports weights, if that's specifically what you're referring to, though I thought that when I skimmed through the mean implementation it was not implemented in a way that avoids I.E. overflow. I could have been mistaken, though. Also, the code just got released, so I'll put it up somewhere this weekend if it's not redundant with what Boost.Accumulators already does somewhere. -Matt Calabrese

3469

Age (days ago)

3483

Last active (days ago)

List overview

Download

10 comments

5 participants

participants (5)

Bjorn Reese
Marshall Clow
Matt Calabrese
Rajaditya Mukherjee
Seth