Interest in some Algorithms

older
C++Now 2016 Registration is Open

Dean Michael Berris

18 Jan 2016 18 Jan '16

12:13 p.m.

Hi Boost! It’s been a while, I hope you all have been doing well. I wanted to ask whether anybody’s interested in a couple of algorithms, to be made part of the Boost.Algorithm library. One is tentatively named ‘set_inplace_difference' which has the following synopsis: template I1 set_inplace_difference(I1 f1, I1 f2, I2 f2, I2 l2); template I1 set_inplace_difference(I1 f1, I1 f2, I2 f2, I2 l2, C comp); It’s essentially a set_difference which re-uses the storage of the range [f1, l1). It takes elements from [f1, l1) that are in [f2, l2) and moves them to [p, l1) where p is the partition point (or the new “end”). This is best used in conjunction with “erase” on vectors and/or std::deque. These algorithms return p. Another is tentatively named ‘set_inplace_intersection’ which instead of the above which does the difference, does an intersection instead. Requirements on I1 and I2 are that they are RandomAccess iterators, and that they are sorted. I haven’t figured out whether partially-sorted is a sufficient condition. I also haven’t figured out whether it would work for weaker iterator classes, but so far my implementation relies on std::lower_bound and std::upper_bound. If there’s enough interest, I can prepare a pull request after going through my employer’s open-source patching process. Looking forward to hearing your thoughts! Cheers

Show replies by date

Steven Watanabe

18 Jan 18 Jan

8:32 p.m.

AMDG On 01/18/2016 05:13 AM, Dean Michael Berris wrote:

...

It’s been a while, I hope you all have been doing well.

I wanted to ask whether anybody’s interested in a couple of algorithms, to be made part of the Boost.Algorithm library.

One is tentatively named ‘set_inplace_difference' which has the following synopsis:

template I1 set_inplace_difference(I1 f1, I1 f2, I2 f2, I2 l2);

template I1 set_inplace_difference(I1 f1, I1 f2, I2 f2, I2 l2, C comp);

It’s essentially a set_difference which re-uses the storage of the range [f1, l1). It takes elements from [f1, l1) that are in [f2, l2) and moves them to [p, l1) where p is the partition point (or the new “end”). This is best used in conjunction with “erase” on vectors and/or std::deque. These algorithms return p.

Another is tentatively named ‘set_inplace_intersection’ which instead of the above which does the difference, does an intersection instead.

Requirements on I1 and I2 are that they are RandomAccess iterators, and that they are sorted. I haven’t figured out whether partially-sorted is a sufficient condition. I also haven’t figured out whether it would work for weaker iterator classes, but so far my implementation relies on std::lower_bound and std::upper_bound.

Why do you need lower/upper_bound? set_difference can be implemented easily with a straight linear scan through the two sequences. In addition, it's quite easy to implement set_difference in such a way that result == first1 works, which effectively makes it operate in place, although the standard forbids such usage. In Christ, Steven Watanabe

Phil Endecott

19 Jan 19 Jan

5:31 p.m.

Steven Watanabe wrote:

...

On 01/18/2016 05:13 AM, Dean Michael Berris wrote:

...
I wanted to ask whether anybody's interested in a couple of algorithms, to be made part of the Boost.Algorithm library.

One is tentatively named ?set_inplace_difference' which has the following synopsis:

template I1 set_inplace_difference(I1 f1, I1 l1, I2 f2, I2 l2);

template I1 set_inplace_difference(I1 f1, I1 l1, I2 f2, I2 l2, C comp);

(I've fixed an obvious typo above...)

...

...
It's essentially a set_difference which re-uses the storage of the range [f1, l1). It takes elements from [f1, l1) that are in [f2, l2) and moves them to [p, l1) where p is the partition point (or the new 'end'). This is best used in conjunction with 'erase' on vectors and/or std::deque. These algorithms return p.

Another is tentatively named 'set_inplace_intersection' which instead of the above which does the difference, does an intersection instead.

Requirements on I1 and I2 are that they are RandomAccess iterators, and that they are sorted. I haven't figured out whether partially-sorted is a sufficient condition. I also haven't figured out whether it would work for weaker iterator classes, but so far my implementation relies on std::lower_bound and std::upper_bound.

Why do you need lower/upper_bound? set_difference can be implemented easily with a straight linear scan through the two sequences.

If you do that, the complexity is at least O(N) even for the best case where the ranges don't overlap. Here is my attempt at this algorithm, which might be O(N log N) worst case if I'm thinking straight. It is O(1) when the ranges don't overlap. Dean, is this anything like what you've implemented? template I1 set_inplace_difference(I1 f1, I1 l1, I2 f2, I2 l2) { // If either range is empty, there is nothing to do: if (f1 == l1 || f2 == l2) return l1; // If the ranges are both singletons, compare the values: if (l1-f1 == 1 && l2-f2 == 1) { if (*f1 == *f2) return f1; else return l1; } // If ranges don't overlap, there is nothing to do: if (*(l2-1) < *f1 || *f2 > *(l1-1)) return l1; // Recurse, splitting the larger range: if (l2-f2 > l1-f1) { // Second range is larger. Split it at its midpoint and // subtract each half in turn. I2 m2 = f2 + (l2-f2)/2; I1 p = set_inplace_difference(f1,l1, m2,l2); return set_inplace_difference(f1,p, f2,m2); } else { // First range is larger. Split it at its midpoint and // subtract from each half in turn. I1 m1 = f1 + (l1-f1)/2; I1 p = set_inplace_difference(f1,m1, f2,l2); I1 q = set_inplace_difference(m1,l1, f2,l2); // Now merge the two results. return std::rotate(p,m1,q); } } Regards, Phil.

Dean Michael Berris

25 Jan 25 Jan

4:48 a.m.

So sorry for the delay on the questions here. Some answers inline below. On Wed, Jan 20, 2016 at 4:31 AM Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

Steven Watanabe wrote:

...
On 01/18/2016 05:13 AM, Dean Michael Berris wrote:

...
I wanted to ask whether anybody's interested in a couple of algorithms, to be made part of the Boost.Algorithm library.

One is tentatively named ?set_inplace_difference' which has the following synopsis:

template I1 set_inplace_difference(I1 f1, I1 l1, I2 f2, I2 l2);

template I1 set_inplace_difference(I1 f1, I1 l1, I2 f2, I2 l2, C comp);

(I've fixed an obvious typo above...)

...
...
It's essentially a set_difference which re-uses the storage of the range [f1, l1). It takes elements from [f1, l1) that are in [f2, l2) and moves them to [p, l1) where p is the partition point (or the new 'end'). This is best used in conjunction with 'erase' on vectors and/or std::deque. These algorithms return p.

Another is tentatively named 'set_inplace_intersection' which instead of the above which does the difference, does an intersection instead.

Requirements on I1 and I2 are that they are RandomAccess iterators, and that they are sorted. I haven't figured out whether partially-sorted is a sufficient condition. I also haven't figured out whether it would work for weaker iterator classes, but so far my implementation relies on std::lower_bound and std::upper_bound.

Why do you need lower/upper_bound? set_difference can be implemented easily with a straight linear scan through the two sequences.

That is true, and I suppose that's acceptable for some values of N. In practice, when you have two sorted ranges, you want to be able to eliminate as many of the elements as you can without having to walk the whole range.

...

If you do that, the complexity is at least O(N) even for the best case where the ranges don't overlap.

Here is my attempt at this algorithm, which might be O(N log N) worst case if I'm thinking straight. It is O(1) when the ranges don't overlap. Dean, is this anything like what you've implemented?

It's close, I've implemented this without the recursion.

...

template I1 set_inplace_difference(I1 f1, I1 l1, I2 f2, I2 l2) { // If either range is empty, there is nothing to do: if (f1 == l1 || f2 == l2) return l1;

// If the ranges are both singletons, compare the values: if (l1-f1 == 1 && l2-f2 == 1) { if (*f1 == *f2) return f1; else return l1; }

// If ranges don't overlap, there is nothing to do: if (*(l2-1) < *f1 || *f2 > *(l1-1)) return l1;

If you can use std::lower_bound and std::upper_bound here, there's a simpler test. :)

...

// Recurse, splitting the larger range: if (l2-f2 > l1-f1) { // Second range is larger. Split it at its midpoint and // subtract each half in turn. I2 m2 = f2 + (l2-f2)/2; I1 p = set_inplace_difference(f1,l1, m2,l2); return set_inplace_difference(f1,p, f2,m2);

} else { // First range is larger. Split it at its midpoint and // subtract from each half in turn. I1 m1 = f1 + (l1-f1)/2; I1 p = set_inplace_difference(f1,m1, f2,l2); I1 q = set_inplace_difference(m1,l1, f2,l2); // Now merge the two results. return std::rotate(p,m1,q); } }

This looks very interesting Phil. I suppose if we measure this on really big overlapping ranges, there might be an issue with the recursion. You can leverage the fact that the ranges are already sorted, using std::lower_bound and std::upper_bound to find contiguous ranges that can be rotated out of the way incrementally. Seems there is some interest for this, Marshall do I just send you a pull request for the implementation? What is the process here? Cheers

Phil Endecott

6:15 p.m.

Dean Michael Berris wrote:

...

You can leverage the fact that the ranges are already sorted, using std::lower_bound and std::upper_bound to find contiguous ranges that can be rotated out of the way incrementally.

Let's see the code. And the benchmarks. And your complexity claims! The challenge is surely to get linear complexity worst-case, which you can do with the sort of scan Steven suggested and also has good cache locality characteristics, while keeping the sublinear complexity in best cases. Here's the obvious linear algorithm: template I1 set_inplace_difference(I1 f1, I1 l1, I2 f2, I2 l2) { I1 o = f1; I1 i1 = f1; I2 i2 = f2; while (i1 < l1 && i2 < l2) { if (*i1 < *i2) { std::swap(*o,*i1); ++i1; ++o; } else if (*i2 < *i1) { ++i2; } else { ++i1; ++i2; } } while (i1 < l1) { std::swap(*o,*i1); ++i1; ++o; } return o; } Now I think that you can make that logarithmic in the best case, but without making it O(N log N) in the worst case like my recursive algorithm did, like this: template I1 set_inplace_difference(I1 f1, I1 l1, I2 f2, I2 l2) { I1 o = f1; I1 i1 = f1; I2 i2 = f2; int log_size = log2(l2-f2); int n = 0; while (i1 < l1 && i2 < l2) { if (*i1 < *i2) { std::swap(*o,*i1); ++i1; ++o; n = 0; } else if (*i2 < *i1) { ++i2; ++n; if (n == log_size) { i2 = std::lower_bound(i2,l2,*i1); n = 0; } } else { ++i1; ++i2; n = 0; } } if (i1 == o) return l1; while (i1 < l1) { std::swap(*o,*i1); ++i1; ++o; } return o; } That optimises skipping through the second sequence. You could do something similar for the first sequence in the first if-block but only for the case where i1==o, and also in the third if-block for sequences of equal values. (I hadn't previously considered duplicate values; if these are sets, is that allowed?) Something else to note is that my recursive algorithm keeps the to-discard items sorted, so you can actually think of it as a kind of partition algorithm i.e. both halves of the output are useful. The linear algorithms don't do that. Regards, Phil.

Dean Michael Berris

27 Jan 27 Jan

3:50 a.m.

...

On 26 Jan 2016, at 05:15, Phil Endecott wrote:

Dean Michael Berris wrote:

...
You can leverage the fact that the ranges are already sorted, using std::lower_bound and std::upper_bound to find contiguous ranges that can be rotated out of the way incrementally.

Let's see the code. And the benchmarks. And your complexity claims!

Indeed. I’ll go through my employer’s patching process for this. :)

...

The challenge is surely to get linear complexity worst-case, which you can do with the sort of scan Steven suggested and also has good cache locality characteristics, while keeping the sublinear complexity in best cases.

I think if it was possible to do a specialisation on taking forward iterators, the linear algorithm would be fine. I've forgotten about whether this is possible or even a good idea to do without concept support in the language yet.

...

Here's the obvious linear algorithm:

template I1 set_inplace_difference(I1 f1, I1 l1, I2 f2, I2 l2) { I1 o = f1; I1 i1 = f1; I2 i2 = f2;

while (i1 < l1 && i2 < l2) { if (*i1 < *i2) { std::swap(*o,*i1); ++i1; ++o; } else if (*i2 < *i1) { ++i2; } else { ++i1; ++i2; } }

while (i1 < l1) { std::swap(*o,*i1); ++i1; ++o; }

return o; }

Now I think that you can make that logarithmic in the best case, but without making it O(N log N) in the worst case like my recursive algorithm did, like this:

template I1 set_inplace_difference(I1 f1, I1 l1, I2 f2, I2 l2) { I1 o = f1; I1 i1 = f1; I2 i2 = f2; int log_size = log2(l2-f2);

int n = 0; while (i1 < l1 && i2 < l2) { if (*i1 < *i2) { std::swap(*o,*i1); ++i1; ++o; n = 0; } else if (*i2 < *i1) { ++i2; ++n; if (n == log_size) { i2 = std::lower_bound(i2,l2,*i1); n = 0; } } else { ++i1; ++i2; n = 0; } }

if (i1 == o) return l1;

while (i1 < l1) { std::swap(*o,*i1); ++i1; ++o; }

return o; }

That optimises skipping through the second sequence. You could do something similar for the first sequence in the first if-block but only for the case where i1==o, and also in the third if-block for sequences of equal values. (I hadn't previously considered duplicate values; if these are sets, is that allowed?)

I had been working on the assumption that sets were allowed to have multiple contiguous runs of similar elements. If this wasn't the case then indeed it would be simpler (it would obviate the need to use std::upper_bound at least from my implementation).

...

Something else to note is that my recursive algorithm keeps the to-discard items sorted, so you can actually think of it as a kind of partition algorithm i.e. both halves of the output are useful. The linear algorithms don't do that.

Right, it's a form of stable partition. Maybe it should be called a "stable_set_inplace_difference", and that would certainly be useful. Thanks Phil!

Michael Marcin

29 Jan 29 Jan

3:28 a.m.

I have this algorithm in my toolkit and occasionally it is very useful so I wouldn't mind seeing it in boost. FWIW here's my implementation /// Removes elements from the first sorted range which /// are present in the second sorted range. /// Returns an iterator the the new end of the first range. /// /// If there are multiple occurrences of a value the difference /// includes as many occurrences of a given value as exist in /// the first range, minus the amount of matching elements in the //// second, preserving order. template< typename I1, typename I2 > I1 set_inplace_difference( I1 f1, I1 l1, I2 f2, I2 l2 ) { // skip in-place elements for ( ;; ) { if ( f1 == l1 || f2 == l2 ) { return l1; // no elements to remove } else if ( *f1 < *f2 ) { ++f1; } else if ( *f2 < *f1 ) { ++f2; } else { break; // found an out of order element } } // move elements that are not in the second range // i.e. remove elements that ARE in the second range I1 o = f1; for ( ++f1, ++f2; f1 != l1 && f2 != l2; ) { if ( *f1 < *f2 ) { *o = std::move( *f1 ); ++o; ++f1; } else if ( *f2 < *f1 ) { ++f2; } else { ++f1; ++f2; } } // move the remaining elements o = std::move( f1, l1, o ); return o; }

Michael Marcin

3:32 a.m.

On 1/28/2016 9:28 PM, Michael Marcin wrote:

...

break; // found an out of order element

sorry this comment should read // found match

3225

Age (days ago)

3236

Last active (days ago)

List overview

Download

7 comments

4 participants

participants (4)

Dean Michael Berris
Michael Marcin
Phil Endecott
Steven Watanabe