Adding 'indirect_sort' to boost::algorithm
At Boostcon last month, there was a discussion of indirect sorting algorithms to boost::algorithm. An “Indirect sort” (called 'argsort' in numpy - https://numpy.org/doc/stable/reference/generated/numpy.argsort.html) returns a sequence of indices into the collection describing how the collection can be rearranged in order to become sorted. So I created https://github.com/boostorg/algorithm/pull/117 to start down this road. (There’s clearly more that could be done here - stable_sort, partial_sort, nth_element, etc) Comments welcomed. — Marshall
On 15/06/2023 13:42, Marshall Clow wrote:
At Boostcon last month, there was a discussion of indirect sorting algorithms to boost::algorithm.
An “Indirect sort” (called 'argsort' in numpy - https://numpy.org/doc/stable/reference/generated/numpy.argsort.html) returns a sequence of indices into the collection describing how the collection can be rearranged in order to become sorted.
So I created https://github.com/boostorg/algorithm/pull/117 to start down this road. (There’s clearly more that could be done here - stable_sort, partial_sort, nth_element, etc)
Comments welcomed.
It invites an interesting question. A sequence of indexes is only particularly useful for a random-access collection. If it instead generated and then outputed a sorted sequence of iterators, it could potentially support other input collection types (such as std::list, or something else with forward-only iterators) as well. But the utility of this depends on the subsequent usage. A collection of iterators is better than a collection of indexes (and frequently the same size, when not using checked iterators) if your purpose is to then iterate the collection in sorted order, or otherwise perform read-only operations. A collection of iterators is less safe if you intend to perform mutations on the underlying list (such as trying to actually realize the sorted order), as this will sometimes invalidate them depending on the underlying collection type and the specific operations used. (Indexes can also be rendered similarly useless, but are somewhat easier to perform equivalent operations to keep in sync with less performance issues than trying to do the same with non-random-access iterators.)
On Jun 14, 2023, at 7:25 PM, Gavin Lambert via Boost
On 15/06/2023 13:42, Marshall Clow wrote:
At Boostcon last month, there was a discussion of indirect sorting algorithms to boost::algorithm. An “Indirect sort” (called 'argsort' in numpy - https://numpy.org/doc/stable/reference/generated/numpy.argsort.html) returns a sequence of indices into the collection describing how the collection can be rearranged in order to become sorted. So I created https://github.com/boostorg/algorithm/pull/117 to start down this road. (There’s clearly more that could be done here - stable_sort, partial_sort, nth_element, etc) Comments welcomed.
It invites an interesting question. A sequence of indexes is only particularly useful for a random-access collection.
If it instead generated and then outputed a sorted sequence of iterators, it could potentially support other input collection types (such as std::list, or something else with forward-only iterators) as well.
Yes, but sorting is (in general) something we only do on random-access sequences. But I get your point - if we were to return a sequence of iterators, we could sort non-RA sequences (at the cost of a single extra traversal at the beginning of the sort)
But the utility of this depends on the subsequent usage. A collection of iterators is better than a collection of indexes (and frequently the same size, when not using checked iterators) if your purpose is to then iterate the collection in sorted order, or otherwise perform read-only operations.
For random-access collections, they’re essentially the same.
A collection of iterators is less safe if you intend to perform mutations on the underlying list (such as trying to actually realize the sorted order), as this will sometimes invalidate them depending on the underlying collection type and the specific operations used. (Indexes can also be rendered similarly useless, but are somewhat easier to perform equivalent operations to keep in sync with less performance issues than trying to do the same with non-random-access iterators.)
— Marshall
On Thu, Jun 15, 2023 at 3:42 AM Marshall Clow via Boost < boost@lists.boost.org> wrote:
At Boostcon last month, there was a discussion of indirect sorting algorithms to boost::algorithm. An “Indirect sort” [...] returns a sequence of indices into the collection describing how the collection can be rearranged in order to become sorted. [...] Comments welcomed.
Hi Marshall. I've long wondered why something like that wasn't in Boost. In our code base, we've had a "cosort" for several decades, which takes several arrays/vectors, sorts the first one, and applies the same "reordering" to the others. This comes up in "data-oriented" designs where you have N arrays-of-values instead of 1 array-of-struct-of-N-members. So you are adding a building block to have such a "cosort". You're just missing a way to apply that reordering to several (random-access) collections. My $0.02. PS: Note that I'm not sure what the correct term / name is for that algo. I always remember it as "cosort" myself, but there has to be a better name. PPS: https://hackmd.io/@Q66MPiW4T7yNTKOCaEb-Lw/ryfenBCO5 is a nice article on Data-Oriented Designs, which mentions the talk from Andrey Kelly (of Zig fame) I was searching for.
So you are adding a building block to have such a "cosort". You're just missing a way to apply that reordering to several (random-access) collections.
There already is: Basically you first do an indirect sort and then use `boost::algorithm::apply_permutation` on each collection with the result permutation. See the test case and documentation of indirect_sort. Alex
A couple of things. 1. It seems you are initializing the indices of the Permutation <ret> return twice – first to 0 and then to the position in iota(). An optimization would be to first declare an empty Permutation and then reserve() the count of elements you want and then call iota(). 2. In my former codebase I had a similar indirect_sort() that accepted an input array of indices instead of generating them inside the method. This allows a “select and then indirect_sort this set of indices” which is what I generally used. 3. (this less important but just an aside) I also implemented two other types of indirect sort in my old code base due to necessity. A “pointer to element” sort – this allows sorting on post-iterator-access elements – allowing sorting of elements stored in non-indexable iterators. Also an “offset sort” which is more or less like the pointer-to-element sort but allowed sorting of sub-fields of elements (though thinking about it now I don’t remember how I did this and I no longer have access to the code). I used the “pointer to element” sort the most of the time – the size of an 64bit index and a pointer are usually the same so no loss in terms of storage – the return is a set of sorted pointer-to-elements. bien Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows From: Alexander Grund via Boostmailto:boost@lists.boost.org Sent: Thursday, June 15, 2023 3:04 AM To: boost@lists.boost.orgmailto:boost@lists.boost.org Cc: Alexander Grundmailto:alexander.grund@tu-dresden.de Subject: Re: [boost] Adding 'indirect_sort' to boost::algorithm
So you are adding a building block to have such a "cosort". You're just missing a way to apply that reordering to several (random-access) collections.
There already is: Basically you first do an indirect sort and then use `boost::algorithm::apply_permutation` on each collection with the result permutation. See the test case and documentation of indirect_sort. Alex
Wrt (1) I guess you’d have to use something other than iota() since it takes a cue from the initialized size of the array – necessitating initial throwaway initialization of the elements. bien Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows From: David Bien via Boostmailto:boost@lists.boost.org Sent: Thursday, June 15, 2023 12:00 PM To: boost@lists.boost.orgmailto:boost@lists.boost.org Cc: David Bienmailto:davidbien@hotmail.com Subject: Re: [boost] Adding 'indirect_sort' to boost::algorithm A couple of things. 1. It seems you are initializing the indices of the Permutation <ret> return twice – first to 0 and then to the position in iota(). An optimization would be to first declare an empty Permutation and then reserve() the count of elements you want and then call iota(). 2. In my former codebase I had a similar indirect_sort() that accepted an input array of indices instead of generating them inside the method. This allows a “select and then indirect_sort this set of indices” which is what I generally used. 3. (this less important but just an aside) I also implemented two other types of indirect sort in my old code base due to necessity. A “pointer to element” sort – this allows sorting on post-iterator-access elements – allowing sorting of elements stored in non-indexable iterators. Also an “offset sort” which is more or less like the pointer-to-element sort but allowed sorting of sub-fields of elements (though thinking about it now I don’t remember how I did this and I no longer have access to the code). I used the “pointer to element” sort the most of the time – the size of an 64bit index and a pointer are usually the same so no loss in terms of storage – the return is a set of sorted pointer-to-elements. bien Sent from Mail<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&data=05%7C01%7C%7Cb05ca86255704e0564e508db6dca57fe%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638224488055949615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=O4stdx5U6GfpW0jfVok2GFmQI7FLQXM2NS6%2B%2B%2FLJ9tc%3D&reserved=0https://go.microsoft.com/fwlink/?LinkId=550986> for Windows From: Alexander Grund via Boostmailto:boost@lists.boost.org Sent: Thursday, June 15, 2023 3:04 AM To: boost@lists.boost.orgmailto:boost@lists.boost.org Cc: Alexander Grundmailto:alexander.grund@tu-dresden.de Subject: Re: [boost] Adding 'indirect_sort' to boost::algorithm
So you are adding a building block to have such a "cosort". You're just missing a way to apply that reordering to several (random-access) collections.
There already is: Basically you first do an indirect sort and then use `boost::algorithm::apply_permutation` on each collection with the result permutation. See the test case and documentation of indirect_sort. Alex _______________________________________________ Unsubscribe & other changes: https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.boost.org%2Fmailman%2Flistinfo.cgi%2Fboost&data=05%7C01%7C%7Cb05ca86255704e0564e508db6dca57fe%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638224488055949615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8xi35vQdmkaAeZZswWGxKX0Uyxa19s8IPf3%2Bg7nnNhc%3D&reserved=0http://lists.boost.org/mailman/listinfo.cgi/boost
Hi, I'm Francisco Tapia, from the Sort Library. I would like to point out a couple of things about indirect sort 1- The name is not very correct, since indirect sort is usually used for a process in which an ordered index is created and later, we physically order the elements that we have ordered logically in an index. Indirect sorting is useful with large objects, where it is much more expensive to move the object than to compare it. In the library, it was studied but its use was discarded, because when the data is almost ordered (data added at the beginning or end, or some elements whose value has been modified) the final step of passing from logical ordering with an index to a physical ordering, greatly increases the times with respect to direct ordination. If you are interested in it, you can find the code of the discarded implementation at *https://github.com/boostorg/sort/blob/develop/include/boost/sort/common/indi... https://github.com/boostorg/sort/blob/develop/include/boost/sort/common/indi...* 2.- It would be more correct and expressive for users to create index. And the most suitable and simple place for users to find is the Sort Library. I have created a repository with a very simple implementation of such a function. The index is a simple vector of iterators. You can find it at *https://github.com/fjtapia/create_index https://github.com/fjtapia/create_index* It is a basic version (ideas and suggestions are welcome), but can you give us an idea about how to do it? Francisco El jue, 15 jun 2023 a las 20:15, David Bien via Boost (< boost@lists.boost.org>) escribió:
Wrt (1) I guess you’d have to use something other than iota() since it takes a cue from the initialized size of the array – necessitating initial throwaway initialization of the elements.
bien
Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows
From: David Bien via Boostmailto:boost@lists.boost.org Sent: Thursday, June 15, 2023 12:00 PM To: boost@lists.boost.orgmailto:boost@lists.boost.org Cc: David Bienmailto:davidbien@hotmail.com Subject: Re: [boost] Adding 'indirect_sort' to boost::algorithm
A couple of things.
1. It seems you are initializing the indices of the Permutation <ret> return twice – first to 0 and then to the position in iota(). An optimization would be to first declare an empty Permutation and then reserve() the count of elements you want and then call iota(). 2. In my former codebase I had a similar indirect_sort() that accepted an input array of indices instead of generating them inside the method. This allows a “select and then indirect_sort this set of indices” which is what I generally used. 3. (this less important but just an aside) I also implemented two other types of indirect sort in my old code base due to necessity. A “pointer to element” sort – this allows sorting on post-iterator-access elements – allowing sorting of elements stored in non-indexable iterators. Also an “offset sort” which is more or less like the pointer-to-element sort but allowed sorting of sub-fields of elements (though thinking about it now I don’t remember how I did this and I no longer have access to the code). I used the “pointer to element” sort the most of the time – the size of an 64bit index and a pointer are usually the same so no loss in terms of storage – the return is a set of sorted pointer-to-elements. bien
Sent from Mail< https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&data=05%7C01%7C%7Cb05ca86255704e0564e508db6dca57fe%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638224488055949615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=O4stdx5U6GfpW0jfVok2GFmQI7FLQXM2NS6%2B%2B%2FLJ9tc%3D&reserved=0 https://go.microsoft.com/fwlink/?LinkId=550986> for Windows
From: Alexander Grund via Boostmailto:boost@lists.boost.org Sent: Thursday, June 15, 2023 3:04 AM To: boost@lists.boost.orgmailto:boost@lists.boost.org Cc: Alexander Grundmailto:alexander.grund@tu-dresden.de Subject: Re: [boost] Adding 'indirect_sort' to boost::algorithm
So you are adding a building block to have such a "cosort". You're just missing a way to apply that reordering to several (random-access) collections.
There already is: Basically you first do an indirect sort and then use `boost::algorithm::apply_permutation` on each collection with the result permutation. See the test case and documentation of indirect_sort.
Alex
_______________________________________________ Unsubscribe & other changes: https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.boost.org%2Fmailman%2Flistinfo.cgi%2Fboost&data=05%7C01%7C%7Cb05ca86255704e0564e508db6dca57fe%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638224488055949615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8xi35vQdmkaAeZZswWGxKX0Uyxa19s8IPf3%2Bg7nnNhc%3D&reserved=0 http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Jun 20, 2023, at 9:49 AM, Francisco José Tapia via Boost
Hi, I'm Francisco Tapia, from the Sort Library.
I would like to point out a couple of things about indirect sort
1- The name is not very correct, since indirect sort is usually used for a process in which an ordered index is created and later, we physically order the elements that we have ordered logically in an index. Indirect sorting is useful with large objects, where it is much more expensive to move the object than to compare it.
In the library, it was studied but its use was discarded, because when the data is almost ordered (data added at the beginning or end, or some elements whose value has been modified) the final step of passing from logical ordering with an index to a physical ordering, greatly increases the times with respect to direct ordination. If you are interested in it, you can find the code of the discarded implementation at *https://github.com/boostorg/sort/blob/develop/include/boost/sort/common/indi... https://github.com/boostorg/sort/blob/develop/include/boost/sort/common/indi...*
2.- It would be more correct and expressive for users to create index. And the most suitable and simple place for users to find is the Sort Library. I have created a repository with a very simple implementation of such a function. The index is a simple vector of iterators. You can find it at *https://github.com/fjtapia/create_index https://github.com/fjtapia/create_index*
It is a basic version (ideas and suggestions are welcome), but can you give us an idea about how to do it?
I thought about putting this into Boost.sort. It may be that that is the best place for it - I don’t know. One use case that has come up in the discussion is the ability to apply the index to multiple sequences (a ’struct of arrays’ - see the docs in the PR), and creating an index that contains iterators (rather than positions) doesn’t help with that. Also, I worry about discoverability - someone who is looking to sort something probably won’t look at a function named `create_index`. :-( — Marshall P.S. The deadline for “new features/major code changes” for the next Boost release is coming up this week….
Francisco
El jue, 15 jun 2023 a las 20:15, David Bien via Boost (< boost@lists.boost.org>) escribió:
Wrt (1) I guess you’d have to use something other than iota() since it takes a cue from the initialized size of the array – necessitating initial throwaway initialization of the elements.
bien
Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows
From: David Bien via Boostmailto:boost@lists.boost.org Sent: Thursday, June 15, 2023 12:00 PM To: boost@lists.boost.orgmailto:boost@lists.boost.org Cc: David Bienmailto:davidbien@hotmail.com Subject: Re: [boost] Adding 'indirect_sort' to boost::algorithm
A couple of things.
1. It seems you are initializing the indices of the Permutation <ret> return twice – first to 0 and then to the position in iota(). An optimization would be to first declare an empty Permutation and then reserve() the count of elements you want and then call iota().
Ok.
2. In my former codebase I had a similar indirect_sort() that accepted an input array of indices instead of generating them inside the method. This allows a “select and then indirect_sort this set of indices” which is what I generally used. 3. (this less important but just an aside) I also implemented two other types of indirect sort in my old code base due to necessity. A “pointer to element” sort – this allows sorting on post-iterator-access elements – allowing sorting of elements stored in non-indexable iterators. Also an “offset sort” which is more or less like the pointer-to-element sort but allowed sorting of sub-fields of elements (though thinking about it now I don’t remember how I did this and I no longer have access to the code). I used the “pointer to element” sort the most of the time – the size of an 64bit index and a pointer are usually the same so no loss in terms of storage – the return is a set of sorted pointer-to-elements. bien
Sent from Mail< https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&data=05%7C01%7C%7Cb05ca86255704e0564e508db6dca57fe%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638224488055949615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=O4stdx5U6GfpW0jfVok2GFmQI7FLQXM2NS6%2B%2B%2FLJ9tc%3D&reserved=0 https://go.microsoft.com/fwlink/?LinkId=550986> for Windows
From: Alexander Grund via Boostmailto:boost@lists.boost.org Sent: Thursday, June 15, 2023 3:04 AM To: boost@lists.boost.orgmailto:boost@lists.boost.org Cc: Alexander Grundmailto:alexander.grund@tu-dresden.de Subject: Re: [boost] Adding 'indirect_sort' to boost::algorithm
So you are adding a building block to have such a "cosort". You're just missing a way to apply that reordering to several (random-access) collections.
There already is: Basically you first do an indirect sort and then use `boost::algorithm::apply_permutation` on each collection with the result permutation. See the test case and documentation of indirect_sort.
Alex
_______________________________________________ Unsubscribe & other changes: https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.boost.org%2Fmailman%2Flistinfo.cgi%2Fboost&data=05%7C01%7C%7Cb05ca86255704e0564e508db6dca57fe%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638224488055949615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8xi35vQdmkaAeZZswWGxKX0Uyxa19s8IPf3%2Bg7nnNhc%3D&reserved=0 http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (6)
-
Alexander Grund
-
David Bien
-
Dominique Devienne
-
Francisco José Tapia
-
Gavin Lambert
-
Marshall Clow