[review] Formal review period for Sort library continuing through November 19
The formal review of the Sort library by Steven Ross, which started November 10 and is scheduled to continue through November 19th, begins its last 5 days today. We have had some great discussions and reviews so far and I would like to thank everybody involved. I would like to encourage all interested programmers, whether Boost developers or Boost users, to comment on and/or review the library. About the Sort library ================== The Sort library is a library which implements hybrid sorting algorithms based on a Spreadsort ( http://en.wikipedia.org/wiki/Spreadsort ), of which the author of the library, Steven Ross, is the inventor. The algorithm provides a sort that is faster than O(n*log(n)). The library provides a generic implementation of high-speed sorting algorithms that outperform those in the C++ standard in both average and worst case performance. These algorithms only work on random access iterators. They are hybrids using both radix and comparison-based sorting, specialized to sorting common data types, such as integers, floats, and strings. These algorithms are encoded in a generic fashion and accept functors, enabling them to sort any object that can be processed like these basic data types. Where to get it =============== The library is available on github at https://github.com/spreadsort/sort. The library is in modular Boost format and can be cloned to libs/sort under your local modular boost directory. I have provided as the review manager online documentation at: http://eldiener.github.io/sort Review guidelines ================= Reviews should be submitted to the developer list (boost@lists.boost.org), preferably with '[sort]' in the subject. Or if you don't wish to for some reason or are not subscribed to the developer list you can send them privately to me at 'eldiener at tropicsoft dot com'. If so, please let me know whether or not you'd like your review to be forwarded to the list. For your review you may wish to consider the following questions: - What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? - Did you try to use the library? With what compiler? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain? And finally, every review should attempt to answer this question: - Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion. Even if you do not wish to give a full review any technical comment regarding the library is welcome as part of the review period and will help me as the review manager decide whether the library should be accepted as a Boost library. Any questions about the use of the library are also welcome.
Dear list, dear Edward, dear Steven, First of all I would like to congratulate Steven for having his library finally reviewed. I have been aware of the library for years.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
When I first learned about the existence of the library (as I said, several years ago), I read the code and the documentation closely. Today, I have quickly scanned through the previous discussion, re-read the introduction and the rationale from the documentation and made a quick glance over the code to check for any important changes.
- Are you knowledgeable about the problem domain?
Yes, quite. I have strong formal education in algorithm design and analysis. I also have a passion for sorting algorithms and I have been in the process of writing my own C++ sorting library, though I have not had the chance to work on the latter since early 2012. My own library mostly contains well-known comparison sorts but it also contains a few interesting innovations of my own. I have done a lot of benchmarking and analysis of my own algorithm implementations so I’m well aware of the issues that are relevant in comparing sorting algorithms. Please note that I do not consider my own library a competitor for Steven Ross’, though. The aims and contents of the libraries seem to be sufficiently different that there can be a place for both. Apart from that, Steven Ross’ library is in a much more “production-ready” state than mine.
- What is your evaluation of the design?
I’m not convinced that string_sort, float_sort and integer_sort need to be separate interfaces. I believe that the function to extract “digits” from the values for the MSD part of the algorithm can be captured in a fully generic way (that should be applicable to all radix-like sorting algorithms), coupled with two additional, optional parameters (pre- transform and post-transform) specifying functions for doing the temporary “treat floats as integers” trick. Apart from that, I think the design is sensible.
- What is your evaluation of the implementation?
Clever.
- What is your evaluation of the documentation?
Poor. The rationale, which in my opinion is the most important part of the documentation and should be in the front, is too lengthy and misses its target. It does discuss the theoretical advantages of radix sort as well as the (technical) relative merits of MSD vs LSD and several hybrid sorting algorithms, to great length, but it does not address the question why somebody might want to use a hybrid radix/comparison sort in the first place. It seems to be a defense of design decisions rather than a justification of the very existence of the library. The introduction severely and needlessly overstates the value of the algorithm, starting from the very first sentence: “The Boost Sort library provides a generic implementation of high-speed sorting algorithms that outperform those in the C++ standard in both average and worst case performance.” strongly implying that spreadsort is fit for replacing introsort (std::sort) in all possible situations, which it just isn’t. Then the introduction mentions three conditions under which spreadsort is “fastest relative to std::sort”, but it should instead mention two conditions under which a user may want to consider using it at all: 1. Value type can be treated as a sequence of digits/characters. 2. Many keys to sort and/or very long keys. This would be more honest about the scope of application, hence more helpful to the potential user. I believe this does not invalidate the library. Under the two conditions mentioned above, spreadsort can in fact offer significant performance improvements. I think the latter has been sufficiently proven. The title of the library is also too pretentious, as other reviewers have already pointed out. I think it should be called “Spreadsort” rather than “Sort”. So to summarize this part of my review, I think that the documentation is insufficient as long as it doesn’t properly address the following points in the following order: 1. Why and under which conditions the library may be useful. 2. How to use the library. 3. What is going on inside the library (and why in that way). The good news is that parts 2 and 3 have already been provided for and that part 1 should be very short. So I think relatively little work is needed to fix the documentation.
- What is your evaluation of the potential usefulness of the library?
As I explained above, I think the library has somewhat narrow applications, but it can be highly valuable when applicable. This is true of most other sorting algorithms as well. Spreadsort is definitely useful enough for inclusion in Boost, especially as long as no generic implementations of other radix/comparison hybrids are available.
- Did you try to use the library? With what compiler? Did you have any problems?
Not this year. I may have tested the library when I first learned about its existence, but I don’t have any clear memory of the results.
And finally, every review should attempt to answer this question:
- Do you think the library should be accepted as a Boost library?
Yes, sort of. I think spreadsort should be added to Boost.Algorithm (please!), but first the documentation should be improved as I discussed. While I think that the interface should be more general, that is not a condition for acceptance as far as I’m concerned. I do not agree with Niall Douglas that the library should add fallbacks for ranges that don’t offer random access. In general, I think that if you want to sort a range then you should ensure from the outset that it is a random access range, with the exception of very short ranges, for which quadratic complexity does not hurt, or linked lists, which can be efficiently sorted but only with intrusive sorting methods. Neither situation is relevant for this implementation of spreadsort. Kind regards, Julian Gonggrijp
Thank you for your feedback Julian. I'll improve the documentation.
On Sun Nov 16 2014 at 9:22:36 PM Julian Gonggrijp
- What is your evaluation of the design?
I’m not convinced that string_sort, float_sort and integer_sort need to be separate interfaces. I believe that the function to extract “digits” from the values for the MSD part of the algorithm can be captured in a fully generic way (that should be applicable to all radix-like sorting algorithms), coupled with two additional, optional parameters (pre- transform and post-transform) specifying functions for doing the temporary “treat floats as integers” trick.
string_sort is this fully generic approach. integer_sort and float_sort
can be seen as specializations that allow bit-level granularity of the radix operations and work well for keys that fit into a machine word. If you can think of some way to enable the sort-floats-as-ints trick as an option without harming performance for non-floats, I'd be very interested.
The rationale, which in my opinion is the most important part of the documentation and should be in the front, is too lengthy and misses its target. It does discuss the theoretical advantages of radix sort as well as the (technical) relative merits of MSD vs LSD and several hybrid sorting algorithms, to great length, but it does not address the question why somebody might want to use a hybrid radix/comparison sort in the first place. It seems to be a defense of design decisions rather than a justification of the very existence of the library.
The introduction severely and needlessly overstates the value of the algorithm, starting from the very first sentence:
“The Boost Sort library provides a generic implementation of high-speed sorting algorithms that outperform those in the C++ standard in both average and worst case performance.”
strongly implying that spreadsort is fit for replacing introsort (std::sort) in all possible situations, which it just isn’t.
Then the introduction mentions three conditions under which spreadsort is “fastest relative to std::sort”, but it should instead mention two conditions under which a user may want to consider using it at all:
1. Value type can be treated as a sequence of digits/characters.
2. Many keys to sort and/or very long keys.
I agree, if most of your sorting runtime is going to spent on lists of < 1000 elements, this library is of little use, though it should also do
Find me a type that you don't think this applies to, and I'll show you how it can. If std::sort can sort it, so can string_sort if you use the available functors properly. Maybe I should explain this in the documentation? I will agreed to the extent that if performance isn't very critical relative to the code complexity of creating an efficient radix key expansion equivalent to the comparison operation, then people probably are better off with std::sort. little harm due to the fallback to std::sort. I'll edit the docs in develop to make that more clear.
So to summarize this part of my review, I think that the documentation is insufficient as long as it doesn’t properly address the following points in the following order:
1. Why and under which conditions the library may be useful. 2. How to use the library. 3. What is going on inside the library (and why in that way).
The good news is that parts 2 and 3 have already been provided for and that part 1 should be very short. So I think relatively little work is needed to fix the documentation.
Will do.
Steven Ross wrote:
If std::sort can sort it, so can string_sort if you use the available functors properly.
How about: bool compare_distance_from_origin(point& a, point& b) { return sqr(a.x)+sqr(a.y) < sqr(b.x)+sqr(b.y); } std::sort(points.begin(),points.end(),&compare_distance_from_origin); To me, it's clear that there are many kinds of sort where a radix sort can't be used. An interesting borderline case would be case-insensitive sorting; have you tried doing that? Regards, Phil.
Phil, On Mon, Nov 17, 2014 at 7:21 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:
Steven Ross wrote:
If std::sort can sort it, so can string_sort if you use the available functors properly.
How about:
bool compare_distance_from_origin(point& a, point& b) { return sqr(a.x)+sqr(a.y) < sqr(b.x)+sqr(b.y); }
std::sort(points.begin(),points.end(),&compare_distance_from_origin);
Your radix key is: sqr(x) + sqr(y). If you make all your functors work relative to that generated key, it'll work fine.
To me, it's clear that there are many kinds of sort where a radix sort can't be used. An interesting borderline case would be case-insensitive sorting; have you tried doing that?
Your Get_char functor just needs to convert one case to the other before returning: struct bracket { inline unsigned char operator()(const DATA_TYPE &x, size_t offset) const { return tolower(x.a[offset]); } }; See https://github.com/spreadsort/sort/blob/master/example/stringfunctorsample.c... for an example of using string_sort with functors. I'll convert that example to do case-insensitive comparison, as that is an interesting case. If you can represent your sort as a sequence of comparisons of sequences of bytes, you can convert each of those comparisons into a sequence of bytes that make up a synthetic radix. I am willing to concede that this might be complicated, and in some cases less efficient when sorting reasonable quantities of data, but it is doable. I think many people are confused about the limitations of radix sorts, because LSD radix sorting over more complex variable-length keys is inefficient, but MSD radix sort does well with variable-length keys.
Steven Ross wrote:
On Mon, Nov 17, 2014 at 7:21 AM, Phil Endecott
wrote: Steven Ross wrote:
If std::sort can sort it, so can string_sort if you use the available functors properly.
How about:
bool compare_distance_from_origin(point& a, point& b) { return sqr(a.x)+sqr(a.y) < sqr(b.x)+sqr(b.y); }
std::sort(points.begin(),points.end(),&compare_distance_from_origin);
Your radix key is: sqr(x) + sqr(y). If you make all your functors work relative to that generated key, it'll work fine.
So my thought was, you wouldn't want to do that because all that extra maths is being done so frequently and the performance would be considerably worse than std::sort. But:
I think many people are confused about the limitations of radix sorts, because LSD radix sorting over more complex variable-length keys is inefficient, but MSD radix sort does well with variable-length keys.
Ah, that may be true; I was overlooking the fact that not every bit of the value is looked at if the values are sufficiently well separated. Regards, Phil.
Steven Ross wrote:
I’m not convinced that string_sort, float_sort and integer_sort need to be separate interfaces. I believe that the function to extract “digits” from the values for the MSD part of the algorithm can be captured in a fully generic way (that should be applicable to all radix-like sorting algorithms), coupled with two additional, optional parameters (pre- transform and post-transform) specifying functions for doing the temporary “treat floats as integers” trick.
string_sort is this fully generic approach. integer_sort and float_sort can be seen as specializations that allow bit-level granularity of the radix operations
Isn’t that a matter of defining indexing adapters for those types, taking a template parameter for the bitfield size? I mean, I would expect any radix-like sorting algorithm to take a function of type key -> index -> digit as an optional parameter (defaulting to operator[] if not provided). For floats and integers that function could be templated by either the number of bits per digit or the number of desired buckets (which is equivalent under exponentiation). Actually, that could be done for strings and other bitstream-like types as well.
and work well for keys that fit into a machine word.
Could you clarify the above remark? Why does the fully generic string_sort not work well for keys that fit into a machine word?
If you can think of some way to enable the sort-floats-as-ints trick as an option without harming performance for non-floats, I'd be very interested.
Default to the trivial (pre|post) transformation, and specialize the algorithm that maps the transformations over the (input|output) range to be a no-op for the trivial transformation. The compiler will eliminate the function call.
1. Value type can be treated as a sequence of digits/characters.
Find me a type that you don't think this applies to,
For starters, any product type (tuple or struct) containing variable- length subkeys. As an example, consider book records from a library database that need to be ordered first by author and then by title. If you insist on doing that with a (MSD) radix sort, you have two options: 1. Use a stable radix sort and make multiple passes over the data. In most cases, this would be less efficient than just doing a single pass with a comparison sort, not to mention that spreadsort isn’t stable. 2. Hack into the implementation details of the algorithm and recurse into the next key to be sorted by, before the buckets of the previous key are destroyed. Case-agnostic string sorting is another example in which radix sorts don’t excel (even though you can make it work if you extend the interface considerably). This has been mentioned before. In general, radix sorts only “shine” for types that can be injectively mapped to the scalar bitfields while preserving order (even if you can make it “work” for some other cases). Many common types and orders fit in that category, but there are infinitely many types and orders with different semantics.
and I'll show you how it can. If std::sort can sort it, so can string_sort if you use the available functors properly. Maybe I should explain this in the documentation?
No, you should explicitly mention the opposite. Your algorithm has value, not because it can be used for all imaginable scenarios but because it has something to offer in particular use cases. No library needs to be for everyone (not even std::sort, in fact). I really do think your algorithm has value but you need to be very precise about what it can do and what it cannot. Users will only appreciate your library more if you do so.
I will agreed to the extent that if performance isn't very critical relative to the code complexity of creating an efficient radix key expansion equivalent to the comparison operation, then people probably are better off with std::sort.
2. Many keys to sort and/or very long keys.
I agree, if most of your sorting runtime is going to spent on lists of < 1000 elements, this library is of little use, though it should also do little harm due to the fallback to std::sort. I'll edit the docs in develop to make that more clear.
To be honest I am sceptical that the algorithm even adds much value under 100K keys. So far, Frank Gennari has reported a 1.8x speedup for 100M keys. Now, the effect probably would have been more dramatic if the algorithm had been used for long strings rather than 32-bit integers, but that’s still at 100M keys. The difference between an O(n log n) algorithm and an O(nk) algorithm doesn’t grow particularly fast either (except for very small n). Introsort/quicksort cache locality is also about as good as MSD radix sort cache locality. I think the real added value of your algorithm lies in the millions of keys and beyond (which is nothing to be ashamed of!). Please don’t take this as an attack, I really honestly believe that your spreadshort library has value. I just think that you should present it with a bit more modesty with regard to scope of application. -Julian
Julian,
On Mon Nov 17 2014 at 4:27:34 PM Julian Gonggrijp
string_sort is this fully generic approach. integer_sort and float_sort can be seen as specializations that allow bit-level granularity of the radix operations
Isn’t that a matter of defining indexing adapters for those types, taking a template parameter for the bitfield size? I mean, I would expect any radix-like sorting algorithm to take a function of type key -> index -> digit as an optional parameter (defaulting to operator[] if not provided). For floats and integers that function could be templated by either the number of bits per digit or the number of desired buckets (which is equivalent under exponentiation). Actually, that could be done for strings and other bitstream-like types as well.
integer_sort and float_sort assume they are operating on something that can efficiently be treated as a single word. They use bitshifting and subtraction to find the difference between the most significant bits of two elements. The bitshifting paradigm is what makes those algorithms difficult to generalize to very large keys.
and work well for keys that fit into a machine word.
Could you clarify the above remark? Why does the fully generic string_sort not work well for keys that fit into a machine word?
It's not quite as efficient; it works 8 bits at a time instead of their 11, and doesn't use subtraction to eliminate bits that are identical between the min and max.
If you can think of some way to enable the sort-floats-as-ints trick as an option without harming performance for non-floats, I'd be very interested.
Default to the trivial (pre|post) transformation, and specialize the algorithm that maps the transformations over the (input|output) range to be a no-op for the trivial transformation. The compiler will eliminate the function call.
I'll look to see if I can structure it that way without loss of efficiency or major loss in readability.
1. Value type can be treated as a sequence of digits/characters.
Find me a type that you don't think this applies to,
For starters, any product type (tuple or struct) containing variable- length subkeys. As an example, consider book records from a library database that need to be ordered first by author and then by title. If you insist on doing that with a (MSD) radix sort, you have two options:
1. Use a stable radix sort and make multiple passes over the data. In most cases, this would be less efficient than just doing a single pass with a comparison sort, not to mention that spreadsort isn’t stable. 2. Hack into the implementation details of the algorithm and recurse into the next key to be sorted by, before the buckets of the previous key are destroyed.
Note: I am open to writing stable versions of these algorithms, potentially with TimSort or some other efficient stable fallback sort. All that is required for stability of the MSD radix portion is having a full copy of the data (just like LSD radix sort uses). The library should be faster without the complex cache-missing swaps the in-place version uses. That said, here's another 2 options, the first of which is quite efficient: 3) If there is a character that is not allowed in the substrings (ideally null character), then you can just return 0 when you hit the end of a substring, and if the illegal character is not null, then shift any values below that character up by 1 in the Get_char functor to fit this change in. Embedded nulls usually are not a design requirement, in which case you can just use null to signal the end of a substring. 4) If all your possible character values are accounted for, then you can add an extra character every step to signal whether the substring is over. Example returns from Get_char with this approach on a string with 2 substrings: """a"-> 0, 1, 'a', 0 "bc""d" -> 1, 'b', 1, 'c', 0, '1', 'd', '0' If this is a serious issue, I can add substring end signaling into the API. If you can define a strict weak ordering that is usable in std::sort, you can define a synthetic radix as I've just given an example of. Case-agnostic string sorting is another example in which radix sorts
don’t excel (even though you can make it work if you extend the interface considerably). This has been mentioned before.
The functor example works efficiently in this case, as mentioned before. It's significantly faster than std::sort using boost::algorithm::ilexicographical_compare for case-insensitive sorting: 1.21s for 40MB of strings vs 2.56s for std::sort.
Steven Ross wrote:
integer_sort and float_sort assume they are operating on something that can efficiently be treated as a single word. They use bitshifting and subtraction to find the difference between the most significant bits of two elements. The bitshifting paradigm is what makes those algorithms difficult to generalize to very large keys.
Why is the bitshifting hard-coded into a separate version of the algorithm, rather than wrapping it in the Get_char parameter function? (By the way, please forgive me for keeping firing such questions at you. It’s just that your remarks raise questions which I think are interesting. You are not obliged to defend your design, as far as I’m concerned.)
Could you clarify the above remark? Why does the fully generic string_sort not work well for keys that fit into a machine word?
It's not quite as efficient; it works 8 bits at a time instead of their 11, and doesn't use subtraction to eliminate bits that are identical between the min and max.
If I understand correctly, the substraction is a trick to reduce the number of buckets. Wouldn’t that be useful for strings as well? And why wouldn’t the number of bits per pass be parametric (with sensible defaults of course)?
1. Value type can be treated as a sequence of digits/characters.
Find me a type that you don't think this applies to,
For starters, any product type (tuple or struct) containing variable- length subkeys. As an example, consider book records from a library database that need to be ordered first by author and then by title. If you insist on doing that with a (MSD) radix sort, you have two options:
1. Use a stable radix sort and make multiple passes over the data. In most cases, this would be less efficient than just doing a single pass with a comparison sort, not to mention that spreadsort isn’t stable. 2. Hack into the implementation details of the algorithm and recurse into the next key to be sorted by, before the buckets of the previous key are destroyed.
Note: I am open to writing stable versions of these algorithms, potentially with TimSort or some other efficient stable fallback sort. All that is required for stability of the MSD radix portion is having a full copy of the data (just like LSD radix sort uses). The library should be faster without the complex cache-missing swaps the in-place version uses.
It is up to you whether you add a stable version; I would vote to accept your library without it, too. Note that while a full copy of the data may speed up the algorithm, that does not necessarily mean that it is suddenly a good idea to make multiple passes with it.
That said, here's another 2 options, the first of which is quite efficient:
3) If there is a character that is not allowed in the substrings (ideally null character), then you can just return 0 when you hit the end of a substring, and if the illegal character is not null, then shift any values below that character up by 1 in the Get_char functor to fit this change in. Embedded nulls usually are not a design requirement, in which case you can just use null to signal the end of a substring. 4) If all your possible character values are accounted for, then you can add an extra character every step to signal whether the substring is over. Example returns from Get_char with this approach on a string with 2 substrings: """a"-> 0, 1, 'a', 0 "bc""d" -> 1, 'b', 1, 'c', 0, '1', 'd', '0'
If this is a serious issue, I can add substring end signaling into the API.
To the contrary. Please don’t do any such invasive thing to your algorithm just to add one class of use cases. You algorithm works for a large class of use cases out of the box and you can make many people happy with it as it is.
If you can define a strict weak ordering that is usable in std::sort, you can define a synthetic radix as I've just given an example of.
You can invent all kinds of workarounds and invasive additions to account for additional use cases. That doesn’t mean it would be a good idea. There is no single sophistication to spreadsort that will make it work for *all* use cases that every comparison sort works out of the box for. I think it would be better to appreciate the elegance of the algorithm for the use cases that it easily supports, than to try to approach the generality of a comparison sort. Apart from that, at this point it seems clear to me that you agree that spreadsort (and radix sorting in general) does not work for product types with variable length subkeys *out of the box*, contrary to (pure) comparison sorts. I must emphasise that this is still just one example that I could easily come up with; you should not assume that it is the only exception. Since spreadsort does not work out of the box for all use cases, you should not pretend that it does in your documentation, either.
Case-agnostic string sorting is another example in which radix sorts don’t excel (even though you can make it work if you extend the interface considerably). This has been mentioned before.
The functor example works efficiently in this case, as mentioned before. It's significantly faster than std::sort using boost::algorithm::ilexicographical_compare for case-insensitive sorting: 1.21s for 40MB of strings vs 2.56s for std::sort.
In that case I retract that example. As I said I only quickly scanned the previous discussion, so I hereby apologise for this oversight and any other oversights I might have made. -Julian
On Tue Nov 18 2014 at 3:34:28 PM Julian Gonggrijp
integer_sort and float_sort assume they are operating on something that can efficiently be treated as a single word. They use bitshifting and subtraction to find the difference between the most significant bits of two elements. The bitshifting paradigm is what makes those algorithms difficult to generalize to very large keys.
Why is the bitshifting hard-coded into a separate version of the algorithm, rather than wrapping it in the Get_char parameter function?
Could you clarify the above remark? Why does the fully generic
string_sort not work well for keys that fit into a machine word?
It's not quite as efficient; it works 8 bits at a time instead of their 11, and doesn't use subtraction to eliminate bits that are identical between the min and max.
If I understand correctly, the substraction is a trick to reduce the number of buckets. Wouldn’t that be useful for strings as well? And why wouldn’t the number of bits per pass be parametric (with sensible defaults of course)?
You can bit-shift in Get_char; the main difference is the subtraction (which requires a scan through all data for the minimum and maximum). string_sort uses a fixed-size window based on the maximum possible range of the unsigned character Get_char returns, plus a bin for empties. It is explicitly designed and optimized for datatypes like the string, where there is a sequence of small data types and subtraction will provide little benefit, and where all characters at the same index are sometimes equal. People can bit-shift inside Get_char to extract chunks of data, but the subtraction optimization doesn't make sense when you're just sorting bytes, and only have 259 buckets (which easily fit in L1 cache).
3) If there is a character that is not allowed in the substrings (ideally null character), then you can just return 0 when you hit the end of a substring, and if the illegal character is not null, then shift any values below that character up by 1 in the Get_char functor to fit this change in. Embedded nulls usually are not a design requirement, in which case you can just use null to signal the end of a substring. 4) If all your possible character values are accounted for, then you can add an extra character every step to signal whether the substring is over. Example returns from Get_char with this approach on a string with 2 substrings: """a"-> 0, 1, 'a', 0 "bc""d" -> 1, 'b', 1, 'c', 0, '1', 'd', '0' If you can define a strict weak ordering that is usable in std::sort, you can define a synthetic radix as I've just given an example of.
You can invent all kinds of workarounds and invasive additions to account for additional use cases. That doesn’t mean it would be a good idea. There is no single sophistication to spreadsort that will make it work for *all* use cases that every comparison sort works out of the box for. I think it would be better to appreciate the elegance of the algorithm for the use cases that it easily supports, than to try to approach the generality of a comparison sort.
I think there is some form of miscommunication. string_sort, out of the box, can sort ANY data with a strict weak ordering that std::sort can, depending on how the user of the library defines their Get_char functor, just like they have to define a compare functor in these more complex use cases for both comparison-based algorithms and string_sort. You posed a problem, and I provided 2 solutions by changing the user's Get_char functor, one efficient one for the standard case where embedded nulls aren't an issue, and another that supports embedded nulls with some more effort. This technique can be extended to any std::sortable data type with a strict weak ordering, merely by reproducing the actual bytes the comparison function compares as returns from the Get_char function at sequential indices. There is no loss of generality with hybrid-radix sorting. What I will agree is that it becomes more complicated for the user to define a Get_char functor in addition to the compare functor for some use cases, and in some cases it won't be worth the bother.
259->257 buckets. Sorry for the typo.
On Tue Nov 18 2014 at 9:25:12 PM Steven Ross
On Tue Nov 18 2014 at 3:34:28 PM Julian Gonggrijp
wrote: integer_sort and float_sort assume they are operating on something that can efficiently be treated as a single word. They use bitshifting and subtraction to find the difference between the most significant bits of two elements. The bitshifting paradigm is what makes those algorithms difficult to generalize to very large keys.
Why is the bitshifting hard-coded into a separate version of the algorithm, rather than wrapping it in the Get_char parameter function?
Could you clarify the above remark? Why does the fully generic
string_sort not work well for keys that fit into a machine word?
It's not quite as efficient; it works 8 bits at a time instead of their 11, and doesn't use subtraction to eliminate bits that are identical between the min and max.
If I understand correctly, the substraction is a trick to reduce the number of buckets. Wouldn’t that be useful for strings as well? And why wouldn’t the number of bits per pass be parametric (with sensible defaults of course)?
You can bit-shift in Get_char; the main difference is the subtraction (which requires a scan through all data for the minimum and maximum). string_sort uses a fixed-size window based on the maximum possible range of the unsigned character Get_char returns, plus a bin for empties. It is explicitly designed and optimized for datatypes like the string, where there is a sequence of small data types and subtraction will provide little benefit, and where all characters at the same index are sometimes equal. People can bit-shift inside Get_char to extract chunks of data, but the subtraction optimization doesn't make sense when you're just sorting bytes, and only have 259 buckets (which easily fit in L1 cache).
3) If there is a character that is not allowed in the substrings (ideally null character), then you can just return 0 when you hit the end of a substring, and if the illegal character is not null, then shift any values below that character up by 1 in the Get_char functor to fit this change in. Embedded nulls usually are not a design requirement, in which case you can just use null to signal the end of a substring. 4) If all your possible character values are accounted for, then you can add an extra character every step to signal whether the substring is over. Example returns from Get_char with this approach on a string with 2 substrings: """a"-> 0, 1, 'a', 0 "bc""d" -> 1, 'b', 1, 'c', 0, '1', 'd', '0'
If you can define a strict weak ordering that is usable in std::sort, you
can define a synthetic radix as I've just given an example of.
You can invent all kinds of workarounds and invasive additions to account for additional use cases. That doesn’t mean it would be a good idea. There is no single sophistication to spreadsort that will make it work for *all* use cases that every comparison sort works out of the box for. I think it would be better to appreciate the elegance of the algorithm for the use cases that it easily supports, than to try to approach the generality of a comparison sort.
I think there is some form of miscommunication. string_sort, out of the box, can sort ANY data with a strict weak ordering that std::sort can, depending on how the user of the library defines their Get_char functor, just like they have to define a compare functor in these more complex use cases for both comparison-based algorithms and string_sort. You posed a problem, and I provided 2 solutions by changing the user's Get_char functor, one efficient one for the standard case where embedded nulls aren't an issue, and another that supports embedded nulls with some more effort. This technique can be extended to any std::sortable data type with a strict weak ordering, merely by reproducing the actual bytes the comparison function compares as returns from the Get_char function at sequential indices. There is no loss of generality with hybrid-radix sorting. What I will agree is that it becomes more complicated for the user to define a Get_char functor in addition to the compare functor for some use cases, and in some cases it won't be worth the bother.
Steven Ross wrote:
You can bit-shift in Get_char; the main difference is the subtraction (which requires a scan through all data for the minimum and maximum). string_sort uses a fixed-size window based on the maximum possible range of the unsigned character Get_char returns, plus a bin for empties. It is explicitly designed and optimized for datatypes like the string, where there is a sequence of small data types and subtraction will provide little benefit, and where all characters at the same index are sometimes equal. People can bit-shift inside Get_char to extract chunks of data, but the subtraction optimization doesn't make sense when you're just sorting bytes, and only have 259 buckets (which easily fit in L1 cache).
Thank you for your explanation.
3) If there is a character that is not allowed in the substrings (ideally null character), then you can just return 0 when you hit the end of a substring, and if the illegal character is not null, then shift any values below that character up by 1 in the Get_char functor to fit this change in. Embedded nulls usually are not a design requirement, in which case you can just use null to signal the end of a substring. 4) [...]
You can invent all kinds of workarounds and invasive additions to account for additional use cases. [...] I think it would be better to appreciate the elegance of the algorithm for the use cases that it easily supports, than to try to approach the generality of a comparison sort.
I think there is some form of miscommunication. string_sort, out of the box, can sort ANY data with a strict weak ordering that std::sort can, depending on how the user of the library defines their Get_char functor, just like they have to define a compare functor in these more complex use cases for both comparison-based algorithms and string_sort. You posed a problem, and I provided 2 solutions by changing the user's Get_char functor, one efficient one for the standard case where embedded nulls aren't an issue, and another that supports embedded nulls with some more effort. This technique can be extended to any std::sortable data type with a strict weak ordering, merely by reproducing the actual bytes the comparison function compares as returns from the Get_char function at sequential indices. There is no loss of generality with hybrid-radix sorting. What I will agree is that it becomes more complicated for the user to define a Get_char functor in addition to the compare functor for some use cases, and in some cases it won't be worth the bother.
Yes, perhaps there is a miscommunication. Regarding the author/title example, I understand the part where a library user can refine the Get_char function to return a special value for the end of a subkey. However, it is unclear to me how that could be enough to ensure that all books are first sorted by author and then by title. Author names may be as short as 10 characters or as long as 50 characters. How is Get_char supposed to ensure all by itself that all first characters of all book titles are considered in the same radix pass, i.e. the one after the last character of the longest author name? I naturally assumed that the sorting algorithm itself would have to help here, which would be invasive for your implementation of spreadsort. I realise that you never confirmed that, though. By contrast, it is very clear how refining the comparison function would be sufficient for a comparison sort. Just compare by author; in case of a tie, compare by title. To the best of my knowledge, it is common wisdom that radix sorts are not as general as comparison sorts. I can be wrong. I may have missed an important (new?) insight. If that is the case, please explain to me how refining the Get_char function is sufficient to deal with any strict weak ordering (just repeating the claim does not help me understand). -Julian
Julian, On Wed Nov 19 2014 at 5:46:01 PM Julian Gonggrijpwrote: > >>> 3) If there is a character that is not allowed in the substrings > (ideally > >>> null character), then you can just return 0 when you hit the end of a > >>> substring, and if the illegal character is not null, then shift any > >> values > >>> below that character up by 1 in the Get_char functor to fit this change > >>> in. Embedded nulls usually are not a design requirement, in which case > >> you > >>> can just use null to signal the end of a substring. > >>> 4) [...] > >> > >> You can invent all kinds of workarounds and invasive additions to > account > >> for additional use cases. [...] > >> I think it would be better to appreciate the elegance of the algorithm > >> for the use cases that it easily supports, than to try to approach the > >> generality of a comparison sort. > >> > > > > I think there is some form of miscommunication. string_sort, out of the > > box, can sort ANY data with a strict weak ordering that std::sort can, > > depending on how the user of the library defines their Get_char functor, > > just like they have to define a compare functor in these more complex use > > cases for both comparison-based algorithms and string_sort. You posed a > > problem, and I provided 2 solutions by changing the user's Get_char > > functor, one efficient one for the standard case where embedded nulls > > aren't an issue, and another that supports embedded nulls with some more > > effort. This technique can be extended to any std::sortable data type > with > > a strict weak ordering, merely by reproducing the actual bytes the > > comparison function compares as returns from the Get_char function at > > sequential indices. There is no loss of generality with hybrid-radix > > sorting. > > What I will agree is that it becomes more complicated for the user to > > define a Get_char functor in addition to the compare functor for some use > > cases, and in some cases it won't be worth the bother. > > Yes, perhaps there is a miscommunication. Regarding the author/title > example, I understand the part where a library user can refine the > Get_char function to return a special value for the end of a subkey. > However, it is unclear to me how that could be enough to ensure that all > books are first sorted by author and then by title. Author names may be > as short as 10 characters or as long as 50 characters. How is Get_char > supposed to ensure all by itself that all first characters of all book > titles are considered in the same radix pass, i.e. the one after the last > character of the longest author name? > By alternating between returning payload radix characters and a "does this field continue" byte. A 0 will specify an end of field, and sort lower in the order than a 1, thus making all shorter names sort lower than names that are equivalent to the same length but longer. Thus, Author: Jon Title: bat will have GetChar output these values: 1,'J',1,'o',1,'n',0 ,1,'b',1,'a',1,'t',0 Author: Jonathan Title:(empty), will output these values: 1,'J',1,'o',1,'n',1,'a',1,'t',1,'h',1,'a',1,'n',0, 0 The 0 vs 1 will cause Jon to appear before Jonathan in order, because it is shorter. Actually, the 0s and 1s aren't necessary for the last subkey (the title), but I included them to show this approach. There is also a simple GetLength functor that lets string_sort know when all characters are exhausted. > > By contrast, it is very clear how refining the comparison function would > be sufficient for a comparison sort. Just compare by author; in case of a > tie, compare by title. > That's exactly what the Get_char method I explained is doing, but it's less intuitive. The radix equivalent requires encoding the same information the comparison function uses, but sometimes in a different byte format to include factors such as sizes and signs. As I said, doable, but may not be worth the effort. > To the best of my knowledge, it is common wisdom that radix sorts are not > as general as comparison sorts. I can be wrong. I may have missed an > important (new?) insight. If that is the case, please explain to me how > refining the Get_char function is sufficient to deal with any strict weak > ordering (just repeating the claim does not help me understand). > > I agree with your statement that this is conventional wisdom. 2 major factors are included in this conventional wisdom: 1) Most Radix sorts perform poorly relative to comparison sorting for small N, large K, or variable-length data. Hybrid-radix sorting resolves this problem. 2) The mapping from a sorting requirement to a comparison function is more intuitive and generally simpler than the mapping to a Get_char-type radix implementation. With #1 limiting performance, and comparison-based sorts being fast enough for most problems, I think most people just didn't put in the effort to examine the possibility of generalized radix sorting. string_sort provides this capability. It won't be optimal for many types of problems, but it can be used for them and provides worst-case performance guarantees.
Steven Ross wrote: >> How is Get_char >> supposed to ensure all by itself that all first characters of all book >> titles are considered in the same radix pass, i.e. the one after the last >> character of the longest author name? >> > By alternating between returning payload radix characters and a "does this > field continue" byte. A 0 will specify an end of field, and sort lower in > the order than a 1, thus making all shorter names sort lower than names > that are equivalent to the same length but longer. > Thus, Author: Jon Title: bat will have GetChar output these values: > 1,'J',1,'o',1,'n',0 ,1,'b',1,'a',1,'t',0 > > Author: Jonathan Title:(empty), will output these values: > 1,'J',1,'o',1,'n',1,'a',1,'t',1,'h',1,'a',1,'n',0, 0 > > The 0 vs 1 will cause Jon to appear before Jonathan in order, because it is > shorter. Actually, the 0s and 1s aren't necessary for the last subkey (the > title), but I included them to show this approach. > > There is also a simple GetLength functor that lets string_sort know when > all characters are exhausted. > >> >> By contrast, it is very clear how refining the comparison function would >> be sufficient for a comparison sort. Just compare by author; in case of a >> tie, compare by title. >> > > That's exactly what the Get_char method I explained is doing, but it's less > intuitive. The radix equivalent requires encoding the same information the > comparison function uses, but sometimes in a different byte format to > include factors such as sizes and signs. As I said, doable, but may not be > worth the effort. > > >> To the best of my knowledge, it is common wisdom that radix sorts are not >> as general as comparison sorts. I can be wrong. I may have missed an >> important (new?) insight. If that is the case, please explain to me how >> refining the Get_char function is sufficient to deal with any strict weak >> ordering (just repeating the claim does not help me understand). >> >> I agree with your statement that this is conventional wisdom. > 2 major factors are included in this conventional wisdom: > 1) Most Radix sorts perform poorly relative to comparison sorting for small > N, large K, or variable-length data. Hybrid-radix sorting resolves this > problem. > 2) The mapping from a sorting requirement to a comparison function is more > intuitive and generally simpler than the mapping to a Get_char-type radix > implementation. With #1 limiting performance, and comparison-based sorts > being fast enough for most problems, I think most people just didn't put in > the effort to examine the possibility of generalized radix sorting. > string_sort provides this capability. It won't be optimal for many types > of problems, but it can be used for them and provides worst-case > performance guarantees. Thank you, Steven. It clicked. I guess I just needed a “explain like I’m 10” type of explanation. Issue resolved, as far as I’m concerned. Cheers, Julian
Julian,
I updated the first paragraph in the introduction to make clearer what
this library is NOT good for, and added the change to the develop
branch. Here is the new first paragraph.
"The Boost Sort library provides a generic implementation of
high-speed sorting algorithms that outperform those in the C++
standard in both average and worst case performance when there are
over 1000 elements in the list to sort. They fall back to std::sort on
small data sets. These algorithms only work on random access
iterators. They are hybrids using both radix and comparison-based
sorting, specialized to sorting common data types, such as integers,
floats, and strings. These algorithms are encoded in a generic fashion
and accept functors, enabling them to sort any object that can be
processed like these basic data types. In the case of string_sort,
this includes anything with a defined strict weak ordering that
std::sort can sort, but writing efficient functors for some complex
key types may not be worth the additional effort relative to just
using std::sort, depending on how important speed is to your
application. Sample usages are available in the samples directory."
On Sun, Nov 16, 2014 at 9:22 PM, Julian Gonggrijp
Dear list, dear Edward, dear Steven,
First of all I would like to congratulate Steven for having his library finally reviewed. I have been aware of the library for years.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
When I first learned about the existence of the library (as I said, several years ago), I read the code and the documentation closely.
Today, I have quickly scanned through the previous discussion, re-read the introduction and the rationale from the documentation and made a quick glance over the code to check for any important changes.
- Are you knowledgeable about the problem domain?
Yes, quite. I have strong formal education in algorithm design and analysis. I also have a passion for sorting algorithms and I have been in the process of writing my own C++ sorting library, though I have not had the chance to work on the latter since early 2012. My own library mostly contains well-known comparison sorts but it also contains a few interesting innovations of my own. I have done a lot of benchmarking and analysis of my own algorithm implementations so I’m well aware of the issues that are relevant in comparing sorting algorithms.
Please note that I do not consider my own library a competitor for Steven Ross’, though. The aims and contents of the libraries seem to be sufficiently different that there can be a place for both. Apart from that, Steven Ross’ library is in a much more “production-ready” state than mine.
- What is your evaluation of the design?
I’m not convinced that string_sort, float_sort and integer_sort need to be separate interfaces. I believe that the function to extract “digits” from the values for the MSD part of the algorithm can be captured in a fully generic way (that should be applicable to all radix-like sorting algorithms), coupled with two additional, optional parameters (pre- transform and post-transform) specifying functions for doing the temporary “treat floats as integers” trick.
Apart from that, I think the design is sensible.
- What is your evaluation of the implementation?
Clever.
- What is your evaluation of the documentation?
Poor.
The rationale, which in my opinion is the most important part of the documentation and should be in the front, is too lengthy and misses its target. It does discuss the theoretical advantages of radix sort as well as the (technical) relative merits of MSD vs LSD and several hybrid sorting algorithms, to great length, but it does not address the question why somebody might want to use a hybrid radix/comparison sort in the first place. It seems to be a defense of design decisions rather than a justification of the very existence of the library.
The introduction severely and needlessly overstates the value of the algorithm, starting from the very first sentence:
“The Boost Sort library provides a generic implementation of high-speed sorting algorithms that outperform those in the C++ standard in both average and worst case performance.”
strongly implying that spreadsort is fit for replacing introsort (std::sort) in all possible situations, which it just isn’t.
Then the introduction mentions three conditions under which spreadsort is “fastest relative to std::sort”, but it should instead mention two conditions under which a user may want to consider using it at all:
1. Value type can be treated as a sequence of digits/characters. 2. Many keys to sort and/or very long keys.
This would be more honest about the scope of application, hence more helpful to the potential user. I believe this does not invalidate the library. Under the two conditions mentioned above, spreadsort can in fact offer significant performance improvements. I think the latter has been sufficiently proven.
The title of the library is also too pretentious, as other reviewers have already pointed out. I think it should be called “Spreadsort” rather than “Sort”.
So to summarize this part of my review, I think that the documentation is insufficient as long as it doesn’t properly address the following points in the following order:
1. Why and under which conditions the library may be useful. 2. How to use the library. 3. What is going on inside the library (and why in that way).
The good news is that parts 2 and 3 have already been provided for and that part 1 should be very short. So I think relatively little work is needed to fix the documentation.
- What is your evaluation of the potential usefulness of the library?
As I explained above, I think the library has somewhat narrow applications, but it can be highly valuable when applicable. This is true of most other sorting algorithms as well. Spreadsort is definitely useful enough for inclusion in Boost, especially as long as no generic implementations of other radix/comparison hybrids are available.
- Did you try to use the library? With what compiler? Did you have any problems?
Not this year. I may have tested the library when I first learned about its existence, but I don’t have any clear memory of the results.
And finally, every review should attempt to answer this question:
- Do you think the library should be accepted as a Boost library?
Yes, sort of. I think spreadsort should be added to Boost.Algorithm (please!), but first the documentation should be improved as I discussed. While I think that the interface should be more general, that is not a condition for acceptance as far as I’m concerned.
I do not agree with Niall Douglas that the library should add fallbacks for ranges that don’t offer random access. In general, I think that if you want to sort a range then you should ensure from the outset that it is a random access range, with the exception of very short ranges, for which quadratic complexity does not hurt, or linked lists, which can be efficiently sorted but only with intrusive sorting methods. Neither situation is relevant for this implementation of spreadsort.
Kind regards, Julian Gonggrijp
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Steven Ross wrote:
I updated the first paragraph in the introduction to make clearer what this library is NOT good for, and added the change to the develop branch. Here is the new first paragraph.
"The Boost Sort library provides a generic implementation of high-speed sorting algorithms that outperform those in the C++ standard in both average and worst case performance when there are over 1000 elements in the list to sort. They fall back to std::sort on small data sets. These algorithms only work on random access iterators. They are hybrids using both radix and comparison-based sorting, specialized to sorting common data types, such as integers, floats, and strings. These algorithms are encoded in a generic fashion and accept functors, enabling them to sort any object that can be processed like these basic data types. In the case of string_sort, this includes anything with a defined strict weak ordering that std::sort can sort, but writing efficient functors for some complex key types may not be worth the additional effort relative to just using std::sort, depending on how important speed is to your application. Sample usages are available in the samples directory.”
After my reply to your previous email, it may be clear that this is not good enough to my taste, yet. Let me offer you a suggestion for a different approach: “The Boost Spreadsort library provides a generic implementation of spreadsort, a hybrid radix/comparison sorting algorithm. The relation between spreadsort and std::sort is comparable to the relation between std::sort and insertion sort. In general, std::sort can be expected to be much faster than insertion sort, but std::sort falls back on insertion sort for small subranges because insertion sort beats std::sort at that scale. Insertion sort also beats std::sort for nearly sorted ranges of any size. Likewise, spreadsort can be expected to outperform std::sort in general except for small ranges, in which it falls back on std::sort, with the reservation that “small” here means something rather large! There also exist situations in which spreadsort is never better than std::sort, regardless of the size of the range. Spreadsort is likely to make you a happier person if all of the following criteria apply: - you are sorting very large sequences of data, possibly millions of keys or more; - your keys can be treated as strings of digits or charachters that can be lexicographically compared (fortunately this is true of many common types, including fixed-width floats and integers); - you do not need to preserve the order of equivalent keys (spreadsort is not stable); - your sequence is stored in a random access container.” HTH, Julian
On Nov 17, 2014, at 1:55 PM, Julian Gonggrijp
- your keys can be treated as strings of digits or charachters that can be lexicographically compared (fortunately this is true of many common types, including fixed-width floats and integers);
I would prefer avoiding the phrase "fixed-width floats". Josh
Josh Juran wrote:
- your keys can be treated as strings of digits or charachters that can be lexicographically compared (fortunately this is true of many common types, including fixed-width floats and integers);
I would prefer avoiding the phrase "fixed-width floats”.
I see your point. Perhaps “fixed-size floats”?
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Josh Juran Sent: 18 November 2014 02:53 To: boost@lists.boost.org Subject: Re: [boost] [review] Formal review period for Sort library continuing through November 19
On Nov 17, 2014, at 1:55 PM, Julian Gonggrijp
wrote: - your keys can be treated as strings of digits or charachters that can be lexicographically compared (fortunately this is true of many common types, including fixed-width floats and integers);
I would prefer avoiding the phrase "fixed-width floats".
In http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3626.pdf we eventually decided to call them 'specified width' floating-point types. This covers both built-in float, double and long double, and other longer (or shorter) for example using Boost.Multiprecision. Paul --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830
participants (6)
-
Edward Diener
-
Josh Juran
-
Julian Gonggrijp
-
Paul A. Bristow
-
Phil Endecott
-
Steven Ross