[review] Last day for formal review for Sort library

Edward Diener

19 Nov 2014 19 Nov '14

12:30 p.m.

Today is the last day for the formal review of the Sort library by Steven Ross, which started November 10. We have had some great discussions and reviews so far and I would like to thank everybody involved, but as always we need more formal reviews. If you need any delay for your review after today, please contact me at 'eldiener at tropicsoft dot com' and I will take your review into account if given within a reasonable period of time after today. About the Sort library ================== The Sort library is a library which implements hybrid sorting algorithms based on a Spreadsort ( http://en.wikipedia.org/wiki/Spreadsort ), of which the author of the library, Steven Ross, is the inventor. The algorithm provides a sort that is faster than O(n*log(n)). The library provides a generic implementation of high-speed sorting algorithms that outperform those in the C++ standard in both average and worst case performance. These algorithms only work on random access iterators. They are hybrids using both radix and comparison-based sorting, specialized to sorting common data types, such as integers, floats, and strings. These algorithms are encoded in a generic fashion and accept functors, enabling them to sort any object that can be processed like these basic data types. Where to get it =============== The library is available on github at https://github.com/spreadsort/sort. The library is in modular Boost format and can be cloned to libs/sort under your local modular boost directory. I have provided as the review manager online documentation at: http://eldiener.github.io/sort Review guidelines ================= Reviews should be submitted to the developer list (boost@lists.boost.org), preferably with '[sort]' in the subject. Or if you don't wish to for some reason or are not subscribed to the developer list you can send them privately to me at 'eldiener at tropicsoft dot com'. If so, please let me know whether or not you'd like your review to be forwarded to the list. For your review you may wish to consider the following questions: - What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? - Did you try to use the library? With what compiler? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain? And finally, every review should attempt to answer this question: - Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion. Even if you do not wish to give a full review any technical comment regarding the library is welcome as part of the review period and will help me as the review manager decide whether the library should be accepted as a Boost library. Any questions about the use of the library are also welcome.

Show replies by date

Thijs (M.A.) van den Berg

19 Nov 19 Nov

1:36 p.m.

...

The algorithm provides a sort that is faster than O(n*log(n)).

Is it possible to specify the O() of the algorithm instead of only claiming that it's faster? I've read 1.8x speed up compared to std::sort and that made me think that it's also O(n*log(n)), but with a different constant in front. Specifying the exact O() is very informative, it tells you what to expect, how (the gap with std::sort) scales as a function of n.

Steven Ross

11:17 p.m.

It's the better of O(N*log(N)) and O(N*((K/S)+S)), where K is the key length in bits and S is the division factor (11 by default). Practically, it ends up being a slightly better than constant speedup for integer_sort and float_sort (that fluctuates based on the number of radix iterations required) from N=10000 up until N gets close to 2^K when it starts switching to bucketsort. For string_sort the speedup can be much more dramatic for long keys (because it does O(N) comparisons, and string comparisons take time proportional to the length of the key, making comparison-based sort worst-case be O(N*log(N)*K/C), where C may be as high as 64), but for shorter strings it has a speedup similar to integer_sort and float_sort. Bottom line is: it's faster, but complicated to summarize. Do you have a recommendation on how to summarize that better? On Wed Nov 19 2014 at 8:36:13 AM Thijs (M.A.) van den Berg <thijs@sitmo.com> wrote:

...

...
The algorithm provides a sort that is faster than O(n*log(n)).

Is it possible to specify the O() of the algorithm instead of only claiming that it's faster? I've read 1.8x speed up compared to std::sort and that made me think that it's also O(n*log(n)), but with a different constant in front.

Specifying the exact O() is very informative, it tells you what to expect, how (the gap with std::sort) scales as a function of n.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/ mailman/listinfo.cgi/boost

Paul A. Bristow

20 Nov 20 Nov

10:26 a.m.

...

-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Steven Ross Sent: 19 November 2014 23:18 To: boost@lists.boost.org Subject: Re: [boost] [review] Last day for formal review for Sort library

It's the better of O(N*log(N)) and O(N*((K/S)+S)), where K is the key length in bits and S is the division factor (11 by default). Practically, it ends up being a slightly better than constant speedup for integer_sort and float_sort (that fluctuates based on the number of radix iterations required) from N=10000 up until N gets close to 2^K when it starts switching to bucketsort. For string_sort the speedup can be much more dramatic for long keys (because it does O(N) comparisons, and string comparisons take time proportional to the length of the key, making comparison-based sort worst-case be O(N*log(N)*K/C), where C may be as high as 64), but for shorter strings it has a speedup similar to integer_sort and float_sort. Bottom line is: it's faster, but complicated to summarize. Do you have a recommendation on how to summarize that better?

Put the bottom line first ;-) But do include all of this (and more) in the docs - we are not short of space :-) It is important that people realize just how complicated sort-speed is - if nothing else. Paul --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830

Thijs van den Berg

10:53 a.m.

...

...
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Steven Ross Sent: 19 November 2014 23:18 To: boost@lists.boost.org Subject: Re: [boost] [review] Last day for formal review for Sort library

It's the better of O(N*log(N)) and O(N*((K/S)+S)), where K is the key length in bits and S is the division factor (11 by default). Practically, it ends up being a slightly better than constant speedup for integer_sort and float_sort (that fluctuates based on the number of radix iterations required) from N=10000 up until N gets close to 2^K when it starts switching to bucketsort. For string_sort the speedup can be much more dramatic for long keys (because it does O(N) comparisons, and string comparisons take time proportional to the length of the key, making comparison-based sort worst-case be O(N*log(N)*K/C), where C may be as high as 64), but for shorter strings it has a speedup similar to integer_sort and float_sort. Bottom line is: it's faster, but complicated to summarize. Do you have a recommendation on how to summarize that better?

Put the bottom line first ;-)

But do include all of this (and more) in the docs - we are not short of space :-)

It is important that people realize just how complicated sort-speed is - if nothing else.

Paul

I thinks that’s a very good idea. The reason of existence of this code is “performance” so be as transparent as possible about that aspect. If people think this is to technical then they’ll skip the paragraph, but I I doubt it: typical users are likely looking for alternatives to std::sort because of performance. Knowing this will make it easier for developers to stress test performance benchmarks and know where to look for corner cases.

DUPUIS Etienne

3:32 p.m.

Greetings, I do not intend to send a formal review but I nevertheless compiled and executed the provided test code, with clang 3.5. I added timings and confirmed that in the provided test case, this sort is significantly faster than std::sort() provided by the latest Linux Mint distribution. To the author, I would suggest two very minor issues : 1. There are compiler warnings that can easily be avoided (seen with '-Wunused-parameter' and '-Wold-style-cast'). 2. In string_sort_test.cpp, tests are performed using an array of 32 elements. 10 000 or more should make a more convincing test. I have also seen in the documentation that samples are available. Indeed, they are in git. However, I would appreciate samples in the documentation, without any useless I/O code (useless to understand how to use the sort functions). For example, taken from 'keyplusdatasample.cpp' : // --------------- Start of code snippet ------------------------ struct DATA_TYPE { int key; std::string data; }; struct lessthan { inline bool operator()(const DATA_TYPE &x, const DATA_TYPE &y) const { return x.key < y.key; } }; struct rightshift { inline int operator()(const DATA_TYPE &x, const unsigned offset) { return x.key >> offset; } }; std::vector<DATA_TYPE> array; integer_sort(array.begin(), array.end(), rightshift(), lessthan()); // --------------- End of code snippet ------------------------ Having concise code snippets like the above in the documentation is in my opinion the best way to have users quickly understand how the library must be used. I would also like to see one or two more example using "complex" data types. Consider for example : // --------------- Start of code snippet ------------------------ struct Md5Sum { uint8_t bytes[16]; bool operator<(const Md5Sum& md5) const { for (int k = 0; k < 16; k++) if (bytes[k] < md5.bytes[k]) return true; return false; } }; // --------------- End of code snippet ------------------------ After reading your documentation and the provided samples, I understand that I should use string_sort() after defining some get_char() and get_length() functions. Am I right ? Another example, sorting by birth data and then by name : // --------------- Start of code snippet ------------------------ struct Id { std::string name; time_t birth; bool operator<(const Id& id) const { return (birth < id.birth) || ((birth == id.birth) && (name < id.name)); } }; // --------------- End of code snippet ------------------------ Again, I understand that I must use string_sort and split the birth date into 8-bit components. Am I right ? These samples are important as there is probably much more people sorting structures than simple integers. Regards, Étienne ------ Message d'origine ------ De : "Edward Diener" <eldiener@tropicsoft.com> À : "boost@lists.boost.org" <boost@lists.boost.org> Cc: "boost-users@lists.boost.org" <boost-users@lists.boost.org> Envoyé 19/11/2014 13:30:53 Objet : [boost] [review] Last day for formal review for Sort library

...

Today is the last day for the formal review of the Sort library by Steven Ross, which started November 10.

We have had some great discussions and reviews so far and I would like to thank everybody involved, but as always we need more formal reviews. If you need any delay for your review after today, please contact me at 'eldiener at tropicsoft dot com' and I will take your review into account if given within a reasonable period of time after today.

This message and any attachments are confidential and intended solely for the addressees. Any unauthorized modification, edition, use or dissemination is prohibited. If you have received this message by mistake, please notify us immediately. ATEME decline all responsibility for this message if it has been altered, deformed, falsified or even edited or disseminated without authorization.

Paul A. Bristow

4:21 p.m.

...

-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of DUPUIS Etienne Sent: 20 November 2014 15:33 To: boost@lists.boost.org Subject: Re: [boost] [review] Last day for formal review for Sort library

Greetings,

I do not intend to send a formal review but I nevertheless compiled and executed the provided test code, with clang 3.5. I added timings and confirmed that in the provided test case, this sort is significantly faster than std::sort() provided by the latest Linux Mint distribution.

To the author, I would suggest two very minor issues :

1. There are compiler warnings that can easily be avoided (seen with '-Wunused- parameter' and '-Wold-style-cast'). 2. In string_sort_test.cpp, tests are performed using an array of 32 elements. 10 000 or more should make a more convincing test.

I have also seen in the documentation that samples are available. Indeed, they are in git. However, I would appreciate samples in the documentation, without any useless I/O code (useless to understand how to use the sort functions).

For example, taken from 'keyplusdatasample.cpp' :

// --------------- Start of code snippet ------------------------

<snip to end> Eienne is quite correct, and using Quickbook snippets system allows the reader to see both selected snippets of the program in the documentation, but also allows a clickable link to the full .cpp file. And since the code example is known to compile and run (you can provide a snippet of the output too) one can be sure that What You See is exactly What you Will get to Work. Several people, including me, can help authors to get started on Quickbook documentation. Ask. HTH Paul --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830

Steven Ross

21 Nov 21 Nov

5:01 a.m.

Thanks for your input Dupuis. On Thu Nov 20 2014 at 10:32:46 AM DUPUIS Etienne <e.dupuis@ateme.com> wrote:

...

Greetings,

I do not intend to send a formal review but I nevertheless compiled and executed the provided test code, with clang 3.5. I added timings and confirmed that in the provided test case, this sort is significantly faster than std::sort() provided by the latest Linux Mint distribution.

To the author, I would suggest two very minor issues :

1. There are compiler warnings that can easily be avoided (seen with '-Wunused-parameter' and '-Wold-style-cast').

...

2. In string_sort_test.cpp, tests are performed using an array of 32 elements. 10 000 or more should make a more convincing test.

Thanks for pointing that out! I've increased it to 100000, adding a

I've fixed the ones in my examples and library in the develop branch (thanks for explaining how to detect them). Some examples use other boost libraries which include plenty of code with old-style casts. trivial amount to the total test+build time.

...

I have also seen in the documentation that samples are available. Indeed, they are in git. However, I would appreciate samples in the documentation, without any useless I/O code (useless to understand how to use the sort functions).

I do have some basic usage snippets in the documentation, but the idea behind the links to the full example code is that I'll maintain the code, and nothing necessary to run the code is missing. Would it help if I created a library to pull most of the I/O code out of the examples, so that most of what is visible is sorting?

...

For example, taken from 'keyplusdatasample.cpp' :

// --------------- Start of code snippet ------------------------

struct DATA_TYPE { int key; std::string data; }; struct lessthan { inline bool operator()(const DATA_TYPE &x, const DATA_TYPE &y) const { return x.key < y.key; } };

struct rightshift { inline int operator()(const DATA_TYPE &x, const unsigned offset) { return x.key >> offset; } };

std::vector<DATA_TYPE> array; integer_sort(array.begin(), array.end(), rightshift(), lessthan());

I'm confused. This code snippet is almost identical to the one at the bottom of the integer_sort.html page. What am I missing?

...

// --------------- End of code snippet ------------------------ Having concise code snippets like the above in the documentation is in my opinion the best way to have users quickly understand how the library must be used.

I would also like to see one or two more example using "complex" data types. Consider for example :

// --------------- Start of code snippet ------------------------ struct Md5Sum { uint8_t bytes[16];

bool operator<(const Md5Sum& md5) const { for (int k = 0; k < 16; k++) if (bytes[k] < md5.bytes[k]) return true; return false; } };

This seems very similar to the structure example at the bottom of string_sort.html. How is this example superior to that one?

...

// --------------- End of code snippet ------------------------ After reading your documentation and the provided samples, I understand that I should use string_sort() after defining some get_char() and get_length() functions. Am I right ?

Yes

...

Another example, sorting by birth data and then by name :

// --------------- Start of code snippet ------------------------ struct Id { std::string name; time_t birth;

bool operator<(const Id& id) const { return (birth < id.birth) || ((birth == id.birth) && (name < id.name)); } };

// --------------- End of code snippet ------------------------

...

Again, I understand that I must use string_sort and split the birth date into 8-bit components. Am I right ?

That's a good idea! I'll add such an example if the library is deemed worthy of including in boost.

...

These samples are important as there is probably much more people sorting structures than simple integers.

You're probably right. Key-plus-data is how most of my sorts at work

operate.

DUPUIS Etienne

11:25 a.m.

------ Message d'origine ------ De : "Steven Ross" <spreadsort@gmail.com> À : "boost@lists.boost.org" <boost@lists.boost.org> Envoyé 21/11/2014 06:01:24 Objet : Re: [boost] [review] Last day for formal review for Sort library

...

Thanks for your input Dupuis.

I do have some basic usage snippets in the documentation, but the idea behind the links to the full example code is that I'll maintain the code, and nothing necessary to run the code is missing. Would it help if I created a library to pull most of the I/O code out of the examples, so that most of what is visible is sorting? Indeed there are already some code snippets at the bottom of the reference pages, I just did not saw them, I am sorry for that, I looked for sample on the first page and saw right away that they were located in separate files hence went to open and read these files.

Perhaps in the content page the item 'Reference' could be changed to 'Reference and Samples' or the samples could be put on their own page ? There are many developpers who looks first for samples and then for API reference once they have grabbed the basic usage of an API. The file samples as they are are fine in my opinion, there is no need to pull out the I/O code.

...

I'm confused. This code snippet is almost identical to the one at the bottom of the integer_sort.html page. What am I missing?

You are of course right.

...

...
Consider for example :

// --------------- Start of code snippet ------------------------ struct Md5Sum { uint8_t bytes[16];

bool operator<(const Md5Sum& md5) const { for (int k = 0; k < 16; k++) if (bytes[k] < md5.bytes[k]) return true; return false; } };

This seems very similar to the structure example at the bottom of string_sort.html. How is this example superior to that one?

Very similar. In that case, it was not immediately obvious to me that I could use string_sort() to sort binary data. Perhaps adding a sentence to that effect in the introduction or overloading section would be nice. Regards, and thanks for your work, Étienne This message and any attachments are confidential and intended solely for the addressees. Any unauthorized modification, edition, use or dissemination is prohibited. If you have received this message by mistake, please notify us immediately. ATEME decline all responsibility for this message if it has been altered, deformed, falsified or even edited or disseminated without authorization.

Steven Ross

23 Nov 23 Nov

noon

Dupuis,

...

...
Consider for example :

...
// --------------- Start of code snippet ------------------------ struct Md5Sum { uint8_t bytes[16];

bool operator<(const Md5Sum& md5) const { for (int k = 0; k < 16; k++) if (bytes[k] < md5.bytes[k]) return true; return false; } };

This seems very similar to the structure example at the bottom of string_sort.html. How is this example superior to that one? Very similar. In that case, it was not immediately obvious to me that I could use string_sort() to sort binary data. Perhaps adding a sentence to that effect in the introduction or overloading section would be nice.

That's a good point. I've added a mention that it can sort binary data to the introduction for string_sort, and a mention that string_sort can sort anything with a strict weak ordering that std::sort can sort to the library introduction.

Steven Ross

9 Dec 9 Dec

11:17 a.m.

Dupuis, On Fri Nov 21 2014 at 12:01:24 AM Steven Ross <spreadsort@gmail.com> wrote:

...

Thanks for your input Dupuis.

On Thu Nov 20 2014 at 10:32:46 AM DUPUIS Etienne <e.dupuis@ateme.com> wrote:

...
Another example, sorting by birth data and then by name :

// --------------- Start of code snippet ------------------------ struct Id { std::string name; time_t birth;

bool operator<(const Id& id) const { return (birth < id.birth) || ((birth == id.birth) && (name < id.name)); } };

// --------------- End of code snippet ------------------------

...
Again, I understand that I must use string_sort and split the birth date into 8-bit components. Am I right ?

That's a good idea! I'll add such an example if the library is deemed worthy of including in boost.

I've added such an example to the develop branch that handles signed ints, signed floats, and multiple string keys all together: https://github.com/spreadsort/sort/blob/develop/example/generalizedstruct.cp... And will mention it in the docs. Surprisingly, considering all the complexity of its get_char function, it's about 50% faster than std::sort on random data formatted like this. Below are the comparison and get_char functions for you to see. Comparison sorting is definitely easier, but hybrid-radix sorting is feasible and surprisingly efficient even for harder-to-represent cases: struct DATA_TYPE { time_t birth; float net_worth; string first_name; string last_name; }; static const int birth_size = sizeof(time_t); static const int first_name_offset = birth_size + sizeof(float); static const boost::uint64_t base_mask = 0xff; struct lessthan { inline bool operator()(const DATA_TYPE &x, const DATA_TYPE &y) const { if (x.birth != y.birth) { return x.birth < y.birth; } if (x.net_worth != y.net_worth) { return x.net_worth < y.net_worth; } if (x.first_name != y.first_name) { return x.first_name < y.first_name; } return x.last_name < y.last_name; } }; struct bracket { inline unsigned char operator()(const DATA_TYPE &x, size_t offset) const { // Sort date as a signed int, returning the appropriate byte. if (offset < birth_size) { const int bit_shift = 8 * (birth_size - offset - 1); unsigned char result = (x.birth & (base_mask << bit_shift)) >> bit_shift; // Handling the sign bit. Unnecessary if the data is always positive. if (offset == 0) { return result ^ 128; } return result; } // Sort a signed float. This requires reversing the order of negatives // because of the way floats are represented in bits. if (offset < first_name_offset) { const int bit_shift = 8 * (first_name_offset - offset - 1); unsigned key = float_mem_cast<float, unsigned>(x.net_worth); unsigned char result = (key & (base_mask << bit_shift)) >> bit_shift; // Handling the sign. if (x.net_worth < 0) { return 255 - result; } // Increasing positives so they are higher than negatives. if (offset == birth_size) { return 128 + result; } return result; } // Sort a string that is before the end. This approach supports embedded // nulls. If embedded nulls are not required, then just delete the "* 2" // and the inside of the following if just becomes: // return x.first_name[offset - first_name_offset]; const unsigned first_name_end_offset = first_name_offset + x.first_name.size() * 2; if (offset < first_name_end_offset) { int char_offset = offset - first_name_offset; // This signals that the string continues. if (!(char_offset & 1)) { return 1; } return x.first_name[char_offset >> 1]; } // This signals that the string has ended, so that shorter strings come // before longer ones. if (offset == first_name_end_offset) { return 0; } // The final string needs no special consideration. return x.last_name[offset - first_name_end_offset - 1]; } }; struct getsize { inline size_t operator()(const DATA_TYPE &x) const { return first_name_offset + x.first_name.size() * 2 + 1 + x.last_name.size(); } };

DUPUIS Etienne

12 Dec 12 Dec

10:39 a.m.

------ Message d'origine ------ De : "Steven Ross" <spreadsort@gmail.com> À : "boost@lists.boost.org" <boost@lists.boost.org> Envoyé 09/12/2014 12:17:07 Objet : Re: [boost] [review] Last day for formal review for Sort library

...

I've added such an example to the develop branch that handles signed ints, signed floats, and multiple string keys all together: https://github.com/spreadsort/sort/blob/develop/example/generalizedstruct.cp... And will mention it in the docs. Surprisingly, considering all the complexity of its get_char function, it's about 50% faster than std::sort on random data formatted like this. Below are the comparison and get_char functions for you to see. Comparison sorting is definitely easier, but hybrid-radix sorting is feasible and surprisingly efficient even for harder-to-represent cases:

struct DATA_TYPE { time_t birth; float net_worth; string first_name; string last_name; };

Great, this is exactly the kind of more complex example one needs. Regards, Étienne

...

3862

Age (days ago)

3885

Last active (days ago)

List overview

Download

11 comments

6 participants

participants (6)

DUPUIS Etienne
Edward Diener
Paul A. Bristow
Steven Ross
Thijs (M.A.) van den Berg
Thijs van den Berg