Thanks for your feedback Phil.
Is a "faster" serial sort algorithm useful? -------------------------------------------
In 2008, Luke Simonson said: anyone interested in runtime of an application that spends a lot of time sorting large data volumes is going to do it on an up-to-date machine, which means 4, 8, 16 or 32 cores I think std::sort is a straw man to compare yourself to, not because it isn't good at what it does, but because the target use case in question of very large data volumes is better suited to a parallel algorithm.
At that time I didn't entirely agree as I had some embedded applications that could benefit from faster sorting of perhaps tens of thousands of items. But in the intervening five years, even phones have gained multiple CPUs. Today, if I needed faster sorting my first thought would be to parallelise it.
Sure, you can gain speedup by parallelizing, but if you can use half as many cores to get the same speedup by using a faster subsort, then isn't that helpful? On phones, the biggest issue is usually battery usage, which is why many of them rarely use more than 1 CPU core at a time. On servers, frameworks like MapReduce or other unshared memory techniques are commonly used to distribute the problem in many smaller pieces across separate processes. This library provides a faster subsort that will work with every standard way to parallelize a CPU-based sorting problem except LSD radix sorting. Parallel sorting can be valuable, but because of all the potential applications, variations between memory, runtime, and CPU usage constraints, and the level of shared-memory parallelization, there will be large variation across applications in which solution is best, complicating any library that attempts to meet these various needs. I'm not fundamentally opposed to adding library calls for the specific case of shared-memory parallel sorting, but it would add substantial complexity, and not cover all parallel-sorting needs.
Code Quality ------------
Back in 2008, the code had issues like calls to malloc() and free() and a lack of exception safety. No doubt it has improved, but perhaps any keen reviewers could look for any remaining issues.
There is now just a vector::resize of the bin_cache in the develop branch. I looked at eliminating that by using (more) stack memory, but it slowed down the library.
Skipping common prefixes ------------------------
One undeniable complexity advantage of this algorithm compared to std::sort<string> is that it doesn't need to re-compare common string prefixes once it knows that they are equal. Intuitively, one would expect a substantial speedup from this change if the input contains many strings with common prefixes. But intuition is not always correct. I did some experiments using the pathnames of all the files on my computer as the test data; these certainly have many common prefixes e.g. /usr/local/include/boost/. I wrote a modified quicksort that would skip common prefixes (see http://svn.chezphil.org/libpbe/trunk/include/string_qsort.hh ) and concluded that actually the intuition is wrong; comparing bytes can be done 8-at-a-time on a 64-bit system, and the cost of constantly re-comparing strings like /usr/local/include/boost/ must be small. So the intuition that Steven's algorithm would be useful in applications where strings have long common prefixes should not be followed without first measuring it.
Highly compressible data is one such application, and it works well there in bzip, but the length of the common prefixes were as much as hundreds of bytes. With more conventional data, you are correct, it isn't much of an advantage. I added that feature more to minimize the disadvantage created by otherwise doing a full radix sort on each byte, when nothing is different, which is a common issue with radix sorts on strings (if there are better solutions, I'm open to them). The end result was consistently faster than comparison-based sorting when fed real data, including the hierarchical strings I tested with, but had no relative advantage on short hierarchies like what you describe vs normal strings.
Does anyone sort a vector<int> ? --------------------------------
I don't think I've ever needed to sort a vector<int>; there is normally some associated data, e.g. I'm sorting a vector<struct> or vector
where one field of the struct is an int key. Or the key has multiple fields like an (x,y) pair, or there is some extra feature of the comparison such as being case-insensitive. The proposed API does provide a reasonable way to sort these more complex containers, i.e. by providing additional functors that (for string_sort) return the nth character of the key string. But what we don't know is whether its performance benefits relative to introsort also apply in these cases: as far as I've seen, the performance results are all for simple vector<int>-like cases. One could easily imagine that the time taken for copying would dominate, diluting the algorithmic benefits.
If you run the tune.pl script that is in the base directory of the library, you'll see the speedup on key+value sorts. They actually do less swaps than a comparison-based algorithm, because there are less iterations, so large chunks of data sort faster relative to std::sort with this approach than small chunks of data. Some excerpts from my latest tune.pl run (with no args) on Ubuntu: Verifying integer_sort runtime: 87.596183 std::sort time: 165.384112 speedup: 88.80% Verifying float_sort runtime: 85.212854 std::sort time: 182.711132 speedup: 114.42% Verifying string_sort runtime: 115.193192 std::sort time: 250.349607 speedup: 117.33% Verifying integer_sort with separate key and data runtime: 405.510518 std::sort time: 991.96294 speedup: 144.62% Verifying string_sort with all functors runtime: 139.453782 std::sort time: 311.68992 speedup: 123.51% Verifying boost::sort on its custom-built worst-case distribution runtime: 17.710653 std::sort time: 17.606037 speedup: -0.59% I encourage all reviewers to run this script themselves to see how fast this library is on their system (the README describes how to do this, or you can just call ./tune.pl -small [-windows] from the libs/sort directory and wait a minute). A standard trick with a large data payload is to use a pointer to the data to minimize the amount of memory that must be swapped, so sorting with a really large data payload should be fairly unusual.