Re: [boost] Adding my sorting algorithm, ska_sort, to boost, attempt 2

28 Jan 2020

      Hello,

The first thing is to apologize for the delay. I have medical problems in
my family, and all my free time is gone.

I didn't remember that in February 2019 we evaluated this same algorithm.
In the message that Steven wrote, he mentioned a list of major deficiencies
in the algorithm.

*It looks like ska_sort is a radix sort designed mostly for the best-case,
and that if you feed it the alrbreaker or binaryalrbreaker distributions it
will perform poorly (though not as poorly as radix implementations with no
fallback or that fallback to insertionsort).*

*There might be some counting and swapping optimizations here to copy into
spreadsort (I see block operations), though those might not work if the
items being sorted are references or pointers.*

*Mostly-sorted data is a common case to take seriously, and it took only
modest tweaking of spreadsort to do well there.*

*Based on Francisco's numbers, I should retune spreadsort to take advantage
of the higher speed of pdqsort vs std::sort, at least if I want to optimize
for 64-bit integer sorting on modern processors.*

 *https://www.boost.org/doc/libssort/spreadsort/*
<https://www.boost.org/doc/libs/1_67_0/boost/sort/spreadsort/detail/constants.hpp>

*A higher log_min_split_count (especially) and log_finishing_count is
probably appropriate.*

Every so often we get requests for evaluation of sorting algorithms, many
of which are "world's fastest algorithm". We try to evaluate such requests.
I think everyone deserves respect for their work. But so do we who spend
our time doing the evaluation.

I did the evaluation of the algorithm, and I was surprised, because the
recommendations we made in February, have not been followed. So the tests
I've done, have been to corroborate what we already said in February. In
other words, repeat work.

The comments I have to make to the algorithm are as follows:

1- The algorithm needs C++17. This is a big problem because the library is
made to be compatible with C++11. In many companies and workplaces, the
compilers that are being used are old, at most compatible with C++11. In
the same way that for production you don't usually use the latest version
of the Operating System and SW, you don't usually use the latest version of
the compiler either.

It is not possible to pass a part of the library to C++17 and another part
to remain with C++11. Passing it all to C++17 means leaving out many people
and many companies, and that goes against the spirit of the library which
is to reach as many users as possible.

2.- The algorithm has not implemented worst-case control, nor when all
elements are the same or are ordered, as it was indicated in February.

3.- The algorithm is incomplete. It is to be expected from the algorithm
that it orders, not only integers but also signed and unsigned integers, 32
64 and 128-bit floating-point numbers and strings.

4.- The only argument of the algorithm is its speed. About this, I have
several comments

a) When the implementation incorporates worst-case control of repeated and
equal elements, the speed may change

b) The speed of an algorithm is a factor to be taken into account, but it
is not the only one. For example, the Timsort algorithm is slower than any
other algorithm with random data. But it is extremely fast when the data
are almost ordered, a situation that occurs with much more frequency than
we imagine. This algorithm is highly valued by programmers and companies,
and even the Java language uses it for its sort function.

Ska_sort is very fast with the random data, but it is quite mediocre when
the data are almost ordered.

5) Having a hybrid algorithm like Spreadsort, having another hybrid
algorithm is not very interesting, because with small data sets, the time
differences, if any, will be small, and with large data sets, parallel
algorithms are used, which provide much more speed than any single-stranded
algorithm

In addition to all this, I would like to express my discomfort at having to
waste my time, which, as I said before, is very scarce, in repeating a
test, which does not contribute anything, because it has not followed any
of our recommendations.

Francisco Tapia

El jue., 23 ene. 2020 a las 19:36, Malte Skarupke via Boost (<
boost@lists.boost.org>) escribió:
...
...
If I understood correctly, your new algorithm increases the requirement
for Boost.Sort to C++17 from C++11. I don't see a good reason for doing
that. You can replace if constexpr and the generic lambdas in your
implementation with other code constructs that work on C++11.
It only requires C++17 if you want to use ska_sort. The other algorithms
remain at C++11. I agree that this is a little unfortunate, but when I
briefly tried to get rid of "if constexpr" in the code, it turned out to be
a giant pain. It would also probably make the code compile much more
slowly. If you are trying to sort nested types, like
vector<vector<pair<string, int>>>, compile time can explode without some of
the short circuiting that I'm doing using if constexpr. (I can explain why
it explodes if you're interested, but for now I'll just say that it works
different from comparison based sorting and you have to instantiate new
code at every level of depth) I know I can do the same short circuiting
without if constexpr if I put in enough work, but I think the code quality
would suffer a lot, making the code much harder to read and maintain. (not
that it's simple as is, but it would go from "hard to read" to "impossible
to read")
...
Regarding the interface, shouldn't it follow the example of spreadsort?
The main benefit of this over spreadsort is that it's more generic. The
spreadsort interface is hard to generalize to other types. For example it
allows you to provide your own shift operator. But how do you provide a
shift operator when trying to sort a pair<string, int>? When you get to the
end of the string, you'd have to start providing bits from the int, but
that's going to be annoyingly complicated to write. (especially since I'm
not sure if spreadsort works on byte boundaries)
It gets even more complicated when trying to sort a variant<string, int>.
Because now the bits depend on which value of the variant is active, but
you don't want to have to check that over and over again.
The interface was very important for making this more general. When it
comes to this kind of decisions, I have to make as many of them invisible
for the user as possible, because only the internals of the sorting
algorithm can know which bits to sort on next.
About providing a single header: That is definitely what I would do as a
standalone library, but I think in boost it's OK to split this into
multiple headers. Users can include boost/sort/ska_sort/ska_sort.hpp which
includes all necessary headers. So usually they don't care that it's
multiple headers internally. And if you want slightly faster compile time,
you can just include ska_sort_base.hpp without the headers that provide
tuple or variant implementations. (this is currently not part of the
documentation because I'm not sure if I want to provide that forever)
So why make this part of boost instead of a standalone header? Basically I
like having the assurance that it'll be kept working for a long time. I'm
willing to do that and to put future work into this, but having it in boost
ensures that that continues to happen even after I can no longer provide
support.
--
  Malte Skarupke
  malteskarupke@fastmail.fm
_______________________________________________
Unsubscribe & other changes:
http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [boost] Adding my sorting algorithm, ska_sort, to boost, attempt 2

Francisco José Tapia