New subject: [review] Formal review period for Sort library continuing through November 19

15 Nov 2014

Julian,

On Wed Nov 19 2014 at 5:46:01 PM Julian Gonggrijp <j.gonggrijp@gmail.com>
wrote:

> >>> 3) If there is a character that is not allowed in the substrings
> (ideally
> >>> null character), then you can just return 0 when you hit the end of a
> >>> substring, and if the illegal character is not null, then shift any
> >> values
> >>> below that character up by 1 in the Get_char functor to fit this change
> >>> in.  Embedded nulls usually are not a design requirement, in which case
> >> you
> >>> can just use null to signal the end of a substring.
> >>> 4) [...]
> >>
> >> You can invent all kinds of workarounds and invasive additions to
> account
> >> for additional use cases. [...]
> >> I think it would be better to appreciate the elegance of the algorithm
> >> for the use cases that it easily supports, than to try to approach the
> >> generality of a comparison sort.
> >>
> >
> > I think there is some form of miscommunication.  string_sort, out of the
> > box, can sort ANY data with a strict weak ordering that std::sort can,
> > depending on how the user of the library defines their Get_char functor,
> > just like they have to define a compare functor in these more complex use
> > cases for both comparison-based algorithms and string_sort.  You posed a
> > problem, and I provided 2 solutions by changing the user's Get_char
> > functor, one efficient one for the standard case where embedded nulls
> > aren't an issue, and another that supports embedded nulls with some more
> > effort.  This technique can be extended to any std::sortable data type
> with
> > a strict weak ordering, merely by reproducing the actual bytes the
> > comparison function compares as returns from the Get_char function at
> > sequential indices.  There is no loss of generality with hybrid-radix
> > sorting.
> > What I will agree is that it becomes more complicated for the user to
> > define a Get_char functor in addition to the compare functor for some use
> > cases, and in some cases it won't be worth the bother.
>
> Yes, perhaps there is a miscommunication. Regarding the author/title
> example, I understand the part where a library user can refine the
> Get_char function to return a special value for the end of a subkey.
> However, it is unclear to me how that could be enough to ensure that all
> books are first sorted by author and then by title. Author names may be
> as short as 10 characters or as long as 50 characters. How is Get_char
> supposed to ensure all by itself that all first characters of all book
> titles are considered in the same radix pass, i.e. the one after the last
> character of the longest author name?
>
By alternating between returning payload radix characters and a "does this
field continue" byte.  A 0 will specify an end of field, and sort lower in
the order than a 1, thus making all shorter names sort lower than names
that are equivalent to the same length but longer.
Thus, Author: Jon Title: bat will have GetChar output these values:
1,'J',1,'o',1,'n',0 ,1,'b',1,'a',1,'t',0

Author: Jonathan Title:(empty), will output these values:
1,'J',1,'o',1,'n',1,'a',1,'t',1,'h',1,'a',1,'n',0, 0

The 0 vs 1 will cause Jon to appear before Jonathan in order, because it is
shorter.  Actually, the 0s and 1s aren't necessary for the last subkey (the
title), but I included them to show this approach.

There is also a simple GetLength functor that lets string_sort know when
all characters are exhausted.

>
> By contrast, it is very clear how refining the comparison function would
> be sufficient for a comparison sort. Just compare by author; in case of a
> tie, compare by title.
>

That's exactly what the Get_char method I explained is doing, but it's less
intuitive.  The radix equivalent requires encoding the same information the
comparison function uses, but sometimes in a different byte format to
include factors such as sizes and signs.  As I said, doable, but may not be
worth the effort.

> To the best of my knowledge, it is common wisdom that radix sorts are not
> as general as comparison sorts. I can be wrong. I may have missed an
> important (new?) insight. If that is the case, please explain to me how
> refining the Get_char function is sufficient to deal with any strict weak
> ordering (just repeating the claim does not help me understand).
>
> I agree with your statement that this is conventional wisdom.
2 major factors are included in this conventional wisdom:
1) Most Radix sorts perform poorly relative to comparison sorting for small
N, large K, or variable-length data.  Hybrid-radix sorting resolves this
problem.
2) The mapping from a sorting requirement to a comparison function is more
intuitive and generally simpler than the mapping to a Get_char-type radix
implementation.  With #1 limiting performance, and comparison-based sorts
being fast enough for most problems, I think most people just didn't put in
the effort to examine the possibility of generalized radix sorting.
 string_sort provides this capability.  It won't be optimal for many types
of problems, but it can be used for them and provides worst-case
performance guarantees.

[review] Formal review period for Sort library continuing through November 19

Edward Diener

Julian Gonggrijp

Steven Ross

Phil Endecott

Steven Ross

Phil Endecott

Julian Gonggrijp

Steven Ross

Julian Gonggrijp

Steven Ross

Steven Ross

Julian Gonggrijp

Steven Ross

Julian Gonggrijp

Steven Ross

Julian Gonggrijp

Josh Juran

Julian Gonggrijp

Paul A. Bristow

tags

participants (6)