19 Nov
2014
19 Nov
'14
11:46 p.m.
Julian, On Wed Nov 19 2014 at 5:46:01 PM Julian Gonggrijpwrote: > >>> 3) If there is a character that is not allowed in the substrings > (ideally > >>> null character), then you can just return 0 when you hit the end of a > >>> substring, and if the illegal character is not null, then shift any > >> values > >>> below that character up by 1 in the Get_char functor to fit this change > >>> in. Embedded nulls usually are not a design requirement, in which case > >> you > >>> can just use null to signal the end of a substring. > >>> 4) [...] > >> > >> You can invent all kinds of workarounds and invasive additions to > account > >> for additional use cases. [...] > >> I think it would be better to appreciate the elegance of the algorithm > >> for the use cases that it easily supports, than to try to approach the > >> generality of a comparison sort. > >> > > > > I think there is some form of miscommunication. string_sort, out of the > > box, can sort ANY data with a strict weak ordering that std::sort can, > > depending on how the user of the library defines their Get_char functor, > > just like they have to define a compare functor in these more complex use > > cases for both comparison-based algorithms and string_sort. You posed a > > problem, and I provided 2 solutions by changing the user's Get_char > > functor, one efficient one for the standard case where embedded nulls > > aren't an issue, and another that supports embedded nulls with some more > > effort. This technique can be extended to any std::sortable data type > with > > a strict weak ordering, merely by reproducing the actual bytes the > > comparison function compares as returns from the Get_char function at > > sequential indices. There is no loss of generality with hybrid-radix > > sorting. > > What I will agree is that it becomes more complicated for the user to > > define a Get_char functor in addition to the compare functor for some use > > cases, and in some cases it won't be worth the bother. > > Yes, perhaps there is a miscommunication. Regarding the author/title > example, I understand the part where a library user can refine the > Get_char function to return a special value for the end of a subkey. > However, it is unclear to me how that could be enough to ensure that all > books are first sorted by author and then by title. Author names may be > as short as 10 characters or as long as 50 characters. How is Get_char > supposed to ensure all by itself that all first characters of all book > titles are considered in the same radix pass, i.e. the one after the last > character of the longest author name? > By alternating between returning payload radix characters and a "does this field continue" byte. A 0 will specify an end of field, and sort lower in the order than a 1, thus making all shorter names sort lower than names that are equivalent to the same length but longer. Thus, Author: Jon Title: bat will have GetChar output these values: 1,'J',1,'o',1,'n',0 ,1,'b',1,'a',1,'t',0 Author: Jonathan Title:(empty), will output these values: 1,'J',1,'o',1,'n',1,'a',1,'t',1,'h',1,'a',1,'n',0, 0 The 0 vs 1 will cause Jon to appear before Jonathan in order, because it is shorter. Actually, the 0s and 1s aren't necessary for the last subkey (the title), but I included them to show this approach. There is also a simple GetLength functor that lets string_sort know when all characters are exhausted. > > By contrast, it is very clear how refining the comparison function would > be sufficient for a comparison sort. Just compare by author; in case of a > tie, compare by title. > That's exactly what the Get_char method I explained is doing, but it's less intuitive. The radix equivalent requires encoding the same information the comparison function uses, but sometimes in a different byte format to include factors such as sizes and signs. As I said, doable, but may not be worth the effort. > To the best of my knowledge, it is common wisdom that radix sorts are not > as general as comparison sorts. I can be wrong. I may have missed an > important (new?) insight. If that is the case, please explain to me how > refining the Get_char function is sufficient to deal with any strict weak > ordering (just repeating the claim does not help me understand). > > I agree with your statement that this is conventional wisdom. 2 major factors are included in this conventional wisdom: 1) Most Radix sorts perform poorly relative to comparison sorting for small N, large K, or variable-length data. Hybrid-radix sorting resolves this problem. 2) The mapping from a sorting requirement to a comparison function is more intuitive and generally simpler than the mapping to a Get_char-type radix implementation. With #1 limiting performance, and comparison-based sorts being fast enough for most problems, I think most people just didn't put in the effort to examine the possibility of generalized radix sorting. string_sort provides this capability. It won't be optimal for many types of problems, but it can be used for them and provides worst-case performance guarantees.