Fwd: [SORT] Parallel Algorithms
Hi Steven I had been revising and testing spreadsort, and the results are impressive. I have prepared the benchmarks and include spreadsort. Now the numbers are extracted from a file generated with a random number generator. In a few days I will put all in a git repository for a easy access As I had seen, you use boost::sort for to invoke to spreadsort. If you use this name, which name use for to invoke introsort? I think a good idea can be call the algorithms by their name. If we do this we have the next algorithm list : - spreadsort - spreadsort_integer - spreadsort_float - spreadsort_string - introsort - parallel_introsort - smart_merge_sort - parallel_stable_sort - sample_sort I think it's a simple and easy to understand, because with the name sort, many people think about an algorithm as introsort ( general , not specific for several data types) and not stable. What's your opinion ?
On Thu, Apr 16, 2015 at 4:08 AM, Francisco José Tapia
Hi Steven
I had been revising and testing spreadsort, and the results are impressive.
I have prepared the benchmarks and include spreadsort. Now the numbers are extracted from a file generated with a random number generator. In a few days I will put all in a git repository for a easy access
As I had seen, you use boost::sort for to invoke to spreadsort. If you use this name, which name use for to invoke introsort?
What version of the code are you looking at? I restructured the namespacing after the review completed to match the directory structure. The latest version is at https://github.com/boostorg/sort. You can also see it integrated into the 1_58_0 release candidate here: http://boost.cowic.de/rc/ The naming goes like this: boost::sort::spreadsort::spreadsort boost::sort::spreadsort::integer_sort etc. So if you were to add a library, it might be something like this: boost::sort::parallel::parallel_introsort (Is there a name you would prefer to parallel?)
I think it's a simple and easy to understand, because with the name sort, many people think about an algorithm as introsort ( general , not specific for several data types) and not stable.
spreadsort is a general algorithm, and can be used generically through this implementation, but yes, it will generate compile errors if people pass random data types that don't have the required operators, until they specify functors to use instead. The namespacing is intended to match the directory structure, but it also enforces the specificity you're requesting.
Sorry by the mistake, I was using an old version , I download the code in December, and the files inside are from October. I just revised the new version, and the documentation is much better than the older. ( The index.html don't run well because don't exist the folder doc/html/) . As see in the sort.pdf, the spreadsort function is in the boost::sort namespace. This can be coherent with my previous message. My initial idea is to put the functions in the boost::sort namespace with the names - introsort - parallel_introsort - smart_merge_sort - parallel_stable_sort - sample_sort I include the sample sort because in the parallel sorting of strings the sample sort is a 27% faster than parallel_stable_sort , and use only a 4% more of memory. The code can be in a folder with any name, by example, boost/sort/generalsort/algorithms.hpp. What's your opinion? According to your recommendations, I create the benchmarks, and the data are extracted from a file generate with a random number generator. I am trying to check in machines with many cores, I am talking with friends of universities of Madrid. I include the spreadsort in the benchmarks, and the results are impressive. I changed the name of the objects ( now is int_array) , and use the default copy constructor and the default operator =, the comparison is with the sum of all the values in the array, and all the positions of the array are filled with the random numbers of the file I am writing the documentation and changing the test programs to adapt to the boost format. In a few days, I will create a git repository, with all the information Sorry again by my mistake. Yours Francisco
On Thu, Apr 16, 2015 at 6:20 AM Francisco José Tapia
I just revised the new version, and the documentation is much better than the older. ( The index.html don't run well because don't exist the folder doc/html/) .
The html is generated from the code and a template. In 1_58_0 boost release (which just came out), the generated output is included. Do you want to know how to generate the html?
As see in the sort.pdf, the spreadsort function is in the boost::sort namespace. This can be coherent with my previous message.
That's an obsolete file, and I've deleted it. Thanks for pointing it out. You should use the html docs.
My initial idea is to put the functions in the boost::sort namespace with the names
- introsort - parallel_introsort - smart_merge_sort - parallel_stable_sort - sample_sort
I include the sample sort because in the parallel sorting of strings the sample sort is a 27% faster than parallel_stable_sort , and use only a 4% more of memory.
The code can be in a folder with any name, by example, boost/sort/generalsort/algorithms.hpp. What's your opinion?
The namespace needs to match the folder, that's the boost convention. The only reason to include introsort (i'd put it in the detail folder) is for comparison; std::sort usually performs comparably or better, and is easier to find, in which case "parallel" works (or more specifically, parallel_comparison) as a name. I don't like "generalsort" because spreadsort could be described the same way; it isn't specific enough.
According to your recommendations, I create the benchmarks, and the data are extracted from a file generate with a random number generator.
I am trying to check in machines with many cores, I am talking with friends of universities of Madrid.
I include the spreadsort in the benchmarks, and the results are impressive.
I changed the name of the objects ( now is int_array) , and use the default copy constructor and the default operator =, the comparison is with the sum of all the values in the array, and all the positions of the array are filled with the random numbers of the file
I am writing the documentation and changing the test programs to adapt to the boost format. In a few days, I will create a git repository, with all the information
Sounds good.
Hi, Finally, I have all in a git repository. You can see in https:://github.com/fjtapia/sort_parallel. https://mail.google.com/mail/u/0/https:://github.com/fjtapia/sort_parallel. https:://github.com/fjtapia/sort_parallel. The parts are : the source code, documentation, examples and benchmarks. The first versions, usually have many errors and need correct many things. This is not an exception. If you see something wrong, or something to change, please say me. The English is not my mother tongue, and I suppose the list of things to change is leaded by incorrect expressions and even by true bumps to the English dictionary. Thanks by your help and your patience Francisco
On 5/27/15 12:38 AM, Francisco José Tapia wrote:
Hi,
Finally, I have all in a git repository. You can see in https:://github.com/fjtapia/sort_parallel. https://mail.google.com/mail/u/0/https:://github.com/fjtapia/sort_parallel. https:://github.com/fjtapia/sort_parallel.
The parts are : the source code, documentation, examples and benchmarks.
The first versions, usually have many errors and need correct many things. This is not an exception. If you see something wrong, or something to change, please say me.
The English is not my mother tongue, and I suppose the list of things to change is leaded by incorrect expressions and even by true bumps to the English dictionary. Thanks by your help and your patience
Francisco
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I've taken a look at your github repo and think this is a great candidate for the Boost Library Incubator ( www.blincubator ). Please considering submitting your library to the Incubator. Robert Ramey
On Wed, May 27, 2015, 9:44 AM Robert Ramey
Hi,
Finally, I have all in a git repository. You can see in https:://github.com/fjtapia/sort_parallel. < https://mail.google.com/mail/u/0/https:://github.com/fjtapia/sort_parallel .> https:://github.com/fjtapia/sort_parallel.
The parts are : the source code, documentation, examples and benchmarks.
The first versions, usually have many errors and need correct many
On 5/27/15 12:38 AM, Francisco José Tapia wrote: things.
This is not an exception. If you see something wrong, or something to change, please say me.
The English is not my mother tongue, and I suppose the list of things to change is leaded by incorrect expressions and even by true bumps to the English dictionary. Thanks by your help and your patience
Francisco
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I've taken a look at your github repo and think this is a great candidate for the Boost Library Incubator ( www.blincubator ). Please considering submitting your library to the Incubator.
Robert Ramey
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I have no objection to Francisco adding this to the incubator, but if he wants to include it in boost.sort I'll need to be involved.
On 5/27/15 10:40 AM, Steven Ross wrote:
I have no objection to Francisco adding this to the incubator, but if he wants to include it in boost.sort I'll need to be involved.
Of course, the review process guarantees that anyone who want's to get involved has the opportunity. In the meantime, you'd be very welcome to post comments on this library once it's in the incubator. I'm sure the author would find them very interesting. Robert Ramey
Robert, thanks by your message and your interest. About to put sort::parallel in the Incubator, sincerely I don't know what must do now. As say Steven, I think the logic is to join the two parts spread sort and parallel in a repository, controlled by Steven and myself. But the spread sort part is approved and the parallel part not. I don't know where must be located the approved libraries. Personally, don't have any problem with any of the options mentioned My other library ( don't approved yet) Countertree, I will move to the Incubator when finish the exams of my students, the last week of June. I will consult the details with Robert (thanks by tour interest and your continuous work with the Incubator) About the approbation, I don't know if must open other approbation process or if depend only of Steven. As you can see, I am a sea of doubts. What's you opinion ? Thanks Francisco
On Thu, May 28, 2015, 3:51 AM Francisco José Tapia
Hi , Steven,
sorry by the delay in the response. Now I am buried in a mountain of exams.
In my opinion, sort::parallel is better leave in the actual place. When
finish the process with sort::parallel, I have intention to move my other
library CounterTree to the Incubator.
I think must try consider too, the view point of the final user, Try to
do all the things clear and simpleas possible. Create a new library for the
parallel algorithms, is to separate similar and related functionalities.
Boost is big and sometimes need a considerable amount of time to find the
desired functionality. If we divide the library, we add time to the user's
search, and don't improve nothing.
I had seen several grammatical mistakes in the documentation, due to the
fatigue, the excess of cut and paste and my lack of English grammar.
I generate the doxygen documentation for all the code, but I think is very
big and unnecessary for the final user of the library. I will change and
put the documentation of the final functions.
I will try to correct this week, but I can't guarantee. When correct all,
It will be the moment to think about the mini review. What need, what must
add and what must cut
Francisco
2015-05-28 14:07 GMT+02:00 Steven Ross
On Thu, May 28, 2015, 3:51 AM Francisco José Tapia
wrote: Robert,
thanks by your message and your interest. About to put sort::parallel in the Incubator, sincerely I don't know what must do now.
About the approbation, I don't know if must open other approbation process or if depend only of Steven.
As you can see, I am a sea of doubts. What's you opinion ?
Thanks
Francisco
_______________________________________________ Unsubscribe & other changes: http:// http://lists.boost.org/mailman/listinfo.cgi/boostlists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost/mailman/ http://lists.boost.org/mailman/listinfo.cgi/boostlistinfo.cgi http://lists.boost.org/mailman/listinfo.cgi/boost/boost http://lists.boost.org/mailman/listinfo.cgi/boost
Francisco,
Feel free to add it to the incubator. If you wish to add it to boost.sort (regardless of whether it's in the incubator), I can run a mini-review and skip the full formal review; you always have the option of adding a new library with a full formal review, but that can take longer (6 years in my case).
With regards to where to check in code, just don't worry about that now. When you want to start merging, you can create a git repository, and copy the existing boost.sort into it, and combine both code bases. I can take care of final integration once it's approved in the mini review.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hi Francisco,
In my opinion, sort::parallel is better leave in the actual place. When finish the process with sort::parallel, I have intention to move my other library CounterTree to the Incubator.
Sounds good to me. Thanks for changing your examples to run off of files! That's easier to test and debug. Please provide your examples with a name of what they do, not just 1, 2, 3. I suggest setting up all your benchmarks to run within 5 minutes by default, and have a flag to do the full-size test. Feel free to rip out timsort if it causes you any trouble; I'd only mentioned it because it's a popular sort. Please make sure the license allows you to include it in Boost.
I had seen several grammatical mistakes in the documentation, due to the fatigue, the excess of cut and paste and my lack of English grammar.
There is a mistake in the sort::parallel documentation: "These algorithms are generic, can sort any kind of object using a comparison object, in contrast with the specialized algorithms, as spreadsort, which only sort integers, strings and float values in a single thread version, but with a speed greater than the generic sort algorithms." spreadsort can sort any data for which you can define a strict weak ordering (just like std::sort). See https://github.com/boostorg/sort/blob/master/example/generalizedstruct.cpp for a moderately complicated example. What's different about purely comparison-based algorithms vs spreadsort is that only a comparison object is required for purely-comparison-based algorithms, where spreadsort requires indexing functions into the key (such as bitshift or the equivalent of [] for strings). If you can write a comparison function, you can write an equivalent indexing function, though that doesn't mean it's easy. Many people have made the same assumption as in your parallel_sort documentation, and I'm not sure what the best way to clarify it is. I think it'd be interesting to try a parallel spreadsort using your parallelization approach, if your library gets substantial usage. Do you need to have separate windows and linux cpp files? Much of the benchmark code appears duplicated, and that becomes a maintenance hassle over time. Please share as much of the code as you can. I have some questions about your int_array object: 1) Why does it have these two functions: int_array ( uint64_t K =0 ); int_array & operator = ( uint64_t K); 2) Why doesn't this print out the entire contents of the array: std::ostream & operator << ( std::ostream & out , const int_array<NN> & N) 3) The object has a fairly expensive comparison function (there is no early exit, it's like a hash comparison), it doesn't seem like a problem, but I'm curious why you picked that comparison approach? I look forward to reviewing it once you've completed cleaning it up (including your docs). Did you include the tbb optimization to be very fast with already-sorted data (I haven't run the library to check)? I assume all your algorithms are either competitive in speed or better in memory than their tbb and std library equivalents.
I generate the doxygen documentation for all the code, but I think is very big and unnecessary for the final user of the library. I will change and put the documentation of the final functions.
They only need to see what is outside your detail namespace.
Hi Steven, I changed the name of the examples. Now are : - example_single_thread - example_default_threads - example_100_threads I will prepare a sort benchmark for to run in 5 minutes. I will do a new program, because if begin to include options in the programs, can be long and can be confuse for the users. Timsort in Windows, sort the data, but the time and memory used are much greater than the used in Linux. I think the best option is to remove of the Windows benchmarks. I will examine the licenses, and I think only need to specify the license of each part of the code because I only use, don't modify. I will modify the text about spreadsort, trying to explain in a easy way The benchmarks programs are different in Windows and Linux, because the algorithms compared are different. In Windows don't have the GCC algorithms, and in Linux don't have the PPL algorithms. About the int_array objects. The constructor and the operator = function with an 64 bits integer as parameter, are functions used before, but don't used now. The operator << is the same. The reason for to implement a complex comparison function, is because in a sort algorithm the two main functions are move and compare. I want to implement the worst case, big objects hard to move and hard to compare. About the optimization of the code I use the -O3 optimization parameter for all the code. The TBB parallel stable sort have 4 different versions ( Open-MP, CilkPlus, TBB high level and TBB low level), the fastest is TBB low level and due this is the used in the benchmarks. In the attached file (parallel_objects.txt) you can see the results in my machine with all kind of data (sorted, reverse sorted, random, random %50000000, random % 10000, and equal elements). About to use the code for to parallelize spreadsort, feel free for to use. And if you need help , need any question or have any problem, please, say me. When you decide to do , say me and I will prepare a description of the two approach used in order to decide the best for your code, and how to implement. Some things are complex to understand reading only the code. I am modifying the documents, deleting the errors, mistakes, and changing the doxygen documentation of the code, for to reduce only tho the final functions and the object Nthread. I hope have all ready in one or two weeks, because now I am in the middle of the exams, and the corrections need many time and energy ( they are terribly boring). Francisco
Hi Francisco,
On Mon, Jun 8, 2015 at 4:44 PM Francisco José Tapia
I changed the name of the examples. Now are :
- example_single_thread - example_default_threads - example_100_threads
Thanks!
I will prepare a sort benchmark for to run in 5 minutes. I will do a new program, because if begin to include options in the programs, can be long and can be confuse for the users.
<= 5 minutes. If you can create a consistent benchmark that runs in a minute, even better, but I've usually found really short benchmarks (<1 minute) tend to be noisy.
Timsort in Windows, sort the data, but the time and memory used are much greater than the used in Linux. I think the best option is to remove of the Windows benchmarks.
Sounds good to me.
I will examine the licenses, and I think only need to specify the license of each part of the code because I only use, don't modify.
If you're willing to give me a link for all the licenses for software you haven't written but are including along with the code, I'd appreciate it, so I can double-check.
The benchmarks programs are different in Windows and Linux, because the algorithms compared are different. In Windows don't have the GCC algorithms, and in Linux don't have the PPL algorithms.
Ok; please try to share everything that's practical.
About the int_array objects. The constructor and the operator = function with an 64 bits integer as parameter, are functions used before, but don't used now. The operator << is the same.
Please delete them then. They confused me, and I don't want to maintain or answer questions about any unused code.
The reason for to implement a complex comparison function, is because in a sort algorithm the two main functions are move and compare. I want to implement the worst case, big objects hard to move and hard to compare.
I'm ok with that, but please mention it in your documentation if you don't already.
About the optimization of the code I use the -O3 optimization parameter for all the code. The TBB parallel stable sort have 4 different versions ( Open-MP, CilkPlus, TBB high level and TBB low level), the fastest is TBB low level and due this is the used in the benchmarks.
In the attached file (parallel_objects.txt) you can see the results in my machine with all kind of data (sorted, reverse sorted, random, random %50000000, random % 10000, and equal elements).
Those numbers look good! If the numbers on my test system (dual-boot Windows and Linux) look anything like that, I say it's good enough to include. In particular, I appreciate that you added the already-sorted optimization; that's a surprisingly common case. I've seen duplicate sorts (they sorted the data 2 times, every time) deployed to hundreds of thousands of computers; nobody noticed until I looked because the sort routine had an optimization for already-sorted data, and the duplicate sort only took 1% of the total sort time.
About to use the code for to parallelize spreadsort, feel free for to use. And if you need help , need any question or have any problem, please, say me.
When you decide to do , say me and I will prepare a description of the two approach used in order to decide the best for your code, and how to implement. Some things are complex to understand reading only the code.
Not today. Let's get your algorithms debugged, documented, and into Boost before we try anything that complicated. Please let me know when you're done with your updates and are ready for me to review the documentation and code.
Hi Steven I decided to change the benchmarks. I have designed a new compact benchmark, with an execution time of aprox 3 min. The results are clear and compact and permit an easy evaluation. In the attached file you can see the result on my machine About the licenses. I Included in the benchmark program in the initial documentation the next paragraphs /// This program use for comparison only purposes the TimSort /// implementation, https://github.com/gfx/cpp-TimSort which license is /// https://github.com/gfx/cpp-TimSort/blob/master/LICENSE /// This program use for comparison purposes, the Threading Building /// Blocks which license is the GNU General Public License, version 2 /// as published by the Free Software Foundation. Now I am very busy with the exams of my students , but I am working a shot time every day in the correction of the documentation and in the final version before the validation. I hope to have ready soon and begin with the next process
Hi Francisco,
On Thu, Jun 11, 2015 at 12:01 PM Francisco José Tapia
Hi Steven
I decided to change the benchmarks. I have designed a new compact benchmark, with an execution time of aprox 3 min. The results are clear and compact and permit an easy evaluation.
In the attached file you can see the result on my machine
Looks good! I'll verify it on my machine once you're ready for review. I was amused to see single-threaded spreadsort provide comparable speed to a parallel algorithm when sorting strings.
Hi the code and the documentation are finished All is in https://github.com/fjtapia/sort_parallel Please, if you see any error, mistake or something to correct, say me in order to change as soon as possible Francisco
On Fri, Jun 26, 2015 at 11:01 AM Francisco José Tapia
Hi
the code and the documentation are finished
All is in https://github.com/fjtapia/sort_parallel
Please, if you see any error, mistake or something to correct, say me in order to change as soon as possible
Thanks Francisco: My preliminary testing on Linux suggests that these libraries are beneficial in at least one case on an 8-core system:
introsort smart_merge_sort parallel stable sort sample sort parallel introsort didn't prove itself (based on raw speed). Do you have a larger test I can run to verify memory usage (which other users might like), or should I just hack up the tests you provided for this purpose? If parallel introsort is better than competitors in RAM, then that would be its reason for use. It looks like we can tell users that these library versions (except parallel introsort) are faster than TBB and GCC for very large data types (256+ bytes), and for other data types they are roughly competitive in speed, so at minimum, it'll save people the effort of setting up their own indirect sort.
On Fri, Jun 26, 2015 at 11:00 AM, Francisco José Tapia
Hi
the code and the documentation are finished
All is in https://github.com/fjtapia/sort_parallel
Please, if you see any error, mistake or something to correct, say me in order to change as soon as possible
Francisco,
Please fix your int_array type; I still see unnecessary calls in it.
Here's what I suggest:
template
Hi Steven,
About parallel_introsort, there are two strategies :
1.- Divide the the elements in two parts with the quicksort procedure,
several times and when the parts are smaller than predefined value, sort
with introsort. This method don't need additional memory ( only the
variables of the recursive calls).
This have a problem, the first division is done with 1 HW thread, the
second with 2, the third with 4, and the fourth with 8. With processors
with a small number of HW threads, run well and usually is the fastest. But
with a great number of HW threads this is not true.
This strategy is used by TBB parallel_sort, Boost parallel_introsort and
Microsoft PPL parallel_sort.
2.- Sort small parts of the data , and do a fusion, usually using the same
technique than sample_sort. With a great number of cores this version
outperform the previous strategy, but need an additional memory equal than
the used with the data.
This strategy is used by GCC parallel_sort, and Microsoft
parallel_buffered_sort
The most critical element for the performance is the data bus. A personal
computer have a dual channel, and a server have a quad channel and a big
cache. Due this, sometimes, good results in small machines are bad in big
machines , and the opposite too.
The attached file is the result of running on a machine with 16 cores of
the Polytechnic University of Madrid.
The information about the memory used is in the documentation. But if you
want to check yourself, I send you too the old benchmark files, which sort
the same than in the benchmark code of the documentation.
On Linux you must execute /usr/bin/time -f “%M” program [parameters], and
print the memory used by the program. By example parallel_numbers 1, sort
with the GCC sort,
with the 2 as parameter, with introsort, and without parameters print a
menu, and must input a option by the keyboard.
In Windows, open the administration tool for to see the process, and there
is a column showing the maximum memory used.
I changed the syntax errors in the README.md , and now I go to change the
int_array
Francisco
2015-06-30 3:04 GMT+02:00 Steven Ross
On Fri, Jun 26, 2015 at 11:00 AM, Francisco José Tapia
wrote: Hi
the code and the documentation are finished
All is in https://github.com/fjtapia/sort_parallel
Please, if you see any error, mistake or something to correct, say me in order to change as soon as possible
Francisco,
Please fix your int_array type; I still see unnecessary calls in it. Here's what I suggest:
template
struct int_array { uint64_t M[NN]; template <class generator > static int_array<NN> generate (generator &gen) { int_array<NN> result; for ( uint32_t i =0 ; i < NN ; ++i) result.M[i] = gen(); return result; };
uint64_t counter ( void) const { uint64_t Acc =0 ; for ( uint32_t i =0 ; i < NN ; Acc += M[i++]) ; return Acc ; }
bool operator < ( const int_array &R) const { return ( counter() < R.counter()); }; };
There is no need to pass an int_array to the generate method. If you want, I can send you a pull request.
With the int_array corrected, I see a case in which the introsort is useful too; I think we can keep it all (depending on how Windows testing goes), but please fix the int_array.
If you'd prefer, I can send a pull request with the changes, though you'll need to make minor test fixes to accommodate it.
Steve
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (3)
-
Francisco José Tapia
-
Robert Ramey
-
Steven Ross