Dear Boost, In a previous mail, I introduced a container library, `double_ended`. This library has benchmarks [0]. However, the benchmarks show unexpected results. This is what I do [1]: - compile program using G++ 5.3, -O2 - remove CPU0 from kernel scheduler pool (isolcpu=0 boot flag) - disable CPU scaling - pin benchmark thread to CPU0 - in the test, create a vector - reserve memory for 1M unsigned - write a byte to each page reserved, to warm up the page table - get TSC value (machine has constant_tsc flag) - insert 1M unsigned value - get TSC value again, subtract previous value - repeat test 25 times, average results I do this with 3 containers: std::vector, boost::container::vector and double_ended::devector. While the latter two produces almost equivalent results, std::vector seriously falls behind. - No reallocation happens between the TSC sampling - Test order doesn't matter: every order produces the same results - Container size or sample count doesn't matter, same results. - The deviations of individual measures of each container is small, the performance is consistent. - I compared the produced ASM of std::vector and double_ended::devector, they are very similar, almost the same instructions in different order [2]. What do I miss here? Thanks, Benedek [0]: http://erenon.hu/double_ended/double_ended/benchmarks.html [1]: https://github.com/erenon/double_ended/blob/master/benchmark/push_back.cpp [2]: http://pastebin.com/tUL54i9f
On 2/16/2016 11:56 AM, Benedek Thaler wrote:
Dear Boost,
In a previous mail, I introduced a container library, `double_ended`. This library has benchmarks [0]. However, the benchmarks show unexpected results.
This is what I do [1]: - compile program using G++ 5.3, -O2 - remove CPU0 from kernel scheduler pool (isolcpu=0 boot flag) - disable CPU scaling - pin benchmark thread to CPU0 - in the test, create a vector - reserve memory for 1M unsigned - write a byte to each page reserved, to warm up the page table - get TSC value (machine has constant_tsc flag) - insert 1M unsigned value - get TSC value again, subtract previous value - repeat test 25 times, average results
I do this with 3 containers: std::vector, boost::container::vector and double_ended::devector. While the latter two produces almost equivalent results, std::vector seriously falls behind.
- No reallocation happens between the TSC sampling - Test order doesn't matter: every order produces the same results - Container size or sample count doesn't matter, same results. - The deviations of individual measures of each container is small, the performance is consistent. - I compared the produced ASM of std::vector and double_ended::devector, they are very similar, almost the same instructions in different order [2].
What do I miss here?
Thanks, Benedek
[0]: http://erenon.hu/double_ended/double_ended/benchmarks.html [1]: https://github.com/erenon/double_ended/blob/master/benchmark/push_back.cpp [2]: http://pastebin.com/tUL54i9f
FWIW I can't reproduce your results on Visual Studio 2015 Update 1. Here are my results if you're interested. X std::vector 0 0 100 592.380000 200 1178.640000 400 2310.270000 800 4517.580000 1600 8893.160000 3200 17600.610000 6400 35895.900000 12800 71004.060000 25600 142081.980000 51200 283691.720000 102400 567921.980000 204800 1138470.790000 409600 2274810.920000 819200 4556220.740000 X devector 0 0 100 711.840000 200 1458.540000 400 2924.550000 800 5862.690000 1600 11653.440000 3200 23825.170000 6400 46841.700000 12800 93916.710000 25600 187148.900000 51200 372430.440000 102400 745044.570000 204800 1488865.490000 409600 2971389.220000 819200 5942622.180000 X boost::container::vector 0 0 100 837.930000 200 1666.050000 400 3259.280000 800 6423.750000 1600 12706.180000 3200 25197.520000 6400 50089.290000 12800 100486.150000 25600 200127.660000 51200 398981.780000 102400 800517.160000 204800 1610967.220000 409600 3250085.410000 819200 6454347.270000 X std::deque 0 0 100 18529.880000 200 25942.200000 400 38937.040000 800 59459.400000 1600 100410.640000 3200 175987.360000 6400 321362.520000 12800 569617.680000 25600 1019693.840000 51200 1930645.720000 102400 3806069.880000 204800 7704495.360000 409600 15629599.200000 819200 31983278.800000 X batch_deque 0 0 100 991.760000 200 1735.560000 400 3127.280000 800 5864.880000 1600 11357.320000 3200 22291.800000 6400 44334.520000 12800 88523.040000 25600 179476.320000 51200 358522.960000 102400 716579.960000 204800 1433212.760000 409600 2868735.640000 819200 5741968.240000 X boost::container::deque 0 0 100 1187.320000 200 4253.640000 400 6874.320000 800 11220.320000 1600 20418.400000 3200 38970.160000 6400 74156.560000 12800 142497.720000 25600 281098.400000 51200 537988.920000 102400 1084152.000000 204800 2232533.560000 409600 4669299.360000 819200 9431670.920000
On Feb 17, 2016 5:21 AM, "Michael Marcin"
FWIW I can't reproduce your results on Visual Studio 2015 Update 1. Here are my results if you're interested.
Thanks for checking! (You are the first to compile it on Windows, I'm surprised it worked) Looking at your results: although now std::vector wins, std::deque awfully falls behind. I don't know the std library supplied by MSVC, but I suspect this isn't the expected result. I matched the segment size to libstdc++, maybe dinkumware uses smaller segments and allocates more. Thanks, Benedek
On 16-02-2016 18:56, Benedek Thaler wrote:
Dear Boost,
What do I miss here?
Hard to say. I guess instruction order may have an impact on the branch prediction. But please include a test with unsafe_push_back() for comparison also. Some more questions: A. is the lambda optimized away? B. Do the test run the same number of times? Is it asserted at runtime? C. Do you compile via bjam? I mean, is all the right flags present, e.g. NDEBUG=1? D. How does the actual code between std::vector and devector compare? Maybe there could be hints to the difference. regards Thorsten
I think there is definitely something not quite right.
One thing I notice is that the line:
sample.push_back(get_clock() - base_clock);
Is being timed.
Which could potentially access a new memory page I think.
In pin_thread you have _MSVC_VER instead of _MSC_VER
Hacking up std::vector and adding a push_back_reserved that does no
bounds checks I get consistently very strange timings.
(Code included at the bottom to make it obvious that the latter should
be no slower).
X std::vector::push_back
0 0
100 765.360000
200 1537.260000
400 3037.950000
800 6008.280000
1600 11905.150000
3200 23950.320000
6400 48064.600000
12800 96178.470000
25600 190637.270000
51200 379025.420000
102400 757784.880000
204800 1514584.120000
409600 3026480.680000
819200 6050654.130000
X std::vector::push_back_reserved
0 0
100 938.090000
200 1936.380000
400 3704.400000
800 7274.170000
1600 14307.310000
3200 28550.140000
6400 56835.310000
12800 113579.430000
25600 226906.070000
51200 453802.040000
102400 911835.760000
204800 1819723.520000
409600 3639619.650000
819200 7281897.560000
My simpler, and I'm sure in many ways inferior benchmark, shows much
more believable results.
http://codepad.org/SBZ4rbqo
Running 1 test case...
std::vector
On 18-02-2016 08:51, Michael Marcin wrote:
I think there is definitely something not quite right.
One thing I notice is that the line: sample.push_back(get_clock() - base_clock); Is being timed. Which could potentially access a new memory page I think.
In pin_thread you have _MSVC_VER instead of _MSC_VER
I don't understand why its necessary with all this complicated stuff. AFAICT, a single threaded program runs on one CPU. -Thorsten
On 2/18/2016 7:22 AM, Thorsten Ottosen wrote:
On 18-02-2016 08:51, Michael Marcin wrote:
I think there is definitely something not quite right.
One thing I notice is that the line: sample.push_back(get_clock() - base_clock); Is being timed. Which could potentially access a new memory page I think.
In pin_thread you have _MSVC_VER instead of _MSC_VER
I don't understand why its necessary with all this complicated stuff. AFAICT, a single threaded program runs on one CPU.
FWIW http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-...
On Thu, Feb 18, 2016 at 8:51 AM, Michael Marcin
I think there is definitely something not quite right.
One thing I notice is that the line: sample.push_back(get_clock() - base_clock); Is being timed. Which could potentially access a new memory page I think.
In pin_thread you have _MSVC_VER instead of _MSC_VER
Good catch, fixed this.
Hacking up std::vector and adding a push_back_reserved that does no bounds checks I get consistently very strange timings. (Code included at the bottom to make it obvious that the latter should be no slower).
My simpler, and I'm sure in many ways inferior benchmark, shows much more believable results.
Still, there are huge differences between essentially same programs.
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-...
I gave a spin to rdtscp, no success, but I continue looking at this doc. Thanks, Benedek
On 2/19/2016 12:48 AM, Benedek Thaler wrote:
On Thu, Feb 18, 2016 at 8:51 AM, Michael Marcin
wrote: I think there is definitely something not quite right.
One thing I notice is that the line: sample.push_back(get_clock() - base_clock); Is being timed. Which could potentially access a new memory page I think.
In pin_thread you have _MSVC_VER instead of _MSC_VER
Good catch, fixed this.
Hacking up std::vector and adding a push_back_reserved that does no bounds checks I get consistently very strange timings. (Code included at the bottom to make it obvious that the latter should be no slower).
My simpler, and I'm sure in many ways inferior benchmark, shows much more believable results.
Still, there are huge differences between essentially same programs.
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-...
I gave a spin to rdtscp, no success, but I continue looking at this doc.
Thanks, Benedek
FWIW (probably not much this time) I took a shot at porting my test to the google/benchmark library. Mostly just to learn the library but also to see if there were any glaring discrepancies. Not sure I'm using it right but the relative results seem in line with my expectations. results: http://codepad.org/U66O82X8 source: http://codepad.org/rii3BHly
On Fri, Feb 19, 2016 at 8:50 AM, Michael Marcin
On 2/19/2016 12:48 AM, Benedek Thaler wrote:
FWIW (probably not much this time)
I took a shot at porting my test to the google/benchmark library. Mostly just to learn the library but also to see if there were any glaring discrepancies. Not sure I'm using it right but the relative results seem in line with my expectations.
results: http://codepad.org/U66O82X8
source: http://codepad.org/rii3BHly
Marcin, Thanks for this take, I like this framework. Looking at the results, I'm wondering whether it would be more informative to just show the generated code instead of the raw numbers and say: the code is similar, the user can expect similar performance.
On Wed, Feb 17, 2016 at 5:34 PM, Thorsten Ottosen
Some more questions: A. is the lambda optimized away?
It is. I got the same results without any helper functions.
B. Do the test run the same number of times? Is it asserted at runtime?
It does, just checked it manually.
C. Do you compile via bjam? I mean, is all the right flags present, e.g. NDEBUG=1?
I use bjam. I have -DNDEBUG on the compiler invocation line. I also have -O3. I wonder why the release build have -O3 instead of -O2 by default.
D. How does the actual code between std::vector and devector compare? Maybe there could be hints to the difference.
I found nothing indicative.
I don't understand why its necessary with all this complicated stuff. AFAICT, a single threaded program runs on one CPU.
A single threaded program can be preempted, and migrated to an other CPU later. Since I'm using TSC and processor isolation, it's important to always run the program on the same CPU core.
If you do a benchmark without reserve, you should configure devector to use the same growth factor/initial size as the std/boost equivalent.
Currently, vector tests use reserve, the growth policy is not used.
This is platform dependent, so one test per library is needed.
Regarding queues, I matched the segment size to my platform (512 elems), after the issue above is resolved, we can have tests for different libs, yes. Thanks for checking, Benedek
On 16-02-2016 18:56, Benedek Thaler wrote:
Dear Boost,
In a previous mail, I introduced a container library, `double_ended`. This library has benchmarks [0]. However, the benchmarks show unexpected results.
Some other comments: If you do a benchmark without reserve, you should configure devector to use the same growth factor/initial size as the std/boost equivalent. This is platform dependent, so one test per library is needed. For the reserve test, we should expect very similar results across all platforms. If not, something appears to be wrong. -Thorsten
participants (3)
-
Benedek Thaler
-
Michael Marcin
-
Thorsten Ottosen