What's wrong with this benchmark

Benedek Thaler

16 Feb 2016 16 Feb '16

5:56 p.m.

Dear Boost, In a previous mail, I introduced a container library, `double_ended`. This library has benchmarks [0]. However, the benchmarks show unexpected results. This is what I do [1]: - compile program using G++ 5.3, -O2 - remove CPU0 from kernel scheduler pool (isolcpu=0 boot flag) - disable CPU scaling - pin benchmark thread to CPU0 - in the test, create a vector - reserve memory for 1M unsigned - write a byte to each page reserved, to warm up the page table - get TSC value (machine has constant_tsc flag) - insert 1M unsigned value - get TSC value again, subtract previous value - repeat test 25 times, average results I do this with 3 containers: std::vector, boost::container::vector and double_ended::devector. While the latter two produces almost equivalent results, std::vector seriously falls behind. - No reallocation happens between the TSC sampling - Test order doesn't matter: every order produces the same results - Container size or sample count doesn't matter, same results. - The deviations of individual measures of each container is small, the performance is consistent. - I compared the produced ASM of std::vector and double_ended::devector, they are very similar, almost the same instructions in different order [2]. What do I miss here? Thanks, Benedek [0]: http://erenon.hu/double_ended/double_ended/benchmarks.html [1]: https://github.com/erenon/double_ended/blob/master/benchmark/push_back.cpp [2]: http://pastebin.com/tUL54i9f

Show replies by date

Michael Marcin

17 Feb 17 Feb

4:20 a.m.

On 2/16/2016 11:56 AM, Benedek Thaler wrote:

...

Dear Boost,

In a previous mail, I introduced a container library, `double_ended`. This library has benchmarks [0]. However, the benchmarks show unexpected results.

This is what I do [1]: - compile program using G++ 5.3, -O2 - remove CPU0 from kernel scheduler pool (isolcpu=0 boot flag) - disable CPU scaling - pin benchmark thread to CPU0 - in the test, create a vector - reserve memory for 1M unsigned - write a byte to each page reserved, to warm up the page table - get TSC value (machine has constant_tsc flag) - insert 1M unsigned value - get TSC value again, subtract previous value - repeat test 25 times, average results

I do this with 3 containers: std::vector, boost::container::vector and double_ended::devector. While the latter two produces almost equivalent results, std::vector seriously falls behind.

- No reallocation happens between the TSC sampling - Test order doesn't matter: every order produces the same results - Container size or sample count doesn't matter, same results. - The deviations of individual measures of each container is small, the performance is consistent. - I compared the produced ASM of std::vector and double_ended::devector, they are very similar, almost the same instructions in different order [2].

What do I miss here?

Thanks, Benedek

[0]: http://erenon.hu/double_ended/double_ended/benchmarks.html [1]: https://github.com/erenon/double_ended/blob/master/benchmark/push_back.cpp [2]: http://pastebin.com/tUL54i9f

FWIW I can't reproduce your results on Visual Studio 2015 Update 1. Here are my results if you're interested. X std::vector 0 0 100 592.380000 200 1178.640000 400 2310.270000 800 4517.580000 1600 8893.160000 3200 17600.610000 6400 35895.900000 12800 71004.060000 25600 142081.980000 51200 283691.720000 102400 567921.980000 204800 1138470.790000 409600 2274810.920000 819200 4556220.740000 X devector 0 0 100 711.840000 200 1458.540000 400 2924.550000 800 5862.690000 1600 11653.440000 3200 23825.170000 6400 46841.700000 12800 93916.710000 25600 187148.900000 51200 372430.440000 102400 745044.570000 204800 1488865.490000 409600 2971389.220000 819200 5942622.180000 X boost::container::vector 0 0 100 837.930000 200 1666.050000 400 3259.280000 800 6423.750000 1600 12706.180000 3200 25197.520000 6400 50089.290000 12800 100486.150000 25600 200127.660000 51200 398981.780000 102400 800517.160000 204800 1610967.220000 409600 3250085.410000 819200 6454347.270000 X std::deque 0 0 100 18529.880000 200 25942.200000 400 38937.040000 800 59459.400000 1600 100410.640000 3200 175987.360000 6400 321362.520000 12800 569617.680000 25600 1019693.840000 51200 1930645.720000 102400 3806069.880000 204800 7704495.360000 409600 15629599.200000 819200 31983278.800000 X batch_deque 0 0 100 991.760000 200 1735.560000 400 3127.280000 800 5864.880000 1600 11357.320000 3200 22291.800000 6400 44334.520000 12800 88523.040000 25600 179476.320000 51200 358522.960000 102400 716579.960000 204800 1433212.760000 409600 2868735.640000 819200 5741968.240000 X boost::container::deque 0 0 100 1187.320000 200 4253.640000 400 6874.320000 800 11220.320000 1600 20418.400000 3200 38970.160000 6400 74156.560000 12800 142497.720000 25600 281098.400000 51200 537988.920000 102400 1084152.000000 204800 2232533.560000 409600 4669299.360000 819200 9431670.920000

Benedek Thaler

4:05 p.m.

On Feb 17, 2016 5:21 AM, "Michael Marcin" <mike.marcin@gmail.com> wrote:

...

FWIW I can't reproduce your results on Visual Studio 2015 Update 1. Here are my results if you're interested.

Thanks for checking! (You are the first to compile it on Windows, I'm surprised it worked) Looking at your results: although now std::vector wins, std::deque awfully falls behind. I don't know the std library supplied by MSVC, but I suspect this isn't the expected result. I matched the segment size to libstdc++, maybe dinkumware uses smaller segments and allocates more. Thanks, Benedek

Thorsten Ottosen

4:34 p.m.

On 16-02-2016 18:56, Benedek Thaler wrote:

...

Dear Boost,

...

What do I miss here?

Hard to say. I guess instruction order may have an impact on the branch prediction. But please include a test with unsafe_push_back() for comparison also. Some more questions: A. is the lambda optimized away? B. Do the test run the same number of times? Is it asserted at runtime? C. Do you compile via bjam? I mean, is all the right flags present, e.g. NDEBUG=1? D. How does the actual code between std::vector and devector compare? Maybe there could be hints to the difference. regards Thorsten

Michael Marcin

18 Feb 18 Feb

7:51 a.m.

I think there is definitely something not quite right. One thing I notice is that the line: sample.push_back(get_clock() - base_clock); Is being timed. Which could potentially access a new memory page I think. In pin_thread you have _MSVC_VER instead of _MSC_VER Hacking up std::vector and adding a push_back_reserved that does no bounds checks I get consistently very strange timings. (Code included at the bottom to make it obvious that the latter should be no slower). X std::vector::push_back 0 0 100 765.360000 200 1537.260000 400 3037.950000 800 6008.280000 1600 11905.150000 3200 23950.320000 6400 48064.600000 12800 96178.470000 25600 190637.270000 51200 379025.420000 102400 757784.880000 204800 1514584.120000 409600 3026480.680000 819200 6050654.130000 X std::vector::push_back_reserved 0 0 100 938.090000 200 1936.380000 400 3704.400000 800 7274.170000 1600 14307.310000 3200 28550.140000 6400 56835.310000 12800 113579.430000 25600 226906.070000 51200 453802.040000 102400 911835.760000 204800 1819723.520000 409600 3639619.650000 819200 7281897.560000 My simpler, and I'm sure in many ways inferior benchmark, shows much more believable results. http://codepad.org/SBZ4rbqo Running 1 test case... std::vector<uint32_t>::push_back_reserved 4.88 std::vector<uint32_t>::push_back 6.31 bst::devector<uint32_t>::push_back 5.37 bst::vector<uint32_t>::push_back 7.22 void push_back(value_type&& _Val) { // insert by moving into element at end if (_Inside(_STD addressof(_Val))) { // push back an element size_type _Idx = _STD addressof(_Val) - this->_Myfirst(); if (this->_Mylast() == this->_Myend()) _Reserve(1); _Orphan_range(this->_Mylast(), this->_Mylast()); this->_Getal().construct(this->_Mylast(), _STD forward<value_type>(this->_Myfirst()[_Idx])); ++this->_Mylast(); } else { // push back a non-element if (this->_Mylast() == this->_Myend()) _Reserve(1); _Orphan_range(this->_Mylast(), this->_Mylast()); this->_Getal().construct(this->_Mylast(), _STD forward<value_type>(_Val)); ++this->_Mylast(); } } void push_back_reserved( value_type&& _Val ) { this->_Getal().construct( this->_Mylast(), _STD forward<value_type>( _Val ) ); ++this->_Mylast(); }

Thorsten Ottosen

1:22 p.m.

On 18-02-2016 08:51, Michael Marcin wrote:

...

I think there is definitely something not quite right.

One thing I notice is that the line: sample.push_back(get_clock() - base_clock); Is being timed. Which could potentially access a new memory page I think.

In pin_thread you have _MSVC_VER instead of _MSC_VER

I don't understand why its necessary with all this complicated stuff. AFAICT, a single threaded program runs on one CPU. -Thorsten

Michael Marcin

5:40 p.m.

On 2/18/2016 7:22 AM, Thorsten Ottosen wrote:

...

On 18-02-2016 08:51, Michael Marcin wrote:

...
I think there is definitely something not quite right.

One thing I notice is that the line: sample.push_back(get_clock() - base_clock); Is being timed. Which could potentially access a new memory page I think.

In pin_thread you have _MSVC_VER instead of _MSC_VER

I don't understand why its necessary with all this complicated stuff. AFAICT, a single threaded program runs on one CPU.

FWIW http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-...

Benedek Thaler

19 Feb 19 Feb

6:48 a.m.

On Thu, Feb 18, 2016 at 8:51 AM, Michael Marcin <mike.marcin@gmail.com> wrote:

...

I think there is definitely something not quite right.

One thing I notice is that the line: sample.push_back(get_clock() - base_clock); Is being timed. Which could potentially access a new memory page I think.

In pin_thread you have _MSVC_VER instead of _MSC_VER

Good catch, fixed this.

...

Hacking up std::vector and adding a push_back_reserved that does no bounds checks I get consistently very strange timings. (Code included at the bottom to make it obvious that the latter should be no slower).

My simpler, and I'm sure in many ways inferior benchmark, shows much more believable results.

Still, there are huge differences between essentially same programs.

...

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-...

I gave a spin to rdtscp, no success, but I continue looking at this doc. Thanks, Benedek

Michael Marcin

7:50 a.m.

On 2/19/2016 12:48 AM, Benedek Thaler wrote:

...

On Thu, Feb 18, 2016 at 8:51 AM, Michael Marcin <mike.marcin@gmail.com> wrote:

...
I think there is definitely something not quite right.

One thing I notice is that the line: sample.push_back(get_clock() - base_clock); Is being timed. Which could potentially access a new memory page I think.

In pin_thread you have _MSVC_VER instead of _MSC_VER

Good catch, fixed this.

...
Hacking up std::vector and adding a push_back_reserved that does no bounds checks I get consistently very strange timings. (Code included at the bottom to make it obvious that the latter should be no slower).

My simpler, and I'm sure in many ways inferior benchmark, shows much more believable results.

Still, there are huge differences between essentially same programs.

...
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-...

I gave a spin to rdtscp, no success, but I continue looking at this doc.

Thanks, Benedek

FWIW (probably not much this time) I took a shot at porting my test to the google/benchmark library. Mostly just to learn the library but also to see if there were any glaring discrepancies. Not sure I'm using it right but the relative results seem in line with my expectations. results: http://codepad.org/U66O82X8 source: http://codepad.org/rii3BHly

Benedek Thaler

20 Feb 20 Feb

4:53 p.m.

On Fri, Feb 19, 2016 at 8:50 AM, Michael Marcin <mike.marcin@gmail.com> wrote:

...

On 2/19/2016 12:48 AM, Benedek Thaler wrote:

FWIW (probably not much this time)

I took a shot at porting my test to the google/benchmark library. Mostly just to learn the library but also to see if there were any glaring discrepancies. Not sure I'm using it right but the relative results seem in line with my expectations.

results: http://codepad.org/U66O82X8

source: http://codepad.org/rii3BHly

Marcin, Thanks for this take, I like this framework. Looking at the results, I'm wondering whether it would be more informative to just show the generated code instead of the raw numbers and say: the code is similar, the user can expect similar performance.

Benedek Thaler

19 Feb 19 Feb

6:45 a.m.

On Wed, Feb 17, 2016 at 5:34 PM, Thorsten Ottosen <tottosen@dezide.com> wrote:

...

Some more questions: A. is the lambda optimized away?

It is. I got the same results without any helper functions.

...

B. Do the test run the same number of times? Is it asserted at runtime?

It does, just checked it manually.

...

C. Do you compile via bjam? I mean, is all the right flags present, e.g. NDEBUG=1?

I use bjam. I have -DNDEBUG on the compiler invocation line. I also have -O3. I wonder why the release build have -O3 instead of -O2 by default.

...

D. How does the actual code between std::vector and devector compare? Maybe there could be hints to the difference.

I found nothing indicative.

...

I don't understand why its necessary with all this complicated stuff. AFAICT, a single threaded program runs on one CPU.

A single threaded program can be preempted, and migrated to an other CPU later. Since I'm using TSC and processor isolation, it's important to always run the program on the same CPU core.

...

If you do a benchmark without reserve, you should configure devector to use the same growth factor/initial size as the std/boost equivalent.

Currently, vector tests use reserve, the growth policy is not used.

...

This is platform dependent, so one test per library is needed.

Regarding queues, I matched the segment size to my platform (512 elems), after the issue above is resolved, we can have tests for different libs, yes. Thanks for checking, Benedek

Thorsten Ottosen

18 Feb 18 Feb

1:53 p.m.

On 16-02-2016 18:56, Benedek Thaler wrote:

...

Dear Boost,

In a previous mail, I introduced a container library, `double_ended`. This library has benchmarks [0]. However, the benchmarks show unexpected results.

Some other comments: If you do a benchmark without reserve, you should configure devector to use the same growth factor/initial size as the std/boost equivalent. This is platform dependent, so one test per library is needed. For the reserve test, we should expect very similar results across all platforms. If not, something appears to be wrong. -Thorsten

3422

Age (days ago)

3426

Last active (days ago)

List overview

Download

11 comments

3 participants

participants (3)

Benedek Thaler
Michael Marcin
Thorsten Ottosen