[Multi_index] Performance like sequenced.cpp example

newer
2007 Boost Conference is Next Week!

Manuel Jung

30 Apr 2007 30 Apr '07

5:47 p.m.

Hi, I have to count a lot of words. Up to now i did it with MySQL, because it was easy. The result is safed there anyway. Now i thought i could speed up this a little if i would use internally a Multi_index list to store the words, so i have only to insert all different words. The words are stored in a UnicodeString from the ICU library. My code is really near to the one from the example "sequenced.cpp". Im using the following definition: typedef multi_index_container< UnicodeString, indexed_by< sequenced<>, ordered_non_unique<identity<UnicodeString> >

...

text_container;

typedef nth_index<text_container,1>::type ordered_text; text_container tc; Im inserting new words with "tc.push_back(UnicodeString(NewWord));" And count them exactly like in the example. I thought this should be fast, but it isnt. It eats up all my CPU, but isnt fast. It is a lot slower than my old solution. I have still hope i could speed this up, before i have to switch back MySQL. The profile of a run says that "boost::multi_index::safe_mode::check_same_owner<..." eats most CPU time. Some suggesting how to speed it up with Multi_index? Or some ideas which other way would be faster than MySQL inserts? Thanks Manu

Show replies by date

Ovanes Markarian

30 Apr 30 Apr

6:05 p.m.

On Mon, April 30, 2007 19:47, Manuel Jung wrote:

...

Hi,

I have to count a lot of words. Up to now i did it with MySQL, because it was easy. The result is safed there anyway. Now i thought i could speed up this a little if i would use internally a Multi_index list to store the words, so i have only to insert all different words. The words are stored in a UnicodeString from the ICU library. My code is really near to the one from the example "sequenced.cpp". Im using the following definition:

typedef multi_index_container< UnicodeString, indexed_by< sequenced<>, ordered_non_unique<identity<UnicodeString> >

...
text_container;

typedef nth_index<text_container,1>::type ordered_text; text_container tc;

Im inserting new words with "tc.push_back(UnicodeString(NewWord));" And count them exactly like in the example. I thought this should be fast, but it isnt. It eats up all my CPU, but isnt fast. It is a lot slower than my old solution. I have still hope i could speed this up, before i have to switch back MySQL. The profile of a run says that "boost::multi_index::safe_mode::check_same_owner<..." eats most CPU time.

Some suggesting how to speed it up with Multi_index? Or some ideas which other way would be faster than MySQL inserts?

Thanks Manu

Try using the hashed_non_unique instead of ordered_non_unique index implementation. This will use hashed values to access keys, and not a comparison function. My personal opinion is that if your words are in the database anyway, you should not retrieve them from there and then store them. SQL Solution will be always faster, since databases knows how to optimize statements and result sets as well. With Kind Regards, Ovanes Markarian

Manuel Jung

6:26 p.m.

New subject: [Multi_index] Performance like sequenced.cpp example

...

Try using the hashed_non_unique instead of ordered_non_unique index implementation. This will use hashed values to access keys, and not a comparison function.

Would this really work? If i use the hashed_non_unique index, i cant use std::difference and "upper_bound" to get the count of same words, because it isnt sorted anymore? Or am i wrong?

...

My personal opinion is that if your words are in the database anyway, you should not retrieve them from there and then store them. SQL Solution will be always faster, since databases knows how to optimize statements and result sets as well.

The original data comes not from the Database. I would do it then directly with a User Defined Function or Stored Procedure. But in my application the data is downloaded from the internet and is written to the DB after or before counting words. (Im counting it in the database with "INSERT ON DUPLICATE KEY UPDATE" statements.) Cheers Manu

Ovanes Markarian

6:48 p.m.

On Mon, April 30, 2007 20:26, Manuel Jung wrote:

...

...
Try using the hashed_non_unique instead of ordered_non_unique index implementation. This will use hashed values to access keys, and not a comparison function.

Would this really work? If i use the hashed_non_unique index, i cant use std::difference and "upper_bound" to get the count of same words, because it isnt sorted anymore? Or am i wrong?

Please take a look at: http://www.boost.org/libs/multi_index/doc/reference/hash_indices.html#hash_i... There is a member count (2 overloads), which can count all items with a given key or another member equal_range (2 overloads), which ruturns the pair<iterator, iterator> for begin and end of the range.

...

...
My personal opinion is that if your words are in the database anyway, you should not retrieve them from there and then store them. SQL Solution will be always faster, since databases knows how to optimize statements and result sets as well.

The original data comes not from the Database. I would do it then directly with a User Defined Function or Stored Procedure. But in my application the data is downloaded from the internet and is written to the DB after or before counting words. (Im counting it in the database with "INSERT ON DUPLICATE KEY UPDATE" statements.)

Ok, wanted to be sure. ;)

...

Cheers Manu

With Kind Regards, Ovanes Markarian

Manuel Jung

8:42 p.m.

New subject: [Multi_index] Performance like sequenced.cpp example

...

...
...
Try using the hashed_non_unique instead of ordered_non_unique index implementation. This will use hashed values to access keys, and not a comparison function.

Would this really work? If i use the hashed_non_unique index, i cant use std::difference and "upper_bound" to get the count of same words, because it isnt sorted anymore? Or am i wrong?

Please take a look at:

http://www.boost.org/libs/multi_index/doc/reference/hash_indices.html#hash_i...

...

There is a member count (2 overloads), which can count all items with a given key or another member equal_range (2 overloads), which ruturns the pair<iterator, iterator> for begin and end of the range.

I took a look at it. Thank you very much. I never used a hashed index, but i should sometime. For now, thanks to the quick solution some posts before, i will optimize at another place. But ill come back, if needed! Thank you for your help, Bye Manu

"JOAQUIN LOPEZ MU?Z"

7:23 p.m.

New subject: [Multi_index] Performance like sequenced.cpp example

Hello Manuel, ----- Mensaje original ----- De: Manuel Jung <gzahl@arcor.de> Fecha: Lunes, Abril 30, 2007 7:49 pm Asunto: [Boost-users] [Multi_index] Performance like sequenced.cpp example Para: boost-users@lists.boost.org

...

Hi,

I have to count a lot of words. Up to now i did it with MySQL, because it was easy. The result is safed there anyway. Now i thought i could speed up this a little if i would use internally a Multi_index list to store the words, so i have only to insert all different words. The words are storedin a UnicodeString from the ICU library. My code is really near to the one from the example "sequenced.cpp". Im using the following definition:

typedef multi_index_container< UnicodeString, indexed_by< sequenced<>, ordered_non_unique<identity<UnicodeString> >

...
text_container;

typedef nth_index<text_container,1>::type ordered_text; text_container tc;

Im inserting new words with "tc.push_back(UnicodeString(NewWord));" And count them exactly like in the example. I thought this should be fast, but it isnt. It eats up all my CPU, but isnt fast. It is a lot slower than my old solution. I have still hope i could speed this up, before i have to switch back MySQL. The profile of a run says that "boost::multi_index::safe_mode::check_same_owner<..." eats most CPU time.

This trace indicates that you've set Boost.MultiIndex safe mode on; this and its companion invariant-checking mode are huge CPU eaters, only intended for catching programming errors in debug builds. Please turn them off and time again: is the performance adequate now? Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Manuel Jung

8:40 p.m.

New subject: [Multi_index] Performance like sequenced.cpp example

...

This trace indicates that you've set Boost.MultiIndex safe mode on; this and its companion invariant-checking mode are huge CPU eaters, only intended for catching programming errors in debug builds. Please turn them off and time again: is the performance adequate now? Yeah, i used the release build viewed the profile. it looks different: Still a MI function on top, but a different >very< often used list. So should

Good evening this be okay. Also MySQL is now the bottleneck. My application eats much less CPU time. Thank you! Greetings Manu

6648

Age (days ago)

6648

Last active (days ago)

List overview

Download

1 comments

2 participants

participants (2)

"JOAQUIN LOPEZ MU?Z"
Manuel Jung
Ovanes Markarian