[multi_index] REVISITED: hash_unique with differently produced hash codes
Hello Joaquín, now I found the failure, to the problem described in message http://lists.boost.org/boost-users/2007/04/27211.php with std::string as a key not being found. The problem in my case was that I use Xalan to parse XML and have to convert string to single byte characters. Since std::string has some build in caching to improve performance and I decode double byte string (XalanDOMString) as follows: std::string get_string(XalanDOMString const& str) { XalanDOMString::CharVectorType v; str.transcode(v); return std::string(v.begin(), v.end()); } this code produces a string which still points to the double byte buffer and therefore this results in a different hash value for the returned string as the value would be for the basic_string<char> or C-String: An example: XalanDOMString str("Sint8"); std::string char_str(get_string(str)); Calling find in the multi-index results in a hash value 40. Calling find in the multi-index for const char* str="Sint8"; results in a hash value 13 The same result (13) can be achieved when calling on char_str.c_str(); c_str() cause the strings internal buffer to be rebuild. Since boost::hash is on the way to become a part of STL it should ensure to work with the correctly intended copy of string. What is your opinion on this issue? With Kind Regards, Ovanes
Ovanes Markarian ha escrito:
Hello Joaquín,
now I found the failure, to the problem described in message http://lists.boost.org/boost-users/2007/04/27211.php with std::string as a key not being found. The problem in my case was that I use Xalan to parse XML and have to convert string to single byte characters. Since std::string has some build in caching to improve performance and I decode double byte string (XalanDOMString) as follows:
std::string get_string(XalanDOMString const& str) { XalanDOMString::CharVectorType v; str.transcode(v);
return std::string(v.begin(), v.end()); }
this code produces a string which still points to the double byte buffer and therefore this results in a different hash value for the returned string as the value would be for the basic_string<char> or C-String:
An example:
XalanDOMString str("Sint8"); std::string char_str(get_string(str));
Calling find in the multi-index results in a hash value 40.
Calling find in the multi-index for const char* str="Sint8"; results in a hash value 13
The same result (13) can be achieved when calling on char_str.c_str(); c_str() cause the strings internal buffer to be rebuild.
Since boost::hash is on the way to become a part of STL it should ensure to work with the correctly intended copy of string. What is your opinion on this issue?
OK, what's happenning is the following: boost::hashstd::string()(str) is not equivalent
to boost
On Thu, April 26, 2007 12:29, Joaquín Mª López Muñoz wrote:
OK, what's happenning is the following: boost::hashstd::string()(str) is not equivalent to boost
()(str.c_str()); the latter hashes the *pointer*, not the contents pointed to. So, when you write your_container.find(str.c_str())
Well, looking through the source of boost::hash I came up with a question which of the hash_value
overload functions get called, especially when there is are functions:
template <class T>
std::size_t hash_value(T* const&);
template
B.MI assumes that what you pass (a const char *) is equivalent hash- and equivalence-wise to the stored keys (std::strings), which is not true (because of the hash, equivalence is indeed interoperable as std::string provides the required overloads for operator== etc.)
Returning to your original problem, you said that this works:
inline some_type_ptr create_type(std::string const& name)const { types_map::index<hash>::type::const_iterator i =types_.get<hash>().find(name.c_str()); //... }
but this does not:
inline some_type_ptr create_type(std::string const& name)const { types_map::index<hash>::type::const_iterator i =types_.get<hash>().find(name); //... }
This puzzles me a lot: Given that your types_map container is indexed on a std::string, things should be the other way around: it is the "find(name)" version that should work AFAICS. Could you please double-check?
I double checked it and both give the correct hash value. I think this is string dependent issue, where the string uses COW idiom to save performance, Herb Sutter wrote about it in his Guru of the Week (http://www.gotw.ca/gotw/043.htm). Therefore if I use const char* to initialize the string probably it is used in the string as long as possible until the string is not modified. But if this is so, there is more or less no reliable way to hash strings since these can be implicitly converted from const char* to std::string and afterwards used as a hash key and return a wrong hash result. Unfortunately I cannot step inside of the string constructors in MSVC 8.0 to see how these are really implemented. This is not a trivial issue to hash strings I think. Thanks for your time. With Kind Regards, Ovanes Markarian
Ovanes Markarian
On Thu, April 26, 2007 12:29, Joaquín Mª López Muñoz wrote:
Ovanes Markarian
writes:
[...]
Sorry for the long explanation and then the short idea, hope this is of interest to others...
FWIW, I think your explanation of how boost::hash works with strings, char *s and char []s is correct. [...]
This puzzles me a lot: Given that your types_map container is indexed on a std::string, things should be the other way around: it is the "find(name)" version that should work AFAICS. Could you please double-check?
I double checked it and both give the correct hash value. I think this is string dependent issue, where the string uses COW idiom to save performance, Herb Sutter wrote about it in his Guru of the Week (http://www.gotw.ca/gotw/043.htm). Therefore if I use const char* to initialize the string probably it is used in the string as long as possible until the string is not modified. But if this is so, there is more or less no reliable way to hash strings since these can be implicitly converted from const char* to std::string and afterwards used as a hash key and return a wrong hash result. Unfortunately I cannot step inside of the string constructors in MSVC 8.0 to see how these are really implemented.
I think this has nothing to do with COW strings: even if this optimization is in effect, hashing a pointer will never yield the same value as hashing the associated contents, so if the index is based on std::strings (COW-based or not) you cannot expect to locate a given string str by using the hash value of a pointer to str's contents --your own experiments with boost::hash described above must have shown you precisely this. When I said before that the std::string-based create_type must work and that based on const char * must fail, I made a mistake: I've examined the issue more carefully and realized that *both* versions work, albeit not because of COW-related reasons. When you define a std::string-keyed index, that index stores internally an object of type boost::hashstd::string, let's call it h. Now, when you issue a call like types_.get<hash>().find(name.c_str()); The internal code of B.MI calculates the hash value of the argument you've passed by invoking h(arg); // arg is the argument passed, i.e. name.c_str() As this h is of type boost::hashstd::string, its operator() accepts arguments of type std::string, and we're passing a const char*. Given that const char* is implicitly convertible to std::string, a temporary string is created automatically on the fly with the same contents as those pointed to by arg, so the correct hash value is computed and no pointer is actually hashed. I'm sorry for having wrongly stated that passing a const char* must not work --it works, although the way it does is admittedly a little convoluted. So, if we agree on this, there's a little mistery left: since after double-checking we both agree the two versions of create_type (based on std::string and on const char *) should work, what was your original problem about then?
This is not a trivial issue to hash strings I think. Thanks for your time.
If there's something still unclear about the above explanation, please tell me so. Best, Joaquín M López Muñoz Telefónica, Investigación y Desarrollo
participants (3)
-
Joaquin M Lopez Munoz
-
Joaquín Mª López Muñoz
-
Ovanes Markarian