Re: [boost] [interprocess] boost::hash - different values for the same input?
I altered the subject to include interprocess, since this is relevant there.
On 9 December 2015 at 20:13, trafdev
As per hash_range docs (http://www.boost.org/doc/libs/1_59_0/doc/html/hash/reference.html#boost.hash...):
"This hash function is not intended for general use, and isn't guaranteed to be equal during separate runs of a program - so please don't use it for any persistent storage or communication."
As per code (http://www.boost.org/doc/libs/1_59_0/boost/functional/hash/hash.hpp) it uses hash_combine_impl which doesn't use any random, process_id etc values which might change during new process launches.
At the same time, in an example describing container usage in the shared memory (http://www.boost.org/doc/libs/1_59_0/doc/html/interprocess/allocators_contai...), boost::hash is being used:
typedef boost::unordered_map < KeyType , MappedType , boost::hash<KeyType> ,std::equal_to<KeyType> , ShmemAllocator> MyHashMap;
My question is: is it safe to use boost::hash in a shared (persistent) storage or it's safe only for a particular (non-pointer) types?
In this case, boost::hash docs should be changed to mention this exceptions, because otherwise no one could use boost::hash in a shared memory containers.
Sorry about the very slow reply. I'm afraid the answer is that it depends. The reason why I originally wrote that note was because it can generate different hash values when compiled for different platforms or architectures, e.g. a 32-bit executable might generate a different hash value to a 64-bit executable. Also, I occasionally change the algorithm in new versions of boost (and probably will in the next version). For interprocess this currently won't be an issue if all the processes are using the same executable, or using executables built with same version of boost and the same architecture, but could be a problem if that isn't the case. So maybe the Interprocess documentation needs to explain this. But there's another problem coming up. The last proposal I saw for a standard generic hash function will support randomized hash functions (such as siphash) which will only work with interprocess if they are somehow seeded identically. I'll want to support whatever is accepted to the standard (although maybe as a separate library). I'll need to catch up with how that is progressing, as I haven't kept up to date.
On 2016-01-14 01:07, Daniel James wrote:
I altered the subject to include interprocess, since this is relevant there.
On 9 December 2015 at 20:13, trafdev
wrote: My question is: is it safe to use boost::hash in a shared (persistent) storage or it's safe only for a particular (non-pointer) types?
In this case, boost::hash docs should be changed to mention this exceptions, because otherwise no one could use boost::hash in a shared memory containers.
Sorry about the very slow reply. I'm afraid the answer is that it depends. The reason why I originally wrote that note was because it can generate different hash values when compiled for different platforms or architectures, e.g. a 32-bit executable might generate a different hash value to a 64-bit executable. Also, I occasionally change the algorithm in new versions of boost (and probably will in the next version). For interprocess this currently won't be an issue if all the processes are using the same executable, or using executables built with same version of boost and the same architecture, but could be a problem if that isn't the case. So maybe the Interprocess documentation needs to explain this.
But there's another problem coming up. The last proposal I saw for a standard generic hash function will support randomized hash functions (such as siphash) which will only work with interprocess if they are somehow seeded identically. I'll want to support whatever is accepted to the standard (although maybe as a separate library). I'll need to catch up with how that is progressing, as I haven't kept up to date.
Do you think we could add a few well-known hash functions to Boost.Hash so that they could be used for consistent results?
On 13 January 2016 at 22:18, Andrey Semashev
Do you think we could add a few well-known hash functions to Boost.Hash so that they could be used for consistent results?
I think that would be best handled in a new library. I don't really want to enhance the existing one beyond its current scope, especially as the current standard proposal is a bit different to it, and supports alternative hash functions. Adding extra functions would probably also mean that it wouldn't be a header only library, which would be a problem. If you haven't seen the new proposal, it's at: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0029r0.html There's an implementation at: https://github.com/google/hashing-demo/tree/N0029R0 I assume that something based on that proposal is likely to be accepted. You could just use it (maybe I should add a macro to change the default hash function in unordered....), or perhaps a boost implementation could be based on it. One advantage of a new library is that it can have much stricter compiler requirements which should simplify the implementation.
On 01/13/2016 11:18 PM, Andrey Semashev wrote:
Do you think we could add a few well-known hash functions to Boost.Hash so that they could be used for consistent results?
I was just about to suggest the same... I frequently use hash functions to calculate fingerprints that are exchanged in network protocols, and here consistent results, both across platforms and boost releases, are essential.
On 13 January 2016 at 22:30, Bjorn Reese
On 01/13/2016 11:18 PM, Andrey Semashev wrote:
Do you think we could add a few well-known hash functions to Boost.Hash so that they could be used for consistent results?
I was just about to suggest the same...
I frequently use hash functions to calculate fingerprints that are exchanged in network protocols, and here consistent results, both across platforms and boost releases, are essential.
I think I've always been clear that the library is not appropriate for such uses. In terms of hash quality, it does nothing more than meet the existing std::hash requirements (which are very low). It's also pretty much impossible to fit a stronger hash function into the existing interface. I actually regret calling it 'hash', I probably should have called it something less general. Technically its name is 'Functional/Hash', because std::hash is in the functional header, but that didn't make much sense to anyone.
Bjorn Reese wrote:
I frequently use hash functions to calculate fingerprints that are exchanged in network protocols, and here consistent results, both across platforms and boost releases, are essential.
I use Boost.CRC for that sort of thing: http://www.boost.org/doc/libs/1_60_0/libs/crc/index.html Regards, Phil.
On 01/14/2016 06:18 PM, Phil Endecott wrote:
I use Boost.CRC for that sort of thing:
The protocols I am using do not use fingerprinting for error correction but rather for identifying data chunks. See for instance: http://people.eng.unimelb.edu.au/jzobel/fulltext/jasist03thz.pdf You will notice that equation (1) on page 207 looks an awful lot like the default boost::hash implementation. Just to be sure, I do use my own implementation instead of boost::hash (and hash_combine) because Daniel does not guarantee its consistency. That is fine with me, but I really would like to see some named hash functions in boost that do not change, as Andrey suggested. This sounds like a suitable a GSoC project...
Bjorn Reese wrote:
The protocols I am using do not use fingerprinting for error correction but rather for identifying data chunks. See for instance:
http://people.eng.unimelb.edu.au/jzobel/fulltext/jasist03thz.pdf
You will notice that equation (1) on page 207 looks an awful lot like the default boost::hash implementation.
If you look at the Boost.Hash documentation in http://www.boost.org/doc/libs/1_60_0/doc/html/hash/rationale.html you'll see that it points to issue 6.18 of http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1837.pdf which (at page 67) says that the proposed hash_combine is taken from that very paper. :-) We just added a constant offset during the review because otherwise 0 was a trap. My original idea was for hash_combine to be completely specified and therefore the behavior of boost::hash to be completely predictable from the values of the input.
Bjorn Reese
On 01/14/2016 06:18 PM, Phil Endecott wrote:
I use Boost.CRC for that sort of thing:
The protocols I am using do not use fingerprinting for error correction
CRCs are not for error correction. You might be thinking of ECC?
but rather for identifying data chunks.
A CRC is a good general-purpose hash function. Regards, Phil.
On 01/15/2016 12:23 AM, Phil Endecott wrote:
Bjorn Reese
wrote:
The protocols I am using do not use fingerprinting for error correction
CRCs are not for error correction. You might be thinking of ECC?
Sorry, I meant to write error detection.
but rather for identifying data chunks.
A CRC is a good general-purpose hash function.
Indeed, and a good example of a named hash function already in Boost. Thanks for pointing that out. That said, no single hash function excels at every possibly use case. While you can probably use any hash function (unless we require specific features such as locality-preservation) for any use case, the quality of the results differ. Some hash functions like CRC are well-suited for error detection, and other hash functions like the one in the previously cited paper are better at identifying documents.
The reason why I originally wrote that note was because it can generate different hash values when compiled for different
Daniel James wrote: platforms or architectures, e.g. a 32-bit executable might generate a different hash value to a 64-bit executable. Is this just because of pointers and floats, or are there other reasons? size_t hashing to different values perhaps?
On 14 January 2016 at 04:54, Peter Dimov
Daniel James wrote:
The reason why I originally wrote that note was because it can generate different hash values when compiled for different
platforms or architectures, e.g. a 32-bit executable might generate a different hash value to a 64-bit executable.
Is this just because of pointers and floats, or are there other reasons? size_t hashing to different values perhaps?
Anything with a bigger range than size_t will have to, e.g. int64_t will hash differently on 32 and 64 bit machines. Also, user hash functions could be anything.
Daniel James wrote:
On 14 January 2016 at 04:54, Peter Dimov
wrote: Daniel James wrote:
The reason why I originally wrote that note was because it can generate different hash values when compiled for different platforms or architectures, e.g. a 32-bit executable might generate a different hash value to a 64-bit executable.
Is this just because of pointers and floats, or are there other reasons? size_t hashing to different values perhaps?
Anything with a bigger range than size_t will have to, e.g. int64_t will hash differently on 32 and 64 bit machines.
Yes, my question was a bit stupid because the return value of hash<> is size_t. A better way to put it would be whether the lower 32 bits of the return value need to differ on 32 and 64 bit platforms. I realize that there's a way to hash uint64_t to different values, but there are also ways to make it hash to the same value (mod 2^32). Apart from that, people have argued recently that hash values should vary for each process run. That's because of security considerations with hash tables - an attacker can supply input that is known to create collisions and can so degrade the hash table's performance and effect a denial of service. When this came up on the std reflectors, I argued that the proper way to tackle this issue was not to make the standard hash function nondeterministic, but to have the containers use a random seed when hashing an element.
participants (5)
-
Andrey Semashev
-
Bjorn Reese
-
Daniel James
-
Peter Dimov
-
Phil Endecott