Re: [boost] [interprocess] boost::hash - different values for the same input? - Boost

Re: [boost] [interprocess] boost::hash - different values for the same input?

older
Any-Range containing transformed...

Daniel James

13 Jan 2016 13 Jan '16

10:07 p.m.

I altered the subject to include interprocess, since this is relevant there. On 9 December 2015 at 20:13, trafdev <trafdev@mail.ru> wrote:

...

As per hash_range docs (http://www.boost.org/doc/libs/1_59_0/doc/html/hash/reference.html#boost.hash...):

"This hash function is not intended for general use, and isn't guaranteed to be equal during separate runs of a program - so please don't use it for any persistent storage or communication."

As per code (http://www.boost.org/doc/libs/1_59_0/boost/functional/hash/hash.hpp) it uses hash_combine_impl which doesn't use any random, process_id etc values which might change during new process launches.

At the same time, in an example describing container usage in the shared memory (http://www.boost.org/doc/libs/1_59_0/doc/html/interprocess/allocators_contai...), boost::hash is being used:

typedef boost::unordered_map < KeyType , MappedType , boost::hash<KeyType> ,std::equal_to<KeyType> , ShmemAllocator> MyHashMap;

My question is: is it safe to use boost::hash in a shared (persistent) storage or it's safe only for a particular (non-pointer) types?

In this case, boost::hash docs should be changed to mention this exceptions, because otherwise no one could use boost::hash in a shared memory containers.

Sorry about the very slow reply. I'm afraid the answer is that it depends. The reason why I originally wrote that note was because it can generate different hash values when compiled for different platforms or architectures, e.g. a 32-bit executable might generate a different hash value to a 64-bit executable. Also, I occasionally change the algorithm in new versions of boost (and probably will in the next version). For interprocess this currently won't be an issue if all the processes are using the same executable, or using executables built with same version of boost and the same architecture, but could be a problem if that isn't the case. So maybe the Interprocess documentation needs to explain this. But there's another problem coming up. The last proposal I saw for a standard generic hash function will support randomized hash functions (such as siphash) which will only work with interprocess if they are somehow seeded identically. I'll want to support whatever is accepted to the standard (although maybe as a separate library). I'll need to catch up with how that is progressing, as I haven't kept up to date.

Show replies by date

Andrey Semashev

13 Jan 13 Jan

10:18 p.m.

New subject: [interprocess] boost::hash - different values for the same input?

On 2016-01-14 01:07, Daniel James wrote:

...

I altered the subject to include interprocess, since this is relevant there.

On 9 December 2015 at 20:13, trafdev <trafdev@mail.ru> wrote:

...
My question is: is it safe to use boost::hash in a shared (persistent) storage or it's safe only for a particular (non-pointer) types?

In this case, boost::hash docs should be changed to mention this exceptions, because otherwise no one could use boost::hash in a shared memory containers.

Sorry about the very slow reply. I'm afraid the answer is that it depends. The reason why I originally wrote that note was because it can generate different hash values when compiled for different platforms or architectures, e.g. a 32-bit executable might generate a different hash value to a 64-bit executable. Also, I occasionally change the algorithm in new versions of boost (and probably will in the next version). For interprocess this currently won't be an issue if all the processes are using the same executable, or using executables built with same version of boost and the same architecture, but could be a problem if that isn't the case. So maybe the Interprocess documentation needs to explain this.

But there's another problem coming up. The last proposal I saw for a standard generic hash function will support randomized hash functions (such as siphash) which will only work with interprocess if they are somehow seeded identically. I'll want to support whatever is accepted to the standard (although maybe as a separate library). I'll need to catch up with how that is progressing, as I haven't kept up to date.

Do you think we could add a few well-known hash functions to Boost.Hash so that they could be used for consistent results?

Daniel James

11:09 p.m.

New subject: [interprocess] boost::hash - different values for the same input?

On 13 January 2016 at 22:18, Andrey Semashev <andrey.semashev@gmail.com> wrote:

...

Do you think we could add a few well-known hash functions to Boost.Hash so that they could be used for consistent results?

I think that would be best handled in a new library. I don't really want to enhance the existing one beyond its current scope, especially as the current standard proposal is a bit different to it, and supports alternative hash functions. Adding extra functions would probably also mean that it wouldn't be a header only library, which would be a problem. If you haven't seen the new proposal, it's at: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0029r0.html There's an implementation at: https://github.com/google/hashing-demo/tree/N0029R0 I assume that something based on that proposal is likely to be accepted. You could just use it (maybe I should add a macro to change the default hash function in unordered....), or perhaps a boost implementation could be based on it. One advantage of a new library is that it can have much stricter compiler requirements which should simplify the implementation.

Bjorn Reese

10:30 p.m.

New subject: [interprocess] boost::hash - different values for the same input?

On 01/13/2016 11:18 PM, Andrey Semashev wrote:

...

Do you think we could add a few well-known hash functions to Boost.Hash so that they could be used for consistent results?

I was just about to suggest the same... I frequently use hash functions to calculate fingerprints that are exchanged in network protocols, and here consistent results, both across platforms and boost releases, are essential.

Daniel James

11:16 p.m.

New subject: [interprocess] boost::hash - different values for the same input?

On 13 January 2016 at 22:30, Bjorn Reese <breese@mail1.stofanet.dk> wrote:

...

On 01/13/2016 11:18 PM, Andrey Semashev wrote:

...
Do you think we could add a few well-known hash functions to Boost.Hash so that they could be used for consistent results?

I was just about to suggest the same...

I frequently use hash functions to calculate fingerprints that are exchanged in network protocols, and here consistent results, both across platforms and boost releases, are essential.

I think I've always been clear that the library is not appropriate for such uses. In terms of hash quality, it does nothing more than meet the existing std::hash requirements (which are very low). It's also pretty much impossible to fit a stronger hash function into the existing interface. I actually regret calling it 'hash', I probably should have called it something less general. Technically its name is 'Functional/Hash', because std::hash is in the functional header, but that didn't make much sense to anyone.

Phil Endecott

14 Jan 14 Jan

5:18 p.m.

New subject: [interprocess] boost::hash - different values for the same input?

Bjorn Reese wrote:

...

I frequently use hash functions to calculate fingerprints that are exchanged in network protocols, and here consistent results, both across platforms and boost releases, are essential.

I use Boost.CRC for that sort of thing: http://www.boost.org/doc/libs/1_60_0/libs/crc/index.html Regards, Phil.

Bjorn Reese

5:59 p.m.

New subject: [interprocess] boost::hash - different values for the same input?

On 01/14/2016 06:18 PM, Phil Endecott wrote:

...

I use Boost.CRC for that sort of thing:

http://www.boost.org/doc/libs/1_60_0/libs/crc/index.html

The protocols I am using do not use fingerprinting for error correction but rather for identifying data chunks. See for instance: http://people.eng.unimelb.edu.au/jzobel/fulltext/jasist03thz.pdf You will notice that equation (1) on page 207 looks an awful lot like the default boost::hash implementation. Just to be sure, I do use my own implementation instead of boost::hash (and hash_combine) because Daniel does not guarantee its consistency. That is fine with me, but I really would like to see some named hash functions in boost that do not change, as Andrey suggested. This sounds like a suitable a GSoC project...

Peter Dimov

7:22 p.m.

New subject: [interprocess] boost::hash - different values for the same input?

Bjorn Reese wrote:

...

The protocols I am using do not use fingerprinting for error correction but rather for identifying data chunks. See for instance:

http://people.eng.unimelb.edu.au/jzobel/fulltext/jasist03thz.pdf

You will notice that equation (1) on page 207 looks an awful lot like the default boost::hash implementation.

If you look at the Boost.Hash documentation in http://www.boost.org/doc/libs/1_60_0/doc/html/hash/rationale.html you'll see that it points to issue 6.18 of http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1837.pdf which (at page 67) says that the proposed hash_combine is taken from that very paper. :-) We just added a constant offset during the review because otherwise 0 was a trap. My original idea was for hash_combine to be completely specified and therefore the behavior of boost::hash to be completely predictable from the values of the input.

Bjorn Reese

9:23 p.m.

New subject: [interprocess] boost::hash - different values for the same input?

On 01/14/2016 08:22 PM, Peter Dimov wrote:

...

which (at page 67) says that the proposed hash_combine is taken from that very paper. :-)

Oh, that explains a lot :)

Phil Endecott

11:23 p.m.

New subject: [interprocess] boost::hash - different values for the same input?

Bjorn Reese <breese@mail1.stofanet.dk> wrote:

...

On 01/14/2016 06:18 PM, Phil Endecott wrote:

...
I use Boost.CRC for that sort of thing:

http://www.boost.org/doc/libs/1_60_0/libs/crc/index.html

The protocols I am using do not use fingerprinting for error correction

CRCs are not for error correction. You might be thinking of ECC?

...

but rather for identifying data chunks.

A CRC is a good general-purpose hash function. Regards, Phil.

Bjorn Reese

17 Jan 17 Jan

12:58 p.m.

New subject: [interprocess] boost::hash - different values for the same input?

On 01/15/2016 12:23 AM, Phil Endecott wrote:

...

Bjorn Reese <breese@mail1.stofanet.dk> wrote:

...

...
The protocols I am using do not use fingerprinting for error correction

CRCs are not for error correction. You might be thinking of ECC?

Sorry, I meant to write error detection.

...

...
but rather for identifying data chunks.

A CRC is a good general-purpose hash function.

Indeed, and a good example of a named hash function already in Boost. Thanks for pointing that out. That said, no single hash function excels at every possibly use case. While you can probably use any hash function (unless we require specific features such as locality-preservation) for any use case, the quality of the results differ. Some hash functions like CRC are well-suited for error detection, and other hash functions like the one in the previously cited paper are better at identifying documents.

Peter Dimov

14 Jan 14 Jan

4:54 a.m.

New subject: [interprocess] boost::hash - different values for thesame input?

...

The reason why I originally wrote that note was because it can generate different hash values when compiled for different

Daniel James wrote: platforms or architectures, e.g. a 32-bit executable might generate a different hash value to a 64-bit executable. Is this just because of pointers and floats, or are there other reasons? size_t hashing to different values perhaps?

Daniel James

8:59 p.m.

New subject: [interprocess] boost::hash - different values for thesame input?

On 14 January 2016 at 04:54, Peter Dimov <lists@pdimov.com> wrote:

...

Daniel James wrote:

...
The reason why I originally wrote that note was because it can generate different hash values when compiled for different

platforms or architectures, e.g. a 32-bit executable might generate a different hash value to a 64-bit executable.

Is this just because of pointers and floats, or are there other reasons? size_t hashing to different values perhaps?

Anything with a bigger range than size_t will have to, e.g. int64_t will hash differently on 32 and 64 bit machines. Also, user hash functions could be anything.

Peter Dimov

10:52 p.m.

New subject: [interprocess] boost::hash - different values forthesame input?

Daniel James wrote:

...

On 14 January 2016 at 04:54, Peter Dimov <lists@pdimov.com> wrote:

...
Daniel James wrote:

...
The reason why I originally wrote that note was because it can generate different hash values when compiled for different platforms or architectures, e.g. a 32-bit executable might generate a different hash value to a 64-bit executable.

Is this just because of pointers and floats, or are there other reasons? size_t hashing to different values perhaps?

Anything with a bigger range than size_t will have to, e.g. int64_t will hash differently on 32 and 64 bit machines.

Yes, my question was a bit stupid because the return value of hash<> is size_t. A better way to put it would be whether the lower 32 bits of the return value need to differ on 32 and 64 bit platforms. I realize that there's a way to hash uint64_t to different values, but there are also ways to make it hash to the same value (mod 2^32). Apart from that, people have argued recently that hash values should vary for each process run. That's because of security considerations with hash tables - an attacker can supply input that is known to create collisions and can so degrade the hash table's performance and effect a denial of service. When this came up on the std reflectors, I argued that the proper way to tackle this issue was not to make the standard hash function nondeterministic, but to have the containers use a random seed when hashing an element.

3508

Age (days ago)

3512

Last active (days ago)

List overview

Download

13 comments

5 participants

participants (5)

Andrey Semashev
Bjorn Reese
Daniel James
Peter Dimov
Phil Endecott