[random] new threefry random engine

older
Errors Compiling in AIX 7.1 with...

Thijs van den Berg

17 Apr 2014 17 Apr '14

10:19 p.m.

Hi all, I've completed a nice new random engine that I would like to propose for addition to boost/random. The engine is called "threefry" and it's based on the paper "Parallel random numbers: as easy as 1, 2, 3" by Salmon, John K. and Moraes, Mark A. and Dror, Ron O. and Shaw, David E. I've been using it myself a lot for the last couple of years, and now I've rewritten it so that it will fit in the boost random framework and adhere to the concepts. It has some really nice properties, in particular for large scale parallel computing: * High quality randomness (according to the BigCrush tests) with a cycle length of 2^258 * Fast (comparable to Mersenne Twister) & little memory usage (13x 64 bit integers) * O(1) discard and guarantees for non-overlapping sequences when the (up to 256 bit) seeds are unequal I've uploaded the code, description and tests to https://github.com/sitmo/threefry The actual engine is at https://github.com/sitmo/threefry/blob/master/random/include/boost/random/th... I would really appreciate any feedback about the idea for inclusion and of course comments about the quality of the code,tests,docs kind regards, Thijs Sitmo Consultancy B.V. Financial Modelling & Data Science + 31 6 24110061 thijs@sitmo.com P.O. Box 1059, 2600BB, Delft, The Netherlands

Show replies by date

Steven Watanabe

18 Apr 18 Apr

4:54 a.m.

AMDG On 04/17/2014 04:19 PM, Thijs van den Berg wrote:

...

I've completed a nice new random engine that I would like to propose for addition to boost/random.

The engine is called "threefry" and it's based on the paper "Parallel random numbers: as easy as 1, 2, 3" by Salmon, John K. and Moraes, Mark A. and Dror, Ron O. and Shaw, David E.

I've been using it myself a lot for the last couple of years, and now I've rewritten it so that it will fit in the boost random framework and adhere to the concepts. It has some really nice properties, in particular for large scale parallel computing: * High quality randomness (according to the BigCrush tests) with a cycle length of 2^258 * Fast (comparable to Mersenne Twister) & little memory usage (13x 64 bit integers) * O(1) discard and guarantees for non-overlapping sequences when the (up to 256 bit) seeds are unequal

I've uploaded the code, description and tests to https://github.com/sitmo/threefry The actual engine is at https://github.com/sitmo/threefry/blob/master/random/include/boost/random/th...

I would really appreciate any feedback about the idea for inclusion and of course comments about the quality of the code,tests,docs

I /really/ dislike the constraint that rounds shall be 13 or 20. It looks like encrypt_counter follows a regular pattern. I'd strongly prefer it to be implemented as a loop and allow rounds to be any non-negative integer. Not all possible values make sense, but that's normal. You have to know what you're doing when you define a specialization of any random number engine. Do not use reinterpret_cast in operator(). The behavior is undefined, and will certainly differ depending on endianness. I'm not sure that I like the fact that the textual representation interleaves the three arrays. Also output is a function of key and counter and does not need to be stored. (The same observation applies to the comparison operators) Is it really a good idea to expose set_key/set_counter? It would be better to have an explicit width parameter, like mersenne_twister instead of relying on the size of UIntType. I'd also prefer to see uint_least64_t in place of uint64_t. (Actually I'd like it even more if the arrays stored UIntType, but this doesn't look particularly feasible, given the way the algorithm works.) Don't put usings in namespace boost. The other engines have using declarations for backwards compatibility, which is not a concern for a new engine. Run inspect (http://www.boost.org/tools/inspect). There's at least one use of max that needs to be protected. In Christ, Steven Watanabe

Thijs van den Berg

8 a.m.

On 18 Apr 2014, at 06:54, Steven Watanabe <watanabesj@gmail.com> wrote:

...

AMDG

On 04/17/2014 04:19 PM, Thijs van den Berg wrote:

...
I've completed a nice new random engine that I would like to propose for addition to boost/random.

The engine is called "threefry" and it's based on the paper "Parallel random numbers: as easy as 1, 2, 3" by Salmon, John K. and Moraes, Mark A. and Dror, Ron O. and Shaw, David E.

I've been using it myself a lot for the last couple of years, and now I've rewritten it so that it will fit in the boost random framework and adhere to the concepts. It has some really nice properties, in particular for large scale parallel computing: * High quality randomness (according to the BigCrush tests) with a cycle length of 2^258 * Fast (comparable to Mersenne Twister) & little memory usage (13x 64 bit integers) * O(1) discard and guarantees for non-overlapping sequences when the (up to 256 bit) seeds are unequal

I've uploaded the code, description and tests to https://github.com/sitmo/threefry The actual engine is at https://github.com/sitmo/threefry/blob/master/random/include/boost/random/th...

I would really appreciate any feedback about the idea for inclusion and of course comments about the quality of the code,tests,docs

I /really/ dislike the constraint that rounds shall be 13 or 20. It looks like encrypt_counter follows a regular pattern. I'd strongly prefer it to be implemented as a loop and allow rounds to be any non-negative integer. Not all possible values make sense, but that's normal. You have to know what you're doing when you define a specialization of any random number engine. Very good. I’ll have to change things a a bit. The algorithm has indeed reppetitive patterns of multiples of length 4 and 8. I’ll make it an unconstrained integer and a loop.

...

Do not use reinterpret_cast in operator(). The behavior is undefined, and will certainly differ depending on endianness.

I did this for speed and I didn’t worry about endianness because it never crossed my mind. Changing the order of bits/bytes should not affect the quality of the randomness, but it’s indeed not nice to have different behaviour on different machines. I’ll fix this.

...

I'm not sure that I like the fact that the textual representation interleaves the three arrays. Also output is a function of key and counter and does not need to be stored. (The same observation applies to the comparison operators)

Yes, I’ve already changed that.

...

Is it really a good idea to expose set_key/set_counter?

Indeed, it’s not needed, I’ve moved them to private.

...

It would be better to have an explicit width parameter, like mersenne_twister instead of relying on the size of UIntType. Ok, this takes a bit of time but it’s not dramatic. It’s something that combines nicely with the operator() point.

...

I'd also prefer to see uint_least64_t in place of uint64_t. OK.

...

(Actually I'd like it even more if the arrays stored UIntType, but this doesn't look particularly feasible, given the way the algorithm works.) This algorithm is what they call the 4x64 variant. The paper also talks about a 2x64 and 4x32 variants (but e.g. not 8x32, 2x32). However I’m not sure if we need to implement all the variant the’ve studied (maybe yes, maybe no). They recommend 4x64 with 20 rounds as the best choice for CPU’s and that’s the main one I want to provide. AFAIK the state and the return_type need not be the same. Any 32 bit engine can be turned into a 64 engine by patching two returns together.

The 32 bit state type versions have different constants in the repetitive operators. The authors also studies variants with different cryptographic functions. E.g. one uses EAS (but that has a different name than threefry) which can be faster on CPU's supports EAS instructions. It however calls for assembler. There is also a 3rd variant that targets CUDA machines. In a wider setting all these variant use a counter and a cryptographic function (which probably all use a key?). I haven’t studies the other variants. Perhaps we can later refactor -without changing the interface- the code when we decide to add more variants? E.g. can we stick to the 4x64 threefry algorithm function, have two versions that return either 32 or 64 bit integers specified by a width argument. The two other variant 2x64 and 4x32 could perhaps be added later, those are however not recommended by the paper. I would call the 4x64 part of the algorithm, not NxM with template arguments that can be tweaked by smart users.

...

Don't put usings in namespace boost. The other engines have using declarations for backwards compatibility, which is not a concern for a new engine.

Ok.

...

Run inspect (http://www.boost.org/tools/inspect). There's at least one use of max that needs to be protected.

Excellent, I didn’t know it’s existence, sound very handy. Thanks for the useful feedback, it’s much appreciated.

...

In Christ, Steven Watanabe

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Steven Watanabe

19 Apr 19 Apr

5:46 p.m.

AMDG On 04/18/2014 02:00 AM, Thijs van den Berg wrote:

...

This algorithm is what they call the 4x64 variant. The paper also talks about a 2x64 and 4x32 variants (but e.g. not 8x32, 2x32). However I’m not sure if we need to implement all the variant the’ve studied (maybe yes, maybe no). They recommend 4x64 with 20 rounds as the best choice for CPU’s and that’s the main one I want to provide.

If the algorithm is basically the same and only differs by some constants in the mixing function, then these constants can be turned into template parameters. If the differences are more significant, then it should be handled by a separate engine template, and there's no need to worry about it at this point.

...

AFAIK the state and the return_type need not be the same. Any 32 bit engine can be turned into a 64 engine by patching two returns together.

This can be done outside of threefry by independent_bits_engine. In Christ, Steven Watanabe

Thijs van den Berg

10:35 p.m.

On 19 Apr 2014, at 19:46, Steven Watanabe <watanabesj@gmail.com> wrote:

...

AMDG

On 04/18/2014 02:00 AM, Thijs van den Berg wrote:

...
This algorithm is what they call the 4x64 variant. The paper also talks about a 2x64 and 4x32 variants (but e.g. not 8x32, 2x32). However I’m not sure if we need to implement all the variant the’ve studied (maybe yes, maybe no). They recommend 4x64 with 20 rounds as the best choice for CPU’s and that’s the main one I want to provide.

If the algorithm is basically the same and only differs by some constants in the mixing function, then these constants can be turned into template parameters. If the differences are more significant, then it should be handled by a separate engine template, and there's no need to worry about it at this point. Right now I’m having performance issues with just this 4x64 version & an arbitrary rounds extension.

* I have a un-rolled version “encrypt_counter()” that handles up to 20 rounds (or less), and which has all the constants hard coded in the code. This version times 13.8 CPU cycles for 20 rounds. * I have an arbitrary round version “encrypt_counter_0()” which looks much nicer from a code point of view, it has a loop and static const arrays containing the rotation constants (that repeat every 8 round). However, it takes 42 CPU cycles for the same 20 rounds. I’ve tried many variants and I just can’t make the loop version at fast as the un-rolled one… Perhaps someone can explain why that is? https://github.com/sitmo/threefry/blob/master/random/include/boost/random/th... * initially I had a 13 and 20 rounds version. The 13 round version was supposed to be the lowest rounds version to pass all the BigCrush test, but in my tests it failed one out of the 160 tests. That’s perhaps still better than MT which failed 2 or 3 tests (depending on the version of MT) but "failing only 1 test" is not a perfect selling point. Perhaps 14 round would solve that? The 20 rounds version advised in the paper has 7 extra round as "a big safety margin" and that one passes indeed all BigCrunch tests. My view now is: * I think we should just provide the 20 rounds version by default and allow developers to pick less rounds -and gain a little speed- if they know what they are doing. * I like the "big safety margin” for a default option, I gives you trust that you can pick this engine if you want the “the highest quality random numbers”. I think we shouldn’t predefine the 13 rounds version with a typedef. * The factor 2.5 in speed drop for writing an arbitrary round loop version is not worth it IMO. So I propose: * to only typedef the 20 rounds version, allow for less round via a template argument: * the predefined typedef engines could be named threefry4x64_20 (20 rounds, 32 bit integers result_type) and threefry4x64_20_64 (20 rounds, 64 bit integers result_type). This naming conventions allows all the other variant to be added later if we want to. * have an upper bound of 20 rounds instead of arbitrary rounds because of performance issues I can’t solve. ..this give an engine with the follow properties.. * It is the author’s pick * it is very fast, 13.8 CPU cycles, I don’t expect I can improve the speed of the un-rolled code any further * it has the highest random quality * it has great features for large scale parallel computing What’s your view on limiting the round to <=20 for the template and providing only the 20 round as a typedef? I have addressed most other points you’ve mentioned, but the performance issue of a generic rounds version has failed me.

...

...
AFAIK the state and the return_type need not be the same. Any 32 bit engine can be turned into a 64 engine by patching two returns together.

This can be done outside of threefry by independent_bits_engine.

In Christ, Steven Watanabe

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Steven Watanabe

11:39 p.m.

AMDG On 04/19/2014 04:35 PM, Thijs van den Berg wrote:

...

What’s your view on limiting the round to <=20 for the template

I still don't like it. If all else fails, you can use the optimized version for rounds <= 20, and the slow version for rounds > 20.

...

and providing only the 20 round as a typedef?

I'd favor providing both. There are plenty of inferior algorithms in Boost.Random. I would anticipate that anyone who wants to use an algorithm other than mt19937 would have some idea of the tradeoffs.

...

I have addressed most other points you’ve mentioned, but the performance issue of a generic rounds version has failed me.

In theory it could be optimized. What compiler and optimization settings are you using? In particular, are you using -funroll-loops (GCC)? The version you show unrolls the loop 4x. What if you unroll 8x and kill the constant arrays? What about 20x and eliminate the % 5 in the key addition? Or 40x and eliminate both? In Christ, Steven Watanabe

Thijs van den Berg

20 Apr 20 Apr

8:37 a.m.

On 20 Apr 2014, at 01:39, Steven Watanabe <watanabesj@gmail.com> wrote:

...

AMDG

On 04/19/2014 04:35 PM, Thijs van den Berg wrote:

...
What’s your view on limiting the round to <=20 for the template

I still don't like it. If all else fails, you can use the optimized version for rounds <= 20, and the slow version for rounds > 20. Ok, that’s a good last resort.

...

...
and providing only the 20 round as a typedef?

I'd favor providing both. There are plenty of inferior algorithms in Boost.Random. I would anticipate that anyone who wants to use an algorithm other than mt19937 would have some idea of the tradeoffs.

Ok.

...

...
I have addressed most other points you’ve mentioned, but the performance issue of a generic rounds version has failed me.

In theory it could be optimized. What compiler and optimization settings are you using? In particular, are you using -funroll-loops (GCC)? The version you show unrolls the loop 4x. What if you unroll 8x and kill the constant arrays? What about 20x and eliminate the % 5 in the key addition? Or 40x and eliminate both?

Yes, good ideas. I think I’m going to make a template version that eliminated array indices computations. Right now I was using the default build of the performance tool in random with Clang-503.0.38 on apple dadrwin. I’m now compile and time in manually. One thing I found earlier is that the compiler recognises rotl only with -O3. However, that’s a different issues than the loop/ un-rolled version.

...

In Christ, Steven Watanabe

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Thijs van den Berg

11:17 a.m.

On 20 Apr 2014, at 01:39, Steven Watanabe <watanabesj@gmail.com> wrote:

> AMDG
> 
> On 04/19/2014 04:35 PM, Thijs van den Berg wrote:
>> What’s your view on limiting the round to <=20 for the template
> 
> I still don't like it.  If all else fails, you can use the
> optimized version for rounds <= 20, and the slow version
> for rounds > 20.
> 
>> and providing only the 20 round as a typedef?
>> 
> 
> I'd favor providing both.  There are plenty of inferior algorithms
> in Boost.Random.  I would anticipate that anyone who wants to
> use an algorithm other than mt19937 would have some idea of
> the tradeoffs.
Ok, I now have these  4 typedefs. 
* 13 and 20 rounds
*  a 32 and 64 bit engine

> 
>> I have addressed most other points you’ve mentioned, but the performance issue of a generic rounds version has failed me.
>> 
> 
> In theory it could be optimized.  What compiler and
> optimization settings are you using?  In particular,
> are you using -funroll-loops (GCC)?  The version
> you show unrolls the loop 4x.  What if you
> unroll 8x and kill the constant arrays?  What
> about 20x and eliminate the % 5 in the key addition?
> Or 40x and eliminate both?
> 
I’ve done the 40x version now, and wrapped a loop around it so that you can get arbitrary number of rounds by repeating (partially) this block of 40 rounds. The code is a bit repettitive but the performance is good now.  Here are timings for the 40x unrolled version for 8,13,20 and 99 rounds

// loop version
threefry4x64_08_64:  10.2318 nsec/loop =  19.13350 CPU cycles
threefry4x64_13_64:  14.3048 nsec/loop =  26.75000 CPU cycles
threefry4x64_20_64:  22.6186 nsec/loop =  42.29680 CPU cycles
threefry4x64_99_64: 100.0110 nsec/loop = 187.02200 CPU cycles

// 40x manual unrolled version
threefry4x64_08_64:   3.7386 nsec/loop =   6.99118 CPU cycles
threefry4x64_13_64:   5.1223 nsec/loop =   9.57870 CPU cycles
threefry4x64_20_64:   7.3078 nsec/loop =  13.66560 CPU cycles
threefry4x64_99_64:  29.3599 nsec/loop =  54.90300 CPU cycles

Do you think is is good a solution? 
The code is at https://github.com/sitmo/threefry/blob/master/random/include/boost/random/threefry.hpp 
 
> In Christ,
> Steven Watanabe
> 
> 
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Beman Dawes

4:30 p.m.

On Sun, Apr 20, 2014 at 7:17 AM, Thijs van den Berg <thijs@sitmo.com> wrote:

...

... // loop version threefry4x64_08_64: 10.2318 nsec/loop = 19.13350 CPU cycles threefry4x64_13_64: 14.3048 nsec/loop = 26.75000 CPU cycles threefry4x64_20_64: 22.6186 nsec/loop = 42.29680 CPU cycles threefry4x64_99_64: 100.0110 nsec/loop = 187.02200 CPU cycles

// 40x manual unrolled version threefry4x64_08_64: 3.7386 nsec/loop = 6.99118 CPU cycles threefry4x64_13_64: 5.1223 nsec/loop = 9.57870 CPU cycles threefry4x64_20_64: 7.3078 nsec/loop = 13.66560 CPU cycles threefry4x64_99_64: 29.3599 nsec/loop = 54.90300 CPU cycles

You may want to ask others to run the tests with various CPU's and compilers. Making optimization decisions based on a single platform can be quite misleading.

Remember Knuth's famous quote: "*premature optimization is the root of all evil*":-) --Beman

Thijs (M.A.) van den Berg

4:57 p.m.

On Apr 20, 2014, at 6:30 PM, Beman Dawes <bdawes@acm.org> wrote:

...

On Sun, Apr 20, 2014 at 7:17 AM, Thijs van den Berg <thijs@sitmo.com> wrote:

...
... // loop version threefry4x64_08_64: 10.2318 nsec/loop = 19.13350 CPU cycles threefry4x64_13_64: 14.3048 nsec/loop = 26.75000 CPU cycles threefry4x64_20_64: 22.6186 nsec/loop = 42.29680 CPU cycles threefry4x64_99_64: 100.0110 nsec/loop = 187.02200 CPU cycles

// 40x manual unrolled version threefry4x64_08_64: 3.7386 nsec/loop = 6.99118 CPU cycles threefry4x64_13_64: 5.1223 nsec/loop = 9.57870 CPU cycles threefry4x64_20_64: 7.3078 nsec/loop = 13.66560 CPU cycles threefry4x64_99_64: 29.3599 nsec/loop = 54.90300 CPU cycles

You may want to ask others to run the tests with various CPU's and compilers. Making optimization decisions based on a single platform can be quite misleading.

Remember Knuth's famous quote: "*premature optimization is the root of all evil*":-)

Very true. Most of the times the biggest gain is in math or the algorithm. It would be nice if people could help with testing performance. I thought a factor 2.5 drop in performance by making it more generic was too big to ignore, it would make the engine less interesting as a new choice and that would be to bad. My goals is to add value and I use this engine myself for MC simulations. I think it's fine now -there is nothing fancy going on-, I think performance is more predicable now, it depends less on the smartness of compilers. I'm new to this, but it there a pool of compilers testing new versions of boost that could also test performance?

...

--Beman

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Nat Goodspeed

3:23 a.m.

On Apr 19, 2014, at 6:35 PM, Thijs van den Berg <thijs@sitmo.com> wrote:

...

I have addressed most other points you’ve mentioned, but the performance issue of a generic rounds version has failed me.

Have you considered compile-time recursion to emit an unrolled loop with some max rounds?

...

Thijs (M.A.) van den Berg

8:07 a.m.

On Apr 20, 2014, at 5:23 AM, Nat Goodspeed <nat@lindenlab.com> wrote:

...

On Apr 19, 2014, at 6:35 PM, Thijs van den Berg <thijs@sitmo.com> wrote:

...
I have addressed most other points you’ve mentioned, but the performance issue of a generic rounds version has failed me.

Have you considered compile-time recursion to emit an unrolled loop with some max rounds?

Thanks, I was indeed thinking about that. I've tried too little yet, I was too puzzled by the factor 2.5 an I thought I made a mistake somewhere in my code. So far I've used the performance tool in random with default settings, so I'll first do my own benchmark and try various compiler options, perhaps look at assembler, and various compilers. I think recursion is a very good option. It will prevent run time loop counters and the computations that depend on it and make the code less predictable. I'll try that first!

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Mathias Gaunard

3:18 p.m.

On 18/04/14 10:00, Thijs van den Berg wrote:

...

...
Do not use reinterpret_cast in operator(). The behavior is undefined, and will certainly differ depending on endianness.

I did this for speed and I didn’t worry about endianness because it never crossed my mind. Changing the order of bits/bytes should not affect the quality of the randomness, but it’s indeed not nice to have different behaviour on different machines. I’ll fix this.

Personally I would want this engine to produce exactly the same results as the reference random123 implementation for a given seed. This should be the case regardless of the platform used.

Beman Dawes

4:23 p.m.

On Sun, Apr 20, 2014 at 11:18 AM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:

...

On 18/04/14 10:00, Thijs van den Berg wrote:

...
...
Do not use reinterpret_cast in operator(). The behavior is undefined, and will certainly differ depending on endianness.

I did this for speed and I didn’t worry about endianness because it never crossed my mind. Changing the order of bits/bytes should not affect the quality of the randomness, but it’s indeed not nice to have different behaviour on different machines. I’ll fix this.

Personally I would want this engine to produce exactly the same results as the reference random123 implementation for a given seed.

...

This should be the case regardless of the platform used.

+1 --Beman

Thijs (M.A.) van den Berg

4:25 p.m.

On Apr 20, 2014, at 5:18 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

On 18/04/14 10:00, Thijs van den Berg wrote:

...
...
Do not use reinterpret_cast in operator(). The behavior is undefined, and will certainly differ depending on endianness.

I did this for speed and I didn’t worry about endianness because it never crossed my mind. Changing the order of bits/bytes should not affect the quality of the randomness, but it’s indeed not nice to have different behaviour on different machines. I’ll fix this.

Personally I would want this engine to produce exactly the same results as the reference random123 implementation for a given seed.

This should be the case regardless of the platform used.

I totally agree, I want that too. Ive removed the reinterpret_cast, but it would be nice if we could very its behavior, eg on a platform that has different endianness.

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

John Salmon

21 Apr 21 Apr

8:28 p.m.

On Fri, Apr 18, 2014 at 4:00 AM, Thijs van den Berg <thijs@sitmo.com> wrote:

...

On 18 Apr 2014, at 06:54, Steven Watanabe <watanabesj@gmail.com> wrote:

...
AMDG

On 04/17/2014 04:19 PM, Thijs van den Berg wrote:

...
I've completed a nice new random engine that I would like to propose for addition to boost/random.

The engine is called "threefry" and it's based on the paper "Parallel random numbers: as easy as 1, 2, 3" by Salmon, John K. and Moraes, Mark A. and Dror, Ron O. and Shaw, David E.

I've been using it myself a lot for the last couple of years, and now I've rewritten it so that it will fit in the boost random framework and adhere to the concepts. It has some really nice properties, in particular for large scale parallel computing: * High quality randomness (according to the BigCrush tests) with a cycle length of 2^258 * Fast (comparable to Mersenne Twister) & little memory usage (13x 64 bit integers) * O(1) discard and guarantees for non-overlapping sequences when the (up to 256 bit) seeds are unequal

I've uploaded the code, description and tests to https://github.com/sitmo/threefry The actual engine is at https://github.com/sitmo/threefry/blob/master/random/include/boost/random/th...

I would really appreciate any feedback about the idea for inclusion and of course comments about the quality of the code,tests,docs

I /really/ dislike the constraint that rounds shall be 13 or 20. It looks like encrypt_counter follows a regular pattern. I'd strongly prefer it to be implemented as a loop and allow rounds to be any non-negative integer. Not all possible values make sense, but that's normal. You have to know what you're doing when you define a specialization of any random number engine. Very good. I’ll have to change things a a bit. The algorithm has indeed reppetitive patterns of multiples of length 4 and 8. I’ll make it an unconstrained integer and a loop.

...
Do not use reinterpret_cast in operator(). The behavior is undefined, and will certainly differ depending on endianness.

I did this for speed and I didn’t worry about endianness because it never crossed my mind. Changing the order of bits/bytes should not affect the quality of the randomness, but it’s indeed not nice to have different behaviour on different machines. I’ll fix this.

...
I'm not sure that I like the fact that the textual representation interleaves the three arrays. Also output is a function of key and counter and does not need to be stored. (The same observation applies to the comparison operators)

Yes, I’ve already changed that.

...
Is it really a good idea to expose set_key/set_counter?

Indeed, it’s not needed, I’ve moved them to private.

...
It would be better to have an explicit width parameter, like mersenne_twister instead of relying on the size of UIntType. Ok, this takes a bit of time but it’s not dramatic. It’s something that combines nicely with the operator() point.

...
I'd also prefer to see uint_least64_t in place of uint64_t. OK.

...
(Actually I'd like it even more if the arrays stored UIntType, but this doesn't look particularly feasible, given the way the algorithm works.) This algorithm is what they call the 4x64 variant. The paper also talks about a 2x64 and 4x32 variants (but e.g. not 8x32, 2x32). However I’m not sure if we need to implement all the variant the’ve studied (maybe yes, maybe no). They recommend 4x64 with 20 rounds as the best choice for CPU’s and that’s the main one I want to provide. AFAIK the state and the return_type need not be the same. Any 32 bit engine can be turned into a 64 engine by patching two returns together.

The 32 bit state type versions have different constants in the repetitive operators. The authors also studies variants with different cryptographic functions. E.g. one uses EAS (but that has a different name than threefry) which can be faster on CPU's supports EAS instructions. It however calls for assembler. There is also a 3rd variant that targets CUDA machines. In a wider setting all these variant use a counter and a cryptographic function (which probably all use a key?). I haven’t studies the other variants.

Perhaps we can later refactor -without changing the interface- the code when we decide to add more variants? E.g. can we stick to the 4x64 threefry algorithm function, have two versions that return either 32 or 64 bit integers specified by a width argument. The two other variant 2x64 and 4x32 could perhaps be added later, those are however not recommended by the paper. I would call the 4x64 part of the algorithm, not NxM with template arguments that can be tweaked by smart users.

Greetings, I'm an author of the original Threefry implementation and the paper that describes it First, let me say that I'm extremely happy to see our work being considered for boost, and I thank Thijs for initiating the discussion. But it seems to me that the code under discussion is too specifically related to Threefry itself. Threefry is one of a potentially large class of random number generators that derive their power from a stateless random function rather than a stateful random generator. We called these "Counter Based Random Number Generators" in our SC11 paper, but perhaps "stateless" would have been a better name. We described three new classes of generators: Threefry, Philox and ARS, and noted that others have been discussed in the past, including generators based on DES, MD5, SHA-1 and TEA. We hope that more will be discovered/described in the future. It is the *statelessness* of the underlying function, which is common to all of them, that makes them so well-suited to parallelization. Statelessness allows them all to trivially 'discard' values and to create practically unlimited numbers of non-overlapping streams. It allows applications to easily associate statistically independent random sequences with individual logical computational objects (atoms, grid cells, ticker symbols, data points, timesteps, etc.), independent of physical location (thread id, node number, etc). I would love to see a boost implementation that captured and exposed this capability for general use. Thus, I think a better approach is as follows: - define a new "RandomFunction" concept that dscribes/enforces the essential, common features of the randomizing function in Threefry, Philox, ARS and other counter-based random generators. - implement a Threefry (and perhaps other) function templates that model RandomFunction. - implement an adapter that produces a bona fide "Random Number Engine" (c.f. C++11 standard section 26.5.1.4 [rand.req.eng]) from any RandomFunction. This is where we bridge the gap between the new RandomFunction and the conventional "generators" whose interface is already well-defined. This adapter lets us use a new RandomFunction, e.g., Threefry, with code that expects to talk to an "Engine". It would be similar to the r123::Engine class documented at http://www.thesalmons.org/john/random123/releases/1.08/docs/structr123_1_1En... but completely rewritten to avoid the C99/C++03/CUDA/OpenCL baggage that clutters/obscures that code. - implement an adapter that produces a bona fide "Uniform Random Number Generator" (c.f. 26.5.1.3 [rand.req.urng]). Similarly inspired by the MicroURNG adapter documented at http://www.thesalmons.org/john/random123/releases/1.08/docs/classr123_1_1Mic... but again, completely rewritten. This feels like the right level of generality to me. I'd love to discuss this further and reach consensus - especially the details of the RandomFunction concept. If there seems to be interest, I'd be happy to do some coding as well. In any case, let me reiterate my appreciation for Thijs' work. Cheers, John Salmon

Thijs van den Berg

9:38 p.m.

On 21 Apr 2014, at 22:28, John Salmon <john@thesalmons.org> wrote:

...

On Fri, Apr 18, 2014 at 4:00 AM, Thijs van den Berg <thijs@sitmo.com> wrote:

...
On 18 Apr 2014, at 06:54, Steven Watanabe <watanabesj@gmail.com> wrote:

...
AMDG

On 04/17/2014 04:19 PM, Thijs van den Berg wrote:

...
I've completed a nice new random engine that I would like to propose for addition to boost/random.

The engine is called "threefry" and it's based on the paper "Parallel random numbers: as easy as 1, 2, 3" by Salmon, John K. and Moraes, Mark A. and Dror, Ron O. and Shaw, David E.

I've been using it myself a lot for the last couple of years, and now I've rewritten it so that it will fit in the boost random framework and adhere to the concepts. It has some really nice properties, in particular for large scale parallel computing: * High quality randomness (according to the BigCrush tests) with a cycle length of 2^258 * Fast (comparable to Mersenne Twister) & little memory usage (13x 64 bit integers) * O(1) discard and guarantees for non-overlapping sequences when the (up to 256 bit) seeds are unequal

I've uploaded the code, description and tests to https://github.com/sitmo/threefry The actual engine is at https://github.com/sitmo/threefry/blob/master/random/include/boost/random/th...

I would really appreciate any feedback about the idea for inclusion and of course comments about the quality of the code,tests,docs

I /really/ dislike the constraint that rounds shall be 13 or 20. It looks like encrypt_counter follows a regular pattern. I'd strongly prefer it to be implemented as a loop and allow rounds to be any non-negative integer. Not all possible values make sense, but that's normal. You have to know what you're doing when you define a specialization of any random number engine. Very good. I’ll have to change things a a bit. The algorithm has indeed reppetitive patterns of multiples of length 4 and 8. I’ll make it an unconstrained integer and a loop.

...
Do not use reinterpret_cast in operator(). The behavior is undefined, and will certainly differ depending on endianness.

I did this for speed and I didn’t worry about endianness because it never crossed my mind. Changing the order of bits/bytes should not affect the quality of the randomness, but it’s indeed not nice to have different behaviour on different machines. I’ll fix this.

...
I'm not sure that I like the fact that the textual representation interleaves the three arrays. Also output is a function of key and counter and does not need to be stored. (The same observation applies to the comparison operators)

Yes, I’ve already changed that.

...
Is it really a good idea to expose set_key/set_counter?

Indeed, it’s not needed, I’ve moved them to private.

...
It would be better to have an explicit width parameter, like mersenne_twister instead of relying on the size of UIntType. Ok, this takes a bit of time but it’s not dramatic. It’s something that combines nicely with the operator() point.

...
I'd also prefer to see uint_least64_t in place of uint64_t. OK.

...
(Actually I'd like it even more if the arrays stored UIntType, but this doesn't look particularly feasible, given the way the algorithm works.) This algorithm is what they call the 4x64 variant. The paper also talks about a 2x64 and 4x32 variants (but e.g. not 8x32, 2x32). However I’m not sure if we need to implement all the variant the’ve studied (maybe yes, maybe no). They recommend 4x64 with 20 rounds as the best choice for CPU’s and that’s the main one I want to provide. AFAIK the state and the return_type need not be the same. Any 32 bit engine can be turned into a 64 engine by patching two returns together.

The 32 bit state type versions have different constants in the repetitive operators. The authors also studies variants with different cryptographic functions. E.g. one uses EAS (but that has a different name than threefry) which can be faster on CPU's supports EAS instructions. It however calls for assembler. There is also a 3rd variant that targets CUDA machines. In a wider setting all these variant use a counter and a cryptographic function (which probably all use a key?). I haven’t studies the other variants.

Perhaps we can later refactor -without changing the interface- the code when we decide to add more variants? E.g. can we stick to the 4x64 threefry algorithm function, have two versions that return either 32 or 64 bit integers specified by a width argument. The two other variant 2x64 and 4x32 could perhaps be added later, those are however not recommended by the paper. I would call the 4x64 part of the algorithm, not NxM with template arguments that can be tweaked by smart users.

Greetings,

I'm an author of the original Threefry implementation and the paper that describes it

First, let me say that I'm extremely happy to see our work being considered for boost, and I thank Thijs for initiating the discussion.

But it seems to me that the code under discussion is too specifically related to Threefry itself. Threefry is one of a potentially large class of random number generators that derive their power from a stateless random function rather than a stateful random generator. We called these "Counter Based Random Number Generators" in our SC11 paper, but perhaps "stateless" would have been a better name. We described three new classes of generators: Threefry, Philox and ARS, and noted that others have been discussed in the past, including generators based on DES, MD5, SHA-1 and TEA. We hope that more will be discovered/described in the future.

It is the *statelessness* of the underlying function, which is common to all of them, that makes them so well-suited to parallelization. Statelessness allows them all to trivially 'discard' values and to create practically unlimited numbers of non-overlapping streams. It allows applications to easily associate statistically independent random sequences with individual logical computational objects (atoms, grid cells, ticker symbols, data points, timesteps, etc.), independent of physical location (thread id, node number, etc).

I would love to see a boost implementation that captured and exposed this capability for general use. Thus, I think a better approach is as follows:

- define a new "RandomFunction" concept that dscribes/enforces the essential, common features of the randomizing function in Threefry, Philox, ARS and other counter-based random generators.

- implement a Threefry (and perhaps other) function templates that model RandomFunction.

- implement an adapter that produces a bona fide "Random Number Engine" (c.f. C++11 standard section 26.5.1.4 [rand.req.eng]) from any RandomFunction. This is where we bridge the gap between the new RandomFunction and the conventional "generators" whose interface is already well-defined. This adapter lets us use a new RandomFunction, e.g., Threefry, with code that expects to talk to an "Engine".

It would be similar to the r123::Engine class documented at http://www.thesalmons.org/john/random123/releases/1.08/docs/structr123_1_1En... but completely rewritten to avoid the C99/C++03/CUDA/OpenCL baggage that clutters/obscures that code.

- implement an adapter that produces a bona fide "Uniform Random Number Generator" (c.f. 26.5.1.3 [rand.req.urng]). Similarly inspired by the MicroURNG adapter documented at http://www.thesalmons.org/john/random123/releases/1.08/docs/classr123_1_1Mic... but again, completely rewritten.

This feels like the right level of generality to me. I'd love to discuss this further and reach consensus - especially the details of the RandomFunction concept. If there seems to be interest, I'd be happy to do some coding as well. In any case, let me reiterate my appreciation for Thijs' work.

Hi John, Good to hear form you! In my view we could indeed decompose the design into something like * a “counter”, tied to the random engine discard() and have O(1) * function specific constants and variables (like the key in threefry, or perhaps the rotation constants?) and tie some of those to the seed(). Other options are to tie the seed to the counter, but that would result in something you can already achieve with the discard. and merge the two together into a boost engine (which is a superset of the c++11 engine, it has an additional generate() member function) On the other hand, -from a user perspective-, it shouldn’t matter much *how* the engines are constructed as long as they provide a random engine interface. My plan was to have a narrow scope and get a good engine in production within a short timeframe. I know that little design details can cost *a lot* of time and those can sometimes cause projects to retire before they even get completed, ..that’s my biggest worry (I’ve seen quite a few ambitious things in the boost sandbox), I remember you talking about this a long time ago but it didn’t materialise. I’ve used the threefry a lot myself -wrote my own version-, and I like it a lot -I think it’s a brilliant idea in many ways-, and so my goal was to have a narrow scope, deliver, and make sure that people could start using it. The perfect way forward for me would be to come up with a generic “crypto counter” template interface like you suggest of which the current threefry would be a specialisation (the function, state etc).That way we could finalise the current threefry, and absorb it in a bigger project later. *but* in case the bigger projects ends in the sandbox, we would still has this initial threefry. ..so I *really* like the idea -I’ll join the project if you like-, but I’m worried about the risks of widening the scope. Cheers, Thijs

...

Cheers, John Salmon

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Steven Watanabe

11:51 p.m.

AMDG On 04/21/2014 02:28 PM, John Salmon wrote:

...

<snip> I would love to see a boost implementation that captured and exposed this capability for general use. Thus, I think a better approach is as follows:

- define a new "RandomFunction" concept that dscribes/enforces the essential, common features of the randomizing function in Threefry, Philox, ARS and other counter-based random generators.

I don't think this is a particularly useful abstraction. Most users should not care at all about this, since they should be using the predefined engine typedefs. If more similar generators are added later, then the implementation can be adjusted to factor out the common boilerplate, but I really don't see the point of creating a whole new concept for this. In Christ, Steven Watanabe

Thijs (M.A.) van den Berg

22 Apr 22 Apr

7:33 a.m.

On Apr 22, 2014, at 1:51 AM, Steven Watanabe <watanabesj@gmail.com> wrote:

...

AMDG

On 04/21/2014 02:28 PM, John Salmon wrote:

...
<snip> I would love to see a boost implementation that captured and exposed this capability for general use. Thus, I think a better approach is as follows:

- define a new "RandomFunction" concept that dscribes/enforces the essential, common features of the randomizing function in Threefry, Philox, ARS and other counter-based random generators.

I don't think this is a particularly useful abstraction. Most users should not care at all about this, since they should be using the predefined engine typedefs. If more similar generators are added later, then the implementation can be adjusted to factor out the common boilerplate, but I really don't see the point of creating a whole new concept for this.

I'm still working a bit on the boilerplate and at some point it should be correct, then John and/or I could use it as a template for other similar engines and factor out the common elements. Additional engines should be easier because a lot of time is spend on the boilerplate and that will indeed be very similar across these engines. Since John knows all about the algorithmic parts and has implementations it should be easier to add those too. How about the following steps: * first finalize the current threefry submission and make sure that all aspects are boost compliant. That would finalize my initial mission. * check the scalability of the naming convention of the predefined typedefs. * then work on additional engines, possibly factoring our common boilerplate. For this I think John should take the lead?

...

In Christ, Steven Watanabe

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

John Salmon

2:51 p.m.

On Mon, Apr 21, 2014 at 7:51 PM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

AMDG

On 04/21/2014 02:28 PM, John Salmon wrote:

...
<snip> I would love to see a boost implementation that captured and exposed this capability for general use. Thus, I think a better approach is as follows:

- define a new "RandomFunction" concept that dscribes/enforces the essential, common features of the randomizing function in Threefry, Philox, ARS and other counter-based random generators.

I don't think this is a particularly useful abstraction. Most users should not care at all about this, since they should be using the predefined engine typedefs. If more similar generators are added later, then the implementation can be adjusted to factor out the common boilerplate, but I really don't see the point of creating a whole new concept for this.

...

I don't think this is a particularly useful abstraction. Most users should not care at all about this, since they should be using the predefined engine typedefs. If more similar generators are added later, then the implementation can be adjusted to factor out the common boilerplate, but I really don't see the point of creating a whole new concept for this.

It may well be true that most users should use a predefined generator because the Engine concept is well matched to the style of random number generator that has been in use for 65 years. But the question is not whether a RandomFunction abstraction should or would be used by *most* users. The question is whether it's useful to *enough* users. Allow me to try to motivate the RandomFunction abstraction. The traditional Engine is inherently stateful and sequential. Users have struggled with using random number generators in parallel applications for decades (long before Boost.Random). With sequential generators, a sufficiently motivated and careful user can use the discard() function to initialize multiple generators in parallel. For most generators, discard(N) is O(N), defeating its use for this purpose, but for a few (including Threefry), discard(N) is O(1) making this a practical, but hardly elegant or natural approach to parallelism. Over the years, the idea of using a stateless function has come up from time to time. The papers and articles that discuss them usually conclude with a sentiment along the lines of: "This would be great for parallel random number generation. It's a shame that it's so slow". They have always been considered curiousities rather than serious candidates for practical use. Our contribution was to describe some RandomFunctions that are statistically sound and extremely fast. I hope our paper is not the last word on the subject and that more high-quality, high-speed RandomFunctions follow. The usage model for RandomFunction is fundamentally different from a sequential generator. It really helps to put the idea of an Engine or a generator out of your mind. Imagine instead that all you have a pure function that takes 128-bit input and returns 128-bit output, and that the output is "random" as long as you follow the simple rule of never re-using an input value. How would you create random numbers for a parallel Monte Carlo simulation? Parallelism argues against "global state", so you're disinclined to encapsulate a single global counter into an Engine-like object that calls the function. Instead, you think about how and where you actually use random values. It's likely that you need to associate independent random sequences with the time-varying logical elements of your program, stock prices or atoms or finite elements or whatever. You could associate a 128-bit counter with each logical element that needs a stream of randoms, and increment the counter every time you need a new random. But that entails the overhead of storing and synchronizing counters. Counter state could easily dominate the program's memory footprint and frequent updates could thrash the cache. A better approach is to devise a trivial, application-specific mapping of time (iteration number) plus some stable aspect your logical elements (e.g., atom id) into unique 128-bit inputs. You can now obtain the random values you need, wherever and whenever you need them simply by generating a unique 128-bit input and calling the RandomFunction. There is no space overhead. If your parallel algorithm dictates that you need the same random value, e.g., to perform an "update" in two different threads or nodes, you just call the RandomFunction with the appropriate arguments wherever you need it and you get the results you need, where you need them. In parallel applications this is much more natural than trying to divvy up a single stream into multiple substreams by 'discarding' or 'jumping' an inherently sequential state. It is cheaper and easier than associating a stateful generator with each of your logical objects and worrying about the storage and bandwidth to keep them in-sync across the threads/nodes of a parallel computation. I believe that if Lehmer, who in 1949 invented the linear congruential generator, had instead glued together a few shifts and adds and muls and invented something like Threefry or Philox, then this is how we'd all think about random number generation today. The 'Engine' concept is an artifact of the sequential nature of the linear congruential generator, and its successors. I'm not suggesting that the Engine concept should go away - 65 years of history carries a lot of inertia. Plenty of people will and should continue to use it. But I don't think Engines are the only way or necessarily the best way to think about generating random values. I think there are many users who would use and benefit from an exposed RandomFunction concept. At this point, if I haven't swayed you, then I fear that my powers of persuasion aren't up to the task. Arguing in the abstract is difficult. Specifics might help. Maybe someone else who has read this far and thinks that they would benefit from RandomFunctions can offer some perspective. Or maybe I'm just wrong. If a RandomFunction concept isn't going to make it into boost, then I'd still be pleased to see Thijs' Threefry implementation move ahead. Cheers, John Salmon

...

In Christ, Steven Watanabe

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Steven Watanabe

4:25 p.m.

AMDG On 04/22/2014 08:51 AM, John Salmon wrote:

...

<snip> With sequential generators, a sufficiently motivated and careful user can use the discard() function to initialize multiple generators in parallel. For most generators, discard(N) is O(N), defeating its use for this purpose, but for a few (including Threefry), discard(N) is O(1) making this a practical, but hardly elegant or natural approach to parallelism.

FWIW, discard isn't inherently O(N) for most generators. Faster algorithms exist, but are not widely implemented.

...

<snip> A better approach is to devise a trivial, application-specific mapping of time (iteration number) plus some stable aspect your logical elements (e.g., atom id) into unique 128-bit inputs. You can now obtain the random values you need, wherever and whenever you need them simply by generating a unique 128-bit input and calling the RandomFunction. There is no space overhead. If your parallel algorithm dictates that you need the same random value, e.g., to perform an "update" in two different threads or nodes, you just call the RandomFunction with the appropriate arguments wherever you need it and you get the results you need, where you need them. In parallel applications this is much more natural than trying to divvy up a single stream into multiple substreams by 'discarding' or 'jumping' an inherently sequential state. It is cheaper and easier than associating a stateful generator with each of your logical objects and worrying about the storage and bandwidth to keep them in-sync across the threads/nodes of a parallel computation.

The problem I have with this, basically comes down to encapsulation. - distributions consume an unspecified number of random values. I have no idea how to make the distributions work with a RandomFunction, besides requiring them to use a 1-to-1 mapping. Even with 64-bit values, this limits the precision of the random variates. - Taking this one step farther, how can I write a generic function which uses an arbitrary RandomFunction to generate a sequence of k random variates? There's no way to know a priori how many unique inputs I would need. Even if k is constant, it still depends on the block size of the RandomFunction. - A RandomFunction is strictly a PRNG interface. It doesn't work very well for expressing other sources of random values, such as /dev/urandom, specialized devices, and www.random.org. I think that we can get most of the benefits of a RandomFunction without going away from the Engine concept, if we add a couple of extra guarantees for engines intended for parallel use: - seeding a new engine is fast - distinct seeds (within some limits) produce uncorrelated sequences. Both of these properties are trivial for Engines based on RandomFunctions. In Christ, Steven Watanabe

Thijs van den Berg

23 Apr 23 Apr

12:01 p.m.

FYI, -as a design experiment- I’m refactoring the threefry engine to see if John’s idea is better. Even if the RandomFunction concept never makes it, then it might still result in better reusable components that make it easier to develop additional engines based on these random functions. As a desing-test case I’m also testing it with SHA1 function I found in boost/uuid/detail. I’ll upload example code to github as soon as I have something that I want to suggest.

Peter Dimov

3:06 p.m.

Thijs van den Berg wrote:

...

FYI, -as a design experiment- I’m refactoring the threefry engine to see if John’s idea is better. Even if the RandomFunction concept never makes it, then it might still result in better reusable components that make it easier to develop additional engines based on these random functions.

A RandomFunction is basically a high-grade hash function, right? If so, it might be worth investigating whether to switch Boost.Hash to use the threefry random function. There are people who find it inconvenient that hash_value( 5 ) is 5 and would rather prefer a "random" output.

John Salmon

29 Apr 29 Apr

10:12 p.m.

On Wed, Apr 23, 2014 at 11:06 AM, Peter Dimov <lists@pdimov.com> wrote: A RandomFunction is basically a high-grade hash function, right? If so, it

...

might be worth investigating whether to switch Boost.Hash to use the threefry random function. There are people who find it inconvenient that hash_value( 5 ) is 5 and would rather prefer a "random" output.

Yes, the pseudo-random functions are high-grade hash functions if you use the "full" number of rounds, and they're "low-grade" hash funcions if you use fewer rounds. They actually start to "look" random after only a couple of rounds, which might be enough for the purposes of indexing into a hash table. I put a tiny example at https://github.com/DEShawResearch/Random123-Boost/blob/master/libs/random/ex... John Salmon

John Salmon

23 Apr 23 Apr

2:59 p.m.

On Tue, Apr 22, 2014 at 12:25 PM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

AMDG

On 04/22/2014 08:51 AM, John Salmon wrote:

...
<snip> A better approach is to devise a trivial, application-specific mapping of time (iteration number) plus some stable aspect your logical elements (e.g., atom id) into unique 128-bit inputs. You can now obtain the random values you need, wherever and whenever you need them simply by generating a unique 128-bit input and calling the RandomFunction. There is no space overhead. If your parallel algorithm dictates that you need the same random value, e.g., to perform an "update" in two different threads or nodes, you just call the RandomFunction with the appropriate arguments wherever you need it and you get the results you need, where you need them. In parallel applications this is much more natural than trying to divvy up a single stream into multiple substreams by 'discarding' or 'jumping' an inherently sequential state. It is cheaper and easier than associating a stateful generator with each of your logical objects and worrying about the storage and bandwidth to keep them in-sync across the threads/nodes of a parallel computation.

The problem I have with this, basically comes down to encapsulation.

Yes. I'm arguing against encapsulation, and you're right to be skeptical. Encapsulation is usually (but not always) a good thing. I think Thijs put his finger on it. He said:

...

...
Most of the benefits you see revolve around not owning memory.

Exactly. RandomFunction gives you randomness without the overhead of carrying around, storing and synchronizing encapsulated state. If de-encapsulation just meant that the state had to live in memory somewhere else, then it would be counter-productive. But there are many applications, e.g., Monte Carlo simulations, where the de-encapsulated state doesn't have to "live" anywhere, It can be reconstructed, trivially, on-the-fly, from pre-existing application data, whenever it's needed.

...

- distributions consume an unspecified number of random values. I have no idea how to make the distributions work with a RandomFunction, besides requiring them to use a 1-to-1 mapping. Even with 64-bit values, this limits the precision of the random variates. - Taking this one step farther, how can I write a generic function which uses an arbitrary RandomFunction to generate a sequence of k random variates? There's no way to know a priori how many unique inputs I would need. Even if k is constant, it still depends on the block size of the RandomFunction.

Yes. Almost nobody wants or needs raw Engine output. Almost everybody wants and needs the values returned by Distributions. RandomFunctions absolutely must "work" with Distributions, and retooling the implementation of Distributions is completely out of the question. So how do I propose making the RandomFunction work with distributions? With an adapter class, let's call it counter_based_generator, that models a bona fide Boost.UniformRandomNumberGenerator (so it can be used as an argument to a Boost.Distribution), and that can be constructed from an instance of a RandomFunction and an instance of the RandomFunction's counter_type. It might be used something like this: template <typename RandomFunction> void thermalize(AtomCollection& atoms, RandomFunction& randfunc){ normal_distribution nd; for( atom& a : atoms ){ RandomFunction::counter_type c = {timestep, a.atomid}; counter_based_generator u(randfunc, c); nd.reset(); a.vx = sqrt(kB * T/a.mass) * nd(urng); a.vy = sqrt(kB * T/a.mass) * nd(urng); a.vz = sqrt(kB * T/a.mass) * nd(urng); } } The counter_based_generator does have internal state (copies of c and randfunc), but it is transient, so in practice the state lives in registers and never touches memory. In contrast to the advice given at http://www.boost.org/doc/libs/1_55_0/doc/html/boost_random/reference.html#bo... counter_based_generators are meant to be created and destroyed frequently.

...

- A RandomFunction is strictly a PRNG interface. It doesn't work very well for expressing other sources of random values, such as /dev/urandom, specialized devices, and www.random.org.

Agreed. But so what? RandomFunction doesn't have to model /dev/urandom to be useful. The question is whether it's useful in other contexts.

...

I think that we can get most of the benefits of a RandomFunction without going away from the Engine concept, if we add a couple of extra guarantees for engines intended for parallel use: - seeding a new engine is fast - distinct seeds (within some limits) produce uncorrelated sequences. Both of these properties are trivial for Engines based on RandomFunctions.

I think we're getting closer here. If we simply document a few details about what we expect from a RandomFunction, then we can provide the guarantees you mention (fast seeding and initialization, uncorrelated sequences) with a single, generic adapter class along the lines of the counter_based_generator sketched above. That's basically what I had in mind when I proposed the RandomFunction "Concept". It's purpose is to document the requirements of the counter_based_generator adapter so that it's relatively easy to create new RandomFunctions in the future, with the expectation that they'll interoperate nicely with the rest of Boost.Random. John

...

In Christ, Steven Watanabe

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Thijs (M.A.) van den Berg

22 Apr 22 Apr

4:40 p.m.

...

...
On 04/21/2014 02:28 PM, John Salmon wrote:

...

...

A better approach is to devise a trivial, application-specific mapping of time (iteration number) plus some stable aspect your logical elements (e.g., atom id) into unique 128-bit inputs. You can now obtain the random values you need, wherever and whenever you need them simply by generating a unique 128-bit input and calling the RandomFunction. There is no space overhead.

I think this above is one of the core argument you make. This would imply the need 1) a function/functor that initializes the random output buffer with (e.g.) 128bit of input -possibly assembled from bits of memory and signals here and there-. For C++ random engines this could be a counter functor, but it could also be a unique id of a machine/thread/core with a timestamp. 2) the "randomFunction" with a possible state (like the algorithmic specific encryption key in threefry) that scrambles the output-buffer in-place. Most of the benefits you see revolve around not owning memory. I can easily factor out the scrambling function, but I think they would need to be functors holding function specific state for them to be a concept that can be combined with other concepts. For me having a generic interface has more value than having detailed control. So the big challenge would be to define good concepts. I'm still hoping that designing the new concepts (which are imo difficult to get accepted and I don't have time and passion for that) gets decoupled from delivering a new random engine(s) (clear and manageable deliverable).

John Bytheway

2 May 2 May

2:58 a.m.

On 2014-04-22 10:51, John Salmon wrote:

...

On Mon, Apr 21, 2014 at 7:51 PM, Steven Watanabe <watanabesj@gmail.com>wrote:

...
AMDG

On 04/21/2014 02:28 PM, John Salmon wrote:

...
<snip> I would love to see a boost implementation that captured and exposed this capability for general use. Thus, I think a better approach is as follows:

- define a new "RandomFunction" concept that dscribes/enforces the essential, common features of the randomizing function in Threefry, Philox, ARS and other counter-based random generators.

I don't think this is a particularly useful abstraction. Most users should not care at all about this, since they should be using the predefined engine typedefs. If more similar generators are added later, then the implementation can be adjusted to factor out the common boilerplate, but I really don't see the point of creating a whole new concept for this.

...
I don't think this is a particularly useful abstraction. Most users should not care at all about this, since they should be using the predefined engine typedefs. If more similar generators are added later, then the implementation can be adjusted to factor out the common boilerplate, but I really don't see the point of creating a whole new concept for this.

It may well be true that most users should use a predefined generator because the Engine concept is well matched to the style of random number generator that has been in use for 65 years.

But the question is not whether a RandomFunction abstraction should or would be used by *most* users. The question is whether it's useful to *enough* users.

In case it helps motivate this abstraction, I will mention that I have wanted it in the past. I wanted to generate unpredictable Perlin noise in a game (predictability would in this case allow clients to cheat). For that I wanted exactly a RandomFunction like this which would simply map each integer to an independent random value. This is essentially the same interface Mathias was asking for elsewhere in the thread. What I actually ended up using was a linear congruential generator with O(log(n)) stepping, which didn't satisfactorily achieve any of my design goals. A good random function in Boost.Random would have been perfect. John

Mathias Gaunard

27 Apr 27 Apr

11:39 a.m.

On 22/04/14 01:51, Steven Watanabe wrote:

...

AMDG

On 04/21/2014 02:28 PM, John Salmon wrote:

...
<snip> I would love to see a boost implementation that captured and exposed this capability for general use. Thus, I think a better approach is as follows:

- define a new "RandomFunction" concept that dscribes/enforces the essential, common features of the randomizing function in Threefry, Philox, ARS and other counter-based random generators.

I don't think this is a particularly useful abstraction. Most users should not care at all about this, since they should be using the predefined engine typedefs. If more similar generators are added later, then the implementation can be adjusted to factor out the common boilerplate, but I really don't see the point of creating a whole new concept for this.

People usually want to do something like this #pragma omp for for(int i=0; i<n; ++i) { out[i] = in[i] * random_value(); } If the random_value generator is shared, not only do you get poor performance due to synchronization, but you also get non-deterministic results. If you use a per-thread generator, this is a bit more annoying to set up, manage and reproduce as the results will still depend on which threads get to do what and how each of them are seeded. With a single, stateless generator, everything is much simpler. One seed, one generator, completely reproducible results regardless of scheduling. The interface i want to have is this: #pragma omp for for(int i=0; i<n; ++i) { out[i] = in[i] * random_value(i); }

Thijs (M.A.) van den Berg

12:58 p.m.

On Apr 27, 2014, at 1:39 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

On 22/04/14 01:51, Steven Watanabe wrote:

...
AMDG

On 04/21/2014 02:28 PM, John Salmon wrote:

...
<snip> I would love to see a boost implementation that captured and exposed this capability for general use. Thus, I think a better approach is as follows:

- define a new "RandomFunction" concept that dscribes/enforces the essential, common features of the randomizing function in Threefry, Philox, ARS and other counter-based random generators.

I don't think this is a particularly useful abstraction. Most users should not care at all about this, since they should be using the predefined engine typedefs. If more similar generators are added later, then the implementation can be adjusted to factor out the common boilerplate, but I really don't see the point of creating a whole new concept for this.

People usually want to do something like this

#pragma omp for for(int i=0; i<n; ++i) { out[i] = in[i] * random_value(); }

If the random_value generator is shared, not only do you get poor performance due to synchronization, but you also get non-deterministic results. Yes. That was the problem with the old random generators, but nowadays engines have local state.

If you use a per-thread generator, this is a bit more annoying to set up, manage and reproduce as the results will still depend on which threads get to do what and how each of them are seeded.

With a single, stateless generator, everything is much simpler. One seed, one generator, completely reproducible results regardless of scheduling. The interface i want to have is this:

#pragma omp for for(int i=0; i<n; ++i) { out[i] = in[i] * random_value(i); } I like this interface. I'm experimenting now with something similar.

You would still need to split the range [0,n) into non-overlapping subsets per thread. You will need some unique id per thread -either a thread id, or some static counter that threads use when they are initialized- to do that. That unique id can also be used to seed regular random engines, so that is to me not a big addition. The main gains I see are: 1) that you can be more efficient with memory 2) that you have better guarantees wrt independent sequences. If you seed a regular engine with 1 and another one with 2 you won't now if the two generated random sequences will have overlap or not. With a counter you can guarantee that, you have control over the things you put into the random function. I'm currently experimenting with "random functions" and a "generic counter engine". It patches together a seed+counter (the seed goes to the high bits, counter to the low bits) and it hands out random numbers from the output the random function fills (which can be 64>bits, eg 256 for threefry, 160 for SHA1). There is also overlap between (the interface or concept of) random functions and std::hash, a generic counter engine should perhaps work with the union of both? We could also use an adapter that wraps a hash interface on a random function which would allow us to construct new hash functions.

Thijs (M.A.) van den Berg

1:16 p.m.

...

it hands out random numbers from the output the random function fills (which can be 64>bits, eg 256 for threefry, 160 for SHA1).

This is the inverse of the independent bits engine, it down-samples the resolution of the underlying engine, eg it calls the underlying 256 bit engine once every 4 rounds to generate 64 bit random numbers. Maybe we can isolate this functionality into a separate engine adaptor? Right now I have this implemented as an array view on an array with elements of different byte size.

Mathias Gaunard

2:24 p.m.

On 27/04/14 14:58, Thijs (M.A.) van den Berg wrote:

...

You would still need to split the range [0,n) into non-overlapping subsets per thread.

That's what the parallelization of the algorithm does. (In the case in the previous mail, it's done automatically by OpenMP)

...

You will need some unique id per thread -either a thread id, or some static counter that threads use when they are initialized- to do that. That unique id can also be used to seed regular random engines, so that is to me not a big addition.

That would defeat the point IMO. I want to be able to write parallel code involving random number generation where the result does not depend at all of how the code is parallelized. Counter-based random number generators can compute the i-th random number in O(1) instead of the classical O(n). That means that for a given generator with a given seed, I can compute the i-th random number and the j-th random number in parallel. That's what I want to do. I don't want per-thread seeds.

John Salmon

29 Apr 29 Apr

4 p.m.

On Sun, Apr 27, 2014 at 10:24 AM, Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:

...

On 27/04/14 14:58, Thijs (M.A.) van den Berg wrote:

You would still need to split the range [0,n) into non-overlapping

...
subsets per thread.

That's what the parallelization of the algorithm does. (In the case in the previous mail, it's done automatically by OpenMP)

You will need some unique id per thread -either a thread id, or some

...
static counter that threads use when they are initialized- to do that. That unique id can also be used to seed regular random engines, so that is to me not a big addition.

That would defeat the point IMO.

I want to be able to write parallel code involving random number generation where the result does not depend at all of how the code is parallelized.

Counter-based random number generators can compute the i-th random number in O(1) instead of the classical O(n). That means that for a given generator with a given seed, I can compute the i-th random number and the j-th random number in parallel. That's what I want to do. I don't want per-thread seeds.

I've dusted off the code I wrote a couple of years ago and republished it, now on github:

https://github.com/DEShawResearch/Random123-Boost I've tried to respond to the discussion in this thread. In particular, the README, and the example now emphasize CBRNG's small footprint and fast initialization and use with Distributions. The primary example demonstrates how to use Distributions in a thread-agnostic way. I also removed the functions related to AES because I felt they added a lot of clutter and introduced some technical issues (endian questions, hardware-specific instructions and intrinsics) that would be distractions at this point. We can certainly revisit them in the future, if and when we agree on a framework. Here's the new README: <README.md> Random123-Boost =============== Proposed Random123 functions for Boost.Random The purpose of this source tree is to develop a new family of "Counter Based Uniform Random Number Generators" (CBURNGs) for the Boost.Random library. CBURNGs were introduced in the paper, "Parallel Random Numbers -- As Easy as 1, 2, 3", by Salmon, Moraes, Dror & Shaw, which won the Best Paper award at the SC'11 conference: http://dl.acm.org/citation.cfm?doid=2063405 also available at http://www.thesalmons.org/john/random123/papers/random123sc11.pdf Conventional generators (such as those in Boost.Random or the C++11 <random> library) are large and difficult to initialize. Boost's documentation explicitly advises against frequent initialization, so common practice is to serialize access through a single global generator, or, in a parallel program, to instantiate one generator per-thread. Unlike conventional generators CBURNGs require very little storage and are designed to be created and destroyed frequently. If created and destroyed in an "inner loop", a good compiler can often keep their state entirely in registers and can generate random values with a few dozen inlined instructions. In addition, they have extremely fast, constant-time implementations of 'discard()', which allows callers to easily "leapfrog" through a logical stream of random numbers. These features make them ideal for parallel computation. Consider the following program fragment, using a conventional RandomNumberEngine, mt19937: using namespace boost::random; mt_19937 rng(seed); // seed may come from the command line normal_distribution nd; for(size_t i=0; i<atoms.size(); ++i){ float boltzmannfactor = sqrt(kT/atoms[i].mass); atoms[i].vx = boltzmannfactor*nd(rng); atoms[i].vy = boltzmannfactor*nd(rng); atoms[i].vz = boltzmannfactor*nd(rng); } Now imagine parallelizing this loop over a number of threads or cores. The conventional approach is to create an independent generator in each thread, perhaps folding a thread-id into the seed, or using a 'discard' function so that threads can "leapfrog" over one another. But both these solutions result in output that depends on the number of threads and how atoms are assigned to threads. Furthermore, improperly seeding millions of generators (not an unreasonable number on a modern supercomputer) can lead to unintended correlations among streams, so conventional wisdom is that one must very carefully choose the engine and the initialization method. The problem is even harder if the overall program structure (i.e., not just this loop) dictates a parallelization strategy that might assign the same atom to multiple threads. CBRNGs overcome all these problems. With CBRNGs, the code looks like this (see libs/random/examples/counter_based_example.cpp for a fully worked example): using namespace boost::random; typedef threefry<4, uint32_t> Prf; normal_distribution nd; Prf::key_type key = {seed}; Prf prf(key); // seed may come from command line for(size_t i=0; i<atoms.size(); ++i){ float boltzmannfactor = sqrt(kT/atoms[i].mass); Prf::domain_type d = {atoms[i].id, timestep, THERMALIZE_CTXT}; counter_based_urng<Prf> cbrng(prf, d); nd.reset(); atoms[i].vx = boltzmannfactor*nd(cbrng); atoms[i].vy = boltzmannfactor*nd(cbrng); atoms[i].vz = boltzmannfactor*nd(cbrng); } Let's consider the code changes between the two fragments: - counter_based_urng is a templated adapter class that models a bona fide Boost UniformRandomNumberGenerator. Its template parameter is a Pseudo-Random Function (PRF). An instance of the Pseudo-Random Function and a value from the Pseudo-Random Function's domain are required to construct a counter_based_urng. E.g., these lines in the example: Prf::domain_type d = {atoms[i].id, timestep, THERMALIZE_CTXT}; counter_based_urng<Prf> cbrng(prf, d); In C++11, with initializer-lists, this might be shortened to: counter_based_urng<Prf> cbrng(prf, {atoms[i].id, timestep, THERMALIZE_CTXT}); Creation and destruction of the counter_based_urng is much faster than actually generating random values or processing them through a distribution, so it's reasonable to create and destroy the cbrng every time through the loop. Counter_based_urngs constructed from the same domain value and the same PRF are identical, i.e., they will generate exactly the same sequence. On the other hand, counter_based_urngs constructed from domain values that differ in even a single bit generate independent, non-overlapping sequences of random values. Thus, by choosing a value in the domain that encodes some program-specific state (e.g., atoms[i].id and timestep), we produce a unique stream for each atom at each timestep that is statistically independent of all other streams. Notice that the random values generated for a particular atom at a particular timestep are independent of the number of threads or the assignment of atoms to threads. The additional constant THERMALIZE_CTXT is used to distinguish this loop from any other loop or context in the program, eliminating the possibility that the same sequence will be generated elsewhere in the program. - Since it models a URNG, cbrng can be passed as an argument to the normal_distribution, nd. In order to make each atom independent, nd.reset() is called each time through the loop. - PRFs are keyed. I.e., the Prf constructor takes a key_type as an argument. Two Prfs of the same type, initialized with the same key are indistinguishable. On the other hand, two Prfs constructed from keys that differ in any way, even by a single bit, will give rise to statistically independent counter_based_urngs and output streams. In the example, we initialize the PRF outside the loop with: Prf::key_type key = {seed}; Prf prf(key); // seed may come from command line Pseudo-random functions: Philox and Threefry --------------------------------------------- Two Pseudo-Random functions are implemented in this source tree: threefry and philox. Both are templated over an unsigned width, an unsigned integer value type, and a round-count (which takes a reasonable and safe default value). For example: threefry<4, uint32_t> philox<2, uint64_t> All PRFs have a key_type, a domain_type, and a range_type, all of which are boost::arrays of the underlying value type. I.e., threefry<N, U>::key_type = boost::array<U, N> domain_type = boost::array<U, N> range_type = boost::array<U, N> philox<N, U>::key_type = boost::array<U, N/2> domain_type = boost::array<U, N> range_type = boost::array<U, N> For threefry and philox, initialization is extremely fast. For other PRFs, e.g., the cryptographic AES function (not implemented), there may be non-trivial computation associated with initialization, so initializing PRFs is discouraged in *inner* loops, but is perfectly reasonable at thread-scope or anywhere else that a few dozen invocations of the generator will amortize the initialization cost. Threefry and Philox are "pseudo-random", meaning that the outputs from any set of inputs are practically indistiguishable from random. In particular, one can obtain apparently random output simply by "counting" through inputs in an entirely regular way. Strong empirical evidence is presented in the SC11 paper for the statistical quality of the threefry and philox functions. Furthermore, the pseudo-random functions obtained with different keys are shown to be statistically independent, even if the keys differ by only a single bit or follow regular patterns. Among other things, threefry and philox pass the entire BigCrush suite of tests. counter_based_urng ------------------ The counter_based_urng class uses the pseudo-random property of PRFs to perform random number generation in accordance with the requirements of a UniformRandomNumberGenerator. It reserves some number (by default, all) of most-significant bits of the highest-index member of the domain_type array for its own internal use as a "counter". It is an error for the domain_type constructor argument to have non-zero bits in this range. Whenever new random values are required, the "counter bits" are incremented and the PRF is called, generating a new set of random values that will be returned by counter_based_urng. It is an error to request random values after the counter is exhausted. Thus, counter_based_urngs will typically have fairly "short" sequence lengths (anything from 4 up to 2^64). This is usually more than sufficient to provide input to one or more Distributions, which generally call their engine a non-deterministic, but usually small number of times. On the other hand, it is cheap and efficient to create huge numbers (2^64 or more) of independent counter_based_urngs on demand. counter_based_engine -------------------- The counter_based_engine class is a templated adapter class that models a bona fide RandomNumberEngine. In particular, it is Seedable in the same way as other Engines, so it can be used in any program that expects a RandomNumberEngine. The counter_based_engine offers two very useful properties: 1 - it has a very small size, and a very fast constructor, so it is practical to instantiate millions of them in a large parallel application. 2 - the discard() function is very fast and runs in constant time. Parallel programs can use this to "leapfrog" multiple sequences over one another in different threads, or it can be used to initialize generators in different threads with starting points that are separated by enough to avoid overlap. </README.md>

Thijs van den Berg

9:28 p.m.

...

now on github:

https://github.com/DEShawResearch/Random123-Boost

Hi John, I’ve also been working on a design and it will take another week before it’s good enough. It has similarities and differences (which is to be expected since we haven’t aligned). In short: * a random_function concept which is a functor with operator()(input&,output&) or operator()(input&,output&,key&). I’ve considered returning a container but I prefer this. template <std::size_t InputSize> class sha1_random_function { // type info void operator()(input_type& _i, output_type& _o) { _sha1.reset(); _sha1.process_bytes(_i, input_traits::size ); _sha1.get_digest(_o); }; private: boost::uuids::detail::sha1 _sha1; // I found this inside boost! }; * the input output are random access containers with acompanying traits that allow other components to store and interface with these input output. The traits are element type, number of elements and number of bits used per element. E.g. a uint_least32_t v[5]. Non of these container are stored inside the random function. template <typename RandomFunction> struct random_function_input_traits { typedef ... value_type; // the type of the elements in the input container typedef... type; // the input container type static const std::size_t size = ...; // the number of elements in the input container static const std::size_t bits = ...; // the number of bits used from each element in the input container }; * a generic counter_engine that turns a random_function into a random_engine using a counter. It has a template argument random_function and it stores and manipular a counter of type random_function::input and/or random_function::key and extract random Uint;s from the output buffer. typedef counter_engine< sha1_random_function<64>, // do a sha1 hash on a 64 bytes input buffer (256 bit) boost::uint_least32_t, // return type of the random engine 32, // 32 random bits in the return type elements 32 // 32 bit seed > sha1_counter; * it doesn’t need to end with a counter_engine, other types of random engines adapters that do input manipulators besides a counter can be added. My focus is mainly the counter_engine adaptor (which is finished), the random_function concept and its containers+traits. I have unfortunately limited time right now but I will analyse our designs and come back with ideas and feedback.

Steven Watanabe

2 May 2 May

3:43 p.m.

AMDG On 04/29/2014 10:00 AM, John Salmon wrote:

...

I've dusted off the code I wrote a couple of years ago and republished it, now on github:

https://github.com/DEShawResearch/Random123-Boost

For this to be included in Boost.Random, it must be released under the Boost Software License.

...

<snip>

CBRNGs overcome all these problems. With CBRNGs, the code looks like this (see libs/random/examples/counter_based_example.cpp for a fully worked example):

using namespace boost::random; typedef threefry<4, uint32_t> Prf; normal_distribution nd; Prf::key_type key = {seed}; Prf prf(key); // seed may come from command line for(size_t i=0; i<atoms.size(); ++i){ float boltzmannfactor = sqrt(kT/atoms[i].mass); Prf::domain_type d = {atoms[i].id, timestep, THERMALIZE_CTXT}; counter_based_urng<Prf> cbrng(prf, d); nd.reset(); atoms[i].vx = boltzmannfactor*nd(cbrng); atoms[i].vy = boltzmannfactor*nd(cbrng); atoms[i].vz = boltzmannfactor*nd(cbrng); }

Is there a good reason why this shouldn't be written as: for(...) { ... std::seed_seq s{ seed, atoms[i].id, timestep, THERMALIZE_CTXT }; counter_based_engine<Prf> cbrng(s); ... } I don't really like making the details of the algorithm (i.e. key and domain) part of the primary interface.

...

Let's consider the code changes between the two fragments:

<snip>

- Since it models a URNG, cbrng can be passed as an argument to the normal_distribution, nd. In order to make each atom independent,

nd.reset()

is called each time through the loop.

<snip>

FWIW, I don't really think that reset was a good idea in the first place, and it's currently a no-op in all distributions. To run the loop in parallel, you'd need to make a copy of nd in every thread, anyway.

...

counter_based_urng ------------------

The counter_based_urng class uses the pseudo-random property of PRFs to perform random number generation in accordance with the requirements of a UniformRandomNumberGenerator. <snip>

counter_based_engine --------------------

The counter_based_engine class is a templated adapter class that models a bona fide RandomNumberEngine. <snip>

Is there really a good reason to make two separate templates? A RandomNumberEngine is a URNG, by definition. In Christ, Steven Watanabe

John Salmon

5 May 5 May

1:10 p.m.

On Fri, May 2, 2014 at 11:43 AM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

AMDG

On 04/29/2014 10:00 AM, John Salmon wrote:

...
I've dusted off the code I wrote a couple of years ago and republished

it,

...
now on github:

https://github.com/DEShawResearch/Random123-Boost

For this to be included in Boost.Random, it must be released under the Boost Software License.

I'm confident that this will be possible. I've started the necessary in-housel conversation..

...

...
counter_based_urng ------------------

The counter_based_urng class uses the pseudo-random property of PRFs to perform random number generation in accordance with the requirements of a UniformRandomNumberGenerator. <snip>

counter_based_engine --------------------

The counter_based_engine class is a templated adapter class that models a bona fide RandomNumberEngine. <snip>

Is there really a good reason to make two separate templates? A RandomNumberEngine is a URNG, by definition.

I had reasons, but they're less convincing than I initially thought. I'm working on a new version that only has counter_based_engine. It should be ready soon. John Salmon _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

John Salmon

27 May 27 May

6:48 p.m.

On Mon, May 5, 2014 at 9:10 AM, John Salmon <john@thesalmons.org> wrote:

...

On Fri, May 2, 2014 at 11:43 AM, Steven Watanabe <watanabesj@gmail.com>wrote:

...
AMDG

On 04/29/2014 10:00 AM, John Salmon wrote:

...
I've dusted off the code I wrote a couple of years ago and republished

it,

...
now on github:

https://github.com/DEShawResearch/Random123-Boost

For this to be included in Boost.Random, it must be released under the Boost Software License.

Done! The version at the above URL is now under the Boost Software License. While waiting for permission to change the license, I had some useful conversations with Thijs off-line, resulting in some changes to the implementation. I won't quote the whole README again, but it now starts this way, which I hope is much more to Steven's liking: <snip README.md> CBENG's are fully generic "Random Number Engines", satisfying all the requirements of C++11 and Boost Engines. They can be used anywhere that other engines can be used. When used with the provided "pseudo random functions", threefry and philox, CBENGs are "crush resistant", i.e., they pass the very demanding suite of 'SmallCrush' 'Crush' and 'BigCrush' tests in the TestU01 package. CBENGs based on threefry and philox are fast (comparable to the mersenne twister) and small (requiring only a few words of state). CBENGs are excellent random number engines in any context, but two features make them especially well suited to parallel computations: - a fast (constant time) restart() member function that allows the caller to manage a huge number of distinct streams. - a fast (constant time) discard() member function. The restart() member is unique to CBENGs, while discard() follows standard semantics (but very few engines permit constant-time implementations). </snip> There is only one counter_based_engine, as Steven requested elsewhere in the thread. Can we agree that counter_based_engine, implemented roughly along these lines is a worthwhile addition to Boost.Random? If so, what's the way forward? Should I start a 'feature branch', i.e., https://svn.boost.org/trac/boost/wiki/StartModMaint#Startworkonanewfeature to make it easier to ultimately merge into boost? Cheers, John Salmon

John Salmon

3 Jun 3 Jun

4:17 p.m.

On Tue, May 27, 2014 at 2:48 PM, John Salmon <john@thesalmons.org> wrote:

...

On Mon, May 5, 2014 at 9:10 AM, John Salmon <john@thesalmons.org> wrote:

...
On Fri, May 2, 2014 at 11:43 AM, Steven Watanabe <watanabesj@gmail.com> wrote:

...
AMDG

On 04/29/2014 10:00 AM, John Salmon wrote:

...
I've dusted off the code I wrote a couple of years ago and republished

it,

...
now on github:

https://github.com/DEShawResearch/Random123-Boost

For this to be included in Boost.Random, it must be released under the Boost Software License.

Done! The version at the above URL is now under the Boost Software License.

While waiting for permission to change the license, I had some useful conversations with Thijs off-line, resulting in some changes to the implementation. I won't quote the whole README again, but it now starts this way, which I hope is much more to Steven's liking:

<snip README.md> CBENG's are fully generic "Random Number Engines", satisfying all the requirements of C++11 and Boost Engines. They can be used anywhere that other engines can be used. When used with the provided "pseudo random functions", threefry and philox, CBENGs are "crush resistant", i.e., they pass the very demanding suite of 'SmallCrush' 'Crush' and 'BigCrush' tests in the TestU01 package. CBENGs based on threefry and philox are fast (comparable to the mersenne twister) and small (requiring only a few words of state).

CBENGs are excellent random number engines in any context, but two features make them especially well suited to parallel computations:

- a fast (constant time) restart() member function that allows the caller to manage a huge number of distinct streams. - a fast (constant time) discard() member function.

The restart() member is unique to CBENGs, while discard() follows standard semantics (but very few engines permit constant-time implementations).

</snip>

There is only one counter_based_engine, as Steven requested elsewhere in the thread.

Can we agree that counter_based_engine, implemented roughly along these lines is a worthwhile addition to Boost.Random? If so, what's the way forward?

Should I start a 'feature branch', i.e., https://svn.boost.org/trac/boost/wiki/StartModMaint#Startworkonanewfeature to make it easier to ultimately merge into boost?

Cheers,

John Salmon

Any thoughts on this?

John

John Salmon

13 Jun 13 Jun

3:17 p.m.

On Tue, May 27, 2014 at 2:48 PM, John Salmon <john@thesalmons.org> wrote:

...

On Mon, May 5, 2014 at 9:10 AM, John Salmon <john@thesalmons.org> wrote:

...
On Fri, May 2, 2014 at 11:43 AM, Steven Watanabe <watanabesj@gmail.com> wrote:

...
AMDG

On 04/29/2014 10:00 AM, John Salmon wrote:

...
I've dusted off the code I wrote a couple of years ago and republished

it,

...
now on github:

https://github.com/DEShawResearch/Random123-Boost

For this to be included in Boost.Random, it must be released under the Boost Software License.

Done! The version at the above URL is now under the Boost Software License.

While waiting for permission to change the license, I had some useful conversations with Thijs off-line, resulting in some changes to the implementation. I won't quote the whole README again, but it now starts this way, which I hope is much more to Steven's liking:

<snip README.md> CBENG's are fully generic "Random Number Engines", satisfying all the requirements of C++11 and Boost Engines. They can be used anywhere that other engines can be used. When used with the provided "pseudo random functions", threefry and philox, CBENGs are "crush resistant", i.e., they pass the very demanding suite of 'SmallCrush' 'Crush' and 'BigCrush' tests in the TestU01 package. CBENGs based on threefry and philox are fast (comparable to the mersenne twister) and small (requiring only a few words of state).

CBENGs are excellent random number engines in any context, but two features make them especially well suited to parallel computations:

- a fast (constant time) restart() member function that allows the caller to manage a huge number of distinct streams. - a fast (constant time) discard() member function.

The restart() member is unique to CBENGs, while discard() follows standard semantics (but very few engines permit constant-time implementations).

</snip>

There is only one counter_based_engine, as Steven requested elsewhere in the thread.

Can we agree that counter_based_engine, implemented roughly along these lines is a worthwhile addition to Boost.Random? If so, what's the way forward?

Should I start a 'feature branch', i.e., https://svn.boost.org/trac/boost/wiki/StartModMaint#Startworkonanewfeature to make it easier to ultimately merge into boost?

Cheers,

John Salmon

Second ping, Can anybody hear me?

It's been more than two weeks, with no response to the above, nor to a resend on June 3. I understand that people are busy. But note that I'm asking what *I* should do, not for someone else to do something. In any case, a reply along the lines of "I'm busy and won't be able to get into this until approximately <DATE>, but until then, you could help by doing XXX" would be appreciated. Cheers, John Salmon

Steven Watanabe

5:21 p.m.

AMDG On 06/13/2014 09:17 AM, John Salmon wrote:

...

On Tue, May 27, 2014 at 2:48 PM, John Salmon <john@thesalmons.org> wrote:

...
Can we agree that counter_based_engine, implemented roughly along these lines is a worthwhile addition to Boost.Random? If so, what's the way forward?

The main thing I need is documentation as doxygen+quickbook.

...

...
Should I start a 'feature branch', i.e., https://svn.boost.org/trac/boost/wiki/StartModMaint#Startworkonanewfeature to make it easier to ultimately merge into boost?

Sure. The headers aren't a big deal, since it's trivial for me to copy files, but the documentation does need to be integrated into the Boost.Random docs. - threefry primary template: BOOST_STATIC_ASSERT( N==2 || N==4 ) Just don't provide the primary template. - counter_traits.hpp: comments about SeedSeq and birthday paradox. I'm not sure that this is actually true. It's obviously the case that mapping from 4 integers to 3 is going to have this problem, but there is no way around this. If the destination range is larger, then I really hope the outputs would always be distinct for distinct inputs of the same size. I haven't analyzed the algorithm and I have no idea where it comes from, but if it doesn't have this property, then some outputs would be totally unreachable. (Of course, the performance issues of seed_seq still apply.) - Making detail::counter_traits a template parameter of counter_based_engine adds boost::random::detail to the set of associated namespaces. - make_counter and 64-bit: Boost.Random generally requires 32-bit sequences. (This is inherited from std::seed_seq) Handling different widths is a lot of complication for - The iterator seed and constructors of counter_based_engine are overloaded based on the constness of the first argument. Don't do this. It's far to error-prone. - I'm unconvinced of the value of restart. I'm thinking that we can get the same result by allowing multiprecision integers with discard.

...

...
Second ping, Can anybody hear me?

It's been more than two weeks, with no response to the above, nor to a resend on June 3. I understand that people are busy. But note that I'm asking what *I* should do, not for someone else to do something. In any case, a reply along the lines of "I'm busy and won't be able to get into this until approximately <DATE>, but until then, you could help by doing XXX" would be appreciated.

In Christ, Steven Watanabe

Paul A. Bristow

5:45 p.m.

...

-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Steven Watanabe Sent: 13 June 2014 18:22 To: boost@lists.boost.org Subject: Re: [boost] [random] new threefry random engine

AMDG

On 06/13/2014 09:17 AM, John Salmon wrote:

...
On Tue, May 27, 2014 at 2:48 PM, John Salmon <john@thesalmons.org> wrote:

...
Can we agree that counter_based_engine, implemented roughly along these lines is a worthwhile addition to Boost.Random? If so, what's the way forward?

...

The main thing I need is documentation as doxygen+quickbook.

Sure. The headers aren't a big deal, since it's trivial for me to copy files, but the documentation does need to be integrated into the Boost.Random docs.

I might be able to help with doxygen+Quickbook - getting the toolchain set up is tiresome. https://svn.boost.org/trac/boost/wiki/BoostDocs/GettingStarted If you've got it set up right, then you should be able to rebuild the existing docs in the modular-boost tree. Or I could start with your plain text notes, perhaps pasted onto to a plain text dump of the existing html, and edit this in Quickbook syntax and you (and others) can view the result. A GIT branch sounds essential. But I can't do this for a few weeks :-( Paul --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 01539 561830

John Salmon

6:26 p.m.

On Fri, Jun 13, 2014 at 1:45 PM, Paul A. Bristow <pbristow@hetp.u-net.com> wrote:

...

...
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Steven Watanabe Sent: 13 June 2014 18:22 To: boost@lists.boost.org Subject: Re: [boost] [random] new threefry random engine

AMDG

On 06/13/2014 09:17 AM, John Salmon wrote:

...
On Tue, May 27, 2014 at 2:48 PM, John Salmon <john@thesalmons.org> wrote:

...
Can we agree that counter_based_engine, implemented roughly along these lines is a worthwhile addition to Boost.Random? If so, what's the way forward?

...
The main thing I need is documentation as doxygen+quickbook.

Sure. The headers aren't a big deal, since it's trivial for me to copy files, but the documentation does need to be integrated into the Boost.Random docs.

I might be able to help with doxygen+Quickbook - getting the toolchain set up is tiresome.

https://svn.boost.org/trac/boost/wiki/BoostDocs/GettingStarted

If you've got it set up right, then you should be able to rebuild the existing docs in the modular-boost tree.

Or I could start with your plain text notes, perhaps pasted onto to a plain text dump of the existing html, and edit this in Quickbook syntax and you (and others) can view the result.

A GIT branch sounds essential.

But I can't do this for a few weeks :-(

Paul

--- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 01539 561830

Ok. I'll start a git branch and reach out for help with QuickBook when I get that far. Thanks, John

John Salmon

6 May 6 May

1:43 p.m.

There's a new version at: https://github.com/DEShawResearch/Random123-Boost that I think addresses your concerns. Click on the link to see the new README. On Fri, May 2, 2014 at 11:43 AM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

AMDG

On 04/29/2014 10:00 AM, John Salmon wrote:

...
I've dusted off the code I wrote a couple of years ago and republished

it,

...
now on github:

https://github.com/DEShawResearch/Random123-Boost

...

...
<snip>

CBRNGs overcome all these problems. With CBRNGs, the code looks like this (see libs/random/examples/counter_based_example.cpp for a fully worked example):

using namespace boost::random; typedef threefry<4, uint32_t> Prf; normal_distribution nd; Prf::key_type key = {seed}; Prf prf(key); // seed may come from command line for(size_t i=0; i<atoms.size(); ++i){ float boltzmannfactor = sqrt(kT/atoms[i].mass); Prf::domain_type d = {atoms[i].id, timestep, THERMALIZE_CTXT}; counter_based_urng<Prf> cbrng(prf, d); nd.reset(); atoms[i].vx = boltzmannfactor*nd(cbrng); atoms[i].vy = boltzmannfactor*nd(cbrng); atoms[i].vz = boltzmannfactor*nd(cbrng); }

Is there a good reason why this shouldn't be written as:

for(...) { ... std::seed_seq s{ seed, atoms[i].id, timestep, THERMALIZE_CTXT }; counter_based_engine<Prf> cbrng(s); ... }

Seed_seq is the wrong tool for the job. It's a middleman who charges a hefty commission (time and space) and then delivers your product slightly damaged. You give it values that you have carefully chosen so that they're *guaranteed* to be unique. Seed_seq then transforms them into values that are *probably* unique, but there are no guarantees. If you use it a lot, the Birthday Paradox will eat away at that "probably" and after a few billion uses you'll be in for a nasty surprise: two of your generators will produce exactly the same output. Seed_seq is an attempt to work around the problem that many generators (and some of the most popular ones) are difficult to initialize in a way that results in independent streams. The CBRNGs are designed to give you independent streams without the overhead or risk of a seed_seq. I think you'll like the new example code better. From the new README: typedef threefry<4, uint32_t> Prf; counter_based_engine<Prf, 32> cbeng(seed); for(size_t i=0; i<Natoms; ++i){ float boltzmannfactor = sqrt(kT/atoms[i].mass); normal_distribution nd(0., rmsvelocity); Prf::domain_type base = {atoms[i].id, timestep, THERMALIZE_CTXT}; cbeng.restart(base); atoms[i].vx = boltzmannfactor*nd(cbeng); atoms[i].vy = boltzmannfactor*nd(cbeng); atoms[i].vz = boltzmannfactor*nd(cbeng); } You can seed the counter_based_engine directly with an arithmetic seed, or, if you prefer, you can use the the much larger key_type, or you can pre-initialize a Prf and use that. In addition to seed()-ing (which meets all the standard expectations), counter_based_engine can be 'restart()-ed' with a new 'base counter'. This is how you associate different, independent sequences with whatever program elements you like: threads, atoms, whatever.

...

I don't really like making the details of the algorithm (i.e. key and domain) part of the primary interface.

I think it's important that they be available, but in the new version, you don't have to use them if you don't want to. A user *may* limit himself to using standard seeding methods and avoid restart(). Their one remaining advantage is that they offer constant-time discard(). The additional features that rely on the details of the underlying Prf are entirely optional.

...

Let's consider the code changes between the two fragments:

...
<snip>

- Since it models a URNG, cbrng can be passed as an argument to the normal_distribution, nd. In order to make each atom independent,

nd.reset()

is called each time through the loop.

<snip>

FWIW, I don't really think that reset was a good idea in the first place, and it's currently a no-op in all distributions. To run the loop in parallel, you'd need to make a copy of nd in every thread, anyway.

Fair enough. In the example, I've moved the normal distribution into the loop and removed reset().

...

...
counter_based_urng ------------------

The counter_based_urng class uses the pseudo-random property of PRFs to perform random number generation in accordance with the requirements of a UniformRandomNumberGenerator. <snip>

counter_based_engine --------------------

The counter_based_engine class is a templated adapter class that models a bona fide RandomNumberEngine. <snip>

Is there really a good reason to make two separate templates? A RandomNumberEngine is a URNG, by definition.

There was, but not any more. There is now only a counter_based_engine. It takes a template parameter that determines the length of individual sequences and the number of permissible 'base counters'. It has a 'restart()' method that resets the 'base counter' to begin a new sequence. And it has a few constructor() and seed() overloads that extend the capabilities of a generic Engine.

Thijs van den Berg

7:51 p.m.

...

I think you'll like the new example code better. From the new README:

typedef threefry<4, uint32_t> Prf; counter_based_engine<Prf, 32> cbeng(seed); for(size_t i=0; i<Natoms; ++i){ float boltzmannfactor = sqrt(kT/atoms[i].mass); normal_distribution nd(0., rmsvelocity); Prf::domain_type base = {atoms[i].id, timestep, THERMALIZE_CTXT}; cbeng.restart(base); atoms[i].vx = boltzmannfactor*nd(cbeng); atoms[i].vy = boltzmannfactor*nd(cbeng); atoms[i].vz = boltzmannfactor*nd(cbeng); }

You can seed the counter_based_engine directly with an arithmetic seed, or, if you prefer, you can use the the much larger key_type, or you can pre-initialize a Prf and use that.

I tried to examine what happend when you draw more samples inside the loop, like this for (size_t j=0; j<100; ++j) { atoms[i].vx += mbd(cbeng); atoms[i].vy += mbd(cbeng); atoms[i].vz += mbd(cbeng); } and that caused an abort trap. I expect that the cbeng ran out of samples and that is not the behaviour you want. To be compatible you’ll need to allow the distribution mdb() can draw any amount of samples. Mbd() can be a probability distribution that consumes many random samples from the engine. E.g. I’ve just finished a distribution sampler that consumes 4 per draw. The same issues arises if you have many more variables to set than vx,vy,vz, which make the usage error-prone. One solution for this particular example is that you move the domain_type variable inside the 2nd loop and make it a function of both i,j from the loops. However that still isn’t a nice design because it can be very inefficient. Every time time you call restart() it will generate 4x64 bits of random data on it’s first call, ..but you only consume 3x32 bits. Putting the burden of optimising this on the user is not a good idea. The version I’ve finished here https://github.com/sitmo/threefry is a standard random engine concept and doesn’t need a domain_type, key_type, restart() and doesn’t have a low limit on the number of samples that users need to be aware of. Its intuitive to use like any other random engine -as can be seen in the next example-. I *think* it also uses less memory and is also faster because I’ve optimised the engine for it’s purpose. for(size_t i=0; i<Natoms; ++i){ float rmsvelocity = sqrt(kT/atoms[i].mass); normal_distribution<float> mbd(0., rmsvelocity); threefry4x64_13 eng(i); atoms[i].vx += mbd(eng); atoms[i].vy += mbd(eng); atoms[i].vz += mbd(eng); }

John Salmon

9:28 p.m.

On Tue, May 6, 2014 at 3:51 PM, Thijs van den Berg <thijs@sitmo.com> wrote:

...

...
I think you'll like the new example code better. From the new README:

typedef threefry<4, uint32_t> Prf; counter_based_engine<Prf, 32> cbeng(seed); for(size_t i=0; i<Natoms; ++i){ float boltzmannfactor = sqrt(kT/atoms[i].mass); normal_distribution nd(0., rmsvelocity); Prf::domain_type base = {atoms[i].id, timestep, THERMALIZE_CTXT}; cbeng.restart(base); atoms[i].vx = boltzmannfactor*nd(cbeng); atoms[i].vy = boltzmannfactor*nd(cbeng); atoms[i].vz = boltzmannfactor*nd(cbeng); }

You can seed the counter_based_engine directly with an arithmetic seed,

or,

...
if you prefer, you can use the the much larger key_type, or you can pre-initialize a Prf and use that.

I tried to examine what happend when you draw more samples inside the loop, like this

for (size_t j=0; j<100; ++j) { atoms[i].vx += mbd(cbeng); atoms[i].vy += mbd(cbeng); atoms[i].vz += mbd(cbeng); }

and that caused an abort trap. I expect that the cbeng ran out of samples and that is not the behaviour you want.

Yes. It ran out because in the counter_based_example.cpp file it's declared with a 5-bit counter: counter_based_engine<Prf, 5> cbeng; This was a very bad choice on my part. The README declares cbeng with a 32-bit counter, and explains that the engine can produce 4*2^32 values before running out. Had I used that in the example.cpp, you would not have had a problem. I'll push an updated counter_based_example.cpp with CtrBits=32 momentarily. If you change your copy to match the README, i.e., change 5 to 32, then you'll have 32-bit counter and you can generate a few billion normals in your loop before running out. I just ran it with: counter_based_engine<Prf, 32> cbeng; ... for(size_t j=0; j<1000000000; ++j){ ... } and it took a while, but worked fine. It also used all four cores on my machine and produced the same result as it would on one core. To be compatible you’ll need to allow the distribution mdb() can draw any

...

amount of samples. Mbd() can be a probability distribution that consumes many random samples from the engine. E.g. I’ve just finished a distribution sampler that consumes 4 per draw.

Yes. Some distributions require even more. IIRC, the Poisson distribution can sometimes require O(10). If you're using the generator the way I advocate (restarted every time through a loop), then 32 counter bits will likely be more than enough. If you're using it in a more conventional way, (one generator per thread), then a 64-bit counter is probably a better idea, as it will take centuries to count down. It might make sense to have default value for CtrBits. I'd go with the minimum of 64 and half the width of the Prf's range. I.e., 32 for threefry2x32 and philox2x32 and 64 for everything else. The same issues arises if you have many more variables to set than

...

vx,vy,vz, which make the usage error-prone. One solution for this particular example is that you move the domain_type variable inside the 2nd loop and make it a function of both i,j from the loops. However that still isn’t a nice design because it can be very inefficient. Every time time you call restart() it will generate 4x64 bits of random data on it’s first call, ..but you only consume 3x32 bits. Putting the burden of optimising this on the user is not a good idea.

Actually, it's only 4x32 bits on each call. But your point stands. To get the best possible performance one needs to be aware of some details. How many random values, and of what size you get with each "turn of the crank" are imporant details. If you ignore them, you'll get good, reliable, correctly distributed values, but you might waste some time. The version I’ve finished here https://github.com/sitmo/threefry is a

...

standard random engine concept and doesn’t need a domain_type, key_type, restart() and doesn’t have a low limit on the number of samples that users need to be aware of. Its intuitive to use like any other random engine -as can be seen in the next example-. I *think* it also uses less memory and is also faster because I’ve optimised the engine for it’s purpose.

I'm happy to look at optimization. But it should probably wait till we've agreed on an API. My version is also aggressively optimized, e.g., I used mpl::foreach to unroll some loops because gcc wouldn't unroll them without help. If you've found ways to make it even smaller or faster, I'm eager to learn. John Salmon

Thijs van den Berg

10:40 p.m.

...

On Tue, May 6, 2014 at 3:51 PM, Thijs van den Berg <thijs@sitmo.com> wrote:

...
...
I think you'll like the new example code better. From the new README:

typedef threefry<4, uint32_t> Prf; counter_based_engine<Prf, 32> cbeng(seed); for(size_t i=0; i<Natoms; ++i){ float boltzmannfactor = sqrt(kT/atoms[i].mass); normal_distribution nd(0., rmsvelocity); Prf::domain_type base = {atoms[i].id, timestep, THERMALIZE_CTXT}; cbeng.restart(base); atoms[i].vx = boltzmannfactor*nd(cbeng); atoms[i].vy = boltzmannfactor*nd(cbeng); atoms[i].vz = boltzmannfactor*nd(cbeng); }

You can seed the counter_based_engine directly with an arithmetic seed,

or,

...
if you prefer, you can use the the much larger key_type, or you can pre-initialize a Prf and use that.

I tried to examine what happend when you draw more samples inside the loop, like this

for (size_t j=0; j<100; ++j) { atoms[i].vx += mbd(cbeng); atoms[i].vy += mbd(cbeng); atoms[i].vz += mbd(cbeng); }

and that caused an abort trap. I expect that the cbeng ran out of samples and that is not the behaviour you want.

Yes. It ran out because in the counter_based_example.cpp file it's declared with a 5-bit counter:

counter_based_engine<Prf, 5> cbeng;

This was a very bad choice on my part. The README declares cbeng with a 32-bit counter, and explains that the engine can produce 4*2^32 values before running out. Had I used that in the example.cpp, you would not have had a problem.

I'll push an updated counter_based_example.cpp with CtrBits=32 momentarily. If you change your copy to match the README, i.e., change 5 to 32, then you'll have 32-bit counter and you can generate a few billion normals in your loop before running out. I just ran it with:

Thanks for explaining that. I didn’t realise that, I thought that the random function output (4x64) was being read past the buffer end. 32 bit is indeed more than enough, great to see this!

...

counter_based_engine<Prf, 32> cbeng; ... for(size_t j=0; j<1000000000; ++j){ ... }

That’s a good (speed) test. I’ll use that too.

...

I'm happy to look at optimization. But it should probably wait till we've agreed on an API. My version is also aggressively optimized, e.g., I used mpl::foreach to unroll some loops because gcc wouldn't unroll them without help. If you've found ways to make it even smaller or faster, I'm eager to learn.

Yes I agree we should first look at the interface/concepts like we do now. However, the speed of threefry is one of it’s selling point and it’s natural to look at that too. I ran into the same issue as you with mpl (when not using the foreach) and ended up doing a manual unrolling with chunks of 40 rounds. I wasn’t too happy with that because mpl code is much more elegant and compact. I didn’t know that foreach would have solved it. One of the extra optimisations I realised I could do was to (allow to) limit the key size (my engine stores the key in a private var). The default is 4x64 bits and an additional 64 bit for the xor, but one typically doesn’t need 2^256 independent streams (if you use the key for seeding), e.g. 2^64 is probably more than enough and sometimes no key is ok too (you can then use leapfroggin or discard if you still want independent streams). Reducing the key size reduces the number of instructions in the key round and it reduces memory usage. The key and counter size are very superfluous and that gives room for reduction. There is also another reason I prefer a stripped down version of the key, it will allow you to make the interface a standard random engine one. We can suffice with a 64 bit seed, and that remove the need for having to expose the user to the internal types (key, domain). However, if you need a bigger key then there is always the seedseq interface (the only option) I think the main difference between out versions right now is * “engine construction outside the loop + constructing a domain_type inside the loop + restart()"

...

typedef threefry<4, uint32_t> Prf; counter_based_engine<Prf, 32> cbeng(seed); ... Prf::domain_type base = {atoms[i].id, timestep, THERMALIZE_CTXT}; cbeng.restart(base);

* v.s. constructor() inside the loop

...

threefry4x64_13 eng(i);

To give some details for out discussion.. my engine uses an private counter of size of {1,2,3,4} x 64 bits and I use the seed(i) and constructor interface to set the key. When I set the key via seed or via the constructor I alway reset the counter to zero (because when two engines are seeded equally the sequence should be equal). The constructor and regular seed limits the seed to 64bits, but the (slower) seedseq interface can be used to set the full key if it has more than 64 bits. In your engine things seems to be connected differently? The counter is split into a high and low part, giving you 3 different types of input (including the key). I think that changing your version above to this version below is better because it only uses standard random engines concepts (less terminology and less tweakable options) and is more compact. My view is that new concepts like domain_type, restart() have an associated learning cost and causes vendor lock-in for developers (their code becomes dependent on this specific engine). That’s why I want a standard engine unless there is a *huge* benefit for deviating. Do you think that there is a significant performance penalty here below, or do you simply prefer the interface above? Even when there is a performance hit, shouldn’t we at least have a standard engine without any interface extensions and put the high performance version into a different interface/concept?

...

for(size_t i=0; i<Natoms; ++i){ counter_based_engine< threefry<4, uint32_t>, 32> cbeng(atoms[i].id);

Thijs van den Berg

7 May 7 May

9:36 a.m.

I did some performance test, with 4 threads -O3 and ITERATIONS = 10.000.000, and there is no performance difference between the two versions (see below). My opinion is that we should stick with a standard random engine concept I used, merge our code (we both have non-overlapping bits that are needed), and think a bit more about random function concepts at a later stage. For me a random engine is al I need. timings: 7534.86 7603.93 ms timings: 7640.01 7533.52 ms timings: 7648.3 7604.28 ms timings: 7595.59 7626.43 ms timings: 6767.75 6432.64 ms timings: 7635.71 7535.7 ms timings: 7604.28 7618.24 ms timings: 7734.14 7091.38 ms timings: 7366.14 7581.43 ms timings: 7559.15 7168.32 ms timings: 7538.35 7355.96 ms timings: 7476.68 7702.23 ms timings: 6757.96 7545.04 ms timings: 7607.6 7608.7 ms timings: 7608.16 7657.03 ms timings: 6886.66 7698.5 ms timings: 6833.46 6466.57 ms timings: 7624.62 6443.1 ms timings: 6741.91 6464.72 ms timings: 7464.41 6428.27 ms left: void thermalize(atom* atoms, size_t Natoms, uint32_t timestep, const Prf::key_type& key){ counter_based_engine<Prf, 32> cbeng(key); for(size_t i=0; i<Natoms; ++i){ float rmsvelocity = sqrt(kT/atoms[i].mass); normal_distribution<float> mbd(0., rmsvelocity); Prf::domain_type start = {atoms[i].id, timestep, THERMALIZE_CTXT}; cbeng.restart(start); for (size_t j=0; j<ITERATIONS; ++j) { atoms[i].vx += mbd(cbeng); atoms[i].vy += mbd(cbeng); atoms[i].vz += mbd(cbeng); } } } right: void thermalize_std(atom* atoms, size_t Natoms, uint32_t timestep){ for(size_t i=0; i<Natoms; ++i){ float rmsvelocity = sqrt(kT/atoms[i].mass); normal_distribution<float> mbd(0., rmsvelocity); threefry4x64_engine<uint32_t, 32, 20, 1, 1> eng(i); for (size_t j=0; j<ITERATIONS; ++j) { atoms[i].vx += mbd(eng); atoms[i].vy += mbd(eng); atoms[i].vz += mbd(eng); } } }

John Salmon

8 May 8 May

3:25 p.m.

On Wed, May 7, 2014 at 5:36 AM, Thijs van den Berg <thijs@sitmo.com> wrote:

...

I did some performance test, with 4 threads -O3 and ITERATIONS = 10.000.000, and there is no performance difference between the two versions (see below). My opinion is that we should stick with a standard random engine concept I used, merge our code (we both have non-overlapping bits that are needed), and think a bit more about random function concepts at a later stage. For me a random engine is al I need.

I think my version now provides you with the 'plain old engine' concept that you're looking for. It also *allows* you to do interesting things with restart(), but you can ignore those methods if you wish.. Your version had a nice feature that mine lacked so I adoptied it: the ability to control the output type, independently of the choice of pseudo-random function. With this addition, I can produce 32-bit output from a prf that internally uses 64-bit arithmetic (or vice versa). The template is now: template<typename Prf, unsigned CtrBits = static_unsigned_min<64u, Prf::range_bits/2>::value, typename UintType = typename Prf::range_type::value_type, unsigned w = std::numeric_limits<UintType>::digits> A few examples are benchmarked at the end of the random_speed.cpp that I just pushed up to github: // An assortment of counter_based_engines with a variety of // speed/space/output-type tradeoffs: // Threefry4x64, crush-resistant with a safety margin, 64-bit // output. 248-bit seed space. 2^192 restartable sequences, each // of length 2^66. distrib(iter, "threefry4x64", boost::random::counter_based_engine<boost::random::threefry<4, uint64_t>

...

());

// Threefry4x64/32, crush-resistant with a safety margin, 32-bit // output. 248-bit seed space. 2^192 restartable sequences, each // of length 2^67. distrib(iter, "threefry4x64/32", boost::random::counter_based_engine<boost::random::threefry<4, uint64_t>, 64, uint32_t>()); // Philox 2x32-7 - very small (only 6x32bits of state), // crush-resistant with no safety margin. 26-bit seed space. // Reasonably fast (50-75% of mersenne), with 2^32 restart()-able // sequences each with a length of 2^33. distrib(iter, "philox2x32-7", boost::random::counter_based_engine<boost::random::philox<2, uint32_t, 7>

...

());

// Threefry4x64, crush-resistant but with no safety margin. 248-bit // seed space. This should be the fastest. As fast as // mersenne19937, but much smaller (13x64bits), with better or equal // quality. 2^192 restartable sequences each of length 2^67 distrib(iter, "threefry4x64-12/32", boost::random::counter_based_engine<boost::random::threefry<4, uint64_t, 12>, 64, uint32_t>());

Thijs van den Berg

4:27 p.m.

...

On Wed, May 7, 2014 at 5:36 AM, Thijs van den Berg <thijs@sitmo.com> wrote:

...
I did some performance test, with 4 threads -O3 and ITERATIONS = 10.000.000, and there is no performance difference between the two versions (see below). My opinion is that we should stick with a standard random engine concept I used, merge our code (we both have non-overlapping bits that are needed), and think a bit more about random function concepts at a later stage. For me a random engine is al I need.

I think my version now provides you with the 'plain old engine' concept that you're looking for. It also *allows* you to do interesting things with restart(), but you can ignore those methods if you wish..

For me that’s a good compromise, but it’s up to Steven and (perhaps some others?) to decide what he wants with the interface. In an earlier version I gave public access to counter manipulators and those were questioned because it was non-standard interface, and so I made them private.

...

Your version had a nice feature that mine lacked so I adoptied it: the ability to control the output type, independently of the choice of pseudo-random function. With this addition, I can produce 32-bit output from a prf that internally uses 64-bit arithmetic (or vice versa). The template is now: Thanks. I wanted to provide both 32 and 64 bit random numbers because 32 bit is still used a lot. My first implementation used a fast reinterpret_cast<> but that was non endian invariant, and so I had to fix that. I think the interface and consistent behaviour is more important than speed for boost random, and I agree with that.

I think the most enjoyable way forward would be join effort into a single submission instead of competing ones? For that you will need to fix the copyright and license. What’s your view on this? Are you doing this in corporate time or personal time?

John Salmon

5:43 p.m.

On Thu, May 8, 2014 at 12:27 PM, Thijs van den Berg <thijs@sitmo.com> wrote:

...

...
I think my version now provides you with the 'plain old engine' concept that you're looking for. It also *allows* you to do interesting things with restart(), but you can ignore those methods if you wish..

For me that’s a good compromise, but it’s up to Steven and (perhaps some others?) to decide what he wants with the interface. In an earlier version I gave public access to counter manipulators and those were questioned because it was non-standard interface, and so I made them private.

Yes, I'd like to hear what others think of the new 'bona-fide-engine-with-restart-extension' design.

...

...
ability to control the output type, independently of the choice of pseudo-random function. With this addition, I can produce 32-bit output from a prf that internally uses 64-bit arithmetic (or vice versa). The template is now: Thanks. I wanted to provide both 32 and 64 bit random numbers because 32 bit is still used a lot. My first implementation used a fast reinterpret_cast<> but that was non endian invariant, and so I had to fix

Your version had a nice feature that mine lacked so I adoptied it: the that. I think the interface and consistent behaviour is more important than speed for boost random, and I agree with that.

My implementation is intended to be endian-independent. But as they say - "if it hasn't been tested, it's doesn't work", and I haven't had a chance to test on a big-endian machine, so proceed with caution..

...

I think the most enjoyable way forward would be join effort into a single submission instead of competing ones? For that you will need to fix the copyright and license. What’s your view on this? Are you doing this in corporate time or personal time?

Yes, I think it make sense to merge. Perhaps we should start using the recommended workflow, https://svn.boost.org/trac/boost/wiki/StartModMaint. Especially if that would make things easier for Steven. As you can see, my code is released under a fairly permissive license that is very close to the BSD 3-clause license. But it's not the Boost Software License, and I understand why that's a problem. I'm working on getting permission to release it under the Boost Software License. John

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Thijs van den Berg

9 May 9 May

7:18 a.m.

...

Yes, I think it make sense to merge. Perhaps we should start using the recommended workflow, https://svn.boost.org/trac/boost/wiki/StartModMaint. Especially if that would make things easier for Steven.

As you can see, my code is released under a fairly permissive license that is very close to the BSD 3-clause license. But it's not the Boost Software License, and I understand why that's a problem. I'm working on getting permission to release it under the Boost Software License.

Excellent! I’ll contact you via email so that we can start aligning.

Steven Watanabe

2 May 2 May

5:26 a.m.

AMDG On 04/27/2014 05:39 AM, Mathias Gaunard wrote:

...

With a single, stateless generator, everything is much simpler. One seed, one generator, completely reproducible results regardless of scheduling. The interface i want to have is this:

#pragma omp for for(int i=0; i<n; ++i) { out[i] = in[i] * random_value(i); }

The proposed RandomFunction doesn't quite work this way. It produces random numbers in blocks. (For instance, the threefry engine that started this thread, uses a 256-bit block size.) In Christ, Steven Watanabe

John Salmon

11:39 a.m.

On Fri, May 2, 2014 at 1:26 AM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

AMDG

On 04/27/2014 05:39 AM, Mathias Gaunard wrote:

...
With a single, stateless generator, everything is much simpler. One seed, one generator, completely reproducible results regardless of scheduling. The interface i want to have is this:

#pragma omp for for(int i=0; i<n; ++i) { out[i] = in[i] * random_value(i); }

The proposed RandomFunction doesn't quite work this way. It produces random numbers in blocks. (For instance, the threefry engine that started this thread, uses a 256-bit block size.)

Yes. But with C++11 initializer lists, random_value is a one-liner: unsigned random_value(unsigned i){ return Threefry<2, unsigned>()({i})[0]; } Without initializer lists, you would need another line for a declaration of a domain_type. Alternatively, you may prefer to let a distribution do the rescaling for you: Threefry<2,unsigned> prf; uniform_real_distribution<float> zero_one(0., 1.); #omp for(int i=0; i<n; ++i) { out[i] = in[i]*zero_one( counter_based_urng(prf, {i}) ); } John Salmon

...

Thijs van den Berg

11:52 a.m.

On 02 May 2014, at 13:39, John Salmon <john@thesalmons.org> wrote:

...

On Fri, May 2, 2014 at 1:26 AM, Steven Watanabe <watanabesj@gmail.com>wrote:

...
AMDG

On 04/27/2014 05:39 AM, Mathias Gaunard wrote:

...
With a single, stateless generator, everything is much simpler. One seed, one generator, completely reproducible results regardless of scheduling. The interface i want to have is this:

#pragma omp for for(int i=0; i<n; ++i) { out[i] = in[i] * random_value(i); }

The proposed RandomFunction doesn't quite work this way. It produces random numbers in blocks. (For instance, the threefry engine that started this thread, uses a 256-bit block size.)

Yes. But with C++11 initializer lists, random_value is a one-liner:

unsigned random_value(unsigned i){ return Threefry<2, unsigned>()({i})[0]; }

Without initializer lists, you would need another line for a declaration of a domain_type.

Alternatively, you may prefer to let a distribution do the rescaling for you:

Threefry<2,unsigned> prf; uniform_real_distribution<float> zero_one(0., 1.); #omp for(int i=0; i<n; ++i) { out[i] = in[i]*zero_one( counter_based_urng(prf, {i}) ); }

John Salmon

The zero_one distribution can do any amount of call to the urng. E.g. the current normal_distribribution uses ziggurat with rejection sampling. The previous version uses Box Muller and consumed an even amount of random numbers. The only way I can see "counter_based_urng(prf, {i})” to be able to generate an arbitrary number of random values is when it has it’s own internal counter. In that case it’s best to use a random engine adaptor. counter_based_engine< Threefry<2,unsigned> > eng; uniform_real_distribution<float> zero_one(0., 1.); eng.seed(thread_id); for(int i=0; i<n; ++i) { out[i] = in[i]*zero_one( eng ); }

Peter Dimov

12:32 p.m.

Thijs van den Berg wrote:

...

In that case it’s best to use a random engine adaptor.

counter_based_engine< Threefry<2,unsigned> > eng; uniform_real_distribution<float> zero_one(0., 1.);

eng.seed(thread_id);

for(int i=0; i<n; ++i) { out[i] = in[i]*zero_one( eng ); }

The adaptor needs to be in the loop; you don't have a thread_id outside it. uniform_real_distribution<float> zero_one(0., 1.); #pragma omp parallel for for( int i=0; i<n; ++i ) { counter_based_engine< Threefry<2,unsigned> > eng; eng.seed( i ); out[i] = in[i]*zero_one( eng ); } In principle, this works with any engine, not just a counter-based one, as long as creating and seeding is quick and consecutive seeds result in random first output.

Thijs van den Berg

12:49 p.m.

On 02 May 2014, at 14:32, Peter Dimov <lists@pdimov.com> wrote:

...

Thijs van den Berg wrote:

...
In that case it’s best to use a random engine adaptor.

counter_based_engine< Threefry<2,unsigned> > eng; uniform_real_distribution<float> zero_one(0., 1.);

eng.seed(thread_id);

for(int i=0; i<n; ++i) { out[i] = in[i]*zero_one( eng ); }

The adaptor needs to be in the loop; you don't have a thread_id outside it. ok, .. so the block inside the loop gets automatically distributed across threads.

uniform_real_distribution<float> zero_one(0., 1.);

#pragma omp parallel for for( int i=0; i<n; ++i ) { counter_based_engine< Threefry<2,unsigned> > eng; eng.seed( i );

out[i] = in[i]*zero_one( eng ); }

In principle, this works with any engine, not just a counter-based one, as long as creating and seeding is quick and consecutive seeds result in random first output.

The way I read this ”first random output” is that you are saying that it will only use 1 random number from the block generated by the random function (for threefry e.g. 4x 64bits)? This is probably an artefact of the example you chose? Normally each thread would do millions of samples? I thought the idea was to *not* have the engine store a counter but use the “i” from the loop. I now understand that the loop is the parallel part. The counter_based_engine design I’m working on (I’ll share an example soon) works with 3 type of random function: * random function that convert an input buffer to an output buffer. For these I use the input as an counter, and let the seed() set the highest bits of the counter. The engine uses the following random function interfaces void random_function(input&,output&) * random function that also use an additional key. For this version I have the seed() random engine set a key buffer, and that key gets passed to the random function random_function(input&,output&,key&) * random function that have a state dependent of the key, and computing the state takes some time. These random functions have an additional member function that the counter_engine uses when it gets seeded random_function.set_key(key&)// called by random engine seed() ... random_function(input&,output&) // sometimes called by random engine operator(), whenever it needs a new batch of random data

Steven Watanabe

3:52 p.m.

AMDG On 05/02/2014 06:49 AM, Thijs van den Berg wrote:

...

The counter_based_engine design I’m working on (I’ll share an example soon) works with 3 type of random function: * random function that convert an input buffer to an output buffer. For these I use the input as an counter, and let the seed() set the highest bits of the counter. The engine uses the following random function interfaces void random_function(input&,output&)

* random function that also use an additional key. For this version I have the seed() random engine set a key buffer, and that key gets passed to the random function random_function(input&,output&,key&)

* random function that have a state dependent of the key, and computing the state takes some time. These random functions have an additional member function that the counter_engine uses when it gets seeded random_function.set_key(key&)// called by random engine seed() ... random_function(input&,output&) // sometimes called by random engine operator(), whenever it needs a new batch of random data

This is much too complicated. There should be exactly one kind of RandomFunction. Either pass the key always, and make it empty when it isn't needed, or store the key in the random function and never pass it. In Christ, Steven Watanabe

Thijs (M.A.) van den Berg

6:32 p.m.

...

AMDG

On 05/02/2014 06:49 AM, Thijs van den Berg wrote:

...
The counter_based_engine design I’m working on (I’ll share an example soon) works with 3 type of random function:

...

This is much too complicated. There should be exactly one kind of RandomFunction. Either pass the key always, and make it empty when it isn't needed, or store the key in the random function and never pass it.

In Christ, Steven Watanabe

From a performance point of view the only option will probably be to never pass instead of always. If we give a random function the same seed interface like a random engine for setting the key then a random function will look a lot like a random engine. If will fill an external output buffer based on scrambling an external input buffer. The counter engine adaptor / wrapper will * own the input and output container * manipulate the input container as a counter * forwards the seeding directly to the random function * use the output container as a source for generating samples Would this work for the function in random 123?

Thijs van den Berg

5 May 5 May

2:16 p.m.

After experimenting with random function concepts I’m not pleased with the design options and concept. Instead I’ve decided to complete the threefry random engine I had nearly finished. I’ve added a couple of extra features base on our discussions. There are a couple of reasons I prefer an random engine over a random function: * I prefer -as a user of boost random- the random engine concept instead of a new random function interface. Sticking to a random engine interface makes my user code more generic, it will work with any random engine and I don’t need to learn a new concepts. * For me there is no benefit of using a random function over the random engine. There were a couple of arguments why a random function could under some circumstances be preferred. 1) The cost of construction and seeding a random engine. 2) memory usage, a random function can work on eternal containers and counters. To address these points I’ve refined my threefry based random engine. I’ve allowed for a [1,2,3, or 4] x 64 bit counter and a [0,1,2,3,4] x 64 bit key. Reducing the size of of the key and counter speeds up the algorithm and lowers the memory usage. I’ve set the default to 1x64 bit counter and a 1x64 bit key. This gives a cycle length of 2^66 (7.3e19) and allows for 2^64 (1.8e19) different seeds and independent streams. For normal usage this is more than enough. Of-course one can always set the template parameters to 4. The default setting have low memory usage, higher speed, and give the exact same result as the original engine if you stay within the counter cycle length (the first 7.4e19 draws) and key bounds. It’s faster than the previous threefry because some operations involving the key that are known to be zero can be dropped. In this case the cost of a seed() is neglect-able: it reduces to setting an int. I’ve also renames the random engine template to threefry4x64_engine to ensure that the name is compatible in case we refactor it into a the counter based engine and a random function, or if other developers add additional variants. Personally having this engine available in boost would already solve all my parallel needs. Usage will be like any other engine, i.e. using namespace boost::random; normal_distribution nd; for(size_t i=0; i<atoms.size(); ++i) { float boltzmannfactor = sqrt(kT/atoms[i].mass); threefry4x64_13 eng(i); atoms[i].vx = boltzmannfactor*nd(eng); atoms[i].vy = boltzmannfactor*nd(eng); atoms[i].vz = boltzmannfactor*nd(eng); } Main compile-time options:: * generates 8,16,32 or 64 bit random numbers * arbitrary number of encryption rounds >=1 * flexible counter size: counter can be 64,128, 192 or 256 bit. * flexible key size: 0, 64, 128, 192 or 256 bits. The engine has been fully tested and benchmarked and can be found here: https://github.com/sitmo/threefry

John Salmon

2 May 2 May

2:30 p.m.

On Fri, May 2, 2014 at 8:32 AM, Peter Dimov <lists@pdimov.com> wrote:

...

Thijs van den Berg wrote:

...
In that case it’s best to use a random engine adaptor.

counter_based_engine< Threefry<2,unsigned> > eng; uniform_real_distribution<float> zero_one(0., 1.);

eng.seed(thread_id);

for(int i=0; i<n; ++i) { out[i] = in[i]*zero_one( eng ); }

The adaptor needs to be in the loop; you don't have a thread_id outside it.

uniform_real_distribution<float> zero_one(0., 1.);

#pragma omp parallel for for( int i=0; i<n; ++i ) { counter_based_engine< Threefry<2,unsigned> > eng; eng.seed( i );

out[i] = in[i]*zero_one( eng ); }

In principle, this works with any engine, not just a counter-based one, as long as creating and seeding is quick and consecutive seeds result in random first output.

Yes. But Boost's documentation advises that in general, these requirements are *not* met: <snip http://www.boost.org/doc/libs/1_55_0/doc/html/boost_random/reference.html#bo...

...

Pseudo-random number generators should not be constructed (initialized) frequently during program execution, for two reasons. First, initialization requires full initialization of the internal state of the generator. Thus, generators with a lot of internal state (see below) are costly to initialize. Second, initialization always requires some value used as a "seed" for the generated sequence. It is usually difficult to obtain several good seed values... </snip> This is very good advice because as far as I know, none of the existing engines satisfy all the requirements: small state, quick-to-initialize, quick-to-run and high-quality, statistically independent results when seeded consecutively. Counter-based RNGs are specifically designed to meet these requirements. State ranges from 8 to 32 bytes. They generate 8 to 32 random bytes at a time at rates of 2-4 cycles-per-byte. Initialization time is negligible compared to generation. Output passes TestU01. John Salmon

Thijs van den Berg

2:44 p.m.

On 02 May 2014, at 16:30, John Salmon <john@thesalmons.org> wrote:

...

On Fri, May 2, 2014 at 8:32 AM, Peter Dimov <lists@pdimov.com> wrote:

...
Thijs van den Berg wrote:

...
In that case it’s best to use a random engine adaptor.

counter_based_engine< Threefry<2,unsigned> > eng; uniform_real_distribution<float> zero_one(0., 1.);

eng.seed(thread_id);

for(int i=0; i<n; ++i) { out[i] = in[i]*zero_one( eng ); }

The adaptor needs to be in the loop; you don't have a thread_id outside it.

uniform_real_distribution<float> zero_one(0., 1.);

#pragma omp parallel for for( int i=0; i<n; ++i ) { counter_based_engine< Threefry<2,unsigned> > eng; eng.seed( i );

out[i] = in[i]*zero_one( eng ); }

In principle, this works with any engine, not just a counter-based one, as long as creating and seeding is quick and consecutive seeds result in random first output.

Yes. But Boost's documentation advises that in general, these requirements are *not* met:

<snip http://www.boost.org/doc/libs/1_55_0/doc/html/boost_random/reference.html#bo...

...
Pseudo-random number generators should not be constructed (initialized) frequently during program execution, for two reasons. First, initialization requires full initialization of the internal state of the generator. Thus, generators with a lot of internal state (see below) are costly to initialize. Second, initialization always requires some value used as a "seed" for the generated sequence. It is usually difficult to obtain several good seed values... </snip>

This is very good advice because as far as I know, none of the existing engines satisfy all the requirements: small state, quick-to-initialize, quick-to-run and high-quality, statistically independent results when seeded consecutively.

Counter-based RNGs are specifically designed to meet these requirements. State ranges from 8 to 32 bytes. They generate 8 to 32 random bytes at a time at rates of 2-4 cycles-per-byte. Initialization time is negligible compared to generation. Output passes TestU01.

John Salmon

I think there isn’t any discussion about counter based random engines being a good idea. What needs to be made clear is the benefits of having a random function concept next to a random engine concept. When would a user have clear benefit of using a random function directly instead of a random engine that implements the same algorithm? The concept and benefit need to be made clear IMO. So far I’ve hear about 4 different types of random functions. The challenge is to capture them all in a single concepts.

Steven Watanabe

3:57 p.m.

AMDG On 05/02/2014 05:39 AM, John Salmon wrote:

...

Yes. But with C++11 initializer lists, random_value is a one-liner:

unsigned random_value(unsigned i){ return Threefry<2, unsigned>()({i})[0]; }

Sure, that works, as long as you don't mind throwing away the rest of the outputs. In Christ, Steven Watanabe

4049

Age (days ago)

4106

Last active (days ago)

List overview

Download

60 comments

10 participants

participants (10)

Beman Dawes
John Bytheway
John Salmon
Mathias Gaunard
Nat Goodspeed
Paul A. Bristow
Peter Dimov
Steven Watanabe
Thijs (M.A.) van den Berg
Thijs van den Berg