Re: [boost] [random] new threefry random engine

6 May 2014

      On Tue, May 6, 2014 at 3:51 PM, Thijs van den Berg <thijs@sitmo.com> wrote:
...
...
I think you'll like the new example code better.  From the new README:
typedef threefry<4, uint32_t> Prf;
   counter_based_engine<Prf, 32> cbeng(seed);
   for(size_t i=0; i<Natoms; ++i){
       float boltzmannfactor = sqrt(kT/atoms[i].mass);
       normal_distribution nd(0., rmsvelocity);
       Prf::domain_type base = {atoms[i].id, timestep, THERMALIZE_CTXT};
       cbeng.restart(base);
       atoms[i].vx = boltzmannfactor*nd(cbeng);
       atoms[i].vy = boltzmannfactor*nd(cbeng);
       atoms[i].vz = boltzmannfactor*nd(cbeng);
   }
You can seed the counter_based_engine directly with an arithmetic seed,
or,
...
if you prefer,
you can use the the much larger key_type, or you can pre-initialize a Prf
and use that.
I tried to examine what happend when you draw more samples inside the
loop, like this
for (size_t j=0; j<100; ++j) {
            atoms[i].vx += mbd(cbeng);
            atoms[i].vy += mbd(cbeng);
            atoms[i].vz += mbd(cbeng);
        }
and that caused an abort trap. I expect that the cbeng ran out of samples
and that is not the behaviour  you want.
Yes.  It ran out because in the counter_based_example.cpp file it's
declared with a 5-bit counter:

counter_based_engine<Prf, 5> cbeng;

This was a very bad choice on my part.  The README declares cbeng with a
32-bit counter, and explains that the engine can produce 4*2^32 values
before running out.  Had I used that in the example.cpp, you would not have
had a problem.

I'll push an updated counter_based_example.cpp with CtrBits=32
momentarily.  If you change your copy to match the README, i.e., change 5
to 32, then you'll have 32-bit counter and you can generate a few billion
normals in your loop before running out.  I just ran it with:

   counter_based_engine<Prf, 32> cbeng;
   ...
   for(size_t j=0; j<1000000000; ++j){
   ...
   }

and it took a while, but worked fine.  It also used all four cores on my
machine and produced the same result as it would on one core.

To be compatible you’ll need to allow the distribution mdb() can draw any
...
amount of samples. Mbd() can be a probability distribution that consumes
many random samples from the engine. E.g. I’ve just finished a distribution
sampler that consumes 4 per draw.
Yes.  Some distributions require even more.  IIRC, the Poisson distribution
can sometimes require O(10).  If you're using the generator the way I
advocate (restarted every time through a loop), then 32 counter bits will
likely be more than enough.  If you're using it in a more conventional way,
(one generator per thread), then a 64-bit counter is probably a better
idea, as it will take centuries to count down.

It might make sense to have default value for CtrBits.  I'd go with the
minimum of 64 and half the width of the Prf's range.  I.e., 32 for
threefry2x32 and philox2x32 and 64 for everything else.

The same issues arises if you have many more variables to set than
...
vx,vy,vz, which make the usage error-prone. One solution for this
particular example is that you move the domain_type variable inside the 2nd
loop and make it a function of both i,j from the loops. However that still
isn’t a nice design because it can be very inefficient. Every time time you
call restart() it will generate 4x64 bits of random data on it’s first
call, ..but you only consume 3x32 bits. Putting the burden of optimising
this on the user is not a good idea.
Actually, it's only 4x32 bits on each call.  But your point stands.  To get
the best possible performance one needs to be aware of some details.  How
many random values, and of what size you get with each "turn of the crank"
are imporant details.  If you ignore them, you'll get good, reliable,
correctly distributed values, but you might waste some time.

The version I’ve finished here https://github.com/sitmo/threefry is a
...
standard random engine concept and doesn’t need a domain_type, key_type,
 restart() and doesn’t have a  low limit on the number of samples that
users need to be aware of. Its intuitive to use like any other random
engine -as can be seen in the next example-. I *think* it also uses less
memory and is also faster because I’ve optimised the engine for it’s
purpose.
I'm happy to look at optimization.  But it should probably wait till we've
agreed on an API.  My version is also aggressively optimized, e.g., I used
mpl::foreach to unroll some loops because gcc wouldn't unroll them without
help.  If you've found ways to make it even smaller or faster, I'm eager to
learn.

John Salmon