Re: [boost] [random] new threefry random engine

6 May 2014

...
On Tue, May 6, 2014 at 3:51 PM, Thijs van den Berg <thijs@sitmo.com> wrote:
...
...
I think you'll like the new example code better.  From the new README:
typedef threefry<4, uint32_t> Prf;
  counter_based_engine<Prf, 32> cbeng(seed);
  for(size_t i=0; i<Natoms; ++i){
      float boltzmannfactor = sqrt(kT/atoms[i].mass);
      normal_distribution nd(0., rmsvelocity);
      Prf::domain_type base = {atoms[i].id, timestep, THERMALIZE_CTXT};
      cbeng.restart(base);
      atoms[i].vx = boltzmannfactor*nd(cbeng);
      atoms[i].vy = boltzmannfactor*nd(cbeng);
      atoms[i].vz = boltzmannfactor*nd(cbeng);
  }
You can seed the counter_based_engine directly with an arithmetic seed,
or,
...
if you prefer,
you can use the the much larger key_type, or you can pre-initialize a Prf
and use that.
I tried to examine what happend when you draw more samples inside the
loop, like this
for (size_t j=0; j<100; ++j) {
           atoms[i].vx += mbd(cbeng);
           atoms[i].vy += mbd(cbeng);
           atoms[i].vz += mbd(cbeng);
       }
and that caused an abort trap. I expect that the cbeng ran out of samples
and that is not the behaviour  you want.
Yes.  It ran out because in the counter_based_example.cpp file it's
declared with a 5-bit counter:
counter_based_engine<Prf, 5> cbeng;
This was a very bad choice on my part.  The README declares cbeng with a
32-bit counter, and explains that the engine can produce 4*2^32 values
before running out.  Had I used that in the example.cpp, you would not have
had a problem.
I'll push an updated counter_based_example.cpp with CtrBits=32
momentarily.  If you change your copy to match the README, i.e., change 5
to 32, then you'll have 32-bit counter and you can generate a few billion
normals in your loop before running out.  I just ran it with:
Thanks for explaining that. I didn’t realise that, I thought that the random function output (4x64) was being read past the buffer end. 32 bit is indeed more than enough, great to see this!
...
counter_based_engine<Prf, 32> cbeng;
  ...
  for(size_t j=0; j<1000000000; ++j){
  ...
  }
That’s a good (speed) test. I’ll use that too.
...
I'm happy to look at optimization.  But it should probably wait till we've
agreed on an API.  My version is also aggressively optimized, e.g., I used
mpl::foreach to unroll some loops because gcc wouldn't unroll them without
help.  If you've found ways to make it even smaller or faster, I'm eager to
learn.
Yes I agree we should first look at the interface/concepts like we do now. However, the speed of threefry is one of it’s selling point and it’s natural to look at that too. I ran into the same issue as you with mpl (when not using  the foreach) and ended up doing a manual unrolling with chunks of 40 rounds. I wasn’t too happy with that because mpl code is much more elegant and compact. I didn’t know that foreach would have solved it.

One of the extra optimisations I realised I could do was to (allow to) limit the key size (my engine stores the key in a private var). The default is 4x64 bits and an additional 64 bit for the xor, but one typically doesn’t need 2^256 independent streams (if you use the key for seeding), e.g. 2^64 is probably more than enough and sometimes no key is ok too (you can then use leapfroggin or discard if you still want independent streams). Reducing the key size reduces the number of instructions in the key round and it reduces memory usage. The key and counter size are very superfluous and that gives room for reduction. There is also another reason I prefer a stripped down version of the key, it will allow you to make the interface a standard random engine one. We can suffice with a 64 bit seed, and that remove the need for having to expose the user to the internal types (key, domain). However, if you need a bigger key then there is always the seedseq interface (the only option)

I think the main difference between out versions right now is 
* “engine construction outside the loop + constructing a domain_type inside the loop + restart()"
...
typedef threefry<4, uint32_t> Prf;
counter_based_engine<Prf, 32> cbeng(seed);
...
Prf::domain_type base = {atoms[i].id, timestep, THERMALIZE_CTXT};
       cbeng.restart(base);
* v.s. constructor() inside the loop
...
threefry4x64_13 eng(i);
To give some details for out discussion.. my engine uses an private counter of size of {1,2,3,4} x 64 bits and I use the seed(i) and constructor interface to set the key. When I set the key via seed or via the constructor I alway reset the counter to zero (because when two engines are seeded equally the sequence should be equal). The constructor and regular seed limits the seed to 64bits, but the (slower) seedseq interface can be used to set the full key if it has more than 64 bits.
In your engine things seems to be connected differently? The counter is split into a high and low part, giving you 3 different types of input (including the key). 

I think that changing your version above to this version below is better because it only uses standard random engines concepts (less terminology and less tweakable options) and is more compact. My view is that new concepts like domain_type, restart() have an associated  learning cost and causes vendor lock-in for developers (their code becomes dependent on this specific engine). That’s why I want a standard engine unless there is a *huge* benefit for deviating. Do you think that there is a significant performance penalty here below, or do you simply prefer the interface above? Even when there is a performance hit, shouldn’t we at least have a standard engine without any interface extensions and put the high performance version into a different interface/concept?
...
for(size_t i=0; i<Natoms; ++i){
       counter_based_engine< threefry<4, uint32_t>, 32> cbeng(atoms[i].id);