On 10/26/2016 4:32 AM, Larry Evans wrote:
On 10/26/2016 02:27 AM, Michael Marcin wrote:
i.e. 4 floats have to be contiguous in memory, and the *first* float has to be aligned to 16 bytes.
So why not:
alignas(16) std::array
data; IOW, does the decltype(data) have to have the required alignment or does &data have to have that alignment?
All that matters is the address of the first float be 16 and the number
of floats in your array is divisible by 4. Since SSE processes 4 floats
at a time, the 2nd group of 4 floats is also 16 byte aligned
(sizeof(float)*4 == 16). Note: the different instruction sets/hardware
support different data types/alignments.
This is why all the particle_count's I used in the emitter example were
multiples of 4. (And a multiple of 64 in the tests that use the
bit_vector which packs 64 bools into a uint64_t).
SSE2 has instructions to operate on
- 2 double
- 2 int64_t
- 4 float
- 4 int32_t
- 8 short
- 16 char
Which all require the pointer to the data to be 16 byte aligned, and all
are sized to 16 bytes such that you can operate on successive runs of
data in an appropriately aligned array.
If you don't know your particle_count is a multiple of 4 you need to
write more code.
For example an array of 39 floats you need to operate you can either pad
that out to 40 floats to use SSE on the whole thing or you can use SSE
on the first 36 floats (36/4 = 9 iterations) and have a non-vectorized
implementation of the same algorithm at the end to handle the last 3 floats.
If you don't know the alignment of your data this technique also applies
to the beginning of the array. You can use the non-vectorized algorithm
to processes the first 0-3 floats until you reach a 16 byte alignment
then process all 16 byte aligned groups of 4 floats and then return to
the non-vectorized implementation for 0-3 floats at the end of the array.
This is pretty much what compilers do when they vectorize a loop.
alignas(16) std::array