Re: [boost] cache size runtime detection

18 Aug 2015

      On 18.08.2015 22:16, Joel FALCOU wrote:
...
On 18/08/2015 11:12, Andrey Semashev wrote:
...
For some data it is enough to return some meaningful pessimistic
default in case if the actual value cannot be obtained. E.g. for ISA
extensions we could return 'not supported', for cache size return 0,
for OS version string return an empty string (or a fixed string based
on the data available at compile time) and so on.
For other data this doesn't quite work though. We can't return 0 as
the system RAM size, for instance - except we can but then the user's
application would have to check for this special value. I'm not sure
what is best in this case.
One other stuff to consider, as we had reort of this issue by our users,
is that such facility should have cached and non-cached retrieval function.
Using CPUID to grab SIMD facility for example is slow enough to have a
noticable impact on computing in some cases, hence the need for caching.
I think some of those config are static anyway (you won't remove a CPU
feature midfight) and must be cached in static value at start-up.
Others, like for example amount of available free RAM must not.
I believe most of the API should be non-caching and when caching is 
reasonable we should probably think of a stateful approach. Let the user 
cache the state, if needed, and also deal with inherent thread safety 
issues.

The CPU features are especially difficult because there are two usage 
patterns for this information that I have faced:

1. Collect all necessary CPU info at once and then use it to configure 
user's application (e.g. setup function tables and constants). This is 
typically done relatively rarely, like at application startup or some 
internal context initialization.

2. Query for one or few features, perhaps for using in a local condition 
to jump to a specialized code branch. The code that makes this query can 
be called often, so it must be fast.

Satisfying both these patterns in an effective way is not easy, but I 
think it should be possible if we represent CPU features collection as 
an object that the user can create and cache, if needed. The features 
can be obtained lazily and cached within this object. Something along 
these lines (pseudo-code):

namespace boost::sys_info::cpu {

enum class feature_tag
{
   // Arch-specific values
   sse, sse2...
   _count
};

template< typename... Features >
struct feature_list;

// This struct can be specialized for different feature tags
template< feature_tag Tag >
struct feature
{
   // In specializations, we can describe pre-requisites
   // for each feature, e.g. there must be OSXSAVE and
   // the OS must be saveing/restoring YMM registers
   // in order to be able to use AVX.
   typedef feature_list< ... > prerequisites;
};

constexpr feature< feature_tag::sse > sse = {};
constexpr feature< feature_tag::sse2 > sse2 = {};
...

class features
{
   // The flags indicate which features have been queried
   std::bitset< feature_tag::_count > m_cached;
   // The flags indicate which features are supported
   std::bitset< feature_tag::_count > m_values;

public:
   // By default creates an empty object.
   // The only thing it may need to do
   // is to obtain the max cpuid function number.
   // If do_init == true, calls init() automatically.
   explicit features(bool do_init = false);

   // Obtains all features at once
   void init();

   // If not cached already, tests for the feature and its
   // pre-requisites and returns the flag
   template< feature_tag Tag >
   bool operator[] (feature< Tag >);
};

} // namespace boost::sys_info::cpu

// Usage example
void foo()
{
   namespace cpu = boost::sys_info::cpu;
   cpu::features f;
   if (f[cpu::sse])
     // SSE-optimized code
     foo_sse();
   else
     // generic code
     foo_generic();
}

I know there are complications and possible ways of optimization. In 
particular, we actually discover multiple features with one cpuid call, 
so we might want to fill multiple flags per feature query. And for 
/proc/cpuinfo backed solution we might want to always parse the whole 
file at once. But I like this interface and the design looks extensible.