Re: [boost] Boost Library Testing - a modest proposal - was boost.test regression or behavior change (was Re: Boost.lockfree)

9 Oct 2015

      Le 09/10/15 18:37, Robert Ramey a écrit :
...
I believe this whole thread started from the changes in Boost.Test such
that it can no longer support testing of C++03 compatible libraries.
This is totally unrelated to the testing of Boost libraries.
The thread started because boost.test broke something used by other 
libraries, in a development branch, which raised some misunderstanding 
on the purpose of this branch and the overall workflow.

As a side note, I reverted the changes so that C++03 is not required for 
the set of features that are not explicitly stating this requirement in 
the documentation of 1.59 (datasets mainly, but also some forms of test 
declaration and test assertions).
...
Here is what I would like to see:
a) local testing by library developers.
Of course library developers need this in order to develop and maintain
libraries.
Currently we have this and has worked quite well for many years. Making
Boost.Test require C++11+ throws a monkey wrench into things for the
libraries which use it. But that's only temporary. Libraries whose
developers feel they need to maintain compatibility with C++98 can move
to lightweight test with relatively little effort.
I do not think that local testing has ever been an issue. The value of 
the dashboard is on the scalability of the testing wrt. 
platforms/compiler combinations, especially for configurations that are 
hard to find today (eg. MSVC7) and/or hard to set up (eg. Android).

I would also like to emphasis the difference between the unit testing 
tool (boost.test or lightweight) and the test driver (bjam):

- The "API" for running the test bed is bjam. This is used by developers 
and the regression testing workflow
- The API for writing tests can whatever developer like, boost.test is 
just one choice, which is not directly seen by the regression dashboard.
...
Developers who are concerned that the develop branch is a "soup" can
easily isolate themselves from this by testing against the master branch
of all the other libraries. The Boost modularization system with git has
made this very simple and practicle (thank you Beman!).
So - not a problem.
Right: this is trivial locally, yet this is not the current workflow of 
the regression dashboard. The complains started because of failures in 
develop, and because of workflow considerations + safe increments. As a 
developer, I would like to test my library on many runners (and as fast 
as possible).
...
b) Testing on other platforms.
We have a system which has worked pretty well for many years. Still it
has some features that I'm not crazy about.
i) it doesn't scale well - as boost gets bigger the testing load gets
bigger.
I suggested a test procedure on "stages of quality" in my previous post:
- fast feedback by continuous runners, giving a quick status on some 
mainstream compilers. Runners may have overlapping configuration/setup, 
so that the load is balanced somehow.
- scheduling of less available runners on candidates selected from 
previous stage. The interface can be by increasing a git branch, the 
runners picking that branch only.
...
ii) it tests the develop branch of each library against the develop
branch of all the other libraries - hence we have a testing "soup" where
a test might show failure but this failure might not be related to the
library under test but some other library. It diminishes the utility of
the test results in tracking down problems.
Exactly, but also not being able to track down the history of the 
versions on the current dashboard is far from helping. As a developer, I 
would like to see a summary of eg. the number of failing tests vs. 
number of test, and *per revision*.
...
iii) it relies on volunteer testers to select compilers/platforms to
test under. So it's not exhaustive and the selection might not reflect
that which people are actually using.
I would say that it would be good if each runner publishes the setup 
(not the runtime, but how it has been deployed), and maybe a script for 
being able to reproduce this runner. I think about docker (and how easy 
it is to describe fully a system), there are tools for the other 
platforms, more complicated though.

The idea behind that is to be able to reproduce the runners, so that 
they are not shown by name (eg. teeks99-08) but by property (eg. 
win2012R2-64on64, msvc-12). I am not saying that the current setup 
should not be followed, I am suggesting a way to address the scalability 
issue. For that we can have equivalent runners and balance the load.
...
I would like to see us encourage our users to test the libaries that
they use. This system would work in the following way.
If by users you mean the post-release /end users/, are you expecting a 
post-release feedback? I am not sure I understand.

BTW, do we have numbers on the number of ppl downloading an release 
candidate?
...
a) A user downloads/builds boost.
b) he decides he's going to use library X, and Y
c) he runs a tool which tells him which libraries he has to test. This
would be the result of a dependency analysis. We have tools which do
similar dependency analysis but they would have to be slightly enhanced
to distinguish between testing, deployment, etc. I don't think this
would be a huge undertaking given the work that has already been done.
d) he runs the local testing setup on those libraries and their dependents.
e) he uploads the test results to a dashboard similar if not identical
to the current one.
So we expect having html pages of 10000 columns. I think again the 
information needs to be digested.
...
f) we would discourage uses from just using the boost libraries without
runnig they're own tests. We would do this by exhortation and by
refusing to support users who have been unwilling to run and post local
tests.
Mmmm... sounds bad to me.
...
This would give us the following:
a) a scalable testing setup which could handle a Boost containing any
number of libraries.
And what about just a randomized test? Say we have an ever growing 
number of tests N (big), but the acceptance or running N is decreasing 
with N. Say we limit to M << N (say 100), and we shuffle uniformly: the 
feedback would be much faster, the acceptance much higher. On our side, 
we need some machinery to digest this information based on the 
environment setup.
...
b) All combinations of libraries/platforms/compilers actually being used
would be those being tested and vice versa. We would have complete and
efficient test coverage.
c) We would have statistics on libraries being used. Something we are
sorely lacking now.
I am wondering why this would be relevant.
...
d) We would be encouraging better software development practices.
Sometime ago someone posted that he had a problem but couldn't run the
tests because "management" wouldn't allocate the time - and this was a
critical human life safety app. He escaped before I could weedle out of
him which company he worked.
And best of all - We're almost there !!!! we'd only need to:
a) enhance slightly the dependency tools we've crafted but aren't
actually using.
The dependencies are indirectly tested I would say, so testing the 
dependencies is a /nice to have/, but if I am using X that depends on Y, 
testing X should in most cases be enough. If it happens that the some 
breakage goes unnoticed through the tests of X, having tested Y might 
have helped but this is not trivial: coverage of X should be improved.
...
b) develop a tool to post the local results to a common dashboard
c) enhance the current dashboard to accept these results.
Several tools exist already, eg. CDash together with cmake. Why spending 
that much effort in developing our tools? Our expectations are not that 
different than many other open or closed source softwares: we want quick 
and/or wide feedback on the development state of boost.

Raffi

Re: [boost] Boost Library Testing - a modest proposal - was boost.test regression or behavior change (was Re: Boost.lockfree)

Raffi Enficiaud