On 04/12/2017 11:37 AM, Hans Dembinski via Boost wrote:
The library implements a histogram class (a highly configurable policy-based template) for C++ and Python in C++11 code. Histograms are a standard tool to explore Big Data. They allow one to visualise and analyse distributions of random variables. A histogram provides a lossy compression of input data. GBytes of input can be put in a compact form which requires only a small fraction of the original memory. This makes histograms convenient for interactive data analysis and further processing.
Given that the compression is lossy, I am wondering how it compares with a distribution estimator like: https://arxiv.org/abs/1507.05073v2 A common use-case when collecting numerical data is to determine the quantiles. Boost.Accumulators contains an estimator (extended_p_square) for that. The advantage of such estimators are that they execute in constant time and with constant memory usage, where the constant depends only on the required precision. PS: I am aware that this is a non-trivial question, so I do not expect an answer.