On 04/12/2017 11:37 AM, Hans Dembinski via Boost wrote:
The library implements a histogram class (a highly configurable policy-based template) for C++ and Python in C++11 code. Histograms are a standard tool to explore Big Data. They allow one to visualise and analyse distributions of random variables. A histogram provides a lossy compression of input data. GBytes of input can be put in a compact form which requires only a small fraction of the original memory. This makes histograms convenient for interactive data analysis and further processing.
Given that the compression is lossy, I am wondering how it compares with a distribution estimator like: https://arxiv.org/abs/1507.05073v2 A common use-case when collecting numerical data is to determine the quantiles. Boost.Accumulators contains an estimator (extended_p_square) for that. The advantage of such estimators are that they execute in constant time and with constant memory usage, where the constant depends only on the required precision. PS: I am aware that this is a non-trivial question, so I do not expect an answer.
On 2017-04-12 12:34, Bjorn Reese via Boost wrote:
On 04/12/2017 11:37 AM, Hans Dembinski via Boost wrote:
The library implements a histogram class (a highly configurable policy-based template) for C++ and Python in C++11 code. Histograms are a standard tool to explore Big Data. They allow one to visualise and analyse distributions of random variables. A histogram provides a lossy compression of input data. GBytes of input can be put in a compact form which requires only a small fraction of the original memory. This makes histograms convenient for interactive data analysis and further processing.
Given that the compression is lossy, I am wondering how it compares with a distribution estimator like:
https://arxiv.org/abs/1507.05073v2
A common use-case when collecting numerical data is to determine the quantiles. Boost.Accumulators contains an estimator (extended_p_square) for that.
The advantage of such estimators are that they execute in constant time and with constant memory usage, where the constant depends only on the required precision.
PS: I am aware that this is a non-trivial question, so I do not expect an answer.
Hi, Simple answer: Histograms are not designed for estimating the quantile function, but the pdf. While it is true that a sufficiently good estimate of the pdf will give you an estimate of the quantiles via the inverse of the cdf, the obtainable precision depends on the size of the bins chosen for the histogram. On the other hand, if your data is multi-variate or your pdf multi-modal, you will have a hard time using quantiles, while you could still do for example outlier detection using histograms. Best, Oswin
Hi Oswin,
On the other hand, if your data is multi-variate or your pdf multi-modal, you will have a hard time using quantiles, while you could still do for example outlier detection using histograms.
yeah, I think so, too. The histogram library comes with extra bins along each dimension for outliers. Those can be turned off individually for each dimension, if needed. Best regards, Hans
On 04/12/2017 01:26 PM, Oswin Krause via Boost wrote:
On 2017-04-12 12:34, Bjorn Reese via Boost wrote:
Given that the compression is lossy, I am wondering how it compares with a distribution estimator like:
Simple answer: Histograms are not designed for estimating the quantile function, but the pdf.
The first reference I gave is a distribution (pdf and cdf) estimator.
While it is true that a sufficiently good estimate of the pdf will give you an estimate of the quantiles via the inverse of the cdf, the obtainable precision depends on the size of the bins chosen for the histogram.
On the other hand, if your data is multi-variate or your pdf multi-modal, you will have a hard time using quantiles, while you could still do for example outlier detection using histograms.
Good answer for the quantile estimators.
Hi Bjorn,
Given that the compression is lossy, I am wondering how it compares with a distribution estimator like:
I have to read the reference carefully, which is quite interesting, but I think the scope of such a density estimator is different. Histograms are conceptually simple, and simplicity is sometimes a plus. If you really want to have an estimator of the data pdf, then other algorithms may be better. Histograms can be transformed into an estimator of the pdf, but that's not their primary use case in my experience. In my field, particle physics, we are usually not interested in the data pdf itself. We come up with a theoretical model pdf on our own, which depends on some parameter(s) of interest (e.g. the mass of a new particle). We then adjust this parameter until the theoretical model fits the data. This can done by maximising the likelihood of the model in view of the data. If the data set is big, then it is more practical to use a histogram instead of the original data. We then maximise the likelihood of obtaining such a histogram. For this purpose, histograms are great, because they have clear properties and the analysis is straight-forward. The counts in the cells follow Poisson distributions, the stochastic fluctuations are independent in each cell. Neither is true for smooth density estimators, which makes them unsuitable for model fitting.
A common use-case when collecting numerical data is to determine the quantiles. Boost.Accumulators contains an estimator (extended_p_square) for that.
I had a look into Boost.Accumulators, and my impression was that the algorithms are for one-dimensional data only. The histogram library allows you to handle multi-dimensional input. This goes in addition to what why I wrote above, about the necessity to statistically model the histogram counts. In summary, the histogram library is not a particularly clever density estimator, but it tries to be the most efficient and convenient implementation of a classical histogram. Best regards, Hans
participants (3)
-
Bjorn Reese
-
Hans Dembinski
-
Oswin Krause