This queue is for tickets about the Statistics-Descriptive CPAN distribution.

Report information
The Basics
 Id: 9160 Status: resolved Priority: 0/ Queue: Statistics-Descriptive

People
 Owner: Nobody in particular Requestors: tgape [...] comcast.net Cc: AdminCc:

Bug Information
 Severity: Important Broken in: (no value) Fixed in: (no value)

Attachments
Statistics-Descriptive-2.6-eg.patch

History
 Subject: Variance and Standard Deviation use costly pseudo-variance, instead of computing real variance
text/plain 1.5k
The current Statistics::Descriptive package uses a pseudo-variance algorithm that was popularized many years ago, which was touted as being able to produce its number without all the datapoints, unlike real variance. However, expanding out the variance sum, it's just as easy to calculate the real thing. This patch reduces the cost of adding data points, it calculates true variance/standard deviation, and it allows for both variance computation methods (although it only allows the more common for the basis of standard deviation). As it caches the variance result, reading variance and standard deviation repeatedly per data point still results in a net performance improvement over the original. Finally, it adds a clear method, which reduced the runtime of a particular program which uses this module far too heavily by around 2%. (This program spends about 60% of its time in Statistics::Descriptive.) The basis for the change was simply expanding out the 'elegant form' reported for computing variance, and finding out that the effective equation is actually significantly simpler than the 'elegant form'. Specifically, we get, on the numerator (denominator being trivial): sum of (x(n)^2 - mean^2) x(1)^2-mean^2+x(2)^2-mean^2+...+x(n)^2-mean^2 x(1)^2+x(2)^2+...+x(n)^2 - n*mean^2 (sum of x(n^2) - n*mean^2 After that, it was simply a matter of applying all of the simplifications this allowed. I have not added tests for this, but since there aren't already any standard deviation tests, it still passes the tests. Ed
--- Statistics-Descriptive-2.6/Descriptive.pm 2002/10/10 17:03:13 1.1 +++ Statistics-Descriptive-2.6/Descriptive.pm 2004/01/24 18:20:33 1.2 @@ -21,13 +21,12 @@ count => 0, mean => 0, sum => 0, + sumsq => 0, variance => undef, - pseudo_variance => 0, min => undef, max => undef, mindex => undef, maxdex => undef, - standard_deviation => undef, sample_range => undef, ); @@ -45,7 +44,7 @@ sub add_data { my \$self = shift; ##Myself my \$oldmean; - my (\$min,\$mindex,\$max,\$maxdex); + my (\$min,\$mindex,\$max,\$maxdex,\$sum,\$sumsq,\$count); my \$aref; if (ref \$_[0] eq 'ARRAY') { @@ -55,41 +54,84 @@ \$aref = \@_; } + ##If we were given no data, we do nothing. + return 1 if (!@{ \$aref }); + ##Take care of appending to an existing data set \$min = (defined (\$self->{min}) ? \$self->{min} : \$aref->[0]); \$max = (defined (\$self->{max}) ? \$self->{max} : \$aref->[0]); \$maxdex = \$self->{maxdex} || 0; \$mindex = \$self->{mindex} || 0; + \$sum = \$self->{sum}; + \$sumsq = \$self->{sumsq}; + \$count = \$self->{count}; - ##Calculate new mean, pseudo-variance, min and max; + ##Calculate new mean, sumsq, min and max; foreach ( @{ \$aref } ) { - \$oldmean = \$self->{mean}; - \$self->{sum} += \$_; - \$self->{count}++; + \$sum += \$_; + \$sumsq += \$_**2; + \$count++; if (\$_ >= \$max) { \$max = \$_; - \$maxdex = \$self->{count}-1; + \$maxdex = \$count-1; } if (\$_ <= \$min) { \$min = \$_; - \$mindex = \$self->{count}-1; + \$mindex = \$count-1; } - \$self->{mean} += (\$_ - \$oldmean) / \$self->{count}; - \$self->{pseudo_variance} += (\$_ - \$oldmean) * (\$_ - \$self->{mean}); } \$self->{min} = \$min; \$self->{mindex} = \$mindex; \$self->{max} = \$max; \$self->{maxdex} = \$maxdex; - \$self->{sample_range} = \$self->{max} - \$self->{min}; - if (\$self->{count} > 1) { - \$self->{variance} = \$self->{pseudo_variance} / (\$self->{count} -1); - \$self->{standard_deviation} = sqrt( \$self->{variance}); - } + \$self->{sample_range} = \$max - \$min; + \$self->{sum} = \$sum; + \$self->{sumsq} = \$sumsq; + \$self->{mean} = \$sum / \$count; + \$self->{count} = \$count; + ##indicator the value is not cached. Variance isn't commonly enough + ##used to recompute every single data add. + \$self->{variance} = undef; return 1; } +sub standard_deviation { + my \$self = shift; ##Myself + return undef if (!\$self->{count}); + return sqrt(variance(\$self)); +} + +##Return variance; if needed, compute and cache it. +sub variance { + my \$self = shift; ##Myself + my \$div = @_ ? 0 : 1; + my \$count = \$self->{count}; + if (\$count < 1 + \$div) { + return 0; + } + + my \$variance = \$self->{variance}; + if (!defined(\$variance)) { + \$variance = (\$self->{sumsq} - \$count * \$self->{mean}**2); + \$variance /= \$count - \$div; + \$self->{variance} = \$variance; + } + return \$variance; +} + +##Clear a stat. More efficient than destroying an object and calling +##new. +sub clear { + my \$self = shift; ##Myself + my \$key; + + return if (!\$self->{count}); + while (my(\$field, \$value) = each %fields) { + \$self->{\$field} = \$value; + } +} + sub AUTOLOAD { my \$self = shift; my \$type = ref(\$self) @@ -135,6 +177,27 @@ return \$self; } +##Clear a stat. More efficient than destroying an object and calling +##new. +sub clear { + my \$self = shift; ##Myself + my \$key; + + return if (!\$self->{count}); + foreach \$key (keys %{ \$self }) { # Check each key in the object + # If it's a reserved key for this class, keep it + next if exists \$self->{'_reserved'}->{\$key}; + # If it comes from the base class, keep it + next if exists \$self->{'_permitted'}->{\$key}; + delete \$self->{\$key}; # Delete the out of date cached key + } + \$self->SUPER::clear(); + if (exists(\$self->{data})) { + \$self->{data} = []; + \$self->{presorted} = 0; + } +} + sub add_data { my \$self = shift; my \$key; @@ -466,6 +529,16 @@ =item \$stat = Statistics::Descriptive::Sparse->new(); Create a new sparse statistics object. + +=item \$stat->clear(); + +Effectively the same as + + my \$class = ref(\$stat); + undef \$stat; + \$stat = new \$class; + +except more efficient. =item \$stat->add_data(1,2,3);
 From: tgape [...] comcast.net
text/plain 1.2k
[guest - Sun Dec 26 22:06:22 2004]: Show quoted text
> The current Statistics::Descriptive package uses a pseudo-variance > algorithm that was popularized many years ago, which was touted as > being able to produce its number without all the datapoints, unlike > real variance. However, expanding out the variance sum, it's just > as easy to calculate the real thing.
Back before I first created this patch, I had noticed a skew in standard deviation reported by a program which makes massive use of this package. I did not think of creating a test case at that time, because I could simply run my program with a particular dataset which demonstrated the issue fairly well. For some time, I ran both the 2.6 version and my version in parallel, and saw that the deviations were different. Since submitting this patch here, I've been trying to reproduce the issue, and I cannot. I'm therefore thinking this is not quite so important - especially since my analytic review isn't showing a difference either. I'm now wondering if it was a case of an expanding rounding error. My version does still work more efficiently, but as a mere optimization, this would have a fairly low priority. If I can reproduce the error, I'll submit a test case for it.
 From: tgape [...] comcast.net
text/plain 491b
The variance calculation in Statistics::Descriptive is actually correct, even if it is a bit more expensive. However, it is subject to a propogated rounding error, as it loses the original numbers. The simplified algorithm does not suffer from this issue. My difficulty in reproducing the problem was that I had since upgraded systems, and the newer system did not have the same rounding error of the prior system. It would still be nice to get the performance improvements of the patch.
The patch was applied in the trunk, thanks.
text/plain 1.2k
Hello Shlomi, Hello Ed, I am not sure if reopening the ticket is the right way to go about this, but I would certainly like to bring my concern about this issue to your attention. I do not agree with this patch. The problem with the current formula is that it is - albeit a bit cheaper - numerically very unstable. The problem is that the sum of x**2 and n*mean**2 can both get very big. Therefore, the variance is a possibly small difference of two possibly very big numbers and might therefore get very imprecise, even 0 due to numerical instability. Consider a not at all unrealistic example where the values are around 1,000,000, but fluctuate to a much smaller degree. Then the values of x**2 are around 10**12, if you add 1000 values, the sum is around 10**15. However, if the standard deviation is 1% - which is certainly something you would like to detect - then the variance is 10**8 which is 7 orders of magnitude smaller then the values you subtract from each other. I have had a quick look at version 2.6. I have not verified that it does exactly the right thing, but it is clear that the algorithm in there remedies this numerical instability by using differences to the current estimate of the mean. Best wishes, Lutz
text/plain 1.5k
Hi LGEHLEN, On Fri Nov 18 19:08:00 2011, LGEHLEN wrote: Show quoted text
> Hello Shlomi, Hello Ed, > I am not sure if reopening the ticket is the right way to go about this,
It is not. Please file a new bug and provide a test case (and preferably also a patch) and some direction. I'm resolving this bug - please file a new bug instead of commenting. Also see: https://rt.cpan.org/Public/Bug/Display.html?id=66712 Regards, -- Shlomi Fish Show quoted text