[GRASS5] statistics library?

Hi all,

I was looking to modify r.series to add an option for x% trimmed mean,
and it occured to me that I'd probably want the same for r.statistics.

Looking further, s.univar, s.windavg, s.cellstats, r.series,
r.statistics, and probably others all impliment their own mix of
statistical queries, some with more options, some with less.

Wouldn't it be better to have a standard (simple) stats library in
src/libes/gmath/ which worked on an unsorted array of floats?

Is there something in R or somewhere else that could be reused? Would
that just lead to dependancy vs. sync-ing headaches, and be overkill
anyway? Start off with univar.c?

What started me on this was to trying to filter out bad data points from
a r.sun output map; probably better dealt with by setting a smaller
dist= value or fixing the bug if it is a bug..

?

Hamish

On Tue, Sep 09, 2003 at 02:53:52PM +1200, Hamish wrote:

Hi all,

I was looking to modify r.series to add an option for x% trimmed mean,
and it occured to me that I'd probably want the same for r.statistics.

Looking further, s.univar, s.windavg, s.cellstats, r.series,
r.statistics, and probably others all impliment their own mix of
statistical queries, some with more options, some with less.

Wouldn't it be better to have a standard (simple) stats library in
src/libes/gmath/ which worked on an unsorted array of floats?

Definitely! The statistics routines should be in once library (gmath).

Is there something in R or somewhere else that could be reused? Would
that just lead to dependancy vs. sync-ing headaches, and be overkill
anyway? Start off with univar.c?

There should be no external dependencies to get the current modules
running. Or did you mean to take over code from R (as being GPLed it
should be ok)?

What started me on this was to trying to filter out bad data points from
a r.sun output map; probably better dealt with by setting a smaller
dist= value or fixing the bug if it is a bug..

?

Hamish

The idea of "gmath" was to maintain all stats functions in this
place, unfortunately it's far from being done yet.
There is also potential code from libgis to be migrated there
(lu.c, svd.c, ...).

Markus

Hamish wrote:

I was looking to modify r.series to add an option for x% trimmed mean,
and it occured to me that I'd probably want the same for r.statistics.

Looking further, s.univar, s.windavg, s.cellstats, r.series,
r.statistics, and probably others all impliment their own mix of
statistical queries, some with more options, some with less.

Wouldn't it be better to have a standard (simple) stats library in
src/libes/gmath/ which worked on an unsorted array of floats?

Probably. Although there might be situations where a having both
integer and FP versions would be useful (obviously that doesn't make
sense for e.g. mean, variance etc, but it might for others, e.g. sum,
median).

Is there something in R or somewhere else that could be reused? Would
that just lead to dependancy vs. sync-ing headaches, and be overkill
anyway?

Using R would definitely be overkill.

Start off with univar.c?

I think that src/raster/r.series/cmd/c_*.c might be a better
interface, in the sense of each function computing a single measure
rather than computing everything. If you just want e.g. sum/mean, all
of those calls to pow() for the variance/skew/kurtosis computations
would be excessive.

OTOH, if you wanted both variance and standard deviation, you wouldn't
want to compute the variance twice. So, we might want some sort of
hybrid, which doesn't compute values which aren't required, and which
only computes the required values once.

For cases where the number of samples is likely to be large, it would
be better to have an interface which allows the data to be passed in
chunks, rather than having to have all of the data in memory at once.
However, the median (and quartiles, percentiles) can't be computed
this way; you have to have all of the data in memory at once.

[Also, while you can compute the variance from just the count, sum and
sum-of-squares, it is more accurate to compute the mean first then
accumulate the deviation-squared values in a second pass. This came up
a while back in the context of r.univar computing a negative variance
(due to rounding error) when all of the values are identical,
resulting in the standard deviation compuatation failing.]

--
Glynn Clements <glynn.clements@virgin.net>

On Tue, Sep 09, 2003 at 02:53:52PM +1200, Hamish wrote:

Hi all,

I was looking to modify r.series to add an option for x% trimmed mean,
and it occured to me that I'd probably want the same for r.statistics.

Looking further, s.univar, s.windavg, s.cellstats, r.series,
r.statistics, and probably others all impliment their own mix of
statistical queries, some with more options, some with less.

Wouldn't it be better to have a standard (simple) stats library in
src/libes/gmath/ which worked on an unsorted array of floats?

[...]

It were nice to have r.univar as C program then, not as
(slow) script...
Just another motivation,

Markus