Markus Neteler wrote:

> > > 2. r.statistics uses r.stats to calculate the statistics, and r.stats

> > > reads its inputs as CELL.

> >

> > In this case, r.statistics could also accept an FCELL map

> > without complaining? Currently I need extra steps to round

> > elevation to a CELL map before running r.statistics.

>

> You should be able to simply remove the check at

> raster/r.statistics/main.c:83:

>

> if( G_raster_map_is_fp(covermap->answer, mapset) != 0 )

> G_fatal_error (_("This module currently only works for integer (CELL) maps"));

>

> Or you might want to replace the error with a warning that the result

> is inaccurate.

If so, the latter. Currently I have to round() the elevation map

before running r.statistics - so results are inaccurate anyway

but the procedure requires more extra work.

But if you round() manually, you know that you're introducing

inaccuracy, hence the suggestion of a warning if you allow r.stats to

do it automatically.

The master question is: is it possible to conditionalize calls in

r3.stats to make it working for 2D maps. The new implementation from

Soeren sounds promising and along the lines of your suggestion to

rewrite r.stats from scratch.

I don't know how r3.stats works.

However, if we were to re-write something, it would probably be better

to re-write r.statistics to avoid using r.stats than to re-write

r.stats itself. r.stats has other uses beyond being a back-end for

r.statistics.

r.statistics is a simpler case, as I would expect the number of

distinct categories in the base map to be low enough that you could

comfortably store an accumulator per category.

To simplify matters futher, I would use a linear array, indexed by

base category, rather than an associative array (btree, hash table

etc). Base categories would normally be small non-negative integers

(i.e. suitable for use as indices), and if they aren't (i.e. the

allocation is sparse or includes negative values), you can always

"renumber" the categories using a reclass, e.g.:

r.stats -n $inmap | awk '{print $1,"=",NR-1}' | r.reclass input=$inmap output=$outmap

Finally, it may be better to move the "awkward" aggregates (median,

mode) to a separate module, as the algorithms involved are completely

different.

It would probably be worth offering a choice of a single- or

multiple-pass algorithm. Count, sum, mean, minimum and maximum only

require a single pass. AFAICT, variance, standard deviation, skew and

kurtosis can all be done in a single pass, but a multiple-pass

algorithm would be more accurate.

Actually, computing the mode of a large amount of FP data (or integer

data with a large range) is even worse than computing quantiles.

AFAICT, any general solution must have worst-case memory consumption

proportional to either the number of cells in the input map or the

number of possible values (i.e. 2^32 for CELL/FCELL, 2^64 for DCELL).

For a large map with a small range, you need one counter for each

possible value. For a map with a large range, the optimal solution is

to simply sort the input map then scan the sorted data while tracking

the most common value found to data.

For a DCELL map which is too large to fit into memory, the only

practical solution is to reduce the precision.

--

Glynn Clements <glynn@gclements.plus.com>