Markus Neteler wrote:
> > > 2. r.statistics uses r.stats to calculate the statistics, and r.stats
> > > reads its inputs as CELL.
> >
> > In this case, r.statistics could also accept an FCELL map
> > without complaining? Currently I need extra steps to round
> > elevation to a CELL map before running r.statistics.
>
> You should be able to simply remove the check at
> raster/r.statistics/main.c:83:
>
> if( G_raster_map_is_fp(covermap->answer, mapset) != 0 )
> G_fatal_error (_("This module currently only works for integer (CELL) maps"));
>
> Or you might want to replace the error with a warning that the result
> is inaccurate.
If so, the latter. Currently I have to round() the elevation map
before running r.statistics - so results are inaccurate anyway
but the procedure requires more extra work.
But if you round() manually, you know that you're introducing
inaccuracy, hence the suggestion of a warning if you allow r.stats to
do it automatically.
The master question is: is it possible to conditionalize calls in
r3.stats to make it working for 2D maps. The new implementation from
Soeren sounds promising and along the lines of your suggestion to
rewrite r.stats from scratch.
I don't know how r3.stats works.
However, if we were to re-write something, it would probably be better
to re-write r.statistics to avoid using r.stats than to re-write
r.stats itself. r.stats has other uses beyond being a back-end for
r.statistics.
r.statistics is a simpler case, as I would expect the number of
distinct categories in the base map to be low enough that you could
comfortably store an accumulator per category.
To simplify matters futher, I would use a linear array, indexed by
base category, rather than an associative array (btree, hash table
etc). Base categories would normally be small non-negative integers
(i.e. suitable for use as indices), and if they aren't (i.e. the
allocation is sparse or includes negative values), you can always
"renumber" the categories using a reclass, e.g.:
r.stats -n $inmap | awk '{print $1,"=",NR-1}' | r.reclass input=$inmap output=$outmap
Finally, it may be better to move the "awkward" aggregates (median,
mode) to a separate module, as the algorithms involved are completely
different.
It would probably be worth offering a choice of a single- or
multiple-pass algorithm. Count, sum, mean, minimum and maximum only
require a single pass. AFAICT, variance, standard deviation, skew and
kurtosis can all be done in a single pass, but a multiple-pass
algorithm would be more accurate.
Actually, computing the mode of a large amount of FP data (or integer
data with a large range) is even worse than computing quantiles.
AFAICT, any general solution must have worst-case memory consumption
proportional to either the number of cells in the input map or the
number of possible values (i.e. 2^32 for CELL/FCELL, 2^64 for DCELL).
For a large map with a small range, you need one counter for each
possible value. For a map with a large range, the optimal solution is
to simply sort the input map then scan the sorted data while tracking
the most common value found to data.
For a DCELL map which is too large to fit into memory, the only
practical solution is to reduce the precision.
--
Glynn Clements <glynn@gclements.plus.com>