[GRASS-dev] r.regression.line fix proposal

Hi,

I wonder if this change is reasonable to make the script
working with FP maps:

diff -u -r1.17 scripts/r.regression.line
--- r.regression.line 3 Dec 2007 22:52:43 -0000 1.17
+++ r.regression.line 3 Dec 2007 22:54:09 -0000
@@ -90,7 +90,7 @@

#calculate regression equation
-r.stats -c input=$GIS_OPT_MAP1,$GIS_OPT_MAP2 > "$TMP"
+r.stats -cAN input=$GIS_OPT_MAP1,$GIS_OPT_MAP2 > "$TMP"
awk '{tot += $3;sumX +=$1 * $3; sumsqX +=$1*$1*$3;sumY +=$2 * $3; sumsqY +=$2*$2*$3;\
  sumXY +=$1*$2*$3;\
}\

changes:
-n to suppress no data
-A to avoid range output for FP maps
?

(Un)related question:
Would it be a big mess to implement r.regression.line a C module?

Markus

Markus Neteler wrote:

Hi,

I wonder if this change is reasonable to make the script
working with FP maps:

diff -u -r1.17 scripts/r.regression.line
--- r.regression.line 3 Dec 2007 22:52:43 -0000 1.17
+++ r.regression.line 3 Dec 2007 22:54:09 -0000
@@ -90,7 +90,7 @@

#calculate regression equation
-r.stats -c input=$GIS_OPT_MAP1,$GIS_OPT_MAP2 > "$TMP"
+r.stats -cAN input=$GIS_OPT_MAP1,$GIS_OPT_MAP2 > "$TMP"
awk '{tot += $3;sumX +=$1 * $3; sumsqX +=$1*$1*$3;sumY +=$2 * $3; sumsqY
+=$2*$2*$3;\
  sumXY +=$1*$2*$3;\
}\

changes:
-n to suppress no data
-A to avoid range output for FP maps
?

Committed to CVS.

(Un)related question:
Would it be a big mess to implement r.regression.line a C module?

Markus

--
View this message in context: http://www.nabble.com/r.regression.line-fix-proposal-tf4939973.html#a14158969
Sent from the Grass - Dev mailing list archive at Nabble.com.

On Tuesday 04 December 2007, Markus Neteler wrote:

Markus Neteler wrote:
> Hi,
>
> I wonder if this change is reasonable to make the script
> working with FP maps:
>
> diff -u -r1.17 scripts/r.regression.line
> --- r.regression.line 3 Dec 2007 22:52:43 -0000 1.17
> +++ r.regression.line 3 Dec 2007 22:54:09 -0000
> @@ -90,7 +90,7 @@
>
>
> #calculate regression equation
> -r.stats -c input=$GIS_OPT_MAP1,$GIS_OPT_MAP2 > "$TMP"
> +r.stats -cAN input=$GIS_OPT_MAP1,$GIS_OPT_MAP2 > "$TMP"
> awk '{tot += $3;sumX +=$1 * $3; sumsqX +=$1*$1*$3;sumY +=$2 * $3; sumsqY
> +=$2*$2*$3;\
> sumXY +=$1*$2*$3;\
> }\
>
>
> changes:
> -n to suppress no data
> -A to avoid range output for FP maps
> ?

Committed to CVS.

(Un)related question:
Would it be a big mess to implement r.regression.line a C module?

Markus

If written as a C module, could it take advantage of any stats library
functions lying around ?

Dylan

--
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341

Dylan Beaudette wrote:

> Would it be a big mess to implement r.regression.line a C module?
>
> Markus

If written as a C module, could it take advantage of any stats library
functions lying around ?

The aggregate functions in lib/stats require that the entire sample is
held in memory. This makes them unsuitable for computing an aggregate
over a substantial proportion of a map's values.

For r.regression.line, r.univar, r.statistics, etc, you need to use an
incremental approach. For some aggregates (count, sum, mean), this is
relatively straightforward.

For variance and deviation, there's the issue of a one-pass or
two-pass algorithm. A two-pass approach (calculating the mean on the
first pass) is more accurate, but requires two passes (which rules out
reading data from a pipe).

For quantiles, you don't want to sort vast amounts of data at
O(n.log(n)) complexity just to obtain specific quantiles. It's more
efficient to compute successive histograms, refining the interval(s)
containing the desired quantile(s) on each pass, and only sorting once
you've reduced the data to a manageable size. This could require
several passes, depending upon the amount of data, the amount of
memory available, and the distribution of the data.

--
Glynn Clements <glynn@gclements.plus.com>

I have added a new flag to r.regression.line since the current
approach wasn't very precise (due to r.stats -c which isn't perfect
for FP maps).

#calculate regression equation
-r.stats -cnA input=$GIS_OPT_MAP1,$GIS_OPT_MAP2 > "$TMP"
+if [ $GIS_FLAG_S -eq 1 ] ; then
+ # slower but accurate
+ r.stats -n1 input=$GIS_OPT_MAP1,$GIS_OPT_MAP2 | sed 's+$+ 1+g' > "$TMP"
+else
+ # count "identical" pixels
+ r.stats -cnA input=$GIS_OPT_MAP1,$GIS_OPT_MAP2 > "$TMP"
+fi
+

The flag -s select the slower method which writes out all
pixel values individually to the temporary file. The result is then
identical to that obtained from R-stats's lm() function.

Markus
--
View this message in context: http://www.nabble.com/r.regression.line-fix-proposal-tp14141009p15124457.html
Sent from the Grass - Dev mailing list archive at Nabble.com.