[GRASS-dev] r.regression.line crash in awk (awk debugging)

Hi,

I wanted to run a linear regression on a DEM and a temperature
map but got an error:
awk: cmd. line:2: (FILENAME=- FNR=3) fatal: division by zero attempted

I have reduced everything to this part (added --lint):

#!/bin/sh

unset LC_ALL
LC_NUMERIC=C
export LC_NUMERIC

echo "345.551054 1 1
364.229489 1 1
382.907925 1 3" | awk --lint '{tot += $3;sumX +=$1 * $3; sumsqX
+=$1*$1*$3;sumY +=$2 * $3; sumsqY +=$2*$2*$3; sumXY +=$1*$2*$3;}
END {B=(sumXY - sumX*sumY/tot)/(sumsqX - sumX*sumX/tot);
R= (sumXY - sumX*sumY/tot)/((sumsqX - sumX^2/tot)*(sumsqY - sumY^2/tot))^0.5;
mediaX=sumX/tot;sumsqX=sumsqX/tot;varX=sumsqX-(mediaX^2);sdX=varX^0.5;
mediaY=sumY/tot;sumsqY=sumsqY/tot;varY=sumsqY-(mediaY^2);sdY=varY^0.5;
A=mediaY - B*mediaX; F= R^2/(1-R^2/tot-2);
print A, B, R, tot, F, mediaX, sdX, mediaY, sdY}'

Running this, I get:
awk: (FILENAME=- FNR=1) warning: reference to uninitialized variable `tot'
awk: (FILENAME=- FNR=1) warning: reference to uninitialized variable `sumX'
awk: (FILENAME=- FNR=1) warning: reference to uninitialized variable `sumsqX'
awk: (FILENAME=- FNR=1) warning: reference to uninitialized variable `sumY'
awk: (FILENAME=- FNR=1) warning: reference to uninitialized variable `sumsqY'
awk: (FILENAME=- FNR=1) warning: reference to uninitialized variable `sumXY'
awk: cmd. line:2: (FILENAME=- FNR=3) fatal: division by zero attempted

The script used to work - what could be wrong? Using 6.4 here but it is
identical to 6.5. My awk foo is a bit limited...

Markus

Markus Neteler wrote:

I wanted to run a linear regression on a DEM and a temperature
map but got an error:
awk: cmd. line:2: (FILENAME=- FNR=3) fatal: division by zero attempted

I have reduced everything to this part (added --lint):

#!/bin/sh

unset LC_ALL
LC_NUMERIC=C
export LC_NUMERIC

echo "345.551054 1 1
364.229489 1 1
382.907925 1 3" | awk --lint '{tot += $3;sumX +=$1 * $3;
sumsqX
+=$1*$1*$3;sumY +=$2 * $3; sumsqY +=$2*$2*$3; sumXY
+=$1*$2*$3;}
END {B=(sumXY - sumX*sumY/tot)/(sumsqX - sumX*sumX/tot);
R= (sumXY - sumX*sumY/tot)/((sumsqX - sumX^2/tot)*(sumsqY -
sumY^2/tot))^0.5;
mediaX=sumX/tot;sumsqX=sumsqX/tot;varX=sumsqX-(mediaX^2);sdX=varX^0.5;
mediaY=sumY/tot;sumsqY=sumsqY/tot;varY=sumsqY-(mediaY^2);sdY=varY^0.5;
A=mediaY - B*mediaX; F= R^2/(1-R^2/tot-2);
print A, B, R, tot, F, mediaX, sdX, mediaY, sdY}'

Running this, I get:
awk: (FILENAME=- FNR=1) warning: reference to uninitialized
variable `tot'

....

it starts with "tot += $3" [aka "tot = tot + $3"], but tot is unused at
that point so undefined. (maybe awk can be trusted to always init new
vars to 0? no idea)

adding a 'BEGIN { tot = 0; sumX = 0; ... }' section at the start of the
awk script might help.

?
Hamish

On Fri, Aug 7, 2009 at 3:05 AM, Hamish<hamish_b@yahoo.com> wrote:

Markus Neteler wrote:

I wanted to run a linear regression on a DEM and a temperature
map but got an error:
awk: cmd. line:2: (FILENAME=- FNR=3) fatal: division by zero attempted

I have reduced everything to this part (added --lint):

#!/bin/sh

unset LC_ALL
LC_NUMERIC=C
export LC_NUMERIC

echo "345.551054 1 1
364.229489 1 1
382.907925 1 3" | awk --lint '{tot += $3;sumX +=$1 * $3;
sumsqX
+=$1*$1*$3;sumY +=$2 * $3; sumsqY +=$2*$2*$3; sumXY
+=$1*$2*$3;}
END {B=(sumXY - sumX*sumY/tot)/(sumsqX - sumX*sumX/tot);
R= (sumXY - sumX*sumY/tot)/((sumsqX - sumX^2/tot)*(sumsqY -
sumY^2/tot))^0.5;
mediaX=sumX/tot;sumsqX=sumsqX/tot;varX=sumsqX-(mediaX^2);sdX=varX^0.5;
mediaY=sumY/tot;sumsqY=sumsqY/tot;varY=sumsqY-(mediaY^2);sdY=varY^0.5;
A=mediaY - B*mediaX; F= R^2/(1-R^2/tot-2);
print A, B, R, tot, F, mediaX, sdX, mediaY, sdY}'

Running this, I get:
awk: (FILENAME=- FNR=1) warning: reference to uninitialized
variable `tot'

....

it starts with "tot += $3" [aka "tot = tot + $3"], but tot is unused at
that point so undefined. (maybe awk can be trusted to always init new
vars to 0? no idea)

adding a 'BEGIN { tot = 0; sumX = 0; ... }' section at the start of the
awk script might help.

Tried this:

echo "345.551054 1 1
364.229489 1 1
382.907925 1 3" | awk --lint -F " " 'BEGIN { tot = 0; sumX = 0; sumsqX
= 0; sumY = 0; sumsqY = 0; sumXY = 0;}
    {tot += $3; sumX +=$1*$3; sumsqX +=$1*$1*$3; sumY +=$2*$3; sumsqY
+=$2*$2*$3; sumXY +=$1*$2*$3;}
END { B = (sumXY - sumX*sumY/tot)/(sumsqX - sumX*sumX/tot);
      R = (sumXY - sumX*sumY/tot)/((sumsqX - sumX^2/tot)*(sumsqY -
sumY^2/tot))^0.5;
      mediaX= sumX/tot; sumsqX=sumsqX/tot; varX=sumsqX-(mediaX^2); sdX=varX^0.5;
      mediaY= sumY/tot; sumsqY=sumsqY/tot; varY=sumsqY-(mediaY^2); sdY=varY^0.5;
      A = mediaY - B*mediaX; F= R^2/(1-R^2/tot-2);
      print A, B, R, tot, F, mediaX, sdX, mediaY, sdY}'

but
awk: cmd. line:3: (FILENAME=- FNR=3) fatal: division by zero attempted

Confused,
Markus

On Fri, Aug 7, 2009 at 9:13 AM, Markus Neteler<neteler@osgeo.org> wrote:

On Fri, Aug 7, 2009 at 3:05 AM, Hamish<hamish_b@yahoo.com> wrote:

Markus Neteler wrote:

I wanted to run a linear regression on a DEM and a temperature
map but got an error:
awk: cmd. line:2: (FILENAME=- FNR=3) fatal: division by zero attempted

I have reduced everything to this part (added --lint):

#!/bin/sh

unset LC_ALL
LC_NUMERIC=C
export LC_NUMERIC

echo "345.551054 1 1
364.229489 1 1
382.907925 1 3" | awk --lint '{tot += $3;sumX +=$1 * $3;
sumsqX
+=$1*$1*$3;sumY +=$2 * $3; sumsqY +=$2*$2*$3; sumXY
+=$1*$2*$3;}
END {B=(sumXY - sumX*sumY/tot)/(sumsqX - sumX*sumX/tot);
R= (sumXY - sumX*sumY/tot)/((sumsqX - sumX^2/tot)*(sumsqY -
sumY^2/tot))^0.5;
mediaX=sumX/tot;sumsqX=sumsqX/tot;varX=sumsqX-(mediaX^2);sdX=varX^0.5;
mediaY=sumY/tot;sumsqY=sumsqY/tot;varY=sumsqY-(mediaY^2);sdY=varY^0.5;
A=mediaY - B*mediaX; F= R^2/(1-R^2/tot-2);
print A, B, R, tot, F, mediaX, sdX, mediaY, sdY}'

Running this, I get:
awk: (FILENAME=- FNR=1) warning: reference to uninitialized
variable `tot'

....

it starts with "tot += $3" [aka "tot = tot + $3"], but tot is unused at
that point so undefined. (maybe awk can be trusted to always init new
vars to 0? no idea)

adding a 'BEGIN { tot = 0; sumX = 0; ... }' section at the start of the
awk script might help.

Tried this:

echo "345.551054 1 1
364.229489 1 1
382.907925 1 3" | awk --lint -F " " 'BEGIN { tot = 0; sumX = 0; sumsqX
= 0; sumY = 0; sumsqY = 0; sumXY = 0;}
{tot += $3; sumX +=$1*$3; sumsqX +=$1*$1*$3; sumY +=$2*$3; sumsqY
+=$2*$2*$3; sumXY +=$1*$2*$3;}
END { B = (sumXY - sumX*sumY/tot)/(sumsqX - sumX*sumX/tot);
R = (sumXY - sumX*sumY/tot)/((sumsqX - sumX^2/tot)*(sumsqY -
sumY^2/tot))^0.5;
mediaX= sumX/tot; sumsqX=sumsqX/tot; varX=sumsqX-(mediaX^2); sdX=varX^0.5;
mediaY= sumY/tot; sumsqY=sumsqY/tot; varY=sumsqY-(mediaY^2); sdY=varY^0.5;
A = mediaY - B*mediaX; F= R^2/(1-R^2/tot-2);
print A, B, R, tot, F, mediaX, sdX, mediaY, sdY}'

but
awk: cmd. line:3: (FILENAME=- FNR=3) fatal: division by zero attempted

But this works.

echo "345.551054 1 1
364.229489 1 1
382.907925 1 3" | awk --lint -F " " 'BEGIN { tot = 0; sumX = 0; sumsqX
= 0; sumY = 0; sumsqY = 0; sumXY = 0;}
    {tot += $3; sumX +=$1*$3; sumsqX +=$1*$1*$3; sumY +=$2*$3; sumsqY
+=$2*$2*$3; sumXY +=$1*$2*$3;}
END {print tot, sumX, sumsqX, sumY, sumsqY, sumXY}'

5 1858.5 691924 5 5 1858.5

If I throw it in R:

tot=5
sumX= 1858.5
sumsqX=691924
sumY= 5
sumsqY= 5
sumXY= 1858.5

B = (sumXY - sumX*sumY/tot)/(sumsqX - sumX*sumX/tot);
R = (sumXY - sumX*sumY/tot)/((sumsqX - sumX^2/tot)*(sumsqY - sumY^2/tot))^0.5;
mediaX= sumX/tot; sumsqX=sumsqX/tot; varX=sumsqX-(mediaX^2); sdX=((varX^2)^0.5)^0.5;
mediaY= sumY/tot; sumsqY=sumsqY/tot; varY=sumsqY-(mediaY^2); sdY=((varY^2)^0.5)^0.5;
A = mediaY - B*mediaX; F= R^2/(1-R^2/tot-2);

B; R; tot; F; mediaX; sdX ;mediaY; sdY

[1] 0
[1] NaN
[1] 5
[1] NaN
[1] 371.7
[1] 14.96362
[1] 1
[1] 0

You see: R = NaN.

But then I rerun the R step:

R = (sumXY - sumX*sumY/tot)/((sumsqX - sumX^2/tot)*(sumsqY - sumY^2/tot))^0.5;
R

[1] 0

MAGIC?! I guess I am overlooking something here...

Markus

Markus Neteler wrote:

echo "345.551054 1 1
364.229489 1 1
382.907925 1 3" | awk ...

....

      R = (sumXY - sumX*sumY/tot)/((sumsqX - sumX^2/tot)*(sumsqY -
sumY^2/tot))^0.5;

sumsqY = 5
sumY = 5
tot = 5

thus (sumsqY - sumY^2/tot))^0.5 = 0

Hamish

On Fri, Aug 7, 2009 at 9:41 AM, Hamish<hamish_b@yahoo.com> wrote:

Markus Neteler wrote:

echo "345.551054 1 1
364.229489 1 1
382.907925 1 3" | awk ...

....

  R     = \(sumXY \- sumX\*sumY/tot\)/\(\(sumsqX \- sumX^2/tot\)\*\(sumsqY \-

sumY^2/tot))^0.5;

sumsqY = 5
sumY = 5
tot = 5

thus (sumsqY - sumY^2/tot))^0.5 = 0

ok... but how to add "protection"?
The original input was 81 rows (pixels with certain temperature
and their corresponding altitude from DEM).

It prints out
print tot, sumX, sumsqX, sumY, sumsqY, sumXY
7432 9.17582e+06 1.14575e+10 7432 7432 9.17582e+06

Mhh, same situation again... perhaps I am just trying nonsense :slight_smile:

cheers
Markus

PS: Should I commit the BEGIN initialization anyway?

[divide by zero in awk]

Markus Neteler wrote:

ok... but how to add "protection"?
The original input was 81 rows (pixels with certain
temperature and their corresponding altitude from DEM).

It prints out
print tot, sumX, sumsqX, sumY, sumsqY, sumXY
7432 9.17582e+06 1.14575e+10 7432 7432 9.17582e+06

Mhh, same situation again... perhaps I am just trying
nonsense :slight_smile:

"awk was developed before IEEE 754 arithmetic became widely available,
so the language does not fully support Infinity and NaN. In particular,
current awk implementations trap attempts to divide by zero, eve though
that is perfectly well-defined in IEEE 754 arithmetic."
--
'Classic shell scripting' by Arnold Robbins, Nelson Beebe
p.229 (O'Reilly Press)

tests:
$ awk 'BEGIN {print 0.0/0.0}'
awk: fatal: division by zero attempted

$ gawk 'BEGIN {print 0.0/0.0}'
gawk: fatal: division by zero attempted

$ mawk 'BEGIN {print 0.0/0.0}'
nan

but I guess we really can't expect mawk to be installed.

how about adding something like this to the awk script:

if( sumsqX == sumX^2/tot ) { B = NaN; }
else { B = (sumXY - sumX*sumY/tot)/(sumsqX - sumX*sumX/tot); }
\
if( sumsqY == sumY^2/tot ) { R = NaN; }
else { R = (sumXY - sumX*sumY/tot)/((sumsqX - sumX^2/tot)*(sumsqY - sumY^2/tot))^0.5; }

or perhaps if(sumsqX - sumX^2/tot < 1e-15 ) { B = NaN; } ... ?

in this case due to squaring I don't think fabs() is needed.. will $tot
always be positive? (AFAICT fabs() doesn't exist in awk anyway)

PS: Should I commit the BEGIN initialization anyway?

I don't think it is strictly needed, but it is good habit to do so.

Hamish

Hamish wrote:

how about adding something like this to the awk script:

(sorry about that being so rough+buggy, but you get the idea)

H