[GRASS5] [bug #2380] (grass) unexpected '(' in r.univar

this bug's URL: http://intevation.de/rt/webrt?serial_num=2380
-------------------------------------------------------------------------

Subject: unexpected '(' in r.univar

Platform: WindowsNT/2000/XP
grass obtained from: Mirror of Trento site
grass binary for platform: Downloaded precompiled Binaries
GRASS Version: 5.0.3 (Oktober 2003)

while executing r.univar (v.1.7.2.2 2002/10/23) the errormessage "unexpected '(' in line 18" appears.
after changing the line "function cleanup()" to "cleanup()" [the same in line 24] r.univar runs without problems. Is it a bug or a cygwin-problem?

System: WinXP - Cygwin (1.5.9) - GNU bash 2.05b.0(1) - grass 5.0.3 (Oktober 2003)

Greetings to all developers

Marco

-------------------------------------------- Managed by Request Tracker

On Tue, Apr 13, 2004 at 10:50:29AM +0200, Request Tracker wrote:

this bug's URL: http://intevation.de/rt/webrt?serial_num=2380
-------------------------------------------------------------------------

Subject: unexpected '(' in r.univar

Platform: WindowsNT/2000/XP
grass obtained from: Mirror of Trento site
grass binary for platform: Downloaded precompiled Binaries
GRASS Version: 5.0.3 (Oktober 2003)

while executing r.univar (v.1.7.2.2 2002/10/23) the errormessage "unexpected '(' in line 18" appears.
after changing the line "function cleanup()" to "cleanup()" [the same in line 24] r.univar runs without problems. Is it a bug or a cygwin-problem?

The bug is related to the usage of the word "function".

Please try to remove it, keep only "cleanup()" in that line.
Then it should work. This is already fixed in >= 5.3.

Markus

> while executing r.univar (v.1.7.2.2 2002/10/23) the errormessage
> "unexpected '(' in line 18" appears. after changing the line
> "function cleanup()" to "cleanup()" [the same in line 24] r.univar
> runs without problems. Is it a bug or a cygwin-problem?

Yesterday I put together a small module written in C for calculating the
stats on the non-null cells of a raster map, filling pretty much the
same roll as r.univar (which I hadn't met before). I had been doing
something like 'r.to.sites | s.univar', which doesn't work for 5.7 of
course (which is why I wrote it).

Having r.univar makes this redundant of course, but I'll test them to
see how much of a speed difference there is (I take it the r.univar
script will spend a bit of time doing disk I/O). [result: C is at least
40% faster than shell script, and doesn't write 10s of MBs files to /tmp]

Two comments arise: (without good answers)

a) Population vs. sample variance (& standard deviation)

r.series and r.univar use sum((xi-mean(x))^2)/n
   (i.e. population variance aka "sigma^2")

while

s.univar and s.cellstats use sum((xi-mean(x))^2)/(n-1)
   (i.e. sample or bias-corrected variance aka "s^2")

For consistency we should pick one way & document it. The difference
between n and n-1 for big maps with huge numbers of cells isn't very
much, so this isn't too critical, but someone might need to do analysis
on very small/sparse maps one day.... I've used n-1, for no great reason
besides the current region is 'sample' of a larger location.
Can any stats people comment?

b) gmath library: I looked at using the c_var.c & co. functions from
r.series, but these require passing all input values (ie the whole map
in memory) at once, which while good for a general library function or
for n<1000 cells-of-the-same-coordinate like r.series or r.mapcalc might
use, it doesn't cut it for a 10000x10000 DCELL map. I guess I could use
c_sum.c to do one line at a time, but it doesn't seem worth it, and
doesn't get rid of any implementation inconsistencies (eg the n vs. n-1
problem above) which is the great benefit of using a gmath library.
So I just reimplemented in an inconsistent manner as described above.

If people are interested in a replacement to r.univar, I can clean it up
and add the missing extended functionality (quartiles, etc.) which
r.univar provides. I'm not looking forward to the sort.. so maybe I'll
leave that to a real programmer to do.

And the ageless question of what to call it?
  ideas: r.mapstats, r.univar2

shrug.

Hamish

On Thu, Apr 15, 2004 at 12:58:07AM +1200, Hamish wrote:

> > while executing r.univar (v.1.7.2.2 2002/10/23) the errormessage
> > "unexpected '(' in line 18" appears. after changing the line
> > "function cleanup()" to "cleanup()" [the same in line 24] r.univar
> > runs without problems. Is it a bug or a cygwin-problem?

Yesterday I put together a small module written in C for calculating the
stats on the non-null cells of a raster map, filling pretty much the
same roll as r.univar (which I hadn't met before). I had been doing
something like 'r.to.sites | s.univar', which doesn't work for 5.7 of
course (which is why I wrote it).

Having r.univar makes this redundant of course, but I'll test them to
see how much of a speed difference there is (I take it the r.univar
script will spend a bit of time doing disk I/O). [result: C is at least
40% faster than shell script, and doesn't write 10s of MBs files to /tmp]

I'm hoping for years to see r.univar implemented in C...

Two comments arise: (without good answers)

a) Population vs. sample variance (& standard deviation)

r.series and r.univar use sum((xi-mean(x))^2)/n
   (i.e. population variance aka "sigma^2")

while

s.univar and s.cellstats use sum((xi-mean(x))^2)/(n-1)
   (i.e. sample or bias-corrected variance aka "s^2")

For consistency we should pick one way & document it.

Yes: and move into gmath. Please use doxygen-style comments.

The difference
between n and n-1 for big maps with huge numbers of cells isn't very
much, so this isn't too critical, but someone might need to do analysis
on very small/sparse maps one day.... I've used n-1, for no great reason
besides the current region is 'sample' of a larger location.
Can any stats people comment?

b) gmath library: I looked at using the c_var.c & co. functions from
r.series, but these require passing all input values (ie the whole map
in memory) at once, which while good for a general library function or
for n<1000 cells-of-the-same-coordinate like r.series or r.mapcalc might
use, it doesn't cut it for a 10000x10000 DCELL map. I guess I could use
c_sum.c to do one line at a time, but it doesn't seem worth it, and
doesn't get rid of any implementation inconsistencies (eg the n vs. n-1
problem above) which is the great benefit of using a gmath library.
So I just reimplemented in an inconsistent manner as described above.

If people are interested in a replacement to r.univar, I can clean it up
and add the missing extended functionality (quartiles, etc.) which
r.univar provides. I'm not looking forward to the sort.. so maybe I'll
leave that to a real programmer to do.

And the ageless question of what to call it?
  ideas: r.mapstats, r.univar2

IMHO we should aim at *replacing* code, not adding similar
code with different names. The same applies to various
other modules such as r.grow[2] etc. Confusions grows...

See also:
http://intevation.de/rt/webrt?serial_num=1848&display=History
"merge of r.average, r.median and r.mode"

Markus

> Yesterday I put together a small module written in C for calculating
> the stats on the non-null cells of a raster map, filling pretty much
> the same roll as r.univar (which I hadn't met before). I had been
> doing something like 'r.to.sites | s.univar', which doesn't work for
> 5.7 of course (which is why I wrote it).
>
> Having r.univar makes this redundant of course, but I'll test them
> to see how much of a speed difference there is (I take it the
> r.univar script will spend a bit of time doing disk I/O). [result: C
> is at least 40% faster than shell script, and doesn't write 10s of
> MBs files to /tmp]

I'm hoping for years to see r.univar implemented in C...

Ok, it's cleaned up now & seems to be running fine.

> Two comments arise: (without good answers)
>
>
> a) Population vs. sample variance (& standard deviation)
>
> r.series and r.univar use sum((xi-mean(x))^2)/n
> (i.e. population variance aka "sigma^2")
>
> while
>
> s.univar and s.cellstats use sum((xi-mean(x))^2)/(n-1)
> (i.e. sample or bias-corrected variance aka "s^2")
>
>
> For consistency we should pick one way & document it.

Yes: and move into gmath. Please use doxygen-style comments.

As stated in (b) below, this doesn't really work for r.univar(2), but
I'd agree that r.series's c_*.c should be moved to gmath for other
uses.. (sorry, not by me, not today)

> b) gmath library: I looked at using the c_var.c & co. functions from
> r.series, but these require passing all input values (ie the whole
> map in memory) at once, which while good for a general library
> function or for n<1000 cells-of-the-same-coordinate like r.series or
> r.mapcalc might use, it doesn't cut it for a 10000x10000 DCELL map.
> I guess I could use c_sum.c to do one line at a time, but it doesn't
> seem worth it, and doesn't get rid of any implementation
> inconsistencies (eg the n vs. n-1 problem above) which is the great
> benefit of using a gmath library. So I just reimplemented in an
> inconsistent manner as described above.

...

> If people are interested in a replacement to r.univar, I can clean
> it up and add the missing extended functionality (quartiles, etc.)
> which r.univar provides. I'm not looking forward to the sort.. so
> maybe I'll leave that to a real programmer to do.
>
> And the ageless question of what to call it?
> ideas: r.mapstats, r.univar2

IMHO we should aim at *replacing* code, not adding similar
code with different names. The same applies to various
other modules such as r.grow[2] etc. Confusions grows...

Ok, how about following r.mapcalc's lead and making its directory in the
source tree r.univar2 but have the program build a r.univar executable.
When people are happy the new version is better than the old it can be
added to the build list and the script removed at the same time. This
minimizes confusion both for future developers and the user.

Everything in the r.univar script is implemented except for the extended
stats (but some framework is in place for someone to continue that) and
the base= map option (just use a MASK). Only minor modification of
dependant scripts is needed (but definitely some).

If that's alright, I'll add it to CVS but not 5.3's build list (for
now). I'd add it to 5.7 though, replacing r.univar[.sh] immediately.
Yes/no?

See also:
http://intevation.de/rt/webrt?serial_num=1848&display=History
"merge of r.average, r.median and r.mode"

I can't do this now. Someone else?

Hamish

Hamish wrote:

a) Population vs. sample variance (& standard deviation)

r.series and r.univar use sum((xi-mean(x))^2)/n
   (i.e. population variance aka "sigma^2")

while

s.univar and s.cellstats use sum((xi-mean(x))^2)/(n-1)
   (i.e. sample or bias-corrected variance aka "s^2")

For consistency we should pick one way & document it.

Or consistently offer both options.

The difference
between n and n-1 for big maps with huge numbers of cells isn't very
much, so this isn't too critical, but someone might need to do analysis
on very small/sparse maps one day.... I've used n-1, for no great reason
besides the current region is 'sample' of a larger location.
Can any stats people comment?

Note that, for r.series, n is the number of maps, not the number of
cells, so n vs n-1 would be more significant there.

b) gmath library: I looked at using the c_var.c & co. functions from
r.series, but these require passing all input values (ie the whole map
in memory) at once, which while good for a general library function or
for n<1000 cells-of-the-same-coordinate like r.series or r.mapcalc might
use, it doesn't cut it for a 10000x10000 DCELL map.

Yep. General purpose versions of the common aggregate functions would
need to provide a begin/update/end API, so that the data can be
supplied in chunks.

We would also need to consider implementing median (quartiles,
percentiles) efficiently for large amounts of data. Sorting the entire
set is usually overkill, but binning and sorting requires two passes.

Similar issues apply to computing the mode.

--
Glynn Clements <glynn.clements@virgin.net>

On Thu, 15 Apr 2004, Hamish wrote:

(Actually I think it was Markus who wrote:)

> See also:
> http://intevation.de/rt/webrt?serial_num=1848&display=History
> "merge of r.average, r.median and r.mode"

To which Hamish said:

I can't do this now. Someone else?

Hamish

If nobody beats me to it (which would be fine!), I will be working on this
over the summer. I said "way back" that I wanted to tackle it, but
haven't found the time.

I'm glad to see the exchange and your work on r.univar, looking at that
code will probably really help.

Cheers,
Scott

> a) Population vs. sample variance (& standard deviation)
>
> r.series and r.univar use sum((xi-mean(x))^2)/n
> (i.e. population variance aka "sigma^2")
>
> while
>
> s.univar and s.cellstats use sum((xi-mean(x))^2)/(n-1)
> (i.e. sample or bias-corrected variance aka "s^2")
>
>
> For consistency we should pick one way & document it.

Or consistently offer both options.

I think that's overkill (see below). As long as we specify what we are
presenting, if the user knows n, they can always figure out the other
one in the rare case they need it.

> The difference
> between n and n-1 for big maps with huge numbers of cells isn't very
> much, so this isn't too critical, but someone might need to do
> analysis on very small/sparse maps one day.... I've used n-1, for no
> great reason besides the current region is 'sample' of a larger
> location. Can any stats people comment?

Note that, for r.series, n is the number of maps, not the number of
cells, so n vs n-1 would be more significant there.

pulling out the stats textbook...
"Regardless of sample size, however, it is good practice to divide a
sum by n-1 when computing a variance or standard deviation. It should be
assumed that the symbol s^2 refers to a variance obtained by the
division of the sum of squares by the degrees of freedom, as the
quantity n-1 is generally called. The only time when division of the sum
of squares by n is appropriate is when the interest of the investigator
is limited to the sample at hand and to its variance and standard
deviation as descriptive statistics of the sample, in contrast to using
these as estimates of the population parameters. In the rare cases in
which the investigator possesses data on the entire population division
by n is justified because then the investigator is not estimating a
parameter, but is evaluating it."
-- "Biometry" by Sokal & Rohlf, 3rd Ed. p.53
(proudly from SUNY Stony Brook!)

So you could argue that we could use n for r.series, but not for
r.univar. Still I think we should stick with n-1 for everything, as you
want to be dealing with degrees of freedom when your sample size is
small, AFAIK.

We would also need to consider implementing median (quartiles,
percentiles) efficiently for large amounts of data. Sorting the entire
set is usually overkill, but binning and sorting requires two passes.

Similar issues apply to computing the mode.

This remains on the TODO list of the newly uploaded r.univar(2) now in
CVS. Feel free to give it a shot, I'm probably not going to have time to
add these extended stats anytime soon.

best,
Hamish