[GRASS5] Re: [GRASS_DE] r.mapcalc round()

Roger_Bivand · June 1, 2001, 9:56am

Probably the reason why this went quiet is that it is difficult to do well
and portably. R uses six different user-controlled functions:

     `ceiling' takes a single numeric argument `x' and returns a
     numeric vector containing the smallest integers not less than the
     corresponding elements of `x'.

     `floor' takes a single numeric argument `x' and returns a numeric
     vector containing the largest integers not greater than the
     corresponding elements of `x'.

     `round' rounds the values in its first argument to the specified
     number of decimal places (default 0). Note that for rounding off a
     5, the IEEE standard is used, ``go to the even digit''. Therefore
     `round(0.5)' is `0' and `round(-1.5)' is `-2'.

`signif' rounds the values in its first argument to the specified
number of significant digits.

     `trunc' takes a single numeric argument `x' and returns a numeric
     vector containing the integers by truncating the values in `x'
     toward `0'.

     `zapsmall' determines a `digits' argument `dr' for calling
     `round(x, digits = dr)' such that values ``close to zero'' values
     are ``zapped'', i.e., treated as `0'.

where the underlying C code has to take care of machine dependencies as
well as retaining default values for the number of digits to round to, and
the number of significant digits to retain. The resulting returned values
are then handed off to formatting code. One reason for the difficulties is
that different users may need very different precisions for different
purposes, and this would in our case add to the arguments taken by
programs outputting FCELL and DCELL as text.

Roger

On Fri, 1 Jun 2001, Markus Neteler wrote:

Hi all,

On Fri, May 18, 2001 at 12:26:50PM +0100, Glynn Clements wrote:
>
> [CC'd to grass5, as I really think that this merits some discussion.]
>
> Markus Neteler wrote:
>
> > as you really know much more than me about the precision issue and the
> > related %f (and whatever), may I leave eventual changes to you? I am
> > afraid to introduce more problems that currently there. My intention is
> > to have relyable numbers, not optically polished numbers, no question!
> >
> > The candidates seem to be
> > r.stats
> > r.to.sites
> > r.describe
> > r.report
> >
> > which print numbers to stdout.
>
> I started looking into r.stats, and noticed a couple of issues:
>
> 1. r.stats uses DCELL for all floating-point data. Theoretically this
> needs 17 significant digits in order to preserve accuracy, but would
> the data ever be this accurate?
>
> For geographical data, I suspect not; but how accurate might it be? If
> we pick some "reasonable" value, that then becomes the limit of its
> accuracy. If we don't, 17 digits is likely to be too many in 99% of
> cases (9 digits too many for all FCELL data).
>
> I noticed some code which appeared to be intended to support
> specifying the precision ("dp" variable in main.c, "fmt" argument to
> cell_stats, print_cell_stats). However, this isn't actually used (in
> that "dp" can't actually be changed).
>
> 2. Shouldn't the format of "r.stats -1" be consistent for all data?
>
> This initially started as a question of whether to strip trailing
> zeroes, but then there's the issue that, with "%g", some values could
> be printed in exponential form and others not.
>
> One solution would be to select either "%e" or "%f" (and an
> appropriate precision) once, based upon the overall range of the data,
> rather than letting "%g" choose for each individual value.
>
> 3. Might the output from "r.stats -1" be fed to programs which don't
> recognise exponential form? The ANSI functions (atof, strtod, *scanf)
> all support it, but not everything uses those.
>
> I haven't looked at the other programs; I suspect that similar issues
> may apply to those.
>
> If the output from the above programs (apart from r.to.sites) is
> intended for a user (instead of or as well as for programs), then
> appearance is a valid consideration. Further, programs may impose
> limitations on their input (e.g. a program which reads floating-point
> values may first store the string in a buffer which isn't wide enough
> for the full precision of a double).
>
> It might be worth the effort of designing and implementing (the latter
> is the easy part) a system-wide function for converting floating-point
> numbers to decimal.
>
> At it's simplest, this could be little more than:
>
> sprintf(buf, getenv("GRASS_FLOAT_FMT"), val);
>
> (don't take this too literally).
>
> A better approach would allow individual programs to specify
> particular requirements, with the default format filling in the
> blanks.
>
> An additional possibility is a new option type (in the sense of
> G_define_option) to allow consistent input of format specifications as
> command-line parameters.

unfortunately there is still no discussion on these problems...

Just want to get this problem back into minds

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Breiviksveien 40, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93
e-mail: Roger.Bivand@nhh.no
and: Department of Geography and Regional Development, University of
Gdansk, al. Mar. J. Pilsudskiego 46, PL-81 378 Gdynia, Poland.

From neteler Fri Jun 1 11:38:48 2001
Return-Path: <neteler>
Received: by hgeo02.geog.uni-hannover.de (SMI-8.6/SMI-SVR4)
  id LAA11563; Fri, 1 Jun 2001 11:38:48 +0100
Date: Fri, 1 Jun 2001 11:38:48 +0100
From: Markus Neteler <neteler@geog.uni-hannover.de>
To: grass5@geog.uni-hannover.de
Subject: Re: [GRASS5] vector maps: Hardcoded Organization
Message-ID: <20010601113848.B10781@hgeo02.geog.uni-hannover.de>
Mail-Followup-To: grass5@geog.uni-hannover.de
References: <20010510172008.F20265@hgeo02.geog.uni-hannover.de> <3AFAC1B8.B705C8CC@baylor.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3AFAC1B8.B705C8CC@baylor.edu>; from Bruce_Byars@baylor.edu on Thu, May 10, 2001 at 11:28:40AM -0500
Sender: grass5-admin@geog.uni-hannover.de
Errors-To: grass5-admin@geog.uni-hannover.de
X-BeenThere: grass5@geog.uni-hannover.de
X-Mailman-Version: 2.0.5
Precedence: bulk
List-Help: <mailto:grass5-request@geog.uni-hannover.de?subject=help>
List-Post: <mailto:grass5@geog.uni-hannover.de>
List-Subscribe: <http://www.geog.uni-hannover.de/mailman/listinfo/grass5>,
  <mailto:grass5-request@geog.uni-hannover.de?subject=subscribe>
List-Id: GRASS 5 Developers mailing list <grass5.geog.uni-hannover.de>
List-Unsubscribe: <http://www.geog.uni-hannover.de/mailman/listinfo/grass5>,
  <mailto:grass5-request@geog.uni-hannover.de?subject=unsubscribe>
List-Archive: <http://www.geog.uni-hannover.de/pipermail/grass5/>
Status: O
Content-Length: 646
Lines: 26

Hi all,

On Thu, May 10, 2001 at 11:28:40AM -0500, B. Byars wrote:

Markus Neteler wrote:

> Hi again,
>
> here some low priority issue: Some vector modules have the hardcoded
> organization entry: "US Army Const. Eng. Rsch. Lab"
>
> ./mapdev/v.in.dxf/make_header.c
> ./mapdev/v.mkgrid/init_head.c
> ./mapdev/v.mkquads/init_head.c
> ./sites/s.voronoi/init_head.c

./mapdev/v.digit/head_info.c

I have added support for a new env variable "GRASS_ORGANIZATION" into
above modules. If set, it is used for "Organization", if unset,
"GRASS Development Team" is set.

In case of objection, please let me know.

Cheers

Markus

Glynn_Clements · June 1, 2001, 10:06pm

Roger Bivand wrote:

where the underlying C code has to take care of machine dependencies as
well as retaining default values for the number of digits to round to, and
the number of significant digits to retain. The resulting returned values
are then handed off to formatting code. One reason for the difficulties is
that different users may need very different precisions for different
purposes, and this would in our case add to the arguments taken by
programs outputting FCELL and DCELL as text.

Yep; so what should the syntax of these arguments be?

The simplest option (to implement) would be to allow a literal format
string to be specified, e.g. "fmt=%.6g". But that isn't necessarily
the best option for users.

One issue is that it might be desirable to allow a default format to
be set, e.g. via an environment variable. Ideally this would be able
to specify different formats depending upon whether the data was FCELL
or DCELL (assuming that the program knows; it can't know if DCELL data
is "real" DCELL data or FCELL data which has been "promoted" to
DCELL).

If the format specification is in the form of a printf() format
string, the only form of override would be complete replacement. A
more structured mechanism would allow e.g. the number of significant
figures to be changed while retaining the default style (e/f/g).

Overriding becomes important in cases where the output must conform to
certain constraints; e.g. v.out.foo where the "foo" format doesn't
support exponential form. The program would have to use some variation
on "%f", but the user could still choose the number of digits.

Thinking about this a bit, there are quite a few different ways to
specify the precision, including (but probably not limited to):

  Number of decimal places
  Number of significant figures
  Relative error
  Absolute error

Suppression or inclusion of leading and/or trailing zeroes may also be
an issue.

One parameter that cannot be controlled if you use printf() is the
number of digits used for the exponent. ANSI C dictates 2 digits
unless 3 are required, with a "+" for positive exponents
(incidentally, Microsoft's runtime always uses 3 digits).

--
Glynn Clements <glynn.clements@virgin.net>

Roger_Bivand · June 4, 2001, 1:43pm

On Fri, 1 Jun 2001, Glynn Clements wrote:

> where the underlying C code has to take care of machine dependencies as
> well as retaining default values for the number of digits to round to, and
> the number of significant digits to retain. The resulting returned values
> are then handed off to formatting code. One reason for the difficulties is
> that different users may need very different precisions for different
> purposes, and this would in our case add to the arguments taken by
> programs outputting FCELL and DCELL as text.

Yep; so what should the syntax of these arguments be?

The simplest option (to implement) would be to allow a literal format
string to be specified, e.g. "fmt=%.6g". But that isn't necessarily
the best option for users.

...

Thinking about this a bit, there are quite a few different ways to
specify the precision, including (but probably not limited to):

  Number of decimal places
  Number of significant figures
  Relative error
  Absolute error

It looks as though number of significant figures is the most important,
because that takes care of relative error too. Absolute error is maybe too
scale-dependent to help, I don't have a feel for that unless that's what a
client asks about? Number of decimal places is also dependent on where the
point is to start with, so significant digits seem to be the way to go,
and the line of least resistance is then %<sig.digs>e, (taking into
account platform dependencies with regard to the exponent). Default values
would follow from machine representation, for DCELL say 16 and for FCELL
say 8, but stored as environment variables.

I'm only too clear about this still jumping over Markus' original
question, which has at least as much to do with the apparent oddness of
numbers being different between input and output when they appear with
"too many" significant digits.

Roger

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Breiviksveien 40, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93
e-mail: Roger.Bivand@nhh.no
and: Department of Geography and Regional Development, University of
Gdansk, al. Mar. J. Pilsudskiego 46, PL-81 378 Gdynia, Poland.

Glynn_Clements · June 4, 2001, 3:38pm

Roger Bivand wrote:

> > where the underlying C code has to take care of machine dependencies as
> > well as retaining default values for the number of digits to round to, and
> > the number of significant digits to retain. The resulting returned values
> > are then handed off to formatting code. One reason for the difficulties is
> > that different users may need very different precisions for different
> > purposes, and this would in our case add to the arguments taken by
> > programs outputting FCELL and DCELL as text.
>
> Yep; so what should the syntax of these arguments be?
>
> The simplest option (to implement) would be to allow a literal format
> string to be specified, e.g. "fmt=%.6g". But that isn't necessarily
> the best option for users.
>
...
>
> Thinking about this a bit, there are quite a few different ways to
> specify the precision, including (but probably not limited to):
>
> Number of decimal places
> Number of significant figures
> Relative error
> Absolute error

It looks as though number of significant figures is the most important,
because that takes care of relative error too. Absolute error is maybe too
scale-dependent to help, I don't have a feel for that unless that's what a
client asks about? Number of decimal places is also dependent on where the
point is to start with, so significant digits seem to be the way to go,
and the line of least resistance is then %<sig.digs>e, (taking into
account platform dependencies with regard to the exponent). Default values
would follow from machine representation, for DCELL say 16 and for FCELL
say 8, but stored as environment variables.

%e is certainly the simplest to implement; it isn't necessarily all
that user friendly, though. It may not be acceptable to some programs
either (although this can be worked around by post-processing, e.g.
with awk).

I suspect that using exponential form generally might prove unpopular.

Having thought about it some more, some of the potentially useful
options can't be implemented simply by replacing printf("%f, val) with
print_in_preferred_format(val).

When displaying multiple values, it might sometimes be desirable to
have all the values printed in a similar format, e.g.

   9.80e+3
   9.90e+3
  10.00e+3
  10.10e+3

might be preferred over:

  9.80e+3
  9.90e+3
  1.00e+4
  1.01e+4

This would require the program to perform two passes on the data; one
to determine the range of magnitudes and another to print the values.

I guess that what I'm getting at is that the "default format" isn't so
much a constant as a function of various parameters, such as the
minimum and maximum values, the precision of the data[1], and maybe
program-specific constraints.

However, it isn't practical to allow the user to specify an arbitrary
function at run-time. The best practical option is for the (fixed)
formatting function to have the right parameters so that any remotely
reasonable formatting function can be chosen at run-time.

Does this make sense?

[1] The precision is really more than just whether the data is FCELL
or DCELL, although at present that's all that we have to go on. It
would be better if coordinate data (i.e. vector and sites files) could
include an indication of the data's actual precision.

--
Glynn Clements <glynn.clements@virgin.net>

Roger_Bivand · June 6, 2001, 12:23pm

On Mon, 4 Jun 2001, Glynn Clements wrote:

Roger Bivand wrote:

> It looks as though number of significant figures is the most important,
> because that takes care of relative error too. Absolute error is maybe too
> scale-dependent to help, I don't have a feel for that unless that's what a
> client asks about? Number of decimal places is also dependent on where the
> point is to start with, so significant digits seem to be the way to go,
> and the line of least resistance is then %<sig.digs>e, (taking into
> account platform dependencies with regard to the exponent). Default values
> would follow from machine representation, for DCELL say 16 and for FCELL
> say 8, but stored as environment variables.

%e is certainly the simplest to implement; it isn't necessarily all
that user friendly, though. It may not be acceptable to some programs
either (although this can be worked around by post-processing, e.g.
with awk).

I suspect that using exponential form generally might prove unpopular.

I'm sure you're right if a user needs to read the output. I tend to be too
abstract and would generally use awk, sed, or something similar to
post-process output in any case - %e with specified significance is more
for other programs to read. But for other programs, this is what preserves
the precision in the data.

Having thought about it some more, some of the potentially useful
options can't be implemented simply by replacing printf("%f, val) with
print_in_preferred_format(val).

When displaying multiple values, it might sometimes be desirable to
have all the values printed in a similar format, e.g.

   9.80e+3
   9.90e+3
  10.00e+3
  10.10e+3

might be preferred over:

  9.80e+3
  9.90e+3
  1.00e+4
  1.01e+4

This would require the program to perform two passes on the data; one
to determine the range of magnitudes and another to print the values.

I guess that what I'm getting at is that the "default format" isn't so
much a constant as a function of various parameters, such as the
minimum and maximum values, the precision of the data[1], and maybe
program-specific constraints.

However, it isn't practical to allow the user to specify an arbitrary
function at run-time. The best practical option is for the (fixed)
formatting function to have the right parameters so that any remotely
reasonable formatting function can be chosen at run-time.

Does this make sense?

Yes. The next step is where to keep the knowledge about the value range,
number of significant digits, etc., which is a kind of metadata. Should it
be given explicitly to each modular command, or treated like a region or
histogram, or palette, or operate per session for a location? In a
different thread, the common-sense in GRASS implicit in having separate
programs do different things is stressed. But here - as with GUI-type
things, the granularity is different, with the significance metadata
needing (potentially) to be accessible to successive programs, so that the
formatting decisions don't need to be passed to each in turn.

[1] The precision is really more than just whether the data is FCELL
or DCELL, although at present that's all that we have to go on. It
would be better if coordinate data (i.e. vector and sites files) could
include an indication of the data's actual precision.

Roger

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Breiviksveien 40, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93
e-mail: Roger.Bivand@nhh.no
and: Department of Geography and Regional Development, University of
Gdansk, al. Mar. J. Pilsudskiego 46, PL-81 378 Gdynia, Poland.

Glynn_Clements · June 6, 2001, 2:09pm

Roger Bivand wrote:

> However, it isn't practical to allow the user to specify an arbitrary
> function at run-time. The best practical option is for the (fixed)
> formatting function to have the right parameters so that any remotely
> reasonable formatting function can be chosen at run-time.
>
> Does this make sense?

Yes. The next step is where to keep the knowledge about the value range,
number of significant digits, etc., which is a kind of metadata. Should it
be given explicitly to each modular command, or treated like a region or
histogram, or palette, or operate per session for a location? In a
different thread, the common-sense in GRASS implicit in having separate
programs do different things is stressed. But here - as with GUI-type
things, the granularity is different, with the significance metadata
needing (potentially) to be accessible to successive programs, so that the
formatting decisions don't need to be passed to each in turn.

And, of course, the step after that is for any program which generates
data to automatically set the precision based upon the precision of
the input(s) and the nature of the computation. But I'm not expecting
to see that happen soon.

For now, I'm just looking for a way to "pass the buck" to the user.
At least they *might* know the correct format to use.

Anyway, much of this is starting to look like 5.1 territory.

For 5.0.0 I'd suggest something along the lines of:

printf("%#.*g", atoi(getenv("GRASS_PRECISION")), val);

except in any cases where exponential form is known not to work, in
which case use "%.*f" instead.

For numbers larger than 1e-4, you can prevent exponential form from
being used by setting GRASS_PRECISION to a large enough value. And
numbers smaller than 1e-4 weren't handled well with "%f" anyway
(maximum of 2 sig.dig).

Also, I'm not sure if there's much point having separate defaults for
FCELL and DCELL data. Does anyone know of real-world cases where the
actual precision exceeds that of "float" (rel. error ~= 1.2e-7)?

--
Glynn Clements <glynn.clements@virgin.net>

Helena_Mitasova · June 6, 2001, 6:04pm

Glynn Clements wrote:

Also, I'm not sure if there's much point having separate defaults for
FCELL and DCELL data. Does anyone know of real-world cases where the
actual precision exceeds that of "float" (rel. error ~= 1.2e-7)?

It is safe to assume that such data exist. Even with the accuracy of the
current
elevation measurements in cm you want the data stored with precision in mm
and in higher elevation areas you would need 7 digits. It is even more
important for
data such as curvatures or pollutant concentrations which can change over
several magnitudes.
When we discussed the impementation of FP in GRASS it was obvious that for
most of the applications at that time 7 digits were sufficient, however, with
GIS
being increasingly used to support numerical modeling the precision needed can
easily exceed FP. On the other hand, handling everything as DCELL would be
overkill
as most of the current data are indeed sufficiently supported by FP.
Some of the print outs - e.g. for r.colors with the rules option have really a
ridiculously
large number of digits which may be confusing for many users so it would be
useful
to change the output at least for those cases where the solution is obvious.

Helena

--
Glynn Clements <glynn.clements@virgin.net>
_______________________________________________
grass5 mailing list
grass5@geog.uni-hannover.de
http://www.geog.uni-hannover.de/mailman/listinfo/grass5

Glynn_Clements · June 7, 2001, 4:56am

Helena wrote:

> Also, I'm not sure if there's much point having separate defaults for
> FCELL and DCELL data. Does anyone know of real-world cases where the
> actual precision exceeds that of "float" (rel. error ~= 1.2e-7)?

It is safe to assume that such data exist. Even with the accuracy of
the current elevation measurements in cm you want the data stored
with precision in mm and in higher elevation areas you would need 7
digits. It is even more important for data such as curvatures or
pollutant concentrations which can change over several magnitudes.

I'm quite sure that such data do exist, but if they only account for a
small fraction of GRASS' usage, it doesn't seem unreasonable to
require the user to explicitly set the precision in those cases.

In terms of specifics, it boils down to whether to have one
environment variable (e.g. GRASS_PRECISION) to be used for all decimal
conversions of FP values, or two variables (e.g. GRASS_FLOAT_PRECISION
and GRASS_DOUBLE_PRECISION) and select which one to use depending upon
whether FCELL or DCELL data are used.

Note that the work involved in choosing the right setting may not
always be trivial; some programs may automatically promote float to
double for internal computations. The section of the program which
outputs the results may not have any idea as to the original
precision.

The bottom line is whether the cases where DCELL data really warrant
greater precision are sufficiently common to justify doing the extra
work involved at the present time (i.e. before 5.0.0 is released).

Something needs to be done now, as there is existing code (unless it's
all been changed since the start of this discussion) which will
sometimes do the wrong thing (specifically, use of "%f" or "%.<fixed
precision>f", which will introduce significant error for small
values), with no way for the user to change it.

--
Glynn Clements <glynn.clements@virgin.net>

Roger_Bivand · June 7, 2001, 10:21am

On Thu, 7 Jun 2001, Glynn Clements wrote:

Helena wrote:

> > Also, I'm not sure if there's much point having separate defaults for
> > FCELL and DCELL data. Does anyone know of real-world cases where the
> > actual precision exceeds that of "float" (rel. error ~= 1.2e-7)?
>
> It is safe to assume that such data exist. Even with the accuracy of
> the current elevation measurements in cm you want the data stored
> with precision in mm and in higher elevation areas you would need 7
> digits. It is even more important for data such as curvatures or
> pollutant concentrations which can change over several magnitudes.

The bottom line is whether the cases where DCELL data really warrant
greater precision are sufficiently common to justify doing the extra
work involved at the present time (i.e. before 5.0.0 is released).

Something needs to be done now, as there is existing code (unless it's
all been changed since the start of this discussion) which will
sometimes do the wrong thing (specifically, use of "%f" or "%.<fixed
precision>f", which will introduce significant error for small
values), with no way for the user to change it.

I agree that this 5.0.0 solution is the one to go for now, because it
permits us to set the environment variable to cater for other cases if
need be. I'm unsure about whether there is a help page giving the
environment variables and their default settings - if there is, that would
be where to put the info, and maybe also a standard phrase on affected
program pages?

Roger

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Breiviksveien 40, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93
e-mail: Roger.Bivand@nhh.no
and: Department of Geography and Regional Development, University of
Gdansk, al. Mar. J. Pilsudskiego 46, PL-81 378 Gdynia, Poland.

Justin_Hickey · June 8, 2001, 5:52am

Hi Roger

Roger Bivand wrote:

I'm unsure about whether there is a help page giving the
environment variables and their default settings - if there is, that
would be where to put the info, and maybe also a standard phrase on
affected program pages?

It's not really a help page, but there is the file documents/envVars.txt
that lists all known environment variables for Grass. Many of them do
not have descriptions. Note that if you add variables to this file that
they are listed in alpabetical order (but the GRASS prefix is ignored
since it was unofficially decided that all environment variables should
start with GRASS_ - we just haven't changed them all yet).

--
Sincerely,

Jazzman (a.k.a. Justin Hickey) e-mail: jhickey@hpcc.nectec.or.th
High Performance Computing Center
National Electronics and Computer Technology Center (NECTEC)
Bangkok, Thailand

People who think they know everything are very irritating to those
of us who do. ---Anonymous

[GRASS5] Re: [GRASS_DE] r.mapcalc round()

Jazzman (a.k.a. Justin Hickey) e-mail: jhickey@hpcc.nectec.or.th High Performance Computing Center National Electronics and Computer Technology Center (NECTEC) Bangkok, Thailand

Jazz and Trek Rule!!!

Jazzman (a.k.a. Justin Hickey) e-mail: jhickey@hpcc.nectec.or.th
High Performance Computing Center
National Electronics and Computer Technology Center (NECTEC)
Bangkok, Thailand