[GRASS-dev] Raster format and dual function module

Hi all,

I've been thinking abit about GRASS and the two things which I'd like
to see change. However I *don't* think that they these things should
be changed just for me and recognise the need to remain compatible
with existing modules etc.

1) Raster file format - Unless I'm mistaken, the raster is a flat
compressed file, with NULL values stored uncompressed in a seperate
bitmask. This means the NULL mask is quite large and also means the
raster is split across files. Obviously this makes it fastish, but
large and sparse maps are unnecessarily huge.

I've also been frustrated that one has to read an entire set of rows
to find out whether there are any cells Non-null. This makes modelling
processes slower then they need to be. I think that a quadtree format
would solve this (lower resolution versions of an region overlayed on
one another e.g. 128x128 64x64 32x32) by allowing you to descend only
those tree branches that have raster cells present. Quadtrees would
also speed up the process of displaying large raster maps on limited
resolution monitors.

Quadtrees based rasters with integrated null bitmasks could be easily
accessed through the normal raster function calls ensuring existing
modules are interoperable. Only new modules wanting to use quadtree
functions (such as checking existing of values at positions higher up
the tree) woudl have to check the version of the GRASS raster.

2) Dual function modules - there is alot of talk about how we want to
move forward with the GUI but still maintain the seperate programs for
GRASS commands. I suggest having a compilation system that will create
both standalone executables and integrated libraries. A GUI could then
check whether a library version of GRASS module exists and load that,
or otherwise use the equivalent executable.

This is due to another frustration I've had with speed of execution.
While running long chains of commands in simulations I can't help
realising that every command has to reload a map from disk and then
write it back. As we all know, disk access is REALLY slow in
comparison to memory etc. so if GRASS modules were compiled as
libraries the last N (N being configurable) number of loaded maps
could remain in memory for quick processing and display.

Let me know what people think. Obviously I don't expect other people
to go ahead with this just because I would like to see it, but if
people see some value in these approaches I could attempt to map out a
course and contribute some time to it.

Cheers,
Joel

--
"Wish not to seem, but to be, the best."
                -- Aeschylus

On Thu, 25 May 2006 12:30:28 +1200
"Joel Pitt" <joel.pitt@gmail.com> wrote:

Hi all,

I've been thinking abit about GRASS and the two things which I'd like
to see change. However I *don't* think that they these things should
be changed just for me and recognise the need to remain compatible
with existing modules etc.

1) Raster file format - Unless I'm mistaken, the raster is a flat
compressed file, with NULL values stored uncompressed in a seperate
bitmask. This means the NULL mask is quite large and also means the
raster is split across files. Obviously this makes it fastish, but
large and sparse maps are unnecessarily huge.

I've also been frustrated that one has to read an entire set of rows
to find out whether there are any cells Non-null. This makes modelling
processes slower then they need to be. I think that a quadtree format
would solve this (lower resolution versions of an region overlayed on
one another e.g. 128x128 64x64 32x32) by allowing you to descend only
those tree branches that have raster cells present. Quadtrees would
also speed up the process of displaying large raster maps on limited
resolution monitors.

Quadtrees based rasters with integrated null bitmasks could be easily
accessed through the normal raster function calls ensuring existing
modules are interoperable. Only new modules wanting to use quadtree
functions (such as checking existing of values at positions higher up
the tree) woudl have to check the version of the GRASS raster.

Having had the pleasure of working with quadtrees under the old SPANS
system, I can certainly see the speed benefit. I've thought about this
in the past, but didn't know how it could be implemented such that
existing modules wouldn't be affected. Perhaps I'm a bit slow, but I
don't understand how your suggested method would allow existing modules
to function in a normal way. Could point me to some literature
or provide a bit more explanation please.

2) Dual function modules - there is alot of talk about how we want to
move forward with the GUI but still maintain the seperate programs for
GRASS commands. I suggest having a compilation system that will create
both standalone executables and integrated libraries. A GUI could then
check whether a library version of GRASS module exists and load that,
or otherwise use the equivalent executable.

This is due to another frustration I've had with speed of execution.
While running long chains of commands in simulations I can't help
realising that every command has to reload a map from disk and then
write it back. As we all know, disk access is REALLY slow in
comparison to memory etc. so if GRASS modules were compiled as
libraries the last N (N being configurable) number of loaded maps
could remain in memory for quick processing and display.

I recall some time back when there were discussions about providing
much of the grass functionality as libraries one of the concerns raised
was about memory management. I believe it was Glynn who indicated that
most of the modules have been written under the assumption they were
free-standing and not part of an integrated whole, so the level of work
that has gone into memory management is not sufficient for use in a
library situation without major rewrites.

As I understand it, by providing a SWIG interface to some of the
backend libraries in GRASS it should be possible to write modules with
much more control than is currently available by stringing together
commands and thus could be potentially faster. Of this I'm not so sure,
so others will probably provide better answers on this one.

I personally feel pretty nervous about modules loading entire files
into memory unnecessarily. For example right now v.in.ascii is still
pretty slow at loading large files even when topology generation is
disabled (my assertion here is based on quoted numbers for files of
specified sizes compared to how long it took to load similarly large
point files on SPANS under OS/2 on a 486 years ago). Further in my work
developing an updated GUI for Stereo, I've noticed that without being
careful, it was possible to load entire images when it was not
necessary and slow the entire system to crawl. The issue in both cases
is one of well planned buffered reading and writing so that the system
doesn't go into thrash mode trying to load gigs of memory worth of data
but at the same time isn't spending inordinate amounts of time waiting
for the disk.

T
--
Trevor Wiens
twiens@interbaun.com

The significant problems that we face cannot be solved at the same
level of thinking we were at when we created them.
(Albert Einstein)

This is due to another frustration I've had with speed of execution.
While running long chains of commands in simulations I can't help
realising that every command has to reload a map from disk and then
write it back. As we all know, disk access is REALLY slow in
comparison to memory etc. so if GRASS modules were compiled as
libraries the last N (N being configurable) number of loaded maps
could remain in memory for quick processing and display.

Your OS should cache the data in non-reserved memory space and reuse it
if possible, or dump it if something else wants the space.

e.g. grep for something in the grass source code, then when it is done
try it again. The second time will finish in 5% of the time as it wasn't
reading from the disk. The more memory you have, the more potential for
cache space.

Hamish

On 5/25/06, Hamish <hamish_nospam@yahoo.com> wrote:

> This is due to another frustration I've had with speed of execution.
> While running long chains of commands in simulations I can't help
> realising that every command has to reload a map from disk and then
> write it back. As we all know, disk access is REALLY slow in
> comparison to memory etc. so if GRASS modules were compiled as
> libraries the last N (N being configurable) number of loaded maps
> could remain in memory for quick processing and display.

Your OS should cache the data in non-reserved memory space and reuse it
if possible, or dump it if something else wants the space.

e.g. grep for something in the grass source code, then when it is done
try it again. The second time will finish in 5% of the time as it wasn't
reading from the disk. The more memory you have, the more potential for
cache space.

Okay, good point about reading maps, however the writing still takes
time and writing to disk is the only simple way for passing a map from
one command to the other.

Unless the OS doesn't actually write the file when it is closed. Which
I'm guessing would be a Bad Thing (TM)

BTW, do you know whether a newly written file also changes the OS
cached version? Or in other words, is the cache invalidated after a
write changes the disk? Just curious...

--
"Wish not to seem, but to be, the best."
                -- Aeschylus

On Thu, 25 May 2006, Joel Pitt wrote:

BTW, do you know whether a newly written file also changes the OS
cached version? Or in other words, is the cache invalidated after a
write changes the disk? Just curious...

YES! Otherwise things would break, and you would loose data!

--W

--

<:3 )---- Wolf Bergenheim ----( 8:>

> > This is due to another frustration I've had with speed of
> > execution. While running long chains of commands in simulations I
> > can't help realising that every command has to reload a map from
> > disk and then write it back. As we all know, disk access is REALLY
> > slow in comparison to memory etc. so if GRASS modules were
> > compiled as libraries the last N (N being configurable) number of
> > loaded maps could remain in memory for quick processing and
> > display.
>
> Your OS should cache the data in non-reserved memory space and reuse
> it if possible, or dump it if something else wants the space.

..

Okay, good point about reading maps, however the writing still takes
time and writing to disk is the only simple way for passing a map from
one command to the other.

tip: r.mapcalc lets you set intermediate variables, see eval():
  http://grass.ibiblio.org/grass61/manuals/html61_user/r.mapcalc.html

(maybe someone who knows this better can provide an example?)

BTW, do you know whether a newly written file also changes the OS
cached version? Or in other words, is the cache invalidated after a
write changes the disk? Just curious...

The act of writing will mean the data will pass through system memory
and be cached. I am not qualified to say much further about it, other
than it happens and correct tuning of this is what the kernel
developers spend lots of time trying to get right.

Hamish

On 5/25/06, Hamish <hamish_nospam@yahoo.com> wrote:

> > Your OS should cache the data in non-reserved memory space and reuse
> > it if possible, or dump it if something else wants the space.
..
> Okay, good point about reading maps, however the writing still takes
> time and writing to disk is the only simple way for passing a map from
> one command to the other.

tip: r.mapcalc lets you set intermediate variables, see eval():
  http://grass.ibiblio.org/grass61/manuals/html61_user/r.mapcalc.html

(maybe someone who knows this better can provide an example?)

thanks, however I use several modules that behave differently from
what mapcalc can do. I also need to be able to use analysis tools on
the intermediate maps.

> BTW, do you know whether a newly written file also changes the OS
> cached version? Or in other words, is the cache invalidated after a
> write changes the disk? Just curious...

The act of writing will mean the data will pass through system memory
and be cached. I am not qualified to say much further about it, other
than it happens and correct tuning of this is what the kernel
developers spend lots of time trying to get right.

According to http://www.tldp.org/LDP/tlk/fs/filesystem.html
It seems that writes to the disk do occur to the cache and that the
kernel schedules dirty pages to be written at an opportune time.

So in summary I probably wouldn't notice much speed up from keeping
the map in memory.

-Joel

--
"Wish not to seem, but to be, the best."
                -- Aeschylus

Joel Pitt wrote:

I've been thinking abit about GRASS and the two things which I'd like
to see change. However I *don't* think that they these things should
be changed just for me and recognise the need to remain compatible
with existing modules etc.

1) Raster file format - Unless I'm mistaken, the raster is a flat
compressed file, with NULL values stored uncompressed in a seperate
bitmask. This means the NULL mask is quite large and also means the
raster is split across files. Obviously this makes it fastish, but
large and sparse maps are unnecessarily huge.

Actually, it makes it quite slow; the null file is opened and closed
for every line. OTOH, keeping the null file open would double the
number of descriptors used, halving the maximum number of maps which
can be opened at a time; this can be an issue for r.series if you are
working with a year's worth of daily samples (i.e 365 maps).

It would be much better to just embed the nulls into the raster file.

I've also been frustrated that one has to read an entire set of rows
to find out whether there are any cells Non-null. This makes modelling
processes slower then they need to be. I think that a quadtree format
would solve this (lower resolution versions of an region overlayed on
one another e.g. 128x128 64x64 32x32) by allowing you to descend only
those tree branches that have raster cells present. Quadtrees would
also speed up the process of displaying large raster maps on limited
resolution monitors.

There are simpler ways to solve some of these problems. E.g.

1. It wouldn't be particularly hard to add an alternatve to
G_get_raster_row() which doesn't bother to read all-null rows, but
just sets a flag to indicate that the row is all null.

2. Tiled storage would handle sparse maps better than row storage. How
it compared to quadtrees would depend upon the typical cluster size.

Quadtrees based rasters with integrated null bitmasks could be easily
accessed through the normal raster function calls ensuring existing
modules are interoperable. Only new modules wanting to use quadtree
functions (such as checking existing of values at positions higher up
the tree) woudl have to check the version of the GRASS raster.

2) Dual function modules - there is alot of talk about how we want to
move forward with the GUI but still maintain the seperate programs for
GRASS commands. I suggest having a compilation system that will create
both standalone executables and integrated libraries. A GUI could then
check whether a library version of GRASS module exists and load that,
or otherwise use the equivalent executable.

This is due to another frustration I've had with speed of execution.
While running long chains of commands in simulations I can't help
realising that every command has to reload a map from disk and then
write it back. As we all know, disk access is REALLY slow in
comparison to memory etc. so if GRASS modules were compiled as
libraries the last N (N being configurable) number of loaded maps
could remain in memory for quick processing and display.

Unix caches disk accesses. If you have enough RAM, you'll never need
to actually read the same data from disk twice.

It's more likely that performance issues stem from the various
processes which are performed on the data between it being read from
the file and passed to the application (see lib/gis/get_row.c; I made
some diagrams of this, if you're interested). Opening and closing the
null bitmap for each line of input is known to be a significant
performance sink.

Let me know what people think. Obviously I don't expect other people
to go ahead with this just because I would like to see it, but if
people see some value in these approaches I could attempt to map out a
course and contribute some time to it.

I've been thinking about a new raster architecture for a while, but I
still don't have anything concrete.

For the time being, it would be better to see if we can improve the
situation with some minor changes to a few key areas. It would help if
someone has the time to build GRASS with profiling support and
actually profile some common usage patterns.

--
Glynn Clements <glynn@gclements.plus.com>

Joel Pitt wrote:

> > This is due to another frustration I've had with speed of execution.
> > While running long chains of commands in simulations I can't help
> > realising that every command has to reload a map from disk and then
> > write it back. As we all know, disk access is REALLY slow in
> > comparison to memory etc. so if GRASS modules were compiled as
> > libraries the last N (N being configurable) number of loaded maps
> > could remain in memory for quick processing and display.
>
> Your OS should cache the data in non-reserved memory space and reuse it
> if possible, or dump it if something else wants the space.
>
> e.g. grep for something in the grass source code, then when it is done
> try it again. The second time will finish in 5% of the time as it wasn't
> reading from the disk. The more memory you have, the more potential for
> cache space.

Okay, good point about reading maps, however the writing still takes
time and writing to disk is the only simple way for passing a map from
one command to the other.

Actually writing to disk doesn't take much CPU time, and it won't
cause delays in applications unless disk bandwidth is saturated.

Unless the OS doesn't actually write the file when it is closed. Which
I'm guessing would be a Bad Thing (TM)

BTW, do you know whether a newly written file also changes the OS
cached version? Or in other words, is the cache invalidated after a
write changes the disk? Just curious...

No. When you write to a file, the OS writes to the cache. Any
subsequent reads from that file will read from the cache.

Each cached block has a flag to indicate whether it is "clean" (is
identical to the copy on disk) or "dirty" (differs from the disk
copy). If something reads a block which isn't cached, it is read into
the cache from disk, and the cache block is marked as clean. Whenever
a block is modified it will be marked as dirty.

The OS will write dirty blocks to the disk lazily (or when explicitly
requested via sync() or fsync() or when a device is unmounted).
Afterwards, the cached block will be marked as clean and retained.

Essentially, almost anything which accesses a disk reads or writes the
cached copy; the physical disk is merely backing storage.

--
Glynn Clements <glynn@gclements.plus.com>

For the time being, it would be better to see if we can improve the
situation with some minor changes to a few key areas. It would help if
someone has the time to build GRASS with profiling support and
actually profile some common usage patterns.

You don't need to build with any special profiling support to try, (!)

Run from valgrind, save interim results to a file using the callgrind
plugin, then visualize that file with KCachegrind.

see http://kcachegrind.sourceforge.net/cgi-bin/show.cgi

Hamish

Hamish wrote:

> For the time being, it would be better to see if we can improve the
> situation with some minor changes to a few key areas. It would help if
> someone has the time to build GRASS with profiling support and
> actually profile some common usage patterns.

You don't need to build with any special profiling support to try, (!)

Run from valgrind, save interim results to a file using the callgrind
plugin, then visualize that file with KCachegrind.

see http://kcachegrind.sourceforge.net/cgi-bin/show.cgi

In which case, it would be useful if someone could use that to analyse
some common usage cases.

Cases which involve more than one process (e.g. d.*, or v.* when used
with a separate DBMS server) are harder to analyse.

If it turns out that the separate null file is a signficant
performance issue, we need to consider a migration plan for embedding
nulls (e.g. if 6.3 can write out rasters with embedded nulls, do we
need 6.2 to be able to read them?).

--
Glynn Clements <glynn@gclements.plus.com>

Glynn,

2. Tiled storage would handle sparse maps better than row storage. How
it compared to quadtrees would depend upon the typical cluster size.

Here's an absolutely wacky idea for a storage mechanism. (I love it when I
shove together words or ideas that I don't completely understand).

Here's what it is:

Take the raster data. Split it up into tiles (Glynn is happy).
Make a new set of raster data in which each pixel is the mean of 4 pixels in
the original map (Joel is happy). Split it up into tiles (Cedric is happy).
Lather rinse repeat until you have a raster map of a single tile. The big
kicker is make sure that one tile in a more general map exactly covers four
tiles in the more specific map. Include null data (so entire tiles can be
null and not exist).

A generic sparse matrix might be better for some datasets depending on how
it's clustered, just like Glynn said. I propose that those datasets are
better served by vector data formats.

It has a few benefits that I've already thought of:

Quick rendering at any resolution. A dataset of approximately the same
resolution already exists unless the desired dataset is many fewer pixels
than the size of a tile.

Possibility of unbelievably quick rendering in low viewing angle 3d views.
Amount of data needed to cover the screen is virtually invariant with respect
to view. You just need enough tiles to fill in the parts of the screen at the
resolution they are at. A tile cache doesn't need to know anything about the
geometry of the tiles (just remember whichever ones are "used the most").
Moving around locally on the screen shouldn't greatly change the tiles needed
on the horizons.

Actually, there's no reason except convenience that the resolution of a map
must be the same everywhere.

Here's a really ill-conceived data type (hard to read and write to files
because of the pointer):

struct quadtile {
  tilepixels * pixels;
  tilenulls * nulls;
  quadtile quads[4];
}

Actually, the problem of pointers in files should be solvable with an in-file
heap. It should be possible (even routine?) to map a file to memory without
actually loading it, right?

Now the question of an interface to these files is a bit trickyer. All the
current access methods would need to be implemented on top of a tile cache
for extra fun specialness. Many algorithms would prefer the direct tile or
quadtree or quadtree tile access.

Drawbacks:

Dataset size is as big as a quadtree.

--Cedric

Cedric Shock wrote:

> 2. Tiled storage would handle sparse maps better than row storage. How
> it compared to quadtrees would depend upon the typical cluster size.

Here's an absolutely wacky idea for a storage mechanism. (I love it when I
shove together words or ideas that I don't completely understand).

Here's what it is:

Take the raster data. Split it up into tiles (Glynn is happy).
Make a new set of raster data in which each pixel is the mean of 4 pixels in
the original map (Joel is happy).

That's a MIPmap. It only works if it's meaningful to take the mean of
several values, which isn't the case for discrete categories.

[And they're "cells", not "pixels"; GRASS maps aren't pictures :wink: ]

A quadtree is similar except that you store whether or not the
sub-cells all have the same value (i.e. you're essentially recording
whether the variance is zero, rather than recording the mean).

With regard to the null bitmap, you would store a 1 if the entire
square represented by that cell was all-null, and a 0 if any part of
it was non-null. That would allow you to skip large chunks of all-null
data quickly.

Actually, there's no reason except convenience that the resolution of a map
must be the same everywhere.

Here's a really ill-conceived data type (hard to read and write to files
because of the pointer):

struct quadtile {
  tilepixels * pixels;
  tilenulls * nulls;
  quadtile quads[4];
}

Actually, the problem of pointers in files should be solvable with an in-file
heap. It should be possible (even routine?) to map a file to memory without
actually loading it, right?

For storage, you basically have three choices:

1. Store "pointers" (i.e. offsets) to the data for each quadrant.
2. Store the data for each quadrant sequentially, following the parent.
3. Store the levels sequentially (i.e. a sequence of 2D arrays of
increasing resolution).

For option 1, you normally need to stop before you get to the level of
individual cells (e.g. the leaves are 4x4 or 8x8 blocks of cells),
otherwise all of the pointers to the leaves use too much space (even
assuming that you use relative offsets, which allows you to use fewer
bits at lower levels of the tree).

Option 2 provides the best compression, but it's only practical if you
are consuming the data by recursive descent; it's no use for any other
access strategy.

Option 3 takes up slightly more space than a raw array (the lowest
level /is/ the raw array), but random access is fast. There isn't any
compression, but you can easily detect entire blocks comprised of a
single value.

In most cases, single-level tiled storage will give you close to the
same performance with a lot less complexity.

--
Glynn Clements <glynn@gclements.plus.com>

> > For the time being, it would be better to see if we can improve
> > the situation with some minor changes to a few key areas. It would
> > help if someone has the time to build GRASS with profiling support
> > and actually profile some common usage patterns.
>
> You don't need to build with any special profiling support to try,
> (!)
>
> Run from valgrind, save interim results to a file using the
> callgrind plugin, then visualize that file with KCachegrind.
>
> see http://kcachegrind.sourceforge.net/cgi-bin/show.cgi

In which case, it would be useful if someone could use that to analyse
some common usage cases.

could you write some precise/focused command line examples of "common"
or likely intensive operations? (using spearfish)

then we can all test & share results. (e.g. where additional 64bit
overhead slows things down vs 32bit, bottlenecks in Mac vs Cygwin, etc)

Cases which involve more than one process (e.g. d.*, or v.* when used
with a separate DBMS server) are harder to analyse.

If it turns out that the separate null file is a signficant
performance issue, we need to consider a migration plan for embedding
nulls (e.g. if 6.3 can write out rasters with embedded nulls, do we
need 6.2 to be able to read them?).

presumably FCELL and DCELL can use nan inline to identify NULL, CELL
needs something else?

Hamish

Hamish wrote:

> > > For the time being, it would be better to see if we can improve
> > > the situation with some minor changes to a few key areas. It would
> > > help if someone has the time to build GRASS with profiling support
> > > and actually profile some common usage patterns.
> >
> > You don't need to build with any special profiling support to try,
> > (!)
> >
> > Run from valgrind, save interim results to a file using the
> > callgrind plugin, then visualize that file with KCachegrind.
> >
> > see http://kcachegrind.sourceforge.net/cgi-bin/show.cgi
>
> In which case, it would be useful if someone could use that to analyse
> some common usage cases.

could you write some precise/focused command line examples of "common"
or likely intensive operations? (using spearfish)

No. I'm a programmer, not a geoscientist; that's why I'm asking
someone else to do it. Anything I came up with would be a contrived
example rather than "common usage". It would probably be more useful
if people could try it on "real" examples.

Having said that, the most interesting cases are ones where the
processing is relatively simple (so that the performance of libgis'
I/O routines is the dominant factor) and which tend to be slow (making
a 20-second operation take 10 seconds is more useful than making a
2-second operation take 1 second).

> Cases which involve more than one process (e.g. d.*, or v.* when used
> with a separate DBMS server) are harder to analyse.
>
> If it turns out that the separate null file is a signficant
> performance issue, we need to consider a migration plan for embedding
> nulls (e.g. if 6.3 can write out rasters with embedded nulls, do we
> need 6.2 to be able to read them?).

presumably FCELL and DCELL can use nan inline to identify NULL, CELL
needs something else?

All of the data types have a defined null value which is stored in
internal buffers (by G_get_raster_row() etc). FWIW, the actual bit
patterns used internally are:

  CELL 0x80000000
  FCELL 0xFFFFFFFF
  DCELL 0xFFFFFFFFFFFFFFFF

[The FCELL/DCELL patterns are NaNs, but so is anything with an
all-ones exponent and a non-zero mantissa; G_is_[fd]_null_value checks
for the specific bit pattern. The CELL pattern is -2^31, i.e. the
value least likely to occur by accident.]

G_put_raster_row() checks each value with G_is_[cfd]_null_value().
For a null value, the value written out to the cell/fcell file is
zero, with a 1 written to the null file.

Conversely, G_get_raster_row() replaces the value read from the
cell/fcell file with the null value if the corresponding bit in the
null file is set.

There's no reason why the null value can't be written to or read from
the cell/fcell file. AFAIK, the current mechanism was used so that 5.x
maps would work with 4.x (which ignores the null file but treats zero
as null).

--
Glynn Clements <glynn@gclements.plus.com>

> > > > For the time being, it would be better to see if we can
> > > > improve the situation with some minor changes to a few key
> > > > areas. It would help if someone has the time to build GRASS
> > > > with profiling support and actually profile some common usage
> > > > patterns.
> > >
> > > You don't need to build with any special profiling support to
> > > try, (!)
> > >
> > > Run from valgrind, save interim results to a file using the
> > > callgrind plugin, then visualize that file with KCachegrind.
> > >
> > > see http://kcachegrind.sourceforge.net/cgi-bin/show.cgi
> >
> > In which case, it would be useful if someone could use that to
> > analyse some common usage cases.
>
> could you write some precise/focused command line examples of
> "common" or likely intensive operations? (using spearfish)

No. I'm a programmer, not a geoscientist; that's why I'm asking
someone else to do it. Anything I came up with would be a contrived
example rather than "common usage". It would probably be more useful
if people could try it on "real" examples.

Having said that, the most interesting cases are ones where the
processing is relatively simple (so that the performance of libgis'
I/O routines is the dominant factor) and which tend to be slow (making
a 20-second operation take 10 seconds is more useful than making a
2-second operation take 1 second).

so a good candidate for a high libgis IO module + slow task might be,

(spearfish60)
g.region elevation.10m
r.patch out=merge in=roads,railroads,streams,tractids,transport.misc,rstrct.areas,trn.sites,landuse,fields,erode.index

or the r.mapcalc equivalent of the same task?

As "common" tasks are very different for everyone, it is hard to define.

> > Cases which involve more than one process (e.g. d.*, or v.* when
> > used with a separate DBMS server) are harder to analyse.
> >
> > If it turns out that the separate null file is a signficant
> > performance issue, we need to consider a migration plan for
> > embedding nulls (e.g. if 6.3 can write out rasters with embedded
> > nulls, do we need 6.2 to be able to read them?).
>
> presumably FCELL and DCELL can use nan inline to identify NULL, CELL
> needs something else?

All of the data types have a defined null value which is stored in
internal buffers (by G_get_raster_row() etc). FWIW, the actual bit
patterns used internally are:

  CELL 0x80000000
  FCELL 0xFFFFFFFF
  DCELL 0xFFFFFFFFFFFFFFFF

[The FCELL/DCELL patterns are NaNs, but so is anything with an
all-ones exponent and a non-zero mantissa; G_is_[fd]_null_value checks
for the specific bit pattern. The CELL pattern is -2^31, i.e. the
value least likely to occur by accident.]

G_put_raster_row() checks each value with G_is_[cfd]_null_value().
For a null value, the value written out to the cell/fcell file is
zero, with a 1 written to the null file.

Conversely, G_get_raster_row() replaces the value read from the
cell/fcell file with the null value if the corresponding bit in the
null file is set.

There's no reason why the null value can't be written to or read from
the cell/fcell file. AFAIK, the current mechanism was used so that 5.x
maps would work with 4.x (which ignores the null file but treats zero
as null).

This part of the next raster format seems clear then.

Hamish

Hamish wrote:

> > could you write some precise/focused command line examples of
> > "common" or likely intensive operations? (using spearfish)
>
> No. I'm a programmer, not a geoscientist; that's why I'm asking
> someone else to do it. Anything I came up with would be a contrived
> example rather than "common usage". It would probably be more useful
> if people could try it on "real" examples.
>
> Having said that, the most interesting cases are ones where the
> processing is relatively simple (so that the performance of libgis'
> I/O routines is the dominant factor) and which tend to be slow (making
> a 20-second operation take 10 seconds is more useful than making a
> 2-second operation take 1 second).

so a good candidate for a high libgis IO module + slow task might be,

(spearfish60)
g.region elevation.10m
r.patch out=merge in=roads,railroads,streams,tractids,transport.misc,rstrct.areas,trn.sites,landuse,fields,erode.index

or the r.mapcalc equivalent of the same task?

Probably. Or simple r.mapcalc commands. Actually, for test I/O,
r.resample is probably as good a choice as any.

I suppose that the issue has as much to do with the type of maps
people are actually using (map resolution, region resolution,
CELL/FCELL/DCELL input/output formats, distribution of values,
proportion of nulls etc) as with the processes performed upon them.

--
Glynn Clements <glynn@gclements.plus.com>

Hi,

I have added a "replacement raster format" page in the wiki for
both informational and educational purposes.

http://grass.gdf-hannover.de/wiki/Replacement_raster_format

It would be helpful if basic wants, needs, ideas were listed there to
keep interested parties informed of existing plans instead of
(re)inventing and suggesting the same good ideas over and over again.

[links, justifications, explaination of why to do it one way & not the
other, etc..]

thanks,
Hamish

Glynn Clements wrote:

If it turns out that the separate null file is a signficant
performance issue, we need to consider a migration plan for embedding
nulls (e.g. if 6.3 can write out rasters with embedded nulls, do we
need 6.2 to be able to read them?).

I've compiled GRASS with profiling support, and a quick glance at the
results indicates that the null handling is indeed significant. E.g.
for "r.resample in=elevation.dem ...", G_get_raster_row() accounts for
30.2% of the time taken, with embed_nulls() taking 22.5%, which means
that embed_nulls() accounts for 75% of G_get_raster_row().

[FWIW, that 22.5% is split roughly evenly between G_is_null_value()
(11.1%) and get_null_value_row() (10.2%, of which 6.9% is in
G__check_null_bit()).]

Another interesting point; from the flat profile (i.e. time attributed
to calls does not include time spent in children):

13.10 G_is_c_null_value
11.78 G_is_d_null_value
  5.89 G_is_null_value

IOW, 30.77 of the total time is spent testing whether cells are null.

Regarding the first two: these should be available as macros or inline
functions, and they should be optimised. These functions amount to
comparing two 32- or 64-bit values, and should be trivial.

Regarding the third:

  int G_is_null_value (const void *rast, RASTER_MAP_TYPE data_type)
  {
      switch(data_type)
      {
          case CELL_TYPE:
        return (G_is_c_null_value((CELL *) rast));
          
    case FCELL_TYPE:
        return (G_is_f_null_value((FCELL *) rast));
          
    case DCELL_TYPE:
        return (G_is_d_null_value((DCELL *) rast));
          
    default:
        G_warning("G_is_null_value: wrong data type!");
        return FALSE;
      }
  }

That's nearly 6% of the program spent in a CELL/FCELL/DCELL switch
statement (the cost of the individual G_is_[cfd]_null_value() calls
isn't included in that figure). There are quite a few places where
this idiom is used (e.g. lib/gis/raster.c).

This suggests that simple functions taking a RASTER_MAP_TYPE argument
and operating upon individual cells should be avoided where possible.
Instead, there should be a separate row-processing loop for each data
type, so that the switch statement(s) are only executed once per row,
not once per cell.

--
Glynn Clements <glynn@gclements.plus.com>