[GRASS-dev] Re: [GRASS-user] Benchmarking Grass 4.3, 5.4, 6.0, 6.2 raster commands

[CC to grass-dev]

Roy Sanderson wrote:

I've been trying to persuade our users to stop working with Grass 4.3 and
Grass 5.4 for some time now, and as I have to upgrade the OS on our
applications server have told them that they now have no choice.

However, a couple of users stated that they preferred to use Grass 4.3 as
it was faster, and for large tasks, more stable than the newer versions. I
checked this on a map of 52,000 rows by 28,000 columns and commands like
r.mapcalc, r.clump, r.volume operated about 10x faster in Grass 4.3 than
the more recent versions.

This might simply arise from the age of the applications server OS (still
running RH7.3), or because I've mis-configured the newer versions of Grass.
For example, I did not compile Grass 5 or 6 with large-file support
enabled, although the file sizes are only around 180Mb, but the speedy
performance of 4.3 vs 5.4, 6.0 and 6.2.1 surprised me. Perhaps there's an
additional overhead associated with the introduction of nulls and
floating-points, which were major changes from 4.3 to 5.4. However, the
performance difference is still present when working with integer maps. As
I haven't benchmarked versions, and also because personally I only work
with Grass version 6, I hadn't spotted the differences until now.

I strongly suspect that the support for nulls is to blame. The
implementation is really quite inefficient in several ways.

It doesn't help that the null file is repeatedly opened and closed
(the null bitmap is read in chunks of 8 rows at a time, with the file
being opened anew for each read). Depending upon the speed of
filesystem calls (open(), access() etc) relative to actual I/O, that
could be a significant factor.

Keeping the null file open would eliminate that part of the overhead,
but would double the number of descriptors used. On older versions,
that would halve the maximum number of open maps, although that limit
has been eliminated in recent 6.3 CVS versions.

Also, that would only eliminate part of the overhead. Actually
decoding and embedding the null data is also non-trivial.

Embedding the nulls in the data file, eliminating the null bitmap
altogether, would eliminate all of the null overhead, but would also
either enlarge the files significantly or break compatibility.

The existing format is optimised for small, non-negative integers.
Each row is stored using only as many bytes are required for the
largest value, where all values are treated as unsigned (i.e. negative
values always require 4 bytes). The integer value used for nulls is
0x80000000 (i.e. INT_MIN, -2^31); embedding this value directly would
cause many files to always use 4 bytes per cell when 1 byte would
otherwise be enough.

We could change the encoding to be more friendly to embedded nulls,
but that would break compatibility with earlier versions. AFAICT, a
6.3 integer raster can still be read by 4.3 (assuming that it uses RLE
rather than zlib compression), with any nulls being read as zeroes.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn,

does the null implementation affect also the runs with rasters that have no nulls
and there is no MASK present? As I have recently written, I have
noticed what seems to be totally unrelated changes in performance in
v.surf.rst compared the 4.3 - mostly along the line of changing to G_ludcmp
from an internal lineq solver, but it may be somewhere else.

10x faster is a huge difference - it may be worthwhile to find out
whether it is true for integer maps without nulls and whether it
is really nulls slowing it down so badly.

There were many discussions about the null implementation and as Glynn correctly
points out the main driver for the current design was to sacrifice the performance
to preserve the backwards compatibility. Wishes of old users (many of whom
contributed funds to GRASS development) were given very high priority.

Helena

Helena Mitasova
Dept. of Marine, Earth and Atm. Sciences
1125 Jordan Hall, NCSU Box 8208,
Raleigh NC 27695
http://skagit.meas.ncsu.edu/~helena/

On Apr 30, 2007, at 2:17 PM, Glynn Clements wrote:

[CC to grass-dev]

Roy Sanderson wrote:

I've been trying to persuade our users to stop working with Grass 4.3 and
Grass 5.4 for some time now, and as I have to upgrade the OS on our
applications server have told them that they now have no choice.

However, a couple of users stated that they preferred to use Grass 4.3 as
it was faster, and for large tasks, more stable than the newer versions. I
checked this on a map of 52,000 rows by 28,000 columns and commands like
r.mapcalc, r.clump, r.volume operated about 10x faster in Grass 4.3 than
the more recent versions.

This might simply arise from the age of the applications server OS (still
running RH7.3), or because I've mis-configured the newer versions of Grass.
For example, I did not compile Grass 5 or 6 with large-file support
enabled, although the file sizes are only around 180Mb, but the speedy
performance of 4.3 vs 5.4, 6.0 and 6.2.1 surprised me. Perhaps there's an
additional overhead associated with the introduction of nulls and
floating-points, which were major changes from 4.3 to 5.4. However, the
performance difference is still present when working with integer maps. As
I haven't benchmarked versions, and also because personally I only work
with Grass version 6, I hadn't spotted the differences until now.

I strongly suspect that the support for nulls is to blame. The
implementation is really quite inefficient in several ways.

It doesn't help that the null file is repeatedly opened and closed
(the null bitmap is read in chunks of 8 rows at a time, with the file
being opened anew for each read). Depending upon the speed of
filesystem calls (open(), access() etc) relative to actual I/O, that
could be a significant factor.

Keeping the null file open would eliminate that part of the overhead,
but would double the number of descriptors used. On older versions,
that would halve the maximum number of open maps, although that limit
has been eliminated in recent 6.3 CVS versions.

Also, that would only eliminate part of the overhead. Actually
decoding and embedding the null data is also non-trivial.

Embedding the nulls in the data file, eliminating the null bitmap
altogether, would eliminate all of the null overhead, but would also
either enlarge the files significantly or break compatibility.

The existing format is optimised for small, non-negative integers.
Each row is stored using only as many bytes are required for the
largest value, where all values are treated as unsigned (i.e. negative
values always require 4 bytes). The integer value used for nulls is
0x80000000 (i.e. INT_MIN, -2^31); embedding this value directly would
cause many files to always use 4 bytes per cell when 1 byte would
otherwise be enough.

We could change the encoding to be more friendly to embedded nulls,
but that would break compatibility with earlier versions. AFAICT, a
6.3 integer raster can still be read by 4.3 (assuming that it uses RLE
rather than zlib compression), with any nulls being read as zeroes.

--
Glynn Clements <glynn@gclements.plus.com>

_______________________________________________
grass-dev mailing list
grass-dev@grass.itc.it
http://grass.itc.it/mailman/listinfo/grass-dev

Helena Mitasova wrote:

does the null implementation affect also the runs with rasters that
have no nulls and there is no MASK present?

It affects any raster which has a null bitmap, even if no cells are
actually null.

10x faster is a huge difference - it may be worthwhile to find out
whether it is true for integer maps without nulls and whether it
is really nulls slowing it down so badly.

It's relatively easy to test the effect of the null bitmap:
delete/rename the cell_misc/<name>/null file. There will still be some
residual overhead due to the conversion of zeroes to nulls, but this
will determine the cost of the filesystem calls.

There were many discussions about the null implementation and as
Glynn correctly
points out the main driver for the current design was to sacrifice
the performance
to preserve the backwards compatibility. Wishes of old users (many of
whom
contributed funds to GRASS development) were given very high priority.

It's possible to embed nulls while retaining compatibility, but the
result is that most CELL maps will end up using 4 bytes per cell
(prior to RLE or zlib compression).

--
Glynn Clements <glynn@gclements.plus.com>

Hello Again

Thanks for the various piecies of feedback - I've tried removing the null
file (and there isn't a MASK), and unfortunately it hasn't made much
difference.

Having done some very rough benchmarking, based on a simple r.stats command
on a much smaller map than our user was struggling with, figures are:

Grass 4.3 - 10 seconds
Grass 5.4 - 67 seconds
Grass 6.2 - 74 seconds

Given that this is being run across an NFS network, the figures are only
approximate, but it looks as though the performance hit came from the
switch from Grass 4.3 to 5.4. Oddly enough, even the d.rast commands seem
faster in Grass 4 (though that may simply be because the no. of colours
used in the monitor is lower).

Best wishes
Roy

At 06:10 01/05/07 +0100, Glynn Clements wrote:

Helena Mitasova wrote:

does the null implementation affect also the runs with rasters that
have no nulls and there is no MASK present?

It affects any raster which has a null bitmap, even if no cells are
actually null.

10x faster is a huge difference - it may be worthwhile to find out
whether it is true for integer maps without nulls and whether it
is really nulls slowing it down so badly.

It's relatively easy to test the effect of the null bitmap:
delete/rename the cell_misc/<name>/null file. There will still be some
residual overhead due to the conversion of zeroes to nulls, but this
will determine the cost of the filesystem calls.

There were many discussions about the null implementation and as
Glynn correctly
points out the main driver for the current design was to sacrifice
the performance
to preserve the backwards compatibility. Wishes of old users (many of
whom
contributed funds to GRASS development) were given very high priority.

It's possible to embed nulls while retaining compatibility, but the
result is that most CELL maps will end up using 4 bytes per cell
(prior to RLE or zlib compression).

--
Glynn Clements <glynn@gclements.plus.com>