[CC to grass-dev]
Roy Sanderson wrote:
I've been trying to persuade our users to stop working with Grass 4.3 and
Grass 5.4 for some time now, and as I have to upgrade the OS on our
applications server have told them that they now have no choice.However, a couple of users stated that they preferred to use Grass 4.3 as
it was faster, and for large tasks, more stable than the newer versions. I
checked this on a map of 52,000 rows by 28,000 columns and commands like
r.mapcalc, r.clump, r.volume operated about 10x faster in Grass 4.3 than
the more recent versions.This might simply arise from the age of the applications server OS (still
running RH7.3), or because I've mis-configured the newer versions of Grass.
For example, I did not compile Grass 5 or 6 with large-file support
enabled, although the file sizes are only around 180Mb, but the speedy
performance of 4.3 vs 5.4, 6.0 and 6.2.1 surprised me. Perhaps there's an
additional overhead associated with the introduction of nulls and
floating-points, which were major changes from 4.3 to 5.4. However, the
performance difference is still present when working with integer maps. As
I haven't benchmarked versions, and also because personally I only work
with Grass version 6, I hadn't spotted the differences until now.
I strongly suspect that the support for nulls is to blame. The
implementation is really quite inefficient in several ways.
It doesn't help that the null file is repeatedly opened and closed
(the null bitmap is read in chunks of 8 rows at a time, with the file
being opened anew for each read). Depending upon the speed of
filesystem calls (open(), access() etc) relative to actual I/O, that
could be a significant factor.
Keeping the null file open would eliminate that part of the overhead,
but would double the number of descriptors used. On older versions,
that would halve the maximum number of open maps, although that limit
has been eliminated in recent 6.3 CVS versions.
Also, that would only eliminate part of the overhead. Actually
decoding and embedding the null data is also non-trivial.
Embedding the nulls in the data file, eliminating the null bitmap
altogether, would eliminate all of the null overhead, but would also
either enlarge the files significantly or break compatibility.
The existing format is optimised for small, non-negative integers.
Each row is stored using only as many bytes are required for the
largest value, where all values are treated as unsigned (i.e. negative
values always require 4 bytes). The integer value used for nulls is
0x80000000 (i.e. INT_MIN, -2^31); embedding this value directly would
cause many files to always use 4 bytes per cell when 1 byte would
otherwise be enough.
We could change the encoding to be more friendly to embedded nulls,
but that would break compatibility with earlier versions. AFAICT, a
6.3 integer raster can still be read by 4.3 (assuming that it uses RLE
rather than zlib compression), with any nulls being read as zeroes.
--
Glynn Clements <glynn@gclements.plus.com>