[GRASS5] thoughts about runlength encoding

Hy folks,
just a few thoughts about runlevel encoding.
Perhaps someone can tell me if what I'm saying is right.

When talking about rle in Grass, we are talking about integer (CELL) file
compression.
For what I know, Grass's rle is not a standard one.
While standard rle starts compressing equal values after two (or three) of
them have passed, Grass's starts to compress from the first one.
When it compressed, every value gets its counter, so that we can say that a
bunch of coupled is build.
By knowing the compressed size, simply dividing by two we get the number of
couples. Done that it's rather easy to get the uncompressed row.
Is this right? This would mean, that there can't "hybrid" (i.e. couples and
singles) as in the standard rle.

Example:
uncompressed:
1 1 1 1 1 1 1
1 3 3 3 1 3 1

in Grass gets compressed as:
7 1 =couple
1 3 3 3 1 3 1 =not compressed

as opposed with external rle, where it gets:
1 1 1 5 3 3 3 0 1 3 1

Cheers and thanks for any reply,
Andrea

--
____________________________________________________________________________
"Let it be as much a great honour to take as to give learning,
if you want to be called wise."
Skuggsja' - The King's mirror - 1240 Reykjavik

University of Trento
Department of Civil and Environmental Engineering
Via Mesiano, 77 - Trento (ITALY)

Andrea Antonello
tel: +393288497722
fax: +390461882672
____________________________________________________________________________

Andrea Antonello wrote:

just a few thoughts about runlevel encoding.
Perhaps someone can tell me if what I'm saying is right.

When talking about rle in Grass, we are talking about integer (CELL) file
compression.
For what I know, Grass's rle is not a standard one.

There isn't any "standard" form of RLE.

While standard rle starts compressing equal values after two (or three) of
them have passed, Grass's starts to compress from the first one.

Some RLE schemes have a "pass-through" option, where you can have
blocks of raw data interspersed with runs. E.g. the count byte might
be signed, with a positive value indicating that the next byte is
repeated n times, and a negative value indicating that abs(n) bytes of
raw data follow.

GRASS' RLE doesn't have this feature. Each row is either compressed or
raw; if the length of the row (as determined by examining the
difference between the row offsets) is less than the length of an
uncompressed row, then it's compressed, otherwise it's raw.

Except that, if cellhd.compressed is negative (pre-3.0 compression),
rows are always compressed, even if they would be longer than an
uncompressed row.

When it compressed, every value gets its counter, so that we can say that a
bunch of coupled is build.
By knowing the compressed size, simply dividing by two we get the number of
couples.

Actually, you need to divide by n+1, where n is the number of bytes
per cell. Each run consists of a one-byte count followed by an n-byte
value. For pre-3.0 compression, n is always the nbytes field from the
fileinfo structure (which is cellhd.format + 1); for the newer form
(cellhd.compressed == 1), n is stored as an extra byte at the
beginning of each line.

Done that it's rather easy to get the uncompressed row.
Is this right? This would mean, that there can't "hybrid" (i.e. couples and
singles) as in the standard rle.

Correct.

BTW, if you're interested in the format of raster files, you should
probably look at the {get,put}_row2.c files which I posted recently
(in the "Raster lib and CELL files > 2GB" thread). Hopefully, these
should be somewhat easier to read than the original versions.

In the long run, I'm hoping to completely re-write the raster I/O
code. I'm not planning to support RLE compression (other than to allow
old files to be converted to the new format), but to use zlib for both
integer and FP formats.

However, a complete re-write is a long way off. In the mean time, I'm
considering implementing some of the less radical changes as an
intermediate measure. Primarily, I intend to add support for 64-bit
offsets on 32-bit platforms (so that raster files aren't limited to
2Gb). I'm also thinking about supporting the use of zlib for integer
maps, as well as the possibility of eliminating the null file.

--
Glynn Clements <glynn.clements@virgin.net>

Andrea Antonello wrote:
> just a few thoughts about runlevel encoding.
> Perhaps someone can tell me if what I'm saying is right.
>
> When talking about rle in Grass, we are talking about integer (CELL) file
> compression.
> For what I know, Grass's rle is not a standard one.

There isn't any "standard" form of RLE.

Alright...

Actually, you need to divide by n+1, where n is the number of bytes
per cell. Each run consists of a one-byte count followed by an n-byte
value. For pre-3.0 compression, n is always the nbytes field from the
fileinfo structure (which is cellhd.format + 1); for the newer form
(cellhd.compressed == 1), n is stored as an extra byte at the
beginning of each line.

alright, was doing tests on one byte ints

> Done that it's rather easy to get the uncompressed row.
> Is this right? This would mean, that there can't "hybrid" (i.e. couples
> and singles) as in the standard rle.

Correct.

BTW, if you're interested in the format of raster files, you should
probably look at the {get,put}_row2.c files which I posted recently
(in the "Raster lib and CELL files > 2GB" thread). Hopefully, these
should be somewhat easier to read than the original versions.

Absolutely :slight_smile:

In the long run, I'm hoping to completely re-write the raster I/O
code. I'm not planning to support RLE compression (other than to allow
old files to be converted to the new format), but to use zlib for both
integer and FP formats.

I think that would be the best deal.

However, a complete re-write is a long way off. In the mean time, I'm
considering implementing some of the less radical changes as an
intermediate measure. Primarily, I intend to add support for 64-bit
offsets on 32-bit platforms (so that raster files aren't limited to
2Gb). I'm also thinking about supporting the use of zlib for integer
maps, as well as the possibility of eliminating the null file.

Thanks for your advices, they are always of big help.

Ciao
Andrea Antonello

--
____________________________________________________________________________
"Let it be as much a great honour to take as to give learning,
if you want to be called wise."
Skuggsja' - The King's mirror - 1240 Reykjavik

University of Trento
Department of Civil and Environmental Engineering
Via Mesiano, 77 - Trento (ITALY)

Andrea Antonello
tel: +393288497722
fax: +390461882672
____________________________________________________________________________

Glynn Clements wrote:

BTW, if you're interested in the format of raster files, you should
probably look at the {get,put}_row2.c files which I posted recently
(in the "Raster lib and CELL files > 2GB" thread). Hopefully, these
should be somewhat easier to read than the original versions.

I've now committed these to CVS.

In the long run, I'm hoping to completely re-write the raster I/O
code. I'm not planning to support RLE compression (other than to allow
old files to be converted to the new format), but to use zlib for both
integer and FP formats.

However, a complete re-write is a long way off. In the mean time, I'm
considering implementing some of the less radical changes as an
intermediate measure. Primarily, I intend to add support for 64-bit
offsets on 32-bit platforms (so that raster files aren't limited to
2Gb). I'm also thinking about supporting the use of zlib for integer
maps, as well as the possibility of eliminating the null file.

I've also committed fixes to use off_t instead of long throughout the
raster I/O code. If you compile with -D_FILE_OFFSET_BITS=64, you
should be able to have raster maps which are larger than 2Gb (tested
briefly).

The code which reads the row pointers can handle both 32-bit and
64-bit offsets regardless of whether -D_FILE_OFFSET_BITS=64 was used.
Obviously, if you have a file which actually exceeds 2Gb, you can't
read it with a version of GRASS which wasn't built with
-D_FILE_OFFSET_BITS=64.

Also, the code which writes the row pointers will only write 64-bit
offsets when necessary (i.e. if the file is larger than 4Gb), so using
-D_FILE_OFFSET_BITS=64 shouldn't introduce any incompatibilities with
previous versions.

The only tricky issue is pre-3.0 compressed files (indicated by
cellhd.format being negative). These have the row pointers in the
native format (in terms of sizeof(long) and endianness) of the system
which wrote the file, with no indication as to exactly which format is
used.

E.g. if a pre-3.0 compressed file was written on a little-endian
system where sizeof(long) == 4, it can only be read on a little-endian
system where sizeof(off_t) == 4 (i.e. using -D_FILE_OFFSET_BITS=64
will prevent such files from being read).

OTOH, I would guess that pre-3.0 compressed files are almost
non-existent these days, so this is unlikely to be an issue.

Note that the changes only affect the raster I/O code. Other files
(e.g. temporary files created directly by a program) will typically
still be limited to 32 bits (most of the code which I've seen not only
uses "long" to hold offsets, but actually calculates offsets using
"int" arithmetic, and so is limited to 32 bits even on systems with a
64-bit "long" type).

In that regard, adding -D_FILE_OFFSET_BITS=64 globally (e.g. by adding
it to CFLAGS when running the configure script) may be risky, as such
programs will open their temporary files using the 64-bit API, but
will silently wrap offsets larger than 2Gb.

Without that switch, open() will simply refuse to open files which are
larger than 2Gb, and write() will fail if it would result in the
file's size exceeding 2Gb.

--
Glynn Clements <glynn.clements@virgin.net>