[GRASS5] Raster lib and CELL files > 2GB

Hi,

for a remote sensing project we face the problem to arrive
at file sizes > 2GB. I see two options:

a) change the CELL file compression from RLE to DEFLATE
  -> how to do that? I just want to change locally for me
b) enable large file support (http://www.suse.de/~aj/linux_lfs.html)
  -> any functions which need a change?

Variant a) sounds somewhat better to me and maybe already solves
the problem.

Recommendations are welcome.

Markus

Markus Neteler wrote:

for a remote sensing project we face the problem to arrive
at file sizes > 2GB. I see two options:

a) change the CELL file compression from RLE to DEFLATE
  -> how to do that? I just want to change locally for me

That's far from simple. The existing low-level raster I/O
implementation is a total mess; learning how it works may well take
longer than re-writing it from scratch.

Compare read_data_fp_compressed() with read_data_compressed() in
get_row.c, and put_data() with put_fp_data() in put_row.c.

b) enable large file support (http://www.suse.de/~aj/linux_lfs.html)
  -> any functions which need a change?

Search for all occurrences of "long" in libgis, then figure out which
ones need to be changed to off_t. AFAICT, most of them are likely to
be in the following files:

  G.h (row_ptr field)
  closecell.c
  format.c
  get_row.c
  opencell.c
  put_row.c

Variant a) sounds somewhat better to me and maybe already solves
the problem.

Recommendations are welcome.

c) obtain a system where "long" is 64 bits.

--
Glynn Clements <glynn.clements@virgin.net>

Glynn,

thanks for your comments below! I forgot to submit additional info:

original ERDAS files:
-rwxr-xr-x 1 dassau users 11184610066 2004-07-09 16:25 040706_rgb_juist.ige*
-rwxr-xr-x 1 dassau users 18261 2004-07-09 16:25 040706_rgb_juist.img*
-rwxr-xr-x 1 dassau users 942589967 2004-07-09 16:26 040706_rgb_juist.rrd*

GRASS (certainly useless) and GDAL recompiled with LFS support:
CFLAGS="-D_FILE_OFFSET_BITS=64 -D_LARGE_FILES" ./configure ...

r.in.gdal then imports the ERDAS files without problem:
grassdata/nnw/juist/cell> l juist_dmcnc.1
-rw-r--r-- 1 dassau users 2317620271 2004-07-09 17:06 juist_dmcnc.1
-> this file is larger than 2GB, probably a correct file.

But then d.rast and other reading commands crash:
row 18793
ERROR: error reading compressed map [juist_dmcnc.3] in mapset [juist],

-> does this need the 'long' changes as you suggest?

[ Another more simple option might be to split the file
  with GDAL tools and then to import the pieces ... ]

Thanks

Markus

On Mon, Jul 12, 2004 at 10:05:28PM +0100, Glynn Clements wrote:

Markus Neteler wrote:

> for a remote sensing project we face the problem to arrive
> at file sizes > 2GB. I see two options:
>
> a) change the CELL file compression from RLE to DEFLATE
> -> how to do that? I just want to change locally for me

That's far from simple. The existing low-level raster I/O
implementation is a total mess; learning how it works may well take
longer than re-writing it from scratch.

Compare read_data_fp_compressed() with read_data_compressed() in
get_row.c, and put_data() with put_fp_data() in put_row.c.

> b) enable large file support (http://www.suse.de/~aj/linux_lfs.html)
> -> any functions which need a change?

Search for all occurrences of "long" in libgis, then figure out which
ones need to be changed to off_t. AFAICT, most of them are likely to
be in the following files:

  G.h (row_ptr field)
  closecell.c
  format.c
  get_row.c
  opencell.c
  put_row.c

> Variant a) sounds somewhat better to me and maybe already solves
> the problem.
>
> Recommendations are welcome.

c) obtain a system where "long" is 64 bits.

--
Glynn Clements <glynn.clements@virgin.net>

--
Markus Neteler <neteler itc it> http://mpa.itc.it
ITC-irst - Centro per la Ricerca Scientifica e Tecnologica
MPBA - Predictive Models for Biol. & Environ. Data Analysis
Via Sommarive, 18 - 38050 Povo (Trento), Italy

Markus Neteler wrote:

thanks for your comments below! I forgot to submit additional info:

original ERDAS files:
-rwxr-xr-x 1 dassau users 11184610066 2004-07-09 16:25 040706_rgb_juist.ige*
-rwxr-xr-x 1 dassau users 18261 2004-07-09 16:25 040706_rgb_juist.img*
-rwxr-xr-x 1 dassau users 942589967 2004-07-09 16:26 040706_rgb_juist.rrd*

GRASS (certainly useless) and GDAL recompiled with LFS support:
CFLAGS="-D_FILE_OFFSET_BITS=64 -D_LARGE_FILES" ./configure ...

r.in.gdal then imports the ERDAS files without problem:
grassdata/nnw/juist/cell> l juist_dmcnc.1
-rw-r--r-- 1 dassau users 2317620271 2004-07-09 17:06 juist_dmcnc.1
-> this file is larger than 2GB, probably a correct file.

But then d.rast and other reading commands crash:
row 18793
ERROR: error reading compressed map [juist_dmcnc.3] in mapset [juist],

-> does this need the 'long' changes as you suggest?

Yes. For files larger than 2Gb, file offsets won't fit into 32 bits
(i.e. a "long" on x86). You have to use off_t instead. Using
-D_FILE_OFFSET_BITS=64 makes off_t a 64 bit type ("long long"), but
that only helps if you're using off_t, and the raster I/O code in
libgis uses "long".

Also, note that the raster files begin with an array of offsets to the
start of each row. These would have to be changed to use off_t
(format.c, G.h, maybe other places).

One consequence of this is that maps which were created by a version
of GRASS which used 64-bit offsets wouldn't be readable by a version
of GRASS which used 32-bit offsets, and vice-versa. This issue already
exists for systems with a 64-bit "long" type, but supporting 64-bit
offsets on x86 would make the situation a lot more common.

--
Glynn Clements <glynn.clements@virgin.net>

Glynn Clements wrote:

> for a remote sensing project we face the problem to arrive
> at file sizes > 2GB. I see two options:
>
> a) change the CELL file compression from RLE to DEFLATE
> -> how to do that? I just want to change locally for me

That's far from simple. The existing low-level raster I/O
implementation is a total mess; learning how it works may well take
longer than re-writing it from scratch.

I have created some flowcharts (in .dia or EPS formats) for get_row.c
and put_row.c if anyone is interested. OTOH, I'm planning on cleaning
this code up, so the flowcharts will become outdated quite quickly.

Also, note that the raster files begin with an array of offsets to the
start of each row. These would have to be changed to use off_t
(format.c, G.h, maybe other places).

Actually, the row pointers are only used for compressed maps. For
uncompressed maps, it just seeks to row * bytes_per_row.

It probably wouldn't be that hard to support uncompressed maps larger
than 2Gb. Mostly[1] it should just be a matter of changing long to
off_t in:

  read_data_uncompressed()
  G__read_null_bits()
  G__write_null_bits()
  put_data()
  seek_random()

then compiling with -D_FILE_OFFSET_BITS=64. [The last two are only
necessary to support G_open_cell_new_random().]

[1] The position of some of the type casts needs to change; e.g. code
such as:

   offset = (long) (size * R * sizeof(unsigned char)) ;

should actually be:

   offset = (long) size * R * sizeof(unsigned char) ;

In the first case, the value is computed using ints (size, R and
sizeof(unsigned char) are all ints) then promoted to a long at the end
(by which point it will have been truncated to the size of an int).

In the second case, size gets promoted to a long, so the
multiplications are all performed using longs.

So, the existing code won't actually cope with files >2Gb even on
platforms where long is 64 bits, because the intermediate values are
calculated as ints.

I've already changed this in the code which I've been working on,
which is currenly all of get_row.c, although I've mostly left the null
handling alone. I haven't started on put_row.c yet.

The null handling is an even bigger mess than the rest of it. It's
also very inefficient.

Essentially, reads blocks of 8 lines (NULL_ROWS_INMEM, from G.h) at a
time, into the NULL_ROWS array of the fileinfo structure. For each
block, it locates the null file using G_find_file(), opens it with
G_open_old (which also locates it using G_find_file()), reads the
data, then closes the file.

This saves having to keep the descriptor open, but it's likely to have
a significant performance impact. OTOH, keeping the null descriptor
open could halve the maximum number of raster maps which could be open
at a time (assuming that the limiting factor is the OS' open files
limit).

--
Glynn Clements <glynn.clements@virgin.net>

Glynn Clements wrote:

> That's far from simple. The existing low-level raster I/O
> implementation is a total mess; learning how it works may well take
> longer than re-writing it from scratch.

I have created some flowcharts (in .dia or EPS formats) for get_row.c
and put_row.c if anyone is interested. OTOH, I'm planning on cleaning
this code up, so the flowcharts will become outdated quite quickly.

The attached .tar.gz file contains cleaned-up versions of get_row.c
and put_row.c, which could use some testing (the files are called
get_row2.c and put_row2.c, so you need to modify the Gmakefile to use
them instead of the originals).

The updated versions should behave identically to the originals, i.e.
a "cell" file created by get_row2.c should be identical to that
created by the original version. The changes are purely code
simplification (e.g. splitting up some of the larger functions,
isolating duplicated code, using array indexing instead of pointers).

One possibility is to build two versions of (shared) libgis, with the
old and new files, and use LD_LIBRARY_PATH to select the appropriate
version at run-time (this requires the recent update to aclocal.m4 to
not use -rpath).

--
Glynn Clements <glynn.clements@virgin.net>

(attachments)

raster-io.tar.gz (9.63 KB)

On Tue, 13 Jul 2004 11:50:08 +0100
Glynn Clements <glynn.clements@virgin.net> wrote:

Markus Neteler wrote:

> thanks for your comments below! I forgot to submit additional info:
>
> original ERDAS files:
> -rwxr-xr-x 1 dassau users 11184610066 2004-07-09 16:25 040706_rgb_juist.ige*
> -rwxr-xr-x 1 dassau users 18261 2004-07-09 16:25 040706_rgb_juist.img*
> -rwxr-xr-x 1 dassau users 942589967 2004-07-09 16:26 040706_rgb_juist.rrd*
>
> GRASS (certainly useless) and GDAL recompiled with LFS support:
> CFLAGS="-D_FILE_OFFSET_BITS=64 -D_LARGE_FILES" ./configure ...
>
> r.in.gdal then imports the ERDAS files without problem:
> grassdata/nnw/juist/cell> l juist_dmcnc.1
> -rw-r--r-- 1 dassau users 2317620271 2004-07-09 17:06 juist_dmcnc.1
> -> this file is larger than 2GB, probably a correct file.
>
> But then d.rast and other reading commands crash:
> row 18793
> ERROR: error reading compressed map [juist_dmcnc.3] in mapset [juist],
>
> -> does this need the 'long' changes as you suggest?

Yes. For files larger than 2Gb, file offsets won't fit into 32 bits
(i.e. a "long" on x86). You have to use off_t instead. Using
-D_FILE_OFFSET_BITS=64 makes off_t a 64 bit type ("long long"), but
that only helps if you're using off_t, and the raster I/O code in
libgis uses "long".

Also, note that the raster files begin with an array of offsets to the
start of each row. These would have to be changed to use off_t
(format.c, G.h, maybe other places).

One consequence of this is that maps which were created by a version
of GRASS which used 64-bit offsets wouldn't be readable by a version
of GRASS which used 32-bit offsets, and vice-versa. This issue already
exists for systems with a 64-bit "long" type, but supporting 64-bit
offsets on x86 would make the situation a lot more common.

Hi Glynn,

Thank you very much for your work! I just tested some maps and AFAICS it work.

dassau@berlin:~> l grassdata/nnw/juist/cell/
-rw-r--r-- 1 dassau users 2317620271 2004-07-09 17:06 juist_dmcnc.1
-rw-r--r-- 1 dassau users 2277930187 2004-07-09 17:28 juist_dmcnc.2

    Otto