[GRASS-dev] r.in.gdal: how to speed-up import with huge amount of bands?

Hi,

I have received temperature map time series of 50 years of daily data
in single Geotiff files (Tmin, Tmean, Tmax). Each GeoTIFF has around
21550 bands, 4GB file size.

Problem: the import takes "forever" despite using a superfast disk, i.e.
120 seconds per band (size is only 464 x 201), so 29 DAYS for each file.

The problem will be that it has to run through the entire 4GB over and over
again for each channel. I guess I want to do heavy caching - but how?

Looking at ImportBand() in main.c of r.in.gdal, I see that GDALRasterIO()
is used:
http://www.gdal.org/gdal_tutorial.html#gdal_tutorial_read
but I don't see hints to tell the IO function to keep more in cache.

The original files are in netCDF format, I used gdalwarp to generate GeoTIFF.
First I thought to write it out as 21550 files but didn't find an
option to do so.

Clueless,
Markus

On Mon, Mar 29, 2010 at 8:35 AM, Markus Neteler <neteler@osgeo.org> wrote:

Hi,

I have received temperature map time series of 50 years of daily data
in single Geotiff files (Tmin, Tmean, Tmax). Each GeoTIFF has around
21550 bands, 4GB file size.

Problem: the import takes "forever" despite using a superfast disk, i.e.
120 seconds per band (size is only 464 x 201), so 29 DAYS for each file.

The problem will be that it has to run through the entire 4GB over and over
again for each channel. I guess I want to do heavy caching - but how?

Looking at ImportBand() in main.c of r.in.gdal, I see that GDALRasterIO()
is used:
http://www.gdal.org/gdal_tutorial.html#gdal_tutorial_read
but I don't see hints to tell the IO function to keep more in cache.

I found something in nearblack.c of GDAL, which is

Index: raster/r.in.gdal/main.c

--- raster/r.in.gdal/main.c (revision 41604)
+++ raster/r.in.gdal/main.c (working copy)
@@ -666,6 +666,7 @@
     /* Select a cell type for the new cell. */
     /* -------------------------------------------------------------------- */
     eRawGDT = GDALGetRasterDataType(hBand);
+ GDALSetCacheMax (2000000000); /* heavy caching */

     switch (eRawGDT) {
     case GDT_Float32:

It allocates way more RAM (2GB) but the speed remains exactly
the same: 120 seconds per band.

The original files are in netCDF format, I used gdalwarp to generate GeoTIFF.
First I thought to write it out as 21550 files but didn't find an
option to do so.

Clueless,
Markus

On Mon, Mar 29, 2010 at 9:00 AM, Markus Neteler <neteler@osgeo.org> wrote:

On Mon, Mar 29, 2010 at 8:35 AM, Markus Neteler <neteler@osgeo.org> wrote:

Hi,

I have received temperature map time series of 50 years of daily data
in single Geotiff files (Tmin, Tmean, Tmax). Each GeoTIFF has around
21550 bands, 4GB file size.

Problem: the import takes "forever" despite using a superfast disk, i.e.
120 seconds per band (size is only 464 x 201), so 29 DAYS for each file.

...

Index: raster/r.in.gdal/main.c

--- raster/r.in.gdal/main.c (revision 41604)
+++ raster/r.in.gdal/main.c (working copy)
@@ -666,6 +666,7 @@
/* Select a cell type for the new cell. */
/* -------------------------------------------------------------------- */
eRawGDT = GDALGetRasterDataType(hBand);
+ GDALSetCacheMax (2000000000); /* heavy caching */

switch \(eRawGDT\) \{
case GDT\_Float32:

It allocates way more RAM (2GB) but the speed remains exactly
the same: 120 seconds per band.

Ha! Setting the cache to the file size + minor overhead helps. Now it
takes 5 seconds instead of 120...

At this point I would implement this as cache= parameter. The question
is how to preset it. Or make it a flag "make cache as large as input file"?

Markus

It allocates way more RAM (2GB) but the speed remains exactly
the same: 120 seconds per band.

> The original files are in netCDF format, I used gdalwarp to
> generate GeoTIFF.
> First I thought to write it out as 21550 files but didn't
> find an option to do so.

gdal_translate can do that? e.g. see examples of extracting the
quality band into its own raster map from the MODIS grass wiki
page.

ncdump might be able do the same.

are the netCDF/geotiff pixel or band interleaved? if pixel there's
not much you can do to not read the whole file in. if band then
it should be (theoretically) pretty quick to fseek() to the right
part of the file.

other than that, split the job in 4 and borrow 3 other machines
for a week...

good luck,
Hamish

Markus Neteler wrote:

> Index: raster/r.in.gdal/main.c
> ===================================================================
> --- raster/r.in.gdal/main.c (revision 41604)
> +++ raster/r.in.gdal/main.c (working copy)
> @@ -666,6 +666,7 @@
> /* Select a cell type for the new cell. */
> /* -------------------------------------------------------------------- */
> eRawGDT = GDALGetRasterDataType(hBand);
> + GDALSetCacheMax (2000000000); /* heavy caching */
>
> switch (eRawGDT) {
> case GDT_Float32:
>
> It allocates way more RAM (2GB) but the speed remains exactly
> the same: 120 seconds per band.

Ha! Setting the cache to the file size + minor overhead helps. Now it
takes 5 seconds instead of 120...

At this point I would implement this as cache= parameter. The question
is how to preset it.

I suggest that the default should be "don't call GDALSetCacheMax()",
i.e.:

  if (parm.cache->answer && *parm.cache->answer)
      GDALSetCacheMax(atol(parm.cache->answer));

If the file is larger than will fit into physical memory, and is
interleaved by pixel, you lose; there is no way to make that case
fast with the existing code.

You could make it fast by importing multiple bands concurrently rather
than sequentially, i.e. "foreach row {foreach band ...}" rather than
"foreach band {foreach row ...}". But that's likely to be problematic
with 21550 bands, due to limits on open files and per-open-file
resource consumption. It's also undesirable if the data is
band-sequential.

Ideally you would want to be able to have "parm.band->multiple = YES"
in conjunction with a choice between band-then-row and row-then-band
access patterns, but that requires more complex code. OTOH, when
you're dealing with very large amounts of data, there isn't really any
sane alternative to choosing the access pattern to match the data.

Or make it a flag "make cache as large as input file"?

I suspect that such an option may get overused. For data which is
band-interleaved-by-line or band-sequential, it's likely to be
unnecessary and may be counter-productive (e.g. it may cause GDAL to
allocate the cache from swap, resulting in an unnecessary disc-to-disc
copy).

--
Glynn Clements <glynn@gclements.plus.com>

On Mon, Mar 29, 2010 at 8:08 PM, Glynn Clements
<glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

At this point I would implement this as cache= parameter. The question
is how to preset it.

I suggest that the default should be "don't call GDALSetCacheMax()",
i.e.:

   if \(parm\.cache\-&gt;answer &amp;&amp; \*parm\.cache\-&gt;answer\)
       GDALSetCacheMax\(atol\(parm\.cache\-&gt;answer\)\);

ok - I have implemented this (as MiB to recycle the translated string
from r.proj).

If the file is larger than will fit into physical memory, and is
interleaved by pixel, you lose; there is no way to make that case
fast with the existing code.

You could make it fast by importing multiple bands concurrently rather
than sequentially, i.e. "foreach row {foreach band ...}" rather than
"foreach band {foreach row ...}". But that's likely to be problematic
with 21550 bands, due to limits on open files and per-open-file
resource consumption.

Yes, I remember my ("only") 1460 MODIS files battle with r.series...

It's also undesirable if the data is band-sequential.

Ideally you would want to be able to have "parm.band->multiple = YES"
in conjunction with a choice between band-then-row and row-then-band
access patterns, but that requires more complex code. OTOH, when
you're dealing with very large amounts of data, there isn't really any
sane alternative to choosing the access pattern to match the data.

ok. Since I managed to reduce the import to 5% of the previous
time consumption, I am quite happy now.
[btw: there seem to be a subtle memory leak somewhere which
accumulates with so many bands]

Or make it a flag "make cache as large as input file"?

I suspect that such an option may get overused. For data which is
band-interleaved-by-line or band-sequential, it's likely to be
unnecessary and may be counter-productive (e.g. it may cause GDAL to
allocate the cache from swap, resulting in an unnecessary disc-to-disc
copy).

Thanks for your comments.
The new parameter is now:

r.in.gdal
...
    memory Cache size (MiB)

So far added to 7 and 6.5, I'll backport it for 6.4.1 then.

Markus