[GRASS-dev] Avoiding maximum open files limit in r.series

Dear devs,
just for your information, i have added the support of input files
with newline separated map names to r.series.
r.series now supports two input methods, file and input. Using option
<file> is slower but avoids the open file descriptor limit.

Now we can process ten-thousands of maps with r.series without hitting
the open file descriptor limit or the maximum command line size limit.
Limiting factor is now the RAM.

I have tested r.series with ~6000 maps (ECA&D daily temperature data
from 1995-2010) each ~100000 cells. Computation needs for method
average
about 3 minutes on my (fast) machine.
Memory footprint is about 330MB of RAM. But this looks like a memory
leak to me, because the memory consumption raise linear with the
processed rows of the output map. All the memory allocation in
r.series is done before the row processing ... ???

Change:
http://trac.osgeo.org/grass/changeset/48638

The changes includes also some tests and the temporal module tr.series.

Best regards
Soeren

Sören Gebbert wrote:

Dear devs,
just for your information, i have added the support of input files
with newline separated map names to r.series.
r.series now supports two input methods, file and input. Using option
<file> is slower but avoids the open file descriptor limit.

I've made some changes (mostly just clean-up) to this; can you test
r48654?

Main changes:

Make opening/reading/closing maps for each row a separate feature
(-x flag). This has a significant performance impact, may be
unnecessary ("ulimit -n" is 1024 by default, but this can be changed
if you have sufficient privilege; 100k open files is quite possible),
and may be necessary even if map names are specified on the command
line (via input=).

Only read the file once, reallocating the array dynamically.

Can't use G_check_input_output_name, as parm.output->multiple=YES.

Don't use C99-specific features (specifically, variable declarations
intermingled with statements).

Move variables from function scope to block scope where possible.

I have tested r.series with ~6000 maps (ECA&D daily temperature data
from 1995-2010) each ~100000 cells. Computation needs for method
average
about 3 minutes on my (fast) machine.
Memory footprint is about 330MB of RAM. But this looks like a memory
leak to me, because the memory consumption raise linear with the
processed rows of the output map. All the memory allocation in
r.series is done before the row processing ... ???

I suspect that this will be due to re-opening the maps for each row.
Normally, an overhead on each call to Rast_open_old() would be
considered a per-map overhead, and we wouldn't worry about a few kB
per map.

Opening a map is quite an expensive operation, as it has to find which
mapset contains the map, determine its type (CELL/FCELL/DCELL), read
its cellhd (and possibly other files, e.g. reclass table), set up the
column mapping, etc.

For this particular case (and anything else like it), the process
could be accelerated significantly by keeping the fileinfo structure
around and just closing and re-opening (and re-positioning) the
descriptors (one for the raster data, one for the null bitmap).

One significant problem with doing this, however, is that raster maps
are identified by the file descriptor for their data: the "fd"
parameter to Rast_get_row() etc, and the index into the R__.fileinfo
array, is the actual file descriptor.

It wouldn't be a great deal of work to change this, so that the "fd"
parameter was just the index into the R__.fileinfo array, and the
fileinfo structure contained the actual fd. However, we would need to
make sure that we catch every case where "fd" needs to be changed to
e.g. R__.fileinfo[fd].data_fd.

--
Glynn Clements <glynn@gclements.plus.com>

Hello Glynn,

2011/10/6 Glynn Clements <glynn@gclements.plus.com>:

Sören Gebbert wrote:

Dear devs,
just for your information, i have added the support of input files
with newline separated map names to r.series.
r.series now supports two input methods, file and input. Using option
<file> is slower but avoids the open file descriptor limit.

I've made some changes (mostly just clean-up) to this; can you test
r48654?

Main changes:

Make opening/reading/closing maps for each row a separate feature
(-x flag). This has a significant performance impact, may be
unnecessary ("ulimit -n" is 1024 by default, but this can be changed
if you have sufficient privilege; 100k open files is quite possible),
and may be necessary even if map names are specified on the command
line (via input=).

All of my colleagues and our system administrators do not have the
knowledge to increase the open file limit on Unix machines. And i
don't know if it is possible to set this limit on Windows or Mac OS,
so i thought it would be a meaningful addition. I have hit the Python
subprocess limit of command line arguments when running r.series using
grass.script.run_command() and did not found a solution ... accept to
patch r.series ... .

Only read the file once, reallocating the array dynamically.

Can't use G_check_input_output_name, as parm.output->multiple=YES.

Ohh, yes indeed. I will add some more tests to cover this r.series behavior.

Don't use C99-specific features (specifically, variable declarations
intermingled with statements).

Move variables from function scope to block scope where possible.

I have tested r.series with ~6000 maps (ECA&D daily temperature data
from 1995-2010) each ~100000 cells. Computation needs for method
average
about 3 minutes on my (fast) machine.
Memory footprint is about 330MB of RAM. But this looks like a memory
leak to me, because the memory consumption raise linear with the
processed rows of the output map. All the memory allocation in
r.series is done before the row processing ... ???

I suspect that this will be due to re-opening the maps for each row.
Normally, an overhead on each call to Rast_open_old() would be
considered a per-map overhead, and we wouldn't worry about a few kB
per map.

Opening a map is quite an expensive operation, as it has to find which
mapset contains the map, determine its type (CELL/FCELL/DCELL), read
its cellhd (and possibly other files, e.g. reclass table), set up the
column mapping, etc.

For this particular case (and anything else like it), the process
could be accelerated significantly by keeping the fileinfo structure
around and just closing and re-opening (and re-positioning) the
descriptors (one for the raster data, one for the null bitmap).

One significant problem with doing this, however, is that raster maps
are identified by the file descriptor for their data: the "fd"
parameter to Rast_get_row() etc, and the index into the R__.fileinfo
array, is the actual file descriptor.

It wouldn't be a great deal of work to change this, so that the "fd"
parameter was just the index into the R__.fileinfo array, and the
fileinfo structure contained the actual fd. However, we would need to
make sure that we catch every case where "fd" needs to be changed to
e.g. R__.fileinfo[fd].data_fd.

So we have two options to solve the memory leak?
1.) Correct memory management while closing maps
2.) Modification of the raster map identification

Is it worth the effort to correct the memory management while closing
maps or should we try to change the the raster map identification?
Maybe we can provide additional functions which only initialize the
fileinfo structure but does not keep file descriptors open? And the
call of Rast_open_old will open only file descriptors in case the
fileinfo is already set up? An additional Raster_close_fd_only() can
be added to close only the file descriptors. In this case only
r.series must be patched, the API keeps consistent for other modules.

However, it may be meaningful to change the map identification and to
correct the memory management.

Best regards
Soeren

--
Glynn Clements <glynn@gclements.plus.com>

2011/10/7 Sören Gebbert <soerengebbert@googlemail.com>:

2011/10/6 Glynn Clements <glynn@gclements.plus.com>:

...

Make opening/reading/closing maps for each row a separate feature
(-x flag). This has a significant performance impact, may be
unnecessary ("ulimit -n" is 1024 by default, but this can be changed
if you have sufficient privilege; 100k open files is quite possible),
and may be necessary even if map names are specified on the command
line (via input=).

All of my colleagues and our system administrators do not have the
knowledge to increase the open file limit on Unix machines.

Just for the record, it is explained in the manual :slight_smile:
http://grass.osgeo.org/grass64/manuals/html64_user/r.series.html

(no idea though about non-Unix-like OS)

Markus

Sören Gebbert wrote:

> Make opening/reading/closing maps for each row a separate feature
> (-x flag). This has a significant performance impact, may be
> unnecessary ("ulimit -n" is 1024 by default, but this can be changed
> if you have sufficient privilege; 100k open files is quite possible),
> and may be necessary even if map names are specified on the command
> line (via input=).

All of my colleagues and our system administrators do not have the
knowledge to increase the open file limit on Unix machines. And i
don't know if it is possible to set this limit on Windows or Mac OS,
so i thought it would be a meaningful addition. I have hit the Python
subprocess limit of command line arguments when running r.series using
grass.script.run_command() and did not found a solution ... accept to
patch r.series ... .

The point is that the limit on the length of a command line and the
limit on the number of open files are quite separate.

The open file limit might be exceeded when map names are given on the
command line, and it might not be exceeded when map names are read
from a file (reading map names from a file needn't be restricted to
the case where there are too many names to fit on the command line; it
may just be more convenient to read the names from a file).

So rather than having the open/close behaviour tied to input= versus
file=, I added a separate flag for it. If you run out of file
descriptors ("Too many open files", i.e. errno==EMFILE), use -x;
otherwise, not using -x is likely to be faster.

>> Memory footprint is about 330MB of RAM. But this looks like a memory
>> leak to me, because the memory consumption raise linear with the
>> processed rows of the output map. All the memory allocation in
>> r.series is done before the row processing ... ???
>
> I suspect that this will be due to re-opening the maps for each row.
> Normally, an overhead on each call to Rast_open_old() would be
> considered a per-map overhead, and we wouldn't worry about a few kB
> per map.

So we have two options to solve the memory leak?
1.) Correct memory management while closing maps
2.) Modification of the raster map identification

Is it worth the effort to correct the memory management while closing
maps or should we try to change the the raster map identification?

I'm not sure what you mean by "identification".

Maybe we can provide additional functions which only initialize the
fileinfo structure but does not keep file descriptors open?

That's what I was suggesting; e.g. Rast_suspend() and Rast_resume().

And the
call of Rast_open_old will open only file descriptors in case the
fileinfo is already set up?

I think that Rast_open_old() should always create a new fileinfo
structure. There may be modules which will get confused if two
Rast_open_old() calls return the same file descriptor.

--
Glynn Clements <glynn@gclements.plus.com>

Hi Glynn,

Am 08.10.2011 14:34 schrieb "Glynn Clements" <glynn@gclements.plus.com>:

Sören Gebbert wrote:

....

That's what I was suggesting; e.g. Rast_suspend() and Rast_resume().

can you please implement this? I think i do not have the knowledge to
implement it and in case i try you will correct it anyway. :wink:

I think the suspend and resume functionality will be an important
feature, not only to increase the performance
of r.series using "-x" flag, but also for future spatio-temporal
raster analysis modules (spatio-temporal neighborhood analysis,
spatio-temporal interpolation, spatio-temporal pattern recognition,
...).

Best regards
Soeren

Sören Gebbert wrote:

> That's what I was suggesting; e.g. Rast_suspend() and Rast_resume().

can you please implement this? I think i do not have the knowledge to
implement it and in case i try you will correct it anyway. :wink:

Can you (and everyone else, for that matter) test r48841?

It doesn't actually implement the requested functionality. However, it
implements a necessary precursor, specifically:

One significant problem with doing this, however, is that raster maps
are identified by the file descriptor for their data: the "fd"
parameter to Rast_get_row() etc, and the index into the R__.fileinfo
array, is the actual file descriptor.

It wouldn't be a great deal of work to change this, so that the "fd"
parameter was just the index into the R__.fileinfo array, and the
fileinfo structure contained the actual fd. However, we would need to
make sure that we catch every case where "fd" needs to be changed to
e.g. R__.fileinfo[fd].data_fd.

This is a fairly fundamental change (albeit one which shouldn't affect
anything outside of the core lib/raster code), so it needs to be
thoroughly tested before anything else is done in that area.

In addition to "native" raster maps, testing r.external/r.out.external
GDAL-linked maps is also necessary.

--
Glynn Clements <glynn@gclements.plus.com>

Hi,

This is a fairly fundamental change (albeit one which shouldn't affect
anything outside of the core lib/raster code), so it needs to be
thoroughly tested before anything else is done in that area.

I have tested the raster functionality with r.mapcalc and r.series
excessively. No problems found so far.

In addition to "native" raster maps, testing r.external/r.out.external
GDAL-linked maps is also necessary.

While implementing spatio-temporal data import and export using
r.in.gdal, r.out.gdal and r.external i found some strange behavior:
The export of raster data works fine, but the import/linking needs a
lot more time than expected. Export with r.out.gdal 0.033s,
link/import r.external or r.in.gdal ~17s:

GRASS 7.0.svn (XYLocation@test):~/src/grass7.0/grass_trunk/temporal/tr.export

r.info prec_1 -gs

north=80
south=0
east=120
west=0
nsres=10
ewres=10

GRASS 7.0.svn (XYLocation@test):~/src/grass7.0/grass_trunk/temporal/tr.export

time r.out.gdal in=prec_1 out=prec_1.tif

Checking GDAL data type and nodata value...
100%
WARNING: Unable to set projection
Exporting raster data to GTiff format...
100%
r.out.gdal complete. File <prec_1.tif> created.

real 0m0.033s
user 0m0.012s
sys 0m0.020s

GRASS 7.0.svn (XYLocation@test):~/src/grass7.0/grass_trunk/temporal/tr.export

time r.external in=prec_1.tif output=new_test

Projection of input dataset and current location appear to match
Reading band 1 of 1...
r.external complete. Link to raster map <new_test> created.

real 0m17.638s
user 0m17.237s
sys 0m0.016s

GRASS 7.0.svn (XYLocation@test):~/src/grass7.0/grass_trunk/temporal/tr.export

time r.in.gdal in=prec_1.tif output=new_test_2

Projection of input dataset and current location appear to match
100%
r.in.gdal complete. Raster map <new_test_2> created.

real 0m16.394s
user 0m16.221s
sys 0m0.044s

Best regards
Soeren

Sören Gebbert wrote:

While implementing spatio-temporal data import and export using
r.in.gdal, r.out.gdal and r.external i found some strange behavior:
The export of raster data works fine, but the import/linking needs a
lot more time than expected. Export with r.out.gdal 0.033s,
link/import r.external or r.in.gdal ~17s:

I don't think that this is related to my changes.

Does passing the -c flag to r.out.gdal significantly affect the time
taken to import the resulting file?

--
Glynn Clements <glynn@gclements.plus.com>