[GRASS-dev] [GRASS GIS] #2676: r.neighbors on large rasters

#2676: r.neighbors on large rasters
---------------------------------------+-------------------------
Reporter: dnewcomb | Owner: grass-dev@…
     Type: enhancement | Status: new
Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Keywords: r.neighbors large rasters | CPU: x86-64
Platform: Linux |
---------------------------------------+-------------------------
running r.neighbors --o --verbose input=naip_change
output=naip_change_average method=average size=29

The input raster is ( r.info output)

Total Cells: 365659744068
Starts and shows 0% complete, then crashes with segmentation fault.

linux top program shows process using 1.5 GB RAM before crash

Using
grass-7.0.svn_src_snapshot_2015_05_02

In main.c , line 139 I see :
int i , n;

   I assume that this means that r.neighbors is limited to rasters with the
number of cells in the 32 bit integer range?

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2676&gt;
GRASS GIS <http://grass.osgeo.org>

#2676: r.neighbors on large rasters
--------------------------+---------------------------------------
  Reporter: dnewcomb | Owner: grass-dev@…
      Type: enhancement | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: r.neighbors large rasters
       CPU: x86-64 | Platform: Linux
--------------------------+---------------------------------------

Comment (by glynn):

Replying to [ticket:2676 dnewcomb]:

> In main.c , line 139 I see :
> int i , n;
>
> I assume that this means that r.neighbors is limited to rasters with
the number of cells in the 32 bit integer range?

If it is, it has nothing to do with above declaration. "i" will never
exceed the number of outputs, while "n" will never exceed the number of
cells in the neighbourhood.

r.neighbours operates row-by-row, only holding as many rows as are in the
neighbourhood, and for method=average it isn't maintaining category data,
so it shouldn't have any problems with rasters with more than 2!^31 cells.
At least, there's no more reason for r.neighbors to have such issues than
any other raster module.

What are the actual dimensions (rows x columns) of the current region?
(The dimensions of the input map aren't directly relevant).

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2676#comment:1&gt;
GRASS GIS <http://grass.osgeo.org>

#2676: r.neighbors on large rasters
--------------------------+---------------------------------------
  Reporter: dnewcomb | Owner: grass-dev@…
      Type: enhancement | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: r.neighbors large rasters
       CPU: x86-64 | Platform: Linux
--------------------------+---------------------------------------

Comment (by dnewcomb):

Replying to [comment:1 glynn]:
> Replying to [ticket:2676 dnewcomb]:
>
> > In main.c , line 139 I see :
> > int i , n;
> >
> > I assume that this means that r.neighbors is limited to rasters with
the number of cells in the 32 bit integer range?
>
> If it is, it has nothing to do with above declaration. "i" will never
exceed the number of outputs, while "n" will never exceed the number of
cells in the neighbourhood.
>
> r.neighbours operates row-by-row, only holding as many rows as are in
the neighbourhood, and for method=average it isn't maintaining category
data, so it shouldn't have any problems with rasters with more than 2!^31
cells. At least, there's no more reason for r.neighbors to have such
issues than any other raster module.
>
> What are the actual dimensions (rows x columns) of the current region?
(The dimensions of the input map aren't directly relevant).

Sorry, rookie mistake.

The region stats are these.

rows: 1969532
cols: 1875705
cells: 3694261020060

I had used r.external -e to expand the extent of the region to include the
linked raster.

Setting the region to match the raster gives me this size region.

rows: 440046
cols: 830958
cells: 365659744068

This seems to be working so far.

Sorry for the noise.

Kind of curious what is the upper limit for the command, though.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2676#comment:2&gt;
GRASS GIS <http://grass.osgeo.org>

#2676: r.neighbors on large rasters
--------------------------+---------------------------------------
  Reporter: dnewcomb | Owner: grass-dev@…
      Type: enhancement | Status: closed
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: invalid | Keywords: r.neighbors large rasters
       CPU: x86-64 | Platform: Linux
--------------------------+---------------------------------------
Changes (by dnewcomb):

* status: new => closed
* resolution: => invalid

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2676#comment:3&gt;
GRASS GIS <http://grass.osgeo.org>

#2676: r.neighbors on large rasters
--------------------------+---------------------------------------
  Reporter: dnewcomb | Owner: grass-dev@…
      Type: enhancement | Status: reopened
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: r.neighbors large rasters
       CPU: x86-64 | Platform: Linux
--------------------------+---------------------------------------
Changes (by glynn):

* status: closed => reopened
* resolution: invalid =>

Comment:

Replying to [comment:2 dnewcomb]:

> > What are the actual dimensions (rows x columns) of the current region?
(The dimensions of the input map aren't directly relevant).

Sorry, that's incorrect. If you don't use the -a flag, r.neighbors sets
the region to match the input map.

In any case, the cause is the use of G_alloca() for allocating temporary
row buffers within the raster library. This is a macro which expands to
alloca(), which allocates memory on the stack.

While this is vastly more efficient than malloc() etc (alloca() may
compile to just 2 CPU instructions), it doesn't report allocation failure;
the next instruction to push something onto the stack will result in a
segfault.

1875705 columns at 8 bytes per cell corresponds to 15 MB for a row. On my
system, the default maximum stack size (ulimit -s) is 8192 KiB (8 MiB).
Changing this to 50 MiB avoids the segfault (although I have neither the
patience nor the free disk space to see whether it runs to completion).

While I doubt that this will be a genuine issue for many people, it can
result in the upper limit on map dimensions being smaller than it could
be.

Consequently, we may want to think about whether the use of alloca()
should be optional. Currently, it's used on any system which is believed
to provide it (those which lack it use malloc/free). See the top of
include/defs/gis.h for the details.

If most users can live with the existing behaviour (an 8 MiB default stack
allows for just short of a million columns with DCELL data) and most of
the rest can live with explicitly increasing the stack size, it may
suffice to just add e.g. "#ifndef DONT_USE_ALLOCA" to allow the remainder
to override the behaviour at compile time.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2676#comment:4&gt;
GRASS GIS <http://grass.osgeo.org>

#2676: r.neighbors on large rasters
--------------------------+---------------------------------------
  Reporter: dnewcomb | Owner: grass-dev@…
      Type: enhancement | Status: reopened
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: r.neighbors large rasters
       CPU: x86-64 | Platform: Linux
--------------------------+---------------------------------------

Comment (by dnewcomb):

Replying to [comment:4 glynn]:
> Replying to [comment:2 dnewcomb]:
>
> > > What are the actual dimensions (rows x columns) of the current
region? (The dimensions of the input map aren't directly relevant).
>
> Sorry, that's incorrect. If you don't use the -a flag, r.neighbors sets
the region to match the input map.
>
> In any case, the cause is the use of G_alloca() for allocating temporary
row buffers within the raster library. This is a macro which expands to
alloca(), which allocates memory on the stack.
>
> While this is vastly more efficient than malloc() etc (alloca() may
compile to just 2 CPU instructions), it doesn't report allocation failure;
the next instruction to push something onto the stack will result in a
segfault.
>
> 1875705 columns at 8 bytes per cell corresponds to 15 MB for a row. On
my system, the default maximum stack size (ulimit -s) is 8192 KiB (8 MiB).
Changing this to 50 MiB avoids the segfault (although I have neither the
patience nor the free disk space to see whether it runs to completion).
>
> While I doubt that this will be a genuine issue for many people, it can
result in the upper limit on map dimensions being smaller than it could
be.
>
> Consequently, we may want to think about whether the use of alloca()
should be optional. Currently, it's used on any system which is believed
to provide it (those which lack it use malloc/free). See the top of
include/defs/gis.h for the details.
>
> If most users can live with the existing behaviour (an 8 MiB default
stack allows for just short of a million columns with DCELL data) and most
of the rest can live with explicitly increasing the stack size, it may
suffice to just add e.g. "#ifndef DONT_USE_ALLOCA" to allow the remainder
to override the behaviour at compile time.

Just to give an idea of the use case, I have done a land cover change
ndvi analysis based off of 4 band 1m resolution digital aerial photography
for the state of NC that I'm averaging to a coarser resolution. I can
see down the road a couple of years, as image data gets denser, the
possibility of more folks running into the limits you describe.

Of course, one could always chop the data set into overlapping blocks,
process, and patch together later.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2676#comment:5&gt;
GRASS GIS <http://grass.osgeo.org>