[GRASS-dev] Addons: r.sun2 segfault

Working on a large DEM, r.sun2 (addons) segfaults for me:

GRASS 6.4.svn (patUTM32):~ > g.region -p
projection: 1 (UTM)
zone: 32
datum: wgs84
ellipsoid: wgs84
north: 5157086
south: 5059541
west: 612488
east: 730098
nsres: 5
ewres: 5
rows: 19509
cols: 23522
cells: 458890698

GRASS 6.4.svn (patUTM32):~ > r.sun2 -s pat_dtm_5m horizon=horangle
horizonstep=15 \
     aspin=pat_dtm_5m.as slopein=pat_dtm_5m.sl day=180
insol_time=photoperiodo_d180
Mode 2: integrated daily irradiation
Segmentation fault

GRASS 6.4.svn (patUTM32):~ > gdb r.sun2
GNU gdb Red Hat Linux (6.5-37.el5_2.2rh)
...
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host
libthread_db library "/lib64/libthread_db.so.1".

(gdb) r -s pat_dtm_5m horizon=horangle horizonstep=15
aspin=pat_dtm_5m.as slopein=pat_dtm_5m.sl day=180
insol_time=photoperiodo_d180
Starting program: /home/neteler/binaries/grass-6.4.svn/bin/r.sun2 -s
pat_dtm_5m horizon=horangle horizonstep=15 aspin=pat_dtm_5m.as
slopein=pat_dtm_5m.sl day=180 insol_time=photoperiodo_d180
[Thread debugging using libthread_db enabled]
Mode 2: integrated daily irradiation
[New Thread 47231340984336 (LWP 13269)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 47231340984336 (LWP 13269)]
INPUT_part (offset=0, zmax=0x7fffc45840a8) at main.c:965
965 horizonpointer[i] = (char)(rint(SCALING_FACTOR *
(gdb) where
#0 INPUT_part (offset=0, zmax=0x7fffc45840a8) at main.c:965
#1 0x0000000000405d6e in calculate (singleSlope=<value optimized
out>, singleAspect=<value optimized out>,
    singleAlbedo=<value optimized out>, singleLinke=<value optimized
out>, gridGeom=
      {xp = 2.3335382984350421e-310, yp = 2.3335374857606904e-310, xx0
= 6.9533063593499383e-310, yy0 = 2.3335374857180031e-310, xg0 =
2.0760920055667309e-317, yg0 = 4.4465908125712189e-323, stepx = 5,
stepy = 5, deltx = 117610, delty = 97545, stepxy = 5, sinlat =
2.3335382752661301e-310, coslat = 1.1670564030398969e-312}) at
main.c:1866
#2 0x00000000004071b1 in main (argc=9, argv=0x7fffc4584428) at main.c:745
(gdb)

As so often, I get lost in the gdb output :slight_smile:
Any clue?

http://trac.osgeo.org/grass/browser/grass-addons/raster/r.sun_horizon/r.sun2

thanks,
Markus

Markus Neteler wrote:

Working on a large DEM, r.sun2 (addons) segfaults for me:

#0 INPUT_part (offset=0, zmax=0x7fffc45840a8) at main.c:965
#1 0x0000000000405d6e in calculate (singleSlope=<value optimized out>, singleAspect=<value optimized out>,
    singleAlbedo=<value optimized out>, singleLinke=<value optimized out>,

As so often, I get lost in the gdb output :slight_smile:

Start by compiling without optimisation. If optimisation is enabled,
you still get a backtrace, but real debugging becomes practically
impossible, as the compiled code typically bears little or no
resemblance to the source code.

However, looking at the numbers:

horizonarray = malloc(arrayNumInt * numRows * n);

m = cellhd.rows = 19509
n = cellhd.cols = 23522
horizonStep = 15
arrayNumInt = 360 / horizonStep = 360/15 = 24
numPartitions = 1
numRows = m / numPartitions = 19509 / 1 = 19509
arrayNumInt * numRows * n = 24 * 19509 * 23522 = 11,013,376,752

So the horizon array works out at ~11GB. Personally, I would start by
checking that horizonarray is non-null (note that it uses malloc(),
not G_malloc(), so there's no error checking).

But more generally, there are so many things wrong with that code that
dealing with specific bugs is just papering over the cracks.

--
Glynn Clements <glynn@gclements.plus.com>

On Tue, Oct 21, 2008 at 11:54 PM, Glynn Clements
<glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

Working on a large DEM, r.sun2 (addons) segfaults for me:

#0 INPUT_part (offset=0, zmax=0x7fffc45840a8) at main.c:965
#1 0x0000000000405d6e in calculate (singleSlope=<value optimized out>, singleAspect=<value optimized out>,
    singleAlbedo=<value optimized out>, singleLinke=<value optimized out>,

As so often, I get lost in the gdb output :slight_smile:

Start by compiling without optimisation. If optimisation is enabled,

ouch - sorry about that. It's the version which I use on a cluster,
I forgot to disable optimisation.

you still get a backtrace, but real debugging becomes practically
impossible, as the compiled code typically bears little or no
resemblance to the source code.

Sure - with no optimisation and G_malloc() I get a
ERROR: G_malloc: out of memory

(easier to understand...)

However, looking at the numbers:

horizonarray = malloc(arrayNumInt * numRows * n);

m = cellhd.rows = 19509
n = cellhd.cols = 23522
horizonStep = 15
arrayNumInt = 360 / horizonStep = 360/15 = 24
numPartitions = 1
numRows = m / numPartitions = 19509 / 1 = 19509
arrayNumInt * numRows * n = 24 * 19509 * 23522 = 11,013,376,752

So the horizon array works out at ~11GB.

The machine has 8GB RAM + 2GB swap.

Personally, I would start by
checking that horizonarray is non-null (note that it uses malloc(),
not G_malloc(), so there's no error checking).

Running it on a different machine with 16GB RAM it starts and
then crashed at 0.x%:

(gdb) r -s pat_dtm_5m horizon=horangle horizonstep=15
aspin=pat_dtm_5m.as slopein=pat_dtm_5m.sl day=180
insol_time=photoperiodo_d180
Starting program: /home/neteler/binaries/grass-6.4.svn/bin/r.sun2 -s
pat_dtm_5m horizon=horangle horizonstep=15 aspin=pat_dtm_5m.as
slopein=pat_dtm_5m.sl day=180 insol_time=photoperiodo_d180
[Thread debugging using libthread_db enabled]
Mode 2: integrated daily irradiation
[New Thread 47358322518032 (LWP 1627)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 47358322518032 (LWP 1627)]
0x0000000000404fb8 in INPUT_part (offset=0, zmax=0x7fff33a95a38) at main.c:979
979 horizonpointer[i] = (char)(rint(SCALING_FACTOR *
(gdb) bt
#0 0x0000000000404fb8 in INPUT_part (offset=0, zmax=0x7fff33a95a38)
at main.c:979
#1 0x00000000004076ac in calculate
(singleSlope=3.6565798448710657e-319,
singleAspect=1.90389130408995e-314,
    singleAlbedo=0.20000000000000001, singleLinke=3, gridGeom=
      {xp = 6.9531864307740289e-310, yp = 1.0909214257054157e-312, xx0
= 0, yy0 = 2.3398120197805848e-310, xg0 = 2.339812019231579e-310, yg0
= 2.0760929936980226e-317, stepx = 5, stepy = 5, deltx = 117610, delty
= 97545, stepxy = 5, sinlat = 2.3398112026541087e-310, coslat =
6.9531864307767956e-310}) at main.c:1880
#2 0x0000000000404457 in main (argc=9, argv=0x7fff33a95df8) at main.c:759
(gdb)

Running "top" in parallel, I see that no more than 23% of RAM are used
when it crashes.

Adding
  -m Use the low-memory version of the program
does not help.

Markus

Markus Neteler wrote:

Running it on a different machine with 16GB RAM it starts and
then crashed at 0.x%:

(gdb) r -s pat_dtm_5m horizon=horangle horizonstep=15
aspin=pat_dtm_5m.as slopein=pat_dtm_5m.sl day=180
insol_time=photoperiodo_d180
Starting program: /home/neteler/binaries/grass-6.4.svn/bin/r.sun2 -s
pat_dtm_5m horizon=horangle horizonstep=15 aspin=pat_dtm_5m.as
slopein=pat_dtm_5m.sl day=180 insol_time=photoperiodo_d180
[Thread debugging using libthread_db enabled]
Mode 2: integrated daily irradiation
[New Thread 47358322518032 (LWP 1627)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 47358322518032 (LWP 1627)]
0x0000000000404fb8 in INPUT_part (offset=0, zmax=0x7fff33a95a38) at main.c:979
979 horizonpointer[i] = (char)(rint(SCALING_FACTOR *
(gdb) bt
#0 0x0000000000404fb8 in INPUT_part (offset=0, zmax=0x7fff33a95a38)
at main.c:979
#1 0x00000000004076ac in calculate
(singleSlope=3.6565798448710657e-319,
singleAspect=1.90389130408995e-314,
    singleAlbedo=0.20000000000000001, singleLinke=3, gridGeom=
      {xp = 6.9531864307740289e-310, yp = 1.0909214257054157e-312, xx0
= 0, yy0 = 2.3398120197805848e-310, xg0 = 2.339812019231579e-310, yg0
= 2.0760929936980226e-317, stepx = 5, stepy = 5, deltx = 117610, delty
= 97545, stepxy = 5, sinlat = 2.3398112026541087e-310, coslat =
6.9531864307767956e-310}) at main.c:1880
#2 0x0000000000404457 in main (argc=9, argv=0x7fff33a95df8) at main.c:759
(gdb)

      for (row = m - offset - 1; row >= finalRow; row--) {

    row_rev = m - row - 1;
    rowrevoffset = row_rev - offset;

    horizonpointer = horizonarray + arrayNumInt * n * rowrevoffset;

The worst-case situation is when row == 0:

  m = cellhd.rows = 19509
  n = cellhd.cols = 23522
  arrayNumInt = 360 / horizonStep = 360/15 = 24
  row_rev = m - row - 1 = 19509 - 0 - 1 = 19508 (for row==0)
  rowrevoffset = row_rev - offset = 19508 - 0 = 19508

  arrayNumInt * n * rowrevoffset = 24 * 23522 * 19508 = 11,012,812,224

arrayNumInt, n and rowrevoffset are all "int"s, which is presumably
only 32 bits. After 2GiB, the calculation will wrap and produce large
negative values.

If this is the problem, the immediate segfault can be fixed by casting
to a wider type, e.g.:

    horizonpointer = horizonarray + (ssize_t) arrayNumInt * n * rowrevoffset;

But the code could be littered with such issues (calculating offsets
which won't fit into an "int").

Most of GRASS is immune to this issue by virtue of only holding a
small number of rows in memory at any given time. But any module which
stores an entire map (or, more generally, stores data proportional to
rows*cols) is at risk of having similar issues with large maps.

--
Glynn Clements <glynn@gclements.plus.com>

On Wed, Oct 22, 2008 at 4:31 PM, Glynn Clements
<glynn@gclements.plus.com> wrote:

           for (row = m - offset - 1; row >= finalRow; row--) {

               row_rev = m - row - 1;
               rowrevoffset = row_rev - offset;

               horizonpointer = horizonarray + arrayNumInt * n * rowrevoffset;

The worst-case situation is when row == 0:

       m = cellhd.rows = 19509
       n = cellhd.cols = 23522
       arrayNumInt = 360 / horizonStep = 360/15 = 24
       row_rev = m - row - 1 = 19509 - 0 - 1 = 19508 (for row==0)
       rowrevoffset = row_rev - offset = 19508 - 0 = 19508

       arrayNumInt * n * rowrevoffset = 24 * 23522 * 19508 = 11,012,812,224

arrayNumInt, n and rowrevoffset are all "int"s, which is presumably
only 32 bits. After 2GiB, the calculation will wrap and produce large
negative values.

If this is the problem, the immediate segfault can be fixed by casting
to a wider type, e.g.:

               horizonpointer = horizonarray + (ssize_t) arrayNumInt * n * rowrevoffset;

But the code could be littered with such issues (calculating offsets
which won't fit into an "int").

The suggested change seems to help (running on a 16GB RAM box now):

top - 17:21:14 up 2 days, 3:00, 1 user, load average: 1.00, 0.94, 0.60
Tasks: 166 total, 1 running, 165 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.7%us, 0.0%sy, 0.0%ni, 87.2%id, 9.7%wa, 0.0%hi, 0.4%si, 0.0%st
Mem: 16439276k total, 16161516k used, 277760k free, 131824k buffers
Swap: 0k total, 0k used, 0k free, 2974596k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16788 neteler 18 0 17.1g 12g 2772 D 22 77.9 2:55.57 r.sun2
...

Most of GRASS is immune to this issue by virtue of only holding a
small number of rows in memory at any given time. But any module which
stores an entire map (or, more generally, stores data proportional to
rows*cols) is at risk of having similar issues with large maps.

Once finished (still at 0%), I'll report back to decide if we want to submit
this change to SVN.

Thanks so far,
Markus