[GRASS-dev] r.surf.contour inefficiency?

Hi,

while running r.surf.contour I notice that 33% of the cpu is dealing with
"sys" leaving ~ 65% to run the program. with r.surf.contour using 97% of
the CPU, top says:

Cpu(s): 0.3%us, 33.1%sy, 64.9%ni, 0.0%id, ...

(nice'd because I run GRASS at low priority to keep processing from
slowing down desktop tasks while number crunching)

there is no big disk I/O going on -- is using that much "sys" CPU
indicative of some massive inefficiency in the program? It's a module
which has historically taken a ridiculously long time to run and I don't
understand why the kernel needs to be involved so much.
(linux 2.6.22, 1.5GB+ free memory)

ideas?

Hamish

Hamish wrote:

while running r.surf.contour I notice that 33% of the cpu is dealing with
"sys" leaving ~ 65% to run the program. with r.surf.contour using 97% of
the CPU, top says:

Cpu(s): 0.3%us, 33.1%sy, 64.9%ni, 0.0%id, ...

(nice'd because I run GRASS at low priority to keep processing from
slowing down desktop tasks while number crunching)

there is no big disk I/O going on -- is using that much "sys" CPU
indicative of some massive inefficiency in the program? It's a module
which has historically taken a ridiculously long time to run and I don't
understand why the kernel needs to be involved so much.
(linux 2.6.22, 1.5GB+ free memory)

ideas?

I presume that it's the use of the segment library causing a lot of
I/O.

Can r.surf.contour reasonably be used on gigabyte-sized maps? If not,
I'd suggest eliminating the use of the segment library.

In general, there isn't much point using segmentation with algorithms
whose time complexity is worse than linear in the size of the input.
On modern systems, the run time will typically become prohibitive
before memory usage does.

If it can reasonably be used on maps which won't necessarily fit into
memory, try increasing the number of segments in memory (the last
parameter to cseg_open(), currently hardcoded to 8).

Even if that reduces the sys time, segment_get() is still vastly
slower than array access. If segmentation is useful, the code in
r.proj[.seg] should be significantly more efficient, due to the use of
macros, power-of-two sizes, and compile-time constants.

--
Glynn Clements <glynn@gclements.plus.com>

Hamish wrote:
> while running r.surf.contour I notice that 33% of the cpu is dealing
> with "sys" leaving ~ 65% to run the program.

....

> there is no big disk I/O going on -- is using that much "sys" CPU
> indicative of some massive inefficiency in the program? It's a module
> which has historically taken a ridiculously long time to run and I
> don't understand why the kernel needs to be involved so much.

Glynn:

I presume that it's the use of the segment library causing a lot of I/O.

Can r.surf.contour reasonably be used on gigabyte-sized maps?

currently it is so slow to run that working with such a big map would
not complete in a reasonable time.

If not, I'd suggest eliminating the use of the segment library.

In general, there isn't much point using segmentation with algorithms
whose time complexity is worse than linear in the size of the input.
On modern systems, the run time will typically become prohibitive
before memory usage does.

If it can reasonably be used on maps which won't necessarily fit into
memory, try increasing the number of segments in memory (the last
parameter to cseg_open(), currently hardcoded to 8).

Even if that reduces the sys time, segment_get() is still vastly
slower than array access. If segmentation is useful, the code in
r.proj[.seg] should be significantly more efficient, due to the use of
macros, power-of-two sizes, and compile-time constants.

Following r.cost @ 100%, I have changed cseg_open(16, 16, 8) to 256,256,64
locally. now "sys" CPU time is greatly reduced:

Cpu(s): 1.3%us, 0.3%sy, 96.7%ni, 0.0%id,

and memory use by the module is 25mb for a 2700x1090 region.

it's running now, we'll see how much faster it goes.

of course r.surf.contour has other big problems, like the fact that it
only outputs CELL maps and you have to multiply your F/DCELL input DEM by
10000 before running and divide by the same afterwards to preserve sub-
meter elevation, and (IIRC) if you have a "0" contour it is treated as
NULL. so: 'r.mapcalc surf_input=(dem*10000)+1'
None the less it really does a very nice job WRT the output, so can be
worth the pain.

thanks,
Hamish

Hamish wrote:

Following r.cost @ 100%, I have changed cseg_open(16, 16, 8) to 256,256,64
locally.

FWIW, I found that r.watershed.seg is fastest with the equivalent of cseg_open(200, 200, <variable number of open segments>). It seems logical to use powers of 2, but 200, 200 is in r.watershed.seg faster than 128, 128 and 256, 256. 512, 512 is much slower again, there is a performance peak somewhere around 200, 200 for r.watershed.seg on my system. The segment library has strange dynamics, no idea what determines these dynamics, especially this performance peak.

Markus M

Markus Metz wrote:

> Following r.cost @ 100%, I have changed cseg_open(16, 16, 8) to 256,256,64
> locally.

FWIW, I found that r.watershed.seg is fastest with the equivalent of
cseg_open(200, 200, <variable number of open segments>). It seems
logical to use powers of 2, but 200, 200 is in r.watershed.seg faster
than 128, 128 and 256, 256.

Using powers of two only provides a benefit if the code actually takes
account of this restriction; the segment library doesn't:

  int segment_address(const SEGMENT * SEG, int row, int col, int *n, int *index)
  {
      *n = row / SEG->srows * SEG->spr + col / SEG->scols;
      *index = (row % SEG->srows * SEG->scols + col % SEG->scols) * SEG->len;

OTOH, r.proj uses a fixed 64x64 block size, and the "get" operation is
a macro. Other sizes would work, but there is an efficiency gain for
fixing the size at compile time.

But, if the nature of the algorithm is such that processing time
becomes a problem long before memory does, then there's no point to
using any form of segmentation.

--
Glynn Clements <glynn@gclements.plus.com>

Hamish wrote:

> I presume that it's the use of the segment library causing a lot of I/O.
>
> Can r.surf.contour reasonably be used on gigabyte-sized maps?

currently it is so slow to run that working with such a big map would
not complete in a reasonable time.

I'll remove the segmentation code in 7.0.

--
Glynn Clements <glynn@gclements.plus.com>

On Sun, Feb 8, 2009 at 8:26 AM, Glynn Clements <glynn@gclements.plus.com> wrote:

I'll remove the segmentation code in 7.0.

Here the timing comparison:

# NC data set
g.region vect=elev_lid792_cont1m res=1 -p
v.to.rast elev_lid792_cont1m out=elev_lid792_cont1m use=z
time r.surf.contour elev_lid792_cont1m out=dem

Results:
* GRASS 6.5.svn (pre seg size update):
147.66user
24.14system
2:52.22elapsed
184inputs+4544outputs (1major+508minor) pagefaults 0swaps

* GRASS 6.5.svn (post seg size update):
118.28user
0.26system
1:58.93elapsed
0inputs+4984outputs (0major+1202minor)pagefaults 0swaps

* GRASS 7.svn:
72.19user
0.04system
1:12.38elapsed
0inputs+320outputs (0major+1023minor) pagefaults 0swaps

r.univar doesn't show differences between 6.5.svn (post seg size update)
and 7.svn.
Is there risk in backporting this?

Markus

Markus Neteler wrote:

> I'll remove the segmentation code in 7.0.

Here the timing comparison:

# NC data set
g.region vect=elev_lid792_cont1m res=1 -p
v.to.rast elev_lid792_cont1m out=elev_lid792_cont1m use=z
time r.surf.contour elev_lid792_cont1m out=dem

Results:
* GRASS 6.5.svn (pre seg size update):
147.66user
24.14system
2:52.22elapsed
184inputs+4544outputs (1major+508minor) pagefaults 0swaps

* GRASS 6.5.svn (post seg size update):
118.28user
0.26system
1:58.93elapsed
0inputs+4984outputs (0major+1202minor)pagefaults 0swaps

* GRASS 7.svn:
72.19user
0.04system
1:12.38elapsed
0inputs+320outputs (0major+1023minor) pagefaults 0swaps

r.univar doesn't show differences between 6.5.svn (post seg size update)
and 7.svn.

How does r.univar come into this?

Is there risk in backporting this?

The only risk is that I might have made a mistake somewhere. Mostly
the change just removes all references to the segment library and uses
an array insted.

[Note to self: remove segment library from Makefile.]

The nature of the change is such that it would be hard to make a
mistake that would only show up in specific circumstances. If a
straightforward test case works, it should be fine.

--
Glynn Clements <glynn@gclements.plus.com>

On Sun, Feb 8, 2009 at 5:25 PM, Glynn Clements <glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

> I'll remove the segmentation code in 7.0.

Here the timing comparison:

# NC data set
g.region vect=elev_lid792_cont1m res=1 -p
v.to.rast elev_lid792_cont1m out=elev_lid792_cont1m use=z
time r.surf.contour elev_lid792_cont1m out=dem

Results:
* GRASS 6.5.svn (pre seg size update):
147.66user
24.14system
2:52.22elapsed
184inputs+4544outputs (1major+508minor) pagefaults 0swaps

* GRASS 6.5.svn (post seg size update):
118.28user
0.26system
1:58.93elapsed
0inputs+4984outputs (0major+1202minor)pagefaults 0swaps

* GRASS 7.svn:
72.19user
0.04system
1:12.38elapsed
0inputs+320outputs (0major+1023minor) pagefaults 0swaps

r.univar doesn't show differences between 6.5.svn (post seg size update)
and 7.svn.

How does r.univar come into this?

Forgot to mention that I used r.mapcalc to calculate the differences
between the 6.5 and 7 results.

Is there risk in backporting this?

The only risk is that I might have made a mistake somewhere. Mostly
the change just removes all references to the segment library and uses
an array insted.

[Note to self: remove segment library from Makefile.]

The nature of the change is such that it would be hard to make a
mistake that would only show up in specific circumstances. If a
straightforward test case works, it should be fine.

Here it does. Some Mac tests would be good.

Markus

Glynn:

> But, if the nature of the algorithm is such that processing time
> becomes a problem long before memory does, then there's no point to
> using any form of segmentation.

as the module gets faster, the size of map you attempt to run it on
gets bigger and memory does start to become an issue...

FWIW the biggest actual-data region I've tried with r.surf.contour was
21000x13000 (this was a long while back, so god knows how long it took to
finish, could have been a week..*). If it stores it in memory in the input
map's CELL_TYPE, that'll take ~ 2.1gb RAM by my numbers (DCELL). If stored
in memory as CELL, half that. I can accept that much, but it is pushing on
needing a 'percent of map to keep in memory' option.

[*] the great inefficiency in the algorithm is that it is Very slow in
areas where it has to search a long way to find non-NULL cells. If you
can minimize NULL cell areas (one way or another) it can be quite quick,
even for big regions. For my big job ISTR that I set ocean to -1 elevation
instead of NULL and had quite rugged terrain so contour density was thick.

Markus Metz:

FWIW, I found that r.watershed.seg is fastest with the equivalent of
cseg_open(200, 200, <variable number of open segments>). It seems
logical to use powers of 2, but 200, 200 is in r.watershed.seg faster
than 128, 128 and 256, 256. 512, 512 is much slower again, there is a
performance peak somewhere around 200, 200 for r.watershed.seg on my
system. The segment library has strange dynamics, no idea what determines
these dynamics, especially this performance peak.

yesterday I added a command line option for segment size & number, and
ran some trials:
SEG_SIZES="16 32 60 64 100 120 128 150 200 250 256 300 432 500 512 1000 1028 2000 2024"
NUM_SEGS="4 8 16 25 32 50 64 100 128 256"

#spearfish
g.region rast=elevation.10m
#rows: 1398
#cols: 1899
r.contour input="elevation.10m" output="elev_contours_10m" step=10
v.to.rast in=elev_contours_10m out=elev_contours.10m use=z

for ... ; do
  echo "$segsize x $nsegs :"
  time r.surf.contour ...
done

512x512 segments x16 came out best. but basically there were all pretty
similar as long as they were not so small as to be choking the system.
(as original 16x16x8 was). 512x512x16 uses about 19mb of RAM for the proc.

I would guess that the best tuning is somewhat dependent on the size
and shape of the input data.

Glynn:

> I'll remove the segmentation code in 7.0.

MarkusN tests:
[6.5prev: 171.8 sec total usr+sys;
  6.5curr: 118.5 sec;
  7.0curr: 72.2 sec]

nice!

MarkusN on Linux (old):
147.66 user
  24.14 system

Michael on Mac (old):
  user 2m00.088s
  sys 1m42.810s

Interesting that OSX spends so much more time in the kernel than Linux.
So the improvement will have twice the impact for Macs.

Glynn:

> How does r.univar come into this?

Markus:

Forgot to mention that I used r.mapcalc to calculate the differences
between the 6.5 and 7 results.

I did the same in my tests knowing that it /shouldn't/ have an effect.
If nothing else it's a good fail-safe sanity check. e.g. NULL support
in 6.5 is currently borken, if r.surf.contour outputted maps containing
NULLs it probably would have picked that up.

test-suites are good.

Hamish

Hamish wrote:

> > But, if the nature of the algorithm is such that processing time
> > becomes a problem long before memory does, then there's no point to
> > using any form of segmentation.

as the module gets faster, the size of map you attempt to run it on
gets bigger and memory does start to become an issue...

FWIW the biggest actual-data region I've tried with r.surf.contour was
21000x13000 (this was a long while back, so god knows how long it took to
finish, could have been a week..*). If it stores it in memory in the input
map's CELL_TYPE, that'll take ~ 2.1gb RAM by my numbers (DCELL). If stored
in memory as CELL, half that. I can accept that much, but it is pushing on
needing a 'percent of map to keep in memory' option.

If you're running jobs which take days, I don't think that it's
unrealistic to require a manual tile/process/patch. It's quite likely
that you will want to do that anyhow in order to split the task over
multiple systems.

I did the same in my tests knowing that it /shouldn't/ have an effect.
If nothing else it's a good fail-safe sanity check. e.g. NULL support
in 6.5 is currently borken, if r.surf.contour outputted maps containing
NULLs it probably would have picked that up.

r.surf.contour is one of the few modules still using zero-is-null,
i.e. G_get_map_row() rather than G_get_c_raster_row().

--
Glynn Clements <glynn@gclements.plus.com>

Glynn wrote:

If you're running jobs which take days, I don't think that it's
unrealistic to require a manual tile/process/patch. It's quite likely
that you will want to do that anyhow in order to split the task over
multiple systems.

mmph. for me the pain threshold is usually if the job is not done yet
by the time I return monday morning. But computers get faster & faster..

Hamish:

> If nothing else it's a good fail-safe sanity check. e.g. NULL support
> in 6.5 is currently borken, if r.surf.contour outputted maps containing
> NULLs it probably would have picked that up.

r.surf.contour is one of the few modules still using zero-is-null,
i.e. G_get_map_row() rather than G_get_c_raster_row().

that is true, but not what I meant: (trac #392)

G65> r.mapcalc 'nullmap = null()'
100%

G65> r.info -r nullmap
min=-2147483648
max=-2147483648

G65> r.univar nullmap
100%
total null and non-null cells: 2654802
total null cells: 0

Of the non-null cells:
----------------------
n: 2654802
minimum: 0
maximum: 0
range: 0
mean: 0
mean of absolute values: 0
standard deviation: 0
variance: 0
variation coefficient: nan %
sum: 0

:frowning:

Hamish