[GRASS-dev] Re: parallelizing GRASS modules

Michael wrote:

I lost the previous thread but wanted to respond
to your question about which modules might
benefit from speedup.
In our recursive landscape evolution module
(r.landscape.evol.py), the two GRASS modules that
take the most time are r.watershed, r.stats, and
r.walk, especially r.watershed and r.stats since
we need to run these every model cycle.
The speedup of r.watershed of a few years back
made an enormous difference in our model run
times. But it is still time consuming on
landscapes with large numbers of cells. If
parallelization could speed this up, it would be
great. I'm not sure that r.stats can be
parallelized or not, but speedup would be helpful.

wrt r.walk: I would think to start with parallelizing r.cost, then porting
the method over to r.walk. Both modules use the segment library, so coding
it to process each segment in its own thread seems like the way to do that.
(and more generally formulate + document a method to parallelize things that use the segment library.
perhaps a simple proof-of-method module to do that should come first)

wrt r.watershed: I guess we'd want a segment mode like the -m flag, but
keeping things in RAM instead of using disk swap? Segment mode does make
use of the segment library.. MarkusM?

wrt r.stats: I suspect it is mostly I/O bound so I don't know how much faster
it can be made, but I ran it through the valgrind profiler anyway just to see.
(see the wiki Bugs page for recipe)

22% of the time is spent in G_quant_get_cell_value()

13% of the time is spent in lib/gis/get_row.c's cell_values_double()

10% of the time is spent in r.stats/stats.c's update_cell_stats() updating a
hash table. may be possible to parallelize that.

if multiple input maps are used perhaps each could be processed in their own
thread, but I don't think you are doing that with LandDyn.

perhaps the way that r.stats is called/used by LandDyn could be refined? ie
is there a lot of unnecessary calculations going on just to get a simple
stat out which could more efficiently be answered in another way? (no idea,
but worth exploring)

Hamish

On Dec 3, 2011, at 3:38 PM, Hamish wrote:

Michael wrote:

I lost the previous thread but wanted to respond
to your question about which modules might
benefit from speedup.
In our recursive landscape evolution module
(r.landscape.evol.py), the two GRASS modules that
take the most time are r.watershed, r.stats, and
r.walk, especially r.watershed and r.stats since
we need to run these every model cycle.
The speedup of r.watershed of a few years back
made an enormous difference in our model run
times. But it is still time consuming on
landscapes with large numbers of cells. If
parallelization could speed this up, it would be
great. I'm not sure that r.stats can be
parallelized or not, but speedup would be helpful.

wrt r.walk: I would think to start with parallelizing r.cost, then porting
the method over to r.walk. Both modules use the segment library, so coding
it to process each segment in its own thread seems like the way to do that.
(and more generally formulate + document a method to parallelize things that use the segment library.
perhaps a simple proof-of-method module to do that should come first)

I was hoping that this would be the answer. So improvements to the segment library should help both (and also v.surf.bspline and v.surf.rst??)

wrt r.watershed: I guess we'd want a segment mode like the -m flag, but
keeping things in RAM instead of using disk swap? Segment mode does make
use of the segment library.. MarkusM?

Sounds promising.

wrt r.stats: I suspect it is mostly I/O bound so I don't know how much faster
it can be made, but I ran it through the valgrind profiler anyway just to see.
(see the wiki Bugs page for recipe)

22% of the time is spent in G_quant_get_cell_value()

13% of the time is spent in lib/gis/get_row.c's cell_values_double()

10% of the time is spent in r.stats/stats.c's update_cell_stats() updating a
hash table. may be possible to parallelize that.

But not much gain it seems.

if multiple input maps are used perhaps each could be processed in their own
thread, but I don't think you are doing that with LandDyn.

You're right

perhaps the way that r.stats is called/used by LandDyn could be refined? ie
is there a lot of unnecessary calculations going on just to get a simple
stat out which could more efficiently be answered in another way? (no idea,
but worth exploring)

No help in that regard. There are other things we're doing like working through long lists that might be speeded up, but these are in Python and Java, not GRASS. So they can't be helped by parallelization.

This overall sounds encouraging, however.

Mihael

Hamish

Michael:

So improvements to the segment library should
help both

[r.cost and r.walk]

Perhaps better stated as learning how to run the
segment library in parallel. The changes would be
in the modules AFAIU, not in the library per se.
(although perhaps the could be, or both the lib and
module working together for it, ??)

(and also v.surf.bspline and v.surf.rst??)

v.surf.bspline does use GRASS's segment library,
but v.surf.rst quadtree segmentation is something
different.

note Soeren has recently parallelized the bit of
the gmath library which v.surf.bspline uses which
for me made it 3.5x faster on 6-cores, but incurred
a 40% overhead penalty. Hopefully by parallelizing
the segments instead of the inner loops of the
gmath library we can get that overhead down near 0%
of the overall task.

also to consider is that the segment library was
created mainly as a way to keep RAM usage down on
big jobs. running all segments at once negates that
advantage. (but of course you can limit the number
of threads launched at run time using environment
variables if that is a problem)

Hamish

Michael wrote:

>> In our recursive landscape evolution module
>> (r.landscape.evol.py),

...

There are other things we're doing like working
through long lists that might be speeded up,
but these are in Python and Java, not GRASS.
So they can't be helped by parallelization.

For help in finding the hogs I've just added a
section to the wiki about profiling python scripts:
  http://grass.osgeo.org/wiki/GRASS_Debugging#Profiling_python_scripts

which just points you to this page:
  http://docs.python.org/library/profile.html

Hamish

Hamish wrote:

Michael wrote:

I lost the previous thread but wanted to respond
to your question about which modules might
benefit from speedup.
In our recursive landscape evolution module
(r.landscape.evol.py), the two GRASS modules that
take the most time are r.watershed, r.stats, and
r.walk, especially r.watershed and r.stats since
we need to run these every model cycle.
The speedup of r.watershed of a few years back
made an enormous difference in our model run
times. But it is still time consuming on
landscapes with large numbers of cells. If
parallelization could speed this up, it would be
great. I'm not sure that r.stats can be
parallelized or not, but speedup would be helpful.

wrt r.walk: I would think to start with parallelizing r.cost, then porting
the method over to r.walk. Both modules use the segment library, so coding
it to process each segment in its own thread seems like the way to do that.

Note that r.terracost follows this approach. Unfortunately, 1) you
will have to cross segment boundaries at some stage to accumulate
costs, 2) the r.terracost implementation using disk swap mode is
unsuccessful in the sense that it is magnitudes slower than r.cost in
trunk, because r.terracost needs to cross segment boundaries. With
regard to the knight's move in r.cost, that method relies on the fact
that all directly adjacent cells are processed first.

(and more generally formulate + document a method to parallelize things that use the segment library.
perhaps a simple proof-of-method module to do that should come first)

The segment library in trunk is mostly IO bound, CPU overhead is very
low. Parallelizing the segment library could actually slow down the
system because several parts of a file are accessed in parallel that
can not be cached by the system. That's why the iostream lib does
strictly sequential reading. The segment library, although designed to
allow random access, should be used such that access is not too random
(sweep-line concept for searches as done by r.cost, r.walk,
r.watershed).

wrt r.watershed: I guess we'd want a segment mode like the -m flag, but
keeping things in RAM instead of using disk swap? Segment mode does make
use of the segment library.. MarkusM?

The costliest (most time consuming) parts of r.watershed are the A*
Search and flow accumulation. The A* Search can not really be
parallelized because the order in which cells are processed is of
vital importance for everything that follows in r.watershed. This
order could easily get messed up with parallelization. Processing the
8 neighbours of the current focus cell could be parallelized, but not
without rewriting parts of the A* Search component. Flow accumulation
could be parallelized using an approach similar to r.terraflow, but
this would increase disk space and memory requirements by a factor of
nearly 8. Alternatively, there may be potential to parallelize some of
the inner loops for (MFD) flow accumulation, but I guess that the
overhead introduced by paralellization is rather high and may not be
leveled out with more cores because there are only 8 neighbours to
each cell. The current approach for the segmented mode of r.watershed
is a compromise between keeping the size of intermediate data low and
reducing disk IO. The easiest way to speed up things is to get a
really fast harddrive and use that only for grass databases.

Sorry for the lack of enthusiasm. IMHO, code should be optimized first
before it is parallelized, otherwise you run ineffective code on
several cores. I'm sure there is still some potential for optimization
in the code base...

Markus M

wrt r.stats: I suspect it is mostly I/O bound so I don't know how much faster
it can be made, but I ran it through the valgrind profiler anyway just to see.
(see the wiki Bugs page for recipe)

22% of the time is spent in G_quant_get_cell_value()

13% of the time is spent in lib/gis/get_row.c's cell_values_double()

10% of the time is spent in r.stats/stats.c's update_cell_stats() updating a
hash table. may be possible to parallelize that.

if multiple input maps are used perhaps each could be processed in their own
thread, but I don't think you are doing that with LandDyn.

perhaps the way that r.stats is called/used by LandDyn could be refined? ie
is there a lot of unnecessary calculations going on just to get a simple
stat out which could more efficiently be answered in another way? (no idea,
but worth exploring)

Hamish
_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Hamish:

> wrt r.walk: I would think to start with parallelizing r.cost, then
> porting the method over to r.walk. Both modules use the segment
> library, so coding it to process each segment in its own thread
> seems like the way to do that.

[I have since studied r.cost and the segment library more closely, and
now realize that the segment library is a custom page swap virtual array
to keep RAM low; as opposed to a more straight forward quad-tree-like,
rows=4096, or 'r.in.xyz percent=' style divide and conquer method]

Markus M:

The segment library in trunk is mostly IO bound, CPU overhead is very
low. Parallelizing the segment library could actually slow down the
system because several parts of a file are accessed in parallel that
can not be cached by the system. That's why the iostream lib does
strictly sequential reading. The segment library, although designed to
allow random access, should be used such that access is not too random

fwiw, I ran trunk's r.cost through valgrind+kcachgrind profiling:
(see grass mediawiki Bugs page for method)
50% of the module time was taken in segment_get()
25% in segment_address()
18% in segment_address_fast()
16% in get_lowest()
8% in segment_pagein()
7% in memcpy()

[r.watershed]

Alternatively, there may be potential to parallelize some of
the inner loops for (MFD) flow accumulation, but I guess that the
overhead introduced by paralellization is rather high and may not
be leveled out with more cores because there are only 8
neighbours to each cell.

threads are way more efficient than processes, and the compilers are
getting better all the time, but yes, there is a finite overhead hit
you take with each thread created & destroyed, so parallelizing outer
loops is much preferred over simple inner loops, and MarkusN's mapset-
per-job is still the most efficient way to get a huge batch job done.
.. as long as it's not an iterative task like Michael has; although I
guess he could run different starting condition cases serially on
different CPUs.

The easiest way to speed up things is to get a
really fast harddrive and use that only for grass
databases.

maybe I should add my poor-man's RAID setup & tuning notes to the grass
wiki :slight_smile:

Sorry for the lack of enthusiasm.

I do not expect everything to happen like magic, for me this is as much
of a learning exercise as anything else. From our 500 or so modules there
will be a few which will be easy wins, and many more which will not be.
But some wins are better than none.

IMHO, code should be optimized first before it is parallelized,
otherwise you run ineffective code on several cores.

shrug, that is still better than running ineffective code on a single
core, when the optimization may not happen for years...

The thing I like about OpenMP is that it is very non-invasive, so that
it is not a huge investment and refactoring needed to play with it, and
if it is not switched on it is as if no changes were made at all. So it
is easily ignored/dumped when it comes time to optimize. and the more we
look at the inner algorithms the more we might see ways to optimize better.
:slight_smile:

Right now I'd be happy with just being able to efficiently parallelize
inefficient things, if it can be done with minimal effort and minimal
changes.

I'm sure there is still some potential for optimization
in the code base...

I'd say there is huge potential. [*cough* v.db.select]

wrt v.surf.rst, I've been playing with its lib/gmath/lu.c LU decomposition
function. can anyone figure a way to get the following to work? afaic
guess a[i] in one thread is having a race with a[k] in another.

/* corrupts data but speeds it up a lot: */
#pragma omp parallel for private(i, k, sum) shared(j, a)
  for (i = 0; i < j; i++) {
      sum = a[i][j];

/* on its own, this pragma works but hugely slows things down: */
//#pragma omp parallel for schedule (static) private(k) shared(i, j) reduction(-:sum)
      for (k = 0; k < i; k++)
    sum -= a[i][k] * a[k][j];
      a[i][j] = sum;
  }

still need to investigate if we can wrap an already optimized and paral-
ellized BLAS/LAPACK/ATLAS function to replace that, if present.
(even a 20% slower algorithm well parallelized will be <=60% quicker on a
dual-core and <=540% quicker on a 8-core cpu)

thanks,
Hamish

Hi,

IMHO, code should be optimized first before it is parallelized,
otherwise you run ineffective code on several cores.

shrug, that is still better than running ineffective code on a single
core, when the optimization may not happen for years...

The thing I like about OpenMP is that it is very non-invasive, so that
it is not a huge investment and refactoring needed to play with it, and
if it is not switched on it is as if no changes were made at all. So it
is easily ignored/dumped when it comes time to optimize. and the more we
look at the inner algorithms the more we might see ways to optimize better.
:slight_smile:

Right now I'd be happy with just being able to efficiently parallelize
inefficient things, if it can be done with minimal effort and minimal
changes.

OpenMP is "simple" to use in a program that has been designed from the
beginning to run in parallel. Parallelizing serial code is much
harder. In my experience harder then optimizing code, except you
have simple matrix and vector operations.

I'm sure there is still some potential for optimization
in the code base...

I'd say there is huge potential. [*cough* v.db.select]

wrt v.surf.rst, I've been playing with its lib/gmath/lu.c LU decomposition
function. can anyone figure a way to get the following to work? afaic
guess a[i] in one thread is having a race with a[k] in another.

/* corrupts data but speeds it up a lot: */
#pragma omp parallel for private(i, k, sum) shared(j, a)
for (i = 0; i < j; i++) {
sum = a[i][j];

/* on its own, this pragma works but hugely slows things down: */
//#pragma omp parallel for schedule (static) private(k) shared(i, j) reduction(-:sum)
for (k = 0; k < i; k++)
sum -= a[i][k] * a[k][j];
a[i][j] = sum;
}

Hamish, please do not optimize the LU code as it should be removed (NR
licence issue). Have a look at
the other direct solver in gmath library and ccmath library.
in solvers_direct.c:
* G_math_solver_cholesky
* G_math_solver_lu
* G_math_solver_gauss

These solver are parallelized, but not optimized. The disadvantage of
these solver parallelization is the triangular nature of the matrix
and therefor the thread creation overhead in case of at least one
third of the matrix computation. Hence the speedup for these LU, GAUSS
and Cholesky solver will always be worse, except you design the code
to run more efficiently in parallel.

The NR (Numerical Recipes) solver is designed to run very fast in
serial, but i think is very hard to parallelize.

I did not have had a look on the ccmath LU solver.

still need to investigate if we can wrap an already optimized and paral-
ellized BLAS/LAPACK/ATLAS function to replace that, if present.
(even a 20% slower algorithm well parallelized will be <=60% quicker on a
dual-core and <=540% quicker on a 8-core cpu)

You need a solver using the same matrix storage concept as the
existing one, otherwise
the memory consumption will double.

Have a look at SuperLU to process dense matrices efficiently in parallel.

Best regards
Soeren

thanks,
Hamish

Hamish:

> wrt v.surf.rst, I've been playing with its lib/gmath/lu.c LU
> decomposition function. can anyone figure a way to get the
> following to work?

[...]

Sören wrote:

Hamish, please do not optimize the LU code as it should be removed (NR
licence issue). Have a look at the other direct solver in gmath library
and ccmath library.

I'm looking at this as a learning exercise as much as anything else, so
I don't really mind a bit of wasted effort. And if I can speed up
v.surf.rst today by adding a conditional one-liner somewhere, I think it
is worth my time. (In ~30 minutes of experimenting I already have it
completing ~25% faster; the just posted problem runs 50% faster but gives
garbage results)

in solvers_direct.c:
* G_math_solver_cholesky
* G_math_solver_lu
* G_math_solver_gauss

I peeked at G_math_lu_decomposition() but not G_math_solver_lu() yet.
It looks reasonably promising:
lu.c:int G_ludcmp(double **a, int n, int *indx, double *d)
solvers_direct.c:int G_math_solver_lu(double **A, double *x, double *b, int rows)

The NR (Numerical Recipes) solver is designed to run very
fast in serial, but i think is very hard to parallelize.

fwiw, long ago the Numerical Recipes authors gave permission for their
code to be used in GRASS, but yes, we should work to remove it anyway.

I did not have had a look on the ccmath LU solver.

...

Have a look at SuperLU to process dense matrices efficiently in parallel.

ok, more reading to do..

having looked at v.surf.rst a bit more, it seems to me the loop that
really wants to be parallelized is in lib/rst/interp_float/segmen2d.c
IL_interp_segments_2d(). the "cv" for loop starting on this line:

    for (skip_index = 0; skip_index < m_skip; skip_index++) {

calls matrix_create() ~256 times during the course of the module run, and
within that the LU decomposition function is already abstracted so easy
to swap out with something else. If each one of those matrix_create()s
could be run in their own thread there would be no need to parallelize
the deep linear algebra innards, and so no huge overhead to worry about
and numeric code designed to be serial could remain that way.
it is a bit more complicated, so may be a job for pthreads..?

Hamish

On Mon, Dec 5, 2011 at 11:30 AM, Hamish <hamish_b@yahoo.com> wrote:

Hamish:

> wrt r.walk: I would think to start with parallelizing r.cost, then
> porting the method over to r.walk. Both modules use the segment
> library, so coding it to process each segment in its own thread
> seems like the way to do that.

[I have since studied r.cost and the segment library more closely, and
now realize that the segment library is a custom page swap virtual array
to keep RAM low; as opposed to a more straight forward quad-tree-like,
rows=4096, or 'r.in.xyz percent=' style divide and conquer method]

Markus M:

The segment library in trunk is mostly IO bound, CPU overhead is very
low. Parallelizing the segment library could actually slow down the
system because several parts of a file are accessed in parallel that
can not be cached by the system. That's why the iostream lib does
strictly sequential reading. The segment library, although designed to
allow random access, should be used such that access is not too random

fwiw, I ran trunk's r.cost through valgrind+kcachgrind profiling:
(see grass mediawiki Bugs page for method)
50% of the module time was taken in segment_get()
25% in segment_address()
18% in segment_address_fast()
16% in get_lowest()
8% in segment_pagein()
7% in memcpy()

What exactly did you want to profile? If you want to profile rather
the parts responsible for loading the input map(s) and writing the
output map(s) than the actual search part, you should use a cost
surface with lots of NULLs, otherwise a cost surface without NULLs.
The current region should at least have a few million cells, otherwise
I doubt that the results are meaningful. Separating disk IO (time
spent by read and write) from CPU load is here also important.

r.cost as other modules uses the segment library because the input
raster can not be processed row by row, any part of the input and
output raster must be accessible at any time while processing. One
possibility is to load everything to memory, but that's not so
user-friendly and against the fundamentals of GRASS coding having its
origin in the 70's and 80's. That is, the more crucial performance
tests should use region settings that would exceed the available
memory (see info printed at the beginning of r.cost --v) if everything
is loaded to memory. IOW, there may be other modules that are easier
to optimize using parallelization than the ones using the segment
library.

Attachment FYI

Markus M

(attachments)

r.cost.png

I'm looking at this as a learning exercise as much as anything else, so
I don't really mind a bit of wasted effort. And if I can speed up
v.surf.rst today by adding a conditional one-liner somewhere, I think it
is worth my time. (In ~30 minutes of experimenting I already have it
completing ~25% faster; the just posted problem runs 50% faster but gives
garbage results)

Unfortunately i must say, parallelizing a program which produces
garbage is no indicator
how much faster the program becomes if it gets correctly parallelized.
The problem for garbage are
in most cases race conditions. If you have race conditions in the
code, you need to restructure it.
The cholseky band matrix solver needed to be restructured to run in
parallel, luckily that was not so hard. IMHO in case of the NR LU
decomposition code, you really need to understand what is going on for
parallel restructuring.

[...]

The NR (Numerical Recipes) solver is designed to run very

fast in serial, but i think is very hard to parallelize.

fwiw, long ago the Numerical Recipes authors gave permission for their
code to be used in GRASS, but yes, we should work to remove it anyway.

Well, i guess they don't gave there permission to distribute it under the GPL2?
Thats actually what we are doing. Besides of that the code is not commented
or documented as NR code with special license ... .

[...]

having looked at v.surf.rst a bit more, it seems to me the loop that
really wants to be parallelized is in lib/rst/interp_float/segmen2d.c
IL_interp_segments_2d(). the "cv" for loop starting on this line:

for (skip_index = 0; skip_index < m_skip; skip_index++) {

calls matrix_create() ~256 times during the course of the module run, and
within that the LU decomposition function is already abstracted so easy
to swap out with something else. If each one of those matrix_create()s
could be run in their own thread there would be no need to parallelize
the deep linear algebra innards, and so no huge overhead to worry about
and numeric code designed to be serial could remain that way.
it is a bit more complicated, so may be a job for pthreads..?

OpenMP support parallel section:
#pragma omp sections
{
  #pragma omp section
    /* Code for section 1 */
#pragma omp section
   /* Code for section 2 */
}

Best regards
Soeren

Markus Metz wrote:

IOW, there may be other modules that are easier to optimize using
parallelization than the ones using the segment library.

what do you think about the recently optimized v.surf.bspline module?
Thanks to Sören the G_math_solver_cholesky_sband() is now parallelized,
which helps, but I think it would be more efficient to run each of the
slightly-overlapping subregions in their own threads, then wait for all
subregions to finish before continuing on. Technically the segment
library is involved, but the subregion bit of the code seems somewhat
separated from the segmenting parts.

fwiw I don't really consider parallelization to be "optimizing" anything,
rather just a way to throw more brute force at the problem.

Attachment FYI

very impressive. the v.surf.icw addon script thanks you.

Hamish