[GRASS-dev] Interested in parallelization of GRASS

Hello,
I would like to work on parallelization of GRASS libraries through implementation of OpenMP in thread-safe algorithms and modification of non-thread-safe algorithms to make them thread-safe if possible.
I have good experience in C++( Templates/STL, generic programming, tag-dispatching and basic idea of meta-programming). Also, I have worked with OpenMP to parallelize time-averaged simulation.
I would like to know more about this project and apply under the GSoC program.

With Regards,
Vivek.

Hello,
as nobody has answered Your mail, I would drop my 0.02$.

As most of current GRASS is designed with single process single thread
short-lived CLI applications in mind. Redesigning core of GRASS to
take advantage of modern multi-core CPUs and/or nodes /GPUs would be a
nice thing. You can read about current state of GRASS parallelization
at WIKI [1].

Keep in mind that GRASS mostly is written in C and there are no plans
to change core language to anything else.

There are many areas that could benefit from thread-safe/parallel
processing: vector and raster reading/writing, OGSF etc.

As I'm not a (good) programmer, I can't help You more. Feel free to
explore existing codebase [2] and documentation [3], [4].

Maris.

1. http://grass.osgeo.org/wiki/Parallel_GRASS_jobs
2. http://trac.osgeo.org/grass/browser/grass/trunk
3. http://grass.osgeo.org/programming7/
4. http://trac.osgeo.org/grass/

2011/3/21, Vivek Yadav <vicky.it.bhu@gmail.com>:

Hello,
                   I would like to work on parallelization of GRASS
libraries through implementation of OpenMP in thread-safe algorithms and
modification of non-thread-safe algorithms to make them thread-safe if
possible.
I have good experience in C++( Templates/STL, generic programming,
tag-dispatching and basic idea of meta-programming). Also, I have worked
with OpenMP to parallelize time-averaged simulation.
                  I would like to know more about this project and apply
under the GSoC program.

With Regards,
                Vivek.

Maris Nartiss wrote:

There are many areas that could benefit from thread-safe/parallel
processing: vector and raster reading/writing, OGSF etc.

Supporting concurrent reads on a single raster map would make the code
significantly more complex. Concurrent writes would be even worse
unless compression was disabled.

In 7.0, concurrent reads from or writes to different raster maps
should work (in 6.x, it's unsafe). This might allow for significant
parallelism in e.g. r.series, which often has a large number of input
maps.

--
Glynn Clements <glynn@gclements.plus.com>

Maris wrote:

> There are many areas that could benefit from thread-safe/
> parallel processing: vector and raster reading/writing, OGSF
> etc.

Glynn:

Supporting concurrent reads on a single raster map would
make the code significantly more complex. Concurrent writes
would be even worse unless compression was disabled.

In 7.0, concurrent reads from or writes to different raster
maps should work (in 6.x, it's unsafe). This might allow for
significant parallelism in e.g. r.series, which often has a
large number of input maps.

It seems to me that the biggest gains for the least effort is to
concentrate on the modules/lib fns which are CPU bound not I/O
bound.

While e.g. pthreads in r.mapcalc in grass7 may give me a nice
10% speedup by separating reading and writing into two threads
[or is that splitting into IO and maths?] (* I haven't actually benchmarked it, 10% is just a guess), and that is not a gift to
refuse, the reality seems that the time-to-completion is still
dominated by the I/O bottlenecks and saturating the bus-- at which
point it doesn't matter how many threads or CPUs you have, you're
still limited by the speed of your bus/drive/RAID array.

Optimizing/parallelizing short-running tasks seems most
appropriate when they will be run repetitively in tight loops.
There's nothing wrong with speeding them up, but I'd rather chop
an hour long process into 10 minutes on my hex-core CPU than get
a 5 second one-off process to complete in 4.5 seconds.

It's the ones that take hours, or at least give you time to
glance at the process monitor and think "I've got 2/4/6/8 cores
here, but only one is being used. this is taking forever; argh!".

So I try to think of modules which are CPU bound.. the first
task is to replace inefficient algorithms with better ones (e.g.
Glynn's r.cost work, Markus M's r.watershed work), and then if
still needed to split those tasks up with pthreads, OpenMP, or
OpenCL (depending on the right horse for the course).

One thing I learned from Seth's work on r.sun (which I still
need to merge, bad me), was that you could instruct OpenCL to
send the job to $x CPU cores instead of an OpenCL-aware GPU.

my current understanding goes like:
-pthreads maybe best for things like splitting IO and math into
two threads.
-OpenML maybe best for things like sending a CPU-bound problem
to 2/4/6/8 CPU cores.
-OpenCL maybe best for ray-tracing type problems (or maybe just
fine for more general use?)

*.surf.rst takes a while to run; v.in.ogr in GRASS 6 takes a
while to run (haven't benchmarked in gr7, know there has been
work done on that); r.surf.contour used to take days (but worth
the wait :); not sure if Markus M's updates there do anything
for that or not)

There may be a number of slow vector modules, I mostly work with
rasters so don't know them as well. It is worth noting that a
lot of the DB stuff is already run as a separate process, so
some advantage of a multi-core system there.

I'd be interested to hear what non-IO bound modules people spend
a lot of time waiting for. Also if there are any memory-bound
bottlenecks around?

2c,
Hamish

2011/3/26, Glynn Clements <glynn@gclements.plus.com>:

Supporting concurrent reads on a single raster map would make the code
significantly more complex. Concurrent writes would be even worse
unless compression was disabled.
--
Glynn Clements <glynn@gclements.plus.com>

And this brings us back to question - is current GRASS raster storage
the best one? Could GRASS benefit from moving to some quad-tree like
storage? Current gislib doesn't expose internal data structures to
most of analysis modules and thus it shouldn't require any module
changes.
I know - for worst case scenarios such approach could provide no
speedup or worse - be slower than current implementation, still I
would love to see solid numbers on different storage approaches before
saying "YES/NO" to alternatives.
Anybody has a CS student interested in algorithms and data storage problems?

Just 0.02 from non-CS person,
Maris.

Hamish wrote:

While e.g. pthreads in r.mapcalc in grass7 may give me a nice 10%
speedup by separating reading and writing into two threads [or is
that splitting into IO and maths?] (* I haven't actually benchmarked
it, 10% is just a guess),

The parallelism in 7.0's r.mapcalc consists of evaluating function
arguments concurrently.

Previously, evaluation of a function (or operator) meant:

  for each argument:
      evaluate argument
  evaluate function

In 7.0, this changed to:

  for each argument:
      commence evaluation of argument
  for each argument:
      wait for evaluation to complete
  evaluate function

For expressions with multiple input maps, the maps will be read
concurrently. Evaluation of the top-most operator and writing the
output map is performed by the main thread. If there are multiple
output maps, these are evaluated sequentially. Rows are evaluated
sequentially. Also, arguments to eval() are evaluated sequentially.
So, the amount of parallelism available is limited by the complexity
of the expression.

and that is not a gift to refuse, the
reality seems that the time-to-completion is still dominated by the
I/O bottlenecks and saturating the bus-- at which point it doesn't
matter how many threads or CPUs you have, you're still limited by
the speed of your bus/drive/RAID array.

I don't know how significant physical disc access really is. There's a
lot of overhead in Rast_{get,put}_row(). Even on a desktop system with
a fast CPU and an average hard drive, I wouldn't be surprised if a
simple get-row/put-row copy operation was CPU bound.

So I try to think of modules which are CPU bound.. the first
task is to replace inefficient algorithms with better ones (e.g.
Glynn's r.cost work,

Huh? I haven't done anything related to r.cost.

If you're thinking of r.grow.distance, that can't be used as a
substitute for r.cost except in the case of constant cost, as it
relies upon "distance" being monotonic with respect to x and y.

--
Glynn Clements <glynn@gclements.plus.com>

Maris Nartiss wrote:

> Supporting concurrent reads on a single raster map would make the code
> significantly more complex. Concurrent writes would be even worse
> unless compression was disabled.

And this brings us back to question - is current GRASS raster storage
the best one? Could GRASS benefit from moving to some quad-tree like
storage? Current gislib doesn't expose internal data structures to
most of analysis modules and thus it shouldn't require any module
changes.
I know - for worst case scenarios such approach could provide no
speedup or worse - be slower than current implementation, still I
would love to see solid numbers on different storage approaches before
saying "YES/NO" to alternatives.
Anybody has a CS student interested in algorithms and data storage problems?

The biggest advantage of the current format is that skipping entire
rows (when the region's vertical resolution is coarser than the map's)
is trivial.

If I was going to change anything about the raster format, it would be
to allow the data to be partitioned horizontally, so that we can avoid
reading and decompressing entire rows when the region's east-west
bounds are only a small portion of the map's.

For modules which need non-sequential access, I'd suggest replacing
the segment library with something more efficient, e.g. something like
the code from r.proj. Or even just a flat file which can be mmap()ed,
and require the use of a 64-bit platform if you want to run such
modules on more than ~3 GiB of data.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn:

The biggest advantage of the current format is that skipping entire
rows (when the region's vertical resolution is coarser than the map's)
is trivial.

note that is also a great feature when the map only takes up a small
part of the working region, e.g. a stamp sized map in the corner of a
big r.patch or display operation. It flies past the rows which are all
NULLs. (but doesn't help much if the map is tall and thin or diagonal,
and your region is square) ...

If I was going to change anything about the raster format, it would be
to allow the data to be partitioned horizontally, so that we can avoid
reading and decompressing entire rows when the region's east-west
bounds are only a small portion of the map's.

unrealted aside: Since the old-old bug tracker days there was a wish
to replace the i.rectify code with the same-origin GDALwarp API library,
which is much faster. Last Summer (of Code) Seth's main project was
to add OpenCL capability to gdalwarp (which also allows you to do multi-
core if you don't have a capable GPU). The OpenCL r.sun work was tacked
on to the end of that project.

For modules which need non-sequential access, I'd suggest replacing
the segment library with something more efficient, e.g. something like
the code from r.proj. Or even just a flat file which can be mmap()ed,
and require the use of a 64-bit platform if you want to run such
modules on more than ~3 GiB of data.

Although it's an obvious target, I didn't mention the segment library
in the parallelization suggestions as I seemed to recall the
disatisfaction with the current implimentation.

If would be cool if a replacement for the segment library could be
(re)built from the ground up with parallelization in mind.

Hamish:

> So I try to think of modules which are CPU bound.. the first
> task is to replace inefficient algorithms with better ones (e.g.
> Glynn's r.cost work,

Huh? I haven't done anything related to r.cost.

fixing the over-by-one bug in the segment library sped up r.cost by
over 50x in my tests (YMMV). (sometime before 6.4.0RC1) Not technically
a fix to r.cost, but it sure is nice to have it go heaps faster.
I misspoke a bit; no inefficient algorithm was replaced there.

If you're thinking of r.grow.distance, that can't be used as a
substitute for r.cost except in the case of constant cost, as it
relies upon "distance" being monotonic with respect to x and y.

I still have to play with that, as it could be added as an optional
flag in v.surf.icw (from addons), where r.cost was always the slowest
step. That is a heavy user of r.mapcalc, so will be a good test for
pthreads too.

thanks,
Hamish

Hamish wrote:

Glynn:

The biggest advantage of the current format is that skipping entire
rows (when the region's vertical resolution is coarser than the map's)
is trivial.

note that is also a great feature when the map only takes up a small
part of the working region, e.g. a stamp sized map in the corner of a
big r.patch or display operation. It flies past the rows which are all
NULLs. (but doesn't help much if the map is tall and thin or diagonal,
and your region is square) ...

If I was going to change anything about the raster format, it would be
to allow the data to be partitioned horizontally, so that we can avoid
reading and decompressing entire rows when the region's east-west
bounds are only a small portion of the map's.

unrealted aside: Since the old-old bug tracker days there was a wish
to replace the i.rectify code with the same-origin GDALwarp API library,
which is much faster. Last Summer (of Code) Seth's main project was
to add OpenCL capability to gdalwarp (which also allows you to do multi-
core if you don't have a capable GPU). The OpenCL r.sun work was tacked
on to the end of that project.

Note that r.proj, i.rectify, and i.ortho.rectify now use very similar
code, all use the caching code of r.proj. i.rectify is now quite a bit
faster. IOW, improvements to any of these three modules can be applied
to all three modules.

For modules which need non-sequential access, I'd suggest replacing
the segment library with something more efficient, e.g. something like
the code from r.proj. Or even just a flat file which can be mmap()ed,
and require the use of a 64-bit platform if you want to run such
modules on more than ~3 GiB of data.

Although it's an obvious target, I didn't mention the segment library
in the parallelization suggestions as I seemed to recall the
disatisfaction with the current implimentation.

The segment lib implementation in trunk is considerably faster than in
6.x, but not as efficient as the caching method of r.proj. Some
reasons why the segment library is slower than the caching code in
r.proj are 1) write support in the segment lib, 2) flexible data
storage size in the segment lib, 3) flexible tile size in the segment
lib, 4) bad abuse of the segment library, e.g. in r.los.

Markus M

If would be cool if a replacement for the segment library could be
(re)built from the ground up with parallelization in mind.

Hamish:

> So I try to think of modules which are CPU bound.. the first
> task is to replace inefficient algorithms with better ones (e.g.
> Glynn's r.cost work,

Huh? I haven't done anything related to r.cost.

fixing the over-by-one bug in the segment library sped up r.cost by
over 50x in my tests (YMMV). (sometime before 6.4.0RC1) Not technically
a fix to r.cost, but it sure is nice to have it go heaps faster.
I misspoke a bit; no inefficient algorithm was replaced there.

Was this speed-up constant for different regions? In theory, this
speed-up should be larger for small regions and barely noticeable for
larger regions.

Markus M

If you're thinking of r.grow.distance, that can't be used as a
substitute for r.cost except in the case of constant cost, as it
relies upon "distance" being monotonic with respect to x and y.

I still have to play with that, as it could be added as an optional
flag in v.surf.icw (from addons), where r.cost was always the slowest
step. That is a heavy user of r.mapcalc, so will be a good test for
pthreads too.

thanks,
Hamish

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Markus Metz wrote:

The segment lib implementation in trunk is considerably faster than in
6.x, but not as efficient as the caching method of r.proj. Some
reasons why the segment library is slower than the caching code in
r.proj are 1) write support in the segment lib, 2) flexible data
storage size in the segment lib, 3) flexible tile size in the segment
lib, 4) bad abuse of the segment library, e.g. in r.los.

The main factors for performance are whether the tile size is fixed at
compile time and whether it's known (at compile time) to be a power of
two.

It wouldn't be particularly hard to structure the r.proj tile cache
code such that the tile size can be set by the module, so long as it's
set at compile time. Allowing the tile size to be set at run time
incurs an inevitable performance penalty.

The power-of-two constraint needn't be enforced by the code. The
shifts and masks could be replaced by multiplication, division and
modulo; any modern compiler will optimise these back to shifts and
masks if the tile size is a compile-time constant which is a power of
two.

OTOH, I'm not sure there's any benefit to relaxing the requirement.
Some modules may be less efficient if the tile size is significantly
too small or too large, but being off by a factor of sqrt(2) shouldn't
be an issue.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements wrote:

Markus Metz wrote:

The segment lib implementation in trunk is considerably faster than in
6.x, but not as efficient as the caching method of r.proj. Some
reasons why the segment library is slower than the caching code in
r.proj are 1) write support in the segment lib, 2) flexible data
storage size in the segment lib, 3) flexible tile size in the segment
lib, 4) bad abuse of the segment library, e.g. in r.los.

The main factors for performance are whether the tile size is fixed at
compile time and whether it's known (at compile time) to be a power of
two.

It wouldn't be particularly hard to structure the r.proj tile cache
code such that the tile size can be set by the module, so long as it's
set at compile time. Allowing the tile size to be set at run time
incurs an inevitable performance penalty.

That means that the r.proj tile cache can not go into a library
function because libraries are compiled before modules, i.e. tile size
and data structure size are no known to the library at compile time,
correct? Thus the r.proj tile cache code would need to be duplicated
for each module, making maintenance more tedious. See e.g. the
different versions of crs.[c|h] floating around in the imagery
modules.

The power-of-two constraint needn't be enforced by the code. The
shifts and masks could be replaced by multiplication, division and
modulo; any modern compiler will optimise these back to shifts and
masks if the tile size is a compile-time constant which is a power of
two.

The segment library in G7 figures out during setup of the segment
structure if bit operations can be used. That gave me a good speed-up.

OTOH, I'm not sure there's any benefit to relaxing the requirement.
Some modules may be less efficient if the tile size is significantly
too small or too large, but being off by a factor of sqrt(2) shouldn't
be an issue.

The tile size used by r.cost, r.walk, r.watershed in G7 is 64x64
throughout, as for the r.proj tile cache. That size seems to be good
for various different region sizes. With the exception of r.watershed
where a sorted array is stored in a segmented structure, i.e. only one
row, many columns.

I have optimized the segment library a bit more (using an idea from
the r.poj tile cache). For a region with 66 million cells, r.watershed
in segmented mode with memory=2000 needs 66 minutes in 6.4 and 7m30s
in 7.0. The all-in-memory mode finishes in about 4m30s. That includes
optimizations of r.watershed.seg in 7, though. Anyway, the time
penalty by using the grass 7 segment library seems to be no longer
that serious.

I am not against a replacement of the segment library, but I would
like to keep any tile cache code as a library for easy code
maintenance. And I would like to have some flexibility, at least
support for tiles and arrays as well as custom data sizes beyond CELL,
FCELL, DCELL.

Markus M

Markus Metz wrote:

> It wouldn't be particularly hard to structure the r.proj tile cache
> code such that the tile size can be set by the module, so long as it's
> set at compile time. Allowing the tile size to be set at run time
> incurs an inevitable performance penalty.

That means that the r.proj tile cache can not go into a library
function because libraries are compiled before modules, i.e. tile size
and data structure size are no known to the library at compile time,
correct? Thus the r.proj tile cache code would need to be duplicated
for each module, making maintenance more tedious. See e.g. the
different versions of crs.[c|h] floating around in the imagery
modules.

Much of the r.proj tile cache is implemented by macros in the header
file, which are compiled as part of the module. That's the part where
the optimisations matter. The I/O code in the library can have the
tile size selected at run time.

IOW, modules would do e.g.:

  #define TILE_WIDTH 64
  #define TILE_HEIGHT 64
  #include <grass/tile_cache.h>
  ...
  Tile_Cache_Setup(..., TILE_WIDTH, TILE_HEIGHT);

> OTOH, I'm not sure there's any benefit to relaxing the requirement.
> Some modules may be less efficient if the tile size is significantly
> too small or too large, but being off by a factor of sqrt(2) shouldn't
> be an issue.

The tile size used by r.cost, r.walk, r.watershed in G7 is 64x64
throughout, as for the r.proj tile cache. That size seems to be good
for various different region sizes. With the exception of r.watershed
where a sorted array is stored in a segmented structure, i.e. only one
row, many columns.

I suspect that optimal sizes for r.proj would have the width greater
than the height, on the assumption that real-would cases would
typically have relatively little rotation or curvature (i.e.
horizontal lines in the target projection aren't far from horizontal
in the source projection).

--
Glynn Clements <glynn@gclements.plus.com>