[GRASS-dev] Parallel computing for r.sun

Yann wrote:

anyone has a timeline for merging the OpenCL code into trunk?

Hi,

that's been on my todo list for way too long.

the first step is to get support for OpenCL build into grass7's
./configure next to pthreads and OpenMP which are already there.
I welcome help with that, my copious free time hasn't been
very good lately.

The removal of tertiary calls from the main r.sun loop has
already been done in trunk.

I'll try to write more after work, but a lot is explained
on these pages:
  http://grasswiki.osgeo.org/wiki/Category:Parallelization

besides the r.sun work already done, AFAIAC the top candidates
for parallelization in GRASS are v.surf.rst and v.surf.bspline.
Currently there is some support directly in the LU decomposition
but that makes 1000s of threads; the cost of creating and
destroying those is coming down a lot, but I think it will
probably be a lot more efficient to only create dozens of
threads in the case of v.surf.bspline (see code comment at
the start of the loop where the OpenMP support could go)
and for v.surf.rst perhaps multithread the various boxes of
the quadtree? The idea being to more closely match the
number of threads/processes with the number of CPUs or GPUs.
For CPUs that means dozens, for GPUs that means hundreds.

Each module will be different, so each one requires its own
approach. For that reason I'm happy for pthreads, OpenMP,
and OpenCL to all be supported.

various python and bourne shell scripts (quite new so not
backported to 6.4.svn yet) have been parallelized; the easy
win is to run the three R,G,B bands in parallel. It only
scales to 3 CPUs, but is nearly perfectly efficient and a 3x
speedup is as good as any. See the v.surf.icw script in both
g6.sh and g7.py addons for a complete example.

the good news is that slowly the opencl apis are making their
way into the mainstream driver releases, even intel is on board.
before you pretty much needed to tailor your linux distro to
match their SDK release target if you wanted to use it; and
then the SDK didn't match with the available driver version,
and other such pain..

I'm not sure of what modules in GRASS besides r.sun are well
suited to GPU acceleration. r.sun as a ray-tracing exercise
was an obvious one as that's what GPUs are generally designed
to do these days. Much, if not most, of GRASS's modules are
I/O limited, and especially I/O to the video card has
traditionally been really slow. (that's getting better to, but
still a little on the horizon).

another thing to consider is that GPU math on consumer-grade
video cards have been traditionally limited to float()s. you
had to buy the expensive "science grade" one if you wanted to
calc using double prec. FPs.

great leaps can be made, but there are some caveats to consider.

best,
Hamish

oh yeah, and I should probably say a little about MPI as that's
most of what we're running on our cluster here.

well, it's dominated by the I/O problem: let's get multithreading
working first. :slight_smile: MPI is good for the massively-CPU bound problems,
which again I think are mostly the v.surf.* interpolation ones.
It can be more or less efficent than OpenMP depending on the
class of problem, but MPI requires much deeper changes to the
modules and a lot more work setting up the ring of systems.

Hamish

Hi Hamish,

“”"
the first step is to get support for OpenCL build into grass7’s
./configure next to pthreads and OpenMP which are already there.
I welcome help with that, my copious free time hasn’t been
very good lately.
“”"

I am heading to the code sprint in Genoa on Sunday,

Could you spend a little time explaining me the directions to do that?

(URL, PDF, anything welcome)

Cheers,
Yann

···

On 30 January 2013 08:19, Hamish <hamish_b@yahoo.com> wrote:

Yann wrote:

anyone has a timeline for merging the OpenCL code into trunk?

Hi,

that’s been on my todo list for way too long.

the first step is to get support for OpenCL build into grass7’s
./configure next to pthreads and OpenMP which are already there.
I welcome help with that, my copious free time hasn’t been
very good lately.

The removal of tertiary calls from the main r.sun loop has
already been done in trunk.

I’ll try to write more after work, but a lot is explained
on these pages:
http://grasswiki.osgeo.org/wiki/Category:Parallelization

besides the r.sun work already done, AFAIAC the top candidates
for parallelization in GRASS are v.surf.rst and v.surf.bspline.
Currently there is some support directly in the LU decomposition
but that makes 1000s of threads; the cost of creating and
destroying those is coming down a lot, but I think it will
probably be a lot more efficient to only create dozens of
threads in the case of v.surf.bspline (see code comment at
the start of the loop where the OpenMP support could go)
and for v.surf.rst perhaps multithread the various boxes of
the quadtree? The idea being to more closely match the
number of threads/processes with the number of CPUs or GPUs.
For CPUs that means dozens, for GPUs that means hundreds.

Each module will be different, so each one requires its own
approach. For that reason I’m happy for pthreads, OpenMP,
and OpenCL to all be supported.

various python and bourne shell scripts (quite new so not
backported to 6.4.svn yet) have been parallelized; the easy
win is to run the three R,G,B bands in parallel. It only
scales to 3 CPUs, but is nearly perfectly efficient and a 3x
speedup is as good as any. See the v.surf.icw script in both
g6.sh and g7.py addons for a complete example.

the good news is that slowly the opencl apis are making their
way into the mainstream driver releases, even intel is on board.
before you pretty much needed to tailor your linux distro to
match their SDK release target if you wanted to use it; and
then the SDK didn’t match with the available driver version,
and other such pain…

I’m not sure of what modules in GRASS besides r.sun are well
suited to GPU acceleration. r.sun as a ray-tracing exercise
was an obvious one as that’s what GPUs are generally designed
to do these days. Much, if not most, of GRASS’s modules are
I/O limited, and especially I/O to the video card has
traditionally been really slow. (that’s getting better to, but
still a little on the horizon).

another thing to consider is that GPU math on consumer-grade
video cards have been traditionally limited to float()s. you
had to buy the expensive “science grade” one if you wanted to
calc using double prec. FPs.

great leaps can be made, but there are some caveats to consider.

best,
Hamish

Yann Chemin

Researcher@IWMI

Skype/FB: yann.chemin

On the Mac, you’re compiling with “-framework OpenCL”

~Seth

via iPhone

···

On 30 January 2013 08:19, Hamish <hamish_b@yahoo.com> wrote:

Yann wrote:

anyone has a timeline for merging the OpenCL code into trunk?

Hi,

that’s been on my todo list for way too long.

the first step is to get support for OpenCL build into grass7’s
./configure next to pthreads and OpenMP which are already there.
I welcome help with that, my copious free time hasn’t been
very good lately.

The removal of tertiary calls from the main r.sun loop has
already been done in trunk.

I’ll try to write more after work, but a lot is explained
on these pages:
http://grasswiki.osgeo.org/wiki/Category:Parallelization

besides the r.sun work already done, AFAIAC the top candidates
for parallelization in GRASS are v.surf.rst and v.surf.bspline.
Currently there is some support directly in the LU decomposition
but that makes 1000s of threads; the cost of creating and
destroying those is coming down a lot, but I think it will
probably be a lot more efficient to only create dozens of
threads in the case of v.surf.bspline (see code comment at
the start of the loop where the OpenMP support could go)
and for v.surf.rst perhaps multithread the various boxes of
the quadtree? The idea being to more closely match the
number of threads/processes with the number of CPUs or GPUs.
For CPUs that means dozens, for GPUs that means hundreds.

Each module will be different, so each one requires its own
approach. For that reason I’m happy for pthreads, OpenMP,
and OpenCL to all be supported.

various python and bourne shell scripts (quite new so not
backported to 6.4.svn yet) have been parallelized; the easy
win is to run the three R,G,B bands in parallel. It only
scales to 3 CPUs, but is nearly perfectly efficient and a 3x
speedup is as good as any. See the v.surf.icw script in both
g6.sh and g7.py addons for a complete example.

the good news is that slowly the opencl apis are making their
way into the mainstream driver releases, even intel is on board.
before you pretty much needed to tailor your linux distro to
match their SDK release target if you wanted to use it; and
then the SDK didn’t match with the available driver version,
and other such pain…

I’m not sure of what modules in GRASS besides r.sun are well
suited to GPU acceleration. r.sun as a ray-tracing exercise
was an obvious one as that’s what GPUs are generally designed
to do these days. Much, if not most, of GRASS’s modules are
I/O limited, and especially I/O to the video card has
traditionally been really slow. (that’s getting better to, but
still a little on the horizon).

another thing to consider is that GPU math on consumer-grade
video cards have been traditionally limited to float()s. you
had to buy the expensive “science grade” one if you wanted to
calc using double prec. FPs.

great leaps can be made, but there are some caveats to consider.

best,
Hamish

Yann Chemin

Researcher@IWMI

Skype/FB: yann.chemin

Dear Seth, Hamish, Yann and others,

thank you for the information and suggestions. At the beginning we have chosen module r.los, but after discussion with Jaro Hofierka we switch to r.sun.

So, what do you think about r.los? Is there any implementation of the parallel computing?

If there is r.los also available for parallel computing we will try v.surf.bspline or v.surf.bspline, that are in the area of our interest as well.

Thank you very much

Jan

Hi Jan,

So, what do you think about r.los? Is there any
implementation of the parallel computing?

If there is r.los also available for parallel computing we
will try v.surf.bspline or v.surf.bspline, that are in the
area of our interest as well.

the current r.los implementation does not scale at all well to
large array sizes. So r.viewshed is probably the better of
those two to look at, the only thing to consider is that it is
written in C++, if that matters to you.

see
  http://grasswiki.osgeo.org/wiki/GPU#Modules_of_interest_to_be_parallelized
and
  http://grasswiki.osgeo.org/wiki/OpenMP#Candidates

"r.los (to be replaced by r.viewshed after last few bugs are fixed)"

(and the other pages in the Parallelization wiki category)

have fun,
Hamish

Jan wrote:

If there is r.los also available for parallel computing we
will try v.surf.bspline

re. processing processing each of the bspline subregions in
parallel, see this point in the code:

https://trac.osgeo.org/grass/browser/grass/trunk/vector/v.surf.bspline/main.c#L589

note deep openmp pragmas are already in place in the gmath
library calls from that loop, but there's an openmp call to
tell sub functions not to re-parallelize something that is
already broken up into chunks. (or you can just comment out
the ones in the gmath library if nothing else is using them;
they're quite inefficient so far inside the loop)

Hamish

On Wed, Jan 30, 2013 at 10:42 AM, Hamish <hamish_b@yahoo.com> wrote:
...

This statement:

"r.los (to be replaced by r.viewshed after last few bugs are fixed)"

(and the other pages in the Parallelization wiki category)

and other outdated stuff (it all referred to G6, not G7) I have fixed in the
Wiki page:

http://grasswiki.osgeo.org/wiki/GPU#Modules_of_interest_to_be_parallelized

Markus