[GRASS-dev] multithreading v.surf.bspline and others with OpenMP

Hi,

On June 2, 2011, Soeren wrote:

Have a look at:
http://grass.osgeo.org/wiki/OpenMP

There are only a few modules using OpenMP (r.gwflow, r3.gwflow and
r.solute.transport). Parts of the gpde and gmath libraries in grass
6.5 and 7 are parallelized using OpenMP. These libraries provide tests
and benchmarks in the libraries test directories.

I am looking to add OpenMP support to v.surf.bspline and am very much
a beginner with it. Profiling with valgrind and kcachegrind (see the
wiki Bugs page) shows that 99% of the module time was spent in gmath lib's
G_math_cholesky_sband_decomposition(). This looks like low hanging fruit.

So I build grass7 with openmp support and add the following patch:
(see OpenMP ./configure patch in the trac'er)

Index: lib/gmath/solvers_direct_cholesky_band.c

--- lib/gmath/solvers_direct_cholesky_band.c (revision 49355)
+++ lib/gmath/solvers_direct_cholesky_band.c (working copy)
@@ -24,6 +24,7 @@

     for (i = 0; i < rows; i++) {
        G_percent(i, rows, 2);
+#pragma omp parallel for schedule (static) private(j, k, sum, end) shared(A, T, i, bandwidth)
        for (j = 0; j < bandwidth; j++) {
            sum = A[i][j];
            /* start = 1 */

and on a 6-core CPU the module runs 3.5x faster. but the resulting map
is just full of NaNs. Running with OMP_NUM_THREADS=1 gives me data again.

a) what have I done wrong? why all NaNs?

b) 3.5x speedup is very nice, but any way to improve on that 40% efficiency
loss?

c) for the many raster modules that do
        for (row = 0; row < nrows; row++) {
            for (col = 0; col < ncols; col++) {

I'm guessing it is better to multithread the columns for loop (filling
a row's array in parallel), but to keep writing the raster rows serially.
But does the cost of creating and destroying 2-16 threads per row end up
costing too much in terms of create/destroy overhead? IIUC "schedule
(static)" helps to minimize the cost of creating each thread a little? (??)

d) I had a look at 'r.univar -e'; it uses heapsort() which AFAIU is a
sorting method which can not be parallelized.

e) I had a look at r.neighbors, which spends 50% of its time in gather().
I got it to run multithreaded on all cores, but the result was many times
slower than if I just had it run serially. why? too much waiting for "n"?

int gather(DCELL * values, int offset)
{
    int row, col;
    int n = 0;
    DCELL *c;

    *values = 0;

#pragma omp parallel for private(c, col) reduction(+:n)
    for (row = 0; row < ncb.nsize; row++)
        for (col = 0; col < ncb.nsize; col++) {
            c = &ncb.buf[row][offset + col];

            if (ncb.mask && !ncb.mask[row][col])
                continue;

            if (Rast_is_d_null_value(c))
                Rast_set_d_null_value(&values[n], 1);
            else
                values[n] = *c;

            n++;
        }

    return n ? n : -1;
}

(I moved "DCELL *c" out of the inner loop as I suspected allocating a
new variable might be expensive. But it still ran at about the same speed
so maybe it is cheap to do that after all, or maybe the compiler did
something smart)

f) we talked a little about this before, but it would be good to settle
on a uniform name for test suite scripts within a module or libraries
directory. I humbly suggest to use "test_suite/". Using "profiler/"
doesn't cover regression testing, "test" is too generic, and putting
it in the main module dir introduces too much file clutter IMO.

g) v.surf.rst spends 58% of its time in gmath lib's G_ludcmp() and 20% of
its time in __iee754_log(). G_ludcmp() also looks like very low hanging
fruit. (great!)

h) is it possible &/or desirable to use (aka outsource) pre-optimized &
pre-threaded BLAS or LAPACK libraries for our linear algebra needs?

thanks,
Hamish

Hi Hamish,

2011/11/28 Hamish <hamish_b@yahoo.com>:

Hi,

On June 2, 2011, Soeren wrote:

Have a look at:
http://grass.osgeo.org/wiki/OpenMP

There are only a few modules using OpenMP (r.gwflow, r3.gwflow and
r.solute.transport). Parts of the gpde and gmath libraries in grass
6.5 and 7 are parallelized using OpenMP. These libraries provide tests
and benchmarks in the libraries test directories.

I am looking to add OpenMP support to v.surf.bspline and am very much
a beginner with it. Profiling with valgrind and kcachegrind (see the
wiki Bugs page) shows that 99% of the module time was spent in gmath lib's
G_math_cholesky_sband_decomposition(). This looks like low hanging fruit.

So I build grass7 with openmp support and add the following patch:
(see OpenMP ./configure patch in the trac'er)

Index: lib/gmath/solvers_direct_cholesky_band.c

--- lib/gmath/solvers_direct_cholesky_band.c (revision 49355)
+++ lib/gmath/solvers_direct_cholesky_band.c (working copy)
@@ -24,6 +24,7 @@

for \(i = 0; i &lt; rows; i\+\+\) \{
   G\_percent\(i, rows, 2\);

+#pragma omp parallel for schedule (static) private(j, k, sum, end) shared(A, T, i, bandwidth)
for (j = 0; j < bandwidth; j++) {
sum = A[i][j];
/* start = 1 */

and on a 6-core CPU the module runs 3.5x faster. but the resulting map
is just full of NaNs. Running with OMP_NUM_THREADS=1 gives me data again.

a) what have I done wrong? why all NaNs?

Try r49406.

You need to separate the computation for j == 0 and j > 0.

b) 3.5x speedup is very nice, but any way to improve on that 40% efficiency
loss?

The speedup is better as larger the band matrix is. But limiting
factor of parallel processing speedup is the the first computation for
j == 0. This operation must be done before the rest can be processed
in parallel. The next limiting factor is the time which the OS needs
to create a thread, unless the OpenMP implementation uses a thread
pool ... .

c) for the many raster modules that do
for (row = 0; row < nrows; row++) {
for (col = 0; col < ncols; col++) {

I'm guessing it is better to multithread the columns for loop (filling
a row's array in parallel), but to keep writing the raster rows serially.

Yes, much better indeed.

But does the cost of creating and destroying 2-16 threads per row end up
costing too much in terms of create/destroy overhead? IIUC "schedule
(static)" helps to minimize the cost of creating each thread a little? (??)

This is only useful in case the processing of the columns needs much
more time than the thread creation by the OS, unless a thread pool
.... .

d) I had a look at 'r.univar -e'; it uses heapsort() which AFAIU is a
sorting method which can not be parallelized.

e) I had a look at r.neighbors, which spends 50% of its time in gather().
I got it to run multithreaded on all cores, but the result was many times
slower than if I just had it run serially. why? too much waiting for "n"?

Multithreading, especially in case of OpenMP reduction, is only
meaningful in case the data is large enough, otherwise the serial
gathering of n and the thread creation takes much longer then the
computation, unless a thread pool ..... .

int gather(DCELL * values, int offset)
{
int row, col;
int n = 0;
DCELL *c;

*values = 0;

#pragma omp parallel for private(c, col) reduction(+:n)
for (row = 0; row < ncb.nsize; row++)
for (col = 0; col < ncb.nsize; col++) {
c = &ncb.buf[row][offset + col];

       if \(ncb\.mask &amp;&amp; \!ncb\.mask\[row\]\[col\]\)
           continue;

       if \(Rast\_is\_d\_null\_value\(c\)\)
           Rast\_set\_d\_null\_value\(&amp;values\[n\], 1\);
       else
           values\[n\] = \*c;

       n\+\+;
   \}

return n ? n : -1;
}

(I moved "DCELL *c" out of the inner loop as I suspected allocating a
new variable might be expensive. But it still ran at about the same speed
so maybe it is cheap to do that after all, or maybe the compiler did
something smart)

f) we talked a little about this before, but it would be good to settle
on a uniform name for test suite scripts within a module or libraries
directory. I humbly suggest to use "test_suite/". Using "profiler/"
doesn't cover regression testing, "test" is too generic, and putting
it in the main module dir introduces too much file clutter IMO.

No problem with test_suite for library tests and benchmarks.

g) v.surf.rst spends 58% of its time in gmath lib's G_ludcmp() and 20% of
its time in __iee754_log(). G_ludcmp() also looks like very low hanging
fruit. (great!)

h) is it possible &/or desirable to use (aka outsource) pre-optimized &
pre-threaded BLAS or LAPACK libraries for our linear algebra needs?

The GRASS ATLAS wrapper is and example for such an approach. ATLAS can be used,
but in case it is not installed, the default GRASS implementation is
used. Feel free to add new function wrapper. :slight_smile:

Best regards
Soeren

thanks,
Hamish

Hamish:

> I am looking to add OpenMP support to v.surf.bspline

...

Sören wrote:

Try r49406.
You need to separate the computation for j == 0 and j > 0.

nice, thanks.

> b) 3.5x speedup is very nice, but any way to improve
> on that 40% efficiency loss?

The speedup is better as larger the band matrix is. But limiting
factor of parallel processing speedup is the the first computation for
j == 0. This operation must be done before the rest can be processed
in parallel. The next limiting factor is the time which the OS needs
to create a thread, unless the OpenMP implementation uses a thread
pool ... .

I guess that getting a thread pool typically involves waiting until the
next OS upgrade, or longer?

> c) for the many raster modules that do
> for (row = 0; row < nrows; row++) {
> for (col = 0; col < ncols; col++)
{
>
> I'm guessing it is better to multithread the columns for loop
> (filling a row's array in parallel), but to keep writing the raster
> rows serially.

Yes, much better indeed.

... but the parallelizing the row loop would be much better if the
whole array was in memory, or mmap()'d?

> But does the cost of creating and destroying 2-16 threads per row
> end up costing too much in terms of create/destroy overhead?
> IIUC "schedule (static)" helps to minimize the cost of creating
> each thread a little? (??)

This is only useful in case the processing of the columns
needs much more time than the thread creation by the OS, unless a
thread pool .... .

under the current method of parallelizing the inside loop though,
say for a quad-core CPU with a 1200 cols x 800 rows array, we
get 4 threads per row, each handling 300 columns, and for the task
have created and destroyed 4*800= 3200 threads on a system which
will only handle 4 at a time.

much better (but perhaps harder) would be to parallelize as close to
the main process level as we can, and then only deal with the overhead
of creating/destroying e.g. 4 threads not 3200.

On the otherhand, for OpenCL (I'll work on support for that after the
OpenMP stuff has been committed) a modern GPU may well have 500 cores.

in the case of v.surf.bspline I note it runs using 4-16 subregions for
the tests runs I did. if those could each be sent to their own thread
I think we'd be done (for a few years), without the 40% efficiency loss.

If so, is it then possible to call omp_set_num_threads(1); to tell gmath
lib not to try and parallelize it any more? The fn descr says "number of
threads used by default in subsequent parallel sections", so maybe so.

Multithreading, especially in case of OpenMP reduction, is only
meaningful in case the data is large enough, otherwise the serial
gathering of n and the thread creation takes much longer then the
computation, unless a thread pool ..... .

And even moreso for OpenCL, as copying the data into and the result back
out of video card memory is very very slow.

> f) we talked a little about this before, but it would
> be good to settle on a uniform name for test suite scripts

...

also it would be good to confirm a standard dataset to use. Generating
fake data on the fly is self-boot strapping, but requires passing
fixed seeds to v.random etc. Otherwise N.C.2008 probably gives a
wider range of possibilities than the much smaller spearfish. (mainly
that spearfish doesn't include lidar data)
any thoughts?

> g) v.surf.rst spends 58% of its time in gmath lib's G_ludcmp() and 20%
> of its time in __iee754_log(). G_ludcmp() also looks like very low
> hanging fruit. (great!)

it also looks like a very similar clone of other code in gmath lib, and
I'd expect BLAS/LAPACK/ATLAS too.

I was able to get v.surf.rst to run faster by putting some pragmas into
G_ludcmp(), but again I wonder if it would be more efficient to concentrate
on parallelizing the module's quadtree segments instead of the inner loops
of the linear algebra. And again, maybe a two step approach: do the libs
now (relatively easy), then later do the segments and have that module code
also switch off threading for its library calls with omp_set_num_threads().

> h) is it possible &/or desirable to use (aka outsource) pre-optimized
> & pre-threaded BLAS or LAPACK libraries for our linear algebra needs?

The GRASS ATLAS wrapper is and example for such an approach. ATLAS can
be used, but in case it is not installed, the default GRASS
implementation is used.

Oh, I did not know that was there. We can work on adding it to trunk's
./configure next.

Hamish

ps- I didn't add --with-openmp-includes= to ./configure in my patch, just
--with-openmp-libs=. To use the omp_*() fns I guess omp.h is wanted, and
so I should do that after all?

Hi,

2011/11/29 Hamish <hamish_b@yahoo.com>:

Hamish:

> I am looking to add OpenMP support to v.surf.bspline

...

Sören wrote:

Try r49406.
You need to separate the computation for j == 0 and j > 0.

nice, thanks.

> b) 3.5x speedup is very nice, but any way to improve
> on that 40% efficiency loss?

The speedup is better as larger the band matrix is. But limiting
factor of parallel processing speedup is the the first computation for
j == 0. This operation must be done before the rest can be processed
in parallel. The next limiting factor is the time which the OS needs
to create a thread, unless the OpenMP implementation uses a thread
pool ... .

I guess that getting a thread pool typically involves waiting until the
next OS upgrade, or longer?

OpenMP is compiler specific and therefor the thread pool
implementation. Have you tried the free Intel C compiler for open
source projects? This one is really fast. The thread creation speed is
OS dependent.

> c) for the many raster modules that do
> for (row = 0; row < nrows; row++) {
> for (col = 0; col < ncols; col++)
{
>
> I'm guessing it is better to multithread the columns for loop
> (filling a row's array in parallel), but to keep writing the raster
> rows serially.

Yes, much better indeed.

... but the parallelizing the row loop would be much better if the
whole array was in memory, or mmap()'d?

Yes. Most of the GRASS blas level 2 and 3 functions can run in
parallel using OpenMP,
because everything is in memory.

> But does the cost of creating and destroying 2-16 threads per row
> end up costing too much in terms of create/destroy overhead?
> IIUC "schedule (static)" helps to minimize the cost of creating
> each thread a little? (??)

This is only useful in case the processing of the columns
needs much more time than the thread creation by the OS, unless a
thread pool .... .

under the current method of parallelizing the inside loop though,
say for a quad-core CPU with a 1200 cols x 800 rows array, we
get 4 threads per row, each handling 300 columns, and for the task
have created and destroyed 4*800= 3200 threads on a system which
will only handle 4 at a time.

much better (but perhaps harder) would be to parallelize as close to
the main process level as we can, and then only deal with the overhead
of creating/destroying e.g. 4 threads not 3200.

This is indeed harder. First it depends on the algorithm of a module.
Can it be parallelized at the main process level at all?
Its easier in the loops to reuse threads doing the same job again and
again ... a thread pool. The pool gets initialized at the start of the
program waiting for work. The threads will be destroyed at the end of
the program.
Some years ago we discussed a C thread pool implementation in a
parallel algorithm course at University ... unfortunately the
supervisor was to #!*&%? lazy to present us the solution. :frowning:

On the otherhand, for OpenCL (I'll work on support for that after the
OpenMP stuff has been committed) a modern GPU may well have 500 cores.

Handling this amount of threads is pure magic in my humble opinion,
but i guess this is very efficient
implemented in hardware on the graphic card.... .

in the case of v.surf.bspline I note it runs using 4-16 subregions for
the tests runs I did. if those could each be sent to their own thread
I think we'd be done (for a few years), without the 40% efficiency loss.

Indeed. Parallelization on subregion level is the best thing to do. In
case the subregions are large enough, a MPI solution my be meaningful
too?

If so, is it then possible to call omp_set_num_threads(1); to tell gmath
lib not to try and parallelize it any more? The fn descr says "number of
threads used by default in subsequent parallel sections", so maybe so.

We should in principal avoid nested OpenMP regions, better to use
non-parallelized
solver and stuff in a subthread. A better idea is to use MPI to
distribute the jobs and OpenMP to parallelize the work on the MPI
nodes.

Multithreading, especially in case of OpenMP reduction, is only
meaningful in case the data is large enough, otherwise the serial
gathering of n and the thread creation takes much longer then the
computation, unless a thread pool ..... .

And even moreso for OpenCL, as copying the data into and the result back
out of video card memory is very very slow.

Well PCI Express v3.0 with 16 lanes: 16 GB/s is not that bad? Hard
disk is much slower.

> f) we talked a little about this before, but it would
> be good to settle on a uniform name for test suite scripts

...

also it would be good to confirm a standard dataset to use. Generating
fake data on the fly is self-boot strapping, but requires passing
fixed seeds to v.random etc. Otherwise N.C.2008 probably gives a
wider range of possibilities than the much smaller spearfish. (mainly
that spearfish doesn't include lidar data)
any thoughts?

A standard dataset is a good idea, but very often you need to generate
the data, to test specific
parameter settings. Keeping all this data in a test dataset will IMHO
bloat the test dataset. Besides of that
the tests should be designed to work with small data for speed reason,
except tests for specific bugs which occur in case of large data.

> g) v.surf.rst spends 58% of its time in gmath lib's G_ludcmp() and 20%
> of its time in __iee754_log(). G_ludcmp() also looks like very low
> hanging fruit. (great!)

it also looks like a very similar clone of other code in gmath lib, and
I'd expect BLAS/LAPACK/ATLAS too.

Unfortunately G_ludcmp() is NR code and its faster then the solver in
gmath and ccmath.
But it should be replaced by the LU solver implemented in the ccmath library.

Some years ago i implemented a v.surf.rst version
using much faster solver which have been parallelized with OpenMP, but
without much success. The matrices created by spline interpolation are
dense and mostly in
a bad condition. Hence, the LU decomposition solver and the GAUSS
solver are the most robust
and meaningful solver available.

I was able to get v.surf.rst to run faster by putting some pragmas into
G_ludcmp(), but again I wonder if it would be more efficient to concentrate
on parallelizing the module's quadtree segments instead of the inner loops
of the linear algebra. And again, maybe a two step approach: do the libs
now (relatively easy), then later do the segments and have that module code
also switch off threading for its library calls with omp_set_num_threads().

Parallelizing the module's quadtree segments instead of the inner
loops is the best idea.
I was thinking about this too, but the rst code scared me to much ...
functions with 40 parameters
as arguments and so on ... brrrrr.

> h) is it possible &/or desirable to use (aka outsource) pre-optimized
> & pre-threaded BLAS or LAPACK libraries for our linear algebra needs?

The GRASS ATLAS wrapper is and example for such an approach. ATLAS can
be used, but in case it is not installed, the default GRASS
implementation is used.

Oh, I did not know that was there. We can work on adding it to trunk's
./configure next.

We can do, but the IMHO the ATLAS wrapper is not in use by any module,
except the library test module.

Best regards
Soeren

Hamish

ps- I didn't add --with-openmp-includes= to ./configure in my patch, just
--with-openmp-libs=. To use the omp_*() fns I guess omp.h is wanted, and
so I should do that after all?

As OpenMP is compiler dependent, the compiler should know where to
search for the includes, because its part of the standard library?
Well i am not 100% sure .... . IMHO to put OpenMP support in the
config, you need specific solution for every compiler on the market.
The gmath and gpde libraries make no use of omp_* functions, so the
OpenMP stuff can be specified as compiler and linker flags.

Sören wrote:

The GRASS ATLAS wrapper is and example for such
an approach. ATLAS can be used, but in case it
is not installed, the default GRASS implementation
is used.

Hamish:

> Oh, I did not know that was there. We can work on
> adding it to trunk's ./configure next.

Sören:

We can do, but the IMHO the ATLAS wrapper is not in
use by any module, except the library test module.

the question is: if we had support for it would it
be used? since we already have BLAS for low-level
stuff and LAPACK for mid-level stuff, and the gmath
and gpde libraries, and the ccmath library built in,
it starts to feel a little crowded. But if it is
the right tool for the right job I would not want
to deny someone to easily use it.

what advantage does ATLAS bring to the table?

Hamish

ps- does any one know if the equivalent of the
"jobs" and "wait" shell commands exist for
launching modules from python?

Hi,

2011/12/18 Hamish <hamish_b@yahoo.com>:

Sören wrote:

The GRASS ATLAS wrapper is and example for such
an approach. ATLAS can be used, but in case it
is not installed, the default GRASS implementation
is used.

Hamish:

> Oh, I did not know that was there. We can work on
> adding it to trunk's ./configure next.

Sören:

We can do, but the IMHO the ATLAS wrapper is not in
use by any module, except the library test module.

the question is: if we had support for it would it
be used? since we already have BLAS for low-level
stuff and LAPACK for mid-level stuff, and the gmath
and gpde libraries, and the ccmath library built in,
it starts to feel a little crowded. But if it is
the right tool for the right job I would not want
to deny someone to easily use it.

what advantage does ATLAS bring to the table?

ATLAS is the C-Implementation of BLAS and parts of LAPACK.[1]

The advantages of ATLAS over the BLAS/LAPACK Fortran wrapper in GRASS
is that its supports
the C-style matrix layout used by the gmath library and that no
Fortran to C conversion is needed.
As far as i know the BLAS/LAPACK wrapper is not used in GRASS and IMHO
with the presents of ATLAS
and ccmath obsolete.

Best regards
Soeren

Hamish

ps- does any one know if the equivalent of the
"jobs" and "wait" shell commands exist for
launching modules from python?

[1] http://math-atlas.sourceforge.net/

2012/1/2 Sören Gebbert <soerengebbert@googlemail.com>:

...

ATLAS is the C-Implementation of BLAS and parts of LAPACK.[1]

The advantages of ATLAS over the BLAS/LAPACK Fortran wrapper in GRASS
is that its supports
the C-style matrix layout used by the gmath library and that no
Fortran to C conversion is needed.

This sounds very good.

As far as i know the BLAS/LAPACK wrapper is not used in GRASS and IMHO
with the presents of ATLAS and ccmath obsolete.

It was originally introduced to support the i.spec.unmix command which
is sitting since 1999 in the GRASS Addons :slight_smile: Of course it would be
better to also update i.spec.unmix at this point to completely use
gmath which then calls ATLAS.

Markus