Hi,
2011/11/29 Hamish <hamish_b@yahoo.com>:
Hamish:
> I am looking to add OpenMP support to v.surf.bspline
...
Sören wrote:
Try r49406.
You need to separate the computation for j == 0 and j > 0.
nice, thanks.
> b) 3.5x speedup is very nice, but any way to improve
> on that 40% efficiency loss?
The speedup is better as larger the band matrix is. But limiting
factor of parallel processing speedup is the the first computation for
j == 0. This operation must be done before the rest can be processed
in parallel. The next limiting factor is the time which the OS needs
to create a thread, unless the OpenMP implementation uses a thread
pool ... .
I guess that getting a thread pool typically involves waiting until the
next OS upgrade, or longer?
OpenMP is compiler specific and therefor the thread pool
implementation. Have you tried the free Intel C compiler for open
source projects? This one is really fast. The thread creation speed is
OS dependent.
> c) for the many raster modules that do
> for (row = 0; row < nrows; row++) {
> for (col = 0; col < ncols; col++)
{
>
> I'm guessing it is better to multithread the columns for loop
> (filling a row's array in parallel), but to keep writing the raster
> rows serially.
Yes, much better indeed.
... but the parallelizing the row loop would be much better if the
whole array was in memory, or mmap()'d?
Yes. Most of the GRASS blas level 2 and 3 functions can run in
parallel using OpenMP,
because everything is in memory.
> But does the cost of creating and destroying 2-16 threads per row
> end up costing too much in terms of create/destroy overhead?
> IIUC "schedule (static)" helps to minimize the cost of creating
> each thread a little? (??)
This is only useful in case the processing of the columns
needs much more time than the thread creation by the OS, unless a
thread pool .... .
under the current method of parallelizing the inside loop though,
say for a quad-core CPU with a 1200 cols x 800 rows array, we
get 4 threads per row, each handling 300 columns, and for the task
have created and destroyed 4*800= 3200 threads on a system which
will only handle 4 at a time.
much better (but perhaps harder) would be to parallelize as close to
the main process level as we can, and then only deal with the overhead
of creating/destroying e.g. 4 threads not 3200.
This is indeed harder. First it depends on the algorithm of a module.
Can it be parallelized at the main process level at all?
Its easier in the loops to reuse threads doing the same job again and
again ... a thread pool. The pool gets initialized at the start of the
program waiting for work. The threads will be destroyed at the end of
the program.
Some years ago we discussed a C thread pool implementation in a
parallel algorithm course at University ... unfortunately the
supervisor was to #!*&%? lazy to present us the solution.
On the otherhand, for OpenCL (I'll work on support for that after the
OpenMP stuff has been committed) a modern GPU may well have 500 cores.
Handling this amount of threads is pure magic in my humble opinion,
but i guess this is very efficient
implemented in hardware on the graphic card.... .
in the case of v.surf.bspline I note it runs using 4-16 subregions for
the tests runs I did. if those could each be sent to their own thread
I think we'd be done (for a few years), without the 40% efficiency loss.
Indeed. Parallelization on subregion level is the best thing to do. In
case the subregions are large enough, a MPI solution my be meaningful
too?
If so, is it then possible to call omp_set_num_threads(1); to tell gmath
lib not to try and parallelize it any more? The fn descr says "number of
threads used by default in subsequent parallel sections", so maybe so.
We should in principal avoid nested OpenMP regions, better to use
non-parallelized
solver and stuff in a subthread. A better idea is to use MPI to
distribute the jobs and OpenMP to parallelize the work on the MPI
nodes.
Multithreading, especially in case of OpenMP reduction, is only
meaningful in case the data is large enough, otherwise the serial
gathering of n and the thread creation takes much longer then the
computation, unless a thread pool ..... .
And even moreso for OpenCL, as copying the data into and the result back
out of video card memory is very very slow.
Well PCI Express v3.0 with 16 lanes: 16 GB/s is not that bad? Hard
disk is much slower.
> f) we talked a little about this before, but it would
> be good to settle on a uniform name for test suite scripts
...
also it would be good to confirm a standard dataset to use. Generating
fake data on the fly is self-boot strapping, but requires passing
fixed seeds to v.random etc. Otherwise N.C.2008 probably gives a
wider range of possibilities than the much smaller spearfish. (mainly
that spearfish doesn't include lidar data)
any thoughts?
A standard dataset is a good idea, but very often you need to generate
the data, to test specific
parameter settings. Keeping all this data in a test dataset will IMHO
bloat the test dataset. Besides of that
the tests should be designed to work with small data for speed reason,
except tests for specific bugs which occur in case of large data.
> g) v.surf.rst spends 58% of its time in gmath lib's G_ludcmp() and 20%
> of its time in __iee754_log(). G_ludcmp() also looks like very low
> hanging fruit. (great!)
it also looks like a very similar clone of other code in gmath lib, and
I'd expect BLAS/LAPACK/ATLAS too.
Unfortunately G_ludcmp() is NR code and its faster then the solver in
gmath and ccmath.
But it should be replaced by the LU solver implemented in the ccmath library.
Some years ago i implemented a v.surf.rst version
using much faster solver which have been parallelized with OpenMP, but
without much success. The matrices created by spline interpolation are
dense and mostly in
a bad condition. Hence, the LU decomposition solver and the GAUSS
solver are the most robust
and meaningful solver available.
I was able to get v.surf.rst to run faster by putting some pragmas into
G_ludcmp(), but again I wonder if it would be more efficient to concentrate
on parallelizing the module's quadtree segments instead of the inner loops
of the linear algebra. And again, maybe a two step approach: do the libs
now (relatively easy), then later do the segments and have that module code
also switch off threading for its library calls with omp_set_num_threads().
Parallelizing the module's quadtree segments instead of the inner
loops is the best idea.
I was thinking about this too, but the rst code scared me to much ...
functions with 40 parameters
as arguments and so on ... brrrrr.
> h) is it possible &/or desirable to use (aka outsource) pre-optimized
> & pre-threaded BLAS or LAPACK libraries for our linear algebra needs?
The GRASS ATLAS wrapper is and example for such an approach. ATLAS can
be used, but in case it is not installed, the default GRASS
implementation is used.
Oh, I did not know that was there. We can work on adding it to trunk's
./configure next.
We can do, but the IMHO the ATLAS wrapper is not in use by any module,
except the library test module.
Best regards
Soeren
Hamish
ps- I didn't add --with-openmp-includes= to ./configure in my patch, just
--with-openmp-libs=. To use the omp_*() fns I guess omp.h is wanted, and
so I should do that after all?
As OpenMP is compiler dependent, the compiler should know where to
search for the includes, because its part of the standard library?
Well i am not 100% sure .... . IMHO to put OpenMP support in the
config, you need specific solution for every compiler on the market.
The gmath and gpde libraries make no use of omp_* functions, so the
OpenMP stuff can be specified as compiler and linker flags.