[GRASS-user] r.neighbors velocity

Hi all,
A friend of mine (having skill also in some proprietary remote sensing
softwares) is testing GRASS for some task.
He is quite happy about the results but he was really surprised by the
time interval that r.neighbor take for doing the analysis. The time is
really large in his opinion. Do you feel the same in your experience? is
r.mfilter faster? do you have other ideas to perform this kind of kernel
based analyses using GRASS or other OS software?

many thanks

Ivan

On Thu, Jun 27, 2013 at 9:01 AM, Ivan Marchesini
<ivan.marchesini@gmail.com> wrote:

Hi all,
A friend of mine (having skill also in some proprietary remote sensing
softwares) is testing GRASS for some task.
He is quite happy about the results but he was really surprised by the
time interval that r.neighbor take for doing the analysis. The time is
really large in his opinion

Please post some indications: computational region size and
moving window size, also which hardware/operating system.
Otherwise it is hard to say anything...

Markus

Hi Markus
you are perfectly right

the region is 4312*5576
the moving window 501
GRASS is the stable version on a machine with 8 core and 32 gb RAM.
Ubuntu 12.04

it seems that the proprietary software is able to perform the analysis
in 2/3 seconds

:expressionless:

ciao

On Fri, 2013-06-28 at 00:02 +0200, Markus Neteler wrote:

On Thu, Jun 27, 2013 at 9:01 AM, Ivan Marchesini
<ivan.marchesini@gmail.com> wrote:
> Hi all,
> A friend of mine (having skill also in some proprietary remote sensing
> softwares) is testing GRASS for some task.
> He is quite happy about the results but he was really surprised by the
> time interval that r.neighbor take for doing the analysis. The time is
> really large in his opinion

Please post some indications: computational region size and
moving window size, also which hardware/operating system.
Otherwise it is hard to say anything...

Markus

On Fri, Jun 28, 2013 at 5:05 PM, Ivan Marchesini
<ivan.marchesini@gmail.com> wrote:

Hi Markus
you are perfectly right

the region is 4312*5576
the moving window 501

So, you are running a 501x501 moving window over that map?
For what purpose?

GRASS is the stable version on a machine with 8 core and 32 gb RAM.
Ubuntu 12.04

it seems that the proprietary software is able to perform the analysis
in 2/3 seconds

Unlikely with a 501x501 moving window...

Markus

Hi Ivan,
this sounds very interesting.

Your map has a size of 4312*5576 pixel? That’s about 100MB in case of a type integer or type float map or about 200MB in case of a type double map. You must have a very fast HD or SSD to read and write such a map in under 2/3 seconds?

In case your moving window has a size of 501 pixel (not 501x501 pixel!), the amount of operations that must be performed is at least 43125576501. That’s about 12 billion ops. Amazing to do this in 2/3 seconds.
I have written a little program to see how my Intel core i5 performs processing this amount of operations. Well it needs about 100 seconds.

Here the code, compiled with optimization:

#include <stdio.h>

int main()
{
unsigned int i, j, k;
register double v = 0.0;

for(i = 0; i < 4321; i++) {
for(j = 0; j < 5576; j++) {
for(k = 0; k < 501; k++) {
v = v + (double)(i + j + k)/3.0;
}
}
}
printf(“v is %g\n”, v);
}

soeren@vostro:~/src$ gcc -O3 numtest.c -o numtest
soeren@vostro:~/src$ time ./numtest
v is 2.09131e+13

real 1m49.292s
user 1m49.223s
sys 0m0.000s

Your proprietary software must run highly parallel using a fast GPU or an ASIC to keep the processing time under 2/3 seconds?

Unfortunately r.neighbors is not able to compete with such a powerful software, since it is not reading the entire map into RAM and does not run on GPU’s or ASIC’s. But r.neighbors is able to process maps that are to large to fit into the RAM. :slight_smile:

Can you please tell us what software is so incredible fast?

Best regards
Soeren

···

2013/6/28 Ivan Marchesini <ivan.marchesini@gmail.com>

Hi Markus
you are perfectly right

the region is 4312*5576
the moving window 501
GRASS is the stable version on a machine with 8 core and 32 gb RAM.
Ubuntu 12.04

it seems that the proprietary software is able to perform the analysis
in 2/3 seconds

:expressionless:

ciao

On Fri, 2013-06-28 at 00:02 +0200, Markus Neteler wrote:

On Thu, Jun 27, 2013 at 9:01 AM, Ivan Marchesini
<ivan.marchesini@gmail.com> wrote:

Hi all,
A friend of mine (having skill also in some proprietary remote sensing
softwares) is testing GRASS for some task.
He is quite happy about the results but he was really surprised by the
time interval that r.neighbor take for doing the analysis. The time is
really large in his opinion

Please post some indications: computational region size and
moving window size, also which hardware/operating system.
Otherwise it is hard to say anything…

Markus


grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

Ivan wrote:

the region is 4312*5576
the moving window 501
GRASS is the stable version on a machine with 8 core and 32 gb RAM.
Ubuntu 12.04

it seems that the proprietary software is able to perform the analysis
in 2/3 seconds

I expect he's probably correct in that statement, but it's the *compiler*
used not the code behind it, and GRASS compiled in the same way would
be/is just as fast.

Sören wrote:

this sounds very interesting.

Your map has a size of 4312*5576 pixel? That's about 100MB in case of
a type integer or type float map or about 200MB in case of a type
double map. You must have a very fast HD or SSD to read and write such
a map in under 2/3 seconds?

500mb/s IO for a SSD is not unusual, 300mb/s for spinning platter RAID
is pretty common. It's good to run a few replicates of the benchmark
so the 2nd+ times the data is already cached in RAM. (as long as the
region is not too huge to hold it there)

In case your moving window has a size of 501 pixel (not 501x501 pixel!),
the amount of operations that must be performed is at least 4312*5576*501.
That's about 12 billion ops. Amazing to do this in 2/3 seconds.
I have written a little program to see how my Intel core i5 performs
processing this amount of operations. Well it needs about 100 seconds.

I was able get the same down to just over 1 second wall-time on a plain
consumer desktop chip. (!)

Here the code, compiled with optimization:

#include <stdio.h>

int main()
{
unsigned int i, j, k;
register double v = 0.0;

    for\(i = 0; i &lt; 4321; i\+\+\) \{
            for\(j = 0; j &lt; 5576; j\+\+\) \{
                    for\(k = 0; k &lt; 501; k\+\+\) \{
                            v = v \+ \(double\)\(i \+ j \+ k\)/3\.0;
                    \}
            \}
    \}
    printf\(&quot;v is %g\\n&quot;, v\);

}

soeren@vostro:~/src$ gcc -O3 numtest.c -o numtest
soeren@vostro:~/src$ time ./numtest
v is 2.09131e+13

real1m49.292s
user1m49.223s
sys0m0.000s

Your proprietary software must run highly parallel using a fast
GPU or an ASIC to keep the processing time under 2/3 seconds?

Unfortunately r.neighbors is not able to compete with such a
powerful software,

sure it is! :slight_smile:

since it is not reading the entire map into RAM and does not run
on GPU's or ASIC's. But r.neighbors is able to process maps that
are to large to fit into the RAM. :slight_smile:

Can you please tell us what software is so incredible fast?

I ran some quick trials with your sample program with both gcc 4.6
(ubuntu 12.04) and Intel's icc 12.1 on the same computer.

Diplomatically speaking, I see gcc 4.8 has just arrived in Debian/sid,
and I look forward to exploring how its new auto-vectorization features
are coming along.

The results however, are not so diplomatic and speak for themselves..
and for this simple test case* it isn't pretty.
(* so atypically easy for the compiler to optimize)

test system: i7 3770, lots of RAM
replicates are presented in horizontal columns.

standard gcc, with & without -O3 and -march=native: (all ~same)
real 1m14.507s | 1m14.559s | 1m14.513s | 1m14.514s
user 1m14.289s | 1m14.305s | 1m14.297s | 1m14.297s
sys 0m0.000s | 0m0.028s | 0m0.000s | 0m0.000s
--

standard Intel icc with & without -O3:
v is 2.09131e+13

real 0m21.979s | 0m21.967s | 0m21.958s | 0m21.994s
user 0m21.909s | 0m21.901s | 0m21.897s | 0m21.929s
sys 0m0.000s | 0m0.000s | 0m0.000s | 0m0.000s
--

icc with the "-fast" compiler switch:
$ icc -fast soeren_speed_test.c -o soeren_speed_test_icc_fast
# note 900kb for executable vs's gcc's 8kb.
$ time ./soeren_speed_test_icc_fast
v is 2.09131e+13

real 0m3.273s | 0m3.274s | 0m3.275s
user 0m3.260s | 0m3.260s | 0m3.264s
sys 0m0.000s | 0m0.000s | 0m0.000s

(there's your 3 seconds)
--

icc -funroll-loops:
real 0m22.008s | 0m21.998s
user 0m21.941s | 0m21.929s
sys 0m0.000s | 0m0.000s

(no extra gain in this case)
--

icc -parallel: (running on 8 hyperthread (ie 4 real) cores)
# binary size: 30kb
real 0m6.034s | 0m6.005s | 0m6.005s
user 0m46.531s | 0m46.603s | 0m46.519s
sys 0m0.024s | 0m0.028s | 0m0.044s
--

icc -parallel -fast:
# binary size 2.2 megabytes
$ time ./soeren_speed_test_icc_parallel+fast
v is 2.09131e+13

real 0m1.002s | 0m1.002s | 0m1.002s
user 0m6.768s | 0m6.796s | 0m6.780s
sys 0m0.004s | 0m0.004s | 0m0.008s

I tried a number of times but couldn't break the 1 second barrier. :slight_smile:

-----
I also ran it on an AMD Phenom II X6 1090T (icc -xHost --> -xSSSE3 ?)
All times "real"; all output was "v is 2.09131e+13".

gcc 4.4.5 with standard-opts: 7kb binary
== near parity single-threaded performance with the new i7 chip from
the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze)
1m16.175s | 1m15.634s | 1m16.029s

icc 12.1 with standard-opts:
0m32.975s | 0m33.079s | 0m33.249s

icc with "-fast" opt: (700kb binary)
0m9.577s | 0m9.572s | 0m9.583s

icc with -parallel auto-MP: (31kb binary)
== again near parity with the new i7 chip! even with the Intel-biased
compiler. "user" cpu-time was actually less. the advantage of 6 real
cores vs 4 real+4virtual ones.*
0m6.406s | 0m6.404s | 0m6.404s
0m37.106s | 0m37.170s | 0m37.106s
0m0.044s | 0m0.040s | 0m0.028s

icc with -fast and -parallel: (2mb binary)
0m2.002s | 0m2.002s | 0m2.002s
0m10.765s | 0m10.769s | 0m10.769s
0m0.016s | 0m0.012s | 0m0.008s

(* I know from some earlier tests that hyperthreading is computationally
overheady, on a 12 real + 12virtual core Xeon, using about 11 real cores
took the same wall-clock time as 12 real + 5 virtual cores, it only beat
the 12 real cores as I got up to about 19 total cores, and full 24 cores
was in the new percentage points gain & very much diminishing returns)

regards,
Hamish

ps- we have to pay for icc/ifort academic (research) licenses now, but
the student (homework/classroom) license for linux is still gratis
if you dig around their dev website. Also AMD has their Open64
compiler to play with http://developer.amd.com/tools/open64/Pages/

Hi,

here are the same results for Soeren's test program, with the Open64
compiler from AMD:

- Same AMD X6 CPU as below.
- Open64 compiler 4.5.2.1 from AMD (GPLv2, LGPL)

I just downloaded the pre-built RHEL5 binary tarball and they worked
on Debian/squeeze, I just made an alias to the executable in the un-
tarred bin/ dir to get it to work.
see also http://wiki.open64.net/index.php/Installation_on_Ubuntu
Source is available of course, but according to the Debian ITP ticket
it's a bit of a pain to build there.

straight opencc:

real 0m59.015s | 0m58.972s | 0m58.963s
user 0m58.760s | 0m58.812s | 0m58.624s
sys 0m0.248s | 0m0.136s | 0m0.300s
--

opencc -O3:

real 0m35.203s | 0m35.173s | 0m35.204s
user 0m35.206s | 0m35.174s | 0m35.206s
sys 0m0.000s | 0m0.000s | 0m0.000s
--

opencc -Ofast (with or without -march=auto for native bytecode)

real 0m13.389s | 0m13.402s | 0m13.435s
user 0m13.389s | 0m13.405s | 0m13.437s
sys 0m0.000s | 0m0.000s | 0m0.000s
--

opencc -Ofast -march=auto -apo on a 6-(real)-core CPU
v is 2.09131e+13

real 0m2.552s | 0m2.595s | 0m2.591s
user 0m14.857s | 0m14.725s | 0m14.725s
sys 0m0.008s | 0m0.024s | 0m0.016s

'-apo' is autoparallelization, poorly documented, but it works!
it adds OpenMP pragmas where it thinks it can && where it will
cause a gain; I'm glad to see it's not just for the fotran
compiler anymore.

So the Open64 compiler is not quite as fast as Intel's one for this
test case, but it's pretty close versus the more versatile gcc in the
far distance. Executable file size for all of the above was less than
12kb, since it can link to local OS shared libs.

I haven't tried it with llvm/clang.

Now I wonder which flags to use to recreate -Ofast in gcc to make it
a fairer comparison..

Hamish

I also ran it on an AMD Phenom II X6 1090T (icc -xHost --> -xSSSE3 ?)
All times "real"; all output was "v is 2.09131e+13".

gcc 4.4.5 with standard-opts: 7kb binary
== near parity single-threaded performance with the new i7 chip from
the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze)
1m16.175s | 1m15.634s | 1m16.029s

icc 12.1 with standard-opts:
0m32.975s | 0m33.079s | 0m33.249s

icc with "-fast" opt: (700kb binary)
0m9.577s | 0m9.572s | 0m9.583s

icc with -parallel auto-MP: (31kb binary)
== again near parity with the new i7 chip! even with the Intel-biased
compiler. "user" cpu-time was actually less. the advantage of 6 real
cores vs 4 real+4virtual ones.*
0m6.406s | 0m6.404s | 0m6.404s
0m37.106s | 0m37.170s | 0m37.106s
0m0.044s | 0m0.040s | 0m0.028s

icc with -fast and -parallel: (2mb binary)
0m2.002s | 0m2.002s | 0m2.002s
0m10.765s | 0m10.769s | 0m10.769s
0m0.016s | 0m0.012s | 0m0.008s

Some more results with Sören's test program on a Intel(R) Core(TM) i5
CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and
clang 3.3

gcc -O3
v is 2.09131e+13

real 2m0.393s
user 1m57.610s
sys 0m0.003s

gcc -Ofast
v is 2.09131e+13

real 0m7.218s
user 0m7.018s
sys 0m0.017s

gcc -Ofast -floop-parallelize-all is as fast as gcc -Ofast

clang -Ofast
v is 2.09131e+13

real 0m18.701s
user 0m18.285s
sys 0m0.000s

Markus M

On Sat, Jun 29, 2013 at 8:35 AM, Hamish <hamish_b@yahoo.com> wrote:

Hi,

here are the same results for Soeren's test program, with the Open64
compiler from AMD:

- Same AMD X6 CPU as below.
- Open64 compiler 4.5.2.1 from AMD (GPLv2, LGPL)

I just downloaded the pre-built RHEL5 binary tarball and they worked
on Debian/squeeze, I just made an alias to the executable in the un-
tarred bin/ dir to get it to work.
see also http://wiki.open64.net/index.php/Installation_on_Ubuntu
Source is available of course, but according to the Debian ITP ticket
it's a bit of a pain to build there.

straight opencc:

real 0m59.015s | 0m58.972s | 0m58.963s
user 0m58.760s | 0m58.812s | 0m58.624s
sys 0m0.248s | 0m0.136s | 0m0.300s
--

opencc -O3:

real 0m35.203s | 0m35.173s | 0m35.204s
user 0m35.206s | 0m35.174s | 0m35.206s
sys 0m0.000s | 0m0.000s | 0m0.000s
--

opencc -Ofast (with or without -march=auto for native bytecode)

real 0m13.389s | 0m13.402s | 0m13.435s
user 0m13.389s | 0m13.405s | 0m13.437s
sys 0m0.000s | 0m0.000s | 0m0.000s
--

opencc -Ofast -march=auto -apo on a 6-(real)-core CPU
v is 2.09131e+13

real 0m2.552s | 0m2.595s | 0m2.591s
user 0m14.857s | 0m14.725s | 0m14.725s
sys 0m0.008s | 0m0.024s | 0m0.016s

'-apo' is autoparallelization, poorly documented, but it works!
it adds OpenMP pragmas where it thinks it can && where it will
cause a gain; I'm glad to see it's not just for the fotran
compiler anymore.

So the Open64 compiler is not quite as fast as Intel's one for this
test case, but it's pretty close versus the more versatile gcc in the
far distance. Executable file size for all of the above was less than
12kb, since it can link to local OS shared libs.

I haven't tried it with llvm/clang.

Now I wonder which flags to use to recreate -Ofast in gcc to make it
a fairer comparison..

Hamish

I also ran it on an AMD Phenom II X6 1090T (icc -xHost --> -xSSSE3 ?)
All times "real"; all output was "v is 2.09131e+13".

gcc 4.4.5 with standard-opts: 7kb binary
== near parity single-threaded performance with the new i7 chip from
    the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze)
  1m16.175s | 1m15.634s | 1m16.029s

icc 12.1 with standard-opts:
  0m32.975s | 0m33.079s | 0m33.249s

icc with "-fast" opt: (700kb binary)
  0m9.577s | 0m9.572s | 0m9.583s

icc with -parallel auto-MP: (31kb binary)
== again near parity with the new i7 chip! even with the Intel-biased
    compiler. "user" cpu-time was actually less. the advantage of 6 real
    cores vs 4 real+4virtual ones.*
  0m6.406s | 0m6.404s | 0m6.404s
  0m37.106s | 0m37.170s | 0m37.106s
  0m0.044s | 0m0.040s | 0m0.028s

icc with -fast and -parallel: (2mb binary)
  0m2.002s | 0m2.002s | 0m2.002s
  0m10.765s | 0m10.769s | 0m10.769s
  0m0.016s | 0m0.012s | 0m0.008s

_______________________________________________
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

Markus Metz wrote:

Some more results with Sören's test program on a Intel(R) Core(TM) i5
CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and
clang 3.3

gcc -O3
v is 2.09131e+13

real 2m0.393s
user 1m57.610s
sys 0m0.003s

gcc -Ofast
v is 2.09131e+13

real 0m7.218s
user 0m7.018s
sys 0m0.017s

nice. one thing we need to remember though is that it's not entirely
free, one thing -Ofast turns on is -ffast-math,
"""
This option is not turned on by any -O option besides -Ofast since it can
result in incorrect output for programs that depend on an exact
implementation of IEEE or ISO rules/specifications for math functions. It
may, however, yield faster code for programs that do not require the
guarantees of these specifications.
"""

which may not be fit for our purposes.

With the ifort compiler there is '-fp-model precise' which allows only
optimizations which don't harm the results. Maybe gcc has something
similar.

Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify
places to focus OpenMP work on.

Hamish

Hey Folks,
many thanks for pointing to the important influence of different compiler and compiler options. But please be aware that my tiny little program is not representative for a neighbor analysis implementation, it was simply a demonstration of 12 billion ops:

  1. I use fixed loop sizes, it is really easy for a compiler to optimize that

  2. It is pretty simple to parallelize since only a simple reduction is done in the inner loop

  3. Most important: The statement of Ivan was a window size of 501 … as MarkusN IMHO correctly interpreted this leads to a moving window of 501x501 pixel if this is an option for r.neighbors. It is not the total number of cells of a rectangular moving window, since it must must be an even number in this case. Other shapes than rectangular are more complex to implement.

To be diplomatic i decided to use 501 pixel, which might represented a 23x21 pixel moving window, to show that this “small” number of operations needs a considerable amount of time on modern CPU’s.

If you use a 501x501 pixel moving window the computational effort is roughly 501 times 12 billion ops. IMHO in this case a GPU or neighbor algorithm specific FPGA/ASIC may be able to perform this operation in 2/3 seconds.

Best regards
Soeren

On Sat, Jun 29, 2013 at 1:26 PM, Hamish <hamish_b@yahoo.com> wrote:

Markus Metz wrote:

Some more results with Sören's test program on a Intel(R) Core(TM) i5
CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and
clang 3.3

gcc -O3
v is 2.09131e+13

real 2m0.393s
user 1m57.610s
sys 0m0.003s

gcc -Ofast
v is 2.09131e+13

real 0m7.218s
user 0m7.018s
sys 0m0.017s

nice. one thing we need to remember though is that it's not entirely
free, one thing -Ofast turns on is -ffast-math,
"""
This option is not turned on by any -O option besides -Ofast since it can
result in incorrect output for programs that depend on an exact
implementation of IEEE or ISO rules/specifications for math functions. It
may, however, yield faster code for programs that do not require the
guarantees of these specifications.
"""

which may not be fit for our purposes.

With the ifort compiler there is '-fp-model precise' which allows only
optimizations which don't harm the results. Maybe gcc has something
similar.

In gcc, you can turn of -ffoo with -fno-foo, maybe this way you can
use -Ofast -fno-fast-math to preserve IEEE specifications.

Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify
places to focus OpenMP work on.

Hamish

Hi,
i have implemented a “real” average neighborhood algorithm that runs in parallel using openmp. The source code and the benchmark shell script is attached.

The neighbor program computes the average moving window of arbitrary size. The size of the map rows x cols and the size of the moving window (odd number cols==rows) can be specified.

./neighbor rows cols mw_size

IMHO the new program is better for compiler comparison and neighborhood operation performance.

This is the benchmark on my 5 year old AMD phenom 4 core computer using 1, 2 and 4 threads:

gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor
export OMP_NUM_THREADS=1
time ./neighbor 5000 5000 23

real 0m37.211s
user 0m36.998s
sys 0m0.196s

export OMP_NUM_THREADS=2
time ./neighbor 5000 5000 23

real 0m19.907s
user 0m38.890s
sys 0m0.248s

export OMP_NUM_THREADS=4
time ./neighbor 5000 5000 23

real 0m10.170s
user 0m38.466s
sys 0m0.192s

Happy hacking, compiling and testing. :slight_smile:

Best regards
Soeren

(attachments)

benchmark.sh (224 Bytes)
main.c (3.58 KB)

···

2013/6/29 Markus Metz <markus.metz.giswork@gmail.com>

On Sat, Jun 29, 2013 at 1:26 PM, Hamish <hamish_b@yahoo.com> wrote:

Markus Metz wrote:

Some more results with Sören’s test program on a Intel(R) Core™ i5
CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and
clang 3.3

gcc -O3
v is 2.09131e+13

real 2m0.393s
user 1m57.610s
sys 0m0.003s

gcc -Ofast
v is 2.09131e+13

real 0m7.218s
user 0m7.018s
sys 0m0.017s

nice. one thing we need to remember though is that it’s not entirely
free, one thing -Ofast turns on is -ffast-math,
“”"
This option is not turned on by any -O option besides -Ofast since it can
result in incorrect output for programs that depend on an exact
implementation of IEEE or ISO rules/specifications for math functions. It
may, however, yield faster code for programs that do not require the
guarantees of these specifications.
“”"

which may not be fit for our purposes.

With the ifort compiler there is ‘-fp-model precise’ which allows only
optimizations which don’t harm the results. Maybe gcc has something
similar.

In gcc, you can turn of -ffoo with -fno-foo, maybe this way you can
use -Ofast -fno-fast-math to preserve IEEE specifications.

Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify
places to focus OpenMP work on.

Hamish

More benchmark results on core i5 2410M 2 cores 4 threads, 8GB RAM:

gcc -Wall -fopenmp -lgomp -O3 main.c -o neighbor
time ./neighbor 5000 5000 23

export OMP_NUM_THREADS=1

real 0m27.052s
user 0m26.882s
sys 0m0.128s

export OMP_NUM_THREADS=2

real 0m15.579s
user 0m30.466s
sys 0m0.124s

export OMP_NUM_THREADS=4

real 0m10.454s
user 0m40.711s
sys 0m0.120s

gcc -Wall -fopenmp -lgomp -Ofast -march=core-avx-i main.c -o neighbor
time ./neighbor 5000 5000 23

export OMP_NUM_THREADS=1

real 0m17.090s
user 0m16.953s
sys 0m0.108s

export OMP_NUM_THREADS=2

real 0m9.957s
user 0m19.437s
sys 0m0.136s

export OMP_NUM_THREADS=4

real 0m7.476s
user 0m28.698s
sys 0m0.124s

opencc -Wall -mp -Ofast -march=auto main.c -o neighbor
time ./neighbor 5000 5000 23

export OMP_NUM_THREADS=1

real 0m19.095s
user 0m18.909s
sys 0m0.152s

export OMP_NUM_THREADS=2

real 0m11.203s
user 0m22.097s
sys 0m0.136s

export OMP_NUM_THREADS=4

real 0m8.648s
user 0m33.670s
sys 0m0.160s

Best regards
Soeren

(attachments)

benchmark.sh (672 Bytes)

···

2013/6/29 Sören Gebbert <soerengebbert@googlemail.com>

Hi,
i have implemented a “real” average neighborhood algorithm that runs in parallel using openmp. The source code and the benchmark shell script is attached.

The neighbor program computes the average moving window of arbitrary size. The size of the map rows x cols and the size of the moving window (odd number cols==rows) can be specified.

./neighbor rows cols mw_size

IMHO the new program is better for compiler comparison and neighborhood operation performance.

This is the benchmark on my 5 year old AMD phenom 4 core computer using 1, 2 and 4 threads:

gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor
export OMP_NUM_THREADS=1
time ./neighbor 5000 5000 23

real 0m37.211s
user 0m36.998s
sys 0m0.196s

export OMP_NUM_THREADS=2
time ./neighbor 5000 5000 23

real 0m19.907s
user 0m38.890s
sys 0m0.248s

export OMP_NUM_THREADS=4
time ./neighbor 5000 5000 23

real 0m10.170s
user 0m38.466s
sys 0m0.192s

Happy hacking, compiling and testing. :slight_smile:

Best regards
Soeren

2013/6/29 Markus Metz <markus.metz.giswork@gmail.com>

On Sat, Jun 29, 2013 at 1:26 PM, Hamish <hamish_b@yahoo.com> wrote:

Markus Metz wrote:

Some more results with Sören’s test program on a Intel(R) Core™ i5
CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and
clang 3.3

gcc -O3
v is 2.09131e+13

real 2m0.393s
user 1m57.610s
sys 0m0.003s

gcc -Ofast
v is 2.09131e+13

real 0m7.218s
user 0m7.018s
sys 0m0.017s

nice. one thing we need to remember though is that it’s not entirely
free, one thing -Ofast turns on is -ffast-math,
“”"
This option is not turned on by any -O option besides -Ofast since it can
result in incorrect output for programs that depend on an exact
implementation of IEEE or ISO rules/specifications for math functions. It
may, however, yield faster code for programs that do not require the
guarantees of these specifications.
“”"

which may not be fit for our purposes.

With the ifort compiler there is ‘-fp-model precise’ which allows only
optimizations which don’t harm the results. Maybe gcc has something
similar.

In gcc, you can turn of -ffoo with -fno-foo, maybe this way you can
use -Ofast -fno-fast-math to preserve IEEE specifications.

Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify
places to focus OpenMP work on.

Hamish

Dear Soeren, Hamish, Markus M., Markus N,.
sorry for the delay in the answer but I didn't believe that my question
could have determined so many answers and I was on holiday for 10 days
with no way to test your code.
First of all a quick answer to Markus N.
My collegue is working on high resolution images and DEMs. In this case,
in particular, he is working on a 1 meter resolution raster map
concerning landslides. He said me that, due to the landslides
dimensions, he need to use a kernel of 501*501 in order to catch the
"signatures" of the phenomena (I don't know exactly the details.. I'm
sorry).

I'm not a C developer but reading your e-mails it seems that the
performances of the C codes (and r.neigbors is written in C) strongly
depend on the compiler.
does it mean that compiling in a different way (I don't know how) the
r.neigbours module we can obtain better results?

We have tested the last code of Soeren in the same machine where the
proprietary software (it is ENVI 5.x) showed those good performances.

These are the results:

gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor
export OMP_NUM_THREADS=1
time ./neighbor 5000 5000 23
real 0m16.598s
user 0m16.477s
sys 0m0.080s

export OMP_NUM_THREADS=2
time ./neighbor 5000 5000 23
real 0m8.977s
user 0m17.573s
sys 0m0.080s

export OMP_NUM_THREADS=4
time ./neighbor 5000 5000 23
real 0m5.993s
user 0m20.277s
sys 0m0.088s

export OMP_NUM_THREADS=6
time ./neighbor 5000 5000 23
real 0m4.784s
user 0m25.770s
sys 0m0.096s

Many thanks for your answers

        

Hi Ivan,

2013/7/17 Ivan Marchesini <ivan.marchesini@gmail.com>:

Dear Soeren, Hamish, Markus M., Markus N,.
sorry for the delay in the answer but I didn't believe that my question
could have determined so many answers and I was on holiday for 10 days
with no way to test your code.
First of all a quick answer to Markus N.
My collegue is working on high resolution images and DEMs. In this case,
in particular, he is working on a 1 meter resolution raster map
concerning landslides. He said me that, due to the landslides
dimensions, he need to use a kernel of 501*501 in order to catch the
"signatures" of the phenomena (I don't know exactly the details.. I'm
sorry).

I think it will be good if you could ask your colleague for more details,
since we are all very curious about his computational approach.

I'm not a C developer but reading your e-mails it seems that the
performances of the C codes (and r.neigbors is written in C) strongly
depend on the compiler.
does it mean that compiling in a different way (I don't know how) the
r.neigbours module we can obtain better results?

We have tested the last code of Soeren in the same machine where the
proprietary software (it is ENVI 5.x) showed those good performances.

The performance is good indeed. Unfortunately you need to set the
moving windows size from 23 to 501, to receive a comparable result to
the computation of your colleague.

./neighbor 5000 5000 501

This will call the neíghbour computation with 25,000,000 cells and a
moving window with 501x501 pixel.

Best regards
Soeren

These are the results:

gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor
export OMP_NUM_THREADS=1
time ./neighbor 5000 5000 23
real 0m16.598s
user 0m16.477s
sys 0m0.080s

export OMP_NUM_THREADS=2
time ./neighbor 5000 5000 23
real 0m8.977s
user 0m17.573s
sys 0m0.080s

export OMP_NUM_THREADS=4
time ./neighbor 5000 5000 23
real 0m5.993s
user 0m20.277s
sys 0m0.088s

export OMP_NUM_THREADS=6
time ./neighbor 5000 5000 23
real 0m4.784s
user 0m25.770s
sys 0m0.096s

Many thanks for your answers

just wondering, are there any basic tools for interactively editing rasters, (not just the categories) ?

the outputs of the Lidar and other terrain surfaces often have surface “pops”, etc, and it would be really nice to have just a few basic, interactive tools to go in and locally/interactively “smooth” things, etc in the same way one can use vector tools to individually fix polygons, etc.

not asking for a full Photoshop or a GIMP, but something well integated such that one could make a change, then look at it in NVIZ, etc …

I guess one could get psyched with the raster library … :-/ ?

suggestions, etc welcomed…thanks!!

Chris

Chris wrote:

just wondering, are there any basic tools for interactively editing
rasters, (not just the categories) ?

Hi,

the module d.rast.edit should help. If using GRASS 6.4 on Windows you'll
want to use the 6.4.3 RC4 version since it was recently
fixed there.

Hamish

Hi,
There is wxRasterDigitizer[1] which I had worked on earlier in addons repository.But thats for creating raster and very well possible to make it for raster editing. Its been outdated for a while and I was super busy these days with OSSIM GSoC (midterm evaluation approaching). But if you are interested I can update it to work with current svn. wxGUI seems to change a lot these days since refractoring is going on. So an update now may not be work later.

I think refractoring should be given some more priority and an API form should be developed (anna, vaclav should have more info on this) which will help to create gui addons easily rather copy-paste existing code and then modifying it. QGIS provides an API to load user/developer plugins easily without much hassle

[1] http://trac.osgeo.org/grass/browser/grass-addons/grass7/gui/wxpython/wx.rdigit

···

On Tue, Jul 23, 2013 at 4:57 AM, Hamish <hamish_b@yahoo.com> wrote:

Chris wrote:

just wondering, are there any basic tools for interactively editing
rasters, (not just the categories) ?

Hi,

the module d.rast.edit should help. If using GRASS 6.4 on Windows you’ll
want to use the 6.4.3 RC4 version since it was recently
fixed there.

Hamish


grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

Regards,
Rashad