Ivan wrote:
the region is 4312*5576
the moving window 501
GRASS is the stable version on a machine with 8 core and 32 gb RAM.
Ubuntu 12.04
it seems that the proprietary software is able to perform the analysis
in 2/3 seconds
I expect he's probably correct in that statement, but it's the *compiler*
used not the code behind it, and GRASS compiled in the same way would
be/is just as fast.
Sören wrote:
this sounds very interesting.
Your map has a size of 4312*5576 pixel? That's about 100MB in case of
a type integer or type float map or about 200MB in case of a type
double map. You must have a very fast HD or SSD to read and write such
a map in under 2/3 seconds?
500mb/s IO for a SSD is not unusual, 300mb/s for spinning platter RAID
is pretty common. It's good to run a few replicates of the benchmark
so the 2nd+ times the data is already cached in RAM. (as long as the
region is not too huge to hold it there)
In case your moving window has a size of 501 pixel (not 501x501 pixel!),
the amount of operations that must be performed is at least 4312*5576*501.
That's about 12 billion ops. Amazing to do this in 2/3 seconds.
I have written a little program to see how my Intel core i5 performs
processing this amount of operations. Well it needs about 100 seconds.
I was able get the same down to just over 1 second wall-time on a plain
consumer desktop chip. (!)
Here the code, compiled with optimization:
#include <stdio.h>
int main()
{
unsigned int i, j, k;
register double v = 0.0;
for\(i = 0; i < 4321; i\+\+\) \{
for\(j = 0; j < 5576; j\+\+\) \{
for\(k = 0; k < 501; k\+\+\) \{
v = v \+ \(double\)\(i \+ j \+ k\)/3\.0;
\}
\}
\}
printf\("v is %g\\n", v\);
}
soeren@vostro:~/src$ gcc -O3 numtest.c -o numtest
soeren@vostro:~/src$ time ./numtest
v is 2.09131e+13
real1m49.292s
user1m49.223s
sys0m0.000s
Your proprietary software must run highly parallel using a fast
GPU or an ASIC to keep the processing time under 2/3 seconds?
Unfortunately r.neighbors is not able to compete with such a
powerful software,
sure it is! 
since it is not reading the entire map into RAM and does not run
on GPU's or ASIC's. But r.neighbors is able to process maps that
are to large to fit into the RAM. 
Can you please tell us what software is so incredible fast?
I ran some quick trials with your sample program with both gcc 4.6
(ubuntu 12.04) and Intel's icc 12.1 on the same computer.
Diplomatically speaking, I see gcc 4.8 has just arrived in Debian/sid,
and I look forward to exploring how its new auto-vectorization features
are coming along.
The results however, are not so diplomatic and speak for themselves..
and for this simple test case* it isn't pretty.
(* so atypically easy for the compiler to optimize)
test system: i7 3770, lots of RAM
replicates are presented in horizontal columns.
standard gcc, with & without -O3 and -march=native: (all ~same)
real 1m14.507s | 1m14.559s | 1m14.513s | 1m14.514s
user 1m14.289s | 1m14.305s | 1m14.297s | 1m14.297s
sys 0m0.000s | 0m0.028s | 0m0.000s | 0m0.000s
--
standard Intel icc with & without -O3:
v is 2.09131e+13
real 0m21.979s | 0m21.967s | 0m21.958s | 0m21.994s
user 0m21.909s | 0m21.901s | 0m21.897s | 0m21.929s
sys 0m0.000s | 0m0.000s | 0m0.000s | 0m0.000s
--
icc with the "-fast" compiler switch:
$ icc -fast soeren_speed_test.c -o soeren_speed_test_icc_fast
# note 900kb for executable vs's gcc's 8kb.
$ time ./soeren_speed_test_icc_fast
v is 2.09131e+13
real 0m3.273s | 0m3.274s | 0m3.275s
user 0m3.260s | 0m3.260s | 0m3.264s
sys 0m0.000s | 0m0.000s | 0m0.000s
(there's your 3 seconds)
--
icc -funroll-loops:
real 0m22.008s | 0m21.998s
user 0m21.941s | 0m21.929s
sys 0m0.000s | 0m0.000s
(no extra gain in this case)
--
icc -parallel: (running on 8 hyperthread (ie 4 real) cores)
# binary size: 30kb
real 0m6.034s | 0m6.005s | 0m6.005s
user 0m46.531s | 0m46.603s | 0m46.519s
sys 0m0.024s | 0m0.028s | 0m0.044s
--
icc -parallel -fast:
# binary size 2.2 megabytes
$ time ./soeren_speed_test_icc_parallel+fast
v is 2.09131e+13
real 0m1.002s | 0m1.002s | 0m1.002s
user 0m6.768s | 0m6.796s | 0m6.780s
sys 0m0.004s | 0m0.004s | 0m0.008s
I tried a number of times but couldn't break the 1 second barrier. 
-----
I also ran it on an AMD Phenom II X6 1090T (icc -xHost --> -xSSSE3 ?)
All times "real"; all output was "v is 2.09131e+13".
gcc 4.4.5 with standard-opts: 7kb binary
== near parity single-threaded performance with the new i7 chip from
the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze)
1m16.175s | 1m15.634s | 1m16.029s
icc 12.1 with standard-opts:
0m32.975s | 0m33.079s | 0m33.249s
icc with "-fast" opt: (700kb binary)
0m9.577s | 0m9.572s | 0m9.583s
icc with -parallel auto-MP: (31kb binary)
== again near parity with the new i7 chip! even with the Intel-biased
compiler. "user" cpu-time was actually less. the advantage of 6 real
cores vs 4 real+4virtual ones.*
0m6.406s | 0m6.404s | 0m6.404s
0m37.106s | 0m37.170s | 0m37.106s
0m0.044s | 0m0.040s | 0m0.028s
icc with -fast and -parallel: (2mb binary)
0m2.002s | 0m2.002s | 0m2.002s
0m10.765s | 0m10.769s | 0m10.769s
0m0.016s | 0m0.012s | 0m0.008s
(* I know from some earlier tests that hyperthreading is computationally
overheady, on a 12 real + 12virtual core Xeon, using about 11 real cores
took the same wall-clock time as 12 real + 5 virtual cores, it only beat
the 12 real cores as I got up to about 19 total cores, and full 24 cores
was in the new percentage points gain & very much diminishing returns)
regards,
Hamish
ps- we have to pay for icc/ifort academic (research) licenses now, but
the student (homework/classroom) license for linux is still gratis
if you dig around their dev website. Also AMD has their Open64
compiler to play with http://developer.amd.com/tools/open64/Pages/