[GRASS-dev] Benchmark the overhead of calling GRASS modules

Panagiotis_Mavrogior · April 29, 2019, 6:49am

Hello all

You might find it easier to read the following text in a gist

Introduction

I was trying to write a decorator/contextmanager that would temporary change the
computational region, but while using it I noticed that there was some overhead the root
of which seemed to be the usage of the GRASS modules. So in order to quantify this
I wrote a small benchmark that tries to measure the overhead of calling a GRASS Module.
This is what I found.

tl;dr

Calling a GRASS module incurs a constant but measurable overhead which, in certain
cases, e.g. when writing a module that uses a lot of the other modules, can quickly
add up to a significant quantity.

Disclaimer

If you try to run the benchmark on your own PC, the actual timings you will get will
probably be different. The differences might be caused by having:

a stronger/weaker CPU
faster/slower hard disk.
using different Python version
using different compilation flags

Still, I think that some of the findings are reproducible.

For reference, I used:

OS: Linux 5.0.9
CPU: Intel i7 6700HQ
Disk type: SSD
Python: 3.7
CFLAGS='-O2 -fPIC -march=native -std=gnu99'

Demonstration

The easiest way to demonstrate the performance difference between using a GRASS module
vs using the GRASS API is to run the following snippets.

Both of them do exactly the same thing, i.e. retrieve the current region settings 10
times in a row. The performance difference is huge though. On my laptop, the first one
needs 0.36 seconds while the second one needs just 0.00038 seconds. That’s almost
a 1000x difference…

import time
import grass.script as gscript

start = time.time()
for i in range(10):
region = gscript.parse_command("g.region", flags="g")
end = time.time()
total = end - start

print("Total time: %s" % total)

vs

import time
from grass.pygrass.gis.region import Region

start = time.time()
for i in range(10):
region = Region()
end = time.time()
total = end - start

print("Total time: %s" % total)

How much is the overhead exactly?

In order to measure the actual overhead of calling a GRASS module, I created two new
GRASS modules that all they do is parse the command line arguments and measured how much
time is needed for their execution. The first module is r.simple and is
implemented in Python while the other one is r.simple.c and is implemented in C.
The timings are in msec and the benchmark was executed using Python 3.7

call method	r.simple	r.simple.c
pygrass.module.Module	85.9	66.5
pygrass.module.Module.shortcut	85.5	66.9
grass.script.run_command	41.3	30.5
subprocess.check_call	41.8	30.3

As we can see, gsrcipt.run_command and subprocess give more or less a identical
results, which is to be expected since run_command + friends are just a thin wrapper
around subprocess. Similarly shortcuts has the same overhead as “vanila”
pygrass.Module. Nevertheless, it is obvious that pygrass is roughly 2x times slower
than grass.script (but more about that later).

As far as C vs Python goes, on my computer modules implemented in C seem to be 25% faster than their
Python counterparts. Nevertheless, a 40 msec startup time doesn’t seem extraordinary
for a Python script, while 30 msec feels rather large for a CLI application implemented
in C.

Where is all that time being spent?

C Modules

Unfortunately, I am not familiar enough with the GRASS internals to easily check what is
going on and I didn’t have the time to try to profile the code. I suspect that
G_gisinit or something similar is causing the overhead, but someone more familiar with
the C API should be able to enlighten us.

Python Modules

In order to gain a better understanding of the overhead we have when calling python
modules, we also need to measure the following quantities:

The time that python needs to spawn a new process
The startup time for the python interpreter
The time that python needs to import grass

These are the results:

	msec
subprocess spawn	1.2
python 2 startup	9.0
python 2 + import grass	24.5
python 3 startup	18.2
python 3 + import grass	39.3

As we can see:

the overhead of spawning a new process from within python is more or less negligible
(at least compared to the other quantities).
The overhead of spawning a python 3 interpreter is 2x bigger than Python 2; i.e. the
transition to Python 3 will have some performance impact, no matter what.
The overhead of spawning a python 3 interpreter accounts for roughly 50% of the total
overhead (18 msec out of 41 msec).
The other 50% of the overhead is pretty much caused just by importing the grass
library.

Why is Pygrass 2x times slower?

I haven’t carefully looked into this (i.e. I haven’t profiled the code), but it seems
that the culprit is this
line.
In other words, pygrass calls a module twice, once to get the interface’s description
and once to actually run the module. That’s why the overhead is double both for
C modules and Python ones.

Is this truly a problem?

Well, it depends

The most important factor is probably the size of the computational region. If you are
dealing with really large regions, then the overhead is probably miniscule. If you are
dealing with much smaller ones though, then it can be significant.

That being said, there are at least two cases where I think that this overhead is
important:

interactive sessions
running tests

Just to give an example, i.pansharpen calls ~50 other GRASS modules. On my laptop, the
overhead for these calls is almost 2 seconds. Is this a lot? Well, if you are
pansharpening Landsat Tiles, it probably is not that much, but you will have this
overhead even if you are pansharpening a 4 pixel map (e.g. when running tests).

I am pretty sure that there are numerous other modules that are just like that, too.

What is to be done?

I think that there are two different areas where work can be done:

Reduce the actual overhead; i.e. make calling GRASS modules faster.
Convert calls to GRASS modules with API calls which, as shown earlier, are orders of
magnitude faster.

Reduce the overhead

The overhead of Python and C modules has different causes.

As I said, I haven’t looked into what causes C modules to take so long so i can’t make
any suggestions. This should be further looked into though. The reason is that modules
like e.g. g.region are being used practically everywhere, so speeding these up should
give a measureable performance improvement.

As far as Python modules go, unfortunately, the overhead seems to be relatively
inelastic. Speeding up the startup time of the Python interpreter is not something that
GRASS can do or rely upon. So, the only thing that can be actually be done is to try to
reduce the time needed for importing the grass library. That being said, this should
probably not skim more than a few ms at best, but at least the gains should be there for
all python modules.

Luckily there is at least one relatively low hanging fruit. Speeding pygrass should not
be that difficult. I haven’t looked into this but pre-generating the modules’ XML
descriptions seems feasible and it should remove the need to call each module twice,
thus making pygrass performance comparable to grass.script.run_command

Convert module calls to API calls

Converting the module calls to API calls is what can potentially give the bigger
benefits, but at the same time is what needs the most work.

There are some low hanging fruits here too. E.g. functions like use_temp_region() and
raster_info() can probably be refactored to use the pygrass API, while calls to e.g.
g.region can probably be replaced with pygrass.gis.Region objects etc.

Nevertheless, this does not really touch the root of the problem which IMHV is the tight
coupling between the GRASS modules functionality and the CLI. In layman terms, at the
moment, if you want to use module A from module B you are forced to spawn a new process
and suffer the overhead that this entails.

TBH, I am not sure if this a problem that can even be tackled at this stage, but if each
module had one or more functions that could be imported/called by other modules,
everything would be much easier and performance would be significantly better.

How to run the benchmark?

If you want to run this benchmark on your own you need to:

git remote add panos [https://github.com/pmav99/grass-ci.git](https://github.com/pmav99/grass-ci.git`)
git fetch --all
git checkout overhead
cd scripts/r.simple && make && cd ../../
cd raster/r.simple.c && make && cd ../../
Start a grass session
python benchmark.py

I haven’t run the benchmark under Python 2, but even if there are incompatibilities,
they should be trivial to fix.

I will be happy to hear any remarks

with kind regards,
Panos

Markus_Metz · May 1, 2019, 10:03pm

Hi Panos,

IMHO the overhead of calling GRASS modules is insignificant because it is in the range of milliseconds. I am much more concerned whether executing a GRASS module takes days or hours or minutes.

Also note that the base of GRASS are C modules using the GRASS C library. GRASS python modules usually call GRASS C modules (or other GRASS python modules calling GRASS C modules). The first thing a GRASS Python module does is calling the GRASS C module g.parser, after that it calls (in the end) some other GRASS C modules. That means it is not straightforward to test the overhead of calling GRASS Python modules vs calling GRASS C modules because it is really GRASS Python + C modules vs GRASS C modules only. And the overhead is insignificant (not measurable) compared to actual execution time for larger datasets/regions.

Markus M

On Mon, Apr 29, 2019 at 8:49 AM Panagiotis Mavrogiorgos <pmav99@gmail.com> wrote:

Hello all

You might find it easier to read the following text in a gist

Introduction

I was trying to write a decorator/contextmanager that would temporary change the
computational region, but while using it I noticed that there was some overhead the root
of which seemed to be the usage of the GRASS modules. So in order to quantify this
I wrote a small benchmark that tries to measure the overhead of calling a GRASS Module.
This is what I found.

tl;dr

Calling a GRASS module incurs a constant but measurable overhead which, in certain
cases, e.g. when writing a module that uses a lot of the other modules, can quickly
add up to a significant quantity.

Disclaimer

If you try to run the benchmark on your own PC, the actual timings you will get will
probably be different. The differences might be caused by having:

a stronger/weaker CPU

faster/slower hard disk.

using different Python version

using different compilation flags

Still, I think that some of the findings are reproducible.

For reference, I used:

OS: Linux 5.0.9

CPU: Intel i7 6700HQ

Disk type: SSD

Python: 3.7

CFLAGS='-O2 -fPIC -march=native -std=gnu99'

Demonstration

The easiest way to demonstrate the performance difference between using a GRASS module
vs using the GRASS API is to run the following snippets.

Both of them do exactly the same thing, i.e. retrieve the current region settings 10
times in a row. The performance difference is huge though. On my laptop, the first one
needs 0.36 seconds while the second one needs just 0.00038 seconds. That’s almost
a 1000x difference…
import time
import grass.script as gscript

start = time.time()
for i in range(10):
region = gscript.parse_command("g.region", flags="g")
end = time.time()
total = end - start

print("Total time: %s" % total)
vs
import time
from grass.pygrass.gis.region import Region

start = time.time()
for i in range(10):
region = Region()
end = time.time()
total = end - start

print("Total time: %s" % total)
How much is the overhead exactly?

In order to measure the actual overhead of calling a GRASS module, I created two new
GRASS modules that all they do is parse the command line arguments and measured how much
time is needed for their execution. The first module is r.simple and is
implemented in Python while the other one is r.simple.c and is implemented in C.
The timings are in msec and the benchmark was executed using Python 3.7

call method r.simple r.simple.c

pygrass.module.Module 85.9 66.5

pygrass.module.Module.shortcut 85.5 66.9

grass.script.run_command 41.3 30.5

subprocess.check_call 41.8 30.3

As we can see, gsrcipt.run_command and subprocess give more or less a identical
results, which is to be expected since run_command + friends are just a thin wrapper
around subprocess. Similarly shortcuts has the same overhead as “vanila”
pygrass.Module. Nevertheless, it is obvious that pygrass is roughly 2x times slower
than grass.script (but more about that later).

As far as C vs Python goes, on my computer modules implemented in C seem to be 25% faster than their
Python counterparts. Nevertheless, a 40 msec startup time doesn’t seem extraordinary
for a Python script, while 30 msec feels rather large for a CLI application implemented
in C.

Where is all that time being spent?

C Modules

Unfortunately, I am not familiar enough with the GRASS internals to easily check what is
going on and I didn’t have the time to try to profile the code. I suspect that
G_gisinit or something similar is causing the overhead, but someone more familiar with
the C API should be able to enlighten us.

Python Modules

In order to gain a better understanding of the overhead we have when calling python
modules, we also need to measure the following quantities:

The time that python needs to spawn a new process

The startup time for the python interpreter

The time that python needs to import grass

These are the results:

msec

subprocess spawn 1.2

python 2 startup 9.0

python 2 + import grass 24.5

python 3 startup 18.2

python 3 + import grass 39.3

As we can see:

the overhead of spawning a new process from within python is more or less negligible
(at least compared to the other quantities).

The overhead of spawning a python 3 interpreter is 2x bigger than Python 2; i.e. the
transition to Python 3 will have some performance impact, no matter what.

The overhead of spawning a python 3 interpreter accounts for roughly 50% of the total
overhead (18 msec out of 41 msec).

The other 50% of the overhead is pretty much caused just by importing the grass
library.

Why is Pygrass 2x times slower?

I haven’t carefully looked into this (i.e. I haven’t profiled the code), but it seems
that the culprit is this
line.
In other words, pygrass calls a module twice, once to get the interface’s description
and once to actually run the module. That’s why the overhead is double both for
C modules and Python ones.

Is this truly a problem?

Well, it depends

The most important factor is probably the size of the computational region. If you are
dealing with really large regions, then the overhead is probably miniscule. If you are
dealing with much smaller ones though, then it can be significant.

That being said, there are at least two cases where I think that this overhead is
important:

interactive sessions

running tests

Just to give an example, i.pansharpen calls ~50 other GRASS modules. On my laptop, the
overhead for these calls is almost 2 seconds. Is this a lot? Well, if you are
pansharpening Landsat Tiles, it probably is not that much, but you will have this
overhead even if you are pansharpening a 4 pixel map (e.g. when running tests).

I am pretty sure that there are numerous other modules that are just like that, too.

What is to be done?

I think that there are two different areas where work can be done:

Reduce the actual overhead; i.e. make calling GRASS modules faster.

Convert calls to GRASS modules with API calls which, as shown earlier, are orders of
magnitude faster.

Reduce the overhead

The overhead of Python and C modules has different causes.

As I said, I haven’t looked into what causes C modules to take so long so i can’t make
any suggestions. This should be further looked into though. The reason is that modules
like e.g. g.region are being used practically everywhere, so speeding these up should
give a measureable performance improvement.

As far as Python modules go, unfortunately, the overhead seems to be relatively
inelastic. Speeding up the startup time of the Python interpreter is not something that
GRASS can do or rely upon. So, the only thing that can be actually be done is to try to
reduce the time needed for importing the grass library. That being said, this should
probably not skim more than a few ms at best, but at least the gains should be there for
all python modules.

Luckily there is at least one relatively low hanging fruit. Speeding pygrass should not
be that difficult. I haven’t looked into this but pre-generating the modules’ XML
descriptions seems feasible and it should remove the need to call each module twice,
thus making pygrass performance comparable to grass.script.run_command

Convert module calls to API calls

Converting the module calls to API calls is what can potentially give the bigger
benefits, but at the same time is what needs the most work.

There are some low hanging fruits here too. E.g. functions like use_temp_region() and
raster_info() can probably be refactored to use the pygrass API, while calls to e.g.
g.region can probably be replaced with pygrass.gis.Region objects etc.

Nevertheless, this does not really touch the root of the problem which IMHV is the tight
coupling between the GRASS modules functionality and the CLI. In layman terms, at the
moment, if you want to use module A from module B you are forced to spawn a new process
and suffer the overhead that this entails.

TBH, I am not sure if this a problem that can even be tackled at this stage, but if each
module had one or more functions that could be imported/called by other modules,
everything would be much easier and performance would be significantly better.

How to run the benchmark?

If you want to run this benchmark on your own you need to:

git remote add panos [https://github.com/pmav99/grass-ci.git](https://github.com/pmav99/grass-ci.git)

git fetch --all

git checkout overhead

cd scripts/r.simple && make && cd ../../

cd raster/r.simple.c && make && cd ../../

Start a grass session

python benchmark.py

I haven’t run the benchmark under Python 2, but even if there are incompatibilities,
they should be trivial to fix.

I will be happy to hear any remarks

with kind regards,
Panos

grass-dev mailing list
grass-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/grass-dev

wenzeslaus · May 2, 2019, 1:49am

Hi Panos and Markus,

I actually touched on this in my master’s thesis [1, p. 54-58], specifically on the subprocess call overhead (i.e. not import or initialization overheads). I compared speed of calling subprocess in Python to a Python function call. The reason was that I was calling GRASS modules many times for small portions of my computational region, i.e. I was changing region always to the area/object of interest within the actual (user set) computational region. So, the overall process involved actually many subprocess calls depending on the size of data. Unfortunately, I don’t have there a comparison of how the two cases (functions versus subprocesses) would look like in terms of time spend for the whole process.

And speaking more generally, it seems to me that the functionality-CLI coupling issue is what might me be partially fueling Facundo’s GSoC proposal (Python package for topology tools). There access to functionality does not seem direct enough to the user-programmer with overhead of subprocess call as well as related I/O cost, whether real or perceived, playing a role.

Best,

Vaclav

[1] Petras V. 2013. Building detection from aerial images in GRASS GIS environment. Master’s thesis. Czech Technical University in Prague. http://geo.fsv.cvut.cz/proj/dp/2013/vaclav-petras-dp-2013.pdf

On Wed, May 1, 2019 at 6:03 PM Markus Metz <markus.metz.giswork@gmail.com> wrote:

Hi Panos,

IMHO the overhead of calling GRASS modules is insignificant because it is in the range of milliseconds. I am much more concerned whether executing a GRASS module takes days or hours or minutes.

Also note that the base of GRASS are C modules using the GRASS C library. GRASS python modules usually call GRASS C modules (or other GRASS python modules calling GRASS C modules). The first thing a GRASS Python module does is calling the GRASS C module g.parser, after that it calls (in the end) some other GRASS C modules. That means it is not straightforward to test the overhead of calling GRASS Python modules vs calling GRASS C modules because it is really GRASS Python + C modules vs GRASS C modules only. And the overhead is insignificant (not measurable) compared to actual execution time for larger datasets/regions.

Markus M

On Mon, Apr 29, 2019 at 8:49 AM Panagiotis Mavrogiorgos <pmav99@gmail.com> wrote:
Hello all

You might find it easier to read the following text in a gist

Introduction

I was trying to write a decorator/contextmanager that would temporary change the
computational region, but while using it I noticed that there was some overhead the root
of which seemed to be the usage of the GRASS modules. So in order to quantify this
I wrote a small benchmark that tries to measure the overhead of calling a GRASS Module.
This is what I found.

tl;dr

Calling a GRASS module incurs a constant but measurable overhead which, in certain
cases, e.g. when writing a module that uses a lot of the other modules, can quickly
add up to a significant quantity.

Disclaimer

If you try to run the benchmark on your own PC, the actual timings you will get will
probably be different. The differences might be caused by having:

a stronger/weaker CPU

faster/slower hard disk.

using different Python version

using different compilation flags

Still, I think that some of the findings are reproducible.

For reference, I used:

OS: Linux 5.0.9

CPU: Intel i7 6700HQ

Disk type: SSD

Python: 3.7

CFLAGS='-O2 -fPIC -march=native -std=gnu99'

Demonstration

The easiest way to demonstrate the performance difference between using a GRASS module
vs using the GRASS API is to run the following snippets.

Both of them do exactly the same thing, i.e. retrieve the current region settings 10
times in a row. The performance difference is huge though. On my laptop, the first one
needs 0.36 seconds while the second one needs just 0.00038 seconds. That’s almost
a 1000x difference…
import time
import grass.script as gscript

start = time.time()
for i in range(10):
region = gscript.parse_command("g.region", flags="g")
end = time.time()
total = end - start

print("Total time: %s" % total)
vs
import time
from grass.pygrass.gis.region import Region

start = time.time()
for i in range(10):
region = Region()
end = time.time()
total = end - start

print("Total time: %s" % total)
How much is the overhead exactly?

In order to measure the actual overhead of calling a GRASS module, I created two new
GRASS modules that all they do is parse the command line arguments and measured how much
time is needed for their execution. The first module is r.simple and is
implemented in Python while the other one is r.simple.c and is implemented in C.
The timings are in msec and the benchmark was executed using Python 3.7

call method r.simple r.simple.c

pygrass.module.Module 85.9 66.5

pygrass.module.Module.shortcut 85.5 66.9

grass.script.run_command 41.3 30.5

subprocess.check_call 41.8 30.3

As we can see, gsrcipt.run_command and subprocess give more or less a identical
results, which is to be expected since run_command + friends are just a thin wrapper
around subprocess. Similarly shortcuts has the same overhead as “vanila”
pygrass.Module. Nevertheless, it is obvious that pygrass is roughly 2x times slower
than grass.script (but more about that later).

As far as C vs Python goes, on my computer modules implemented in C seem to be 25% faster than their
Python counterparts. Nevertheless, a 40 msec startup time doesn’t seem extraordinary
for a Python script, while 30 msec feels rather large for a CLI application implemented
in C.

Where is all that time being spent?

C Modules

Unfortunately, I am not familiar enough with the GRASS internals to easily check what is
going on and I didn’t have the time to try to profile the code. I suspect that
G_gisinit or something similar is causing the overhead, but someone more familiar with
the C API should be able to enlighten us.

Python Modules

In order to gain a better understanding of the overhead we have when calling python
modules, we also need to measure the following quantities:

The time that python needs to spawn a new process

The startup time for the python interpreter

The time that python needs to import grass

These are the results:

msec

subprocess spawn 1.2

python 2 startup 9.0

python 2 + import grass 24.5

python 3 startup 18.2

python 3 + import grass 39.3

As we can see:

the overhead of spawning a new process from within python is more or less negligible
(at least compared to the other quantities).

The overhead of spawning a python 3 interpreter is 2x bigger than Python 2; i.e. the
transition to Python 3 will have some performance impact, no matter what.

The overhead of spawning a python 3 interpreter accounts for roughly 50% of the total
overhead (18 msec out of 41 msec).

The other 50% of the overhead is pretty much caused just by importing the grass
library.

Why is Pygrass 2x times slower?

I haven’t carefully looked into this (i.e. I haven’t profiled the code), but it seems
that the culprit is this
line.
In other words, pygrass calls a module twice, once to get the interface’s description
and once to actually run the module. That’s why the overhead is double both for
C modules and Python ones.

Is this truly a problem?

Well, it depends

The most important factor is probably the size of the computational region. If you are
dealing with really large regions, then the overhead is probably miniscule. If you are
dealing with much smaller ones though, then it can be significant.

That being said, there are at least two cases where I think that this overhead is
important:

interactive sessions

running tests

Just to give an example, i.pansharpen calls ~50 other GRASS modules. On my laptop, the
overhead for these calls is almost 2 seconds. Is this a lot? Well, if you are
pansharpening Landsat Tiles, it probably is not that much, but you will have this
overhead even if you are pansharpening a 4 pixel map (e.g. when running tests).

I am pretty sure that there are numerous other modules that are just like that, too.

What is to be done?

I think that there are two different areas where work can be done:

Reduce the actual overhead; i.e. make calling GRASS modules faster.

Convert calls to GRASS modules with API calls which, as shown earlier, are orders of
magnitude faster.

Reduce the overhead

The overhead of Python and C modules has different causes.

As I said, I haven’t looked into what causes C modules to take so long so i can’t make
any suggestions. This should be further looked into though. The reason is that modules
like e.g. g.region are being used practically everywhere, so speeding these up should
give a measureable performance improvement.

As far as Python modules go, unfortunately, the overhead seems to be relatively
inelastic. Speeding up the startup time of the Python interpreter is not something that
GRASS can do or rely upon. So, the only thing that can be actually be done is to try to
reduce the time needed for importing the grass library. That being said, this should
probably not skim more than a few ms at best, but at least the gains should be there for
all python modules.

Luckily there is at least one relatively low hanging fruit. Speeding pygrass should not
be that difficult. I haven’t looked into this but pre-generating the modules’ XML
descriptions seems feasible and it should remove the need to call each module twice,
thus making pygrass performance comparable to grass.script.run_command

Convert module calls to API calls

Converting the module calls to API calls is what can potentially give the bigger
benefits, but at the same time is what needs the most work.

There are some low hanging fruits here too. E.g. functions like use_temp_region() and
raster_info() can probably be refactored to use the pygrass API, while calls to e.g.
g.region can probably be replaced with pygrass.gis.Region objects etc.

Nevertheless, this does not really touch the root of the problem which IMHV is the tight
coupling between the GRASS modules functionality and the CLI. In layman terms, at the
moment, if you want to use module A from module B you are forced to spawn a new process
and suffer the overhead that this entails.

TBH, I am not sure if this a problem that can even be tackled at this stage, but if each
module had one or more functions that could be imported/called by other modules,
everything would be much easier and performance would be significantly better.

How to run the benchmark?

If you want to run this benchmark on your own you need to:

git remote add panos [https://github.com/pmav99/grass-ci.git](https://github.com/pmav99/grass-ci.git)

git fetch --all

git checkout overhead

cd scripts/r.simple && make && cd ../../

cd raster/r.simple.c && make && cd ../../

Start a grass session

python benchmark.py

I haven’t run the benchmark under Python 2, but even if there are incompatibilities,
they should be trivial to fix.

I will be happy to hear any remarks

with kind regards,
Panos

grass-dev mailing list
grass-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/grass-dev
grass-dev mailing list
grass-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/grass-dev

Panagiotis_Mavrogior · May 23, 2019, 1:28pm

hello Markus and Vaclav,

thank you for your feedback. My answer is inline.

On Wed, May 1, 2019 at 6:03 PM Markus Metz <markus.metz.giswork@gmail.com> wrote:

IMHO the overhead of calling GRASS modules is insignificant because it is in the range of milliseconds. I am much more concerned whether executing a GRASS module takes days or hours or minutes.

And the overhead is insignificant (not measurable) compared to actual execution time for larger datasets/regions.

I would argue that this depends on what you are doing. For a single GRASS Session using a really big computational region the overhead is obviously negligible; I wrote that in the initial post too. But if you do need to massively parallelize GRASS, then the overhead of setting up the GRASS Session and/or calling GRASS modules might be measurable too.

Regardless, the overhead

can be noticeable while doing exploratory analysis
can be significant while developing GRASS (e.g. when running tests).

BTW, let us also keep in mind that the majority of the tests should be using really small maps/computational regions (note: they currently don’t, but that’s a different issue) which means that the impact of this overhead should be larger

On Thu, May 2, 2019 at 4:49 AM Vaclav Petras <wenzeslaus@gmail.com> wrote:

Hi Panos and Markus,

I actually touched on this in my master’s thesis [1, p. 54-58], specifically on the subprocess call overhead (i.e. not import or initialization overheads). I compared speed of calling subprocess in Python to a Python function call. The reason was that I was calling GRASS modules many times for small portions of my computational region, i.e. I was changing region always to the area/object of interest within the actual (user set) computational region. So, the overall process involved actually many subprocess calls depending on the size of data. Unfortunately, I don’t have there a comparison of how the two cases (functions versus subprocesses) would look like in terms of time spend for the whole process.

Again I would argue that the answer depends on what you are doing. Pansharpening a 100 pixel map, has a (comparatively) huge overhead. Pansharpening a landast tile, not so much. Regardless of that, I think we can all agree that directly calling a function implementing algorithm Foo is always going to be faster than calling a script that calls the same function. Unfortunately, and as you pointed out, perhaps most of the GRASS functionality is only accessible from the CLI and not through an API.

And speaking more generally, it seems to me that the functionality-CLI coupling issue is what might me be partially fueling Facundo’s GSoC proposal (Python package for topology tools). There access to functionality does not seem direct enough to the user-programmer with overhead of subprocess call as well as related I/O cost, whether real or perceived, playing a role.

I can’t speak about Facundo. Nevertheless, whenever I try to work with the API, I do find it limited and it feels that it still has rough edges (e.g. #3833 and #3845 ). It very soon becomes clear that in order to get work done you need to use the Command Line Interface. As a programmer, I do find this annoying

Best,

Vaclav

[1] Petras V. 2013. Building detection from aerial images in GRASS GIS environment. Master’s thesis. Czech Technical University in Prague. http://geo.fsv.cvut.cz/proj/dp/2013/vaclav-petras-dp-2013.pdf

Thank you for the link)

all the best,
Panos