[GRASS-dev] HPC support and implementation options

(I take liberty to fork this out from
Re: [GRASS-dev] Adding an expert mode to the parser
For the archive reference:
https://lists.osgeo.org/pipermail/grass-dev/2016-September/082520.html
)

On Sun, Sep 25, 2016 at 9:49 PM, Markus Neteler <neteler@osgeo.org> wrote:

On Wed, Sep 28, 2016 at 10:51 PM, Markus Metz <markus.metz.giswork@gmail.com> wrote:

On Thu, Sep 29, 2016 at 12:03 AM, Sören Gebbert <soerengebbert@googlemail.com> wrote:
[snip]

As an example, when aiming at processing all Sentinel-2 tiles
globally, we speak about currently 73000 scenes * up-to-16 tiles along
with global data, analysis on top of other global data is more complex
when doing each job in its own mapset and reintegrate it in a single
target mapset as if able to process then in parallel in one mapset by
simply specifying the respective region to the command of interest.
Yes, different from the current paradigm and not for G7.

from our common experience, I would say that creating separate mapsets
is a safety feature. If anything goes wrong with that particular
processing chain, cleaning up is easy, simply delete this particular
mapset and run the job again, if possible on a different host/node
(assuming that failed jobs are logged). Anyway, I would be surprised
if the overhead of opening a separate mapset is measurable when
processing all Sentinel-2 tiles globally.

Generally I agree and with our MODIS experience it worked fine on a
"standalone" cluster system with local disks in each blade.

Reintegration into a single
target mapset could cause problems with regard to IO saturation, but
in a HPC environment temporary data always need to be copied to a
final target location at some stage.

Yes, with at least 10Gb/s internal connection it worked decently.

The HPC system you are using now
is most probably quite different from the one we used previously, so
this is a lot of guessing, particularly about the storage location of
temporary data (no matter if it is the same mapset or a separate
mapset).

Indeed, one of the current systems we are using is completely
virtualized, i.e. all disks are attached via network which is AFAIK
even virtualized. Hence no dedicated resources but competition with
other unknown users in this system.
I still try to understand how to optimize things there...

Imagine you have a tool that is able to distribute the processing of a large
time series of satellite images across a cluster. Each node in the cluster
should process a stack of r.mapcalc, r.series or r.neighbors commands in a
local temporary mapset, that gets later merged into a single one. A single
stack of commands may have hundreds of jobs that will run in a single
temporary mapset. In this scenario you need separate region settings for
each command in the stack, because of the different spatial extents of the
satellite images. The size of the stack depends on the size of the time
series (number of maps) and the number of available nodes.

Having region setting options in the parser will make the implementation of
such an approach much easier. Commands like t.rast.algebra and
t.rast.neighbors will directly benefit from a region parser option, allowing
the parallel processing of satellite image time series on a multi-core
machine.

Yes - the key issue is that such virtualized cluster systems behave
quite differently compared to the bare metal system we used to have in
Italy.

Best regards
Soeren

Best,
markusN