[GRASS-user] Multicore Processing and Temporary File Cleanup

Dear All,

I have looked around on other postings and it appears that the majority (if
not all) of the GRASS libraries are NOT thread safe. Unfortunately I have a
very large processing job that would benefit from cluster processing. I
have written a script that can be run on multiple processors whilst being
very careful not to allow different processes to try to modify the same data
at any point. The same raster file is not accessed by different processes
at all in fact.

However, I also realise that alone might not solve all my problems. In any
one process some temporary files are created (by GRASS libraries) and then
these are deleted on statup (cleaning temporary files...). Now I was
wondering what these temporary files were and if there might be a problem
with one process creating temporary files that it needs whilst another
process starts up GRASS and deletes them. Is there any way to call GRASS in
a way that doesn't delete the temporary files?

I appreciate that I'm trying to do something that GRASS doesn't really
support but I was hoping that it might be possible to fiddle around and find
a way. Any help would be gratefully received.

I have included the script that I'm trying to run below (the script will be
run many times accross multiple processors). Any advice welcome:

http://www.nabble.com/file/p15441565/example example
--
View this message in context: http://www.nabble.com/Multicore-Processing-and-Temporary-File-Cleanup-tp15441565p15441565.html
Sent from the Grass - Users mailing list archive at Nabble.com.

Joseph,

I am using a cluster right now which is based on PBS to elaborate MODIS
satellite data. Some answers below:

On Feb 13, 2008 2:43 PM, joechip90 <joechip90@googlemail.com> wrote:

Dear All,

I have looked around on other postings and it appears that the majority (if
not all) of the GRASS libraries are NOT thread safe.

Yes, unfortunately true.

Unfortunately I have a
very large processing job that would benefit from cluster processing. I
have written a script that can be run on multiple processors whilst being
very careful not to allow different processes to try to modify the same data
at any point. The same raster file is not accessed by different processes
at all in fact.

Yes, fine. Essentially there are at least two approaches of "poor man"
parallelization without modifying GRASS source code:

- split map into spatial chunks (possibly with overlap to gain smooth results)
- time series: run each map elaboration on a different node.

However, I also realise that alone might not solve all my problems. In any
one process some temporary files are created (by GRASS libraries) and then
these are deleted on statup (cleaning temporary files...). Now I was
wondering what these temporary files were and if there might be a problem
with one process creating temporary files that it needs whilst another
process starts up GRASS and deletes them. Is there any way to call GRASS in
a way that doesn't delete the temporary files?

You could just modify the start script and remove that call for "clean_temp".
BUT:
I am currently elaborating some thousand maps for the same region (time
series). I elaborate each map in the same location but a different mapset
(simply using the map name as mapset name). At the end of the elaboration I
call a second batch job which only contains g.copy to copy the result into a
common mapset. There is a low risk of race condition here in case that two
nodes finish at the same time but this could be even trapped in a loop which
checks if the target mapset is locked and, if needed, launches g.copy again till
success.

I appreciate that I'm trying to do something that GRASS doesn't really
support but I was hoping that it might be possible to fiddle around and find
a way. Any help would be gratefully received.

To some extend GRASS supports what you need.
I have drafted a related wiki page at:
http://grass.gdf-hannover.de/wiki/Parallel_GRASS_jobs

Feel free to hack that page!

Good luck,
Markus

Thank you Markus your Wiki entry is most helpful.

It seems I need to make a few changes to my files and set up a large number of mapsets in every location. Is it appropriate then to have multiple mapsets (one for each node) at a given location? If so is there a way to automatically generate multiple mapsets in a given location such that I can jump straight into GRASS using a script command along the following in each of the processes (I will have thousands of processes)?

#!/bin/bash

declare -r PROCESS_NUM=__ #Some allocated process number - $SGE_TASK_ID for Sun Grid

# Other non-GRASS commands here - in my script there is a call to an external database
# to download parameter values

grass62 -text database/location/${PROCESS_NUM}_mapset <<!
    # Some grass commands here
!

In each mapset would then contain the spatial data that each process will use. You suggest then copying the output into a single shared mapset such as PERMANENT. For my purposes I'll probably just save them as text files (the data then gets transferred to another program for the next stages of processing).

Again many thanks,

Markus Neteler wrote:

Joseph,

I am using a cluster right now which is based on PBS to elaborate MODIS
satellite data. Some answers below:

On Feb 13, 2008 2:43 PM, joechip90 <joechip90@googlemail.com> wrote:
  

Dear All,

I have looked around on other postings and it appears that the majority (if
not all) of the GRASS libraries are NOT thread safe.
    
Yes, unfortunately true.

Unfortunately I have a
very large processing job that would benefit from cluster processing. I
have written a script that can be run on multiple processors whilst being
very careful not to allow different processes to try to modify the same data
at any point. The same raster file is not accessed by different processes
at all in fact.
    
Yes, fine. Essentially there are at least two approaches of "poor man"
parallelization without modifying GRASS source code:

- split map into spatial chunks (possibly with overlap to gain smooth results)
- time series: run each map elaboration on a different node.

However, I also realise that alone might not solve all my problems. In any
one process some temporary files are created (by GRASS libraries) and then
these are deleted on statup (cleaning temporary files...). Now I was
wondering what these temporary files were and if there might be a problem
with one process creating temporary files that it needs whilst another
process starts up GRASS and deletes them. Is there any way to call GRASS in
a way that doesn't delete the temporary files?
    
You could just modify the start script and remove that call for "clean_temp".
BUT:
I am currently elaborating some thousand maps for the same region (time
series). I elaborate each map in the same location but a different mapset
(simply using the map name as mapset name). At the end of the elaboration I
call a second batch job which only contains g.copy to copy the result into a
common mapset. There is a low risk of race condition here in case that two
nodes finish at the same time but this could be even trapped in a loop which
checks if the target mapset is locked and, if needed, launches g.copy again till
success.

I appreciate that I'm trying to do something that GRASS doesn't really
support but I was hoping that it might be possible to fiddle around and find
a way. Any help would be gratefully received.
    
To some extend GRASS supports what you need.
I have drafted a related wiki page at:
http://grass.gdf-hannover.de/wiki/Parallel_GRASS_jobs

Feel free to hack that page!

Good luck,
Markus

On Feb 13, 2008 7:46 PM, Joseph Chipperfield <joechip90@googlemail.com> wrote:

Thank you Markus your Wiki entry is most helpful.

It seems I need to make a few changes to my files and set up a large
number of mapsets in every location. Is it appropriate then to have
multiple mapsets (one for each node) at a given location?

Sure!

If so is
there a way to automatically generate multiple mapsets in a given
location such that I can jump straight into GRASS using a script command
along the following in each of the processes (I will have thousands of
processes)?

Yes. When you start GRASS with path to grassdada/location/mapset/ and
the mapset does not exist, it will be automatically created.
(hint for Hamish: this will then create a valid mapset, i.e. incl. DBF driver
predefined - see grass-dev discussions)

As first step in your script, be sure to run
  g.mapsets add=mapset1_with_data[,mapset2_with_data]
to make the data to be elaborated accessible.

I am processing thousands of MODIS data like that right now:
- GRASS is launched as (example, indeed I loop over many map names)

loop over many map names, like "aqua_lst1km20020706.LST_Night_1km.filt"

------- snip -----------
MYMAPSET=$CURRMAP
TARGETMAPSET=results

GRASS_BATCH_JOB=/shareddisk/modis_job.sh
grass63 -text /shareddisk/grassdata/myloc/$MYMAPSET

# copy over result to target mapset
export INMAP=${CURRMAP}_rst
export INMAPSET=$MYMAPSET
export OUTMAP=$INMAP
export GRASS_BATCH_JOB=/shareddisk/gcopyjob.sh
grass63 -text /shareddisk/grassdata/myloc/$TARGETMAPSET
exit 0
------- snap ----------

You see, that I run GRASS twice. Note that you need GRASS 6.3 to
make use of GRASS_BATCH_JOB (if present, GRASS automatically
executes that job instead of launching the user interface.

The script gcopyjob.sh simply contains
------- snip -----------
g.copy rast=$INMAP@$INMAPSET,$OUTMAP --o
------- snap ----------

That's it!

You script suggestion is essentially right. Only, you would better get
recent GRASS 6.3 to avoid a nightmare :slight_smile:

In each mapset would then contain the spatial data that each process
will use. You suggest then copying the output into a single shared
mapset such as PERMANENT. For my purposes I'll probably just save them
as text files (the data then gets transferred to another program for the
next stages of processing).

Sure - as you prefer. I put the elaborated MODIS map into a single mapset
for easy takeaway in the end.

I have extended
http://grass.gdf-hannover.de/wiki/Parallel_GRASS_jobs

Cheers
Markus

--
Markus Neteler
Fondazione Mach - Centre for Alpine Ecology
38100 Viote del Monte Bondone (Trento), Italy
neteler AT cealp.it http://www.cealp.it/

Again many thanks,

Hi Markus,

Many thanks for all your help. Thanks to your link to the wiki I’ve managed to get this ‘poor man’s paralellisation’ up and running on our university’s cluster and we’ve not had any problems thus far.

2008/12/26 Markus Neteler (via Nabble) <ml-user%2B57828-1466323068@…>

Joseph,

I am using a cluster right now which is based on PBS to elaborate MODIS
satellite data. Some answers below:

On Feb 13, 2008 2:43 PM, joechip90 <joechip90@…> wrote:

Dear All,

I have looked around on other postings and it appears that the majority (if
not all) of the GRASS libraries are NOT thread safe.

Yes, unfortunately true.

Unfortunately I have a
very large processing job that would benefit from cluster processing. I
have written a script that can be run on multiple processors whilst being
very careful not to allow different processes to try to modify the same data
at any point. The same raster file is not accessed by different processes
at all in fact.

Yes, fine. Essentially there are at least two approaches of “poor man”
parallelization without modifying GRASS source code:

  • split map into spatial chunks (possibly with overlap to gain smooth results)
  • time series: run each map elaboration on a different node.

However, I also realise that alone might not solve all my problems. In any
one process some temporary files are created (by GRASS libraries) and then
these are deleted on statup (cleaning temporary files…). Now I was
wondering what these temporary files were and if there might be a problem
with one process creating temporary files that it needs whilst another
process starts up GRASS and deletes them. Is there any way to call GRASS in
a way that doesn’t delete the temporary files?

You could just modify the start script and remove that call for “clean_temp”.
BUT:
I am currently elaborating some thousand maps for the same region (time
series). I elaborate each map in the same location but a different mapset
(simply using the map name as mapset name). At the end of the elaboration I
call a second batch job which only contains g.copy to copy the result into a
common mapset. There is a low risk of race condition here in case that two
nodes finish at the same time but this could be even trapped in a loop which
checks if the target mapset is locked and, if needed, launches g.copy again till
success.

I appreciate that I’m trying to do something that GRASS doesn’t really
support but I was hoping that it might be possible to fiddle around and find
a way. Any help would be gratefully received.

To some extend GRASS supports what you need.
I have drafted a related wiki page at:
http://grass.gdf-hannover.de/wiki/Parallel_GRASS_jobs

Feel free to hack that page!

Good luck,
Markus


grass-user mailing list
grass-user@…
http://lists.osgeo.org/mailman/listinfo/grass-user


View this message in context: Re: Multicore Processing and Temporary File Cleanup
Sent from the Grass - Users mailing list archive at Nabble.com.