[GRASS-dev] GRASS usage on a cluster: thread safety

Hi,

I am using 6.4 on a cluster to process maps in parallel.
Each job runs as batch job in its own mapset. The jobs
are launched via 'qsub' of Grid Engine which
sends it from the frontend to the various blades of
the cluster. The grassdata directory is shared via
NFS on all blades, the filesystem is XFS.

Unfortunately, with fast tasks within the batch job,
various mysterious errors randomly occur:

...
./launch_SGE_grassjob_MODIS_filt2.sh.e255664:ERROR:
/usr/local/grass-6.4.1svn/etc/lock:
./launch_SGE_grassjob_MODIS_filt2.sh.e254776:ERROR:
/usr/local/grass-6.4.1svn/etc/lock:
./launch_SGE_grassjob_MODIS_filt2.sh.e256639:ERROR: G_getenv():
Variable LOCATION_NAME not set
./launch_SGE_grassjob_MODIS_filt2.sh.e257016:ERROR: Unable to make
mapset element .tmp/blade08
./launch_SGE_grassjob_MODIS_filt2.sh.e255264:ERROR: MAPSET
terra_lst1km20010420.LST_Night_1km.filt.255184 not found
./launch_SGE_grassjob_MODIS_filt2.sh.e256528:ERROR: MAPSET
terra_lst1km20021203.LST_Night_1km.filt.256430 not found
./launch_SGE_grassjob_MODIS_filt2.sh.e256254:ERROR: MAPSET
terra_lst1km20021207.LST_Night_1km.filt.256434 not found
./launch_SGE_grassjob_MODIS_filt2.sh.e256415:ERROR:
/usr/local/grass-6.4.1svn/etc/lock:
./launch_SGE_grassjob_MODIS_filt2.sh.e254717:ERROR: Unable to make
mapset element .tmp/blade02
./launch_SGE_grassjob_MODIS_filt2.sh.e256033:ERROR: Unable to make
mapset element .tmp/blade07
./launch_SGE_grassjob_MODIS_filt2.sh.e256722:ERROR: Unable to make
mapset element .tmp/blade07
./launch_SGE_grassjob_MODIS_filt2.sh.e257642:ERROR: Unable to make
mapset element .tmp/blade11
./launch_SGE_grassjob_MODIS_filt2.sh.e255185:ERROR: Unable to make
mapset element .tmp/blade03
./launch_SGE_grassjob_MODIS_filt2.sh.e254745:ERROR: Unable to make
mapset element .tmp/blade02
./launch_SGE_grassjob_MODIS_filt2.sh.e255088:ERROR: G_getenv():
Variable LOCATION_NAME not set
./launch_SGE_grassjob_MODIS_filt2.sh.e256473:ERROR: Unable to make
mapset element .tmp/blade08
./launch_SGE_grassjob_MODIS_filt2.sh.e257003:ERROR: Unable to make
mapset element .tmp/blade12
./launch_SGE_grassjob_MODIS_filt2.sh.e257257:ERROR: MAPSET
aqua_lst1km20031222.LST_Night_1km.filt.257168 not found
./launch_SGE_grassjob_MODIS_filt2.sh.e257696:ERROR: MAPSET
terra_lst1km20030310.LST_Night_1km.filt.257607 not found
...

I wonder how this can happen if the jobs are launched
independently.
About 10-20% of the jobs are affected (n=13600). My
current approach is to relaunch errorenous
jobs unless all are done but that's rather annoying...

How to track that down?

Markus

Hi,

back to this issue with solution:

On Sun, Jan 9, 2011 at 12:25 PM, Markus Neteler <neteler@osgeo.org> wrote:

Hi,

I am using 6.4 on a cluster to process maps in parallel.
Each job runs as batch job in its own mapset. The jobs
are launched via 'qsub' of Grid Engine which
sends it from the frontend to the various blades of
the cluster. The grassdata directory is shared via
NFS on all blades, the filesystem is XFS.

Unfortunately, with fast tasks within the batch job,
various mysterious errors randomly occur:

...
./launch_SGE_grassjob_MODIS_filt2.sh.e255664:ERROR:
/usr/local/grass-6.4.1svn/etc/lock:
./launch_SGE_grassjob_MODIS_filt2.sh.e254776:ERROR:
/usr/local/grass-6.4.1svn/etc/lock:
./launch_SGE_grassjob_MODIS_filt2.sh.e256639:ERROR: G_getenv():
Variable LOCATION_NAME not set
./launch_SGE_grassjob_MODIS_filt2.sh.e257016:ERROR: Unable to make
mapset element .tmp/blade08
./launch_SGE_grassjob_MODIS_filt2.sh.e255264:ERROR: MAPSET
terra_lst1km20010420.LST_Night_1km.filt.255184 not found

...

./launch_SGE_grassjob_MODIS_filt2.sh.e255088:ERROR: G_getenv():
Variable LOCATION_NAME not set
./launch_SGE_grassjob_MODIS_filt2.sh.e256473:ERROR: Unable to make
mapset element .tmp/blade08

...

Lessions learned (see also
http://grass.osgeo.org/wiki/Parallel_GRASS_jobs#Hints_for_NFS_users):

* Locking problem:
lib/init/init.sh
#"$ETC/lock" "$lockfile" $$

is not NFS safe. A solution is to "lockfile" from the procmail software:
lockfile "$lockfile"

* there are other spurious NFS race conditions in lib/init/init.sh which I could
not identify.

Solution:

Don't use this start script at all! Instead, define the GRASS
environment by setting
the needed variables, for this see
http://grass.osgeo.org/wiki/GRASS_and_Shell#Automated_batch_jobs:_Setting_the_GRASS_environmental_variables

The procedure how to set up jobs on a Grid Engine cluster I have
now documented at:
http://grass.osgeo.org/wiki/Parallel_GRASS_jobs#Grid_Engine

cheers
Markus