[GRASS-dev] Parallelize a job using multiprocess python library without destroying environmental variable

Hi all,
I’m attempting to parallelize a job in a python script using multiprocess library in grass70.
I had a look at the following links: http://grasswiki.osgeo.org/wiki/Parallel_GRASS_jobs
and http://grasswiki.osgeo.org/wiki/Parallelizing_Scripts.

I would like to work in the same location but in different mapsets because my jobs touch the region settings, but I don’t know how to set separate mapset for separate jobs.

Since now I discovered that this processes, if run in the same mapset, clean all the environmental variables (GISDBASE, LOCATION, MAPSET) so then GRASS does not start anymore and I have to restore the .grass70/rc file…

can anyone hint me on how to set different mapsets for different jobs?

All the best,
Annalisa

On Mon, Jun 30, 2014 at 5:21 AM, Annalisa Minelli <annagrass6@gmail.com>
wrote:

Hi all,
I'm attempting to parallelize a job in a python script using multiprocess
library in grass70.
I had a look at the following links:
http://grasswiki.osgeo.org/wiki/Parallel_GRASS_jobs
and http://grasswiki.osgeo.org/wiki/Parallelizing_Scripts.

I would like to work in the same location but in different mapsets because
my jobs touch the region settings, but I don't know how to set separate
mapset for separate jobs.

Since now I discovered that this processes, if run in the same mapset,
clean all the environmental variables (GISDBASE, LOCATION, MAPSET) so then
GRASS does not start anymore and I have to restore the .grass70/rc file..

can anyone hint me on how to set different mapsets for different jobs?

First, look at the PyGRASS GridModule [1] whether this can help you.

For general case, there is unfortunately no API. From what I understand,
you have to create a file "gisrc" somewhere and then do something like env
= copy(os.environ) and change GISRC there to your custom "gisrc". Then you
the change the mapset and region by standard GRASS means but you must pass
`env` parameter to all command/module calls (env is used by Python
subprocess to set environment just for one process).

Note that GISRC, GISBASE and LOCATION are (system) environmental variables
while GISDBASE, LOCATION_NAME and MAPSET are GRASS GIS session/environment
variables and are stored in "gisrc" file. I don't have an idea what
LOCATION variable is for (it contains full path to the mapset).

I would be glad to hear what others think about this.

You can of course read source code of GridModule, rendering in wxGUI,
g.gui.animation, or the following snipped but I don't say that it will be
easy to understand and there might be a lot of imperfections.

Vaclav

    # we rely on the tmp dir having enough space for our map
    tgt_gisdbase = tempfile.mkdtemp()
    # this is not needed if we use mkdtemp but why not
    tgt_location = 'r.out.png.proj_location_%s' % epsg_code
    # because we are using PERMANENT we don't have to create mapset
explicitly
    tgt_mapset_name = 'PERMANENT'

    src_mapset = Mapset(src_mapset_name)

    # get source (old) and set target (new) GISRC enviromental variable
    # TODO: set environ only for child processes could be enough and it
would
    # enable (?) parallel runs
    src_gisrc = os.environ['GISRC']
    tgt_gisrc = gsetup.write_gisrc(tgt_gisdbase,
                                   tgt_location, tgt_mapset_name)
    # we should use a copy and pass it but then it would not be possible to
use create_location
    os.environ['GISRC'] = tgt_gisrc
    if os.environ.get('WIND_OVERRIDE'):
        old_temp_region = os.environ['WIND_OVERRIDE']
        del os.environ['WIND_OVERRIDE']
    else:
        old_temp_region = None
    # these lines looks good but anyway when developing the module
    # switching location seemed fragile and on some errors (while running
    # unfinished module) location was switched in the command line

    try:
        # the function itself is not safe for other (backgroud) processes
        # (e.g. GUI), however we already switched GISRC for us
        # and child processes, so we don't influece others
        gcore.create_location(dbase=tgt_gisdbase,
                              location=tgt_location,
                              epsg=epsg_code,
                              datum=None,
                              datum_trans=None)

        # Mapset object cannot be created if the real mapset does not exists
        tgt_mapset = Mapset(gisdbase=tgt_gisdbase, location=tgt_location,
                            mapset=tgt_mapset_name)
        # set the current mapset in the library
        # we actually don't need to switch when only calling modules
        # (right GISRC is enough for them)
        tgt_mapset.current()
...

[1] http://grass.osgeo.org/grass71/manuals/pygrass/modules_grid.html

All the best,
Annalisa

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Hi Annalisa,

I still need to learn a lot about this and have not tested Vaclav's
advice yet, which is probably the best way to go, but you can take a
look at some scripts I wrote for doing this:

https://github.com/javimarlop/eHabpy/blob/master/pas/tmp/parallel_segmentation_pca.py

https://github.com/javimarlop/eHabpy/blob/master/pas/parallel_grass_example.py

They are working for me, but as Markus Metz also mentioned me once, if
you are not using a cluster and there is a lot of writing/reading from
the same hard disk, you will probably not speed up considerably the
processing. In any case, I am also very interested in further
developing this script, so any ideas are welcome!

Cheers,

Javier

On Mon, Jun 30, 2014 at 4:05 PM, Vaclav Petras <wenzeslaus@gmail.com> wrote:

On Mon, Jun 30, 2014 at 5:21 AM, Annalisa Minelli <annagrass6@gmail.com>
wrote:

Hi all,
I'm attempting to parallelize a job in a python script using multiprocess
library in grass70.
I had a look at the following links:
http://grasswiki.osgeo.org/wiki/Parallel_GRASS_jobs
and http://grasswiki.osgeo.org/wiki/Parallelizing_Scripts.

I would like to work in the same location but in different mapsets because
my jobs touch the region settings, but I don't know how to set separate
mapset for separate jobs.

Since now I discovered that this processes, if run in the same mapset,
clean all the environmental variables (GISDBASE, LOCATION, MAPSET) so then
GRASS does not start anymore and I have to restore the .grass70/rc file..

can anyone hint me on how to set different mapsets for different jobs?

First, look at the PyGRASS GridModule [1] whether this can help you.

For general case, there is unfortunately no API. From what I understand, you
have to create a file "gisrc" somewhere and then do something like env =
copy(os.environ) and change GISRC there to your custom "gisrc". Then you the
change the mapset and region by standard GRASS means but you must pass `env`
parameter to all command/module calls (env is used by Python subprocess to
set environment just for one process).

Note that GISRC, GISBASE and LOCATION are (system) environmental variables
while GISDBASE, LOCATION_NAME and MAPSET are GRASS GIS session/environment
variables and are stored in "gisrc" file. I don't have an idea what LOCATION
variable is for (it contains full path to the mapset).

I would be glad to hear what others think about this.

You can of course read source code of GridModule, rendering in wxGUI,
g.gui.animation, or the following snipped but I don't say that it will be
easy to understand and there might be a lot of imperfections.

Vaclav

    # we rely on the tmp dir having enough space for our map
    tgt_gisdbase = tempfile.mkdtemp()
    # this is not needed if we use mkdtemp but why not
    tgt_location = 'r.out.png.proj_location_%s' % epsg_code
    # because we are using PERMANENT we don't have to create mapset
explicitly
    tgt_mapset_name = 'PERMANENT'

    src_mapset = Mapset(src_mapset_name)

    # get source (old) and set target (new) GISRC enviromental variable
    # TODO: set environ only for child processes could be enough and it
would
    # enable (?) parallel runs
    src_gisrc = os.environ['GISRC']
    tgt_gisrc = gsetup.write_gisrc(tgt_gisdbase,
                                   tgt_location, tgt_mapset_name)
    # we should use a copy and pass it but then it would not be possible to
use create_location
    os.environ['GISRC'] = tgt_gisrc
    if os.environ.get('WIND_OVERRIDE'):
        old_temp_region = os.environ['WIND_OVERRIDE']
        del os.environ['WIND_OVERRIDE']
    else:
        old_temp_region = None
    # these lines looks good but anyway when developing the module
    # switching location seemed fragile and on some errors (while running
    # unfinished module) location was switched in the command line

    try:
        # the function itself is not safe for other (backgroud) processes
        # (e.g. GUI), however we already switched GISRC for us
        # and child processes, so we don't influece others
        gcore.create_location(dbase=tgt_gisdbase,
                              location=tgt_location,
                              epsg=epsg_code,
                              datum=None,
                              datum_trans=None)

        # Mapset object cannot be created if the real mapset does not exists
        tgt_mapset = Mapset(gisdbase=tgt_gisdbase, location=tgt_location,
                            mapset=tgt_mapset_name)
        # set the current mapset in the library
        # we actually don't need to switch when only calling modules
        # (right GISRC is enough for them)
        tgt_mapset.current()
...

[1] http://grass.osgeo.org/grass71/manuals/pygrass/modules_grid.html

All the best,
Annalisa

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Thanks to both,
I will have a look at your advices/ideas and tell you if I can solve!

All the best,
Annalisa

···

2014-06-30 20:17 GMT+02:00 Javier Martínez-López <javi.martinez.lopez@gmail.com>:

Hi Annalisa,

I still need to learn a lot about this and have not tested Vaclav’s
advice yet, which is probably the best way to go, but you can take a
look at some scripts I wrote for doing this:

https://github.com/javimarlop/eHabpy/blob/master/pas/tmp/parallel_segmentation_pca.py

https://github.com/javimarlop/eHabpy/blob/master/pas/parallel_grass_example.py

They are working for me, but as Markus Metz also mentioned me once, if
you are not using a cluster and there is a lot of writing/reading from
the same hard disk, you will probably not speed up considerably the
processing. In any case, I am also very interested in further
developing this script, so any ideas are welcome!

Cheers,

Javier

On Mon, Jun 30, 2014 at 4:05 PM, Vaclav Petras <wenzeslaus@gmail.com> wrote:

On Mon, Jun 30, 2014 at 5:21 AM, Annalisa Minelli <annagrass6@gmail.com>
wrote:

Hi all,
I’m attempting to parallelize a job in a python script using multiprocess
library in grass70.
I had a look at the following links:
http://grasswiki.osgeo.org/wiki/Parallel_GRASS_jobs
and http://grasswiki.osgeo.org/wiki/Parallelizing_Scripts.

I would like to work in the same location but in different mapsets because
my jobs touch the region settings, but I don’t know how to set separate
mapset for separate jobs.

Since now I discovered that this processes, if run in the same mapset,
clean all the environmental variables (GISDBASE, LOCATION, MAPSET) so then
GRASS does not start anymore and I have to restore the .grass70/rc file…

can anyone hint me on how to set different mapsets for different jobs?

First, look at the PyGRASS GridModule [1] whether this can help you.

For general case, there is unfortunately no API. From what I understand, you
have to create a file “gisrc” somewhere and then do something like env =
copy(os.environ) and change GISRC there to your custom “gisrc”. Then you the
change the mapset and region by standard GRASS means but you must pass env
parameter to all command/module calls (env is used by Python subprocess to
set environment just for one process).

Note that GISRC, GISBASE and LOCATION are (system) environmental variables
while GISDBASE, LOCATION_NAME and MAPSET are GRASS GIS session/environment
variables and are stored in “gisrc” file. I don’t have an idea what LOCATION
variable is for (it contains full path to the mapset).

I would be glad to hear what others think about this.

You can of course read source code of GridModule, rendering in wxGUI,
g.gui.animation, or the following snipped but I don’t say that it will be
easy to understand and there might be a lot of imperfections.

Vaclav

we rely on the tmp dir having enough space for our map

tgt_gisdbase = tempfile.mkdtemp()

this is not needed if we use mkdtemp but why not

tgt_location = ‘r.out.png.proj_location_%s’ % epsg_code

because we are using PERMANENT we don’t have to create mapset

explicitly
tgt_mapset_name = ‘PERMANENT’

src_mapset = Mapset(src_mapset_name)

get source (old) and set target (new) GISRC enviromental variable

TODO: set environ only for child processes could be enough and it

would

enable (?) parallel runs

src_gisrc = os.environ[‘GISRC’]
tgt_gisrc = gsetup.write_gisrc(tgt_gisdbase,
tgt_location, tgt_mapset_name)

we should use a copy and pass it but then it would not be possible to

use create_location
os.environ[‘GISRC’] = tgt_gisrc
if os.environ.get(‘WIND_OVERRIDE’):
old_temp_region = os.environ[‘WIND_OVERRIDE’]
del os.environ[‘WIND_OVERRIDE’]
else:
old_temp_region = None

these lines looks good but anyway when developing the module

switching location seemed fragile and on some errors (while running

unfinished module) location was switched in the command line

try:

the function itself is not safe for other (backgroud) processes

(e.g. GUI), however we already switched GISRC for us

and child processes, so we don’t influece others

gcore.create_location(dbase=tgt_gisdbase,
location=tgt_location,
epsg=epsg_code,
datum=None,
datum_trans=None)

Mapset object cannot be created if the real mapset does not exists

tgt_mapset = Mapset(gisdbase=tgt_gisdbase, location=tgt_location,
mapset=tgt_mapset_name)

set the current mapset in the library

we actually don’t need to switch when only calling modules

(right GISRC is enough for them)

tgt_mapset.current()

[1] http://grass.osgeo.org/grass71/manuals/pygrass/modules_grid.html

All the best,
Annalisa


grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev


grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev