[GRASS-user] region issues using python multiprocessing.Pool()

Hi all,

I'm attempting to convert WorldView-2 DNs to top of atmosphere
reflectance using the formulas provided in
http://www.digitalglobe.com/sites/default/files/Radiometric_Use_of_WorldView-2_Imagery%20(1).pdf.

If I run the script without multiprocessing enabled the script
executes properly; If I run it with multiprocessing enabled, The
script executes successfully but the pixel range is 0-0; I think this
is caused by the region settings. Can the use_temp_region() function
be used as it is below to execute multiple mapcalc jobs in parallel
that have different extents? Once I have this working I'd like to make
a more generic tool a-la the i.landsat/toar tools.

Thanks,
Eric

r.info from one of the parrallel rasters:
| Type of Map: raster Number of Categories: 0 |
| Data Type: DCELL |
| Rows: 4096 |
| Columns: 4096 |
| Total Cells: 16777216 |
| Projection: Lambert Conformal Conic |
| N: 335872.03166667 S: 308995.445 Res: 6.56166667 |
| E: 798935.41 W: 772058.82333333 Res: 6.56166667 |
| Range of data: min = 0 max = 0 |
| |
| Data Description: |
| generated by r.mapcalc |
| |
| Comments: |
| 0.009295654 * x2NOV04172233_M3DS_R4C1_052823926030_01_P001.1 * |
| 0.99154845 ^ 2 * 3.1415927 / (1758.2229 * cos(50.9))

The functions:

def CalculateTOAR(rastItems):
   raster = rastItems[0]
   absCalFactor = rastItems[1]
   theta = rastItems[2]
   earthSunDist = rastItems[3]
   eSun = rastItems[4]
   output = "%s_TOAR%s" % (raster[:-2], raster[-2:])

   grass.use_temp_region()
   grass.run_command('g.region', rast=raster)

   grass.mapcalc("$output = ($absCal * $input_rast * $esd^2 * $pi) /
($eSun * cos($theta))",
       output=output, absCal=absCalFactor, input_rast=raster,
esd=earthSunDist, pi=math.pi,
       eSun=eSun, theta=theta)

def main():
   imageryPath = r"/path/to/imagery/folder"

   rasterList = grass.list_pairs('rast')

   metadataDict = AssociateMetaDataWithRaster(rasterList)
   rasterVariables =

   for key in metadataDict:
       grass.message("Extracting variables for %s" % key)
       #Extract variables from xml metadata file given the image name
       #and band number
       variables = ExtractVariablesFromMetadata(os.path.join(imageryPath,
           metadataDict[key]), int(key[-1]))
       #a list of lists containing the raster and variables for TOAR
       #calculation that can be passed to pool.map
       rasterVariables.append([key] + variables)

   pool = multiprocessing.Pool()
   output = pool.map(CalculateTOAR, rasterVariables)
   pool.close()
   pool.join

if __name__ == '__main__':
   #Start multiprocessing of radiometric corrections
   t0 = time.time()/60
   print "Start"
   main()
   print "Finish"
   print (time.time()/60) - t0, "minutes processing time"

Hi Eric,

On Wednesday 10 Jul 2013 15:48:04 Eric Goddard wrote:

If I run the script without multiprocessing enabled the script
executes properly; If I run it with multiprocessing enabled, The
script executes successfully but the pixel range is 0-0; I think this
is caused by the region settings. Can the use_temp_region() function
be used as it is below to execute multiple mapcalc jobs in parallel
that have different extents? Once I have this working I'd like to make
a more generic tool a-la the i.landsat/toar tools.

I did something similar to run raster modules splitting the region in several
tiles for grass7.

I don't think that the function use_temp_region, can solve your problem, If
you look at the code [0], the function set the environmental variable,
therefore in a multiprocessing environment you have several process that set
the same variable at the same time...

Therefore I don't think that could work...

But you can isolate the GRASS command giving a dictionary with the
environmental variables... something like:

{{{
import os

myenv = os.environ.copy()
# now change the variables in the dictionary
myenv['GISRC'] = 'mygisrc' # etc.
run_command('r.something', bla, bla, env=myenv)
}}}

in this way the process will use as environmental variables the dictionary
myenv, therefore each process will be isolated from the other...

I've implemented for grass7, you may see the code here:

http://trac.osgeo.org/grass/browser/grass/trunk/lib/python/pygrass/modules/grid/grid.py#L34

All the best.

Pietro

[0]
http://trac.osgeo.org/grass/browser/grass/trunk/lib/python/script/core.py#L909

Eric wrote:

If I run the script without multiprocessing enabled the script
executes properly; If I run it with multiprocessing enabled,

just some 2c, I don't remember the exact details, but when we were
working on the multi-processing functions in core.py we went with
the subprocess python library instead of the multiprocess one as
it seemed to be a better fit for the collection of modular programs
used by GRASS in the usual way. Since pygrass is a different way of
thinking about the structure, maybe the lower-level multiprocess is
more appropriate to use with that (or not), I'm not sure.

Can the use_temp_region() function be used as it is below to execute
multiple mapcalc jobs in parallel that have different extents?
Once I have this working I'd like to make a more generic tool a-la
the i.landsat/toar tools.

see also the mapcalc_start() python command in lib/python/script/raster.py,
and start_command() in core.py, and how they are used by e.g. i.landsat.rgb.py,
i.pansharpen.py, i.oif.py, and r3.in.xyz.py.

(that uses grass.script, the best way to do it with pygrass may be different)

Pietro:

I don't think that the function use_temp_region, can solve your problem,
If you look at the code [0], the function set the environmental variable,
therefore in a multiprocessing environment you have several process that
set the same variable at the same time...

the environment variables are isolated within a process as soon as the new OS
process is launched, it is impossible for a child process to share or export
environment variables back to its parent or among siblings. A process inherits
the environment as it was when it was launched, and any changes within that proc's
environment which happen during run-time evaporate with it when the process is
reaped. As long as WIND_OVERRIDE (ie use_temp_region()) is always used by modules
which want to change the region, and only g.region from the top-level is allowed
to change the real WIND file, then all should be ok.

since python doesn't really allow true multithreading (it is stuck on one
core thanks to GIL) it just spawns multiple processes to emulate it, and so
as long as there are multiple processes there will be multiple environments
for each. (again, I'm not a pygrass expert so please correct if my assumptions
are bad)

myenv = os.environ.copy()
# now change the variables in the dictionary
myenv['GISRC'] = 'mygisrc' # etc.
run_command('r.something', bla, bla, env=myenv)

I would be very surprised if changing GISRC was the correct approach
for anything other than working on a large cluster with a job manager
and e.g. NFS issues to deal with.

regards,
Hamish

Thanks for the responses, I’m looking forward to trying out your suggestion, Pietro. Pygrass looks interesting, but I’m a little confused about its relationship with grass.script.

One of the reasons I’m trying to use multiprocessing is because it side steps the GIL issue by not using threads, according to the documentation (http://docs.python.org/2/library/multiprocessing.html). I didn’t think the grass.start_command() would use all the available CPUs. I’ve used multiprocessing with the gdal python api and it made use of all my cpu cores.

Eric

Eric wrote:

If I run the script without multiprocessing enabled the script
executes properly; If I run it with multiprocessing enabled,

just some 2c, I don’t remember the exact details, but when we were
working on the multi-processing functions in core.py we went with
the subprocess python library instead of the multiprocess one as
it seemed to be a better fit for the collection of modular programs
used by GRASS in the usual way. Since pygrass is a different way of
thinking about the structure, maybe the lower-level multiprocess is
more appropriate to use with that (or not), I’m not sure.

Can the use_temp_region() function be used as it is below to execute
multiple mapcalc jobs in parallel that have different extents?
Once I have this working I’d like to make a more generic tool a-la
the i.landsat/toar tools.

see also the mapcalc_start() python command in lib/python/script/raster.py,
and start_command() in core.py, and how they are used by e.g. i.landsat.rgb.py,
i.pansharpen.py, i.oif.py, and r3.in.xyz.py.

(that uses grass.script, the best way to do it with pygrass may be different)

Pietro:

I don’t think that the function use_temp_region, can solve your problem,
If you look at the code [0], the function set the environmental variable,
therefore in a multiprocessing environment you have several process that
set the same variable at the same time…

the environment variables are isolated within a process as soon as the new OS
process is launched, it is impossible for a child process to share or export
environment variables back to its parent or among siblings. A process inherits
the environment as it was when it was launched, and any changes within that proc’s
environment which happen during run-time evaporate with it when the process is
reaped. As long as WIND_OVERRIDE (ie use_temp_region()) is always used by modules
which want to change the region, and only g.region from the top-level is allowed
to change the real WIND file, then all should be ok.

since python doesn’t really allow true multithreading (it is stuck on one
core thanks to GIL) it just spawns multiple processes to emulate it, and so
as long as there are multiple processes there will be multiple environments
for each. (again, I’m not a pygrass expert so please correct if my assumptions
are bad)

myenv = os.environ.copy()

now change the variables in the dictionary

myenv[‘GISRC’] = ‘mygisrc’ # etc.
run_command(‘r.something’, bla, bla, env=myenv)

I would be very surprised if changing GISRC was the correct approach
for anything other than working on a large cluster with a job manager
and e.g. NFS issues to deal with.

regards,
Hamish


grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

Eric wrote:

Thanks for the responses, I'm looking forward to trying out your
suggestion, Pietro. Pygrass looks interesting, but I'm a little
confused about its relationship with grass.script.

Pietro can certainly explain it better than me, but in general,
grass.script works in the traditional way of calling grass modules to do
what those grass modules do (think pythonized bash scripts), while
pygrass abstracts a lot of that away and pythonizes the grass experience.
See:
http://grasswiki.osgeo.org/wiki/GRASS_SoC_Ideas_2012/High_level_map_interaction

One of the reasons I'm trying to use multiprocessing is because it
side steps the GIL issue by not using threads, according to the
documentation (http://docs.python.org/2/library/multiprocessing.html).
I didn't think the grass.start_command() would use all the available
CPUs. I've used multiprocessing with the gdal python api and it made
use of all my cpu cores.

grass.start_command() will only start one command in one process at a
time. For example, i.landsat.rgb starts three parallel processes: one for
each of the Red, Green, and Blue bands. So maximum speedup for that part
of the module is just under 3x.

But it isn't hard to have that launch more to have it use all cores; see
the r3.in.xyz.py script and v.surf.icw.py addon script for examples of
that. That method only works well with a certain kind of problem, but it
offers very low overhead cost when it fits.

My worry with splitting up a single loop by rows is that the problem
often becomes I/O bound, and the overhead costs are much much larger. But
each module has its own characteristics, so by all means, use what works..

Hamish

Thanks again Pietro and Hamish. After looking at the code for
use_temp_region(), I believe I was using it incorrectly--I should have
called it after setting the region instead of before. With that
change, out of 7 tiles (* 8 bands for 56 images to process in the test
area), 50 of them completed successfully and the remaining 6 had pixel
values ranging from +/- NaN.

Since the NaNs sound like another region issue, I followed Hamish's
advice and looked at how the mapcalc_start() function was implemented
in i.oif module and modified the caclulateTOAR function to process all
8 bands of an image that have the same region at once. This method is
nice for the multispectral imagery (56 images completed in under 30
seconds), but unfortunately it won't help when I need to do the pan
imagery with its much higher resolution. I think Pietro's method will
be helpful here but it will take more time to wrap my head around that
implementation. :slight_smile: Now I need to find out why some bands have
reflectances that exceed 1.0 after fixing a bug in my mapcalc equation
(forgot to divide by the effective bandwidth in the top of atmosphere
radiance calculation)...

Thanks,
Eric

On Thu, Jul 11, 2013 at 5:44 AM, Hamish <hamish_b@yahoo.com> wrote:

Eric wrote:

Thanks for the responses, I'm looking forward to trying out your
suggestion, Pietro. Pygrass looks interesting, but I'm a little
confused about its relationship with grass.script.

Pietro can certainly explain it better than me, but in general,
grass.script works in the traditional way of calling grass modules to do
what those grass modules do (think pythonized bash scripts), while
pygrass abstracts a lot of that away and pythonizes the grass experience.
See:
  http://grasswiki.osgeo.org/wiki/GRASS_SoC_Ideas_2012/High_level_map_interaction

One of the reasons I'm trying to use multiprocessing is because it
side steps the GIL issue by not using threads, according to the
documentation (http://docs.python.org/2/library/multiprocessing.html).
I didn't think the grass.start_command() would use all the available
CPUs. I've used multiprocessing with the gdal python api and it made
use of all my cpu cores.

grass.start_command() will only start one command in one process at a
time. For example, i.landsat.rgb starts three parallel processes: one for
each of the Red, Green, and Blue bands. So maximum speedup for that part
of the module is just under 3x.

But it isn't hard to have that launch more to have it use all cores; see
the r3.in.xyz.py script and v.surf.icw.py addon script for examples of
that. That method only works well with a certain kind of problem, but it
offers very low overhead cost when it fits.

My worry with splitting up a single loop by rows is that the problem
often becomes I/O bound, and the overhead costs are much much larger. But
each module has its own characteristics, so by all means, use what works..

Hamish