[GRASS-user] syncing two locations

M_S · May 21, 2008, 3:58pm

I have two GRASS workstations performing various lidar and basin operations. My idea was to distribute the processing on various machines.

As new data gets created in each location on each computer, when all processes are done, I would like to re-sync the locations to be the same. So far, unison seems to work quite well doing this, but it was only done once so far.

Are there any things to consider when doing this, based upon GRASS location structures? Or is there a better approach to distribute the processing?

Thanks,
Mark

neteler · May 21, 2008, 10:37pm

On Wed, May 21, 2008 at 5:58 PM, M S <mseibel@gmail.com> wrote:

I have two GRASS workstations performing various lidar and basin
operations. My idea was to distribute the processing on various machines.

As new data gets created in each location on each computer, when all
processes are done, I would like to re-sync the locations to be the same.

I did massive MODIS map processing on a cluster and used
one location only in a shared network directory. Therein multiple
mapsets with a final g.copy job to the mapset keeping the results.

I have documented it here:
http://grass.osgeo.org/wiki/Parallel_GRASS_jobs#PBS

Markus

neteler · May 22, 2008, 8:01pm

(I take liberty to cc the list again for discussion)

On Thu, May 22, 2008 at 3:59 PM, M S <mseibel@gmail.com> wrote:

That is cool. I will definitely check out the wiki page.

Perhaps I am not using locations or more specifically mapsets as
efficiently as I could be. If I were to use a single location, but
with different mapsets, would there be issues with changing the region
boundary and/or the cell resolution?

not at all, all mapsets are independent.

For example, in my main area of interest (basin) there were areas I
needed to do finer detailed analysis, going from 25 foot cells to 3
foot cells (r.in.xyz + r.report helped me find this optimal cell size,
great stuff!), would this be an issue within the same location, but
using a different mapset? I had thought that region settings applied
to a location.

no to a mapset!

I have added
http://grass.osgeo.org/wiki/Parallel_GRASS_jobs#Background

I had (perhaps not the best choice, copied the location to another
workstation) and changed the region resolution and boundary, and then
ran more detailed analysis.

If you do it serially, no problem. Otherwise it will conflict.

Under this scenario, does it still make sense to use a single shared
location?

Most likely.

One problem I foresee is with vectors using a single
sqlite.db file,

That's true. You cannot write simultaneously to the same sqlite.db file.

http://en.wikipedia.org/wiki/SQLite
"A write access can only be satisfied if no other accesses are
currently being serviced, otherwise the write access fails with an
error code (or can automatically be retried until a configurable
timeout expires). This concurrent access situation would change when
dealing with temporary tables.
"

But just use different names, so multiple sqlite.db files?

in two different locations and then getting the vector
database out of sync, and unable to rectify this with syncing the
locations. It seems that reprojecting the data into the location is
one way to deal with this single DB issue.

I feel that it's making things worse.

Most all of what I'm doing
is time-consuming raster analysis though, and on the first sync,
things seemed to work as expected.

OK fine.
Please let's collect ideas in the wiki how to deal with vector data
and related DBs when processing in parallel.

Markus

Mark

On 5/21/08, Markus Neteler <neteler@osgeo.org> wrote:

On Wed, May 21, 2008 at 5:58 PM, M S <mseibel@gmail.com> wrote:

I have two GRASS workstations performing various lidar and basin
operations. My idea was to distribute the processing on various machines.

As new data gets created in each location on each computer, when all
processes are done, I would like to re-sync the locations to be the same.

I did massive MODIS map processing on a cluster and used
one location only in a shared network directory. Therein multiple
mapsets with a final g.copy job to the mapset keeping the results.

I have documented it here:
http://grass.osgeo.org/wiki/Parallel_GRASS_jobs#PBS

Markus
_______________________________________________
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

--
Open Source Geospatial Foundation
http://www.osgeo.org/
http://www.grassbook.org/

M_S · May 23, 2008, 10:17am

The best solution does seem to be the multiple mapsets, from a single
shared location. That is really a great way to distribute the
processing, especially given the flexibility of mapsets in a location.

Mark

On 5/22/08, Markus Neteler <neteler@osgeo.org> wrote:

(I take liberty to cc the list again for discussion)

On Thu, May 22, 2008 at 3:59 PM, M S <mseibel@gmail.com> wrote:

That is cool. I will definitely check out the wiki page.

Perhaps I am not using locations or more specifically mapsets as
efficiently as I could be. If I were to use a single location, but
with different mapsets, would there be issues with changing the region
boundary and/or the cell resolution?

not at all, all mapsets are independent.

For example, in my main area of interest (basin) there were areas I
needed to do finer detailed analysis, going from 25 foot cells to 3
foot cells (r.in.xyz + r.report helped me find this optimal cell size,
great stuff!), would this be an issue within the same location, but
using a different mapset? I had thought that region settings applied
to a location.

no to a mapset!

I have added
http://grass.osgeo.org/wiki/Parallel_GRASS_jobs#Background

I had (perhaps not the best choice, copied the location to another
workstation) and changed the region resolution and boundary, and then
ran more detailed analysis.

If you do it serially, no problem. Otherwise it will conflict.

Under this scenario, does it still make sense to use a single shared
location?

Most likely.

One problem I foresee is with vectors using a single
sqlite.db file,

That's true. You cannot write simultaneously to the same sqlite.db file.

http://en.wikipedia.org/wiki/SQLite
"A write access can only be satisfied if no other accesses are
currently being serviced, otherwise the write access fails with an
error code (or can automatically be retried until a configurable
timeout expires). This concurrent access situation would change when
dealing with temporary tables.
"

But just use different names, so multiple sqlite.db files?

in two different locations and then getting the vector
database out of sync, and unable to rectify this with syncing the
locations. It seems that reprojecting the data into the location is
one way to deal with this single DB issue.

I feel that it's making things worse.

Most all of what I'm doing
is time-consuming raster analysis though, and on the first sync,
things seemed to work as expected.

OK fine.
Please let's collect ideas in the wiki how to deal with vector data
and related DBs when processing in parallel.

Markus

Mark

On 5/21/08, Markus Neteler <neteler@osgeo.org> wrote:

On Wed, May 21, 2008 at 5:58 PM, M S <mseibel@gmail.com> wrote:

I have two GRASS workstations performing various lidar and basin
operations. My idea was to distribute the processing on various
machines.

As new data gets created in each location on each computer, when all
processes are done, I would like to re-sync the locations to be the
same.

I did massive MODIS map processing on a cluster and used
one location only in a shared network directory. Therein multiple
mapsets with a final g.copy job to the mapset keeping the results.

I have documented it here:
http://grass.osgeo.org/wiki/Parallel_GRASS_jobs#PBS

Markus
_______________________________________________
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

--
Open Source Geospatial Foundation
http://www.osgeo.org/
http://www.grassbook.org/

M_S · May 28, 2008, 12:26pm

I completely agree to use your approach, with the shared location with
multiple mapsets. Using something like Unison sync's the locations
well, but has problems with single file reconciliation such as an
sqlite.db file for all the vector attributes. Now I have some vector
files that dont have attributes.

It seems to work fine for rasters, but cant say for absolute. I would
stick with your approach, or use the linux clustering approach like
you mentioned.

thanks for the feedback, this will be a great help to distribute the
processing load.

Mark

On 5/22/08, Markus Neteler <neteler@osgeo.org> wrote:

(I take liberty to cc the list again for discussion)

On Thu, May 22, 2008 at 3:59 PM, M S <mseibel@gmail.com> wrote:

That is cool. I will definitely check out the wiki page.

Perhaps I am not using locations or more specifically mapsets as
efficiently as I could be. If I were to use a single location, but
with different mapsets, would there be issues with changing the region
boundary and/or the cell resolution?

not at all, all mapsets are independent.

For example, in my main area of interest (basin) there were areas I
needed to do finer detailed analysis, going from 25 foot cells to 3
foot cells (r.in.xyz + r.report helped me find this optimal cell size,
great stuff!), would this be an issue within the same location, but
using a different mapset? I had thought that region settings applied
to a location.

no to a mapset!

I have added
http://grass.osgeo.org/wiki/Parallel_GRASS_jobs#Background

I had (perhaps not the best choice, copied the location to another
workstation) and changed the region resolution and boundary, and then
ran more detailed analysis.

If you do it serially, no problem. Otherwise it will conflict.

Under this scenario, does it still make sense to use a single shared
location?

Most likely.

One problem I foresee is with vectors using a single
sqlite.db file,

That's true. You cannot write simultaneously to the same sqlite.db file.

http://en.wikipedia.org/wiki/SQLite
"A write access can only be satisfied if no other accesses are
currently being serviced, otherwise the write access fails with an
error code (or can automatically be retried until a configurable
timeout expires). This concurrent access situation would change when
dealing with temporary tables.
"

But just use different names, so multiple sqlite.db files?

in two different locations and then getting the vector
database out of sync, and unable to rectify this with syncing the
locations. It seems that reprojecting the data into the location is
one way to deal with this single DB issue.

I feel that it's making things worse.

Most all of what I'm doing
is time-consuming raster analysis though, and on the first sync,
things seemed to work as expected.

OK fine.
Please let's collect ideas in the wiki how to deal with vector data
and related DBs when processing in parallel.

Markus

Mark

On 5/21/08, Markus Neteler <neteler@osgeo.org> wrote:

On Wed, May 21, 2008 at 5:58 PM, M S <mseibel@gmail.com> wrote:

I have two GRASS workstations performing various lidar and basin
operations. My idea was to distribute the processing on various
machines.

As new data gets created in each location on each computer, when all
processes are done, I would like to re-sync the locations to be the
same.

I did massive MODIS map processing on a cluster and used
one location only in a shared network directory. Therein multiple
mapsets with a final g.copy job to the mapset keeping the results.

I have documented it here:
http://grass.osgeo.org/wiki/Parallel_GRASS_jobs#PBS

Markus
_______________________________________________
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

--
Open Source Geospatial Foundation
http://www.osgeo.org/
http://www.grassbook.org/