[GRASS-dev] locking on a raster

  BTW, is there a way to ``lock'' on a raster (or its data at
  least), so that it could be opened (or rewound) multiple times
  and, e. g., `G_get_raster_row' would return the same data, even
  if the raster was, e. g., renamed meanwhile?

  It seems to me that the combination of the dup () and lseek ()
  (or rather rewind ()?) likes for GRASS rasters would do the
  trick. Are such library functions currently there? Does it
  seem reasonable to implement them?

  I'd like to mention that given the standard Unix-like semantics,
  such a locking would imply that one could read a raster that was
  `g.remove'd meanwhile. I think it's a nice feature for
  long-running GRASS modules, having to read certain rasters
  multiple times per run.

Ivan Shmakov wrote:

  BTW, is there a way to ``lock'' on a raster (or its data at
  least), so that it could be opened (or rewound) multiple times
  and, e. g., `G_get_raster_row' would return the same data, even
  if the raster was, e. g., renamed meanwhile?

That's already the case.

When a raster is opened for write, the data is written to a temporary
file in the <mapset>/.tmp/<hostname> directory. When the raster is
closed, the temporary file is moved into place with rename().

Any process which has the raster open for reading will be unaffected
by all of this, so long as it keeps the raster open (if it closes and
re-opens it, that's a different issue).

Note that this typically won't work on Windows. If a module tries to
replace a raster while it's open, the renaming step will fail.

A more signficant issue for concurrent access is the potential for
race conditions between opening the various files which make up a
raster (cellhd, cell/fcell, categories, colour table, history, etc).

To minimise the risk of race conditions, modules should ideally read
all support files at the start, rather than waiting until they have
finished reading the raster data.

To have concurrent access working properly, we would need to add
explicit locking to all modules. Actually, we might be able to do it
in g.parser, at least for modules whose options contain complete
information (i.e. not modules which derive the names of output maps
from the names of the input maps).

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements <glynn@gclements.plus.com> writes:

>> BTW, is there a way to ``lock'' on a raster (or its data at least),
>> so that it could be opened (or rewound) multiple times and, e. g.,
>> `G_get_raster_row' would return the same data, even if the raster
>> was, e. g., renamed meanwhile?

> That's already the case.

> When a raster is opened for write, the data is written to a temporary
> file in the <mapset>/.tmp/<hostname> directory. When the raster is
> closed, the temporary file is moved into place with rename().

> Any process which has the raster open for reading will be unaffected
> by all of this, so long as it keeps the raster open (if it closes and
> re-opens it, that's a different issue).

  It seems that you've missed my point. I wish to process the
  whole raster with G_get_raster_row () once, and then another
  time from the start. The only way to do it that I know is to
  close and re-open it, which is ``a different issue'', indeed.

> Note that this typically won't work on Windows. If a module tries to
> replace a raster while it's open, the renaming step will fail.

> A more signficant issue for concurrent access is the potential for
> race conditions between opening the various files which make up a
> raster (cellhd, cell/fcell, categories, colour table, history, etc).

> To minimise the risk of race conditions, modules should ideally read
> all support files at the start, rather than waiting until they have
> finished reading the raster data.

  Quite so. Although I have a few ideas on how to avoid it.

> To have concurrent access working properly, we would need to add
> explicit locking to all modules. Actually, we might be able to do it
> in g.parser, at least for modules whose options contain complete
> information (i.e. not modules which derive the names of output maps
> from the names of the input maps).

  So that, e. g., renaming would be impossible without all the
  processing of the raster finished? It doesn't seem to be a
  particulary bright solution. At least, it may bring deadlocks
  to GRASS, and it seems to complicate parallel execution in
  general as well.

Ivan Shmakov wrote:

> Any process which has the raster open for reading will be unaffected
> by all of this, so long as it keeps the raster open (if it closes and
> re-opens it, that's a different issue).

  It seems that you've missed my point. I wish to process the
  whole raster with G_get_raster_row () once, and then another
  time from the start. The only way to do it that I know is to
  close and re-open it, which is ``a different issue'', indeed.

No, there's no need to re-open it (or otherwise "rewind" it).

G_get_raster_row() takes the row number as an argument; nothing
requires that you read the rows sequentially, or that you read each
row only once.

> To have concurrent access working properly, we would need to add
> explicit locking to all modules. Actually, we might be able to do it
> in g.parser, at least for modules whose options contain complete
> information (i.e. not modules which derive the names of output maps
> from the names of the input maps).

  So that, e. g., renaming would be impossible without all the
  processing of the raster finished? It doesn't seem to be a
  particulary bright solution. At least, it may bring deadlocks
  to GRASS, and it seems to complicate parallel execution in
  general as well.

To make concurrent access work, updating a map would need to be an
atomic operation, so that any module which reads the map sees either
the "before" version or the "after" version, and never sees an
"in-progress" version.

This means that while any module is in the process of opening the
various elements that make up a map, the map cannot be replaced.

The module wouldn't have to keep the map locked while processing the
data; it could unlock it once it no longer needs to open any more
files. Provided that modules were written to read all of the support
files at the beginning, the lock could be released before it starts
reading the actual map data.

For output maps, the writer would write the data to a temporary file,
and would only need to lock the map at the point that it is closed,
for the time taken to rename the cell/fcell file and to write the
support files.

Deadlock is only possible if a module attempts to obtain a lock while
it already holds another lock. I can't conceive of any situation where
that would be necessary.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn:

To make concurrent access work, updating a map would need to be an
atomic operation, so that any module which reads the map sees either
the "before" version or the "after" version, and never sees an
"in-progress" version.

This means that while any module is in the process of opening the
various elements that make up a map, the map cannot be replaced.

[somewhat OT, somewhat not]
In my mind, due to the upcoming prevalence of multi-core processors
working to convert the raster modules take advantage of multi-processor
systems is much more important that allowing concurrent use of a single
mapset. (not that they are mutually exclusive goals, but it may be good
to think of one in light of the other)

Whether it is best to start with modules that use the segment memory (I
assume this is typically for RAM limitations not CPU) or splitting serial
G_{get|put}_raster_row() operations into n chunks of rows, I am not sure.
And how (or is it possible) to abstract that to the library level to
avoid a massive rewrite. (massive raster module rewrite is not off the
table for GRASS 7 but seems like a less bang-for-buck approach) It would
be interesting to rewrite r.example to split the operation into n
threads. (GRASS_NPROC=n or GRASS_THREADS=n enviro var?)

For output maps, the writer would write the data to a temporary file,
and would only need to lock the map at the point that it is closed,
for the time taken to rename the cell/fcell file and to write the
support files.

(meta-data support files are in general written after the map data array
is already closed)

Hamish

      ____________________________________________________________________________________
You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.
http://tc.deals.yahoo.com/tc/blockbuster/text5.com

Hamish wrote:

> To make concurrent access work, updating a map would need to be an
> atomic operation, so that any module which reads the map sees either
> the "before" version or the "after" version, and never sees an
> "in-progress" version.
>
> This means that while any module is in the process of opening the
> various elements that make up a map, the map cannot be replaced.

[somewhat OT, somewhat not]
In my mind, due to the upcoming prevalence of multi-core processors
working to convert the raster modules take advantage of multi-processor
systems is much more important that allowing concurrent use of a single
mapset. (not that they are mutually exclusive goals, but it may be good
to think of one in light of the other)

While fine-grained concurrency is likely to be more useful, it's also
significantly harder to implement. Particularly for GRASS, as many of
the core library functions are not remotely thread-safe.

Whether it is best to start with modules that use the segment memory (I
assume this is typically for RAM limitations not CPU) or splitting serial
G_{get|put}_raster_row() operations into n chunks of rows, I am not sure.
And how (or is it possible) to abstract that to the library level to
avoid a massive rewrite. (massive raster module rewrite is not off the
table for GRASS 7 but seems like a less bang-for-buck approach) It would
be interesting to rewrite r.example to split the operation into n
threads. (GRASS_NPROC=n or GRASS_THREADS=n enviro var?)

As you can't call G_get_raster_row() etc from multiple threads, you
can't realistically write a N-threaded version of r.example.

There are some relatively simple things which can be done to take
advantage of multiple threads, e.g. having one thread for calling
GRASS functions and one or more threads for computation.

However, as r.example doesn't do any computation, you can't really
demonstrate that technique there.

With some cleaning up, you might be able to have one thread for
reading, one or more threads for computation, and one thread for
writing. However, the number of threads which would actually be useful
for computation would be limited by the I/O speed.

Having multiple input threads would be possible, but it would require
a fair amount of work in terms of cleaning up the I/O code. It would
probably also result in a net efficiency loss[1], so it would only
make sense for single-user desktop systems (i.e. where the other
core(s) would otherwise be idle).

[1] In particular, if the region resolution is finer than the map
resolution, some of the rows will be duplicated. The I/O code keeps a
copy of the last row read, so consecutive requests for the same map
row don't decode the data repeatedly. Obviously, this requires that
rows are read sequentially; if two threads read the same map row
concurrently, you'll end up decoding the same data twice.

Multiple output threads would effectively require that output maps
were uncompressed. Compressed maps require that you finish writing row
N before you can start writing row N+1.

Probably the most practical use of multiple threads would be in
modules which:

a) spend most of their time performing computations, rather than in
GRASS I/O, and

b) are widely used (to justify the effort).

The most obvious candidate is r.mapcalc.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements <glynn@gclements.plus.com> writes:

[...]

>> It seems that you've missed my point. I wish to process the whole
>> raster with G_get_raster_row () once, and then another time from the
>> start. The only way to do it that I know is to close and re-open
>> it, which is ``a different issue'', indeed.

> No, there's no need to re-open it (or otherwise "rewind" it).

> G_get_raster_row() takes the row number as an argument; nothing
> requires that you read the rows sequentially, or that you read each
> row only once.

  Indeed.

  What about a dup ()? The intent is to have two parts of code
  run in a quite intertwined manner. It's not known which part
  will finish earlier, and I had in mind dup ()-ing the
  descriptor, so that each part could close it on its own when
  finished.

[...]

Glynn Clements <glynn@gclements.plus.com> writes:

[...]

>>> To have concurrent access working properly, we would need to add
>>> explicit locking to all modules. Actually, we might be able to do
>>> it in g.parser, at least for modules whose options contain complete
>>> information (i.e. not modules which derive the names of output maps
>>> from the names of the input maps).

>> So that, e. g., renaming would be impossible without all the
>> processing of the raster finished? It doesn't seem to be a
>> particulary bright solution. At least, it may bring deadlocks to
>> GRASS, and it seems to complicate parallel execution in general as
>> well.

> To make concurrent access work, updating a map would need to be an
> atomic operation, so that any module which reads the map sees either
> the "before" version or the "after" version, and never sees an
> "in-progress" version.

  And that's simple. It's only the current layout of the GRASS
  database directory that makes it hard. The new layout may be
  like the following:

  objects/<id> -- the data file;

  rasters/<raster> -- the <raster>s ``inventory'' file.

  The inventory file contains the <id>s of the objects comprising
  a raster, and any other information deemed useful, e. g. (in RFC
  822-like form):

Raster-Title: The Raster's Title Here
Raster-Data: BQCbkqHHIVoMq6OCG
Raster-Region: FpRZwnY7HgiaNPru0
Raster-Colormap: FVHD1acX1QQ4E6mgU
Raster-Category-Labels: 2RHL1nOsync1KQLbh
X-Raster-r-my-module-private-object: TzN7tvDGYsHeRIw9k
...

  When the raster is created, the objects can be created in-place,
  while the inventory is created in a temporary location and then
  atomically moved to rasters/. When the raster is renamed, the
  only thing that needs to be renamed is the inventory file. When
  the raster is being overwritten, the new data is stored under
  different object <id>s, and the inventory file is replaced
  atomically upon the completion.

  The benefits of this scheme are:

  * the scheme allows for a better transaction-like semantics to
          be implemented; in particular, the map could be replaced
          atomically at any time;

  * the scheme allows for the set of the objects to comprise a
    raster to be extended easily; each module could have its own
    ``namespace'' within the inventory file;

  * the `r.reclass' concept of /not/ generating any raster data
    could be generalized with this scheme; e. g., it may be
    allowed to use different color maps, or different regions,
    over the same data easily; one may compare `r.reclass' with
    creating symbolic links, while with the new scheme the ``hard
    links'' become possible;

  * the concept of ``trash can'' could be implemented with less
    effort with this scheme;

  * the objects/ and rasters/ directories could be easily
    `rsync'ed; currently, the invocation of `rsync' implies a
    transfer size penalty in presence of renamings;

  * the scheme feels to be consistent with the ``rasterset'' ideas
    [1].

[1] http://lists.osgeo.org/pipermail/grass-dev/2008-January/034772.html

  The apparent drawbacks of it are:

  * some reference count should be kept in order to delete the
    objects that aren't needed for any longer; as on some (or?)
    platforms it may be impossible to unlink () an opened object,
    there should be a pure GRASS-library implementation of it;
    (otherwise, it would be possible to just open all the files
    comprising the raster to have it not to vanish meanwhile.)

  * as there may be some dangling objects to remain, a dedicated
    `g.fsck' module would be necessary;

  * the change will be quite disturbing to the existing code base.

[...]

Ivan Shmakov wrote:

>> It seems that you've missed my point. I wish to process the whole
>> raster with G_get_raster_row () once, and then another time from the
>> start. The only way to do it that I know is to close and re-open
>> it, which is ``a different issue'', indeed.

> No, there's no need to re-open it (or otherwise "rewind" it).

> G_get_raster_row() takes the row number as an argument; nothing
> requires that you read the rows sequentially, or that you read each
> row only once.

  Indeed.

  What about a dup ()? The intent is to have two parts of code
  run in a quite intertwined manner. It's not known which part
  will finish earlier, and I had in mind dup ()-ing the
  descriptor, so that each part could close it on its own when
  finished.

Realistically, you would be a lot better off just opening the map
twice.

Currently, there's no way to initialise the FCB other than by opening
the map, and I can't see that it makes sense to add such a feature for
one specific use case.

--
Glynn Clements <glynn@gclements.plus.com>

Ivan Shmakov wrote:

> To make concurrent access work, updating a map would need to be an
> atomic operation, so that any module which reads the map sees either
> the "before" version or the "after" version, and never sees an
> "in-progress" version.

  And that's simple. It's only the current layout of the GRASS
  database directory that makes it hard. The new layout may be
  like the following:

[snip]

A simpler (and less radical) solution is to simply all elements of a
map in a common directory. Then, you replace the directory as a whole
rather than individual files. ISTR that this is already planned for
the future.

The downside is that you can't atomically replace a directory in the
same way that you can a file. You have to rename the destination
first, which leaves a small window when the map doesn't exist. But a
transient state with no map is still better than the current situation
of a transient state with an inconsistent map.

Full transaction support would be nice, but I don't know if it's worth
the substantial effort involved in implementing it.

In any case, all of these mechanisms would require substantial changes
to a large number of modules. And there's still the issue of vector
maps, which are often updated in-place.

--
Glynn Clements <glynn@gclements.plus.com>

On Mon, Apr 7, 2008 at 8:29 PM, Glynn Clements <glynn@gclements.plus.com> wrote:

A simpler (and less radical) solution is to simply all elements of a
map in a common directory. Then, you replace the directory as a whole
rather than individual files. ISTR that this is already planned for
the future.

Yes: like G3D data or Vector data:
http://grass.gdf-hannover.de/wiki/Replacement_raster_format#Directory_structure

Markus

Glynn Clements <glynn@gclements.plus.com> writes:

>>> To make concurrent access work, updating a map would need to be an
>>> atomic operation, so that any module which reads the map sees
>>> either the "before" version or the "after" version, and never sees
>>> an "in-progress" version.

>> And that's simple. It's only the current layout of the GRASS
>> database directory that makes it hard. The new layout may be like
>> the following:

> [snip]

> A simpler (and less radical) solution is to simply all elements of a
> map in a common directory. Then, you replace the directory as a whole
> rather than individual files. ISTR that this is already planned for
> the future.

> The downside is that you can't atomically replace a directory in the
> same way that you can a file. You have to rename the destination
> first, which leaves a small window when the map doesn't exist. But a
> transient state with no map is still better than the current
> situation of a transient state with an inconsistent map.

  Unfortunately, it doesn't seem to solve the ``rsync after
  g.rename'' problem properly. The original point behind the new
  layout was exactly its rsync-friendliness.

> Full transaction support would be nice, but I don't know if it's
> worth the substantial effort involved in implementing it.

  BTW, what do you mean by full transaction support here?

> In any case, all of these mechanisms would require substantial
> changes to a large number of modules.

  I'll try to check whether I could make the new layout available
  under the old API as well.

[...]

Ivan Shmakov wrote:

> Full transaction support would be nice, but I don't know if it's
> worth the substantial effort involved in implementing it.

  BTW, what do you mean by full transaction support here?

Actually, I was referring specifically to atomicity, i.e. being able
to replace a single map atomically, with no interval where the map
doesn't exist or is in an inconsistent state.

I suppose that there are cases where it might be useful to be able to
update multiple maps atomically, but that's even more work (you would
need an inventory for the entire mapset, not just for indivdual maps).

> In any case, all of these mechanisms would require substantial
> changes to a large number of modules.

  I'll try to check whether I could make the new layout available
  under the old API as well.

The main problem is that a module reads a map via several calls. All
of those calls must see the same map.

I initially thought that you would need some form of locking, to
prevent the map from being replaced in the middle of the sequence.
However, you could achieve the same result by caching the inventory
within the module, but the code which garbage-collects unreferenced
elements would need to allow for this.

Even there, you could run into problems where a module invokes another
module; the child would need to use the same version of the map as the
parent.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements <glynn@gclements.plus.com> writes:

>>> Full transaction support would be nice, but I don't know if it's
>>> worth the substantial effort involved in implementing it.

>> BTW, what do you mean by full transaction support here?

> Actually, I was referring specifically to atomicity, i.e. being able
> to replace a single map atomically, with no interval where the map
> doesn't exist or is in an inconsistent state.

  ACK.

> I suppose that there are cases where it might be useful to be able to
> update multiple maps atomically, but that's even more work (you would
> need an inventory for the entire mapset, not just for indivdual
> maps).

  Since the scheme I've suggested reduces a raster to a structured
  inventory and a set of uniquely-named binary objects, it looks
  RDBMS-friendly as well. I wish I had the time to explore
  putting some GRASS rasters into an PostgreSQL database!

>>> In any case, all of these mechanisms would require substantial
>>> changes to a large number of modules.

>> I'll try to check whether I could make the new layout available
>> under the old API as well.

> The main problem is that a module reads a map via several calls. All
> of those calls must see the same map.

> I initially thought that you would need some form of locking, to
> prevent the map from being replaced in the middle of the sequence.
> However, you could achieve the same result by caching the inventory
> within the module, but the code which garbage-collects unreferenced
> elements would need to allow for this.

  Yes. The need for GC seems to be the weekest point of this
  scheme.

  On Unix, as a first approximation, I'd just open () every binary
  object that's referenced by the inventory being processed. This
  way, even if the file loses its name, it would be available to
  the program.

  In general, every binary object would need a list of references.
  Maintaining a list of names of referencing rasters shouldn't be
  too hard to implement. On the contrary, a list of PIDs (to
  allow for a raster to be referenced by a process) looks a bit
  fragile.

> Even there, you could run into problems where a module invokes
> another module; the child would need to use the same version of the
> map as the parent.

  Agreed.

  Actually, the only proper solution to this problem that I know
  is moving the whole computation chain into a ``parallel
  existence'' -- forking a separate copy-on-write location at the
  beginning of a ``transaction block'', and merging it back when
  it's done. And while I hope that something like this will
  eventually be available in GRASS, I probably wouldn't say that
  the current code base is anywhere near that.

Ivan Shmakov wrote:

> I initially thought that you would need some form of locking, to
> prevent the map from being replaced in the middle of the sequence.
> However, you could achieve the same result by caching the inventory
> within the module, but the code which garbage-collects unreferenced
> elements would need to allow for this.

  Yes. The need for GC seems to be the weekest point of this
  scheme.

  On Unix, as a first approximation, I'd just open () every binary
  object that's referenced by the inventory being processed. This
  way, even if the file loses its name, it would be available to
  the program.

However: while updates to the cell/fcell file are atomic (new data is
written to a temporary file which is rename()d upon closing), support
files are typically updated by simply opening the output file for
write. If the output file already exists, it is overwritten in place.

Obviously, that would need to change in order for "transactional" I/O
to work. The fact that individual modules can create their own private
files in cell_misc complicates matters.

BTW, the definition of "support file" includes the cellhd file, which
is rather fundamental, as the layout of the cell/fcell file (rows,
columns, format, compression) is stored in the cellhd file. If the
contents of the cellhd file don't match the cell/fcell file, libgis
will probably just crash.

  In general, every binary object would need a list of references.
  Maintaining a list of names of referencing rasters shouldn't be
  too hard to implement. On the contrary, a list of PIDs (to
  allow for a raster to be referenced by a process) looks a bit
  fragile.

In-process references could be maintained by making a copy (or hard
link) to the inventory, so that the GC treats it as "live". You would
need some kind of clean-up mechanism to handle any copies which are
left behind if a module crashes.

> Even there, you could run into problems where a module invokes
> another module; the child would need to use the same version of the
> map as the parent.

  Agreed.

  Actually, the only proper solution to this problem that I know
  is moving the whole computation chain into a ``parallel
  existence'' -- forking a separate copy-on-write location at the
  beginning of a ``transaction block'', and merging it back when
  it's done. And while I hope that something like this will
  eventually be available in GRASS, I probably wouldn't say that
  the current code base is anywhere near that.

Yep. For the time being, I'd settle for simply re-arranging the
database layout to have one directory per map.

[BTW, it has been pointed out that this can reduce the maximum number
of maps per mapset, as the limit on an inode's hard link count limits
the maximum number of subdirectories, while there is usually no fixed
limit on the number of files. E.g. on Linux' ext2fs, the maximum hard
link count is 65535, so you can't have more than 65533 subdirectories.]

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements <glynn@gclements.plus.com> writes:

>>> I initially thought that you would need some form of locking, to
>>> prevent the map from being replaced in the middle of the sequence.
>>> However, you could achieve the same result by caching the inventory
>>> within the module, but the code which garbage-collects unreferenced
>>> elements would need to allow for this.

>> Yes. The need for GC seems to be the weekest point of this scheme.

>> On Unix, as a first approximation, I'd just open () every binary
>> object that's referenced by the inventory being processed. This
>> way, even if the file loses its name, it would be available to the
>> program.

> However: while updates to the cell/fcell file are atomic (new data is
> written to a temporary file which is rename()d upon closing), support
> files are typically updated by simply opening the output file for
> write. If the output file already exists, it is overwritten in place.

> Obviously, that would need to change in order for "transactional" I/O
> to work.

  Yes. With the new scheme, objects/ may be created in place, but
  musn't ever be overwritten.

> The fact that individual modules can create their own private files
> in cell_misc complicates matters.

> BTW, the definition of "support file" includes the cellhd file, which
> is rather fundamental, as the layout of the cell/fcell file (rows,
> columns, format, compression) is stored in the cellhd file. If the
> contents of the cellhd file don't match the cell/fcell file, libgis
> will probably just crash.

  Agreed.

>> In general, every binary object would need a list of references.
>> Maintaining a list of names of referencing rasters shouldn't be
>> too hard to implement. On the contrary, a list of PIDs (to
>> allow for a raster to be referenced by a process) looks a bit
>> fragile.

> In-process references could be maintained by making a copy (or hard
> link) to the inventory, so that the GC treats it as "live". You would
> need some kind of clean-up mechanism to handle any copies which are
> left behind if a module crashes.

  However, having GC to process all the inventories won't be
  efficient (unless these are stored in a database's table with
  appropriate indices.) So, I had in mind keeping a references
  file along with each object file.

  But well, creating a temporary inventory to hold all the objects
  may help, e. g.:

$ cat tmp/refs/r.mapcalc-26528.1
References: sO7dZ3p0hlA6iQGMN, EwqVK4sVoFq1bFK7y, KkUET1RdWlwXQxosV,
ajK3kbLfQu3a4Osuq, 8isdA0FB3GCmP15JV, qJaJuz2k7hJKMvRIK
$ cat objects/sO7dZ3p0hlA6iQGMN.refs
2005-04-25T19+0000-mod09-reflectance-250m-band-1
tmp/refs/r.mapcalc-26528.1
tmp/refs/r.univar-8974.1
$

  There's still a locking issue regarding the `.refs' -- it needs
  to be handled if multiple processes try to update the file
  concurrently. Any ideas on how to implement it portably?
  (Perhaps it will be worth looking into, e. g., the SQLite
  source.)

>>> Even there, you could run into problems where a module invokes
>>> another module; the child would need to use the same version of the
>>> map as the parent.

>> Agreed.

>> Actually, the only proper solution to this problem that I know
>> is moving the whole computation chain into a ``parallel
>> existence'' -- forking a separate copy-on-write location at the
>> beginning of a ``transaction block'', and merging it back when
>> it's done. And while I hope that something like this will
>> eventually be available in GRASS, I probably wouldn't say that
>> the current code base is anywhere near that.

> Yep. For the time being, I'd settle for simply re-arranging the
> database layout to have one directory per map.

> [BTW, it has been pointed out that this can reduce the maximum number
> of maps per mapset, as the limit on an inode's hard link count limits
> the maximum number of subdirectories, while there is usually no fixed
> limit on the number of files. E.g. on Linux' ext2fs, the maximum hard
> link count is 65535, so you can't have more than 65533 subdirectories.]

  While the inventory scheme is free from hitting this limit.

Ivan Shmakov wrote:

> In-process references could be maintained by making a copy (or hard
> link) to the inventory, so that the GC treats it as "live". You would
> need some kind of clean-up mechanism to handle any copies which are
> left behind if a module crashes.

  However, having GC to process all the inventories won't be
  efficient (unless these are stored in a database's table with
  appropriate indices.) So, I had in mind keeping a references
  file along with each object file.

Ah; if you're talking about back-references, one thing to bear in mind
is permissions: you can use maps from mapsets for which you only have
read permission, and not write permission.

[This issue has already arisen with respect to reclass maps and the
reclassed_to file. That was the first GRASS bug I ever fixed.]

That also means that garbage collection would need to scan the entire
location, not just individual mapsets. Actually, re-projection can
span locations, so you would potentially need to scan the entire
database.

> [BTW, it has been pointed out that this can reduce the maximum number
> of maps per mapset, as the limit on an inode's hard link count limits
> the maximum number of subdirectories, while there is usually no fixed
> limit on the number of files. E.g. on Linux' ext2fs, the maximum hard
> link count is 65535, so you can't have more than 65533 subdirectories.]

  While the inventory scheme is free from hitting this limit.

OTOH, if you don't use subdirectories, you will have many more files
in a single directory. This can be a major performance issue on some
filesystems.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements <glynn@gclements.plus.com> writes:

>>> In-process references could be maintained by making a copy (or hard
>>> link) to the inventory, so that the GC treats it as "live". You
>>> would need some kind of clean-up mechanism to handle any copies
>>> which are left behind if a module crashes.

>> However, having GC to process all the inventories won't be efficient
>> (unless these are stored in a database's table with appropriate
>> indices.) So, I had in mind keeping a references file along with
>> each object file.

> Ah; if you're talking about back-references, one thing to bear in
> mind is permissions: you can use maps from mapsets for which you only
> have read permission, and not write permission.

  Agreed.

> [This issue has already arisen with respect to reclass maps and the
> reclassed_to file. That was the first GRASS bug I ever fixed.]

> That also means that garbage collection would need to scan the entire
> location, not just individual mapsets. Actually, re-projection can
> span locations, so you would potentially need to scan the entire
> database.

  OTOH, I could hardly recall a piece of software that handled the
  access to a repository which is read-only to some of its
  instances, but allows deletions for some other ones. All the
  software that handles it well either requires a dedicated server
  to manage the whole ``database'', or relies on replication.

  It seems that the reasonable behaviour would be to make a
  back-reference if possible, and issue a warning if not.

  ... Or, since I've already mentioned replication, there're a
  couple more of solutions possible for the mapsets intented to be
  accessed read-only by many:

  * make a ``hard link'' for each of the objects in a separate
    mapset, writable by the reading party;

  * never remove an object.

  The first solution actually mimics the ``clone'' feature of
  modern DVCS (say, $ git clone produces a copy of the specified
  Git repository, where most of the files are shared by means of
  ``hard links''.) Obviously, mirroring the mapset effectively
  solves all the problems with permissions, etc., while the design
  of the objects/ directory and the use of hard links ensure
  efficient storage. Two points to pay special attention to are:

  * all the inventories may be copied or hardlinked at the time of
    mirroring effectively turning a read-only mapset into its
    space-efficient copy, but then there should be a way to keep
    this copy in sync with a source mapset;

  * no such mirroring is currently possible precisely due to that
    some files may be updated in place; thus, I believe the ``in
    place'' issue has to be resolved irrespective to whether the
    proposed scheme will be accepted as a whole or not.

  The second solution doesn't rely on hard links and thus may be
  appropriate for the systems lacking support for them. It may be
  noted that the disk space occupied by the unreferenced objects
  could be reclaimed if it could be ensured that no party is
  active at the time of GC. E. g., GC may be scheduled to be run
  as part of the OS start-up sequence.

  Furthermore, this solution may be appropriate for various other
  means of sharing files in a read-only manner. E. g., via HTTP.

>>> [BTW, it has been pointed out that this can reduce the maximum
>>> number of maps per mapset, as the limit on an inode's hard link
>>> count limits the maximum number of subdirectories, while there is
>>> usually no fixed limit on the number of files. E.g. on Linux'
>>> ext2fs, the maximum hard link count is 65535, so you can't have
>>> more than 65533 subdirectories.]

>> While the inventory scheme is free from hitting this limit.

> OTOH, if you don't use subdirectories, you will have many more files
> in a single directory. This can be a major performance issue on some
> filesystems.

  This isn't really a problem, at least for the objects/ -- it
  just has to be ensured that the distribution of the names is
  sufficiently even, and then an option may be added so that the
  names are split, like:

split-at: split-at: 4 split-at: 2, 4
objects/SD6Isoi2orPOu objects/SD6I/soi2orPOu objects/SD/6I/soi2orPOu
objects/IyfgXZdP3JYuu objects/Iyfg/XZdP3JYuu objects/Iy/fg/XZdP3JYuu
objects/xlRKohTgQKmJj objects/xlRK/ohTgQKmJj objects/xl/RK/ohTgQKmJj
objects/oBgUH2otF7Urb objects/oBgU/H2otF7Urb objects/oB/gU/H2otF7Urb
objects/CeX9zEZkdR9g5 objects/CeX9/zEZkdR9g5 objects/Ce/X9/zEZkdR9g5
objects/gnUNviMqfnTOx objects/gnUN/viMqfnTOx objects/gn/UN/viMqfnTOx

  The source for the evenly-distributed numbers may be a good RNG,
  or a kind of a checksum (e. g., SHA1) over the file contents.

Ivan Shmakov wrote:

  It seems that the reasonable behaviour would be to make a
  back-reference if possible, and issue a warning if not.

A warning isn't sufficent. If the GC relies upon back-references, but
you can't create one, the data could disappear while a process is
trying to use it.

Essentially, if you can't ensure that the back-references can be
created, you can't design a GC which relies upon them.

  * no such mirroring is currently possible precisely due to that
    some files may be updated in place; thus, I believe the ``in
    place'' issue has to be resolved irrespective to whether the
    proposed scheme will be accepted as a whole or not.

I agree with this point at least.

Regardless of the layout, the API will need to change. In particular,
there would need to be a distinction between creating a support file
for a new map, and replacing the support file for an existing map.

In all cases, any functions which open elements for writing will need
a matching close function to atomically replace any existing file with
the new version.

Fortunately, there isn't a lot of direct file output outside of the
libraries:

grass=> select a.object, a.symbol
        from obj_imp a, obj_exp b
        where a.symbol = b.symbol
        and b.object in ('lib/gis/OBJ.i686-pc-linux-gnu/open.o', 'lib/gis/OBJ.i686-pc-linux-gnu/open_misc.o')
        and b.symbol not like '%\\_old%'
        and a.object not like 'lib/%'
        ;
                              object | symbol
------------------------------------------------------------------+------------------
general/manage/lib/OBJ.i686-pc-linux-gnu/copyfile.o | G_open_new
general/g.mapsets/OBJ.i686-pc-linux-gnu/main_cmd.o | G_fopen_new
imagery/i.fft/OBJ.i686-pc-linux-gnu/save_fft.o | G_fopen_new_misc
imagery/i.ortho.photo/libes/OBJ.i686-pc-linux-gnu/fopen_camera.o | G_fopen_append
imagery/i.ortho.photo/libes/OBJ.i686-pc-linux-gnu/fopen_camera.o | G_fopen_new
imagery/i.ortho.photo/libes/OBJ.i686-pc-linux-gnu/open_camera.o | G_open_new
raster/r.reclass/OBJ.i686-pc-linux-gnu/reclass.o | G_fopen_new
raster/r.null/OBJ.i686-pc-linux-gnu/null.o | G_open_new_misc
raster/r.flow/OBJ.i686-pc-linux-gnu/io.o | G_open_new
raster/r.flow/OBJ.i686-pc-linux-gnu/io.o | G_open_update
raster/r.support/front/OBJ.i686-pc-linux-gnu/front.o | G_open_new_misc
raster3d/r3.mkdspf/OBJ.i686-pc-linux-gnu/main.o | G_fopen_new
vector/v.external/OBJ.i686-pc-linux-gnu/main.o | G_fopen_new
vector/v.out.ascii/OBJ.i686-pc-linux-gnu/out.o | G_fopen_new
vector/v.support/OBJ.i686-pc-linux-gnu/main.o | G_fopen_modify
vector/v.label/OBJ.i686-pc-linux-gnu/main.o | G_fopen_new
vector/v.lrs/v.lrs.label/OBJ.i686-pc-linux-gnu/main.o | G_fopen_new

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements <glynn@gclements.plus.com> writes:

>> It seems that the reasonable behaviour would be to make a
>> back-reference if possible, and issue a warning if not.

> A warning isn't sufficent. If the GC relies upon back-references, but
> you can't create one, the data could disappear while a process is
> trying to use it.

  Still it's an improvement, since currently the data could just
  as well be overwritten. The transition to tmp/ + rename ()
  alone doesn't help much, as a support file may be renamed while
  the other is being read.

  OTOH, I expect the mapsets being accessible read-only to some
  parties to be used primarily for SELECT, less INSERT, and for a
  negligible amount of UPDATE or DELETE. At least, this may
  explain why the issue didn't arise before.

> Essentially, if you can't ensure that the back-references can be
> created, you can't design a GC which relies upon them.

  Well, yes. That's why I've suggested some solutions in my
  previous post. To simplify, they are:

  * hardlink the entire maps into a place where you can create
    back-references, e. g. the current mapset;

    Pro's: could be done with `g.copy' (changed); becomes possible
      as soon as both the ``in place'' and ``atomic replace''
      issues have been dealt with, no need for an inventory
      scheme;

    Cons.: you have to keep your copy in sync; it should be
      possible to hardlink;

  * hardlink just the objects/; the place may be the current
    mapset's objects/ directory, as there's nothing wrong with
    mixing objects related to different mapsets or even different
    locations within a single directory;

    Pro's: no need to keep the copy in sync, as there's no copy;
      the whole thing is transparent to the user;

    Cons.: requires transition to the inventory scheme; it should
      be possible to hardlink;

  * disable GC for the mapset in question altogether;

    Pro's: hard links aren't necessary; it could be even made
      possible to fetch the objects over, e. g., HTTP;

    Cons.: the disk space reclamation policy should be figured
      out, unless the mapset is intended to be completely static.

  The first two solutions effectively split the GC domain in two,
  thus solving the problem.

>> * no such mirroring is currently possible precisely due to that some
>> files may be updated in place; thus, I believe the ``in place''
>> issue has to be resolved irrespective to whether the proposed scheme
>> will be accepted as a whole or not.

> I agree with this point at least.

> Regardless of the layout, the API will need to change. In particular,
> there would need to be a distinction between creating a support file
> for a new map, and replacing the support file for an existing map.

> In all cases, any functions which open elements for writing will need
> a matching close function to atomically replace any existing file
> with the new version.

  But note that there's still no way to atomically replace any
  existing /raster/.

> Fortunately, there isn't a lot of direct file output outside of the
> libraries:

[...]

  Good. I'll try to take a glance at them.