Moritz Lennert wrote:
>> I don't know how to debug this...
>
> Can you identify a repeatable test case?
>
> If I could make it happen, I could debug it.
You can get a location names TEST here:
http://tomahawk.ulb.ac.be/moritz/mask_bug_testlocation.tgz
This contains only a PERMANENT mapset.
In that mapset, launch the following command:
r.mask vect=hull; for map in $(g.list rast pat="firm_rate*"); do echo
$map ; r.mapcalc "temp_prob = float($map) / sum_rates" --o --q; done;
r.mask -r
I get the error arbitrarily for different firm_rate_* maps, sometimes
only for one, sometimes for many, but at each run its for different
maps.
So it's non-deterministic (I'm getting one error for every 10-20
passes over the data, i.e. every 1200-2500 commands), and only applies
to r.mapcalc.
My first guess was a race condition related to pthreads. I tried
export WORKERS=0
before running the test, and it hasn't happened since.
And actually I'm now fairly certain as to the specific cause.
When compiled with pthread support, r.mapcalc has a mutex for each map
to prevent concurrent access to a single map from multiple threads.
Concurrent access to different maps (and to core lib/gis and and
lib/raster functionality) from different threads is supposed to be
safe (see r34485 and the interval surrounding it), but the MASK was
overlooked.
If a MASK is in use, reading a row from any raster map will read the
corresponding row from the MASK, and there's nothing to prevent
different threads from concurrently accessing two different maps and
thus accessing the MASK.
So, in read_data_{compressed,uncompressed,read_data_fp_compressed} in
lib/raster/get_row.c we have code like:
if (lseek(fcb->data_fd, (off_t) row * bufsize, SEEK_SET) == -1)
G_fatal_error(_("Error reading raster data for row %d of <%s>"),
row, fcb->name);
if (read(fcb->data_fd, data_buf, bufsize) != bufsize)
G_fatal_error(_("Error reading raster data for row %d of <%s>"),
row, fcb->name);
If multiple threads execute this code concurrently, you can end up
with the calls being interleaved like so:
Thread 1 Thread 2
lseek
lseek
read
read
meaning that the file offset has changed betwee the lseek() and the
read() (this is why X/Open and POSIX added pread(), but that's still
relatively new).
This only results in an error at the end of the file (the first read()
will leave the file offset at EOF, so the second read() fails), but in
other situations it's likely causing the wrong row of the MASK to be
read.
A possible quick fix:
if (R__.auto_mask > 0)
putenv("WORKERS=0");
A slightly better fix would be to check for masking and if it's
enabled, have a single mutex which guards *all* raster reads so that
even concurrent access to different maps is blocked. Unlike the above
hack, this still allows computations to be executed in parallel.
Better still would be to guard access to the MASK so that the other
aspects of raster input can be parallelised (raster I/O is still a
major bottleneck, and mostly because of processing rather than actual
disc access).
But that would involve either adding pthread code directly into the
base raster input code in lib/raster/get_row.c (undesirable) or at
least adding a mechanism to allow r.mapcalc to hook into it to provide
the mutex.
--
Glynn Clements <glynn@gclements.plus.com>