OSGeo Discourse

[GRASS-dev] Error reading raster data for row xxx (only when using r.series and t.rast.series)

GRASS GRASS developer

Dylan_Beaudette1 October 15, 2015, 7:44pm 21

On Wed, Oct 14, 2015 at 9:45 PM, Dylan Beaudette
<dylan.beaudette@gmail.com> wrote:

On Wed, Oct 14, 2015 at 12:55 PM, Dylan Beaudette
<dylan.beaudette@gmail.com> wrote:

On Wed, Oct 14, 2015 at 10:50 AM, Dylan Beaudette
<dylan.beaudette@gmail.com> wrote:

Some additional clues:

The original stack was 365 maps with 3105 x 7025 cells.

1. zooming into a smaller region (30 x 40 cells) and running
t.rast.series 100x resulted in 100 "correct" maps, no errors.

2. returning to the full extent and running t.rast.series 30x on the
first 31 maps resulted in 30 "correct" maps, no errors.

3. returning to the full extent and running t.rast.series 30x on the
last 31 maps resulted in 30 "correct" maps, no errors

So, it seems that t.rast.series (r.series) is throwing an error, or
generating wront output, when when:

a large set of maps are supplied as input, and, a region that has a
moderate number of total cells.

Yeah, I know, that isn't very specific. I will try re-compiling with
debugging and no optimization next.

Dylan

More data,

1. re-compiled with CFLAGS="-g -Wall":
* Multiple runs of t.rast.series with the full stack (365 maps with
3105 x 7025 cells), no errors.
* each run required about 8.5 minutes to complete

2. re-compiled with CFLAGS="-O2 -mtune=native -march=native" LDFLAGS="-s":
* 10x tests with full stack, no errors
* each run required about 3.5 minutes

3. re-run original script (see listing below)
* random errors from t.rast.series

This doesn't make much sense to me. The only difference between my
latest "tests" and the original code is that the input to
t.rast.series was static over the course of my "tests", vs. dynamic
within the original code (see below). I purposely selected a stack
that caused t.rast.series to throw an error for my tests.

OK, this does make sense--t.rast.series (r.series) was not the source
of the problems. I was able to verify this by running t.univar on the
output from the previous step:

  # NOTE: 4 CPUs so that external disk isn't thrashed
  gdd_max_C=30
  gdd_min_C=10
  gdd_base_C=10
  t.rast.mapcalc --q --o nprocs=4 input=tmin_subset,tmax_subset
output=gdd basename=gdd expr="max(((min(tmax_subset, $gdd_max_C) +
max(tmin_subset, $gdd_min_C)) / 2.0) - $gdd_base_C, 0)"

... which means that t.rast.mapcalc was generating one (or more)
outputs with some kind of problem, which was then causing t.univar and
t.rast.series to fail.

I can now verify that t.rast.mapcalc is creating some raster maps with
corrupt (?) data. Corrupt in the sense that subsequent reading of the
maps results in the "Error reading raster data for row ..." error.
Just in case anyone is interested, I have opened a ticket for more
informative errors raised by lib/raster/get_row.c
(https://trac.osgeo.org/grass/ticket/2762).

As previously reported, errors seem to occur about 50-60% of the time
and _do not_ appear to be related to the number of concurrent
t.rast.mapcalc instances.

After some more testing, I have found that t.rast.mapcalc does not
(randomly) generate corrupt maps when the output from the mapcalc
expression results in a CELL type map. Expressions that result in both
FCELL and DCELL seem to trigger the corruption.

Fortunately my current project isn't too discriminating and is fine
with CELL output from t.rast.mapcalc.

I now suspect that this is an overflow issue in t.rast.mapcalc (well
the library functions that it calls) that may or may not be influenced
by the use of files linked via r.external.

The inputs to t.rast.mapcalc are files that have been registered with
r.external. I suspect that the multiple concurrent r.mapcalc instances
may be to blame. I don't have an explanation other than some evidence
from the last time I encountered this type of issue. The workflow then
was :

1. spawn 8 concurrent processes via backgrounding: r.sun -> r.mapcalc

2. when finished with daily solar models, sum maps with r.series

I would occasionally encounter the "Error reading raster data for row
xxx" error from r.series in this case and assume that r.series had
somehow broken the map in question.

It would seem that concurrent use of r.mapcalc may be worth
investigating... however, it is strange that it only occurs sometimes.

I stand corrected. My previous encounters with the "Error reading
raster data for row ..." error were likely associated with this
related problem, which is now fixed:

http://lists.osgeo.org/pipermail/grass-dev/2015-July/075627.html

Oddly enough, I didn't have problems with maps generated with the
following (similar) code:

# spring frost
# if tmin never drops below 0 before the start of summer, then the
last "spring frost" is on day 0
# NOTE: 2 CPUs so that disk isn't thrashed
t.rast.mapcalc --o -n nprocs=2 input=tmin output=spring_frost
basename=spring_frost \
expr="if(start_doy() < 182, if(tmin < 0, start_doy(), 0), null())"

# fall frost
# NOTE: 2 CPUs so that disk isn't thrashed
t.rast.mapcalc --o -n nprocs=2 input=tmin output=fall_frost
basename=fall_frost \
expr="if(start_doy() > 213, if(tmin < 0, start_doy(), 365), null())"

... Not so odd anymore, as these t.rast.mapcalc expressions always
resulted in CELL maps.

Dylan

Glynn_Clements1 October 19, 2015, 7:45pm 22

Dylan Beaudette wrote:

I can now verify that t.rast.mapcalc is creating some raster maps with
corrupt (?) data. Corrupt in the sense that subsequent reading of the
maps results in the "Error reading raster data for row ..." error.
Just in case anyone is interested, I have opened a ticket for more
informative errors raised by lib/raster/get_row.c
(https://trac.osgeo.org/grass/ticket/2762).

As previously reported, errors seem to occur about 50-60% of the time
and _do not_ appear to be related to the number of concurrent
t.rast.mapcalc instances.

t.rast.mapcalc relies upon r.mapcalc.

If it's a pthread issue, using "export WORKERS=0" from the shell
before running t.rast.mapcalc should make it go away.

r.mapcalc uses a mutex for each map to prevent multiple threads
accessing a single map concurrently. Accessing different maps from
multiple threads should be safe ... for normal GRASS rasters, and
provided that no MASK raster is present (otherwise it ends up
accessing the MASK raster from multiple threads concurrently, which
results in the "Error reading raster data ..." error, due to a race
condition between the lseek() and the read()).

But for GDAL-linked (r.external) rasters, there may be re-entrancy
issues with GDAL itself, meaning that even accessing different maps
from multiple threads is unsafe.

Fixing that is potentially problematic. r.mapcalc doesn't know whether
its inputs are GDAL-linked (this feature is supposed to be transparent
to modules; there isn't a Rast_is_gdal(int fd) or similar function),
and lib/raster itself doesn't have any knowledge of pthreads (lib/gis
provides functions for managing a thread pool; r.mapcalc/r3.mapcalc
are currently the only modules using these functions, and r3.mapcalc
specifically sets WORKERS=0 because lib/raster3d isn't thread-safe).

--
Glynn Clements <glynn@gclements.plus.com>

Dylan_Beaudette1 October 19, 2015, 8:24pm 23

On Mon, Oct 19, 2015 at 12:45 PM, Glynn Clements
<glynn@gclements.plus.com> wrote:

Dylan Beaudette wrote:

I can now verify that t.rast.mapcalc is creating some raster maps with
corrupt (?) data. Corrupt in the sense that subsequent reading of the
maps results in the "Error reading raster data for row ..." error.
Just in case anyone is interested, I have opened a ticket for more
informative errors raised by lib/raster/get_row.c
(https://trac.osgeo.org/grass/ticket/2762).

As previously reported, errors seem to occur about 50-60% of the time
and _do not_ appear to be related to the number of concurrent
t.rast.mapcalc instances.

t.rast.mapcalc relies upon r.mapcalc.

OK, thanks for the clarification. I'll be more specific in future
tests / posts / tickets.

If it's a pthread issue, using "export WORKERS=0" from the shell
before running t.rast.mapcalc should make it go away.

I don't think that it has anything to do with pthreads... Does the
"--without-pthread" configure argument actually work?

r.mapcalc uses a mutex for each map to prevent multiple threads
accessing a single map concurrently. Accessing different maps from
multiple threads should be safe ... for normal GRASS rasters, and
provided that no MASK raster is present (otherwise it ends up
accessing the MASK raster from multiple threads concurrently, which
results in the "Error reading raster data ..." error, due to a race
condition between the lseek() and the read()).

Got it. This is valuable information. Is it documented anywhere other
than GRASS-user and GRASS-dev mailing lists? How hard would it be to
issue a warning when someone tries parallel access to the MASK? It
would seem that some of the new t.* modules could be affected by this
reality as many of them support parallel processes. Is it the duty of
the module author to warn of such pitfalls?

Again, not sure how to implement but happy to test / document.

But for GDAL-linked (r.external) rasters, there may be re-entrancy
issues with GDAL itself, meaning that even accessing different maps
from multiple threads is unsafe.

This is troubling. Is there anyway to empirically determine what is
safe? Last week worked on a project where I was processing daily
rasters for a 30 year interval, involving 3 climatic variables. Due to
the large number of files, I had to work with "external" data sources.
For the most part, everything went as expected when the results of
(parallel) r.mapcalc expressions were CELL maps. Not so when the
results were FCELL or DCELL maps.

Fixing that is potentially problematic. r.mapcalc doesn't know whether
its inputs are GDAL-linked (this feature is supposed to be transparent
to modules; there isn't a Rast_is_gdal(int fd) or similar function),
and lib/raster itself doesn't have any knowledge of pthreads (lib/gis
provides functions for managing a thread pool; r.mapcalc/r3.mapcalc
are currently the only modules using these functions, and r3.mapcalc
specifically sets WORKERS=0 because lib/raster3d isn't thread-safe).

OK. Does setting "WORKERS=0" do anything when GRASS is compiled
without pthreads?

Thank you,
Dylan

--
Glynn Clements <glynn@gclements.plus.com>

Glynn_Clements1 October 22, 2015, 1:14am 24

Dylan Beaudette wrote:

> If it's a pthread issue, using "export WORKERS=0" from the shell
> before running t.rast.mapcalc should make it go away.

I don't think that it has anything to do with pthreads... Does the
"--without-pthread" configure argument actually work?

--without-pthread is the default. You can check whether GRASS was
built with pthread support by searching for HAVE_PTHREAD_H in
include/grass/config.h. The configure script only checks for
<pthread.h> (and defines that macro) if --with-pthread is given.

> r.mapcalc uses a mutex for each map to prevent multiple threads
> accessing a single map concurrently. Accessing different maps from
> multiple threads should be safe ... for normal GRASS rasters, and
> provided that no MASK raster is present (otherwise it ends up
> accessing the MASK raster from multiple threads concurrently, which
> results in the "Error reading raster data ..." error, due to a race
> condition between the lseek() and the read()).

Got it. This is valuable information. Is it documented anywhere other
than GRASS-user and GRASS-dev mailing lists? How hard would it be to
issue a warning when someone tries parallel access to the MASK? It
would seem that some of the new t.* modules could be affected by this
reality as many of them support parallel processes. Is it the duty of
the module author to warn of such pitfalls?

The MASK issue should have been fixed with r65591 (in trunk, not sure
if it has been backported to releasebranch_7_0 yet).

Ultimately, if a module uses pthreads itself, it has to ensure that
any functions which aren't thread-safe aren't called concurrently from
multiple threads.

Accessing different maps from multiple threads is "intended" to be
safe. But the issue with the MASK was overlooked, as was the potential
for issues with GDAL-linked rasters.

I'm starting to think that r.mapcalc's pthread support to may be more
trouble than it's worth. Unless the expression is particularly
complex, r.mapcalc tends to be I/O-bound, and pthreads doesn't really
help with that.

At a minimum, I may change the thread pool code (lib/gis/worker.c) so
that the default number of workers (if WORKERS isn't set) is zero,
i.e. multiple threads will only be used if specifically requested by
the user.

> But for GDAL-linked (r.external) rasters, there may be re-entrancy
> issues with GDAL itself, meaning that even accessing different maps
> from multiple threads is unsafe.

This is troubling. Is there anyway to empirically determine what is
safe? Last week worked on a project where I was processing daily
rasters for a 30 year interval, involving 3 climatic variables. Due to
the large number of files, I had to work with "external" data sources.
For the most part, everything went as expected when the results of
(parallel) r.mapcalc expressions were CELL maps. Not so when the
results were FCELL or DCELL maps.

The main problem with GDAL is that it's essentially a collection of
distinct modules in a common framework. Unless they have (and enforce)
a policy that GDALRasterIO() must be re-entrant (at least for
different maps), it may be the case that concurrent access is safe for
some formats but not others.

> Fixing that is potentially problematic. r.mapcalc doesn't know whether
> its inputs are GDAL-linked (this feature is supposed to be transparent
> to modules; there isn't a Rast_is_gdal(int fd) or similar function),
> and lib/raster itself doesn't have any knowledge of pthreads (lib/gis
> provides functions for managing a thread pool; r.mapcalc/r3.mapcalc
> are currently the only modules using these functions, and r3.mapcalc
> specifically sets WORKERS=0 because lib/raster3d isn't thread-safe).

OK. Does setting "WORKERS=0" do anything when GRASS is compiled
without pthreads?

No.

If you're experiencing these problems with WORKERS=0 or with GRASS
compiled without pthread support, then the problem is something else
entirely.

--
Glynn Clements <glynn@gclements.plus.com>

neteler November 3, 2015, 7:33pm 25

On Thu, Oct 22, 2015 at 3:14 AM, Glynn Clements
<glynn@gclements.plus.com> wrote:

Dylan Beaudette wrote:

...

The MASK issue should have been fixed with r65591 (in trunk, not sure
if it has been backported to releasebranch_7_0 yet).

Yes, in r65764 about 4 months ago.

Ultimately, if a module uses pthreads itself, it has to ensure that
any functions which aren't thread-safe aren't called concurrently from
multiple threads.

Accessing different maps from multiple threads is "intended" to be
safe. But the issue with the MASK was overlooked, as was the potential
for issues with GDAL-linked rasters.

I'm starting to think that r.mapcalc's pthread support to may be more
trouble than it's worth. Unless the expression is particularly
complex, r.mapcalc tends to be I/O-bound, and pthreads doesn't really
help with that.

Given all the related updated in GCC etc., perhaps it is possible to
implement reasonable openMP support now?

At a minimum, I may change the thread pool code (lib/gis/worker.c) so
that the default number of workers (if WORKERS isn't set) is zero,
i.e. multiple threads will only be used if specifically requested by
the user.

Maybe better since this question came up a few times.

> But for GDAL-linked (r.external) rasters, there may be re-entrancy
> issues with GDAL itself, meaning that even accessing different maps
> from multiple threads is unsafe.

This is troubling. Is there anyway to empirically determine what is
safe? Last week worked on a project where I was processing daily
rasters for a 30 year interval, involving 3 climatic variables. Due to
the large number of files, I had to work with "external" data sources.
For the most part, everything went as expected when the results of
(parallel) r.mapcalc expressions were CELL maps. Not so when the
results were FCELL or DCELL maps.

The main problem with GDAL is that it's essentially a collection of
distinct modules in a common framework. Unless they have (and enforce)
a policy that GDALRasterIO() must be re-entrant (at least for
different maps), it may be the case that concurrent access is safe for
some formats but not others.

> Fixing that is potentially problematic. r.mapcalc doesn't know whether
> its inputs are GDAL-linked (this feature is supposed to be transparent
> to modules; there isn't a Rast_is_gdal(int fd) or similar function),
> and lib/raster itself doesn't have any knowledge of pthreads (lib/gis
> provides functions for managing a thread pool; r.mapcalc/r3.mapcalc
> are currently the only modules using these functions, and r3.mapcalc
> specifically sets WORKERS=0 because lib/raster3d isn't thread-safe).

OK. Does setting "WORKERS=0" do anything when GRASS is compiled
without pthreads?

No.

If you're experiencing these problems with WORKERS=0 or with GRASS
compiled without pthread support, then the problem is something else
entirely.

(apparently no such issue)

Markus

--
Glynn Clements <glynn@gclements.plus.com>
_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Glynn_Clements1 November 4, 2015, 2:24am 26

Markus Neteler wrote:

> At a minimum, I may change the thread pool code (lib/gis/worker.c) so
> that the default number of workers (if WORKERS isn't set) is zero,
> i.e. multiple threads will only be used if specifically requested by
> the user.

Maybe better since this question came up a few times.

Done in r66731.

--
Glynn Clements <glynn@gclements.plus.com>

neteler December 27, 2015, 8:12pm 27

On Wed, Nov 4, 2015 at 3:24 AM, Glynn Clements <glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

> At a minimum, I may change the thread pool code (lib/gis/worker.c) so
> that the default number of workers (if WORKERS isn't set) is zero,
> i.e. multiple threads will only be used if specifically requested by
> the user.

Maybe better since this question came up a few times.

Done in r66731.

Backported in r67398 for 7.0.3.

Markus