[GRASS-dev] [GRASS GIS] #3069: Evaluate branches in r.mapcalc if-function only when needed

#3069: Evaluate branches in r.mapcalc if-function only when needed
-------------------------------------------------+-------------------------
Reporter: wenzeslaus | Owner: grass-dev@…
     Type: enhancement | Status: new
Priority: normal | Milestone: 7.3.0
Component: Raster | Version: svn-trunk
Keywords: r.mapcalc, r3.mapcalc, | CPU: Unspecified
  optimization, expression, evaluation, if- |
  statement, if-function |
Platform: Unspecified |
-------------------------------------------------+-------------------------
The ''r.mapcalc'' expression `x = if(0, rand(0, 2), 3)` takes about the
same time to evaluate as `x = rand(0, 2) + if(0, 2, 3)`, although the
true-branch of the if-statement is never used. As far as I understand,
this is given by fact that `if` is actually a function in ''r.mapcalc'',
so all the arguments are evaluated before the function is called.

It would be a nice enhancement if the values of if-function arguments
would be computed only when needed, i.e. `x = if(0, rand(0, 2), 3)` would
take same time as `x = if(0, 1, 3)` because `rand(0, 2)` is never used.
This would allow for some optimizations when part of the output raster is
already determined and there is no need to perform the computation, for
example a moving window calculation which needs to be performed only when
the central cell has a certain value. Another case when this would be
advantageous is when both branches (arguments) are costly to compute,
currently both are computed and one thrown away.

Here is the minimal example:

{{{
#!bash
g.region rows=10000 cols=10000
time r.mapcalc "x1 = rand(0, 2) + rand(0, 2) + rand(0, 2) + if(0, 2, 3)"
seed=1
time r.mapcalc "x2 = 2 + if(0, rand(0, 2) + rand(0, 2) + rand(0, 2), 3)"
seed=1
}}}

In both case, I get something like 14.5 s and for the expression with
trivial unused branch (`x = 2 + if(0, 2, 3)`) I get less than 6 s.

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/3069&gt;
GRASS GIS <https://grass.osgeo.org>

#3069: Evaluate branches in r.mapcalc if-function only when needed
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: | Status: new
  enhancement |
  Priority: normal | Milestone: 7.3.0
Component: Raster | Version: svn-trunk
Resolution: | Keywords: r.mapcalc, r3.mapcalc,
                         | optimization, expression, evaluation, if-
       CPU: | statement, if-function
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------

Comment (by glynn):

Replying to [ticket:3069 wenzeslaus]:

> The ''r.mapcalc'' expression `x = if(0, rand(0, 2), 3)` takes about the
same time to evaluate as `x = rand(0, 2) + if(0, 2, 3)`, although the
true-branch of the if-statement is never used. As far as I understand,
this is given by fact that `if` is actually a function in ''r.mapcalc'',
so all the arguments are evaluated before the function is called.

It's largely a consequence of r.mapcalc using vectorised (SIMD)
evaluation, similar to e.g. std::valarray in C++, !NumPy, or GPU-oriented
languages such as GLSL.

The "value" of any expression is a row buffer, an array containing a value
for each column. Each function takes row buffers as arguments and
populates the row buffer for the result.

The reason for this approach is to reduce the overhead of the interpreter,
as the interpreter only need to be run per-row rather than per-cell.

As it stands, conditional evaluation would require first evaluating the
condition expression, then analysing the row buffer holding the condition
to determine whether a specific case (zero or non-zero for the 2-argument
and 3-argument versions, negative, zero or positive for the 4-argument
version) doesn't occur in any column.

This could be optimised by analysing the expression to identify sub-
expressions which are constant for any given row. This includes literals,
and function calls where all arguments are constant and the function
itself doesn't "inject" non-constant values (specifically, rand(), col()
and x()). Ideally, variables would also need to be flagged as holding
constant-per-row values.

In the case of variables, there's also the issue that a given variable may
be used within multiple conditionals where the conditions are different.
The variable would need to be evaluated for each column where its value is
needed for at least one usage (i.e. the union of all of the conditions).
That would require either a) associating a validity mask with each
variable (indicating the columns for which the value has been calculated),
b) performing the evaluation independently for each conditional (typically
resulting in redundant evaluation), or c) simply evaluating the variable
for all columns in all but the most straightfoward cases.

Needless to say, this isn't a simple change. It isn't one which I have
time to work on.

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/3069#comment:1&gt;
GRASS GIS <https://grass.osgeo.org>