[GRASS-dev] [GRASS-user] r.to.vect stats

[Switching from grass-user to grass-dev.]

On Mon, May 1, 2017 at 5:11 PM, Markus Metz <markus.metz.giswork@gmail.com>
wrote:

> > Is it different from the binning code in r.in.xyz ?
>
>
> Not really. r.in.lidar was based on r.in.xyz, now they are different
but of course the idea is to make them again as similar as possible.

I would rather keep r.in.xyz as simple as possible and instead add
wrapper scripts for specific tasks. Maybe r.in.lidar could be sync'ed back
to r.in.xyz and a new wrapper script could be created, something like
r.in.lidar.filter :wink:

I was perhaps not clear enough. The code of r.in.lidar and r.in.xyz is the
same even now except for the separation of the binning code and LAS versus
ASCII. There is some filtering related to lidar, but I don't understand if
it is what you mean. Nevertheless, some filtering in r.in.xyz and getting
some functionality from r.in.lidar would make sense (as they are used
interchangeably depending on libLAS availability); unfortunately not really
making it simpler.

> >
> >>
> >> Anyway, a new module needs to be implemented for this. The name can be
> >> something like r.binning, r.vect.stats, r.point.stats, or
> >> r.points.stats. It might good project for beginners.
> >
> >
> > I like the idea. But do you really think this would be much faster
than v.out.ascii | r.in.xyz ?
>
>
> I don't know if much faster, but it will be at least little faster.
Another issue is that "v.out.ascii | r.in.xyz" is little hard to do on MS
Windows and from GUI in general. Yes, there should have been a wrapper
module for this already, but the a module to do the task directly is seems
to me as a better solution

Why would a new C module be a better solution than a simple wrapper
combining v.out.ascii + r.in.xyz? Considering code maintenance, a wrapper
script would be easier to maintain. Otherwise, a dedicated C module would
be another clone of r.in.xyz, taking as input not a text file but a GRASS
vector with points.

I agree, script would be easier to maintain, but considering the
duplication between r.in.xyz and r.in.lidar (and possibly also r.in.pdal
which I haven't mentioned yet), I though moving the binning code to the
library and perhaps gaining some performance improvement along the way is a
better option.

But I'm not against a wrapper. I added a prototype with limited
functionality to addons [1]. However, I'm not sure how to account for large
data - that's what discouraged me from creating a wrapper. The Python
subprocess documentation says use communicate() but its doc says "The data
read is buffered in memory, so do not use this method if the data size is
large or unlimited." [2] Do I have to do the buffering myself then or is it
just fine? Is there some code like this in GRASS?

Thanks for the feedback,
Vaclav

[1] https://trac.osgeo.org/grass/changeset/70996
[2]
https://docs.python.org/2/library/subprocess.html#subprocess.Popen.communicate

On 02/05/17 03:03, Vaclav Petras wrote:

[Switching from grass-user to grass-dev.]

On Mon, May 1, 2017 at 5:11 PM, Markus Metz
<markus.metz.giswork@gmail.com <mailto:markus.metz.giswork@gmail.com>>
wrote:

    > > Is it different from the binning code in r.in.xyz <http://r.in.xyz> ?
    >
    > Not really. r.in.lidar was based on r.in.xyz <http://r.in.xyz>, now they are different but of course the
    idea is to make them again as similar as possible.

    I would rather keep r.in.xyz <http://r.in.xyz> as simple as possible
    and instead add wrapper scripts for specific tasks. Maybe r.in.lidar
    could be sync'ed back to r.in.xyz <http://r.in.xyz> and a new
    wrapper script could be created, something like r.in.lidar.filter :wink:

I was perhaps not clear enough. The code of r.in.lidar and r.in.xyz
<http://r.in.xyz> is the same even now except for the separation of the
binning code and LAS versus ASCII. There is some filtering related to
lidar, but I don't understand if it is what you mean. Nevertheless, some
filtering in r.in.xyz <http://r.in.xyz> and getting some functionality
from r.in.lidar would make sense (as they are used interchangeably
depending on libLAS availability); unfortunately not really making it
simpler.

    > >
    > >>
    > >> Anyway, a new module needs to be implemented for this. The name can be
    > >> something like r.binning, r.vect.stats, r.point.stats, or
    > >> r.points.stats. It might good project for beginners.
    > >
    > > I like the idea. But do you really think this would be much faster than v.out.ascii | r.in.xyz <http://r.in.xyz> ?
    >
    > I don't know if much faster, but it will be at least little faster. Another issue is that "v.out.ascii | r.in.xyz <http://r.in.xyz>" is little hard to do on MS Windows and
    from GUI in general. Yes, there should have been a wrapper module
    for this already, but the a module to do the task directly is seems
    to me as a better solution

    Why would a new C module be a better solution than a simple wrapper
    combining v.out.ascii + r.in.xyz <http://r.in.xyz>? Considering code
    maintenance, a wrapper script would be easier to maintain.
    Otherwise, a dedicated C module would be another clone of r.in.xyz
    <http://r.in.xyz>, taking as input not a text file but a GRASS
    vector with points.

I agree, script would be easier to maintain, but considering the
duplication between r.in.xyz <http://r.in.xyz> and r.in.lidar (and
possibly also r.in.pdal which I haven't mentioned yet), I though moving
the binning code to the library and perhaps gaining some performance
improvement along the way is a better option.

But I'm not against a wrapper. I added a prototype with limited
functionality to addons [1]. However, I'm not sure how to account for
large data - that's what discouraged me from creating a wrapper. The
Python subprocess documentation says use communicate() but its doc says
"The data read is buffered in memory, so do not use this method if the
data size is large or unlimited." [2] Do I have to do the buffering
myself then or is it just fine? Is there some code like this in GRASS?

Wouldn't using pipe_command() and feed_command() in the grass.script be a solution ?

Moritz

Hi Vaclav,

Just had a look at r.binning, very useful.

There is a small error in the code, you are importing grass.script as gs, so on line 33 it should be

options, flags = gs.parser()

instead of

options, flags = grass.parser()

I could have made the correction, but I am not sure what is customary in case of these kind of 'typos'.

Cheers,

Paulo

On Tue, May 2, 2017 at 5:31 AM, Paulo van Breugel <p.vanbreugel@gmail.com>
wrote:

Just had a look at r.binning, very useful.

Thanks. But is is not complete yet.

There is a small error in the code, you are importing grass.script as gs,
so on line 33 it should be

options, flags = gs.parser()

instead of

options, flags = grass.parser()

Thanks. Fixed.

https://trac.osgeo.org/grass/changeset/70998

I could have made the correction, but I am not sure what is customary in
case of these kind of 'typos'.

Good question. If it looks like the person is working on it, it is little
safer to just tell then fix, I think. Actually, I'll send you similar email
later today.

On Tue, May 2, 2017 at 3:24 AM, Moritz Lennert <mlennert@club.worldonline.be

wrote:

But I'm not against a wrapper. I added a prototype with limited

functionality to addons [1]. However, I'm not sure how to account for
large data - that's what discouraged me from creating a wrapper. The
Python subprocess documentation says use communicate() but its doc says
"The data read is buffered in memory, so do not use this method if the
data size is large or unlimited." [2] Do I have to do the buffering
myself then or is it just fine? Is there some code like this in GRASS?

Wouldn't using pipe_command() and feed_command() in the grass.script be a
solution ?

I'm using pipe_command() which is just convenience function setting
stdout=PIPE. Similarly feed_command() is just setting stdin=PIPE which I'm
not using because I'm feeding the stdout of the other process directly
(stdin=first_process.stdout). What I don't understand, regardless of using
stdin=PIPE or stdin=first_process.stdout for the second process, is what
should be next.

https://grass.osgeo.org/grass70/manuals/libpython/script.html#script.core.pipe_command
https://grass.osgeo.org/grass70/manuals/libpython/script.html#script.core.feed_command

On Tue, May 2, 2017 at 3:45 PM, Vaclav Petras <wenzeslaus@gmail.com> wrote:

On Tue, May 2, 2017 at 5:31 AM, Paulo van Breugel <p.vanbreugel@gmail.com> wrote:

Just had a look at r.binning, very useful.

Thanks. But is is not complete yet.

The module name r.binning is rather non-descriptive. I would suggest r.vect.stats because it can be regarded as the inverse of v.rast.stats, and because it is similar to r.resamp.stats. All do statistical aggregation.

Markus M

There is a small error in the code, you are importing grass.script as gs, so on line 33 it should be

options, flags = gs.parser()

instead of

options, flags = grass.parser()

Thanks. Fixed.

https://trac.osgeo.org/grass/changeset/70998

I could have made the correction, but I am not sure what is customary in case of these kind of ‘typos’.

Good question. If it looks like the person is working on it, it is little safer to just tell then fix, I think. Actually, I’ll send you similar email later today.


grass-dev mailing list
grass-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/grass-dev

On 02/05/17 15:53, Vaclav Petras wrote:

On Tue, May 2, 2017 at 3:24 AM, Moritz Lennert
<mlennert@club.worldonline.be <mailto:mlennert@club.worldonline.be>> wrote:

        But I'm not against a wrapper. I added a prototype with limited
        functionality to addons [1]. However, I'm not sure how to
        account for
        large data - that's what discouraged me from creating a wrapper. The
        Python subprocess documentation says use communicate() but its
        doc says
        "The data read is buffered in memory, so do not use this method
        if the
        data size is large or unlimited." [2] Do I have to do the buffering
        myself then or is it just fine? Is there some code like this in
        GRASS?

    Wouldn't using pipe_command() and feed_command() in the grass.script
    be a solution ?

I'm using pipe_command() which is just convenience function setting
stdout=PIPE. Similarly feed_command() is just setting stdin=PIPE which
I'm not using because I'm feeding the stdout of the other process
directly (stdin=first_process.stdout). What I don't understand,
regardless of using stdin=PIPE or stdin=first_process.stdout for the
second process, is what should be next.

Do you really need the in_process.communicate() ? Here's what I used in a local script and it works, without communicate(). Then again, I don't think the data flowing through this pipe ever exceeded available memory.

         pin = gscript.pipe_command('v.db.select',
                                    map = firms_map,
                                    column="x,y,%s" % turnover_column,
                                    where="substr(%s, 1, 2) = '%s' AND %s >= 0" % (nace_column, nace2, turnover_column),
                                    flags='c',
                                    quiet=True)
         total_turnover_map = 'turnover_%s' % nace2
         p = gscript.start_command('r.in.xyz',
                                   input_='-',
                                   stdin=pin.stdout,
                                   method='sum',
                                   type_='DCELL',
                                   output=total_turnover_map,
                                   quiet=True,
                                   overwrite=True)
         if p.wait() is not 0:
             gscript.fatal("Error in r.in.xyz with nace %s" % nace2)

Moritz

On Wed, May 3, 2017 at 8:25 AM, Markus Metz
<markus.metz.giswork@gmail.com> wrote:

On Tue, May 2, 2017 at 3:45 PM, Vaclav Petras <wenzeslaus@gmail.com> wrote:

On Tue, May 2, 2017 at 5:31 AM, Paulo van Breugel <p.vanbreugel@gmail.com>
wrote:

Just had a look at r.binning, very useful.

Thanks. But is is not complete yet.

The module name r.binning is rather non-descriptive. I would suggest
r.vect.stats because it can be regarded as the inverse of v.rast.stats, and
because it is similar to r.resamp.stats. All do statistical aggregation.

+1 r.vect.stats sounds more descriptive.

markusN

On Wed, May 3, 2017 at 6:57 AM, Markus Neteler <neteler@osgeo.org> wrote:

> The module name r.binning is rather non-descriptive. I would suggest
> r.vect.stats because it can be regarded as the inverse of v.rast.stats,
and
> because it is similar to r.resamp.stats. All do statistical aggregation.

+1 r.vect.stats sounds more descriptive.

Done in r71021. Thanks for the feedback.

https://trac.osgeo.org/grass/changeset/71021

On Wed, May 3, 2017 at 5:34 AM, Moritz Lennert <mlennert@club.worldonline.be> wrote:

On 02/05/17 15:53, Vaclav Petras wrote:

I’m using pipe_command() which is just convenience function setting
stdout=PIPE. Similarly feed_command() is just setting stdin=PIPE which
I’m not using because I’m feeding the stdout of the other process
directly (stdin=first_process.stdout). What I don’t understand,
regardless of using stdin=PIPE or stdin=first_process.stdout for the
second process, is what should be next.

Do you really need the in_process.communicate() ? Here’s what I used
in a local script and it works, without communicate(). Then again,
I don’t think the data flowing through this pipe ever exceeded available memory.

pin = gscript.pipe_command(‘v.db.select’,
map = firms_map,

total_turnover_map = ‘turnover_%s’ % nace2
p = gscript.start_command(‘r.in.xyz’,
input_=‘-’,
stdin=pin.stdout,

if p.wait() is not 0:
gscript.fatal(“Error in r.in.xyz with nace %s” % nace2)

The Popen.wait() documentation [1] says: “Warning: This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.”

And since I’m using stdout=PIPE (pipe_command()), I use communicate(). What troubles me is that Popen.communicate(input=None) documentation [2] says: “Note: The data read is buffered in memory, so do not use this method if the data size is large or unlimited.”

It says “data read”, so it probably talks about stdout=PIPE when communicate() does not return None(s) but data (stdout=PIPE and communicate with the same process), i.e. it doesn’t apply to this case and I don’t have to be troubled. As for the wait(), I think that it may work (works most of the time), it is just not guaranteed to work with large data and it depends on how smart the OS will be.

Vaclav

[1] https://docs.python.org/2/library/subprocess.html#subprocess.Popen.wait
[2] https://docs.python.org/2/library/subprocess.html#subprocess.Popen.communicate

On Fri, May 5, 2017 at 5:07 AM, Vaclav Petras <wenzeslaus@gmail.com> wrote:

On Wed, May 3, 2017 at 5:34 AM, Moritz Lennert <mlennert@club.worldonline.be> wrote:

On 02/05/17 15:53, Vaclav Petras wrote:

I’m using pipe_command() which is just convenience function setting
stdout=PIPE. Similarly feed_command() is just setting stdin=PIPE which
I’m not using because I’m feeding the stdout of the other process
directly (stdin=first_process.stdout). What I don’t understand,
regardless of using stdin=PIPE or stdin=first_process.stdout for the
second process, is what should be next.

Do you really need the in_process.communicate() ? Here’s what I used
in a local script and it works, without communicate(). Then again,
I don’t think the data flowing through this pipe ever exceeded available memory.

pin = gscript.pipe_command(‘v.db.select’,
map = firms_map,

total_turnover_map = ‘turnover_%s’ % nace2
p = gscript.start_command(‘r.in.xyz’,
input_=‘-’,
stdin=pin.stdout,

if p.wait() is not 0:
gscript.fatal(“Error in r.in.xyz with nace %s” % nace2)

The Popen.wait() documentation [1] says: “Warning: This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.”

And since I’m using stdout=PIPE (pipe_command()), I use communicate(). What troubles me is that Popen.communicate(input=None) documentation [2] says: “Note: The data read is buffered in memory, so do not use this method if the data size is large or unlimited.”

It says “data read”, so it probably talks about stdout=PIPE when communicate() does not return None(s) but data (stdout=PIPE and communicate with the same process), i.e. it doesn’t apply to this case and I don’t have to be troubled. As for the wait(), I think that it may work (works most of the time), it is just not guaranteed to work with large data and it depends on how smart the OS will be.

Maybe it is safer to store the output of v.out.ascii in a temporary file, then use that file as input for r.in.xyz. You can then not only check if v.out.ascii finished successfully, but also use the percent option of r.in.xyz to reduce memory consumption for large computational regions. The percent option does not work when piping input to r.in.xyz.

Markus M

Vaclav

[1] https://docs.python.org/2/library/subprocess.html#subprocess.Popen.wait
[2] https://docs.python.org/2/library/subprocess.html#subprocess.Popen.communicate

Le Thu, 4 May 2017 23:07:38 -0400,
Vaclav Petras <wenzeslaus@gmail.com> a écrit :

On Wed, May 3, 2017 at 5:34 AM, Moritz Lennert
<mlennert@club.worldonline.be> wrote:
[...]
[...]
>
> Do you really need the in_process.communicate() ? Here's what I used
> in a local script and it works, without communicate(). Then again,
> I don't think the data flowing through this pipe ever exceeded
> available
memory.
>
> pin = gscript.pipe_command('v.db.select',
> map = firms_map,
> ...
> total_turnover_map = 'turnover_%s' % nace2
> p = gscript.start_command('r.in.xyz',
> input_='-',
> stdin=pin.stdout,
> ...
> if p.wait() is not 0:
> gscript.fatal("Error in r.in.xyz with nace %s" %
> nace2)

The Popen.wait() documentation [1] says: "Warning: This will deadlock
when using stdout=PIPE and/or stderr=PIPE and the child process
generates enough output to a pipe such that it blocks waiting for the
OS pipe buffer to accept more data. Use communicate() to avoid that."

Doesn't the stdout/stderr=PIPE here refer to the process that you call
wait() on, i.e. p in my example ? And p does not use stdout=PIPE. But I
have to admit that I'm far outside comfortable waters in terms of my
level of understanding... :wink:

Markus is probably right: working with an intermediate file solves all
this and I don't think it would imply a serious speed decrease.

Moritz

Hi,

2017-05-05 9:43 GMT+02:00 Moritz Lennert <mlennert@club.worldonline.be>:

Markus is probably right: working with an intermediate file solves all
this and I don't think it would imply a serious speed decrease.

I would tend to agree. BTW, meanwhile I changed in r71022 default
method from 'mean' (module was calculating mean value of categories)
to 'n' (number of points in cell). I also added new parameters
`column` and `method`. Later I will update examples.

Martin

--
Martin Landa
http://geo.fsv.cvut.cz/gwiki/Landa
http://gismentors.cz/mentors/landa

On Fri, May 5, 2017 at 3:55 AM, Martin Landa <landa.martin@gmail.com> wrote:

BTW, meanwhile I changed in r71022 default
method from 'mean' (module was calculating mean value of categories)
to 'n' (number of points in cell).

Before it gave expected result for points with Z, e.g. in full NC loc:

r.vect.stats in=elev_lid792_bepts out=ground

Now it gives error (likely due to column=options['column']), but more
importantly, other binning modules (r.in.lidar, r.in.xyz) have mean as
default, so we need to be careful not to be inconsistent and produce
expected results. Can this have no default? Or perhaps the "use" paradigm
as with other modules such as v.to.rast is appropriate here
(use=attr|cat|z), then again we have the issue of default value versus no
value for default, but at least it is general enough to accommodate both
XYZ point clouds and points with attributes.

Hi,

2017-05-05 15:23 GMT+02:00 Vaclav Petras <wenzeslaus@gmail.com>:

Before it gave expected result for points with Z, e.g. in full NC loc:

r.vect.stats in=elev_lid792_bepts out=ground

Now it gives error (likely due to column=options['column']), but more

right, it needs to be fixed (checking if input is 2D or 3D).

importantly, other binning modules (r.in.lidar, r.in.xyz) have mean as
default, so we need to be careful not to be inconsistent and produce
expected results. Can this have no default? Or perhaps the "use" paradigm as

Mean value of cats if input is 2D is misleading in any case.

with other modules such as v.to.rast is appropriate here (use=attr|cat|z),
then again we have the issue of default value versus no value for default,
but at least it is general enough to accommodate both XYZ point clouds and
points with attributes.

Would make sense to me. Thanks for staring this module. I was
surprised that such tool did not exists in GRASS. Martin

--
Martin Landa
http://geo.fsv.cvut.cz/gwiki/Landa
http://gismentors.cz/mentors/landa

On Fri, May 5, 2017 at 3:43 AM, Moritz Lennert <mlennert@club.worldonline.be

wrote:

Markus is probably right: working with an intermediate file solves all
this and I don't think it would imply a serious speed decrease.

I have tested r.vect.stats (with pipe) and compared it with pipe in shell -
the time is the same (expected behavior). Here is the test with pipe in
shell for 19693050 (almost 20 million) points:

time sh -c "v.out.ascii input=points layer=-1 format=point sep=pipe |
r.in.xyz input=- output=bin2 method=mean sep=pipe --o"
real 1m5.982s
user 1m39.120s
sys 0m1.740s

And here is the same test but using temporary file (similar result with and
without SSD):

time sh -c "v.out.ascii input=points layer=-1 format=point sep=pipe
output=tmp.txt --o; r.in.xyz input=tmp.txt output=bin3 method=mean sep=pipe
--o"
real 1m32.933s
user 1m30.472s
sys 0m1.796s

According to this, using temporary file requires 1.4 more time.