[GRASS-dev] About v.distance, v.what.vect (wrt "count points within...").

Nikos_Alexandris1 · August 7, 2010, 6:33am

Hi!

I attempted to measure the time required to process large vector point maps
(derived from the landcover.30m spearfish map) with v.what.vect/v.distance.

a. v.distance without "dmax=" and afterwards v.what.vect
b. v.distance with -pa, grabbing "dmax" and feeding afterwards v.what.vect
with "dmax="

Unfortunately, the process crashed because I (wrongly) ran concurrently
another heavy process for my work. I have no more time to repeat it.

Nevertheless, I have a question: can a good C/C++ programmer, without GRASS-
background, make it more efficient within a day let's say? Is the cause of
v.distance being slow known? I did not find any specific trac-ticket.

(I am asking this because I know somebody which could eventually help-out. In
case it takes a _lot_ of time to jump-in the GRASS-C/++ stuff then I won't
bother to asking him.)

A bit more...

"v.distance" is slow (for the impatient user) with very large vectors (from my
memory I estimate that it took ~20h for ~600.000 features). Trying to get the
"dmax=" first in order to tell "v.distance" to look for features within a
certain radius, might _not_ be very efficient as well. It takes time to get
results from "v.distance -pa" and, depending on the vector(s), the addition of
the _real_ v.distance can be a deal-breaker.

However, it might be worthwhile in the end to use "dmax=", as MoritzL
mentioned in some post. I think it makes sense when the area of interest (in
which v.distance _should_ operate) is much smaller than the queried vector map
itself (right?). Unfortunately, the process which was doing exactly this,
crashed (using >6000 points and a vector fishnet to count the points that fall
inside each box of the fishnet map)

In case "dmax=" does a real good job in some cases, maybe v.what.vect can be
improved. For example, check the features of the vector maps fed in "vect="
and "qvect=" respectively, compare, and, depending on the occasion, decide
automatically if or not to use first a "v.distance -pa" step.

Probably you are aware of all this but I post them anyway. Below is the
information that I have collected.

Thank for your time, Nikos

=======================
The stuff I collected so far:

--Count points within vectors--

-Posts-

"Point count with large vectors"
+ <http://lists.osgeo.org/pipermail/grass-user/2009-May/050317.html>
+ <http://lists.osgeo.org/pipermail/grass-dev/2007-May/031533.html>
+ (this is rather old) <http://lists.osgeo.org/pipermail/grass-user/2003-
September/010200.html>

-Tools-

v.count.points.sh (uses v.what.vect which in turn uses v.distance)
+ <http://wiki.iosa.it/dokuwiki/spatial_analysis:feature_count>
--%<---
+ (the following post is is rather old - maybe fixed in latest
v.count.points.sh?) <http://lists.osgeo.org/pipermail/grass-user/2007-
May/039737.html>:

...this script is very slow, because it uses a nested for loop, so if you have
10000 points and 1000 areas, it calls 10000 * 1000 database queries (in case
you provide a class column)...
-->%---

--Count points in raster cells (not relevant to v.distance)--

- "r.in.xyz method=n" -
+ <http://grass.osgeo.org/wiki/Count_points_in_raster_cells>
+ see also <http://lists.osgeo.org/pipermail/grass-user/2010-
April/055678.html>

-Information that might be relevant to "count points"-
+ <http://duncanjg.wordpress.com/2008/04/19/an-example-of-using-postgis-from-
r/>
+ <http://www.mail-archive.com/grass-user@lists.osgeo.org/msg01859.html>
+ <http://www.mail-archive.com/grass-user@lists.osgeo.org/msg01932.html>

[Point in polygon]
+ <http://grass.osgeo.org/wiki/Point_in_polygon>

[r.statistics base=name cover=name method=count ?]
<http://lists.osgeo.org/pipermail/grass-user/2009-September/052471.html>

[density surface from points in PostGIS]
+<http://lists.osgeo.org/pipermail/grass-user/2003-September/010182.html>

--Towards a faster "count of points within..." script--

+ About SQL statements, related to "v.what.vect": <http://www.mail-
archive.com/grass-stats@lists.osgeo.org/msg00321.html>

Nikos_Alexandris1 · August 7, 2010, 7:16am

Just something more,

my intention to ask an outsider is purely aiming to take-off some work-load of
the grass-dev core team. Solely that. And of course, this in case it is really
(a.) a problem and (b.) it is known and the question is just about who is
going to spend time on it.

Thanks, Nikos

mlennert · August 9, 2010, 7:58am

On 07/08/10 08:33, Nikos Alexandris wrote:

"v.distance" is slow (for the impatient user) with very large vectors (from my
memory I estimate that it took ~20h for ~600.000 features). Trying to get the
"dmax=" first in order to tell "v.distance" to look for features within a
certain radius, might _not_ be very efficient as well. It takes time to get
results from "v.distance -pa" and, depending on the vector(s), the addition of
the _real_ v.distance can be a deal-breaker.

Just an innocent question (partly directed to Markus M): could this be linked to the spatial index ? And is v.distance faster with spatial index on file in grass7 ?

Which version are you using ?

Moritz

Nikos_Alexandris1 · August 9, 2010, 10:52am

Nikos Alexandris wrote:

> "v.distance" is slow (for the impatient user) with very large vectors
> (from my memory I estimate that it took ~20h for ~600.000 features).

(+some details: Kubuntu 64-bit on Intel Core2 Duo T9400 @2.53GHz)

> Trying to get the "dmax=" first in order to tell "v.distance" to look
> for features within a certain radius, might _not_ be very efficient as
> well. It takes time to get results from "v.distance -pa" and, depending
> on the vector(s), the addition of the _real_ v.distance can be a
> deal-breaker.

Moritz L:

Just an innocent question (partly directed to Markus M): could this be
linked to the spatial index ? And is v.distance faster with spatial
index on file in grass7 ?

Which version are you using ?

Hmm... right, I am using latest 64-svn. My intention was to test all within
latest trunk but I gave up after the heavy process broke. I'll probably spend
more time on it after a month or so.

Thanks, Nikos

Nikos_Alexandris1 · August 9, 2010, 1:25pm

Nikos Alexandris wrote:

> > "v.distance" is slow (for the impatient user) with very large vectors
> > (from my memory I estimate that it took ~20h for ~600.000 features).

(+some details: Kubuntu 64-bit on Intel Core2 Duo T9400 @2.53GHz)

> > Trying to get the "dmax=" first in order to tell "v.distance" to look
> > for features within a certain radius, might _not_ be very efficient as
> > well. It takes time to get results from "v.distance -pa" and, depending
> > on the vector(s), the addition of the _real_ v.distance can be a
> > deal-breaker.

Moritz L:

> Just an innocent question (partly directed to Markus M): could this be
> linked to the spatial index ? And is v.distance faster with spatial
> index on file in grass7 ?
>
> Which version are you using ?

Hmm... right, I am using latest 64-svn.

+sqlite.

Nikos

Markus_GRASS · August 10, 2010, 9:19am

Moritz Lennert wrote:

Nikos Alexandris wrote:

"v.distance" is slow (for the impatient user) with very large vectors
(from my
memory I estimate that it took ~20h for ~600.000 features). Trying to get
the
"dmax=" first in order to tell "v.distance" to look for features within a
certain radius, might _not_ be very efficient as well. It takes time to
get
results from "v.distance -pa" and, depending on the vector(s), the
addition of
the _real_ v.distance can be a deal-breaker.

Just an innocent question (partly directed to Markus M): could this be
linked to the spatial index ? And is v.distance faster with spatial index on
file in grass7 ?

No consistent speed difference between grass 6.x and grass 7.
Sometimes grass 6 is faster, sometimes grass 7, but nothing dramatic.
The time-consuming parts are distance calculations and tests whether a
point is inside an area or inside an isle of an area. These
calculations should be done only when necessary. I have added some
tests using bounding boxes to my local copy to avoid
distance-from-point-to-line calculations; in combination with dmax and
without -a there is a real speed gain for point to area distances, but
I need to do more testing to make sure results stay identical.

It might save some time to get distances from points to centroids
first with -p (I don't think -a is necessary) because getting
distances to centroids is much faster than getting distances to areas.
Grab the maximum distance and use that as dmax with to_type=area. This
should be safe because the real maximum distance to areas should be
smaller than the maximum distance to area centroids.

Markus M

Nikos_Alexandris1 · August 10, 2010, 9:36am

Nikos Alexandris wrote:

>> "v.distance" is slow (for the impatient user) with very large vectors
>> (from my
>> memory I estimate that it took ~20h for ~600.000 features). Trying to
>> get the
>> "dmax=" first in order to tell "v.distance" to look for features within
>> a certain radius, might _not_ be very efficient as well. It takes time
>> to get
>> results from "v.distance -pa" and, depending on the vector(s), the
>> addition of
>> the _real_ v.distance can be a deal-breaker.

Moritz Lennert:

> Just an innocent question (partly directed to Markus M): could this be
> linked to the spatial index ? And is v.distance faster with spatial index
> on file in grass7 ?

Markus Metz:

No consistent speed difference between grass 6.x and grass 7.
Sometimes grass 6 is faster, sometimes grass 7, but nothing dramatic.
The time-consuming parts are distance calculations and tests whether a
point is inside an area or inside an isle of an area. These
calculations should be done only when necessary. I have added some
tests using bounding boxes to my local copy to avoid
distance-from-point-to-line calculations; in combination with dmax and
without -a there is a real speed gain for point to area distances, but
I need to do more testing to make sure results stay identical.

It might save some time to get distances from points to centroids
first with -p (I don't think -a is necessary) because getting
distances to centroids is much faster than getting distances to areas.

"Area = boundary + centroid", so what exactly is the point to area distance?
Point to boundary? Specifically, point to all-lines that compose a boundary?
Many distances? And then, keep the maximum?

Oh, I am still not within the logic of how exactly the "within area" point
detection works.

Grab the maximum distance and use that as dmax with to_type=area. This
should be safe because the real maximum distance to areas should be
smaller than the maximum distance to area centroids.

Markus, would you mind pointing out the code lines where distance measuring
(points to areas) is happening? I'll give it a try but it is (so) time
consuming for an ignorant like me.

Thanks, Nikos

Markus_GRASS · August 10, 2010, 10:15am

Nikos Alexandris wrote:

Nikos Alexandris wrote:

>> "v.distance" is slow (for the impatient user) with very large vectors
>> (from my
>> memory I estimate that it took ~20h for ~600.000 features). Trying to
>> get the
>> "dmax=" first in order to tell "v.distance" to look for features within
>> a certain radius, might _not_ be very efficient as well. It takes time
>> to get
>> results from "v.distance -pa" and, depending on the vector(s), the
>> addition of
>> the _real_ v.distance can be a deal-breaker.

Moritz Lennert:

> Just an innocent question (partly directed to Markus M): could this be
> linked to the spatial index ? And is v.distance faster with spatial index
> on file in grass7 ?

Markus Metz:

No consistent speed difference between grass 6.x and grass 7.
Sometimes grass 6 is faster, sometimes grass 7, but nothing dramatic.
The time-consuming parts are distance calculations and tests whether a
point is inside an area or inside an isle of an area. These
calculations should be done only when necessary. I have added some
tests using bounding boxes to my local copy to avoid
distance-from-point-to-line calculations; in combination with dmax and
without -a there is a real speed gain for point to area distances, but
I need to do more testing to make sure results stay identical.

It might save some time to get distances from points to centroids
first with -p (I don't think -a is necessary) because getting
distances to centroids is much faster than getting distances to areas.

"Area = boundary + centroid", so what exactly is the point to area distance?

If a point is inside an area (the polygon composed of the area's
boundaries), the distance is 0 (zero):

A centroid is always inside the area it belongs to, therefore the
distance of a point outside an area to the area's centroid is always
larger than the distance of that point to the area's boundaries.

Point to boundary? Specifically, point to all-lines that compose a boundary?
Many distances? And then, keep the maximum?

If a point is outside an area, v.distance determines the shortest
distance to the area's boundaries (point to all-lines that compose a
boundary), IOW, keep the minimum not the maximum because v.distance
searches for the nearest feature.
A point is also outside an area if it is inside an isle of the area.
This isle might in turn be another area with centroid.

Oh, I am still not within the logic of how exactly the "within area" point
detection works.

Do you really want to know?
Start with
http://en.wikipedia.org/wiki/Point_in_polygon
GRASS uses the ray casting algorithm

Markus, would you mind pointing out the code lines where distance measuring
(points to areas) is happening?

from
https://trac.osgeo.org/grass/browser/grass/trunk/vector/v.distance/main.c#L676
to
https://trac.osgeo.org/grass/browser/grass/trunk/vector/v.distance/main.c#L702

IMHO, there is not a lot of potential for speed improvements, a bit,
but that comes at the cost of more complicated code.

I think dmax is the crucial option to speed things up. Without dmax,
v.distance has to determine for each point in <from> the distance to
all areas in <to>. With many points and many areas, this will
obviously take some time. A reasonably fast way to estimate a good
dmax helps much more than trying to hack v.distance for distances to
areas.

Markus M

Nikos_Alexandris1 · August 10, 2010, 11:49am

@Markus M: Thank you Markus. Below I have put more questions (before reading
the document) on purpose. No need to reply (except if you have the time) -
it's for the sake of a potential beginner's "question and answer" wiki-page.

I'll go through the Ray casting algorithm details. I guess my questions will
be answered there.

Nikos A wrote:

>> >> "v.distance" is slow (for the impatient user) with very large vectors
>> >> (from my memory I estimate that it took ~20h for ~600.000 features).
>> >> Trying to get the "dmax=" first in order to tell "v.distance" to look
>> >> for features within a certain radius, might _not_ be very efficient as
>> >> well. It takes time to get results from "v.distance -pa" and,
>> >> depending on the vector(s), the addition of the _real_ v.distance can
>> >> be a deal-breaker.

Moritz L:

>> > Just an innocent question (partly directed to Markus M): could this be
>> > linked to the spatial index ? And is v.distance faster with spatial
>> > index on file in grass7 ?

Markus M:

>> No consistent speed difference between grass 6.x and grass 7.
>> Sometimes grass 6 is faster, sometimes grass 7, but nothing dramatic.
>> The time-consuming parts are distance calculations and tests whether a
>> point is inside an area or inside an isle of an area. These
>> calculations should be done only when necessary. I have added some
>> tests using bounding boxes to my local copy to avoid
>> distance-from-point-to-line calculations; in combination with dmax and
>> without -a there is a real speed gain for point to area distances, but
>> I need to do more testing to make sure results stay identical.
>>
>> It might save some time to get distances from points to centroids
>> first with -p (I don't think -a is necessary) because getting
>> distances to centroids is much faster than getting distances to areas.

Nikos A:

> "Area = boundary + centroid", so what exactly is the point to area
> distance?

Markus M:

If a point is inside an area (the polygon composed of the area's
boundaries), the distance is 0 (zero):

This sentence makes me think that it is a priori known (based on something
else - related to topology?) when a point is inside an area. Why all the need
to measure distances then in order to count how many points are inside?

A centroid is always inside the area it belongs to, therefore the
distance of a point outside an area to the area's centroid is always
larger than the distance of that point to the area's boundaries.

What happens when the centroid (gravity center) falls outside of the
boundaries?

> Point to boundary? Specifically, point to all-lines that compose a
> boundary? Many distances? And then, keep the maximum?

If a point is outside an area, v.distance determines the shortest
distance to the area's boundaries (point to all-lines that compose a
boundary), IOW, keep the minimum not the maximum because v.distance
searches for the nearest feature.

All right. But, (repeating the same question): why is distance measuring
required at all? To compare it with what?

A point is also outside an area if it is inside an isle of the area.
This isle might in turn be another area with centroid.

This complicates the process I guess.

> Oh, I am still not within the logic of how exactly the "within area"
> point detection works.

Do you really want to know?

At some points yes. I am curious to understand stuff - at least their basic
concept.

Start with
http://en.wikipedia.org/wiki/Point_in_polygon
GRASS uses the ray casting algorithm

Seems to have all the answers I want. Thanks.

> Markus, would you mind pointing out the code lines where distance
> measuring (points to areas) is happening?

from
https://trac.osgeo.org/grass/browser/grass/trunk/vector/v.distance/main.c#L
676 to
https://trac.osgeo.org/grass/browser/grass/trunk/vector/v.distance/main.c#L
702

IMHO, there is not a lot of potential for speed improvements, a bit,
but that comes at the cost of more complicated code.

Why are (supposedly) other tools faster in counting points within polygons?

I think dmax is the crucial option to speed things up. Without dmax,
v.distance has to determine for each point in <from> the distance to
all areas in <to>. With many points and many areas, this will
obviously take some time. A reasonably fast way to estimate a good
dmax helps much more than trying to hack v.distance for distances to
areas.

Then my intentions to test v.distanve/v.what.vect with and without "dmax="
where on the right track (right?).

Thank you very much for your time and the details. All the best, Nikos

mlennert · August 10, 2010, 1:17pm

On 10/08/10 13:49, Nikos Alexandris wrote:

Markus M:

If a point is inside an area (the polygon composed of the area's
boundaries), the distance is 0 (zero):

This sentence makes me think that it is a priori known (based on something
else - related to topology?) when a point is inside an area. Why all the need
to measure distances then in order to count how many points are inside?

As you can see in the code referenced by Markus, there is a Vect_point_in_area(), so yes, it is possible to more directly check if points are in areas. It all depends on which modules were written using this function. At this stage all point-in-polygon attempts in GRASS are scripts using workarounds...

A centroid is always inside the area it belongs to, therefore the
distance of a point outside an area to the area's centroid is always
larger than the distance of that point to the area's boundaries.

What happens when the centroid (gravity center) falls outside of the
boundaries?

In GRASS, by definition a centroid is within the boundaries, and so does not necessarily represent center of gravity.

Point to boundary? Specifically, point to all-lines that compose a
boundary? Many distances? And then, keep the maximum?

If a point is outside an area, v.distance determines the shortest
distance to the area's boundaries (point to all-lines that compose a
boundary), IOW, keep the minimum not the maximum because v.distance
searches for the nearest feature.

All right. But, (repeating the same question): why is distance measuring
required at all? To compare it with what?

Well, v.distance was not meant as a point-in-polygon measurement tool, but as a distance measurement tool. v.what.vect just "abuses" it as such, using dmax=0.0. But currently, there is not dedicated point-in-polygon counting module...

Why are (supposedly) other tools faster in counting points within polygons?

see above...

So, maybe, if you have a dedicated programmer at hand, you could imagine a module with point-in-polygon algorithm, maybe giving the option of counting the points, aggregating some attribute values of the points, etc, using the Vect_point_in_area() function...

Should just be a matter of:

- opening a polygon and a point layer
- for each polygon
- loop through all points and if Vect_point_in_area() is true, aggregate the values of those that fall into the polygon
- put results into table of polygon map, new table or stdout

I imagine that here the use of bounding boxes and the spatial index might here, but that's Markus' specialty

Moritz

Nikos_Alexandris1 · August 10, 2010, 1:30pm

Markus M:

>> If a point is inside an area (the polygon composed of the area's
>> boundaries), the distance is 0 (zero):

Nikos A:

> This sentence makes me think that it is a priori known (based on
> something else - related to topology?) when a point is inside an area.
> Why all the need to measure distances then in order to count how many
> points are inside?

Moritz L:

As you can see in the code referenced by Markus, there is a
Vect_point_in_area(), so yes, it is possible to more directly check if
points are in areas. It all depends on which modules were written using
this function. At this stage all point-in-polygon attempts in GRASS are
scripts using workarounds...

>> A centroid is always inside the area it belongs to, therefore the
>> distance of a point outside an area to the area's centroid is always
>> larger than the distance of that point to the area's boundaries.
>
> What happens when the centroid (gravity center) falls outside of the
> boundaries?

In GRASS, by definition a centroid is within the boundaries, and so does
not necessarily represent center of gravity.

Right!

>>> Point to boundary? Specifically, point to all-lines that compose a
>>> boundary? Many distances? And then, keep the maximum?

>> If a point is outside an area, v.distance determines the shortest
>> distance to the area's boundaries (point to all-lines that compose a
>> boundary), IOW, keep the minimum not the maximum because v.distance
>> searches for the nearest feature.

> All right. But, (repeating the same question): why is distance measuring
> required at all? To compare it with what?

Well, v.distance was not meant as a point-in-polygon measurement tool,
but as a distance measurement tool. v.what.vect just "abuses" it as
such, using dmax=0.0. But currently, there is not dedicated
point-in-polygon counting module...

That is rather good news (I think). There is room for a unique count points
within areas tool then.

> Why are (supposedly) other tools faster in counting points within
> polygons?

see above...

So, maybe, if you have a dedicated programmer at hand, you could imagine
a module with point-in-polygon algorithm, maybe giving the option of
counting the points, aggregating some attribute values of the points,
etc, using the Vect_point_in_area() function...

Should just be a matter of:

- opening a polygon and a point layer
- for each polygon
- loop through all points and if Vect_point_in_area() is true,
aggregate the values of those that fall into the polygon
- put results into table of polygon map, new table or stdout

I imagine that here the use of bounding boxes and the spatial index
might here, but that's Markus' specialty

Thanks a zillion
Nikos

(bcc-ing a potential dedicated programmer)

Markus_GRASS · August 10, 2010, 1:44pm

Nikos Alexandris wrote:

@Markus M: Thank you Markus. Below I have put more questions (before reading
the document) on purpose. No need to reply (except if you have the time) -
it's for the sake of a potential beginner's "question and answer" wiki-page.

Some short answers below.

Nikos A wrote:

>> >> "v.distance" is slow (for the impatient user) with very large vectors
>> >> (from my memory I estimate that it took ~20h for ~600.000 features).
>> >> Trying to get the "dmax=" first in order to tell "v.distance" to look
>> >> for features within a certain radius, might _not_ be very efficient as
>> >> well. It takes time to get results from "v.distance -pa" and,
>> >> depending on the vector(s), the addition of the _real_ v.distance can
>> >> be a deal-breaker.

Moritz L:

>> > Just an innocent question (partly directed to Markus M): could this be
>> > linked to the spatial index ? And is v.distance faster with spatial
>> > index on file in grass7 ?

Markus M:

>> No consistent speed difference between grass 6.x and grass 7.
>> Sometimes grass 6 is faster, sometimes grass 7, but nothing dramatic.
>> The time-consuming parts are distance calculations and tests whether a
>> point is inside an area or inside an isle of an area. These
>> calculations should be done only when necessary. I have added some
>> tests using bounding boxes to my local copy to avoid
>> distance-from-point-to-line calculations; in combination with dmax and
>> without -a there is a real speed gain for point to area distances, but
>> I need to do more testing to make sure results stay identical.
>>
>> It might save some time to get distances from points to centroids
>> first with -p (I don't think -a is necessary) because getting
>> distances to centroids is much faster than getting distances to areas.

Nikos A:

> "Area = boundary + centroid", so what exactly is the point to area
> distance?

Markus M:

If a point is inside an area (the polygon composed of the area's
boundaries), the distance is 0 (zero):

This sentence makes me think that it is a priori known (based on something
else - related to topology?) when a point is inside an area.

This is not known a priori, v.distance checks if a point is inside an
area, if yes, distance is set to zero.

Why all the need
to measure distances then in order to count how many points are inside?

If you want to count the number of points inside areas, set dmax to 0 (zero).

A centroid is always inside the area it belongs to, therefore the
distance of a point outside an area to the area's centroid is always
larger than the distance of that point to the area's boundaries.

What happens when the centroid (gravity center) falls outside of the
boundaries?

Corrupt topology. Happens only when manually moving a centroid outside
its area. When GRASS calculates a new centroid for an area, the
centroid is always within the area (e.g. v.in.* and v.centroids).

> Point to boundary? Specifically, point to all-lines that compose a
> boundary? Many distances? And then, keep the maximum?

If a point is outside an area, v.distance determines the shortest
distance to the area's boundaries (point to all-lines that compose a
boundary), IOW, keep the minimum not the maximum because v.distance
searches for the nearest feature.

All right. But, (repeating the same question): why is distance measuring
required at all? To compare it with what?

If distance measuring is required and distances are supposed to be
compared with something, then use v.distance The answer to these
questions comes before the decision to use v.distance.

A point is also outside an area if it is inside an isle of the area.
This isle might in turn be another area with centroid.

This complicates the process I guess.

All taken care of.

> Oh, I am still not within the logic of how exactly the "within area"
> point detection works.

Do you really want to know?

At some points yes. I am curious to understand stuff - at least their basic
concept.

Start with
http://en.wikipedia.org/wiki/Point_in_polygon
GRASS uses the ray casting algorithm

Seems to have all the answers I want. Thanks.

> Markus, would you mind pointing out the code lines where distance
> measuring (points to areas) is happening?

from
https://trac.osgeo.org/grass/browser/grass/trunk/vector/v.distance/main.c#L
676 to
https://trac.osgeo.org/grass/browser/grass/trunk/vector/v.distance/main.c#L
702

IMHO, there is not a lot of potential for speed improvements, a bit,
but that comes at the cost of more complicated code.

Why are (supposedly) other tools faster in counting points within polygons?

Because if it's only about counting points in polygons, you can write
a special module doing exactly and only that.

I think dmax is the crucial option to speed things up. Without dmax,
v.distance has to determine for each point in <from> the distance to
all areas in <to>. With many points and many areas, this will
obviously take some time. A reasonably fast way to estimate a good
dmax helps much more than trying to hack v.distance for distances to
areas.

Then my intentions to test v.distanve/v.what.vect with and without "dmax="
where on the right track (right?).

I think so. But wait a little bit, I might get v.distance much faster soon...

Markus M

Nikos_Alexandris1 · August 10, 2010, 2:36pm

fyi, I've filed a request-ticket: <http://trac.osgeo.org/grass/ticket/1130>

Thanks, Nikos

mlennert · August 10, 2010, 2:48pm

On 10/08/10 15:17, Moritz Lennert wrote:

On 10/08/10 13:49, Nikos Alexandris wrote:

Markus M:

If a point is inside an area (the polygon composed of the area's
boundaries), the distance is 0 (zero):

This sentence makes me think that it is a priori known (based on
something
else - related to topology?) when a point is inside an area. Why all
the need
to measure distances then in order to count how many points are inside?

As you can see in the code referenced by Markus, there is a
Vect_point_in_area(), so yes, it is possible to more directly check if
points are in areas. It all depends on which modules were written using
this function. At this stage all point-in-polygon attempts in GRASS are
scripts using workarounds...

As a follow-up:

The counting points in polygons algorithm I prefer at this stage is (using municipal boundaries and hospitals in the NC data set with an SQLite backend - DBF won't work):

g.copy hospitals,myhospitals
v.db.addcol myhospitals col="cat_municip int"
v.distance from=myhospitals@sqlite to=boundary_municp@PERMANENT upload=cat column=cat_municip dmax=0.0
db.select sql="select cat_municip, count(*) from myhospitals group by cat_municip"

If your hospital attribute table contains number of beds (nbeds), the you could sum the number of beds as such:

db.select sql="select cat_municip, sum(nbeds) from myhospitals group by cat_municip"

etc...

Using 6.5 to test a similar case to yours (I assume):

g.region vect=boundary_municp

v.random out=mypoints n=600000

v.db.addtable mypoints col="cat int, cat_municip int" (that's veeeeery slow, probably because of 600000 update statements to the database in the v.to.db call...)

time v.distance from=mypoints@sqlite to=boundary_municp@PERMANENT upload=cat column=cat_municip dmax=0.0

real 2m2.119s <= not so bad

db.select sql="select cat_municip, count(*) from mypoints group by cat_municip"

So, using the combination of v.distance and db.select I cannot reproduce your problem with 600,000 points, but maybe the number and nature of polygons can also play a role...

Moritz

Nikos_Alexandris1 · August 10, 2010, 3:50pm

Markus M:

>>> If a point is inside an area (the polygon composed of the area's
>>> boundaries), the distance is 0 (zero):

Nikos A:

>>> This sentence makes me think that it is a priori known (based on
>> something else - related to topology?) when a point is inside an area.
>> Why all the need to measure distances then in order to count how many
>> points are inside?

Moritz L:

> As you can see in the code referenced by Markus, there is a
> Vect_point_in_area(), so yes, it is possible to more directly check if
> points are in areas. It all depends on which modules were written using
> this function. At this stage all point-in-polygon attempts in GRASS are
> scripts using workarounds...

As a follow-up:

The counting points in polygons algorithm I prefer at this stage is
(using municipal boundaries and hospitals in the NC data set with an
SQLite backend - DBF won't work):

g.copy hospitals,myhospitals
v.db.addcol myhospitals col="cat_municip int"
v.distance from=myhospitals@sqlite to=boundary_municp@PERMANENT
upload=cat column=cat_municip dmax=0.0

Hmmm... "dmax=0.0": Is this _my_ problem perhaps? Instead of setting it
directly I was trying to estimate it first with "v.distance -pa" which meands
that I misunderstood the whole process :-/

db.select sql="select cat_municip, count(*) from myhospitals group by
cat_municip"

If your hospital attribute table contains number of beds (nbeds), the
you could sum the number of beds as such:

db.select sql="select cat_municip, sum(nbeds) from myhospitals group by
cat_municip"

etc...

Using 6.5 to test a similar case to yours (I assume):

g.region vect=boundary_municp

v.random out=mypoints n=600000

v.db.addtable mypoints col="cat int, cat_municip int" (that's veeeeery
slow, probably because of 600000 update statements to the database in
the v.to.db call...)

time v.distance from=mypoints@sqlite to=boundary_municp@PERMANENT
upload=cat column=cat_municip dmax=0.0

real 2m2.119s <= not so bad

That is perfect!

db.select sql="select cat_municip, count(*) from mypoints group by
cat_municip"

So, using the combination of v.distance and db.select I cannot reproduce
your problem with 600,000 points, but maybe the number and nature of
polygons can also play a role...

That's interesting. Maybe I have done once again something very messy(?). I
use the 3rd script inside the attached file in ticket # 804 [1]. Although this
(old) script still executes so inefficiently a very large number of SQL
statements, the problem is still only in v.what.vect (so in v.distance) before
the SQL calls.

The script counts several point maps (for example: 404347 points) that fall
inside boxes (that compose a fishnet which I call cell-grid, for example: 1320
vector cells). One run with the above mentioned numbers takes more than 10h.

The specific line(s) in the python script is:

# carry low resolution grid-cell "CAT"s over to reference vector points
                grass.run_command('v.what.vect',\
                flags = '-v',\
                quiet = False,\
                vector = reference_points_map,\
                qvect = lowres_vector_grid,\
                column = gridcell_column,\
                qcolumn = "cat")

Of course I checked the "problem" with the my data by testing only
"v.what.vect" commands out and apart of my messy script.

Equally, very slow are the trials I did with spearfish (random data). I can
pass some of my data (off-list please) or let me find some time later or
tomorrow to copy-paste from my history the exact commands of my test within
spearfish60.

(
Just a quick look: I did not set dmax=0.0 in my (v.distance) tests. Then
again, in "v.what.vect" it is set by default to 0.0, right? Isn't this default
dmax=0.0 passed (by default) to v.distance?
)

Thank you Moritz, Nikos
---
[1] <http://trac.osgeo.org/grass/ticket/804>

Markus_GRASS · August 10, 2010, 3:53pm

OK, here comes (soon) a speed-up for v.distance

test case is nc

I generated 10000 random vector points with r.random, all within North
Carolina. As areas I used boundary_municp, scattered areas, some
points are within an area, most are outside any area. No dmax used
with v.distance

Original: about 25s for updating the table, about 6m25s used for
distance calculations
Tuned: about 25s for updating the table, about 2s :-))) used for
distance calculations

Results for the 10000 points are identical (distance to nearest area
and category of nearest area).
The code is now a bit more complicated, but reducing processing time
for distance calculations from over 6m down to 2s might justify some
code complexity.

Markus M

Moritz Lennert wrote:

On 10/08/10 15:17, Moritz Lennert wrote:

On 10/08/10 13:49, Nikos Alexandris wrote:

Markus M:

If a point is inside an area (the polygon composed of the area's
boundaries), the distance is 0 (zero):

This sentence makes me think that it is a priori known (based on
something
else - related to topology?) when a point is inside an area. Why all
the need
to measure distances then in order to count how many points are inside?

As you can see in the code referenced by Markus, there is a
Vect_point_in_area(), so yes, it is possible to more directly check if
points are in areas. It all depends on which modules were written using
this function. At this stage all point-in-polygon attempts in GRASS are
scripts using workarounds...

As a follow-up:

The counting points in polygons algorithm I prefer at this stage is (using
municipal boundaries and hospitals in the NC data set with an SQLite backend
- DBF won't work):

g.copy hospitals,myhospitals
v.db.addcol myhospitals col="cat_municip int"
v.distance from=myhospitals@sqlite to=boundary_municp@PERMANENT upload=cat
column=cat_municip dmax=0.0
db.select sql="select cat_municip, count(*) from myhospitals group by
cat_municip"

If your hospital attribute table contains number of beds (nbeds), the you
could sum the number of beds as such:

db.select sql="select cat_municip, sum(nbeds) from myhospitals group by
cat_municip"

etc...

Using 6.5 to test a similar case to yours (I assume):

g.region vect=boundary_municp

v.random out=mypoints n=600000

v.db.addtable mypoints col="cat int, cat_municip int" (that's veeeeery slow,
probably because of 600000 update statements to the database in the v.to.db
call...)

time v.distance from=mypoints@sqlite to=boundary_municp@PERMANENT upload=cat
column=cat_municip dmax=0.0

real 2m2.119s <= not so bad

db.select sql="select cat_municip, count(*) from mypoints group by
cat_municip"

So, using the combination of v.distance and db.select I cannot reproduce
your problem with 600,000 points, but maybe the number and nature of
polygons can also play a role...

Moritz

mlennert · August 11, 2010, 9:11am

On 10/08/10 17:53, Markus Metz wrote:

OK, here comes (soon) a speed-up for v.distance

[...]

Results for the 10000 points are identical (distance to nearest area
and category of nearest area).
The code is now a bit more complicated, but reducing processing time
for distance calculations from over 6m down to 2s might justify some
code complexity.

Wow !

Could you explain (briefly) what you did ?

Moritz

Markus_GRASS · August 11, 2010, 9:33am

Moritz Lennert wrote:

Markus Metz wrote:

OK, here comes (soon) a speed-up for v.distance

[...]

Results for the 10000 points are identical (distance to nearest area
and category of nearest area).
The code is now a bit more complicated, but reducing processing time
for distance calculations from over 6m down to 2s might justify some
code complexity.

Wow !

Could you explain (briefly) what you did ?

This all assumes dmax unset or dmax > 0

Original: for each point, the distances to all areas in the To-vector
are calculated with dmax unset

Experimental: the search box is stepwise increased up to the extends
of the To vector. The search stops as soon as an area overlapping with
the current search box is found. A check is included if the area's
bbox overlaps with the search box, but the area itself is outside the
search box, does really happen regularly (e.g. halfmoon area). The
trick is how the search box is increased, in the first few iterations
the search box is very small, only getting larger if really necessary.
With this method, only few areas (the nearest areas) are selected and
distances to these areas are calculated.

For a To vector with many areas and a From vector with many points,
the original amount of distance calculations with dmax unset is n
points x n areas. Now it's only a bit larger than n points, I guess
maximum 4 x n points. Thus this experimental modification really pays
off with many areas. The test vector I used, boundary_municp in the NC
dataset, has 3579 areas, that's not that much, but I observed the
speed increase mentioned above. Now the slow part of v.distance is
updating the attribute table.

Markus M

mlennert · August 11, 2010, 11:42am

On 10/08/10 17:50, Nikos Alexandris wrote:

Hmmm... "dmax=0.0": Is this _my_ problem perhaps? Instead of setting it
directly I was trying to estimate it first with "v.distance -pa" which meands
that I misunderstood the whole process :-/

dmax=0.0 means: only those features that are in the same place, i.e. in the case of from=points and to=areas => only those points which fall into areas.

So, using the combination of v.distance and db.select I cannot reproduce
your problem with 600,000 points, but maybe the number and nature of
polygons can also play a role...

That's interesting. Maybe I have done once again something very messy(?). I
use the 3rd script inside the attached file in ticket # 804 [1]. Although this
(old) script still executes so inefficiently a very large number of SQL
statements, the problem is still only in v.what.vect (so in v.distance) before
the SQL calls.

The script counts several point maps (for example: 404347 points) that fall
inside boxes (that compose a fishnet which I call cell-grid, for example: 1320
vector cells). One run with the above mentioned numbers takes more than 10h.

The specific line(s) in the python script is:

# carry low resolution grid-cell "CAT"s over to reference vector points
                 grass.run_command('v.what.vect',\
                 flags = '-v',\
                 quiet = False,\
                 vector = reference_points_map,\
                 qvect = lowres_vector_grid,\
                 column = gridcell_column,\
                 qcolumn = "cat")

Of course I checked the "problem" with the my data by testing only
"v.what.vect" commands out and apart of my messy script.

Equally, very slow are the trials I did with spearfish (random data). I can
pass some of my data (off-list please) or let me find some time later or
tomorrow to copy-paste from my history the exact commands of my test within
spearfish60.

I just did a similar test with same points and a grid created by

v.mkgrid grid=35,40

(using same column cat_municip from previous test example)
time v.distance from=mypoints@sqlite to=mygrid upload=cat column=cat_municip dmax=0.0

real 2m21.205s

Then testing the idea from the link Markus N added to your bug report:

time v.db.update mygrid col=count value="(SELECT count(*) from mypoints WHERE mygrid.cat=mypoints.cat_municip group by cat_municip)"

real 5m28.312s

One hypothesis I had was that since v.what.vect uses the upload=to_attr option, thus making it necessary to query the to_map's attribute table, this might create significant overhead in database connection, but when using

time v.distance from=mypoints@sqlite to=mygrid upload=to_attr column=cat_municip to_column=cat dmax=0.0

I get

real 2m13.741s

so no significant difference...

And final test with v.what.vect:

time v.what.vect mypoints col=cat_municip qvector=mygrid qcolumn=cat

real 2m9.377s

I'm pretty much at large about what causes your problem...

(
Just a quick look: I did not set dmax=0.0 in my (v.distance) tests. Then
again, in "v.what.vect" it is set by default to 0.0, right? Isn't this default
dmax=0.0 passed (by default) to v.distance?
)

Yes.

Moritz

mlennert · August 11, 2010, 12:24pm

As a final (at least for today )follow-up, just for the record:

On 11/08/10 13:42, Moritz Lennert wrote:

Then testing the idea from the link Markus N added to your bug report:

time v.db.update mygrid col=count value="(SELECT count(*) from mypoints
WHERE mygrid.cat=mypoints.cat_municip group by cat_municip)"

real 5m28.312s

And to show the magic of database indices:

time echo "create index mypoints_cat_municip on mypoints (cat_municip)" | db.execute && time v.db.update mygrid col=count value="(SELECT count(*) from mypoints WHERE mygrid.cat=mypoints.cat_municip group by cat_municip)"

real 0m10.113s
user 0m6.320s
sys 0m1.300s

real 0m0.668s
user 0m0.544s
sys 0m0.124s

And, very interestingly, the difference of behaviour of the different db backends (previous examples were all with SQLite, the following is with PostgreSQL):

time echo "create index mypoints_cat_municip on mypoints_pg (cat_municip)" | db.execute && time v.db.update mygrid_pg col=count value="(SELECT count(*) from mypoints_pg WHERE mygrid_pg.cat=mypoints_pg.cat_municip group by cat_municip)"

real 0m2.905s
user 0m0.012s
sys 0m0.004s

real 0m7.948s
user 0m0.228s
sys 0m0.128s

So, SQLite takes lot's of time creating the index and then is very fast for the update, and the opposite is true for PostgreSQL. Don't know if that's anything we can do something about in GRASS...

Moritz