[GRASS-dev] How to calculate mean coordinates from big point datasets?

Hi,

I came across this question:

http://gis.stackexchange.com/questions/71734/how-to-calculate-mean-coordinates-from-big-point-datasets

and wondered if this approach would be the fasted:

# http://grass.osgeo.org/sampledata/north_carolina/points.las
v.in.lidar input=points.las output=lidarpoints -o
...
Number of points: 1287775
...

Now I would use
v.univar -d lidarpoints type=point

(still calculating here...)

Is it the best way?

Markus

Markus Neteler wrote:

I came across this question:

http://gis.stackexchange.com/questions/71734/how-to-calculate-mean-coordinates-from-big-point-datasets

so wants to find the average coordinate?

and wondered if this approach would be the fasted:

# http://grass.osgeo.org/sampledata/north_carolina/points.las
v.in.lidar input=points.las output=lidarpoints -o
...
Number of points: 1287775
...

Now I would use
v.univar -d lidarpoints type=point

(still calculating here...)

Is it the best way?

see also the v.points.cog addons script:
http://grasswiki.osgeo.org/wiki/AddOns/GRASS_6#v.points.cog

although I haven't tried it for anything as big as lidar data.

Hamish

On 18/09/13 00:10, Markus Neteler wrote:

Hi,

I came across this question:

http://gis.stackexchange.com/questions/71734/how-to-calculate-mean-coordinates-from-big-point-datasets

and wondered if this approach would be the fasted:

# http://grass.osgeo.org/sampledata/north_carolina/points.las
v.in.lidar input=points.las output=lidarpoints -o
...
Number of points: 1287775
...

Now I would use
v.univar -d lidarpoints type=point

This calculates geometry distances, not the mean coordinate:

$ v.univar -d myhosp type=point

number of primitives: 160
number of non zero distances: 12561
number of zero distances: 0
minimum: 9.16773
maximum: 760776
range: 760767
sum: 2.69047e+09
mean: 214193
mean of absolute values: 214193
population standard deviation: 128505
population variance: 1.65136e+10
population coefficient of variation: 0.599953
sample standard deviation: 128511
sample variance: 1.6515e+10
kurtosis: 0.277564
skewness: 0.801646

Is it the best way?

How about v.to.db -p op=coor and then calculating the mean of the coordinates with an ad-hoc script. But that's probably not any faster than Hamish' v.out.ascii approach in v.points.cog.

Would probably be a nice little module to have in C: calculate centroid of polygon, center point of line and centroid (possibly weighted) for points.

Moritz

On 17 September 2013 22:10, Markus Neteler <neteler@osgeo.org> wrote:

Hi,

I came across this question:

http://gis.stackexchange.com/questions/71734/how-to-calculate-mean-coordinates-from-big-point-datasets

and wondered if this approach would be the fasted:

# http://grass.osgeo.org/sampledata/north_carolina/points.las
v.in.lidar input=points.las output=lidarpoints -o
...
Number of points: 1287775
...

Now I would use
v.univar -d lidarpoints type=point

(still calculating here...)

Is it the best way?

maybe v.median [0] could help?

Markus

[1] http://trac.osgeo.org/grass/browser/grass-addons/grass7/vector/v.median

--
ciao
Luca

http://gis.cri.fmach.it/delucchi/
www.lucadelu.org

On 18/09/13 10:51, Luca Delucchi wrote:

On 17 September 2013 22:10, Markus Neteler<neteler@osgeo.org> wrote:

Hi,

I came across this question:

http://gis.stackexchange.com/questions/71734/how-to-calculate-mean-coordinates-from-big-point-datasets

and wondered if this approach would be the fasted:

# http://grass.osgeo.org/sampledata/north_carolina/points.las
v.in.lidar input=points.las output=lidarpoints -o
...
Number of points: 1287775
...

Now I would use
v.univar -d lidarpoints type=point

(still calculating here...)

Is it the best way?

maybe v.median [0] could help?

[1] http://trac.osgeo.org/grass/browser/grass-addons/grass7/vector/v.median

Right.

Here's a little test:

$time v.median in=elev_lid792_randpts
638648.500000|220378.500000

real 0m0.249s
user 0m0.180s
sys 0m0.044s

$time v.to.db elev_lid792_randpts op=coor -p | awk -F'|' 'BEGIN{SUMX=0; SUMY=0; N=0} {N+=1;SUMX+=$2;SUMY+=$3} END{print SUMX/N, SUMY/N}'
Reading features...
  100%
638544 220339

real 0m0.106s
user 0m0.100s
sys 0m0.020s

Would be interesting to see results for big data. And AFAIK median is a bit more difficult to do in awk. I imagine that replacing the median by the mean in numpy is no problem (might be a flag to add to v.median).

I didn't try v.points.cog as that actually creates a new vector map.

Moritz

On Wed, Sep 18, 2013 at 11:41 AM, Moritz Lennert
<mlennert@club.worldonline.be> wrote:

On 18/09/13 10:51, Luca Delucchi wrote:

On 17 September 2013 22:10, Markus Neteler<neteler@osgeo.org> wrote:

Hi,

I came across this question:

http://gis.stackexchange.com/questions/71734/how-to-calculate-mean-coordinates-from-big-point-datasets

and wondered if this approach would be the fasted:

# http://grass.osgeo.org/sampledata/north_carolina/points.las
v.in.lidar input=points.las output=lidarpoints -o
...
Number of points: 1287775
...

Now I would use
v.univar -d lidarpoints type=point

(still calculating here...)

Is it the best way?

maybe v.median [0] could help?

[1]
http://trac.osgeo.org/grass/browser/grass-addons/grass7/vector/v.median

Right.

Here's a little test:

$time v.median in=elev_lid792_randpts
638648.500000|220378.500000

Should be 638648|220378. It seems that numpy gets the median wrong...

real 0m0.249s
user 0m0.180s
sys 0m0.044s

$time v.to.db elev_lid792_randpts op=coor -p | awk -F'|' 'BEGIN{SUMX=0;
SUMY=0; N=0} {N+=1;SUMX+=$2;SUMY+=$3} END{print SUMX/N, SUMY/N}'
Reading features...
100%
638544 220339

Should be 638650 220376

real 0m0.106s
user 0m0.100s
sys 0m0.020s

Would be interesting to see results for big data. And AFAIK median is a bit
more difficult to do in awk. I imagine that replacing the median by the mean
in numpy is no problem (might be a flag to add to v.median).

The v.to.db + v.db.univar approach is working just fine, and provides
correct results.

About a little module to calculate centroid of polygon, center point
of line and centroid (possibly weighted) for points, that would be
easy because all the components are there, even though there are in
theory alternatives to the current calculation of centroids for
polygons.

Markus M

On 18/09/13 16:24, Markus Metz wrote:

On Wed, Sep 18, 2013 at 11:41 AM, Moritz Lennert
<mlennert@club.worldonline.be> wrote:

On 18/09/13 10:51, Luca Delucchi wrote:

On 17 September 2013 22:10, Markus Neteler<neteler@osgeo.org> wrote:

Hi,

I came across this question:

http://gis.stackexchange.com/questions/71734/how-to-calculate-mean-coordinates-from-big-point-datasets

and wondered if this approach would be the fasted:

# http://grass.osgeo.org/sampledata/north_carolina/points.las
v.in.lidar input=points.las output=lidarpoints -o
...
Number of points: 1287775
...

Now I would use
v.univar -d lidarpoints type=point

(still calculating here...)

Is it the best way?

maybe v.median [0] could help?

[1]
http://trac.osgeo.org/grass/browser/grass-addons/grass7/vector/v.median

Right.

Here's a little test:

$time v.median in=elev_lid792_randpts
638648.500000|220378.500000

Should be 638648|220378. It seems that numpy gets the median wrong...

When I look at the numbers coming out of v.to.db, there are a series of points at X=638648,5 around the 3000 limit, and a series of points at Y=220378,5, so I do believe that numpy is right.

real 0m0.249s
user 0m0.180s
sys 0m0.044s

$time v.to.db elev_lid792_randpts op=coor -p | awk -F'|' 'BEGIN{SUMX=0;
SUMY=0; N=0} {N+=1;SUMX+=$2;SUMY+=$3} END{print SUMX/N, SUMY/N}'
Reading features...
  100%
638544 220339

Should be 638650 220376

Oops, forgot the header line, so N goes up to 6001, instead of 6000.

real 0m0.106s
user 0m0.100s
sys 0m0.020s

Would be interesting to see results for big data. And AFAIK median is a bit
more difficult to do in awk. I imagine that replacing the median by the mean
in numpy is no problem (might be a flag to add to v.median).

The v.to.db + v.db.univar approach is working just fine, and provides
correct results.

Yes, but it seems a bit overkill to have to go through the attribute table to make such a calculation.

About a little module to calculate centroid of polygon, center point
of line and centroid (possibly weighted) for points, that would be
easy because all the components are there, even though there are in
theory alternatives to the current calculation of centroids for
polygons.

And these alternatives are better ?

Mortiz

On Thu, Sep 19, 2013 at 9:15 AM, Moritz Lennert
<mlennert@club.worldonline.be> wrote:

On 18/09/13 16:24, Markus Metz wrote:

Moritz Lennert wrote:

Here's a little test:

$time v.median in=elev_lid792_randpts
638648.500000|220378.500000

Should be 638648|220378. It seems that numpy gets the median wrong...

When I look at the numbers coming out of v.to.db, there are a series of
points at X=638648,5 around the 3000 limit, and a series of points at
Y=220378,5, so I do believe that numpy is right.

Right, db.univar was printing out with insufficient precision, fixed in r57750.

About a little module to calculate centroid of polygon, center point
of line and centroid (possibly weighted) for points, that would be
easy because all the components are there, even though there are in
theory alternatives to the current calculation of centroids for
polygons.

And these alternatives are better ?

I am not sure. For areas without isles, there is a faster alternative.
For areas with isles, the current approach is fast, but the centroids
might be placed at somewhat unexpected locations (the next best
location inside the area, outside its isles). Here, alternatives would
be much slower.

Markus M

On Thu, Sep 19, 2013 at 2:32 PM, Markus Metz
<markus.metz.giswork@gmail.com> wrote:
...

I am not sure. For areas without isles, there is a faster alternative.
For areas with isles, the current approach is fast, but the centroids
might be placed at somewhat unexpected locations (the next best
location inside the area, outside its isles). Here, alternatives would
be much slower.

the original exercise was about LiDAR points only... perhaps that
needs to be treated differently.

markusN

Luca Delucchi wrote:

maybe v.median [0] could help?

Not for large datasets. First, it requires that the data will fit into
RAM. Second, numpy.median() sorts the entire array and takes the
middle value, which is somewhere between O(n.log(n)) for the typical
case and O(n^2) for the worst case (numpy.median uses the default
sorting algorithm, which is quicksort).

See r.quantile for a more efficient approach for large datasets. The
first pass computes a histogram which allows upper and lower bounds to
be determined for each quantile. The second pass extracts values which
lie within those bounds and sorts them. Except for pathological cases
(where the majority of the data lie within a tiny proportion of the
overall range), only a small fraction of the data are sorted.

In any case: is this question about the mean or the median?
Calculating the mean is far simpler, and can easily be done in O(n)
time and O(1) space.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements wrote:

Luca Delucchi wrote:

maybe v.median [0] could help?

Not for large datasets. First, it requires that the data will fit into
RAM. Second, numpy.median() sorts the entire array and takes the
middle value, which is somewhere between O(n.log(n)) for the typical
case and O(n^2) for the worst case (numpy.median uses the default
sorting algorithm, which is quicksort).

See r.quantile for a more efficient approach for large datasets. The
first pass computes a histogram which allows upper and lower bounds to
be determined for each quantile. The second pass extracts values which
lie within those bounds and sorts them.

The approach of v.median and r.quantile regards each dimension
separately which is IMHO not correct in a spatial context. The median
of a point cloud would be a multivariate median for 2 or 3 dimensions.
You would need to sort the points first by the first dimension, then
by the second dimension etc, then pick the coordinates of the mid
point. Alternatively you would need to find the point with the
smallest distance to all other points, which is nightmare to calculate
( (n - 1) x (n - 1) distance calculations).

Except for pathological cases
(where the majority of the data lie within a tiny proportion of the
overall range), only a small fraction of the data are sorted.

Coordinates of large point clouds can easily be such pathological cases.

In any case: is this question about the mean or the median?

I guess the median could make sense to account for outliers.

A v.centerpoint module could thus have options for point mode (mean,
median, distance) and line mode (mid point, center of gravity). Area
centroids are already calculated by v.category type=area, even though
they are not centers of gravity.

Markus M

On Fri, Sep 20, 2013 at 5:38 PM, Markus Metz
<markus.metz.giswork@gmail.com> wrote:

Glynn Clements wrote:

Luca Delucchi wrote:

maybe v.median [0] could help?

Not for large datasets. First, it requires that the data will fit into
RAM. Second, numpy.median() sorts the entire array and takes the
middle value, which is somewhere between O(n.log(n)) for the typical
case and O(n^2) for the worst case (numpy.median uses the default
sorting algorithm, which is quicksort).

See r.quantile for a more efficient approach for large datasets. The
first pass computes a histogram which allows upper and lower bounds to
be determined for each quantile. The second pass extracts values which
lie within those bounds and sorts them.

The approach of v.median and r.quantile regards each dimension
separately which is IMHO not correct in a spatial context.

This

The median
of a point cloud would be a multivariate median for 2 or 3 dimensions.
You would need to sort the points first by the first dimension, then
by the second dimension etc, then pick the coordinates of the mid
point.

is wrong, please ignore.

This

Alternatively you would need to find the point with the
smallest distance to all other points, which is nightmare to calculate
( (n - 1) x (n - 1) distance calculations).

is correct, but instead of finding the existing point with the
smallest distance, that point can be approximated with less effort.

Markus M

Markus Neteler wrote:

Hi,

I came across this question:

http://gis.stackexchange.com/questions/71734/how-to-calculate-mean-coordinates-from-big-point-datasets

Please try the new addon v.centerpoint [0]. It calculates various
center points for point clouds, lines, and areas. Standard options are
the geometric mean (center of gravity) and geometric median (more
robust for outliers). There are additional options to calculate center
points for points, lines and areas, explained in the manual. The
geometric medians (points of minimum distance to all other points) are
approximations, but fairly accurate and very fast. I would be
surprised if any other GIS software can calculate these center points.

Markus M

[0] http://grasswiki.osgeo.org/wiki/AddOns/GRASS7/vector#v.centerpoint

Markus,

when you are at it, can you also have a look at v.kcv?
http://grass.osgeo.org/grass70/manuals/v.kcv.html

I am wondering whether it works efficiently with lidar data sets (millions of points) - I heard that it takes forever,
but that was about a year ago and I haven't tried it myself.
For example, if I want to partition the data set into 1% test points and 99% given data points (for example to test
interpolation) it appears I may need 100 partitions as there is no way to have just two partitions with different size.

One of the problems may be the table - perhaps if this was run without the table and the output was written into
two (or k) separate files, it could be faster? The core of the algorithm which is based on finding the closest
points to the set of random points should allow this.
Creating a subset of points which are farther apart than given threshold (2d or 3d distance) would be also useful
(it is done internally in v.surf.rst and I have a version which does it with 3D distances, but the resulting subset is not
written into output file).

This is not urgent but if it is not too difficult it would be nice to have,
or let us know if it already works and I just cannot find the right module,

Helena

Helena Mitasova
Associate Professor
Department of Marine, Earth, and Atmospheric Sciences
2800 Faucette Drive, Rm. 1125 Jordan Hall
North Carolina State University
Raleigh, NC 27695-8208
hmitaso@ncsu.edu

"All electronic mail messages in connection with State business which are sent to or received by this account are subject to the NC Public Records Law and may be disclosed to third parties.”

On Sep 22, 2013, at 11:06 AM, Markus Metz wrote:

Markus Neteler wrote:

Hi,

I came across this question:

http://gis.stackexchange.com/questions/71734/how-to-calculate-mean-coordinates-from-big-point-datasets

Please try the new addon v.centerpoint [0]. It calculates various
center points for point clouds, lines, and areas. Standard options are
the geometric mean (center of gravity) and geometric median (more
robust for outliers). There are additional options to calculate center
points for points, lines and areas, explained in the manual. The
geometric medians (points of minimum distance to all other points) are
approximations, but fairly accurate and very fast. I would be
surprised if any other GIS software can calculate these center points.

Markus M

[0] http://grasswiki.osgeo.org/wiki/AddOns/GRASS7/vector#v.centerpoint
_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Helena Mitasova wrote:

Markus,

when you are at it, can you also have a look at v.kcv?
http://grass.osgeo.org/grass70/manuals/v.kcv.html

I am wondering whether it works efficiently with lidar data sets (millions of points) - I heard that it takes forever,
but that was about a year ago and I haven't tried it myself.

v.kcv has been improved recently in trunk, thanks to Jan Vandrol and
Jan Ruzicka.

For example, if I want to partition the data set into 1% test points and 99% given data points (for example to test
interpolation) it appears I may need 100 partitions as there is no way to have just two partitions with different size.

The number of partitions does not influence speed (in trunk).

One of the problems may be the table - perhaps if this was run without the table and the output was written into
two (or k) separate files, it could be faster?

Yes, updating the table can be slow. For a large number of points it
is recommended to not use dbf. Creating a separate new vector for each
partition could be an alternative.

The core of the algorithm which is based on finding the closest
points to the set of random points should allow this.

This is the part that makes v.kcv slow in 6.x.

Creating a subset of points which are farther apart than given threshold (2d or 3d distance) would be also useful
(it is done internally in v.surf.rst and I have a version which does it with 3D distances, but the resulting subset is not
written into output file).

For that you would need a spatial search structure in order to be
reasonably fast. I guess that v.surf.rst uses quad- or oct-trees for
this purpose.

This is not urgent but if it is not too difficult it would be nice to have,
or let us know if it already works and I just cannot find the right module,

As of 2013-07-19, v.kcv in trunk is much faster than in 6.x. Creating
subsets of points which are farther apart than given threshold is not
implemented, but that would not be too difficult to add using a
spatial index for each partition.

Markus M

Hamish wrote:

see also the v.points.cog addons script:
http://grasswiki.osgeo.org/wiki/AddOns/GRASS_6#v.points.cog

although I haven't tried it for anything as big as lidar data.

oops, I completely forgot to mention the r.cog addon script too:

https://trac.osgeo.org/grass/browser/grass-addons/grass6/raster/r.cog

The main goal isn't just center-of-gravity, it's also to reduce the
dataset to a single trend plane. In that way you might subtract out the
principal component of the z-surface to show the underlying signal
(or bring the primary trend into focus). This is e.g. a common task in
working with magnetic data where the Earth's field dominates, but
is relatively smooth compared to the local signal.

A good sample data for it is spearfish's elevation rasters since there
is a strong N-S trend, and for categorical features using the spearfish
fields map + choosing different combinations of categories to move the
center-point around.

I suspect that it could be quite effective with lidar if used with
r.in.{xyz,lidar}, and would be interested to see how the results compare
numerically with MMetz's new module, especially if the 10% trimmed-mean
binning method was used to solve the mean vs. median outliers problem.
(for really big datasets that method has a non-trivial RAM cost though)

r.cog is not listed on the wiki yet since the dip angle and azimuth +
r.plane trend map is still a work in progress -- help welcome, I've
got the 3D pivot point and azimuth seemingly ok, but the mean slope
has proven tricky. My last attempt at it was to split the raster into
three equal-area triangular zones (sort of looking like the South African
flag) then do the COG for each (yes, a recursive script :slight_smile: to get 3 3D
points, then fit a plane formula to them. IIRC the trouble I was having
with that was if there were large areas of NULL on one side it threw
the whole thing off.

see the example in the v.points.cog help page for an idea of what its
original target application was: placing labels in just the right spot,
while r.cog was written mainly with the idea to determine/create trend-
surfaces.

regards,
Hamish

Moritz:

Would be interesting to see results for big data. And AFAIK median is a
bit more difficult to do in awk. I imagine that replacing the median by
the mean in numpy is no problem (might be a flag to add to v.median).

see the old r.univar.sh shell script for an idea of how to get a median
using unix power tools, it's fairly trivial.

https://trac.osgeo.org/grass/browser/grass/branches/develbranch_6/scripts/r.univar.sh/r.univar.sh#L135

Hamish

On 22/09/13 17:06, Markus Metz wrote:

Markus Neteler wrote:

Hi,

I came across this question:

http://gis.stackexchange.com/questions/71734/how-to-calculate-mean-coordinates-from-big-point-datasets

Please try the new addon v.centerpoint [0]. It calculates various
center points for point clouds, lines, and areas. Standard options are
the geometric mean (center of gravity) and geometric median (more
robust for outliers). There are additional options to calculate center
points for points, lines and areas, explained in the manual. The
geometric medians (points of minimum distance to all other points) are
approximations, but fairly accurate and very fast. I would be
surprised if any other GIS software can calculate these center points.

Thanks a lot ! This is a very nice addition and should be part of core grass, IMHO.

Moritz

Markus Metz wrote:

Please try the new addon v.centerpoint [0]. It calculates various
center points for point clouds, lines, and areas. Standard options are
the geometric mean (center of gravity)

That's the arithmetic mean.

The geometric mean of a set of N values is the Nth root of the product
of the values (i.e. the logarithm of the geometric mean is the
arithmetic mean of the logarithms of the values). It wouldn't be
meaningful to compute the geometric mean of coordinate data.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements wrote:

Markus Metz wrote:

Please try the new addon v.centerpoint [0]. It calculates various
center points for point clouds, lines, and areas. Standard options are
the geometric mean (center of gravity)

That's the arithmetic mean.

The geometric mean of a set of N values is the Nth root of the product
of the values (i.e. the logarithm of the geometric mean is the
arithmetic mean of the logarithms of the values). It wouldn't be
meaningful to compute the geometric mean of coordinate data.

Right, I fixed the documentation. Internally, the arithmetic mean has
been used already to calculate centers of gravity.

Markus M