Now I would use
v.univar -d lidarpoints type=point

This calculates geometry distances, not the mean coordinate:

$ v.univar -d myhosp type=point

number of primitives: 160
number of non zero distances: 12561
number of zero distances: 0
minimum: 9.16773
maximum: 760776
range: 760767
sum: 2.69047e+09
mean: 214193
mean of absolute values: 214193
population standard deviation: 128505
population variance: 1.65136e+10
population coefficient of variation: 0.599953
sample standard deviation: 128511
sample variance: 1.6515e+10
kurtosis: 0.277564
skewness: 0.801646

Is it the best way?

How about v.to.db -p op=coor and then calculating the mean of the coordinates with an ad-hoc script. But that's probably not any faster than Hamish' v.out.ascii approach in v.points.cog.

Would probably be a nice little module to have in C: calculate centroid of polygon, center point of line and centroid (possibly weighted) for points.

Would be interesting to see results for big data. And AFAIK median is a bit more difficult to do in awk. I imagine that replacing the median by the mean in numpy is no problem (might be a flag to add to v.median).

I didn't try v.points.cog as that actually creates a new vector map.

Would be interesting to see results for big data. And AFAIK median is a bit
more difficult to do in awk. I imagine that replacing the median by the mean
in numpy is no problem (might be a flag to add to v.median).

The v.to.db + v.db.univar approach is working just fine, and provides
correct results.

About a little module to calculate centroid of polygon, center point
of line and centroid (possibly weighted) for points, that would be
easy because all the components are there, even though there are in
theory alternatives to the current calculation of centroids for
polygons.

Should be 638648|220378. It seems that numpy gets the median wrong...

When I look at the numbers coming out of v.to.db, there are a series of points at X=638648,5 around the 3000 limit, and a series of points at Y=220378,5, so I do believe that numpy is right.

Oops, forgot the header line, so N goes up to 6001, instead of 6000.

real 0m0.106s
user 0m0.100s
sys 0m0.020s

Would be interesting to see results for big data. And AFAIK median is a bit
more difficult to do in awk. I imagine that replacing the median by the mean
in numpy is no problem (might be a flag to add to v.median).

The v.to.db + v.db.univar approach is working just fine, and provides
correct results.

Yes, but it seems a bit overkill to have to go through the attribute table to make such a calculation.

About a little module to calculate centroid of polygon, center point
of line and centroid (possibly weighted) for points, that would be
easy because all the components are there, even though there are in
theory alternatives to the current calculation of centroids for
polygons.

Should be 638648|220378. It seems that numpy gets the median wrong...

When I look at the numbers coming out of v.to.db, there are a series of
points at X=638648,5 around the 3000 limit, and a series of points at
Y=220378,5, so I do believe that numpy is right.

Right, db.univar was printing out with insufficient precision, fixed in r57750.

About a little module to calculate centroid of polygon, center point
of line and centroid (possibly weighted) for points, that would be
easy because all the components are there, even though there are in
theory alternatives to the current calculation of centroids for
polygons.

And these alternatives are better ?

I am not sure. For areas without isles, there is a faster alternative.
For areas with isles, the current approach is fast, but the centroids
might be placed at somewhat unexpected locations (the next best
location inside the area, outside its isles). Here, alternatives would
be much slower.

On Thu, Sep 19, 2013 at 2:32 PM, Markus Metz
<markus.metz.giswork@gmail.com> wrote:
...

I am not sure. For areas without isles, there is a faster alternative.
For areas with isles, the current approach is fast, but the centroids
might be placed at somewhat unexpected locations (the next best
location inside the area, outside its isles). Here, alternatives would
be much slower.

the original exercise was about LiDAR points only... perhaps that
needs to be treated differently.

Not for large datasets. First, it requires that the data will fit into
RAM. Second, numpy.median() sorts the entire array and takes the
middle value, which is somewhere between O(n.log(n)) for the typical
case and O(n^2) for the worst case (numpy.median uses the default
sorting algorithm, which is quicksort).

See r.quantile for a more efficient approach for large datasets. The
first pass computes a histogram which allows upper and lower bounds to
be determined for each quantile. The second pass extracts values which
lie within those bounds and sorts them. Except for pathological cases
(where the majority of the data lie within a tiny proportion of the
overall range), only a small fraction of the data are sorted.

In any case: is this question about the mean or the median?
Calculating the mean is far simpler, and can easily be done in O(n)
time and O(1) space.

Not for large datasets. First, it requires that the data will fit into
RAM. Second, numpy.median() sorts the entire array and takes the
middle value, which is somewhere between O(n.log(n)) for the typical
case and O(n^2) for the worst case (numpy.median uses the default
sorting algorithm, which is quicksort).

See r.quantile for a more efficient approach for large datasets. The
first pass computes a histogram which allows upper and lower bounds to
be determined for each quantile. The second pass extracts values which
lie within those bounds and sorts them.

The approach of v.median and r.quantile regards each dimension
separately which is IMHO not correct in a spatial context. The median
of a point cloud would be a multivariate median for 2 or 3 dimensions.
You would need to sort the points first by the first dimension, then
by the second dimension etc, then pick the coordinates of the mid
point. Alternatively you would need to find the point with the
smallest distance to all other points, which is nightmare to calculate
( (n - 1) x (n - 1) distance calculations).

Except for pathological cases
(where the majority of the data lie within a tiny proportion of the
overall range), only a small fraction of the data are sorted.

Coordinates of large point clouds can easily be such pathological cases.

In any case: is this question about the mean or the median?

I guess the median could make sense to account for outliers.

A v.centerpoint module could thus have options for point mode (mean,
median, distance) and line mode (mid point, center of gravity). Area
centroids are already calculated by v.category type=area, even though
they are not centers of gravity.

On Fri, Sep 20, 2013 at 5:38 PM, Markus Metz
<markus.metz.giswork@gmail.com> wrote:

Glynn Clements wrote:

Luca Delucchi wrote:

maybe v.median [0] could help?

Not for large datasets. First, it requires that the data will fit into
RAM. Second, numpy.median() sorts the entire array and takes the
middle value, which is somewhere between O(n.log(n)) for the typical
case and O(n^2) for the worst case (numpy.median uses the default
sorting algorithm, which is quicksort).

See r.quantile for a more efficient approach for large datasets. The
first pass computes a histogram which allows upper and lower bounds to
be determined for each quantile. The second pass extracts values which
lie within those bounds and sorts them.

The approach of v.median and r.quantile regards each dimension
separately which is IMHO not correct in a spatial context.

This

The median
of a point cloud would be a multivariate median for 2 or 3 dimensions.
You would need to sort the points first by the first dimension, then
by the second dimension etc, then pick the coordinates of the mid
point.

is wrong, please ignore.

This

Alternatively you would need to find the point with the
smallest distance to all other points, which is nightmare to calculate
( (n - 1) x (n - 1) distance calculations).

is correct, but instead of finding the existing point with the
smallest distance, that point can be approximated with less effort.

Please try the new addon v.centerpoint [0]. It calculates various
center points for point clouds, lines, and areas. Standard options are
the geometric mean (center of gravity) and geometric median (more
robust for outliers). There are additional options to calculate center
points for points, lines and areas, explained in the manual. The
geometric medians (points of minimum distance to all other points) are
approximations, but fairly accurate and very fast. I would be
surprised if any other GIS software can calculate these center points.

I am wondering whether it works efficiently with lidar data sets (millions of points) - I heard that it takes forever,
but that was about a year ago and I haven't tried it myself.
For example, if I want to partition the data set into 1% test points and 99% given data points (for example to test
interpolation) it appears I may need 100 partitions as there is no way to have just two partitions with different size.

One of the problems may be the table - perhaps if this was run without the table and the output was written into
two (or k) separate files, it could be faster? The core of the algorithm which is based on finding the closest
points to the set of random points should allow this.
Creating a subset of points which are farther apart than given threshold (2d or 3d distance) would be also useful
(it is done internally in v.surf.rst and I have a version which does it with 3D distances, but the resulting subset is not
written into output file).

This is not urgent but if it is not too difficult it would be nice to have,
or let us know if it already works and I just cannot find the right module,

Helena

Helena Mitasova
Associate Professor
Department of Marine, Earth, and Atmospheric Sciences
2800 Faucette Drive, Rm. 1125 Jordan Hall
North Carolina State University
Raleigh, NC 27695-8208
hmitaso@ncsu.edu

"All electronic mail messages in connection with State business which are sent to or received by this account are subject to the NC Public Records Law and may be disclosed to third parties.”

Please try the new addon v.centerpoint [0]. It calculates various
center points for point clouds, lines, and areas. Standard options are
the geometric mean (center of gravity) and geometric median (more
robust for outliers). There are additional options to calculate center
points for points, lines and areas, explained in the manual. The
geometric medians (points of minimum distance to all other points) are
approximations, but fairly accurate and very fast. I would be
surprised if any other GIS software can calculate these center points.

I am wondering whether it works efficiently with lidar data sets (millions of points) - I heard that it takes forever,
but that was about a year ago and I haven't tried it myself.

v.kcv has been improved recently in trunk, thanks to Jan Vandrol and
Jan Ruzicka.

For example, if I want to partition the data set into 1% test points and 99% given data points (for example to test
interpolation) it appears I may need 100 partitions as there is no way to have just two partitions with different size.

The number of partitions does not influence speed (in trunk).

One of the problems may be the table - perhaps if this was run without the table and the output was written into
two (or k) separate files, it could be faster?

Yes, updating the table can be slow. For a large number of points it
is recommended to not use dbf. Creating a separate new vector for each
partition could be an alternative.

The core of the algorithm which is based on finding the closest
points to the set of random points should allow this.

This is the part that makes v.kcv slow in 6.x.

Creating a subset of points which are farther apart than given threshold (2d or 3d distance) would be also useful
(it is done internally in v.surf.rst and I have a version which does it with 3D distances, but the resulting subset is not
written into output file).

For that you would need a spatial search structure in order to be
reasonably fast. I guess that v.surf.rst uses quad- or oct-trees for
this purpose.

This is not urgent but if it is not too difficult it would be nice to have,
or let us know if it already works and I just cannot find the right module,

As of 2013-07-19, v.kcv in trunk is much faster than in 6.x. Creating
subsets of points which are farther apart than given threshold is not
implemented, but that would not be too difficult to add using a
spatial index for each partition.

The main goal isn't just center-of-gravity, it's also to reduce the
dataset to a single trend plane. In that way you might subtract out the
principal component of the z-surface to show the underlying signal
(or bring the primary trend into focus). This is e.g. a common task in
working with magnetic data where the Earth's field dominates, but
is relatively smooth compared to the local signal.

A good sample data for it is spearfish's elevation rasters since there
is a strong N-S trend, and for categorical features using the spearfish
fields map + choosing different combinations of categories to move the
center-point around.

I suspect that it could be quite effective with lidar if used with
r.in.{xyz,lidar}, and would be interested to see how the results compare
numerically with MMetz's new module, especially if the 10% trimmed-mean
binning method was used to solve the mean vs. median outliers problem.
(for really big datasets that method has a non-trivial RAM cost though)

r.cog is not listed on the wiki yet since the dip angle and azimuth +
r.plane trend map is still a work in progress -- help welcome, I've
got the 3D pivot point and azimuth seemingly ok, but the mean slope
has proven tricky. My last attempt at it was to split the raster into
three equal-area triangular zones (sort of looking like the South African
flag) then do the COG for each (yes, a recursive script to get 3 3D
points, then fit a plane formula to them. IIRC the trouble I was having
with that was if there were large areas of NULL on one side it threw
the whole thing off.

see the example in the v.points.cog help page for an idea of what its
original target application was: placing labels in just the right spot,
while r.cog was written mainly with the idea to determine/create trend-
surfaces.

Would be interesting to see results for big data. And AFAIK median is a
bit more difficult to do in awk. I imagine that replacing the median by
the mean in numpy is no problem (might be a flag to add to v.median).

see the old r.univar.sh shell script for an idea of how to get a median
using unix power tools, it's fairly trivial.

Please try the new addon v.centerpoint [0]. It calculates various
center points for point clouds, lines, and areas. Standard options are
the geometric mean (center of gravity) and geometric median (more
robust for outliers). There are additional options to calculate center
points for points, lines and areas, explained in the manual. The
geometric medians (points of minimum distance to all other points) are
approximations, but fairly accurate and very fast. I would be
surprised if any other GIS software can calculate these center points.

Thanks a lot ! This is a very nice addition and should be part of core grass, IMHO.

Please try the new addon v.centerpoint [0]. It calculates various
center points for point clouds, lines, and areas. Standard options are
the geometric mean (center of gravity)

That's the arithmetic mean.

The geometric mean of a set of N values is the Nth root of the product
of the values (i.e. the logarithm of the geometric mean is the
arithmetic mean of the logarithms of the values). It wouldn't be
meaningful to compute the geometric mean of coordinate data.

Please try the new addon v.centerpoint [0]. It calculates various
center points for point clouds, lines, and areas. Standard options are
the geometric mean (center of gravity)

That's the arithmetic mean.

The geometric mean of a set of N values is the Nth root of the product
of the values (i.e. the logarithm of the geometric mean is the
arithmetic mean of the logarithms of the values). It wouldn't be
meaningful to compute the geometric mean of coordinate data.

Right, I fixed the documentation. Internally, the arithmetic mean has
been used already to calculate centers of gravity.