[GRASS-dev] v.univar question: Why not lines and areas?

v.univar only works with points. But since it is calculating stats on a field in the attributes table, it should work the same for all vector objects. Can we get rid of the limitation that it only works with points?

Michael


C. Michael Barton, Professor of Anthropology
Director of Graduate Studies
School of Human Evolution & Social Change
Center for Social Dynamics & Complexity
Arizona State University

Phone: 480-965-6262
Fax: 480-965-7671
www: <www.public.asu.edu/~cmbarton>

On 27/01/08 20:30, Michael Barton wrote:

v.univar only works with points. But since it is calculating stats on a field in the attributes table, it should work the same for all vector objects. Can we get rid of the limitation that it only works with points?

There was some debate [1] about the statistical validity of working with the other types, as the way it was programmed, the statistics were calculated with weights which corresponded to line length / area surface .

I guess we might want to distinguish between a v.univar which works on the actual vector objects from a v.db.univar which works on any arbitrary attribute (or combination of attributes). We could write a C-replacement of the current v.db.univar script on the base of the code I have for the classification algorithms used in v.class.

As mentioned earlier, it might be better that I move the code from v.class into a library which can then be accessed by different modules...

Currently, I have defined the following statistics:

typedef struct
{
     double count;
     double min;
     double max;
     double sum;
     double sumsq;
     double mean;
     double var;
     double stdev;
} STATS;

But this could easily be extended according to needs and v.db.univar could also use the quantile classification algorithm to extract percentiles.

What are the statistics most people need ?

Moritz

[1] http://lists.osgeo.org/pipermail/grass-dev/2004-July/014976.html

On Jan 28, 2008, at 5:50 AM, Moritz Lennert wrote:

On 27/01/08 20:30, Michael Barton wrote:

v.univar only works with points. But since it is calculating stats on a field in the attributes table, it should work the same for all vector objects. Can we get rid of the limitation that it only works with points?

There was some debate [1] about the statistical validity of working with the other types, as the way it was programmed, the statistics were calculated with weights which corresponded to line length / area surface .

I guess we might want to distinguish between a v.univar which works on the actual vector objects from a v.db.univar which works on any arbitrary attribute (or combination of attributes). We could write a C-replacement of the current v.db.univar script on the base of the code I have for the classification algorithms used in v.class.

AFAICT, v.univar does not calculate anything from vector topology, only from an attribute column. That is, it behaves the way you describe v.db.univar. For some weird (probably historical) reason, it won't calculate anything but N, max, and min of an attribute column linked to a non-point vector object.

"v.univar calculates univariate statistics of vector map features. This includes the number of features counted, minimum and maximum values, and range. Variance and standard deviation is calculated only for points if type=point is defined.
Extended statistics adds median, 1st and 3rd quartiles, and 90th percentile."

An attribute is the same whether it's linked to a point, line, or area.

It would be nice to be able to calculate some stats from topology, but that is not possible at the moment without loading topology.

As mentioned earlier, it might be better that I move the code from v.class into a library which can then be accessed by different modules...

Currently, I have defined the following statistics:

typedef struct
{
    double count;
    double min;
    double max;
    double sum;
    double sumsq;
    double mean;
    double var;
    double stdev;
} STATS;

But this could easily be extended according to needs and v.db.univar could also use the quantile classification algorithm to extract percentiles.

What are the statistics most people need ?

median, mode, and percentiles would be nice for any attribute or topological data. Any diversity or other non-parametric stats that would actually be useful here?

It would also be nice to get v.report type information (count, length, area) summed by value of a string attribute column or binned numberic column. But this might better be an extension of v.report.

Thanks for checking into it.

Michael

Moritz

[1] [GRASS5] v.univar

____________________
C. Michael Barton, Professor of Anthropology
Director of Graduate Studies
School of Human Evolution & Social Change
Center for Social Dynamics & Complexity
Arizona State University

Phone: 480-965-6262
Fax: 480-965-7671
www: <www.public.asu.edu/~cmbarton>

On Jan 28, 2008, at 10:22 AM, Michael Barton wrote:

On Jan 28, 2008, at 5:50 AM, Moritz Lennert wrote:

On 27/01/08 20:30, Michael Barton wrote:

v.univar only works with points. But since it is calculating stats on a field in the attributes table, it should work the same for all vector objects. Can we get rid of the limitation that it only works with points?

There was some debate [1] about the statistical validity of working with the other types, as the way it was programmed, the statistics were calculated with weights which corresponded to line length / area surface .

I guess we might want to distinguish between a v.univar which works on the actual vector objects from a v.db.univar which works on any arbitrary attribute (or combination of attributes). We could write a C-replacement of the current v.db.univar script on the base of the code I have for the classification algorithms used in v.class.

AFAICT, v.univar does not calculate anything from vector topology, only from an attribute column. That is, it behaves the way you describe v.db.univar. For some weird (probably historical) reason, it won't calculate anything but N, max, and min of an attribute column linked to a non-point vector object.

"v.univar calculates univariate statistics of vector map features. This includes the number of features counted, minimum and maximum values, and range. Variance and standard deviation is calculated only for points if type=point is defined.
Extended statistics adds median, 1st and 3rd quartiles, and 90th percentile."

An attribute is the same whether it's linked to a point, line, or area.

It would be nice to be able to calculate some stats from topology, but that is not possible at the moment without loading topology.

As mentioned earlier, it might be better that I move the code from v.class into a library which can then be accessed by different modules...

Currently, I have defined the following statistics:

typedef struct
{
    double count;
    double min;
    double max;
    double sum;
    double sumsq;
    double mean;
    double var;
    double stdev;
} STATS;

But this could easily be extended according to needs and v.db.univar could also use the quantile classification algorithm to extract percentiles.

What are the statistics most people need ?

median, mode, and percentiles would be nice for any attribute or topological data. Any diversity or other non-parametric stats that would actually be useful here?

I would like to add mean of absolute values of the attribute - this is useful when the attribute is deviation or error
to measure accuracy of interpolation/approximation methods (there are some papers that explain why MAE is
better than RMSE for this). Although it is not in the man pages, v.univar computes it for the point option.
Writing out the range is a nice convenience too (see v.univar output for points - it has pretty comprehensive
set of stats measures).

Helena

It would also be nice to get v.report type information (count, length, area) summed by value of a string attribute column or binned numberic column. But this might better be an extension of v.report.

Thanks for checking into it.

Michael

Moritz

[1] http://lists.osgeo.org/pipermail/grass-dev/2004-July/014976.html

____________________
C. Michael Barton, Professor of Anthropology
Director of Graduate Studies
School of Human Evolution & Social Change
Center for Social Dynamics & Complexity
Arizona State University

Phone: 480-965-6262
Fax: 480-965-7671
www: <www.public.asu.edu/~cmbarton>

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

On 28/01/08 16:22, Michael Barton wrote:

On Jan 28, 2008, at 5:50 AM, Moritz Lennert wrote:

On 27/01/08 20:30, Michael Barton wrote:

v.univar only works with points. But since it is calculating
stats on a field in the attributes table, it should work the same
for all vector objects. Can we get rid of the limitation that it
only works with points?

There was some debate [1] about the statistical validity of working
with the other types, as the way it was programmed, the statistics
were calculated with weights which corresponded to line length /
area surface .

I guess we might want to distinguish between a v.univar which works
on the actual vector objects from a v.db.univar which works on any
arbitrary attribute (or combination of attributes). We could write
a C-replacement of the current v.db.univar script on the base of
the code I have for the classification algorithms used in v.class.

AFAICT, v.univar does not calculate anything from vector topology,
only from an attribute column.

[...]

An attribute is the same whether it's linked to a point, line, or
area.

v.univar currently calculates as follows for lines and areas, even though the results are never printed (main.c):

[lines:]
206 l = Vect_line_length ( Points );
207 sum += l*val;
208 sumsq += l*val*val;
209 sum_abs += l * fabs (val);
210 total_size += l;

[areas:]
270 a = Vect_get_area_area ( &Map, area );
271 sum += a*val;
272 sumsq += a*val*val;
273 sum_abs += a * fabs (val);
274 total_size += a;

285 if ( (otype & GV_LINES) || (otype & GV_AREA) ) {
286 mean = sum / total_size;
287 mean_abs = sum_abs / total_size;

So the mean is actually a weighted mean with the area as weight. I don't
really no why Radim coded it like this at the time, and I think we
should change this so that it just uses unweighted feature counts, just
as Roger suggested at the time. Try the attached (untested) patch.

One thing that does potentially matter, though, is whether to use the features or the attribute columns as a base. If you have several features with the same cat value, this can make a difference, as in the former case they will all be counted individually, whereas in the latter case, they will only be counted once. If each of the features has an indvididual meaning than the former case seems more correct, but if not (e.g. each island of the Philippines counted separately in a table which lists population by country). Obviously we could just say that it is up to the user to make sure that the map data is correct, i.e. if we take the above example, there should only be one centroid linked to data per country).

The way the routines are written in v.class, they take an arbitrary array of floats, so it is up to the individual modules to decide how to create this array.

Moritz

(attachments)

v.univar.diff.gz (1.07 KB)

On Jan 29, 2008, at 5:12 PM, Moritz Lennert wrote:

On 28/01/08 16:22, Michael Barton wrote:

On Jan 28, 2008, at 5:50 AM, Moritz Lennert wrote:

On 27/01/08 20:30, Michael Barton wrote:

v.univar only works with points. But since it is calculating
stats on a field in the attributes table, it should work the same
for all vector objects. Can we get rid of the limitation that it
only works with points?

There was some debate [1] about the statistical validity of working
with the other types, as the way it was programmed, the statistics
were calculated with weights which corresponded to line length /
area surface .
I guess we might want to distinguish between a v.univar which works
on the actual vector objects from a v.db.univar which works on any
arbitrary attribute (or combination of attributes). We could write
a C-replacement of the current v.db.univar script on the base of
the code I have for the classification algorithms used in v.class.

AFAICT, v.univar does not calculate anything from vector topology,
only from an attribute column.

[...]

An attribute is the same whether it's linked to a point, line, or
area.

v.univar currently calculates as follows for lines and areas, even though the results are never printed (main.c):

[lines:]
206 l = Vect_line_length ( Points );
207 sum += l*val;
208 sumsq += l*val*val;
209 sum_abs += l * fabs (val);
210 total_size += l;

[areas:]
270 a = Vect_get_area_area ( &Map, area );
271 sum += a*val;
272 sumsq += a*val*val;
273 sum_abs += a * fabs (val);
274 total_size += a;

285 if ( (otype & GV_LINES) || (otype & GV_AREA) ) {
286 mean = sum / total_size;
287 mean_abs = sum_abs / total_size;

So the mean is actually a weighted mean with the area as weight. I don't
really no why Radim coded it like this at the time, and I think we
should change this so that it just uses unweighted feature counts, just
as Roger suggested at the time. Try the attached (untested) patch.

One thing that does potentially matter, though, is whether to use the features or the attribute columns as a base. If you have several features with the same cat value, this can make a difference, as in the former case they will all be counted individually, whereas in the latter case, they will only be counted once. If each of the features has an indvididual meaning than the former case seems more correct, but if not (e.g. each island of the Philippines counted separately in a table which lists population by country). Obviously we could just say that it is up to the user to make sure that the map data is correct, i.e. if we take the above example, there should only be one centroid linked to data per country).

The way the routines are written in v.class, they take an arbitrary array of floats, so it is up to the individual modules to decide how to create this array.

This is all very interesting. It is a bit worrisome too. I don't want a mean of an attribute column weighted by area unless I specifically ask for it. This suggests that people using v.univar may not be getting what they think they are getting. I think it is an excellent option, but should not be a silent default.

How to count the features is a bit of an issue, but couldn't this be left up to the user too--summarize by cat or by individual feature as an option?

Michael

____________________
C. Michael Barton, Professor of Anthropology
Director of Graduate Studies
School of Human Evolution & Social Change
Center for Social Dynamics & Complexity
Arizona State University

Phone: 480-965-6262
Fax: 480-965-7671
www: <www.public.asu.edu/~cmbarton>

On 30/01/08 02:43, Michael Barton wrote:

On Jan 29, 2008, at 5:12 PM, Moritz Lennert wrote:

On 28/01/08 16:22, Michael Barton wrote:

On Jan 28, 2008, at 5:50 AM, Moritz Lennert wrote:

On 27/01/08 20:30, Michael Barton wrote:

v.univar only works with points. But since it is calculating
stats on a field in the attributes table, it should work the same
for all vector objects. Can we get rid of the limitation that it
only works with points?

There was some debate [1] about the statistical validity of working
with the other types, as the way it was programmed, the statistics
were calculated with weights which corresponded to line length /
area surface .
I guess we might want to distinguish between a v.univar which works
on the actual vector objects from a v.db.univar which works on any
arbitrary attribute (or combination of attributes). We could write
a C-replacement of the current v.db.univar script on the base of
the code I have for the classification algorithms used in v.class.

AFAICT, v.univar does not calculate anything from vector topology,
only from an attribute column.

[...]

An attribute is the same whether it's linked to a point, line, or
area.

v.univar currently calculates as follows for lines and areas, even though the results are never printed (main.c):

[lines:]
206 l = Vect_line_length ( Points );
207 sum += l*val;
208 sumsq += l*val*val;
209 sum_abs += l * fabs (val);
210 total_size += l;

[areas:]
270 a = Vect_get_area_area ( &Map, area );
271 sum += a*val;
272 sumsq += a*val*val;
273 sum_abs += a * fabs (val);
274 total_size += a;

285 if ( (otype & GV_LINES) || (otype & GV_AREA) ) {
286 mean = sum / total_size;
287 mean_abs = sum_abs / total_size;

So the mean is actually a weighted mean with the area as weight. I don't
really no why Radim coded it like this at the time, and I think we
should change this so that it just uses unweighted feature counts, just
as Roger suggested at the time. Try the attached (untested) patch.

One thing that does potentially matter, though, is whether to use the features or the attribute columns as a base. If you have several features with the same cat value, this can make a difference, as in the former case they will all be counted individually, whereas in the latter case, they will only be counted once. If each of the features has an indvididual meaning than the former case seems more correct, but if not (e.g. each island of the Philippines counted separately in a table which lists population by country). Obviously we could just say that it is up to the user to make sure that the map data is correct, i.e. if we take the above example, there should only be one centroid linked to data per country).

The way the routines are written in v.class, they take an arbitrary array of floats, so it is up to the individual modules to decide how to create this array.

This is all very interesting. It is a bit worrisome too. I don't want a mean of an attribute column weighted by area unless I specifically ask for it. This suggests that people using v.univar may not be getting what they think they are getting. I think it is an excellent option, but should not be a silent default.

Well, since the results are not printed, the problem doesn't really exist. The patch I sent doesn't weight at all, just counts features.

How to count the features is a bit of an issue, but couldn't this be left up to the user too--summarize by cat or by individual feature as an option?

That's why I think we should have a library function which calculates stats (i.e. extend what it is the v.class code), and let the modules deal with such issues.

Moritz

On Jan 29, 2008, at 11:38 PM, Moritz Lennert wrote:

This is all very interesting. It is a bit worrisome too. I don't want a mean of an attribute column weighted by area unless I specifically ask for it. This suggests that people using v.univar may not be getting what they think they are getting. I think it is an excellent option, but should not be a silent default.

Well, since the results are not printed, the problem doesn't really exist. The patch I sent doesn't weight at all, just counts features.

How to count the features is a bit of an issue, but couldn't this be left up to the user too--summarize by cat or by individual feature as an option?

That's why I think we should have a library function which calculates stats (i.e. extend what it is the v.class code), and let the modules deal with such issues.

Moritz

Sounds very good to me.

Michael