[GRASS5] v.univar

I have written v.univar. I am not sure how to calculate statistics
for lines and areas, does the code below make sense?
Radim

Lines:
   for each line {
        sum += line_length * variable;
        sumsq += line_length * variable * variable;
        total_length += line_length;
   }
   mean = sum / total_lenght;
   population_variance = (sumsq - sum*sum/total_length)/total_length;
   population_stdev = sqrt(population_variance);

Areas:
   for each area {
        sum += area_size * variable;
        sumsq += areas_size * variable * variable;
        total_size += area_size;
   }
   mean = sum / total_size;
   population_variance = (sumsq - sum*sum/total_size)/total_size;
   population_stdev = sqrt(population_variance);

On Fri, 2 Jul 2004, Radim Blazek wrote:

I have written v.univar. I am not sure how to calculate statistics
for lines and areas, does the code below make sense?

Briefly, no. You are calculating weighted means, weighting by line length
or area surface size. I think it would be better to treat each line or
area as a discrete, unweighted, unit unless some reason to the contrary is
given, just like points/sites. It is probably more important to handle
missing data gracefully than weight the means or other statistics, I
think. There may be reasons to weight sometimes, but most often I see
ratios or rates of two variables, rather than of a single variable and
length or area.

Roger

Radim

Lines:
   for each line {
        sum += line_length * variable;
        sumsq += line_length * variable * variable;
        total_length += line_length;
   }
   mean = sum / total_lenght;
   population_variance = (sumsq - sum*sum/total_length)/total_length;
   population_stdev = sqrt(population_variance);

Areas:
   for each area {
        sum += area_size * variable;
        sumsq += areas_size * variable * variable;
        total_size += area_size;
   }
   mean = sum / total_size;
   population_variance = (sumsq - sum*sum/total_size)/total_size;
   population_stdev = sqrt(population_variance);

_______________________________________________
grass5 mailing list
grass5@grass.itc.it
http://grass.itc.it/mailman/listinfo/grass5

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Breiviksveien 40, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93
e-mail: Roger.Bivand@nhh.no

> I have written v.univar. I am not sure how to calculate statistics
> for lines and areas, does the code below make sense?

Briefly, no. You are calculating weighted means, weighting by line
length or area surface size. I think it would be better to treat each
line or area as a discrete, unweighted, unit unless some reason to the
contrary is given, just like points/sites. It is probably more
important to handle missing data gracefully than weight the means or
other statistics, I think. There may be reasons to weight sometimes,
but most often I see ratios or rates of two variables, rather than of
a single variable and length or area.

I was wondering if some practical examples could be given? The only
thing I could think of is comparing "ground covered in a day" lengths.

This may not be what you were going for, but I have need for something
to help with the "length of border" question. As this is a fractal
problem, the line length is best reported as a ratio (given that the
whole exists at the same resolution).

e.g.
the border with Spain covers x% of Portugal's total borders.

the part of the coastline contained within some vector area represents
x% of the overall coastline.

The integral giving area converges to a real number though, so you could
give mean area + stdev of provinces in a country if you wanted (bad
example..).

In somewhat related matters, I'm planning on activating the C version of
r.univar soon, and for consistency changing r.series to use sample
variance instead of population variance. I think that with raster maps,
you only really have a whole population if you cover the entire planet..
???

This is still missing extended stats (quartiles, median, 10% trimmed
mean), hopefully someone can add that. I might keep the script version
around, renamed r.univar1, until we have that.

Hamish

On Friday 02 July 2004 19:52, Roger Bivand wrote:

On Fri, 2 Jul 2004, Radim Blazek wrote:
> I have written v.univar. I am not sure how to calculate statistics
> for lines and areas, does the code below make sense?

Briefly, no. You are calculating weighted means, weighting by line length
or area surface size. I think it would be better to treat each line or
area as a discrete, unweighted, unit unless some reason to the contrary is
given, just like points/sites.

Then v.univar and v.to.rast + r.univar will give completely different
results for the same data, is it correct?

What should be the 'unit', one geometry element in the map or one category
(one record in the table)? Both fail in some cases, I think.
1) unit = geometry element
   One town, e.g. Bergen is composed of more isolated areas (land+islands)
   all those areas however share the same category and database record
   (town name, number of inhabitants). Now if I take each island (geometry element)
   as one 'unit' and calculate mean of inhabitants in the cities in Norway, the result
   is wrong, I think. The right approach in this case is to take one category as one unit.
2) unit = category
   Map of public lighting, each point is one light but there are only two
   types of lights installed so there are only 2 categories and 2 records
   in the table (type, price). If I want mean of price for installed
   lights and I use the category as the unit the result is wrong again
   (mean of 2 prices not all lights).

It is probably more important to handle
missing data gracefully than weight the means or other statistics, I
think.

What is precisely 'missing data' and what is 'gracefully'?
Currently only non-NULL values are used in the calculation
and number of missing records and number of NULL values is reported
at the end, is it sufficient?

There may be reasons to weight sometimes, but most often I see
ratios or rates of two variables, rather than of a single variable and
length or area.

It could be optional
1) unit=category (default?)
2) unit=geometry (default?)
3) weighted by area/length

Radim