[GRASS-dev] what is the ideal way to store spatial data

I don't know enough to comment on the math issues specifically, but would like to relate a conversation I had with John MacDonald of MacDonald Detweiler (a big Canadian company that makes ground link stations etc) while serving on an advisory panel to the US national remotely sensed data archive (it stores much of the Landsat data). I was pretty naive about what actually goes on in turning raw data as collected by a satellite into various products that we all end up using. I was just interested in have a land cover/use data set and was arguing for the archive storing such a data set. He made two points. The first was similar to the one you make, which is that any manipulation of raw data introduces artifacts. The second was that it will always be cheaper to do processing in the future than it is today. So the data should always be archived in raw form.

The result of his logic is that the archive does in fact store data in raw form, along with the operating characteristics of the satellite that collected it. A related recommendation he made, which has not been followed as far as I can tell, is that you should also archive the algorithms of the day (with a time stamp), so that you can recreate the products, which are what usually get used.

So getting back to grass, it may be too much to ask of today's (and tomorrow's) cpus to do processing on the fly. But I wouldn't want current processing constraints to be hard wired into new versions of grass. Or at least I would encourage the developers to consider this issue. And I guess I would argue that the more usual user situation is one where the user knows less than the software, or at least the gurus who have written the software. I can guarantee that describes me!

Regards,

Jerry

---- Original message ----

Date: Tue, 1 Jan 2008 11:10:43 +0000
From: Glynn Clements <glynn@gclements.plus.com>
Subject: Re: [GRASS-dev] what is the ideal way to store spatial data
To: Gerald Nelson <gnelson@uiuc.edu>
Cc: grass-dev@lists.osgeo.org

Gerald Nelson wrote:

Since all spatial data are about describing a specific location on a
specific planet, usually earth, it would seem that the best way
conceptually to store data is with respect to a single easily defined
reference point such as the gravitational center of the planet. Any
location could then be measured with three values. x,y like latitude
and longitude, and z a distance measure from the reference point along
a ray.

Projections such as utm, etc, are about how to convert the 3-d data
described above into 2-d with a minimum of distortion. Given the speed
of modern computers this conversion process ought to be increasingly
easy to do on the fly, as needed.

The reason I raise this question is to ask the experts whether it
would make sense (for 7.x) to think of a single standard way of
storing data in grass and then all operations would do the conversions
as necessary? There are (at least) two advantages of this. One is
standardization of data storage in a form that is closest to a true
representation of the real world. A second is to reduce the potential
for confusion/mistakes when data are shared and the metadata are not,
or are inadequate. I am continually getting access to data where the
units are not clearly defined. But even if they are defined say as
some utm coordinate, there must be some error in measurement built in.

Apart from wasting CPU time, conversion introduces error. Applying a
non-affine transformation to a regular grid (i.e. raster) doesn't
result in a regular grid. Applying a non-affine transformation to a
straight line doesn't result in a straight line. Any spatial
measurement which is constant for the original data (e.g. maximum
spatial error) will cease to be constant if the data is projected.

All things considered, the optimum form in which to store the data is
the form in which the user chooses to store it. There will always be
factors of which the user is aware but the software isn't.

--
Glynn Clements <glynn@gclements.plus.com>

Gerald Nelson
Professor, Dept. of Agricultural and Consumer Economics
University of Illinois, Urbana-Champaign
office: 217-333-6465
cell: 217-390-7888
315 Mumford Hall
1301 W. Gregory
Urbana, IL 61801

Gerald Nelson wrote:

I don't know enough to comment on the math issues specifically, but
would like to relate a conversation I had with John MacDonald of
MacDonald Detweiler (a big Canadian company that makes ground link
stations etc) while serving on an advisory panel to the US national
remotely sensed data archive (it stores much of the Landsat data). I
was pretty naive about what actually goes on in turning raw data as
collected by a satellite into various products that we all end up
using. I was just interested in have a land cover/use data set and was
arguing for the archive storing such a data set. He made two points.
The first was similar to the one you make, which is that any
manipulation of raw data introduces artifacts. The second was that it
will always be cheaper to do processing in the future than it is
today. So the data should always be archived in raw form.

The result of his logic is that the archive does in fact store data in
raw form, along with the operating characteristics of the satellite
that collected it. A related recommendation he made, which has not
been followed as far as I can tell, is that you should also archive
the algorithms of the day (with a time stamp), so that you can
recreate the products, which are what usually get used.

However, when to re-create and when to re-use isn't something which
the software can determine. E.g. if you are analysing trends, it is
important that all samples are processed consistently. If the older
samples were produced using inferior algorithms, you need to use the
same (inferior) algorithms for the newer samples.

If you have access to the original data, you could re-process that
using newer algorithms. But you might be producing data with the
expectation that others will be performing the analysis. In that
situation, you need to consider whether it's better to simply publish
new data consistent with older data, or to revise the older data. If
the user has already published results based upon the original data,
consistency may be more important.

So getting back to grass, it may be too much to ask of today's (and
tomorrow's) cpus to do processing on the fly. But I wouldn't want
current processing constraints to be hard wired into new versions of
grass. Or at least I would encourage the developers to consider this
issue.

So far as CPU usage is concerned, conversion will always consume CPU
time which could have been used for something else, so you don't want
to perform unnecessary conversions. In particular, you don't want to
perform a specific conversion more than once.

Of the various processes which GRASS can perform, projection is one of
the most CPU intensive (and also memory-intensive, as it can't
generally be done row-by-row).

As CPUs get faster, the extra CPU power will often get used to perform
equivalent processing on higher-resolution data, rather than on more
complex processing. In that situation, the proportion of time spent on
projection will remain constant regardless of CPU speed.

And I guess I would argue that the more usual user situation is
one where the user knows less than the software, or at least the gurus
who have written the software. I can guarantee that describes me!

I wouldn't assume that this is the usual case. GRASS isn't a
word-processor or a paint program. It's targeted at users with a
certain level of skill.

In particular, I would expect most GRASS users to know a lot more
about geography and geographical sciences than I do (I dropped
geography at school at age 14; most of my knowledge has been acquired
while working on GRASS).

--
Glynn Clements <glynn@gclements.plus.com>