RE: [GRASS-user] Large vector files

Michael and Jonathan,

I would _highly_recommend trying r.in.xyz if you have not already done so.
Especially with LIDAR and other forms of remotely-sensed data. I've had good
success with it. Note there is also a parameter in r.n.xyz to control how
much of the input map to keep in memory, allowing you to run the data import
in multiple passes.

Is it imperative that your data be imported as vector?

~ ERIC.

-----Original Message-----
From: grassuser-bounces@grass.itc.it
To: Jonathan Greenberg
Cc: GRASS Users Users
Sent: 10/6/2006 3:03 AM
Subject: Re: [GRASS-user] Large vector files

Jonathan,

I feel your pain. I'm one of those Lidar users and our Library has
just passed 100000km^2 collected at 1-2 points/m^2. Data management
is a real nightmare and so far as I've seen, the commercial vendors
fail to deal with the problem. I'm pretty new to GRASS, but it
combined with GMT appear to offer a far more appealing solution.
Right now I've just been experimenting with everything at a very
superficial level, but I'll share what I've learned; although it is
biased to working with lidar data.

-on my MacBook Pro (2gigs of ram and lots of swap) v.in.ascii chokes
at around the 5 million point level (with topology building).

-without topology I have no issues importing as many as 20 million
points but it again choked when I tried another file with 100 million
points. However the error I received was not a memory allocation
error. I never dove any further into the problem when I discovered
how slowly v.surf.rst ran.

-I've had really positive experiences working with the GMT programs
surface and triangulate. Surface generated a grid that was comparable
with v.sur.rst but was 2 orders of magnitude faster. Triangulate was
3 order faster.

-I've found that it is quite easy to write scripts that automatically
break up the tasks into smaller "tiles". Even better yet, you can use
a idea posted earlier by Hamish (many thanks! :-)) to parallize the
computations. Or at least I have been able to with GMT (I think the
way GRASS handles regions is going to cause me grief when multiple
threads are trying working with different sub-regions...any thoughts?)

-But maybe the most important conclusion I've come to for working
with really large data sets is that files are not the way to go and
that a database serving the application manageable chunks of data is
a better option. Then again, I really don't know too much about
databases so I could be totally wrong on that one. Anyone have any
experience working with lidar through databases?

Cheers,

Mike

On 5-Oct-06, at 5:29 PM, Jonathan Greenberg wrote:

I wonder (and I'm thinking out loud here) if there are ways to
"tile" vector
processes in an analogous (if not algorithmic) way to how we deal with
massive raster datasets? Are the issues I'm running into
fundamentally
something with older file formats, operating system/file system
limitations,
algorithmic maturity, or some mixture of all of these things? As you
pointed out, the Lidar community seems to have the most pressing
need for
these issues to get sorted out -- however as GIS analyses get more
advanced
and require more data, I'm guessing the average user may run into
this as
well.

On a related note, apparently ESRI may be releasing a new version
of their
geodatabase format to get around some of the filesize issues in
their 9.2
release (the beta apparently has this functionality). No word on
whether it
a) works or b) has algorithmic advances to deal with these DB...

--j

On 10/5/06 4:16 PM, "Hamish" <hamish_nospam@yahoo.com> wrote:

Jonathan Greenberg wrote:

Case in point: I just got this error on a v.in.ascii import of a
~200mb csv file with points:

G_realloc: out of memory (I have 4gb RAM and plenty of swap
space, and
the program never hit that limit anyway).

The vector format has a small but finite memory overhead for each
feature which makes more than several million data points
impractical.

To get around this v.in.ascii (and a couple of other modules) let you
load in vector data without building topology. (v.in.ascii -b -t)

Then it's unknown how many points you can load, but it's a lot.

Without topology, about the only thing you can do with the data is
run
it through v.surf.rst.

For multi-gigabyte x,y,z datasets (or x,y,f(x,y) just as well),
you can
use r.in.xyz to bin it directly into a raster map.

see:
  http://grass.ibiblio.org/grass63/manuals/html63_user/r.in.xyz.html
  http://hamish.bowman.googlepages.com/grassfiles#xyz

with regard to the vector library and LFS support, I think you can
expect some "first user" problems, Radim commented on this some
time ago
in the mailing lists, have to search there for a better answer.

Hamish

--
Jonathan A. Greenberg, PhD
NRC Research Associate
NASA Ames Research Center
MS 242-4
Moffett Field, CA 94035-1000
Office: 650-604-5896
Cell: 415-794-5043
AIM: jgrn307
MSN: jgrn307@hotmail.com

_______________________________________________
grassuser mailing list
grassuser@grass.itc.it
http://grass.itc.it/mailman/listinfo/grassuser

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

_______________________________________________
grassuser mailing list
grassuser@grass.itc.it
http://grass.itc.it/mailman/listinfo/grassuser

Unfortunately, I was hoping to work in a vector environment with the data --
I'm sure I could think up raster analogs to the analyses I'm trying to do
right now, but as Michael pointed out earlier, this is a problem that does
need to be solved -- Lidar, in particular, is getting more popular, and
software support remains primitive -- while my problem is not a Lidar one,
it still has the same underlying issue -- I need to be able to create and
manipulate massive vector files.

I am hearing a lot of suggestions about using things like PostGIS and
PostGRESQL here and elsewhere, but I am a total novice to this -- is there a
"dummy's guide" to working with these DB instead of shapefiles? I'd like to
be able to use all of the various GRASS vector commands on a "large vector"
(what would the format be called?) that already exist -- I'm noticing
there's a lot more setup involved in getting a DB running, and the process
has yet to be streamlined. Is it possible to simply substitute some
postgres driven vector DB for a GRASS vector in the GRASS algorithms, or do
the v.[whatever] algorithms need to be reworked to support this?

--j

On 10/6/06 4:22 AM, "Patton, Eric" <epatton@nrcan.gc.ca> wrote:

Michael and Jonathan,

I would _highly_recommend trying r.in.xyz if you have not already done so.
Especially with LIDAR and other forms of remotely-sensed data. I've had good
success with it. Note there is also a parameter in r.n.xyz to control how
much of the input map to keep in memory, allowing you to run the data import
in multiple passes.

Is it imperative that your data be imported as vector?

~ ERIC.

-----Original Message-----
From: grassuser-bounces@grass.itc.it
To: Jonathan Greenberg
Cc: GRASS Users Users
Sent: 10/6/2006 3:03 AM
Subject: Re: [GRASS-user] Large vector files

Jonathan,

I feel your pain. I'm one of those Lidar users and our Library has
just passed 100000km^2 collected at 1-2 points/m^2. Data management
is a real nightmare and so far as I've seen, the commercial vendors
fail to deal with the problem. I'm pretty new to GRASS, but it
combined with GMT appear to offer a far more appealing solution.
Right now I've just been experimenting with everything at a very
superficial level, but I'll share what I've learned; although it is
biased to working with lidar data.

-on my MacBook Pro (2gigs of ram and lots of swap) v.in.ascii chokes
at around the 5 million point level (with topology building).

-without topology I have no issues importing as many as 20 million
points but it again choked when I tried another file with 100 million
points. However the error I received was not a memory allocation
error. I never dove any further into the problem when I discovered
how slowly v.surf.rst ran.

-I've had really positive experiences working with the GMT programs
surface and triangulate. Surface generated a grid that was comparable
with v.sur.rst but was 2 orders of magnitude faster. Triangulate was
3 order faster.

-I've found that it is quite easy to write scripts that automatically
break up the tasks into smaller "tiles". Even better yet, you can use
a idea posted earlier by Hamish (many thanks! :-)) to parallize the
computations. Or at least I have been able to with GMT (I think the
way GRASS handles regions is going to cause me grief when multiple
threads are trying working with different sub-regions...any thoughts?)

-But maybe the most important conclusion I've come to for working
with really large data sets is that files are not the way to go and
that a database serving the application manageable chunks of data is
a better option. Then again, I really don't know too much about
databases so I could be totally wrong on that one. Anyone have any
experience working with lidar through databases?

Cheers,

Mike

On 5-Oct-06, at 5:29 PM, Jonathan Greenberg wrote:

I wonder (and I'm thinking out loud here) if there are ways to
"tile" vector
processes in an analogous (if not algorithmic) way to how we deal with
massive raster datasets? Are the issues I'm running into
fundamentally
something with older file formats, operating system/file system
limitations,
algorithmic maturity, or some mixture of all of these things? As you
pointed out, the Lidar community seems to have the most pressing
need for
these issues to get sorted out -- however as GIS analyses get more
advanced
and require more data, I'm guessing the average user may run into
this as
well.

On a related note, apparently ESRI may be releasing a new version
of their
geodatabase format to get around some of the filesize issues in
their 9.2
release (the beta apparently has this functionality). No word on
whether it
a) works or b) has algorithmic advances to deal with these DB...

--j

On 10/5/06 4:16 PM, "Hamish" <hamish_nospam@yahoo.com> wrote:

Jonathan Greenberg wrote:

Case in point: I just got this error on a v.in.ascii import of a
~200mb csv file with points:

G_realloc: out of memory (I have 4gb RAM and plenty of swap
space, and
the program never hit that limit anyway).

The vector format has a small but finite memory overhead for each
feature which makes more than several million data points
impractical.

To get around this v.in.ascii (and a couple of other modules) let you
load in vector data without building topology. (v.in.ascii -b -t)

Then it's unknown how many points you can load, but it's a lot.

Without topology, about the only thing you can do with the data is
run
it through v.surf.rst.

For multi-gigabyte x,y,z datasets (or x,y,f(x,y) just as well),
you can
use r.in.xyz to bin it directly into a raster map.

see:
  http://grass.ibiblio.org/grass63/manuals/html63_user/r.in.xyz.html
  http://hamish.bowman.googlepages.com/grassfiles#xyz

with regard to the vector library and LFS support, I think you can
expect some "first user" problems, Radim commented on this some
time ago
in the mailing lists, have to search there for a better answer.

Hamish

--
Jonathan A. Greenberg, PhD
NRC Research Associate
NASA Ames Research Center
MS 242-4
Moffett Field, CA 94035-1000
Office: 650-604-5896
Cell: 415-794-5043
AIM: jgrn307
MSN: jgrn307@hotmail.com

_______________________________________________
grassuser mailing list
grassuser@grass.itc.it
http://grass.itc.it/mailman/listinfo/grassuser

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

_______________________________________________
grassuser mailing list
grassuser@grass.itc.it
http://grass.itc.it/mailman/listinfo/grassuser

--
Jonathan A. Greenberg, PhD
NRC Research Associate
NASA Ames Research Center
MS 242-4
Moffett Field, CA 94035-1000
Office: 650-604-5896
Cell: 415-794-5043
AIM: jgrn307
MSN: jgrn307@hotmail.com

Jonathan Greenberg wrote:

Unfortunately, I was hoping to work in a vector environment with the data --
I'm sure I could think up raster analogs to the analyses I'm trying to do
right now, but as Michael pointed out earlier, this is a problem that does
need to be solved -- Lidar, in particular, is getting more popular, and
software support remains primitive -- while my problem is not a Lidar one,
it still has the same underlying issue -- I need to be able to create and
manipulate massive vector files.

I am hearing a lot of suggestions about using things like PostGIS and
PostGRESQL here and elsewhere, but I am a total novice to this -- is there a
"dummy's guide" to working with these DB instead of shapefiles? I'd like to
be able to use all of the various GRASS vector commands on a "large vector"
(what would the format be called?) that already exist -- I'm noticing
there's a lot more setup involved in getting a DB running, and the process
has yet to be streamlined. Is it possible to simply substitute some
postgres driven vector DB for a GRASS vector in the GRASS algorithms, or do
the v.[whatever] algorithms need to be reworked to support this?

You can use any supported database system to store vectors. Use
db.connect to select which database to use.

After DBF, SQLite is the next easiest to use, as it doesn't involve a
separate server process. Unlike the DBF driver, SQLite will normally
have been built with large file support.

Client-server database systems normally require a non-trivial amount
of setup before you can use them. E.g. installing the software,
setting the daemon to start on boot, creating a database, configuring
access to that database, etc.

However, none of this will help if you are running into memory limits
(because the module tries to keep the entire vector map in memory)
rather than file size limits.

I'm not particularly familiar with the vector side of GRASS, but it
appears that whether or not the map includes topology information has
a significant effect upon the maximum size. Maps which lack topology
information are less likely to cause problems than those which have
it.

--
Glynn Clements <glynn@gclements.plus.com>

Jonathan:

contact me if you would like some tips on PostGIS, I use it all the time for
massive soil survey based analysis.

Cheers,

Dylan

On Friday 06 October 2006 12:17, Jonathan Greenberg wrote:

Unfortunately, I was hoping to work in a vector environment with the data
-- I'm sure I could think up raster analogs to the analyses I'm trying to
do right now, but as Michael pointed out earlier, this is a problem that
does need to be solved -- Lidar, in particular, is getting more popular,
and software support remains primitive -- while my problem is not a Lidar
one, it still has the same underlying issue -- I need to be able to create
and manipulate massive vector files.

I am hearing a lot of suggestions about using things like PostGIS and
PostGRESQL here and elsewhere, but I am a total novice to this -- is there
a "dummy's guide" to working with these DB instead of shapefiles? I'd like
to be able to use all of the various GRASS vector commands on a "large
vector" (what would the format be called?) that already exist -- I'm
noticing there's a lot more setup involved in getting a DB running, and the
process has yet to be streamlined. Is it possible to simply substitute
some postgres driven vector DB for a GRASS vector in the GRASS algorithms,
or do the v.[whatever] algorithms need to be reworked to support this?

--j

On 10/6/06 4:22 AM, "Patton, Eric" <epatton@nrcan.gc.ca> wrote:
> Michael and Jonathan,
>
> I would _highly_recommend trying r.in.xyz if you have not already done
> so. Especially with LIDAR and other forms of remotely-sensed data. I've
> had good success with it. Note there is also a parameter in r.n.xyz to
> control how much of the input map to keep in memory, allowing you to run
> the data import in multiple passes.
>
> Is it imperative that your data be imported as vector?
>
> ~ ERIC.
>
> -----Original Message-----
> From: grassuser-bounces@grass.itc.it
> To: Jonathan Greenberg
> Cc: GRASS Users Users
> Sent: 10/6/2006 3:03 AM
> Subject: Re: [GRASS-user] Large vector files
>
> Jonathan,
>
> I feel your pain. I'm one of those Lidar users and our Library has
> just passed 100000km^2 collected at 1-2 points/m^2. Data management
> is a real nightmare and so far as I've seen, the commercial vendors
> fail to deal with the problem. I'm pretty new to GRASS, but it
> combined with GMT appear to offer a far more appealing solution.
> Right now I've just been experimenting with everything at a very
> superficial level, but I'll share what I've learned; although it is
> biased to working with lidar data.
>
> -on my MacBook Pro (2gigs of ram and lots of swap) v.in.ascii chokes
> at around the 5 million point level (with topology building).
>
> -without topology I have no issues importing as many as 20 million
> points but it again choked when I tried another file with 100 million
> points. However the error I received was not a memory allocation
> error. I never dove any further into the problem when I discovered
> how slowly v.surf.rst ran.
>
> -I've had really positive experiences working with the GMT programs
> surface and triangulate. Surface generated a grid that was comparable
> with v.sur.rst but was 2 orders of magnitude faster. Triangulate was
> 3 order faster.
>
> -I've found that it is quite easy to write scripts that automatically
> break up the tasks into smaller "tiles". Even better yet, you can use
> a idea posted earlier by Hamish (many thanks! :-)) to parallize the
> computations. Or at least I have been able to with GMT (I think the
> way GRASS handles regions is going to cause me grief when multiple
> threads are trying working with different sub-regions...any thoughts?)
>
> -But maybe the most important conclusion I've come to for working
> with really large data sets is that files are not the way to go and
> that a database serving the application manageable chunks of data is
> a better option. Then again, I really don't know too much about
> databases so I could be totally wrong on that one. Anyone have any
> experience working with lidar through databases?
>
> Cheers,
>
> Mike
>
> On 5-Oct-06, at 5:29 PM, Jonathan Greenberg wrote:
>> I wonder (and I'm thinking out loud here) if there are ways to
>> "tile" vector
>> processes in an analogous (if not algorithmic) way to how we deal with
>> massive raster datasets? Are the issues I'm running into
>> fundamentally
>> something with older file formats, operating system/file system
>> limitations,
>> algorithmic maturity, or some mixture of all of these things? As you
>> pointed out, the Lidar community seems to have the most pressing
>> need for
>> these issues to get sorted out -- however as GIS analyses get more
>> advanced
>> and require more data, I'm guessing the average user may run into
>> this as
>> well.
>>
>> On a related note, apparently ESRI may be releasing a new version
>> of their
>> geodatabase format to get around some of the filesize issues in
>> their 9.2
>> release (the beta apparently has this functionality). No word on
>> whether it
>> a) works or b) has algorithmic advances to deal with these DB...
>>
>> --j
>>
>> On 10/5/06 4:16 PM, "Hamish" <hamish_nospam@yahoo.com> wrote:
>>> Jonathan Greenberg wrote:
>>>> Case in point: I just got this error on a v.in.ascii import of a
>>>> ~200mb csv file with points:
>>>>
>>>> G_realloc: out of memory (I have 4gb RAM and plenty of swap
>>>> space, and
>>>> the program never hit that limit anyway).
>>>
>>> The vector format has a small but finite memory overhead for each
>>> feature which makes more than several million data points
>>> impractical.
>>>
>>> To get around this v.in.ascii (and a couple of other modules) let you
>>> load in vector data without building topology. (v.in.ascii -b -t)
>>>
>>> Then it's unknown how many points you can load, but it's a lot.
>>>
>>> Without topology, about the only thing you can do with the data is
>>> run
>>> it through v.surf.rst.
>>>
>>>
>>> For multi-gigabyte x,y,z datasets (or x,y,f(x,y) just as well),
>>> you can
>>> use r.in.xyz to bin it directly into a raster map.
>>>
>>> see:
>>> http://grass.ibiblio.org/grass63/manuals/html63_user/r.in.xyz.html
>>> http://hamish.bowman.googlepages.com/grassfiles#xyz
>>>
>>>
>>> with regard to the vector library and LFS support, I think you can
>>> expect some "first user" problems, Radim commented on this some
>>> time ago
>>> in the mailing lists, have to search there for a better answer.
>>>
>>>
>>> Hamish
>>
>> --
>> Jonathan A. Greenberg, PhD
>> NRC Research Associate
>> NASA Ames Research Center
>> MS 242-4
>> Moffett Field, CA 94035-1000
>> Office: 650-604-5896
>> Cell: 415-794-5043
>> AIM: jgrn307
>> MSN: jgrn307@hotmail.com
>>
>>
>> _______________________________________________
>> grassuser mailing list
>> grassuser@grass.itc.it
>> http://grass.itc.it/mailman/listinfo/grassuser
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> _______________________________________________
> grassuser mailing list
> grassuser@grass.itc.it
> http://grass.itc.it/mailman/listinfo/grassuser

--
Dylan Beaudette
Soils and Biogeochemistry Graduate Group
University of California at Davis
530.754.7341

On Fri, October 6, 2006 22:53, Glynn Clements wrote:

Jonathan Greenberg wrote:

Unfortunately, I was hoping to work in a vector environment with the
data --
I'm sure I could think up raster analogs to the analyses I'm trying to
do
right now, but as Michael pointed out earlier, this is a problem that
does
need to be solved -- Lidar, in particular, is getting more popular, and
software support remains primitive -- while my problem is not a Lidar
one,
it still has the same underlying issue -- I need to be able to create
and
manipulate massive vector files.

I am hearing a lot of suggestions about using things like PostGIS and
PostGRESQL here and elsewhere, but I am a total novice to this -- is
there a
"dummy's guide" to working with these DB instead of shapefiles? I'd
like to
be able to use all of the various GRASS vector commands on a "large
vector"
(what would the format be called?) that already exist -- I'm noticing
there's a lot more setup involved in getting a DB running, and the
process
has yet to be streamlined. Is it possible to simply substitute some
postgres driven vector DB for a GRASS vector in the GRASS algorithms, or
do
the v.[whatever] algorithms need to be reworked to support this?

You can use any supported database system to store vectors. Use
db.connect to select which database to use.

No, db.connect tells you which database to use for the link between your
GRASS vector maps and attribute tables, not where to store the actual
vector geometries, which are normally stored internally.

From what I have gathered from your mails your problem does not seem to be

the size of the attribute table, nor necessarily a large file problem, but
rather the size of the geometry file (number of points). Thus the
v.in.ascii error. The memory problem here comes from the building of
topology where, as Hamish explained, there is a memory overhead which
becomes significant with large numbers of features.

If you want to use another vector format than the internal GRASS format,
you can try to use v.external on a shape file or a PostGIS table. But I
don't know if that solves the memory issues, if you need topology for your
operations.

PostGIS also contains a series of vector operations, so you might be able
to do what you want to do without going through GRASS.

Maybe you could tell us what exactly you want to do...

I'm not particularly familiar with the vector side of GRASS, but it
appears that whether or not the map includes topology information has
a significant effect upon the maximum size. Maps which lack topology
information are less likely to cause problems than those which have
it.

The latest discussions on this went in the direction of reenabling spatial
index on file, instead of in memory. This should solve the topology issue,
but it still needs to be done (Radim gave a rough roadmap of how to do
it).

None of what I say above answers your question about large file support,
though...

Moritz

Eric Patton wrote:

I would _highly_recommend trying r.in.xyz if you have not already done
so. Especially with LIDAR and other forms of remotely-sensed data.
I've had good success with it. Note there is also a parameter in
r.in.xyz to control how much of the input map to keep in memory,
allowing you to run the data import in multiple passes.

note the r.in.xyz memory parameter is to help with massive RASTER
regions, nothing directly to do with the size of the input file.
The input file is not kept in memory! The output raster grid is.

I am always looking for feedback on how r.in.xyz goes with massive input
data. (>2gb? >4gb?)

Jonathan Greenberg wrote:

-But maybe the most important conclusion I've come to for working
with really large data sets is that files are not the way to go and
that a database serving the application manageable chunks of data is
a better option.

I have not yet met a dataset that .csv + awk couldn't handle in an
efficient way. Simple, fast, no bells and whistles to cause problems.

JG:

Unfortunately, I was hoping to work in a vector environment with the
data -- I'm sure I could think up raster analogs to the analyses I'm
trying to do right now,

I am interested to learn of a form of processing couldn't be handled by
SQL query+expression, awk pre-processing, or by raster analog.

Is your need something that the GRASS 5 sites format could handle or
something more sophisticated? (grass 5 sites format is just a text file,
as big as UNIX can handle)

JG:

I am hearing a lot of suggestions about using things like PostGIS and
PostGRESQL here and elsewhere,

PostGIS seems to be widely recommended for massive datasets..
(tip: PostGIS is just PostgreSQL with a plugin)

JG:

is there a "dummy's guide" to working with these DB instead of
shapefiles?

there is this help page, but no tutorial I know of:
http://grass.ibiblio.org/grass63/manuals/html63_user/databaseintro.html

Probably lots of generic DB tutorials out there. Maybe this would make
for a nice OSGeo doc project as this is not a GRASS specific need.

Dylan Beaudette wrote:

contact me if you would like some tips on PostGIS, I use it all the
time for massive soil survey based analysis.

A "from scratch" tutorial on setting this up would make a /very/ nice
help page in the GRASS wiki or a GRASSNews article.

JG:

Is it possible to simply substitute some postgres driven vector DB for
a GRASS vector in the GRASS algorithms, or do the v.[whatever]
algorithms need to be reworked to support this?

GRASS vector coordinate info needs to be in GRASS vector format (or
live-translated with v.external). GRASS vector attributes are stored in
the DB of your choice. v.* modules don't care what DB you are using
(they do pass through limitations of the selected DB though [e.g. DBF
column name length]).

If your (large) data is just x,y,z (or x,y,[value]) it is probably best
to skip creating an empty attribute table, there's no need for it.

If you want to access a large dataset without importing to GRASS, use
v.external. I notice it can use these drivers: "..,CSV,Memory,..".
(What's "memory"?) Restating something Moritz has mentioned, I suspect
you'll encounter problems when GRASS tries to build a map which is a
derivative of your external data (creating a new massive GRASS vector).

in summary,

The best method I can suggest for more than 3 million points is PostGIS
or .csv+awk for storage and extraction; v.external for simple "GIS"
cartography of the existing dataset; and r.in.xyz when you want to
really use the dataset as a whole (upon transition from raw to
generalized data needs). And skip using an attribute table if you don't
need one.

I don't know the best way to pass the data to GRASS's R-stats interface.
(GRASS 5 sites format? directly (skipping GRASS)?)

Hamish

Hamish wrote:

> I would _highly_recommend trying r.in.xyz if you have not already done
> so. Especially with LIDAR and other forms of remotely-sensed data.
> I've had good success with it. Note there is also a parameter in
> r.in.xyz to control how much of the input map to keep in memory,
> allowing you to run the data import in multiple passes.

note the r.in.xyz memory parameter is to help with massive RASTER
regions, nothing directly to do with the size of the input file.
The input file is not kept in memory! The output raster grid is.

I am always looking for feedback on how r.in.xyz goes with massive input
data. (>2gb? >4gb?)

r.in.xyz doesn't use LFS, so it will be limited to 2Gb on 32-bit
systems (any system where "long" is 32 bits). As it uses ANSI stdio
functions (including ftell/fseek), extending it to support large files
would be non-trivial.

--
Glynn Clements <glynn@gclements.plus.com>

On Sun, 2006-10-08 at 13:26 +0100, Glynn Clements wrote:

Hamish wrote:

> > I would _highly_recommend trying r.in.xyz if you have not already done
> > so. Especially with LIDAR and other forms of remotely-sensed data.
> > I've had good success with it. Note there is also a parameter in
> > r.in.xyz to control how much of the input map to keep in memory,
> > allowing you to run the data import in multiple passes.
>
> note the r.in.xyz memory parameter is to help with massive RASTER
> regions, nothing directly to do with the size of the input file.
> The input file is not kept in memory! The output raster grid is.
>
> I am always looking for feedback on how r.in.xyz goes with massive input
> data. (>2gb? >4gb?)

r.in.xyz doesn't use LFS, so it will be limited to 2Gb on 32-bit
systems (any system where "long" is 32 bits). As it uses ANSI stdio
functions (including ftell/fseek), extending it to support large files
would be non-trivial.

Attached is a quick patch to enable LFS. It's "poorly" implemented with
fseeko/ftello, so I'm not sure if I should commit it.

--
Brad Douglas <rez touchofmadness com> KB8UYR
Address: 37.493,-121.924 / WGS84 National Map Corps #TNMC-3785

(attachments)

r.in.xyz.pat (2.32 KB)

Hamish, this was a great post, thanks! I want to give some examples of what
I'd like to do with this data to be more clear why I think a vector
environment that can handle massive vectors seems to be a requirement (and
not trying a raster analog)...

Remember that the base dataset is a set of points with a radius parameter
and represent the positions and sizes of tree crowns, e.g. X,Y,crown radius.
We often work with "management polygons" for US Forest Service applications
which are the units of management and the base data layer to be analyzed (on
the scale of many hectares, so its a much smaller coverage to work with) --
so we want to create summary stats based on our tree points at the scale of
the management polygons:

1) What management polygon does each tree belong to (spatial join b/t
massive points and management polygon layer). What the is the tree count
per polygon? What is the distribution of sizes of trees in each polygon?
2) What is the tree cover within a polygon -- at a first glance you'd think
I'd just convert the radius to area, and sum all areas from the previous
step for a given management polygon -- but tree crowns can overlap and the
overlapping area does NOT get counted twice -- so we need to do a spatial
dissolve on a BUFFERED set of tree POLYGONS (we can't work with points), and
then a spatial clip based on the management polguon layer so if any trees
are partially in one poly and partially in the other, we deal with that.
3) What is the distance from every tree to the nearest tree and, at a
management polygon level, what is the distribution of these minimum-tree
distances (this is relevant for fire ecology work)?

These are all classic vector problems, with the added issue that I'm dealing
with > 7 million trees.

--j

On 10/7/06 10:51 PM, "Hamish" <hamish_nospam@yahoo.com> wrote:

Eric Patton wrote:

I would _highly_recommend trying r.in.xyz if you have not already done
so. Especially with LIDAR and other forms of remotely-sensed data.
I've had good success with it. Note there is also a parameter in
r.in.xyz to control how much of the input map to keep in memory,
allowing you to run the data import in multiple passes.

note the r.in.xyz memory parameter is to help with massive RASTER
regions, nothing directly to do with the size of the input file.
The input file is not kept in memory! The output raster grid is.

I am always looking for feedback on how r.in.xyz goes with massive input
data. (>2gb? >4gb?)

Jonathan Greenberg wrote:

-But maybe the most important conclusion I've come to for working
with really large data sets is that files are not the way to go and
that a database serving the application manageable chunks of data is
a better option.

I have not yet met a dataset that .csv + awk couldn't handle in an
efficient way. Simple, fast, no bells and whistles to cause problems.

JG:

Unfortunately, I was hoping to work in a vector environment with the
data -- I'm sure I could think up raster analogs to the analyses I'm
trying to do right now,

I am interested to learn of a form of processing couldn't be handled by
SQL query+expression, awk pre-processing, or by raster analog.

Is your need something that the GRASS 5 sites format could handle or
something more sophisticated? (grass 5 sites format is just a text file,
as big as UNIX can handle)

JG:

I am hearing a lot of suggestions about using things like PostGIS and
PostGRESQL here and elsewhere,

PostGIS seems to be widely recommended for massive datasets..
(tip: PostGIS is just PostgreSQL with a plugin)

JG:

is there a "dummy's guide" to working with these DB instead of
shapefiles?

there is this help page, but no tutorial I know of:
http://grass.ibiblio.org/grass63/manuals/html63_user/databaseintro.html

Probably lots of generic DB tutorials out there. Maybe this would make
for a nice OSGeo doc project as this is not a GRASS specific need.

Dylan Beaudette wrote:

contact me if you would like some tips on PostGIS, I use it all the
time for massive soil survey based analysis.

A "from scratch" tutorial on setting this up would make a /very/ nice
help page in the GRASS wiki or a GRASSNews article.

JG:

Is it possible to simply substitute some postgres driven vector DB for
a GRASS vector in the GRASS algorithms, or do the v.[whatever]
algorithms need to be reworked to support this?

GRASS vector coordinate info needs to be in GRASS vector format (or
live-translated with v.external). GRASS vector attributes are stored in
the DB of your choice. v.* modules don't care what DB you are using
(they do pass through limitations of the selected DB though [e.g. DBF
column name length]).

If your (large) data is just x,y,z (or x,y,[value]) it is probably best
to skip creating an empty attribute table, there's no need for it.

If you want to access a large dataset without importing to GRASS, use
v.external. I notice it can use these drivers: "..,CSV,Memory,..".
(What's "memory"?) Restating something Moritz has mentioned, I suspect
you'll encounter problems when GRASS tries to build a map which is a
derivative of your external data (creating a new massive GRASS vector).

in summary,

The best method I can suggest for more than 3 million points is PostGIS
or .csv+awk for storage and extraction; v.external for simple "GIS"
cartography of the existing dataset; and r.in.xyz when you want to
really use the dataset as a whole (upon transition from raw to
generalized data needs). And skip using an attribute table if you don't
need one.

I don't know the best way to pass the data to GRASS's R-stats interface.
(GRASS 5 sites format? directly (skipping GRASS)?)

Hamish

--
Jonathan A. Greenberg, PhD
NRC Research Associate
NASA Ames Research Center
MS 242-4
Moffett Field, CA 94035-1000
Office: 650-604-5896
Cell: 415-794-5043
AIM: jgrn307
MSN: jgrn307@hotmail.com

Brad Douglas wrote:

> > > I would _highly_recommend trying r.in.xyz if you have not already done
> > > so. Especially with LIDAR and other forms of remotely-sensed data.
> > > I've had good success with it. Note there is also a parameter in
> > > r.in.xyz to control how much of the input map to keep in memory,
> > > allowing you to run the data import in multiple passes.
> >
> > note the r.in.xyz memory parameter is to help with massive RASTER
> > regions, nothing directly to do with the size of the input file.
> > The input file is not kept in memory! The output raster grid is.
> >
> > I am always looking for feedback on how r.in.xyz goes with massive input
> > data. (>2gb? >4gb?)
>
> r.in.xyz doesn't use LFS, so it will be limited to 2Gb on 32-bit
> systems (any system where "long" is 32 bits). As it uses ANSI stdio
> functions (including ftell/fseek), extending it to support large files
> would be non-trivial.

Attached is a quick patch to enable LFS. It's "poorly" implemented with
fseeko/ftello, so I'm not sure if I should commit it.

It shouldn't be committed.

fseeko/ftello aren't available on all platforms. That's why I said
that fixing it is non-trivial; you have to first check that
fseeko/ftello are available, and only use them if they are.

--
Glynn Clements <glynn@gclements.plus.com>

Jonathan Greenberg wrote:

Hamish, this was a great post, thanks! I want to give some examples
of what I'd like to do with this data to be more clear why I think a
vector environment that can handle massive vectors seems to be a
requirement (and not trying a raster analog)...

Remember that the base dataset is a set of points with a radius
parameter and represent the positions and sizes of tree crowns, e.g.
X,Y,crown radius. We often work with "management polygons" for US
Forest Service applications which are the units of management and the
base data layer to be analyzed (on the scale of many hectares, so its
a much smaller coverage to work with) -- so we want to create summary
stats based on our tree points at the scale of the management
polygons:

1) What management polygon does each tree belong to (spatial join b/t
massive points and management polygon layer). What the is the tree
count per polygon? What is the distribution of sizes of trees in each
polygon?

Ok, this is the key -- step 1 is to crop the data to your region of
interest. After that (presumably) less than several million points
remain and you can use the vector engine without further problems.
Then just repeat for each management polygon.

So what is needed is a point in polygon pre-filter.

I can see a couple of ways to do this, the easiest is to find the extent
of the management polygon (v.extract + "g.region vect=" or "v.info -g")
and only import values within that range. Then if the polygon isn't just
the region rectangle you can use v.select on the cropped point dataset
to refine it.

the pre-filter:
* simple awk script if(x<Max && x>=Min), ...

* add a "-r" flag to v.in.ascii to only import points falling within
the current region. (pretty easy) [like "s.univar -a" from GRASS 5, but
opposite]

* add a "spatial=" option to v.in.ascii to only import points falling
within the defined region. (pretty easy) [like "v.in.ogr spatial="]

* add a "vect_mask=" option to v.in.ascii to only import points falling
within a vector map's area polygons. (harder) [use Vect_point_in_area()]
I can think of a few optimizations like perform rough bounding box check
before the expensive point-in-polygon check..

As the last method could be done in a v.select step, I'm less inclined
to worry about it unless non-rectangular input masks are needed that
can't be dealt with by a few "v.in.ascii -r" + v.select + v.patch
steps.

2) What is the tree cover within a polygon -- at a first glance you'd
think I'd just convert the radius to area, and sum all areas from the
previous step for a given management polygon -- but tree crowns can
overlap and the overlapping area does NOT get counted twice -- so we
need to do a spatial dissolve on a BUFFERED set of tree POLYGONS (we
can't work with points), and then a spatial clip based on the
management polguon layer so if any trees are partially in one poly and
partially in the other, we deal with that.

so:

g.region vect=management_polygon
v.in.region or v.extract step could be useful for later?
# expand region slightly so out of region tree centers aren't missed
# "r.in.xyz -s" can give you max_radius
g.region n=n+max_radius s=s-max_radius e=etc w=etc;
# if management_polygon isn't a rectangle use
# v.buffer buffer=max_radius ; g.region vect=buffered_boundary
"v.in.ascii -r"
v.select trees in buffered_boundary
v.buffer buffcol=radius or v.buffer+v.patch per tree # grow tree crowns
v.overlay grown tree_areas with original management polygon

3) What is the distance from every tree to the nearest tree and, at a
management polygon level, what is the distribution of these
minimum-tree distances (this is relevant for fire ecology work)?

v.distance, etc.

These are all classic vector problems, with the added issue that I'm
dealing with > 7 million trees.

Once you have filtered down the 7m trees to something workable, the rest
is just a matter of using the classical vector modules.

This doesn't help with vector large file support (being worked on
separately), but 7 million x,y,radius data points shouldn't come
anywhere near 2gb.

Hamish

Really? Does this apply to the input ascii data set, the output raster or both? The sample data set I've been working with is roughly 92 million points and is 2.5GB in size. I don't recall r.in.xyz complaining when I feed it the file. The output raster was considerably smaller though (around 10 million cells).

Mike

On 8-Oct-06, at 6:26 AM, Glynn Clements wrote:

Hamish wrote:

I would _highly_recommend trying r.in.xyz if you have not already done
so. Especially with LIDAR and other forms of remotely-sensed data.
I've had good success with it. Note there is also a parameter in
r.in.xyz to control how much of the input map to keep in memory,
allowing you to run the data import in multiple passes.

note the r.in.xyz memory parameter is to help with massive RASTER
regions, nothing directly to do with the size of the input file.
The input file is not kept in memory! The output raster grid is.

I am always looking for feedback on how r.in.xyz goes with massive input
data. (>2gb? >4gb?)

r.in.xyz doesn't use LFS, so it will be limited to 2Gb on 32-bit
systems (any system where "long" is 32 bits). As it uses ANSI stdio
functions (including ftell/fseek), extending it to support large files
would be non-trivial.

--
Glynn Clements <glynn@gclements.plus.com>

_______________________________________________
grassuser mailing list
grassuser@grass.itc.it
http://grass.itc.it/mailman/listinfo/grassuser

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection aroundhttp://mail.yahoo.com

Michael Perdue wrote:

Really? Does this apply to the input ascii data set, the output
raster or both?

It always applies to the input file, and also to the output file if
GRASS wasn't configured using --enable-largefile.

--
Glynn Clements <glynn@gclements.plus.com>

Hamish wrote:

Ok, this is the key -- step 1 is to crop the data to your region of
interest. After that (presumably) less than several million points
remain and you can use the vector engine without further problems.
Then just repeat for each management polygon.

So what is needed is a point in polygon pre-filter.

I can see a couple of ways to do this, the easiest is to find the
extent of the management polygon (v.extract + "g.region vect=" or
"v.info -g") and only import values within that range. Then if the
polygon isn't just the region rectangle you can use v.select on the
cropped point dataset to refine it.

the pre-filter:
* simple awk script if(x<Max && x>=Min), ...

* add a "-r" flag to v.in.ascii to only import points falling within
the current region.

Hi,

here is a quick prototype of a "-r" flag. If it is useful we can clean
it up and add it to CVS.

http://bambi.otago.ac.nz/hamish/grass/r.in.xyz/v.in.ascii_RegionCrop.tgz

* use g.region to set region of interest first
    (use "g.region vect=" or "v.info -g" ?)
* "r.in.xyz -s" is useful for scanning data file's extent
* DDD:MM:SS input should be ok, but untested

Hamish

n.b. The v.in.ascii code is a real mess! I feel bad about grafting yet
another feature into the pile.