[GRASS-dev] vector libs: file based spatial index

Moritz wrote:

The largest file I have used is about 125000 areas with a
topo file weighing 42M, so taking your worst estimation,
this would mean around 200MB of spatial index, which is
still largely acceptable for me.

lidar and swath bathymetry data will easily have millions of points,
and as time goes on this will only expand. I seem to recall that one of
Radim's big disappointments was that the need to handle this technology/
data density only really became apparent just when GRASS's new vector
engine was nearing completion. With some earlier notice it could have
been designed to scale better. Still, there is much tuning which can
be done with the present model to reduce the memory overheads, etc.

FWIW the sites type (now vector points) in GRASS 4/5 scales well, just
as much as you can fit in the text file. (not sure if fseeks are 64bit-
proof there, probably not)

the biggest lidar file used that I know about is Doug's 379GB dataset
(14.5 billion points). The vector engine couldn't handle that* so r.in.xyz
was used. Certainly count on 5 million features with topology and DB table
for dataset sizes /today/.

* I don't know what limitation there is if imported without topology+
DB table.

In future memory, CPU, and HD sizes will only increase, but one thing I've
come to respect is that GRASS's raster modules scale so well today because
they were designed to function in the days of extremely tight memory and
CPU constraints.

you might look at libLAS (for lidar data -- an OSGeo semi-affiliated
project: http://liblas.org/ It is my understanding that Howard is
currently adding spatial index support in the development version.
You might check out his approach.

I have been, and still am ignorant of what advantage a spatial index
gives you for point data. ... interested to learn why "topology" would
be useful for points-only data.

In general I'm fairly happy with the no-topology solution for lidar
data in grass, but a few targeted modules (eg v.info) really need to
be modified to deal with them.

Hamish

ps- we still need to hunt through the archives for Radim's posts on these
issues which explain quite a bit.

Hamish wrote:

Moritz wrote:
  

The largest file I have used is about 125000 areas with a
topo file weighing 42M, so taking your worst estimation,
this would mean around 200MB of spatial index, which is
still largely acceptable for me.
    
lidar and swath bathymetry data will easily have millions of points,
and as time goes on this will only expand. I seem to recall that one of
Radim's big disappointments was that the need to handle this technology/
data density only really became apparent just when GRASS's new vector
engine was nearing completion. With some earlier notice it could have
been designed to scale better. Still, there is much tuning which can
be done with the present model to reduce the memory overheads, etc.
  

Yes. As an example, for a 2D point dataset, the topo file should be
about 4 times as large as the coor file, same for the spatial index.
This is because each x,y coordinate pair is stored 3 times in the topo
file, plus some other information that is for points not needed, e.g.
area/isle to the left and to the right, start node and end node (start
node = end node for points/centroids). Each x,y coordinate pair is
stored 2 times in the spatial index (rectangle of size zero with N S E W
and N = S, E = W). I see some potential for cleaning up.

FWIW the sites type (now vector points) in GRASS 4/5 scales well, just
as much as you can fit in the text file. (not sure if fseeks are 64bit-
proof there, probably not)
  

I guess that was without topo?

the biggest lidar file used that I know about is Doug's 379GB dataset
(14.5 billion points).

Frightening.

you might look at libLAS (for lidar data -- an OSGeo semi-affiliated
project: http://liblas.org/ It is my understanding that Howard is
currently adding spatial index support in the development version.
You might check out his approach.
  

Will do.

I have been, and still am ignorant of what advantage a spatial index
gives you for point data. ... interested to learn why "topology" would
be useful for points-only data.
  

Strictly speaking, topology and spatial index are two different things,
you could have a spatial index without topo. I can also not see the
usefulness of topology for point data. A spatial index may be useful to
extract a subset (v.select), but in this case you could just as well go
through the points in the coor file, read one at a time and select the
ones that fall into the study area. Should be slower than with a spatial
index but then you're not dragging along the spatial index.

In general I'm fairly happy with the no-topology solution for lidar
data in grass, but a few targeted modules (eg v.info) really need to
be modified to deal with them.

Hamish

ps- we still need to hunt through the archives for Radim's posts on these
issues which explain quite a bit.
  

I remember one comment where he said that the spatial index is not
written out because of time and space concerns. Space should not be an
issue today, and opening an old vector is faster if the spatial index is
available in a file. Of course I would like a solution that needs less
memory and is faster when modifying a spatial index, but I have not the
faintest idea how to do that. Maybe Paul Kelly's tip on memory mapping
can help.

Markus M

>> the biggest lidar file used that I know about is Doug's 379GB dataset
>> (14.5 billion points).
>Frightening.

The above dataset was for two watersheds collected in 2001, the larger of the two watersheds is 9000 square miles . I’ve recently been working with newer lidar data ( 2007) from a single county with an area of 744 sq. miles ( Craven county, North Carolina, USA) . This county had lidar flown at a submeter posting ( 0.7m? as a guess). The aggregated ASCII x,y,z,intensity file that I created ( for processing using r.in.xyz) from the input las files for that single county was 80 GB .

I guess my point is that lidar datasets are getting quite massive. If we are going to be working with the lidar data as point data in the GRASS vector framework, go with the most scalable options. Scalability in working with large data sets is a huge benefit in using GRASS over other solutions.

Doug

Doug Newcomb
USFWS
Raleigh, NC
919-856-4520 ext. 14 doug_newcomb@fws.gov

The opinions I express are my own and are not representative of the official policy of the U.S.Fish and Wildlife Service or Dept. of Interior. Life is too short for undocumented, proprietary data formats.
Inactive hide details for Markus GRASS <markus.metz.giswork@googlemail.com>Markus GRASS markus.metz.giswork@googlemail.com

Markus GRASS <markus.metz.giswork@googlemail.com>
Sent by: grass-dev-bounces@lists.osgeo.org

06/25/2009 03:21 AM

To

Hamish <hamish_b@yahoo.com>

cc

GRASS devel <grass-dev@lists.osgeo.org>

Subject

Re: [GRASS-dev] vector libs: file based spatial index

`Hamish wrote:` `> Moritz wrote:` `>` `>> The largest file I have used is about 125000 areas with a` `>> topo file weighing 42M, so taking your worst estimation,` `>> this would mean around 200MB of spatial index, which is` `>> still largely acceptable for me.` `>>` `>` `> lidar and swath bathymetry data will easily have millions of points,` `> and as time goes on this will only expand. I seem to recall that one of` `> Radim's big disappointments was that the need to handle this technology/` `> data density only really became apparent just when GRASS's new vector` `> engine was nearing completion. With some earlier notice it could have` `> been designed to scale better. Still, there is much tuning which can` `> be done with the present model to reduce the memory overheads, etc.` `>` `Yes. As an example, for a 2D point dataset, the topo file should be` `about 4 times as large as the coor file, same for the spatial index.` `This is because each x,y coordinate pair is stored 3 times in the topo` `file, plus some other information that is for points not needed, e.g.` `area/isle to the left and to the right, start node and end node (start` `node = end node for points/centroids). Each x,y coordinate pair is` `stored 2 times in the spatial index (rectangle of size zero with N S E W` `and N = S, E = W). I see some potential for cleaning up.` `> FWIW the sites type (now vector points) in GRASS 4/5 scales well, just` `> as much as you can fit in the text file. (not sure if fseeks are 64bit-` `> proof there, probably not)` `>` `I guess that was without topo?` `> the biggest lidar file used that I know about is Doug's 379GB dataset` `> (14.5 billion points).` `Frightening.` `> you might look at libLAS (for lidar data -- an OSGeo semi-affiliated` `> project: ``[http://liblas.org/](http://liblas.org/)`` It is my understanding that Howard is` `> currently adding spatial index support in the development version.` `> You might check out his approach.` `>` `Will do.` `> I have been, and still am ignorant of what advantage a spatial index` `> gives you for point data. ... interested to learn why "topology" would` `> be useful for points-only data.` `>` `Strictly speaking, topology and spatial index are two different things,` `you could have a spatial index without topo. I can also not see the` `usefulness of topology for point data. A spatial index may be useful to` `extract a subset (v.select), but in this case you could just as well go` `through the points in the coor file, read one at a time and select the` `ones that fall into the study area. Should be slower than with a spatial` `index but then you're not dragging along the spatial index.` `>` `> In general I'm fairly happy with the no-topology solution for lidar` `> data in grass, but a few targeted modules (eg v.info) really need to` `> be modified to deal with them.` `>` `>` `>` `> Hamish` `>` `>` `> ps- we still need to hunt through the archives for Radim's posts on these` `> issues which explain quite a bit.` `>` `I remember one comment where he said that the spatial index is not` `written out because of time and space concerns. Space should not be an` `issue today, and opening an old vector is faster if the spatial index is` `available in a file. Of course I would like a solution that needs less` `memory and is faster when modifying a spatial index, but I have not the` `faintest idea how to do that. Maybe Paul Kelly's tip on memory mapping` `can help.`

Markus M
_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
[http://lists.osgeo.org/mailman/listinfo/grass-dev](http://lists.osgeo.org/mailman/listinfo/grass-dev)

(attachments)

pic00236.gif

Doug_Newcomb@fws.gov wrote:

I guess my point is that lidar datasets are getting quite massive. If
we are going to be working with the lidar data as point data in the
GRASS vector framework, go with the most scalable options. Scalability
in working with large data sets is a huge benefit in using GRASS over
other solutions.

For the time being, the only reasonable way to deal with these massive
datasets is to *not* build topology. It's not not only the spatial index
that is getting out of hand, also topology itself and the category
index. The grass vector libs must be told that there is nothing special
about point datasets (to cite Hamish) which means rewriting major parts
of the vector libs, and that takes time.

Markus M

Hi Markus,

2009/7/7 Markus Metz <markus.metz.giswork@googlemail.com>:

[...]

For the time being, the only reasonable way to deal with these massive
datasets is to *not* build topology. It's not not only the spatial index
that is getting out of hand, also topology itself and the category
index. The grass vector libs must be told that there is nothing special
about point datasets (to cite Hamish) which means rewriting major parts
of the vector libs, and that takes time.

BTW, are you planning to commit your changes in sidx to trunk?

Martin

--
Martin Landa <landa.martin gmail.com> * http://gama.fsv.cvut.cz/~landa

Martin Landa wrote:

Hi Markus,

2009/7/7 Markus Metz <markus.metz.giswork@googlemail.com>:

[...]

For the time being, the only reasonable way to deal with these massive
datasets is to *not* build topology. It's not not only the spatial index
that is getting out of hand, also topology itself and the category
index. The grass vector libs must be told that there is nothing special
about point datasets (to cite Hamish) which means rewriting major parts
of the vector libs, and that takes time.
    
BTW, are you planning to commit your changes in sidx to trunk?
  

Yes, after I'm done with testing. I have the probably unrealistic aim to
get building the new file-based spatial index as fast as the current
memory-based index, and I still have to implement the new memory-based
version, currently it's all file-based. I have polished the file-based
version with R*-tree methods and speed optimizations (amongst others a
custom quicksort), but adjusting the grass vector libs to use either the
memory-based version or the file-based version is really a lot of work.
It will take me at least another week to get it right, e.g. decide what
tasks should be done by Vlib and what tasks should be done by diglib,
and how the two should work together. I can send you a detailed
technical report if you want, but I'm afraid it will be very technical
and potentially boring unless you are interested in performance
differences between Toni Guttmann's RTree and Norbert Beckmann's
R*-tree. I would need some help to get random file access optimized,
it's not too bad in my tests but I don't know if it can get better.

Markus M