[GRASS-dev] GRASS 7 vector topology format changed

Hi all,

the GRASS 7 vector topology format changed a bit. I have removed
redundant information (bounding boxes) from vector topology and
updated all affected components. The changes are numerous and require

make distclean
configure
make

This change also means that whenever you switch between GRASS 6.x and
GRASS 7, topology needs to be rebuilt, easiest done by using

v.build.all

The GRASS 7 format always consumes much less memory and is most of the
times, but not always, faster than the GRASS 6 format.

The network analysis modules should be a bit more robust now and can
easier detect errors in the networks.

The new format offers new possibilities for the v.surf.* modules
because it is now possible to build (parts of the) topology even for
massive point clouds which in turn makes it possible to quickly
perform spatial queries with very low memory consumption (a very few
MB).

Please test!

Markus M

Hi,

2011/7/1 Markus Metz <markus.metz.giswork@googlemail.com>:

The new format offers new possibilities for the v.surf.* modules
because it is now possible to build (parts of the) topology even for
massive point clouds which in turn makes it possible to quickly
perform spatial queries with very low memory consumption (a very few
MB).

perfect, thanks a lot for your work! Martin

--
Martin Landa <landa.martin gmail.com> * http://geo.fsv.cvut.cz/~landa

>The new format offers new possibilities for the v.surf.* modules
>because it is now possible to build (parts of the) topology even for
>massive point clouds which in turn makes it possible to quickly
>perform spatial queries with very low memory consumption (a very few
>MB)
Markus,

By massive point clouds, do you mean greater than 4 Billion points? A good long weekend for testing at home
:-)

Doug Newcomb
USFWS
Raleigh, NC
919-856-4520 ext. 14 doug_newcomb@fws.gov

The opinions I express are my own and are not representative of the official policy of the U.S.Fish and Wildlife Service or Dept. of the Interior. Life is too short for undocumented, proprietary data formats.

Markus Metz wrote:

the GRASS 7 vector topology format changed a bit. I have
removed redundant information (bounding boxes) from vector
topology

Hi,

just wondering how redundant that is.. for point data completely,
but for a polygon of 500,000 vertices (forest boundary or the
coastline of Florida) knowing the bbox before you touch the data
array can be a huge speed up. Many of PostGIS's fns work that
way IIRC.

so are per-feature bounding boxes completely gone? or just some
double-storing of them, or redundant data stored within them?

are small vectors (ie points) sped up at the cost of worse
performance of scattered large non-point datasets?

is time to run d.vect in a sub-region of the overall map a good
way to test the performance difference?

(hoping this means we can have the best of both worlds!)

thanks,
Hamish

Hamish wrote:

Markus Metz wrote:

the GRASS 7 vector topology format changed a bit. I have
removed redundant information (bounding boxes) from vector
topology

Hi,

just wondering how redundant that is.. for point data completely,
but for a polygon of 500,000 vertices (forest boundary or the
coastline of Florida) knowing the bbox before you touch the data
array can be a huge speed up. Many of PostGIS's fns work that
way IIRC.

The bounding boxes were redundant because they are also stored in the
spatial index, i.e. if you want to get the bounding box of an area of
500,000 vertixes distributed over 1000 boundaries, you do not need to
read 1000 boundaries but fetch the corresponding box from the spatial
index. For this reason and purpose, I have implemented this
functionality in the spatial index. IOW, the bounding boxes are not
gone, they are still there, but no longer stored in two different
locations, only in one location.

so are per-feature bounding boxes completely gone? or just some
double-storing of them, or redundant data stored within them?

Double-storing is gone (for points, it was 4 times storing the same
box = 8 times storing the point's coordinates).

are small vectors (ie points) sped up at the cost of worse
performance of scattered large non-point datasets?

No, the new, reduced format performs better with larger datasets. For
small datasets, there should be not much of a difference in terms of
speed, only in terms of memory requirements.

is time to run d.vect in a sub-region of the overall map a good
way to test the performance difference?

Maybe, but d.vect is not very efficient: it goes through all features,
checks if a feature is inside the current region and reads e.g. every
area and the area's isles twice instead of only once.

v.what, v.build, v.in.* are also good to test performance differences.
Note that because of substantially reduced memory requirements,
modules may fail with out-of-memory errors in 6.x but complete
successfully in 7. Also note that database operations can mask effects
of changed topology management because database management can be the
main time consuming factor (e.g. [r|v].in.lidar).

(hoping this means we can have the best of both worlds!)

That's the aim.

Markus M

On Fri, Jul 1, 2011 at 2:48 PM, <Doug_Newcomb@fws.gov> wrote:

The new format offers new possibilities for the v.surf.* modules
because it is now possible to build (parts of the) topology even for
massive point clouds which in turn makes it possible to quickly
perform spatial queries with very low memory consumption (a very few
MB)

Markus,

By massive point clouds, do you mean greater than 4 Billion points? A good
long weekend for testing at home
:slight_smile:

Ummh, no, 4 billion is probably a bit much. It would however require
relatively little changes to support that amount of points...

Markus M

Markus Metz wrote:

IOW, the bounding boxes are not
gone, they are still there, but no longer stored in two
different locations, only in one location.

...

No, the new, reduced format performs better with larger
datasets. For small datasets, there should be not much of a
difference in terms of speed, only in terms of memory
requirements.

(formerly we'd get to about 3 million points on a system with
~ 2gb ram)

> is time to run d.vect in a sub-region of the overall
> map a good way to test the performance difference?
>
Maybe, but d.vect is not very efficient: it goes through
all features, checks if a feature is inside the current region
and reads e.g. every area and the area's isles twice instead of
only once.
v.what, v.build, v.in.* are also good to test performance
differences.

ok, I meant something which made heavy use of the bounding boxes.
I guess the others do it in a less obvious way to the end-user.

Also note that database operations can mask effects
of changed topology management because database management
can be the main time consuming factor (e.g. [r|v].in.lidar).

one thing I noticed recently in 6.x.svn was that 'v.out.ascii
column=' was terribly slow, it seemed to open/close the dbf
db for every category. a vector with only 50000 points was taking
minutes to run.

[OT]
re. r.in.lidar, I worry about that much cloned code.
It is a shame that the #ifdef + Makefile solution to building
two modules from one set of code is problematic, as that would be
an ideal solution. (see recent r.colors, r3.colors frontend
wrappers instead, although I wonder if you could still use pre-
processor macros if the front-end wrapper #define'd them instead
of asking the Makefile to do it? hmm maybe not, there is still
just one set of *.o files)

I just have bad flashbacks of spending a lot of time re-sync'ing
fixes to i.points, i.vpoints, ..., after their cloned graphing
code had diverged for many years.

> (hoping this means we can have the best of both worlds!)
>
That's the aim.

great,
Hamish

Remind me - what was the reason to not move common code to (local)
library and prefer some arcane preprocessor/MAKE voodoo?

Not following so closely,
Maris.

2011/7/4 Hamish <hamish_b@yahoo.com>:

[OT]
re. r.in.lidar, I worry about that much cloned code.
It is a shame that the #ifdef + Makefile solution to building
two modules from one set of code is problematic, as that would be
an ideal solution. (see recent r.colors, r3.colors frontend
wrappers instead, although I wonder if you could still use pre-
processor macros if the front-end wrapper #define'd them instead
of asking the Makefile to do it? hmm maybe not, there is still
just one set of *.o files)

I just have bad flashbacks of spending a lot of time re-sync'ing
fixes to i.points, i.vpoints, ..., after their cloned graphing
code had diverged for many years.

Hamish wrote:

Markus Metz wrote:

IOW, the bounding boxes are not
gone, they are still there, but no longer stored in two
different locations, only in one location.

...

No, the new, reduced format performs better with larger
datasets. For small datasets, there should be not much of a
difference in terms of speed, only in terms of memory
requirements.

(formerly we'd get to about 3 million points on a system with
~ 2gb ram)

If GRASS_VECTOR_LOWMEM is set, hundreds of millions of points should
be possible. These changes should generally push up the limits for
processing larger vector datasets.

> is time to run d.vect in a sub-region of the overall
> map a good way to test the performance difference?
>
Maybe, but d.vect is not very efficient: it goes through
all features, checks if a feature is inside the current region
and reads e.g. every area and the area's isles twice instead of
only once.
v.what, v.build, v.in.* are also good to test performance
differences.

ok, I meant something which made heavy use of the bounding boxes.
I guess the others do it in a less obvious way to the end-user.

In general, any modifications to areas, e.g. v.generalize with
boundaries, v.clean tool=rmarea, v.select.
Building topology for areas also makes heavy use of bounding boxes.

v.what is not fair to use for performance differences because GRASS
6.x needs to build the spatial index on the fly and will thus always
take much longer in 6.x than in 7.

BTW, I have reduced file I/O for d.vect type=area in GRASS 7, now
waiting for d.mon to come back...

Markus M

Hi,

2011/7/4 Markus Metz <markus.metz.giswork@googlemail.com>:

BTW, I have reduced file I/O for d.vect type=area in GRASS 7, now
waiting for d.mon to come back...

just note: currently working on d.mon...

Martin

--
Martin Landa <landa.martin gmail.com> * http://geo.fsv.cvut.cz/~landa

On Mon, Jul 4, 2011 at 12:13 PM, Martin Landa <landa.martin@gmail.com> wrote:

Hi,

2011/7/4 Markus Metz <markus.metz.giswork@googlemail.com>:

BTW, I have reduced file I/O for d.vect type=area in GRASS 7, now
waiting for d.mon to come back...

just note: currently working on d.mon...

wonderful!

Markus M

On Sun, Jul 3, 2011 at 6:54 PM, Markus Metz <markus.metz.giswork@googlemail.com> wrote:

On Fri, Jul 1, 2011 at 2:48 PM, <Doug_Newcomb@fws.gov> wrote:

The new format offers new possibilities for the v.surf.* modules
because it is now possible to build (parts of the) topology even for
massive point clouds which in turn makes it possible to quickly
perform spatial queries with very low memory consumption (a very few
MB)

Markus,

By massive point clouds, do you mean greater than 4 Billion points? A good
long weekend for testing at home
:slight_smile:

Ummh, no, 4 billion is probably a bit much. It would however require
relatively little changes to support that amount of points…

That would be great if it wouldn’t be a lot of work! I work with LiDAR clouds and sometimes that’s a TON of point data.