[GRASS-dev] vector library changes

Hi all,

I have tried to make topology building in grass7 a bit faster, with limited success. Some functions are now a bit faster, but there are no drastic changes. Some other functions are now a bit slower, and there I would like to know if there are objections against these changes.

The first change causing a little slow down is in diglib: memory of no longer used structures is freed. That was mentioned as a TODO in the source code, probably by Radim. All I can see now is that cleaning time, e.g. v.in.ogr of a polygon vector, increases from e.g. 15m30s to 15m35s, IOW speed loss is in this case about 0.5%, but memory consumption is not really lower, even when cleaning large vectors with many areas (> 50,000) where many structures should be freed. The new functions dig_free_node(), dig_free_line(), dig_free_area(), and dig_free_isle() in diglib/struct_alloc.c are called from within plus_line.c and plus_area.c. Maybe someone can have a look at the new functions in struct_alloc.c to check if I made a mistake? AFAIKT, there is no obvious mistake, resulting vectors are identical in my cleaning tests, no warnings or errors.

Another similar slow down (about 0.5%) is caused by G_percent which I added to all cleaning functions. The reasoning is that users may wonder if anything is happening at all when importing a large polygon vector with v.in.ogr or cleaning a large vector, and G_percent shows that there is something happening. I find that reassuring to know.

In short, grass7 is still about as fast as grass6 with regard to cleaning vectors (should be a bit faster with building topology, only noticeable with really large vectors), but gives a bit more feedback on the progress. The vector API as well as the vector format is unchanged.

If there are objections against these changes, I will revert them.

Regards,

Markus M

Hi Markus,

On Mon, Mar 23, 2009 at 11:37 AM, Markus Metz
<markus.metz.giswork@googlemail.com> wrote:

Hi all,

I have tried to make topology building in grass7 a bit faster, with limited
success. Some functions are now a bit faster, but there are no drastic
changes. Some other functions are now a bit slower, and there I would like
to know if there are objections against these changes.

The first change causing a little slow down is in diglib: memory of no
longer used structures is freed. That was mentioned as a TODO in the source
code, probably by Radim. All I can see now is that cleaning time, e.g.
v.in.ogr of a polygon vector, increases from e.g. 15m30s to 15m35s, IOW
speed loss is in this case about 0.5%, but memory consumption is not really
lower, even when cleaning large vectors with many areas (> 50,000) where
many structures should be freed. The new functions dig_free_node(),
dig_free_line(), dig_free_area(), and dig_free_isle() in
diglib/struct_alloc.c are called from within plus_line.c and plus_area.c.
Maybe someone can have a look at the new functions in struct_alloc.c to
check if I made a mistake? AFAIKT, there is no obvious mistake, resulting
vectors are identical in my cleaning tests, no warnings or errors.

Another similar slow down (about 0.5%) is caused by G_percent which I added
to all cleaning functions. The reasoning is that users may wonder if
anything is happening at all when importing a large polygon vector with
v.in.ogr or cleaning a large vector, and G_percent shows that there is
something happening. I find that reassuring to know.

Yes, that's very useful (and 0.5% are an acceptable tradeoff).

In short, grass7 is still about as fast as grass6 with regard to cleaning
vectors (should be a bit faster with building topology, only noticeable with
really large vectors), but gives a bit more feedback on the progress. The
vector API as well as the vector format is unchanged.

Here my tests:

GRASS 6.5.svn:
time v.in.ogr usr_urb.shp out=tmp
...
78095 input polygons
Total area: 6.267569e+09 (78095 areas)
Overlapping area: 0.000000e+00 (0 areas)
Area without category: 0.000000e+00 (0 areas)
4020.05user 76.72system 1:14:33elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
298216inputs+762008outputs (26major+213960minor)pagefaults 0swaps

GRASS 7.svn:
time v.in.ogr usr_urb.shp out=tmp
...
78095 input polygons
Total area: 6.267569e+09 (78095 areas)
Overlapping area: 0.000000e+00 (0 areas)
Area without category: 0.000000e+00 (0 areas)
4137.73user 59.59system 1:15:52elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
1080inputs+793648outputs (4major+219703minor)pagefaults 0swaps

Best
Markus

PS: Could it output the percentage in the same line to reduce vertical
  space (in a "screen" session I didn't manage to scroll back)?
  Mhh, perhaps it is G_percent() which introduces the newline...

now:
...
-----------------------------------------------------
Remove duplicates:
100%
-----------------------------------------------------
Clean boundaries at nodes:
100%
...

ideally:
...
-----------------------------------------------------
Remove duplicates: 100%
-----------------------------------------------------
Clean boundaries at nodes: 100%
...

Markus Neteler wrote:

Here my tests:

GRASS 6.5.svn:
time v.in.ogr usr_urb.shp out=tmp
...
4020.05user 76.72system 1:14:33elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
298216inputs+762008outputs (26major+213960minor)pagefaults 0swaps

GRASS 7.svn:
time v.in.ogr usr_urb.shp out=tmp
...
4137.73user 59.59system 1:15:52elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
1080inputs+793648outputs (4major+219703minor)pagefaults 0swaps
  

I'll try to make good for that bit of extra time in grass7 at another place...

PS: Could it output the percentage in the same line to reduce vertical
  space (in a "screen" session I didn't manage to scroll back)?
  Mhh, perhaps it is G_percent() which introduces the newline...
  

...

ideally:
...
-----------------------------------------------------
Remove duplicates: 100%
  

"Remove duplicates:" is produced by G_message(), then comes G_percent(), and this determines the format and new lines. BTW, the extra line for percent is only present in the command line, not the GUI. For that reason, i.e. must work with GUI and with command line, must also respect quiet and verbose when appropriate, translations???, I want to use G_* wherever possible and not construct something with e.g. fprintf. If someone has a solution for that, the cleaning functions can make use of it, but I don't have one :frowning:

Thanks for testing,

Markus M

On Mon, Mar 23, 2009 at 12:59 PM, Markus Neteler <neteler@osgeo.org> wrote:

On Mon, Mar 23, 2009 at 11:37 AM, Markus Metz <markus.metz.giswork@googlemail.com> wrote:

...

usr_urb.shp is of 126MB size.

Here my tests:

GRASS 6.5.svn:
time v.in.ogr usr_urb.shp out=tmp
...
78095 input polygons
Total area: 6.267569e+09 (78095 areas)
Overlapping area: 0.000000e+00 (0 areas)
Area without category: 0.000000e+00 (0 areas)
4020.05user 76.72system 1:14:33elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
298216inputs+762008outputs (26major+213960minor)pagefaults 0swaps

yesterday:

GRASS 7.svn:
time v.in.ogr usr_urb.shp out=tmp
...
78095 input polygons
Total area: 6.267569e+09 (78095 areas)
Overlapping area: 0.000000e+00 (0 areas)
Area without category: 0.000000e+00 (0 areas)
4137.73user 59.59system 1:15:52elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
1080inputs+793648outputs (4major+219703minor)pagefaults 0swaps

Today (the "Break polygons" part is incredible fast now!):

GRASS 7.svn:
time v.in.ogr usr_urb.shp out=tmp
...
78095 input polygons
Total area: 6.267569e+09 (78095 areas)
Overlapping area: 0.000000e+00 (0 areas)
Area without category: 0.000000e+00 (0 areas)
1872.79user 38.02system 32:08.36elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
293928inputs+929648outputs (4major+190446minor)pagefaults 0swaps

1:15:52elapsed versus now 32:08elapsed!

Some comments:

It now spends its time in (as you will know).
Break boundaries: ...
and
Building areas...
and then after having created the temporary topology and before
the final topology (not sure what's doing there, no screen output after
Layer: usr_urb
and the beast is at 100% CPU for a while, then it continues with
-----------------------------------------------------
-----------------------------------------------------
Building topology for vector map <tmp>...
Registering primitives...

Here it also spends looot of time - could that be another candidate?

Great job - the new binary trees seem to be very fast and promising for
the other parts of the topology.

Markus

Markus Neteler wrote:

Some comments:

It now spends its time in (as you will know).
Break boundaries: ...
and
Building areas...
  

Yes, that's unchanged and behaves as before. Break boundaries can be made faster with boundary splitting.

and then after having created the temporary topology and before
the final topology (not sure what's doing there, no screen output after
Layer: usr_urb
and the beast is at 100% CPU for a while, then it continues with
-----------------------------------------------------
Building topology for vector map <tmp>...
Registering primitives...

Here it also spends looot of time - could that be another candidate?
  

There is some potential for improvement. I could sometime soon put a new version of v.in.ogr in trunk that optionally uses boundary splitting, which could not only speed up breaking boundaries, but also topology building. Independent of that, topology is built by v.in.ogr more often than actually needed, also a time killer with large imports.

Markus M

After the recent Vectlib changes in GRASS 7, here new tests:

On Mon, Mar 23, 2009 at 3:53 PM, Markus Metz
<markus.metz.giswork@googlemail.com> wrote:

Markus Neteler wrote:

GRASS 6.5.svn:
time v.in.ogr usr_urb.shp out=tmp
...
4020.05user 76.72system 1:14:33elapsed 91%CPU (0avgtext+0avgdata
0maxresident)k
298216inputs+762008outputs (26major+213960minor)pagefaults 0swaps

GRASS 7.svn:
time v.in.ogr usr_urb.shp out=tmp
...
4137.73user 59.59system 1:15:52elapsed 92%CPU (0avgtext+0avgdata
0maxresident)k
1080inputs+793648outputs (4major+219703minor)pagefaults 0swaps

Today in GRASS 7:
...
1877.97user 37.02system 32:10.12elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
293568inputs+826792outputs (4major+192075minor)pagefaults 0swaps

Faster than ever! Less than 50% of the time if I interpret it correctly.
Congratulations, MarkusM!

MarkusN