Chris:
I have been trying to import a detailed shapefile of the florida
coast (about 24M) into grass6.0b2, using v.in.ogr. Unfortunately,
having started 3 days ago, the import has not yet completed. It is
making progress, as the number of intersections is incrementing,
however this surely should not be taking quite this long. Are there
ways of speeding up the process?
I am on a 1GHz OS X Powerbook with 1GB RAM.
Radim:
Is it one or few big shapes or many small?
Chris:
It is the florida coastline, so it should be one large shape, with
plenty of small islands scattered alongside.
Radim:
Then the problem is that bounding box of that long boundary
intersects all the islands and v.in.ogr will try to break those
lines which takes long time. It was already discussed here. Try to
modify v.in.ogr so that it writes long boundaries in more shorter
parts.
It was discussed off list, correspondence (including patch) follows.
I only come up against this myself every couple of months so have just
it be slow, but I've got a detailed coastline vector map with the
occasional offshore island .. it is topologically clean, but same long
time to process issues.
Hamish
-------------------------------------------------------------------------
From: Radim Blazek <blazek@itc.it>
Subject: Re: polygon cleaning
Date: Mon, 9 Feb 2004 12:27:15 +0100
To: Hamish <hamish_nospam@yahoo.com>
On Wednesday 04 February 2004 23:42, you wrote:
Hi Radim,
I was just wondering if I was getting the expected behaviour out of
v.in.ogr or if something was going wrong.
Some feedback for you about how well things scale for very big files
anyway..
I've got a big shapefile (100mb shp, 12mb dbf) of all the forested areas in
my country. The import went smoothly, but during the "Break boundaries"
stage things seemed to get exponentially slower after about 12,000 lines.
On a new 2.8GHz Pentium4 it took about 36 hours to get through this single
step; the rest of the import went pretty quickly.
[It used 735mb RAM, but I have 2gig RAM, so no swapping]
Is "Break boundaries" inherently exponential, or can the algorithm be
improved?
An import of the 622mb shapefile (topographic contour lines) went pretty
quickly, by the way (minutes).
36 is very bad, but seems to be strange, to import
my shapefile circa 140 Mb (shp), 267117 boundaries in GRASS,
v.in.ogr takes 91 minutes on my 1.5GHz.
"Break boundaries" means Vect_break_lines(). It is of course possible
to improve everything, but the most time consuming problem is already solved, I thing.
Vect_break_lines uses spatial index twice, first to find all lines in bounding box
which could intersect processed line (line A), then second spatial index is build for
all segment in line B. This way, there should be no exponential dependency on input.
Could you try to localise the problem somehow? Probably select only lines above 12,000.
Does v.clean tool=break take also so long time?
Radim
?,
cheers,
Hamish
Here's the output:
G:topo4_nztm > v.in.ogr dsn=. layer=native_poly out=native_poly
WARNING: Datum 'NZGD_2000' not recognised by GRASS and no parameters found.
Datum transformation will not be possible using this projection
information.
Layer: native_poly
WARNING: Area size 1.2e-09, area not imported.
WARNING: 129 features without geometry.
-----------------------------------------------------
142948 primitives registered
142906 areas built
142868 isles built
Number of nodes : 142890
Number of primitives: 142948
Number of points : 0
Number of lines : 0
Number of boundaries: 142948
Number of centroids : 0
Number of areas : 142906
Number of isles : 142868
Number of incorrect boundaries : 44
Number of areas without centroid : 142906
-----------------------------------------------------
WARNING: Cleaning polygons, result is not guaranteed!
Building topology ...
Number of nodes : 142890
Number of primitives: 142948
Number of points : 0
Number of lines : 0
Number of boundaries: 142948
Number of centroids : 0
Number of areas : -
Number of isles : -
-----------------------------------------------------
Snap boundaries (threshold = 1.000e-03):
All vertices: 5815998
Registered points (unique coordinates): 5671966
Nodes marked as anchor : 5671439
Nodes marked to be snapped : 527
Snapped vertices : 554
New vertices : 113
-----------------------------------------------------
Break polygons:
Registering points ... 5671439
All points (vertices): 5815678
Registered points (unique coordinates): 5671439
Points marked for break: 143327
Breaks: 1930
-----------------------------------------------------
Remove duplicates:
Duplicates: 396
-----------------------------------------------------
Break boundaries:
Intersections: 4
-----------------------------------------------------
Remove duplicates:
Duplicates: 4
-----------------------------------------------------
Change dangles to lines:
Removed dangles: 5 removed lines: 5
-----------------------------------------------------
Remove bridges:
Removed bridges: 0 removed lines: 0
-----------------------------------------------------
Building topology ...
142973 areas built
142496 isles built
Number of nodes : 143782
Number of primitives: 145340
Number of points : 0
Number of lines : 0
Number of boundaries: 145340
Number of centroids : 0
Number of areas : 142973
Number of isles : 142496
Number of areas without centroid : 142973
Layer: native_poly
-----------------------------------------------------
Building topology ...
-----------------------------------------------------
257057 primitives registered
142973 areas built
142496 isles built
Number of nodes : 256580
Number of primitives: 257057
Number of points : 0
Number of lines : 0
Number of boundaries: 143803
Number of centroids : 113254
Number of areas : 142973
Number of isles : 142496
Number of areas without centroid : 29719
-----------------------------------------------------
WARNING: 3 areas represet more (overlapping) features, because polygons
overlap in input layer(s). Such areas are linked to more than 1
row in attribute table. The number of features for those areas is stored as
category in field 2.
113255 input polygons
total area: 7.250188e+10 (142973 areas)
overlapping area: 3.488045e+04 (3 areas)
area without category: 4.978083e+09 (29719 areas)
[Finish]
all 4 intersections were somewhere in the last 5000 or so lines
processed.
From: Radim Blazek <blazek@itc.it>
Subject: Re: polygon cleaning
Date: Tue, 10 Feb 2004 10:47:32 +0100
To: Hamish <hamish_nospam@yahoo.com>
On Monday 09 February 2004 14:23, you wrote:
> Could you try to localise the problem somehow?
is it useful to run with DEBUG=2?
I don't think so.
> Does v.clean tool=break take also so long time?
running now, let you know tomorrow..
I just broke out of the 'v.clean tool=break' after 90 minutes and 25147
of ~143000 lines processed. I think it would go the full 36 hours if I
left it.
BTW, it took 5 minutes just to copy(!):
It is because it copies line by line and builds topology, BTW
which db driver? Postgres should be faster then DBF.
Number of boundaries: 143803
Number of centroids : 113254
Number of areas : 142973
Number of isles : 142496
There must be some mystery in your data. I have tried
v.clean tool=break on my 140MB shape, 1.5GHz CPU
Number of boundaries: 267117
Number of centroids: 97314
Number of areas: 98124
Number of islands: 21197
and it takes
real 17m0.085s
user 15m56.800s
sys 0m46.530s
Strange is that, number of boundaries in your map is almost equal to
the number of areas and to the number of isles. That means that,
areas are isolated, do not share common boundaries, is it right?
But I don't see it as a reason to make cleaning so slow.
Could you try to extract just a smaller part of you map,
(v.select, v.extract) and try to run v.clean if it takes
still proportional time of that 36hours.
Try somehow find the type of data causing the problem.
Now I have just one idea, because the areas are probably isolated,
it could be that exist one BIG boundary around all the map
(I thing that something like that exists in ArcInfo),
in that case, when this boundary is processed, it selects
lines which could intersect it by bounding box, in that case ALL
lines, and to check intersection of ALL segments of that BIG line
with ALL segments of ALL other lines can take a very long time.
Is it clear explenation?
Radim
From: Radim Blazek <blazek@itc.it>
Subject: Re: polygon cleaning
Date: Wed, 11 Feb 2004 15:40:07 +0100
To: Hamish <hamish_nospam@yahoo.com>
On Wednesday 11 February 2004 10:25, you wrote:
I'll try extracting the ones around that green & yellow PNG as there are
only a few there.
> Try somehow find the type of data causing the problem.
> Now I have just one idea, because the areas are probably isolated,
> it could be that exist one BIG boundary around all the map
> (I thing that something like that exists in ArcInfo),
> in that case, when this boundary is processed, it selects
> lines which could intersect it by bounding box, in that case ALL
> lines, and to check intersection of ALL segments of that BIG line
> with ALL segments of ALL other lines can take a very long time.
> Is it clear explenation?
Yes, clear explanation, but not the case.
Why not? I think it is. Not just one BIG, but many big. The problem is,
that areas are big, and do not share boundaries, so bounding box of
one such big area is big and selects many other boundaries.
I have got idea. Split boundaries to smaller parts, something like
for ( line = 1; line <= Vect_get_num_lines ( In ); line++) {
type = Vect_read_line ( In, Points, Cats, line);
v = 0;
while ( v < Points->n_points ) {
Vect_reset_line (OPoints);
for (i = 0; i < maxvertex && v < Points->n_points )
Vect_append_point ( OPoints, Points->x[v], Points->y[v], 0);
}
Vect_write_line (Out, type, Points, Cats);
}
}
Could you try it? If it helps, it would be reasonable to add this splitting
to v.in.ogr / v.clean.
Radim
From: Radim Blazek <blazek@itc.it>
Subject: Re: polygon cleaning
Date: Wed, 11 Feb 2004 15:41:44 +0100
To: Hamish <hamish_nospam@yahoo.com>
On Wednesday 11 February 2004 10:43, you wrote:
ok, figured out v.select+v.in.region.
extracted the small island (gray/pink "d.vect -c" PNG from prev. email)
Still seems a little slow 3 min for 500 areas, but FYI:
Yes, v.overlay is slow because it breaks all lines. Also if you need
v.overlay between big and small vector (area size), it is better to run
v.select first.
Radim