[GRASS-dev] [GRASS GIS] #426: v.in.ogr: split long boundaries

#426: v.in.ogr: split long boundaries
-------------------------+--------------------------------------------------
Reporter: neteler | Owner: grass-dev@lists.osgeo.org
     Type: enhancement | Status: new
Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Keywords: | Platform: All
      Cpu: All |
-------------------------+--------------------------------------------------
Moved here from
http://trac.osgeo.org/grass/browser/grass/trunk/doc/vector/TODO#L242

Radim suggested:

It would be useful to split long boundaries in v.in.ogr to smaller
pieces. Otherwise cleaning process can become very slow because
bounding box of long boundaries can overlap large part of the map (for
example outline around all areas) and cleaning process is checking
intersection with all boundaries falling in the bounding box.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by mmetz):

No boundary splitting is only one aspect that leads to complaints about
v.in.ogr. I have added boundary splitting to my local copy of
develbranch_6 and found sometimes substantial speed improvements (up to 6x
as fast), sometimes no speed improvement (more or less the same, give and
take a few seconds). It depends on properties of the vector to be imported
that are independent of the size of the vector to be imported and the
number of features in it (see below). I have used ESRI Shapefiles for
testing, they are the most common vectors to be imported and they have
only what is called "simple features", most importantly no topology.

I figured out four possibilities to improve (ESRI Shapefile) vector import
with v.in.ogr

1) code comment quoting "TODO: is it necessary to build here? probably
not, consumes time"

Indeed, build partial with GV_BUILD_BASE is sufficient and results in a
substantial speed improvement for large vectors.

2) when area cleaning is desired (no -c flag), support for the output
vector is released, the vector is closed, opened for update and a partial
build with GV_BUILD_BASE is done

This gives me an error with large vectors: the size of the coor file does
not match the size given by topology (the topo file). I disabled that part
of the code and now it works. There must have been a reason to do that but
I can not think of a reason. But then I'm not that deep into GRASS vector
processing, please help me out. My argument is that v.in.ogr is far from
finished with that vector and that it is safer to keep it open and keep
working on it. For me, this is both a speed improvement and avoids import
failures (most important).

3) split boundaries when import vectors have many areas ( > 500)

I make this No 3) because No 2) is crucial, it avoids import failures, and
No 2) depends on No 1). Splitting boundaries is just another speed
improvement but probably welcome for many users because the speed
improvement can be quite a bit. With my proposed method, splitting
boundaries is done with a threshold for boundary length. Whenever a new
vertex is added to a boundary and the boundary length exceeds that
threshold, the boundary is written out and a new boundary started with the
same Cats if given. The reasoning to determine the threshold is that a
useful threshold is a function of vector area size and the number of
areas. I propose to use map unit / ln(features). Map unit is sqrt(area
size), reasonable for boundary length. For bounding box of boundary length
I would use are size directly (keep units identical). Using ln(features)
avoids creating tiny tiny thresholds when many many features are in the
vector to be imported. I would undertand if you think that thid is
nonsense but it works, really! Both for a global map of the world with
political boundaries in latlon and a vector with watershed basins in UTM
with 150x150 km extends. Anyway, splitting boundaries will only happen f
the -s flag is set (keeping compatibility with 6.4.0).

4) use a temporary vector, not by me but by Radim Blazek

Do all the processing and cleaning in a temporary vector, then copy only
alive lines to the output vector. In case of ESRI Shapefile polygon
import, this might reduce the size of the coor file by a factor of 2 (all
boundaries used by GRASS are present twice in the shapefile). That would
not only be a speed improvement for further processing but also be safer.
Thinking about it that should be No 1) because as Radim Blazek suggested,
every module should do that to keep the size of the coor file small and
speed up vector processing.

My ultimate testing shapefile that I referred to above has a size of 4.6
MB and 3421 polygons. I would laugh at that and expect seconds to import
it. Then I noticed that the coor file created by GRASS is 4.5 GB = 4608 MB
large (no typo). That is the size after reading in all boundaries, before
cleaning. I'm not done yet with cleaning and am confident that the size
will go over 5 GB, the possible maximum size should be at least below 8
GB. If anybody out there gets as far as "Remove duplicates:" with the
current v.in.ogr version in 6.4.0.RC3 or devbr_6, I will be ready to
deliver a substantial price. Let's say I pay your next pizza delivery :slight_smile:
I will send that shapefile on request.

This innocent looking little shapefile has one polygon with thousands of
islands, that one is responsible for the large coor file and the long time
needed to import that shapefile. I will manage to import it (some more
hours later, still busy) and thus prove that my suggested improvement No
2) does indeed make sense.

Markus M

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:1&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by cgsbob):

In Ticket #397 I reported that some large holes are being filled after
going through v.clean. This seems to happen when this large hole is
surrounded by a large and complex polygon. Do you think that your
v.in.ogr might solve my problem?

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:2&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by mmetz):

Replying to [comment:2 cgsbob]:
> In Ticket #397 I reported that some large holes are being filled after
going through v.clean. This seems to happen when this large hole is
surrounded by a large and complex polygon. Do you think that your
v.in.ogr might solve my problem?

"My" v.in.ogr is not any different from "your" v.in.ogr in that respect.
You can use "v.in.ogr min_area=2800", but I don't think it's a good idea
to export your vector and then re-import it again, you may well loose some
information on the way.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:3&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by cgsbob):

Replying to [comment:3 mmetz]:
> Replying to [comment:2 cgsbob]:
> > In Ticket #397 I reported that some large holes are being filled after
going through v.clean. This seems to happen when this large hole is
surrounded by a large and complex polygon. Do you think that your
v.in.ogr might solve my problem?
>
> "My" v.in.ogr is not any different from "your" v.in.ogr in that respect.
You can use "v.in.ogr min_area=2800", but I don't think it's a good idea
to export your vector and then re-import it again, you may well loose some
information on the way.

I agree, but as you can see in ticket #397, I have not been able to clean
up my vector. I was just saying that if I did re-import my vector into
v.in.ogr w/ boundary splitting and then used v.clean tool=rmarea, I might
get something better then a filled large hole.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:4&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by mmetz):

The attached patch is against devbr_6 r35761 and would give a bit of a
speed up and adds safety checks.

I added boundary splitting activated with a new -s flag, the speed
increase is zero to 6x, depending on the import vector.

The patch uses a temporary vector, this can reduce final coor file size by
a factor of 2 to 5 if a shapefile with polygons is imported.

Vector libraries do not support large files > 2GB, therefore my complaint
no 2) above is not valid and the error message is valid.

Closing a vector and opening again works like a file size limit check. The
patch does that not only before cleaning polygons as in the current
v.in.ogr (this is a very nice idea, it would be annoying to get a vector
cleaned for hours that was corrupt anyway...), it also checks file size
limits just before copying the temporary vector to the final output
vector. If that check is passed, only alive features of the temporary
vector are copied to the output vector, giving the file size reduction in
case polygons were cleaned. New warning messages appear when checking file
size limits (checking file size limits is done by the vector library when
opening a vector, I didn't come up with a weird new solution:-)).

Please test!

Markus M

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:5&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by mmetz):

After lots of testing and debugging with the help of Markus Neteler, I
would like to add boundary splitting in trunk and maybe in devbr6, but
want to ask first about how to implement it:

1) enable boundary splitting by default, disable with new flag. That's my
favorite.

2) enable boundary splitting with new flag, don't split by default.

3) always split boundaries, no option to disable.

4) forget about it, leave v.in.ogr as it is.

If boundary splitting gets added to v.in.ogr, the man page would be
updated with a hint to use v.build.polylines if boundary segments should
be joined again.

With boundary splitting, v.in.ogr is roughly 2x (1.5x - 3x) faster on
shapefiles with areas. That's not as much as I hoped for, but something.

Please vote :slight_smile:

Markus M

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:6&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by mlennert):

Replying to [comment:6 mmetz]:
> After lots of testing and debugging with the help of Markus Neteler, I
would like to add boundary splitting in trunk and maybe in devbr6,

Great job !

> but want to ask first about how to implement it:

Before being able to answer this it would help to know what the impacts of
boundary splitting on any other vector operations might be. Speeding up
import is good, but if it causes any disruptions (or slow-downs) further
on, it might be preferrable to disable it by default. If not I clearly
vote for:

>
> 1) enable boundary splitting by default, disable with new flag. That's
my favorite.

Moritz

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:7&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by mmetz):

Replying to [comment:7 mlennert]:
> Replying to [comment:6 mmetz]:
>
> > want to ask first about how to implement it:
>
> Before being able to answer this it would help to know what the impacts
of boundary splitting on any other vector operations might be. Speeding up
import is good, but if it causes any disruptions (or slow-downs) further
on, it might be preferrable to disable it by default.

AFAICT there is one general trade-off: boundary splitting results in more
boundaries -> more disk space needed and more memory needed, particularly
for topology and the spatial index.

I did not thoroughly test any disruptions (or slow-downs) further on, so
far everything works. I'm also not sure if the speed gain is lost when you
have to run v.build.polylines afterward. This new feature needs more
testing with regard to its impact on other vector operations, therefore my
favorite is 1), submitting to trunk only and see how it performs.

Markus M

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:8&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by hamish):

Replying to [comment:6 mmetz]:
> After lots of testing and debugging with the help of Markus
> Neteler, I would like to add boundary splitting in trunk and
> maybe in devbr6, but want to ask first about how to implement it:
>
> 1) enable boundary splitting by default, disable with new flag.
> That's my favorite.

it would be nice to keep a method to import exact data, even if splitting
will be the default.

how would it be split? after a certain number of vertices of a polyline
like v.split or every "n" map units in a grid like v.in.gshhs?

note v.generalize currently requires to run v.build.polylines first to get
correct output. I don't know if that is a feature or a bug.
but if breaking polylines was the default for v.in.ogr it could become a
widespread subtle problem for it.

Hamish

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:9&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by mmetz):

Replying to [comment:9 hamish]:
>
> it would be nice to keep a method to import exact data, even if
splitting will be the default.
>
Do you mean, don't split when polygons are not cleaned (-c flag)?

> how would it be split? after a certain number of vertices of a polyline
like v.split or every "n" map units in a grid like v.in.gshhs?
>
I'm currently using line length in map units. The threshold is guestimated
from feature density in each layer. BTW, v.in.gshhs restricts bounding box
dimensions, similar to every "n" map units in a grid, but better. Using
line length is faster than using max bbox dimensions.

I do the splitting during OGRFeature import in order to not to inflate the
coor file size. Independent of splitting, after cleaning, more than 50% of
the coor file are dead lines, IOW if you would copy only alive lines to a
final output vector you could reduce the coor file size by more than 50%.
Splitting so early means that final boundary segments are shorter than the
threshold used because both "break polygons" and "break lines" break lines
at intersections. That's the reason why I don't want to make splitting
threshold a user option to avoid complaints.

The alternative would be to split after "break polygons" and "remove
duplicates", just before "break boundaries". Boundaries would then have a
max length of threshold, but this further inflates coor file size. The
reason why I don't want to further inflate the coor file size with more
dead lines is missing LFS support in the vector libs...
>
> note v.generalize currently requires to run v.build.polylines first to
get correct output. I don't know if that is a feature or a bug.
> but if breaking polylines was the default for v.in.ogr it could become a
widespread subtle problem for it.
>
Hmm, I think v.build.polylines should be run first anyway for
v.generalize. The cleaning process in v.in.ogr usually generates boundary
segments independent of whether boundaries are split.

Markus M

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:10&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by mmetz):

Fixed in trunk r37790.

Boundaries are split when cleaning, splitting is disabled when not
cleaning, no extra splitting flag. Additionally, boundaries are merged
later on, splitting is only needed to make Vect_break_lines() faster.
There is now a new cleaning tool in vector libs, Vect_merge_lines(). The
resulting vector should be topologically (c)leaner (less nodes and ready
to use with e.g. v.generalize) than grass6 imports.

Also new is the use of a temporary vector to reduce final coor file size
by a factor 2 to 5 for area imports.

Despite the new boundary merging and use of a temp vector, area imports
are considerably faster than with grass6, please test!

Markus M

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/426#comment:11&gt;
GRASS GIS <http://grass.osgeo.org>

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter: neteler | Owner: grass-dev@…
      Type: enhancement | Status: closed
  Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: fixed | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Changes (by mmetz):

  * status: new => closed
  * resolution: => fixed

Comment:

Applied to all branches, closing ticket.

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/426#comment:12&gt;
GRASS GIS <http://grass.osgeo.org>