[GRASS-user] speeding up v.clean for large datasets

Hi All, we're looking for ways to speed up the cleaning of a large OSM road network (relating to Australia). We're running on a large Amazon AWS EC2 instance.

What we've observed is exponential growth in time taken as number of linestrings increases.

This means it's taking about 3 days to clean entire network.

We were wondering if we were to split the dataset into say 4 subregions, and clean each separately, is it then possible to patch them back together at the end without having to run v.clean afterwards? We want to be able to run v.net over the entire network spanning the subregions.

Alternatively, has anyone found a way to speed up v.clean for large network datasets?

GRASS 6.4.3svn (road_network):/data/grassdata > v.clean input=osm_roads output=osm_roads_cleaned tool=break,rmdupl
--------------------------------------------------
Tool: Threshold
Break: 0.000000e+00
Remove duplicates: 0.000000e+00
--------------------------------------------------
Copying vector lines...
Rebuilding parts of topology...
Building topology for vector map <osm_roads_cleaned>...
Registering primitives...
971074 primitives registered
13142529 vertices registered
Number of nodes: 1458192
Number of primitives: 971074
Number of points: 0
Number of lines: 971074
Number of boundaries: 0
Number of centroids: 0
Number of areas: -
Number of isles: -
--------------------------------------------------
Tool: Break lines at intersections

On Fri, Apr 19, 2013 at 9:06 AM, Mark Wynter <mark@dimensionaledge.com> wrote:

Hi All, we're looking for ways to speed up the cleaning of a large OSM road network (relating to Australia). We're running on a large Amazon AWS EC2 instance.

What we've observed is exponential growth in time taken as number of linestrings increases.

This means it's taking about 3 days to clean entire network.

We were wondering if we were to split the dataset into say 4 subregions, and clean each separately, is it then possible to patch them back together at the end without having to run v.clean afterwards? We want to be able to run v.net over the entire network spanning the subregions.

Alternatively, has anyone found a way to speed up v.clean for large network datasets?

Yes, implemented in GRASS 7 :wink:

Also, when breaking lines it is recommended to split the lines first
in smaller segments with v.split using the vertices option. Then run
v.clean tool=break. After that, use v.build.polylines to merge lines
again. Or use in GRASS 7 the -c flag with v.clean tool=break
type=line. The rmdupl tool is then automatically added, and the
splitting and merging is done internally.

Markus M

GRASS 6.4.3svn (road_network):/data/grassdata > v.clean input=osm_roads output=osm_roads_cleaned tool=break,rmdupl
--------------------------------------------------
Tool: Threshold
Break: 0.000000e+00
Remove duplicates: 0.000000e+00
--------------------------------------------------
Copying vector lines...
Rebuilding parts of topology...
Building topology for vector map <osm_roads_cleaned>...
Registering primitives...
971074 primitives registered
13142529 vertices registered
Number of nodes: 1458192
Number of primitives: 971074
Number of points: 0
Number of lines: 971074
Number of boundaries: 0
Number of centroids: 0
Number of areas: -
Number of isles: -
--------------------------------------------------
Tool: Break lines at intersections

_______________________________________________
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

On Fri, Apr 19, 2013 at 10:07 AM, Markus Metz
<markus.metz.giswork@gmail.com> wrote:

On Fri, Apr 19, 2013 at 9:06 AM, Mark Wynter <mark@dimensionaledge.com> wrote:

...

Alternatively, has anyone found a way to speed up v.clean for large network datasets?

Yes, implemented in GRASS 7 :wink:

Also, when breaking lines it is recommended to split the lines first
in smaller segments with v.split using the vertices option. Then run
v.clean tool=break. After that, use v.build.polylines to merge lines
again. Or use in GRASS 7 the -c flag with v.clean tool=break
type=line. The rmdupl tool is then automatically added, and the
splitting and merging is done internally.

Added to

http://grasswiki.osgeo.org/wiki/Vector_topology_cleaning

markusN

Thanks Markus.
Upgraded to GRASS 7, and re-ran v.clean on same OSM Australia dataset.
Substantially faster. The bulk of the time related to removal of duplicates, and it got exponentially slower as the process approached 100%. Overall it took 12 hours but I'm wondering how it would perform if we were to repeat v.clean for even larger road networks e.g. USA or Europe?

I'm tempted to try dividing the input dataset into say 4 smaller subregions (i.e. vector tiles), and then try patching them back.
I read that we will still need to run v.clean over the patched datasets to remove duplicates.
Since the only duplicates should be nodes along the common tile edges, is there a way to in effect constrain the v.clean process to slithers containing the common edges?
I've had a quick go at g.region but to no avail.

Thanks

GRASS 7.0.svn (PERMANENT):/data/grassdata > v.clean input=osm_roads_split output=osm_roads_split_cleaned tool=break type=line -c
--------------------------------------------------
Tool: Threshold
Break: 0
--------------------------------------------------
Copying vector features...
Copying features...
100%
Rebuilding parts of topology...
Building topology for vector map <osm_roads_split_cleaned@PERMANENT>...
Registering primitives...
971074 primitives registered
13142529 vertices registered
Number of nodes: 1458192
Number of primitives: 971074
Number of points: 0
Number of lines: 971074
Number of boundaries: 0
Number of centroids: 0
Number of areas: -
Number of isles: -
--------------------------------------------------
Tool: Break lines at intersections
100%
Tool: Remove duplicates
100%
--------------------------------------------------
Rebuilding topology for output vector map...
Building topology for vector map <osm_roads_split_cleaned@PERMANENT>...
Registering primitives...
2462829 primitives registered
13322052 vertices registered
Building areas...
100%
0 areas built
0 isles built
Attaching islands...
Attaching centroids...
100%
Number of nodes: 1819237
Number of primitives: 2462829
Number of points: 0
Number of lines: 2462829
Number of boundaries: 0
Number of centroids: 0
Number of areas: 0
Number of isles: 0

On 19/04/2013, at 6:07 PM, Markus Metz wrote:

On Fri, Apr 19, 2013 at 9:06 AM, Mark Wynter <mark@dimensionaledge.com> wrote:

Hi All, we're looking for ways to speed up the cleaning of a large OSM road network (relating to Australia). We're running on a large Amazon AWS EC2 instance.

What we've observed is exponential growth in time taken as number of linestrings increases.

This means it's taking about 3 days to clean entire network.

We were wondering if we were to split the dataset into say 4 subregions, and clean each separately, is it then possible to patch them back together at the end without having to run v.clean afterwards? We want to be able to run v.net over the entire network spanning the subregions.

Alternatively, has anyone found a way to speed up v.clean for large network datasets?

Yes, implemented in GRASS 7 :wink:

Also, when breaking lines it is recommended to split the lines first
in smaller segments with v.split using the vertices option. Then run
v.clean tool=break. After that, use v.build.polylines to merge lines
again. Or use in GRASS 7 the -c flag with v.clean tool=break
type=line. The rmdupl tool is then automatically added, and the
splitting and merging is done internally.

Markus M

On Sat, Apr 20, 2013 at 3:02 AM, Mark Wynter <mark@dimensionaledge.com> wrote:

Thanks Markus.
Upgraded to GRASS 7, and re-ran v.clean on same OSM Australia dataset.
Substantially faster. The bulk of the time related to removal of duplicates, and it got exponentially slower as the process approached 100%. Overall it took 12 hours but I'm wondering how it would perform if we were to repeat v.clean for even larger road networks e.g. USA or Europe?

Something is wrong there. Your dataset has 971074 roads, I tested with
an OSM dataset with 2645287 roads, 2.7 times as many as in your
dataset. Cleaning these 2645287 lines took me less than 15 minutes. I
suspect a slow database backend (dbf). Try to use sqlite as database
backend:

db.connect driver=sqlite
database=$GISDBASE/$LOCATION_NAME/$MAPSET/sqlite/sqlite.db

Do not substitute the variables.

HTH,

Markus M

I'm tempted to try dividing the input dataset into say 4 smaller subregions (i.e. vector tiles), and then try patching them back.
I read that we will still need to run v.clean over the patched datasets to remove duplicates.
Since the only duplicates should be nodes along the common tile edges, is there a way to in effect constrain the v.clean process to slithers containing the common edges?
I've had a quick go at g.region but to no avail.

Thanks

GRASS 7.0.svn (PERMANENT):/data/grassdata > v.clean input=osm_roads_split output=osm_roads_split_cleaned tool=break type=line -c
--------------------------------------------------
Tool: Threshold
Break: 0
--------------------------------------------------
Copying vector features...
Copying features...
100%
Rebuilding parts of topology...
Building topology for vector map <osm_roads_split_cleaned@PERMANENT>...
Registering primitives...
971074 primitives registered
13142529 vertices registered
Number of nodes: 1458192
Number of primitives: 971074
Number of points: 0
Number of lines: 971074
Number of boundaries: 0
Number of centroids: 0
Number of areas: -
Number of isles: -
--------------------------------------------------
Tool: Break lines at intersections
100%
Tool: Remove duplicates
100%
--------------------------------------------------
Rebuilding topology for output vector map...
Building topology for vector map <osm_roads_split_cleaned@PERMANENT>...
Registering primitives...
2462829 primitives registered
13322052 vertices registered
Building areas...
100%
0 areas built
0 isles built
Attaching islands...
Attaching centroids...
100%
Number of nodes: 1819237
Number of primitives: 2462829
Number of points: 0
Number of lines: 2462829
Number of boundaries: 0
Number of centroids: 0
Number of areas: 0
Number of isles: 0

On 19/04/2013, at 6:07 PM, Markus Metz wrote:

On Fri, Apr 19, 2013 at 9:06 AM, Mark Wynter <mark@dimensionaledge.com> wrote:

Hi All, we're looking for ways to speed up the cleaning of a large OSM road network (relating to Australia). We're running on a large Amazon AWS EC2 instance.

What we've observed is exponential growth in time taken as number of linestrings increases.

This means it's taking about 3 days to clean entire network.

We were wondering if we were to split the dataset into say 4 subregions, and clean each separately, is it then possible to patch them back together at the end without having to run v.clean afterwards? We want to be able to run v.net over the entire network spanning the subregions.

Alternatively, has anyone found a way to speed up v.clean for large network datasets?

Yes, implemented in GRASS 7 :wink:

Also, when breaking lines it is recommended to split the lines first
in smaller segments with v.split using the vertices option. Then run
v.clean tool=break. After that, use v.build.polylines to merge lines
again. Or use in GRASS 7 the -c flag with v.clean tool=break
type=line. The rmdupl tool is then automatically added, and the
splitting and merging is done internally.

Markus M

Thanks Marcus.
Tried sqlite backend suggestion - no improvement - then read that that sqlite is the default backend for grass7.
I suspect the complexity of the input dataset may be the contributing factor. For example, I ran v.clean over the already cleaned OSM dataset (2.6M lines), and it took only a few minutes since there were no intersections and no duplicates to remove.

Something is wrong there. Your dataset has 971074 roads, I tested with
an OSM dataset with 2645287 roads, 2.7 times as many as in your
dataset. Cleaning these 2645287 lines took me less than 15 minutes. I
suspect a slow database backend (dbf). Try to use sqlite as database
backend:

db.connect driver=sqlite
database=$GISDBASE/$LOCATION_NAME/$MAPSET/sqlite/sqlite.db

Do not substitute the variables.

HTH,

Markus M

On Mon, Apr 22, 2013 at 11:03 AM, Mark Wynter <mark@dimensionaledge.com> wrote:

Thanks Marcus.
Tried sqlite backend suggestion - no improvement - then read that that sqlite is the default backend for grass7.
I suspect the complexity of the input dataset may be the contributing factor. For example, I ran v.clean over the already cleaned OSM dataset (2.6M lines), and it took only a few minutes since there were no intersections and no duplicates to remove.

I tested with a OSM road vector with 2.6M lines, the output has 5.3M
lines: lots of intersections and duplicates which were cleaned in less
than 15 minutes.

I am surprised that you experience slow removal of duplicates,
breaking lines should take much longer.

About why removing duplicates takes longer at the end: when you have 5
lines that could be duplicates you could check

1 with 2, 3, 4, 5
2 with 1, 3, 4, 5
3 with 1, 2, 4, 5
4 with 1, 2, 3, 5
5 with 1, 2, 3, 4

or checking each combination only once:

1 with 2, 3, 4, 5
2 with 3, 4, 5
3 with 4, 5
4 with 5

alternatively

2 with 1
3 with 1, 2
4 with 1, 2, 3
5 with 1, 2, 3, 4

The current implementation uses the latter.

Markus M

Something is wrong there. Your dataset has 971074 roads, I tested with
an OSM dataset with 2645287 roads, 2.7 times as many as in your
dataset. Cleaning these 2645287 lines took me less than 15 minutes. I
suspect a slow database backend (dbf). Try to use sqlite as database
backend:

db.connect driver=sqlite
database=$GISDBASE/$LOCATION_NAME/$MAPSET/sqlite/sqlite.db

Do not substitute the variables.

HTH,

Markus M

Thanks Markus for the explanation. I've set PostGIS as my backend. Will revert as I get more into v.net

On 22/04/2013, at 8:20 PM, Markus Metz wrote:

On Mon, Apr 22, 2013 at 11:03 AM, Mark Wynter <mark@dimensionaledge.com> wrote:

Thanks Marcus.
Tried sqlite backend suggestion - no improvement - then read that that sqlite is the default backend for grass7.
I suspect the complexity of the input dataset may be the contributing factor. For example, I ran v.clean over the already cleaned OSM dataset (2.6M lines), and it took only a few minutes since there were no intersections and no duplicates to remove.

I tested with a OSM road vector with 2.6M lines, the output has 5.3M
lines: lots of intersections and duplicates which were cleaned in less
than 15 minutes.

I am surprised that you experience slow removal of duplicates,
breaking lines should take much longer.

About why removing duplicates takes longer at the end: when you have 5
lines that could be duplicates you could check

1 with 2, 3, 4, 5
2 with 1, 3, 4, 5
3 with 1, 2, 4, 5
4 with 1, 2, 3, 5
5 with 1, 2, 3, 4

or checking each combination only once:

1 with 2, 3, 4, 5
2 with 3, 4, 5
3 with 4, 5
4 with 5

alternatively

2 with 1
3 with 1, 2
4 with 1, 2, 3
5 with 1, 2, 3, 4

The current implementation uses the latter.

Markus M

Something is wrong there. Your dataset has 971074 roads, I tested with
an OSM dataset with 2645287 roads, 2.7 times as many as in your
dataset. Cleaning these 2645287 lines took me less than 15 minutes. I
suspect a slow database backend (dbf). Try to use sqlite as database
backend:

db.connect driver=sqlite
database=$GISDBASE/$LOCATION_NAME/$MAPSET/sqlite/sqlite.db

Do not substitute the variables.

HTH,

Markus M

On Fri, Apr 26, 2013 at 8:33 AM, Mark Wynter <mark@dimensionaledge.com> wrote:

Thanks Markus for the explanation. I've set PostGIS as my backend. Will revert as I get more into v.net

Oops. Direct PostGIS is 1) experimental, 2) slow. For vector
operations it is heavily recommended to use the native GRASS vector
format and import vectors first with v.in.ogr. v.external and
v.external.out should not be used, i.e. v.external.out -g should
report format=native.

Markus M

On 22/04/2013, at 8:20 PM, Markus Metz wrote:

On Mon, Apr 22, 2013 at 11:03 AM, Mark Wynter <mark@dimensionaledge.com> wrote:

Thanks Marcus.
Tried sqlite backend suggestion - no improvement - then read that that sqlite is the default backend for grass7.
I suspect the complexity of the input dataset may be the contributing factor. For example, I ran v.clean over the already cleaned OSM dataset (2.6M lines), and it took only a few minutes since there were no intersections and no duplicates to remove.

I tested with a OSM road vector with 2.6M lines, the output has 5.3M
lines: lots of intersections and duplicates which were cleaned in less
than 15 minutes.

I am surprised that you experience slow removal of duplicates,
breaking lines should take much longer.

About why removing duplicates takes longer at the end: when you have 5
lines that could be duplicates you could check

1 with 2, 3, 4, 5
2 with 1, 3, 4, 5
3 with 1, 2, 4, 5
4 with 1, 2, 3, 5
5 with 1, 2, 3, 4

or checking each combination only once:

1 with 2, 3, 4, 5
2 with 3, 4, 5
3 with 4, 5
4 with 5

alternatively

2 with 1
3 with 1, 2
4 with 1, 2, 3
5 with 1, 2, 3, 4

The current implementation uses the latter.

Markus M

Something is wrong there. Your dataset has 971074 roads, I tested with
an OSM dataset with 2645287 roads, 2.7 times as many as in your
dataset. Cleaning these 2645287 lines took me less than 15 minutes. I
suspect a slow database backend (dbf). Try to use sqlite as database
backend:

db.connect driver=sqlite
database=$GISDBASE/$LOCATION_NAME/$MAPSET/sqlite/sqlite.db

Do not substitute the variables.

HTH,

Markus M

2013/4/26 Markus Metz <markus.metz.giswork@gmail.com>:

On Fri, Apr 26, 2013 at 8:33 AM, Mark Wynter <mark@dimensionaledge.com> wrote:

Thanks Markus for the explanation. I've set PostGIS as my backend. Will revert as I get more into v.net

Oops. Direct PostGIS is 1) experimental, 2) slow. For vector

right, currently it's very slow, working on improvements. The code is
in trunk, I will announce the native PostGIS support when it will be
ready for testing (roadmap is for July, unfortunately till June I need
to focus on other important task). In any case partial speed-up (for
random access and building areas for boundaries) should be committed
to trunk with few days.

Martin

--
Martin Landa <landa.martin gmail.com> * http://geo.fsv.cvut.cz/~landa