[GRASS-dev] how to increase performance of v.out.ogr to geopackage ?

Hi,

I have large amount of vector maps (~ 400) which each has hundreds of thousands of areas and over 150 attribute columns. I'm looking for the most efficient way to export these to a format I can share with non-GRASS GIS users. My favourite would probably be Geopackage, amongst others because the attribute column names are sometimes fairly long, thus making export to a DBF based format not easily feasible.

I've seen several sources of information about possible performance increases, either in the SQLite world (as a Geopackage is based on SQLite) or in the OGR SQLite driver page.

For OGR [1], there is the -gt option to ogr2ogr which allows increasing the number of features per transaction (or even running the entire conversion in one single transaction). There is also mention of the OGR_SQLITE_CACHE and OGR_SQLITE_SYNCHRONOUS configuration options. However, I don't seem to understand if, and if yes, how I can set any of these when using v.out.ogr. In r73117, MarkusM added config and dataset open options to r.in.gdal and v.in.ogr. Would we need something like this to be able to set the above config options ? How about the transaction option ?

On the SQLite side, there are some pragma statements that allow increasing performance (generally at the cost of less data security, but in some cases this might be acceptable), such as journal_mode [2] or synchronous [3] (which I suppose is what OGR_SQLITE_SYNCHRONOUS sets as well), or possibly locking_mode (I don't know how this is handled internally in OGR).

Again, my question would be how these could be set when exporting data with v.out.ogr.

Other interesting options (which would also influence performance) might be a flag in v.out.ogr allowing exporting without attribute data (although a simple v.db.connect -d before export would probably be enough) and allowing selecting a subset of columns to export. But that's something else.

Moritz

[1] https://gdal.org/drv_sqlite.html
[2] https://sqlite.org/pragma.html#pragma_journal_mode
[3] https://sqlite.org/pragma.html#pragma_synchronous

On Thu, Nov 29, 2018 at 3:43 PM Moritz Lennert <mlennert@club.worldonline.be> wrote:

Hi,

I have large amount of vector maps (~ 400) which each has hundreds of
thousands of areas and over 150 attribute columns. I’m looking for the
most efficient way to export these to a format I can share with
non-GRASS GIS users. My favourite would probably be Geopackage, amongst
others because the attribute column names are sometimes fairly long,
thus making export to a DBF based format not easily feasible.

I’ve seen several sources of information about possible performance
increases, either in the SQLite world (as a Geopackage is based on
SQLite) or in the OGR SQLite driver page.

For OGR [1], there is the -gt option to ogr2ogr which allows increasing
the number of features per transaction (or even running the entire
conversion in one single transaction). There is also mention of the
OGR_SQLITE_CACHE and OGR_SQLITE_SYNCHRONOUS configuration options.
However, I don’t seem to understand if, and if yes, how I can set any of
these when using v.out.ogr. In r73117, MarkusM added config and dataset
open options to r.in.gdal and v.in.ogr. Would we need something like
this to be able to set the above config options ? How about the
transaction option ?

For GDAL/OGR, you can set configuration options in various ways [0], e.g.

OGR_SQLITE_CACHE=1024 OGR_SQLITE_SYNCHRONOUS=OFF v.out.ogr …

I’m afraid the transaction option (equivalent to ogr2ogr -gt) is out of control of v.out.ogr because it is format-specific and v.out.ogr does not use the output format drivers directly, instead it uses the generic OGR interface. However, we could try GDALDatasetStartTransaction()/GDALDatasetCommitTransaction() in v.out.ogr.

Markus M

[0] https://trac.osgeo.org/gdal/wiki/ConfigOptions

On the SQLite side, there are some pragma statements that allow
increasing performance (generally at the cost of less data security, but
in some cases this might be acceptable), such as journal_mode [2] or
synchronous [3] (which I suppose is what OGR_SQLITE_SYNCHRONOUS sets as
well), or possibly locking_mode (I don’t know how this is handled
internally in OGR).

Again, my question would be how these could be set when exporting data
with v.out.ogr.

Other interesting options (which would also influence performance)
might be a flag in v.out.ogr allowing exporting without attribute data
(although a simple v.db.connect -d before export would probably be
enough) and allowing selecting a subset of columns to export. But that’s
something else.

Moritz

[1] https://gdal.org/drv_sqlite.html
[2] https://sqlite.org/pragma.html#pragma_journal_mode
[3] https://sqlite.org/pragma.html#pragma_synchronous


grass-dev mailing list
grass-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/grass-dev

On Thu, Dec 20, 2018 at 10:34 AM Markus Metz <markus.metz.giswork@gmail.com> wrote:

On Thu, Nov 29, 2018 at 3:43 PM Moritz Lennert <mlennert@club.worldonline.be> wrote:

Hi,

I have large amount of vector maps (~ 400) which each has hundreds of
thousands of areas and over 150 attribute columns. I’m looking for the
most efficient way to export these to a format I can share with
non-GRASS GIS users. My favourite would probably be Geopackage, amongst
others because the attribute column names are sometimes fairly long,
thus making export to a DBF based format not easily feasible.

I’ve seen several sources of information about possible performance
increases, either in the SQLite world (as a Geopackage is based on
SQLite) or in the OGR SQLite driver page.

For OGR [1], there is the -gt option to ogr2ogr which allows increasing
the number of features per transaction (or even running the entire

conversion in one single transaction).

[…]

How about the
transaction option ?

I’m afraid the transaction option (equivalent to ogr2ogr -gt) is out of control of v.out.ogr because it is format-specific and v.out.ogr does not use the output format drivers directly, instead it uses the generic OGR interface. However, we could try GDALDatasetStartTransaction()/GDALDatasetCommitTransaction() in v.out.ogr.

v.out.ogr already uses the OGR interface to start and commit transactions when possible.

Another reason for slow export can be a buggy NFS implementation if you export to a NFS mount: check if v.out.ogr is running at 100%. If not you could export to a local temp dir and then copy the GeoPackages to the final destination.

Markus M

On 20/12/18 20:10, Markus Metz wrote:

On Thu, Dec 20, 2018 at 10:34 AM Markus Metz <markus.metz.giswork@gmail.com <mailto:markus.metz.giswork@gmail.com>> wrote:
>
> On Thu, Nov 29, 2018 at 3:43 PM Moritz Lennert <mlennert@club.worldonline.be <mailto:mlennert@club.worldonline.be>> wrote:
> >
> > Hi,
> >
> > I have large amount of vector maps (~ 400) which each has hundreds of
> > thousands of areas and over 150 attribute columns. I'm looking for the
> > most efficient way to export these to a format I can share with
> > non-GRASS GIS users. My favourite would probably be Geopackage, amongst
> > others because the attribute column names are sometimes fairly long,
> > thus making export to a DBF based format not easily feasible.
> >
> > I've seen several sources of information about possible performance
> > increases, either in the SQLite world (as a Geopackage is based on
> > SQLite) or in the OGR SQLite driver page.
> >
> > For OGR [1], there is the -gt option to ogr2ogr which allows increasing
> > the number of features per transaction (or even running the entire
> > conversion in one single transaction).
[...]
> > How about the
> > transaction option ?
>
> I'm afraid the transaction option (equivalent to ogr2ogr -gt) is out of control of v.out.ogr because it is format-specific and v.out.ogr does not use the output format drivers directly, instead it uses the generic OGR interface. However, we could try GDALDatasetStartTransaction()/GDALDatasetCommitTransaction() in v.out.ogr.

v.out.ogr already uses the OGR interface to start and commit transactions when possible.

Another reason for slow export can be a buggy NFS implementation if you export to a NFS mount: check if v.out.ogr is running at 100%. If not you could export to a local temp dir and then copy the GeoPackages to the final destination.

Thanks for all the hints. Using

OGR_SQLITE_CACHE=1024 OGR_SQLITE_SYNCHRONOUS=OFF v.out.ogr ...

should already help.

In this particular case export was happening to local disk, so NFS was not an issue.

Moritz

Hi Ken,

On 20/12/18 16:08, Ken Mankoff wrote:

On Thu, Nov 29, 2018 at 3:43 PM Moritz Lennert <mlennert@club.worldonline.be>
wrote:

I have large amount of vector maps (~ 400) which each has hundreds of
thousands of areas and over 150 attribute columns. I'm looking for
the most efficient way to export these to a format I can share with
non-GRASS GIS users.

Do you need to combine them? If not, my preferred solution to speed things up is to use GNU parallel.

Parallelization is obviously part of a solution, but I would guess that one hits I/O limits at one point if you try to export too many maps in parallel to the same disk.

I realize your message was to GRASS dev and not users though, so you may be looking for a method to speed GRASS up, not necessarily your specific workflow.

The specific question was about the use of the OGR variables for speeding up the process of writing to SQLite DBs, but parallelization is always a path to look at as well.

Moritz

On Thu, Dec 20, 2018 at 10:34 AM Markus Metz
<markus.metz.giswork@gmail.com> wrote:

On Thu, Nov 29, 2018 at 3:43 PM Moritz Lennert <mlennert@club.worldonline.be> wrote:
>
> Hi,
>
> I have large amount of vector maps (~ 400) which each has hundreds of
> thousands of areas and over 150 attribute columns. I'm looking for the
> most efficient way to export these to a format I can share with
> non-GRASS GIS users. My favourite would probably be Geopackage, amongst
> others because the attribute column names are sometimes fairly long,
> thus making export to a DBF based format not easily feasible.

...

For GDAL/OGR, you can set configuration options in various ways [0], e.g.

OGR_SQLITE_CACHE=1024 OGR_SQLITE_SYNCHRONOUS=OFF v.out.ogr ...

...

[0] https://trac.osgeo.org/gdal/wiki/ConfigOptions

For the record, I have added these two variables on this overview page:
https://trac.osgeo.org/gdal/wiki/ConfigOptions#OGR_SQLITE_CACHE

and in r73866 to the v.out.ogr manual.

markusN

--
Markus Neteler, PhD
http://www.mundialis.de - free data with free software
http://grass.osgeo.org
http://courses.neteler.org/blog

On Thu, Dec 27, 2018 at 11:02 AM Markus Neteler <neteler@osgeo.org> wrote:

On Thu, Dec 20, 2018 at 10:34 AM Markus Metz
<markus.metz.giswork@gmail.com> wrote:
> On Thu, Nov 29, 2018 at 3:43 PM Moritz Lennert <mlennert@club.worldonline.be> wrote:
> >
> > Hi,
> >
> > I have large amount of vector maps (~ 400) which each has hundreds of
> > thousands of areas and over 150 attribute columns. I'm looking for the
> > most efficient way to export these to a format I can share with
> > non-GRASS GIS users. My favourite would probably be Geopackage, amongst
> > others because the attribute column names are sometimes fairly long,
> > thus making export to a DBF based format not easily feasible.
...
> For GDAL/OGR, you can set configuration options in various ways [0], e.g.
>
> OGR_SQLITE_CACHE=1024 OGR_SQLITE_SYNCHRONOUS=OFF v.out.ogr ...

# default settings
GRASS 7.6.svn (latlong_wgs84):~ > time -p v.out.ogr
input=brd_plz_gebiete_pop2011 output=brd_plz_gebiete_pop2011.gpkg
Exporting 45618 areas (may take some time)...
...
v.out.ogr complete. 47331 features (Polygon type) written to
<brd_plz_gebiete_pop2011> (GPKG format).
real 10.41
user 6.28
sys 7.16

# CACHE 1024
GRASS 7.6.svn (latlong_wgs84):~ > time -p OGR_SQLITE_CACHE=1024
v.out.ogr input=brd_plz_gebiete_pop2011
output=brd_plz_gebiete_pop2011_2.gpkg
Exporting 45618 areas (may take some time)...
...
v.out.ogr complete. 47331 features (Polygon type) written to
<brd_plz_gebiete_pop2011> (GPKG format).
real 9.71
user 6.11
sys 6.91

# CACHE 2048
GRASS 7.6.svn (latlong_wgs84):~ > time -p OGR_SQLITE_CACHE=2048
v.out.ogr input=brd_plz_gebiete_pop2011
output=brd_plz_gebiete_pop2011_3.gpkg
Exporting 45618 areas (may take some time)...
...
v.out.ogr complete. 47331 features (Polygon type) written to
<brd_plz_gebiete_pop2011> (GPKG format).
real 9.72
user 5.98
sys 6.96

# default again
GRASS 7.6.svn (latlong_wgs84):~ > time -p v.out.ogr
input=brd_plz_gebiete_pop2011 output=brd_plz_gebiete_pop2011_4.gpkg
Exporting 45618 areas (may take some time)...
...
v.out.ogr complete. 47331 features (Polygon type) written to
<brd_plz_gebiete_pop2011> (GPKG format).
real 9.84
user 6.12
sys 7.01

# CACHE AND SYNCHRONOUS=OFF
time -p OGR_SQLITE_CACHE=2048 OGR_SQLITE_SYNCHRONOUS=OFF v.out.ogr
input=brd_plz_gebiete_pop2011 output=brd_plz_gebiete_pop2011_5.gpkg
Exporting 45618 areas (may take some time)...
...
v.out.ogr complete. 47331 features (Polygon type) written to
<brd_plz_gebiete_pop2011> (GPKG format).
real 9.99
user 6.31
sys 6.95

--> (Here) the difference is very small.

Perhaps more striking with larger maps?

markusN