[GRASS-dev] [GRASS GIS] #516: v.extract slow on large datasets

#516: v.extract slow on large datasets
-------------------------+--------------------------------------------------
Reporter: gisboa | Owner: grass-dev@lists.osgeo.org
     Type: enhancement | Status: new
Priority: minor | Milestone: 6.4.0
Component: Vector | Version: unspecified
Keywords: | Platform: All
      Cpu: All |
-------------------------+--------------------------------------------------
Using v.extract on large datasets is incredibly slow. From a 3,000,000
areas dataset I extracted the first 99 (id<100). It took 12 minutes to
extract the geometries, after that it says 'writing attributes' for
another 6 minutes. The pg process is a runner-up in top, consuming about
50% cpu time, the remaing 50% goes to v.extract.
What is going on here? Writing a hundred rows to PostgreSQL should take
only a split second. Is this also due to the fact that the geometry index
is not in a file? Would this be another reason to implement the file based
geometry index?
Maybe a few modules should be rewritten to perform a dedicated task on
their own, instead of relying on others, if that makes it slow.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/516&gt;
GRASS GIS <http://grass.osgeo.org>

#516: v.extract slow on large datasets
--------------------------+-------------------------------------------------
  Reporter: gisboa | Owner: grass-dev@lists.osgeo.org
      Type: enhancement | Status: new
  Priority: minor | Milestone: 6.4.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
  Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by mmetz):

Replying to [ticket:516 gisboa]:
> Using v.extract on large datasets is incredibly slow. From a 3,000,000
areas dataset I extracted the first 99 (id<100). It took 12 minutes to
extract the geometries,

There are probably several reasons for this. The spatial index is built
from topology, that can take a bit. The category index used to select
features is rather inefficient for large numbers of categories. These two
aspects are handled by the vector libs. v.extract itself has potential for
speed improvement. Regarding the vector libs, changes of the spatial index
and the category index will only be done in grass7. Improving v.extract is
possible for grass6, I have some ideas, but I won't get to it soon, and I
don't know if anybody else will rewrite v.extract soon.

> after that it says 'writing attributes' for another 6 minutes. The pg
process is a runner-up in top, consuming about 50% cpu time,

I think Glynn answered that in his comment to #513.

> Would this be another reason to implement the file based geometry index?

Probably yes. But that's not easy. There are "off-the-shelf" solutions for
that, but 1) someone needs to evaluate these solutions for their
suitability for grass, and 2) someone has to implement it.

> Maybe a few modules should be rewritten to perform a dedicated task on
their own, instead of relying on others, if that makes it slow.

AFAICT, v.extract does not rely on other modules, it uses library
functions only. IMHO, modules should not bypass core libraries. If a
particular task is done inefficiently by the core libraries, these
libraries need to be improved. A workaround for a specific module would
only create a mess.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/516#comment:1&gt;
GRASS GIS <http://grass.osgeo.org>