[GRASS-dev] Speeding up v.out.ogr (again)

Benjamin_Ducke1 · February 29, 2012, 7:23am

Dear Devs,

A while ago, I submitted a small patch for v.out.ogr
that moves the SQL SELECT out of the mk_att() function,
so that this costly operation does not have to be
performed multiple times:

http://lists.osgeo.org/pipermail/grass-dev/2011-January/053150.html

Looking at 6.4.2 and 6.5 SVN copies of v.out.ogr,
I see that SQL SELECT is still (again?) in mk_att(), with
this comment (line 1053):

  /* Fetch all attribute records for cat <cat> */
  /* opening and closing the cursor is slow,
   * but the cursor really needs to be opened for each cat separately */

Is this because of GRASS maps that have more than one
attribute table connection?

If so, then I suggest putting in a condition, so that those
GRASS maps that have just one attribute table can perform
the SQL SELECT outside of mk_att(). Otherwise, we punish
all GRASS users with extremely slow export speed, even though
most of them might not even use multi-table attributes in
their projects. The difference in my test was something like
factor 500!

Best,

Ben

--
Benjamin Ducke
{*} Geospatial Consultant
{*} GIS Developer

benducke AT fastmail.fm

Markus_GRASS · March 1, 2012, 4:41pm

On Wed, Feb 29, 2012 at 8:23 AM, Benjamin Ducke <benducke@fastmail.fm> wrote:

Dear Devs,

A while ago, I submitted a small patch for v.out.ogr
that moves the SQL SELECT out of the mk_att() function,
so that this costly operation does not have to be
performed multiple times:

http://lists.osgeo.org/pipermail/grass-dev/2011-January/053150.html

Looking at 6.4.2 and 6.5 SVN copies of v.out.ogr,
I see that SQL SELECT is still (again?) in mk_att(), with
this comment (line 1053):

/* Fetch all attribute records for cat <cat> */
/* opening and closing the cursor is slow,
* but the cursor really needs to be opened for each cat separately */

Is this because of GRASS maps that have more than one
attribute table connection?

No, this is because the i-th feature does not need to have category i,
it can have any category and multiple categories. Selecting all
attributes at once for all categories is also not memory-safe for
larger vectors.

If so, then I suggest putting in a condition, so that those
GRASS maps that have just one attribute table can perform
the SQL SELECT outside of mk_att(). Otherwise, we punish
all GRASS users with extremely slow export speed, even though
most of them might not even use multi-table attributes in
their projects. The difference in my test was something like
factor 500!

Please test if attribute assignment is preserved and attributes are
not randomly swapped. A stream vector created with r.stream.extract
would be a good test case.

Markus M

Benjamin_Ducke1 · March 1, 2012, 7:16pm

No, this is because the i-th feature does not need to have category i,
it can have any category and multiple categories. Selecting all
attributes at once for all categories is also not memory-safe for
larger vectors.

Hmm, let's say we take the smallest and largest category values
in the map to export, then chop up the range of values into
reasonably sized chunks let's say max. 1000 features, and select
a chunk at a time. Shouldn't that make sure that we get all
features and also have a guaranteed maximum memory footprint?

If the category values in the GRASS map are not strictly
increasing, then the feature order will be changed in the
output file. However, I am not sure whether this would be
a problem, given that the feature<->attribute links stay
intact?

>
> If so, then I suggest putting in a condition, so that those
> GRASS maps that have just one attribute table can perform
> the SQL SELECT outside of mk_att(). Otherwise, we punish
> all GRASS users with extremely slow export speed, even though
> most of them might not even use multi-table attributes in
> their projects. The difference in my test was something like
> factor 500!

Please test if attribute assignment is preserved and attributes are
not randomly swapped. A stream vector created with r.stream.extract
would be a good test case.

Will do.

Ben

Markus M

Markus_GRASS · March 2, 2012, 7:15pm

On Thu, Mar 1, 2012 at 8:16 PM, Benjamin Ducke <benducke@fastmail.fm> wrote:

No, this is because the i-th feature does not need to have category i,
it can have any category and multiple categories. Selecting all
attributes at once for all categories is also not memory-safe for
larger vectors.

Hmm, let's say we take the smallest and largest category values
in the map to export, then chop up the range of values into
reasonably sized chunks let's say max. 1000 features, and select
a chunk at a time. Shouldn't that make sure that we get all
features and also have a guaranteed maximum memory footprint?

If the category values in the GRASS map are not strictly
increasing, then the feature order will be changed in the
output file. However, I am not sure whether this would be
a problem, given that the feature<->attribute links stay
intact?

I think I understand your error. You confuse feature id with category
value. The feature order in the output file depends on the feature
order of the GRASS input vector, and the feature order of the GRASS
input vector has absolutely nothing to do with the category order.
There was a good reason why I wrote "the cursor really needs to be
opened for each cat separately".

You have again introduced a serious bug (the same one for the second
time) in one of the main GRASS modules. Please revert this change
immediately or I have to revert this change, again.

Thanks,

Markus M

>
> If so, then I suggest putting in a condition, so that those
> GRASS maps that have just one attribute table can perform
> the SQL SELECT outside of mk_att(). Otherwise, we punish
> all GRASS users with extremely slow export speed, even though
> most of them might not even use multi-table attributes in
> their projects. The difference in my test was something like
> factor 500!

Please test if attribute assignment is preserved and attributes are
not randomly swapped. A stream vector created with r.stream.extract
would be a good test case.

Will do.

Ben

Markus M

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Benjamin_Ducke1 · March 2, 2012, 9:48pm

I think I understand your error. You confuse feature id with category
value. The feature order in the output file depends on the feature
order of the GRASS input vector, and the feature order of the GRASS
input vector has absolutely nothing to do with the category order.
There was a good reason why I wrote "the cursor really needs to be
opened for each cat separately".

I thought so. Just didn't know what that reason was.
That's why I asked.

You have again introduced a serious bug (the same one for the second
time) in one of the main GRASS modules. Please revert this change
immediately or I have to revert this change, again.

I haven't touched anything, so there's nothing to revert.
Just wanted to discuss some thoughts on this list, that's all.

But I still think that something needs to be done
to speed up v.out.ogr.

Best,

Ben

Thanks,

Markus M

>> >
>> > If so, then I suggest putting in a condition, so that those
>> > GRASS maps that have just one attribute table can perform
>> > the SQL SELECT outside of mk_att(). Otherwise, we punish
>> > all GRASS users with extremely slow export speed, even though
>> > most of them might not even use multi-table attributes in
>> > their projects. The difference in my test was something like
>> > factor 500!
>>
>> Please test if attribute assignment is preserved and attributes are
>> not randomly swapped. A stream vector created with r.stream.extract
>> would be a good test case.
>>
>
> Will do.
>
> Ben
>
>> Markus M
>>
> _______________________________________________
> grass-dev mailing list
> grass-dev@lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/grass-dev

Markus_GRASS · March 3, 2012, 2:03pm

On 3/2/12, Benjamin Ducke <benducke@fastmail.fm> wrote:

I think I understand your error. You confuse feature id with category
value. The feature order in the output file depends on the feature
order of the GRASS input vector, and the feature order of the GRASS
input vector has absolutely nothing to do with the category order.
There was a good reason why I wrote "the cursor really needs to be
opened for each cat separately".

I thought so. Just didn't know what that reason was.
That's why I asked.

You have again introduced a serious bug (the same one for the second
time) in one of the main GRASS modules. Please revert this change
immediately or I have to revert this change, again.

I haven't touched anything, so there's nothing to revert.
Just wanted to discuss some thoughts on this list, that's all.

OK. Apparently I misinterpreted the subject of the post (slow internet
connection while doing field work, did not check changes to
v.out.ogr).

But I still think that something needs to be done
to speed up v.out.ogr.

I think it's rather the dblib than v.out.ogr. Recently I have fixed a
few memory leaks in dblib, but no optimizations. The dbf driver in
particular is terribly slow. An index like for real database backends
might help, although that would need to be created on the fly since
dbf does not support indexes. Nevertheless, I noticed that a lot can
be done on module level. The new v.out.ply addon for example is based
on v.out.ascii, and v.out.ply with attribute export is magnitudes
faster than v.out.ascii with attribute export.

Markus M

>> >
>> > If so, then I suggest putting in a condition, so that those
>> > GRASS maps that have just one attribute table can perform
>> > the SQL SELECT outside of mk_att(). Otherwise, we punish
>> > all GRASS users with extremely slow export speed, even though
>> > most of them might not even use multi-table attributes in
>> > their projects. The difference in my test was something like
>> > factor 500!
>>
>> Please test if attribute assignment is preserved and attributes are
>> not randomly swapped. A stream vector created with r.stream.extract
>> would be a good test case.
>>
>
> Will do.
>
> Ben
>
>> Markus M
>>
> _______________________________________________
> grass-dev mailing list
> grass-dev@lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/grass-dev

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Benjamin_Ducke1 · March 3, 2012, 5:06pm

I think it's rather the dblib than v.out.ogr. Recently I have fixed a
few memory leaks in dblib, but no optimizations. The dbf driver in
particular is terribly slow. An index like for real database backends
might help, although that would need to be created on the fly since
dbf does not support indexes. Nevertheless, I noticed that a lot can
be done on module level. The new v.out.ply addon for example is based
on v.out.ascii, and v.out.ply with attribute export is magnitudes
faster than v.out.ascii with attribute export.

Markus M

OK then. For the time being, I will continue to experiment with
my own fork of v.out.ogr, and will report back if I make any
progress.

Best,

Ben

>>
>>
>> >> >
>> >> > If so, then I suggest putting in a condition, so that those
>> >> > GRASS maps that have just one attribute table can perform
>> >> > the SQL SELECT outside of mk_att(). Otherwise, we punish
>> >> > all GRASS users with extremely slow export speed, even though
>> >> > most of them might not even use multi-table attributes in
>> >> > their projects. The difference in my test was something like
>> >> > factor 500!
>> >>
>> >> Please test if attribute assignment is preserved and attributes are
>> >> not randomly swapped. A stream vector created with r.stream.extract
>> >> would be a good test case.
>> >>
>> >
>> > Will do.
>> >
>> > Ben
>> >
>> >> Markus M
>> >>
>> > _______________________________________________
>> > grass-dev mailing list
>> > grass-dev@lists.osgeo.org
>> > http://lists.osgeo.org/mailman/listinfo/grass-dev
>>
> _______________________________________________
> grass-dev mailing list
> grass-dev@lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/grass-dev
>