[GRASS-dev] [GRASS GIS] #2692: v.in.ascii does not handle text in qoutes

#2692: v.in.ascii does not handle text in qoutes
-------------------------------------------------+-------------------------
Reporter: wenzeslaus | Owner: grass-dev@…
     Type: defect | Status: new
Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Keywords: CSV, doublequote, singlequote, text | CPU: Unspecified
  delimiter |
Platform: Unspecified |
-------------------------------------------------+-------------------------
Since r63581 G7:v.in.ascii should be able to handle "CSV with quoted text
fields" (`text` option) but the quotes are stored to the database. This is
the result from `v.db.select` (same thing in `sqliteman`):

{{{
100|437343.6704|4061363.41525|"High Erosion"|"Low Deposition"
}}}

To reproduce, run:

{{{
$ cd grass/src
$ ./bin.../grass71 ... -text
...
> cd vector/v.in.ascii/testsuite/
> python test_csv.py
}}}

I added these tests in r65426 and r65427 if you wish to review them.

`G_tokenize2()` results online:

  *
http://fatra.cnr.ncsu.edu/grassgistests/reports_for_date-2015-06-10-07-00/report_for_nc_spm_08_grass7_nc/lib/gis/gis_lib_tokenize/index.html

`v.in.ascii` results online:

  *
http://fatra.cnr.ncsu.edu/grassgistests/reports_for_date-2015-06-10-07-00/report_for_nc_spm_08_grass7_nc/vector/v.in.ascii/test_csv/index.html

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692&gt;
GRASS GIS <http://grass.osgeo.org>

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
       CPU: | delimiter
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------

Comment (by mlennert):

Replying to [ticket:2692 wenzeslaus]:
> Since r63581 G7:v.in.ascii should be able to handle "CSV with quoted
text fields" (`text` option) but the quotes are stored to the database.
This is the result from `v.db.select` (same thing in `sqliteman`):
>
> {{{
> 100|437343.6704|4061363.41525|"High Erosion"|"Low Deposition"
> }}}
>

IIUC, this comes from
[https://trac.osgeo.org/grass/browser/grass/trunk/lib/gis/token.c#L107
this part of lib/gis/token.c]. The p++ just advances the pointer like for
any other character. I guess that in order to skip it, one would have to
create a second pointer which takes a copy of the first, but advances only
when the first one is not equal to the character you want to eliminate. Or
something like that. But my pointer foo is much too low to handle this
directly. Anyone more C-savy who could do this in 5 minutes ?

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2692#comment:1&gt;
GRASS GIS <http://grass.osgeo.org>

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
       CPU: | delimiter
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------
Changes (by mlennert):

* Attachment "tokenize_with_textdelim.diff" added.

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692&gt;
GRASS GIS <http://grass.osgeo.org>

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
       CPU: | delimiter
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------

Comment (by mlennert):

Replying to [comment:1 mlennert]:
> Replying to [ticket:2692 wenzeslaus]:
> > Since r63581 G7:v.in.ascii should be able to handle "CSV with quoted
text fields" (`text` option) but the quotes are stored to the database.
This is the result from `v.db.select` (same thing in `sqliteman`):
> >
> > {{{
> > 100|437343.6704|4061363.41525|"High Erosion"|"Low Deposition"
> > }}}
> >
>
> IIUC, this comes from
[https://trac.osgeo.org/grass/browser/grass/trunk/lib/gis/token.c#L107
this part of lib/gis/token.c]. The p++ just advances the pointer like for
any other character. I guess that in order to skip it, one would have to
create a second pointer which takes a copy of the first, but advances only
when the first one is not equal to the character you want to eliminate. Or
something like that. But my pointer foo is much too low to handle this
directly. Anyone more C-savy who could do this in 5 minutes ?

The attached patch seems to solve the problem for me, but it feels like a
quick fix and there probably is a better way...

Moritz

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692#comment:2&gt;
GRASS GIS <http://grass.osgeo.org>

Glynn,

Could you just take a look at the patch and say whether this is an acceptable hack ?

Moritz

-------- Forwarded Message --------
Subject: Re: [GRASS GIS] #2692: v.in.ascii does not handle text in qoutes
Date: Sun, 14 Jun 2015 12:18:09 -0000
From: GRASS GIS <trac@osgeo.org>
Reply-To: grass-dev@lists.osgeo.org
To: undisclosed-recipients:;

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
   Reporter: wenzeslaus | Owner: grass-dev@…
       Type: defect | Status: new
   Priority: normal | Milestone: 7.0.1
  Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
        CPU: | delimiter
   Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------

Comment (by mlennert):

  Replying to [comment:1 mlennert]:
  > Replying to [ticket:2692 wenzeslaus]:
  > > Since r63581 G7:v.in.ascii should be able to handle "CSV with quoted
  text fields" (`text` option) but the quotes are stored to the database.
  This is the result from `v.db.select` (same thing in `sqliteman`):
  > >
  > > {{{
  > > 100|437343.6704|4061363.41525|"High Erosion"|"Low Deposition"
  > > }}}
  > >
  >
  > IIUC, this comes from
  [https://trac.osgeo.org/grass/browser/grass/trunk/lib/gis/token.c#L107
  this part of lib/gis/token.c]. The p++ just advances the pointer like for
  any other character. I guess that in order to skip it, one would have to
  create a second pointer which takes a copy of the first, but advances only
  when the first one is not equal to the character you want to eliminate. Or
  something like that. But my pointer foo is much too low to handle this
  directly. Anyone more C-savy who could do this in 5 minutes ?

  The attached patch seems to solve the problem for me, but it feels like a
  quick fix and there probably is a better way...

  Moritz

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692#comment:2&gt;
GRASS GIS <http://grass.osgeo.org>

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
       CPU: | delimiter
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------

Comment (by glynn):

Replying to [comment:2 mlennert]:
> The attached patch seems to solve the problem for me, but it feels like
a quick fix and there probably is a better way...
Simply removing the quotes isn't sufficient to support CSV. It's necessary
to treat the field separator as a normal character when it occurs within a
quoted field (that's the point of using quotes), and also to replace two
consecutive quotes with a single quote (so that the quote character itself
can occur within a field).

For that, an explicit state machine is likely to be more legible than ad-
hoc logic.

Off the top of my head:
{{{
classes: delimiter, separator, newline, other
states: start, in_quote, after_quote, error
actions: no_op, add_char, new_field, end_record, error

start,delimiter -> in_quote,no_op
start,separator -> start,new_field
start,newline -> start,end_record
start,other -> start,add_char
in_quote,delimiter -> after_quote,no_op
in_quote,separator -> in_quote,add_char
in_quote,newline -> error,error
in_quote,other -> in_quote,add_char
after_quote,delimiter -> in_quote,add_char
after_quote,separator -> start,new_field
after_quote,newline -> start,end_record
after_quote,other -> error,error
}}}

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692#comment:3&gt;
GRASS GIS <http://grass.osgeo.org>

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
       CPU: | delimiter
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------

Comment (by mlennert):

Replying to [comment:3 glynn]:
> Replying to [comment:2 mlennert]:
> > The attached patch seems to solve the problem for me, but it feels
like a quick fix and there probably is a better way...
> Simply removing the quotes isn't sufficient to support CSV. It's
necessary to treat the field separator as a normal character when it
occurs within a quoted field (that's the point of using quotes), and also
to replace two consecutive quotes with a single quote (so that the quote
character itself can occur within a field).
>
> For that, an explicit state machine is likely to be more legible than
ad-hoc logic.

Sounds like a fun programming challenge (for it's a challenge at least),
but I wonder whether we couldn't just integrate the work of others in the
same domain, notably [http://sourceforge.net/projects/libcsv/ libcsv].
It's GPL (2 or higher) and its fairly lightweight, so we could just
integrate it into the GRASS source (although Debian ships it as a package
as well, don't know for other distributions, nor for OSX and Windows).
AFAICS, there have been no modifications since 2013, so it does not look
like a heavy dependency to track.

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692#comment:4&gt;
GRASS GIS <http://grass.osgeo.org>

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
       CPU: | delimiter
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------

Comment (by mlennert):

Replying to [comment:4 mlennert]:
> Replying to [comment:3 glynn]:
> > Replying to [comment:2 mlennert]:
> > > The attached patch seems to solve the problem for me, but it feels
like a quick fix and there probably is a better way...
> > Simply removing the quotes isn't sufficient to support CSV. It's
necessary to treat the field separator as a normal character when it
occurs within a quoted field (that's the point of using quotes), and also
to replace two consecutive quotes with a single quote (so that the quote
character itself can occur within a field).
> >
> > For that, an explicit state machine is likely to be more legible than
ad-hoc logic.
>
> Sounds like a fun programming challenge (for it's a challenge at least),
but I wonder whether we couldn't just integrate the work of others in the
same domain, notably [http://sourceforge.net/projects/libcsv/ libcsv].
It's GPL (2 or higher) and its fairly lightweight, so we could just
integrate it into the GRASS source (although Debian ships it as a package
as well, don't know for other distributions, nor for OSX and Windows).
AFAICS, there have been no modifications since 2013, so it does not look
like a heavy dependency to track.

Another option would be to use [https://docs.python.org/2/library/csv.html
Python's csv module] and rewrite v.in.ascii in Python...

Moritz

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692#comment:5&gt;
GRASS GIS <http://grass.osgeo.org>

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
       CPU: | delimiter
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------
Changes (by glynn):

* Attachment "tokenise.diff" added.

improved tokeniser

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692&gt;
GRASS GIS <http://grass.osgeo.org>

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
       CPU: | delimiter
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------

Comment (by glynn):

Replying to [comment:3 glynn]:

> For that, an explicit state machine is likely to be more legible than
ad-hoc logic.

Please test attachment:tokenise.diff

An external library may be worth using for improved fault-tolerance (CSV
is a rather loose "standard", to say the least). But any such dependency
should be
a. on specific modules (e.g. v.in.ascii), not lib/gis (i.e. G_tokenize),
and
b. an optional alternative to G_tokenize(), i.e. modules should still
compile and work if the library isn't available.

Python is far too heavyweight a dependency for such a task.

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692#comment:6&gt;
GRASS GIS <http://grass.osgeo.org>

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
       CPU: | delimiter
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------

Comment (by mlennert):

Replying to [comment:6 glynn]:
> Replying to [comment:3 glynn]:
>
> > For that, an explicit state machine is likely to be more legible than
ad-hoc logic.
>
> Please test attachment:tokenise.diff

Great, thanks !

I propose two small changes (attached tokenise_corrected.diff), one seems
just a typo (in case A_END_RECORD: "*q++ - '\0';") and the other comes
from the fact that when we are in state AFTER_QUOTE and we reach a
delimiter, we have to go back to state S_START. Otherwise if the next
field starts again with a quote, this quote is treated as a second quote.

Using the following example:

{{{
echo "123|123|1|test1|'test2'|'\"test3\"'|'test''4'" | v.in.ascii in=-
out=testtext text=singlequote --o
}}}

With your patch:

{{{
> v.db.select testtextcat|int_1|int_2|int_3|str_1|str_2|str_3|str_4
1|123|123|1|test1|test2|'"test3"|'test'4t''4'
}}}

With the correction:

{{{
> v.db.select testtextcat|int_1|int_2|int_3|str_1|str_2|str_3|str_4
1|123|123|1|test1|test2|"test3"|test'4
}}}

>
> An external library may be worth using for improved fault-tolerance (CSV
is a rather loose "standard", to say the least). But any such dependency
should be
> a. on specific modules (e.g. v.in.ascii), not lib/gis (i.e. G_tokenize),
and
> b. an optional alternative to G_tokenize(), i.e. modules should still
compile and work if the library isn't available.
>
> Python is far too heavyweight a dependency for such a task.

Well, I thought about a new module v.in.ascii2/v.in.csv which would be
based on the Python csv module. As Python is a dependency anyway so on
module level, this shouldn't be a problem. But I think that with your
patch this particular bug is solved, and that we can leave handling of
more complex csv files to other tools which people can use to prepare the
data for v.in.ascii.

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692#comment:7&gt;
GRASS GIS <http://grass.osgeo.org>

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
       CPU: | delimiter
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------
Changes (by mlennert):

* Attachment "tokenise_corrected.diff" added.

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692&gt;
GRASS GIS <http://grass.osgeo.org>

#2692: v.in.ascii does not handle text in qoutes
-------------------------+-------------------------------------------------
  Reporter: wenzeslaus | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: Default | Version: svn-trunk
Resolution: | Keywords: CSV, doublequote, singlequote, text
       CPU: | delimiter
  Unspecified | Platform: Unspecified
-------------------------+-------------------------------------------------

Comment (by glynn):

Replying to [comment:7 mlennert]:

> I propose two small changes (attached tokenise_corrected.diff), one
seems just a typo (in case A_END_RECORD: "*q++ - '\0';") and the other
comes from the fact that when we are in state AFTER_QUOTE and we reach a
delimiter, we have to go back to state S_START. Otherwise if the next
field starts again with a quote, this quote is treated as a second quote.

Indeed, good catches.

Applied as r65499.

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2692#comment:8&gt;
GRASS GIS <http://grass.osgeo.org>