[GRASS-dev] new v.in.geonames: problems with UTF-8 Unicode text

Hi,

I am writing v.in.geonames to easily read in data from
http://download.geonames.org/export/dump/

The script is essentially using v.in.ascii to read in the CSV file encoded
in UTF-8 Unicode text. There are placenames in various languages including
Japanese.
v.in.ascii isn't able to read them properly and fails on such lines, example:

3165456 Torre del Greco Torre del Greco Torre d%27%27o Grieco,Torre
d''o Grieco,Torre d''o Grieco,Torre del Greco,トッレ・デル・グレーコ
40.7839209532791 14.3708038330078 P PPL IT
         04 NA 063084 90607 72
Europe/Rome 2008-06-28

(I have slightly improved the v.in.ascii error message, not yet submitted):

ERROR: Unparsable latitude value in column <4>: 'o Grieco,Torre d''o
       Grieco,Torre del Greco,トッレ・デル・グレーコ

How to fix this problem?

thanks
markus

Markus Neteler wrote:

I am writing v.in.geonames to easily read in data from
http://download.geonames.org/export/dump/

The script is essentially using v.in.ascii to read in the CSV file encoded
in UTF-8 Unicode text. There are placenames in various languages including
Japanese.
v.in.ascii isn't able to read them properly and fails on such lines, example:

3165456 Torre del Greco Torre del Greco Torre d%27%27o Grieco,Torre
d&#x0027;'o Grieco,Torre d''o Grieco,Torre del Greco,トッレ��デル・グレーコ
40.7839209532791 14.3708038330078 P PPL IT
         04 NA 063084 90607 72
Europe/Rome 2008-06-28

(I have slightly improved the v.in.ascii error message, not yet submitted):

ERROR: Unparsable latitude value in column <4>: 'o Grieco,Torre d''o
       Grieco,Torre del Greco,トッレ・デル・グレーコ

How to fix this problem?

Can you please provide accurate and sufficient information about the
problem?

I.e. the exact data being fed to v.in.ascii (*before* it has been
mangled by the various components of the email chain), the v.in.ascii
command which is failing, etc.

--
Glynn Clements <glynn@gclements.plus.com>

Markus:

I am writing v.in.geonames to easily read in data from
http://download.geonames.org/export/dump/

The script is essentially using v.in.ascii to read in the
CSV file encoded in UTF-8 Unicode text.

I think I may have found a buffer overflow in their DB... (NZ.zip export)

2181761 Taumatawhakatangihangakoauauotamateapokaiwhenuakitanatahu Taumatawhakatangihangakoauauotamateapokaiwhenuakitanatahu
  Taumata-whakatangihanga-koauau-a-Tamatea-pokai-whenua-ki-tana-tahu,Taumata-whakatangihanga-kōauau-a-Tamatea-pōkai-whenua-ki-tana-tahu,Taumatauakatangiangakoauauotamateaturipukakapikimaungakhoronukupokanuehnuakitanatakhu,Taumatawhakatangihangakoauau,Taumatawhakatangihangakoauauotamateapokaiwhenuakitanatahu,Taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu,Taumatawhakatangihangakoauotamateaturipukakapikimaungahoronukupokaiwhenuakitanataha,Tetaumatafakatangikhangakoauaotamateaurekhaeaturipukapikhimaungakhoronukupokaifenuaakitanarakhu,Tetaumatawhakatangihangakoauaotamateaurehaeaturipukapihimaungahoronukupokaiwhenuaakitanarahu,ТауматауакатангиангакоауауотаматеатурипукакапикимаунгахоронукупокануэнуакÐ
¸Ñ‚анатаху,Тетауматафакатангихангакоауаотаматеаурехаеатурипукапихимаунгахоронукупокаифенуаакитанараху,タウマタファカタンギハンガコアウアウオタマテアポカイフェヌアキタナタフ,塔乌玛塔法卡塔尼哈娜可阿乌阿乌欧塔玛提亚坡凯费努啊奇塔娜塔胡 -40.35 176.55 T HLL NZ NZ 00 0 191 Pacific/Auckland 2007-02-17

v.in.ascii chokes there for obvious reasons.

Move over small welsh towns!
http://en.wikipedia.org/wiki/Taumatawhakatangihangakoauauotamateapokaiwhenuakitanatahu
[checkout the translation(s)]

Hamish

On Mon, Jun 30, 2008 at 5:26 PM, Glynn Clements
<glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

I am writing v.in.geonames to easily read in data from
http://download.geonames.org/export/dump/

The script is essentially using v.in.ascii to read in the CSV file encoded
in UTF-8 Unicode text. There are placenames in various languages including
Japanese.
v.in.ascii isn't able to read them properly and fails on such lines, example:

3165456 Torre del Greco Torre del Greco Torre d%27%27o Grieco,Torre
d&#x0027;'o Grieco,Torre d''o Grieco,Torre del Greco,トッレ��デル・グレーコ
40.7839209532791 14.3708038330078 P PPL IT
         04 NA 063084 90607 72
Europe/Rome 2008-06-28

(I have slightly improved the v.in.ascii error message, not yet submitted):

ERROR: Unparsable latitude value in column <4>: 'o Grieco,Torre d''o
       Grieco,Torre del Greco,トッレ・デル・グレーコ

How to fix this problem?

Can you please provide accurate and sufficient information about the
problem?

As always, I try.

I.e. the exact data being fed to v.in.ascii (*before* it has been
mangled by the various components of the email chain), the v.in.ascii
command which is failing, etc.

Attached the original file reduced to 1 offending line:
wget http://download.geonames.org/export/dump/IT.zip
cd /tmp
unzip IT.zip
grep 'Italian Republic' /tmp/IT.txt > /tmp/IT_example.csv

Import into LatLong location, replicating the script functionality

v.in.ascii cat=0 x=6 y=5 fs=tab in=/tmp/IT_example.csv out=test
columns='geonameid integer, name varchar(200), asciiname varchar(200),
alternatename varchar(4000), latitude double precision, longitude
double precision, featureclass varchar(1), featurecode varchar(10),
countrycode varchar(2), cc2 varchar(60), admin1code varchar(20),
admin2code varchar(20), admin3code varchar(20), admin4code
varchar(20), population integer, elevation varchar(5), gtopo30
integer, timezone varchar(50), modification date' --o
Scanning input for column types...
Current row: '�,དགའ་དའ་རས,ཨ་ཊ་ལ།,ཨཊ་ལ,იტალია,ጣሊያን,ጣልያን,អតាល,'Itāria,イタリア,イタリア共和国,意大利,ꑴꄊꆺ,이탈리아
     42.8333333 12.8333333A PCLI IT 00
                        58145000 762 Europe/Rome
2008-03-21'
ERROR: Unparsable latitude value in column <4>: PCLI

-----------
For full example, just use the new scripts/v.in.geonames from SVN
(just fixed + "tr"
filter magic deactivated which filtered away all non Latin chars).

Markus

(attachments)

IT_example.csv.gz (821 Bytes)

Markus Neteler wrote:

>> I am writing v.in.geonames to easily read in data from
>> http://download.geonames.org/export/dump/
>>
>> The script is essentially using v.in.ascii to read in the CSV file encoded
>> in UTF-8 Unicode text. There are placenames in various languages including
>> Japanese.
>> v.in.ascii isn't able to read them properly and fails on such lines, example:

>> How to fix this problem?
>
> Can you please provide accurate and sufficient information about the
> problem?

As always, I try.

As a general rule, if you're having problems with input which contains
non-ASCII text, use an attachment. The files appear to be UTF-8, but
your previous email used ISO-2022. They may seem "equivalent" to your
mail program, but they may not be equivalent so far as e.g. v.in.ascii
is concerned.

In this case, I don't think that encodings or non-ASCII characters are
actuallly the problem. However, if the data had actually been encoded
in ISO-2022, it could have been related.

> I.e. the exact data being fed to v.in.ascii (*before* it has been
> mangled by the various components of the email chain), the v.in.ascii
> command which is failing, etc.

Attached the original file reduced to 1 offending line:
wget http://download.geonames.org/export/dump/IT.zip
cd /tmp
unzip IT.zip
grep 'Italian Republic' /tmp/IT.txt > /tmp/IT_example.csv

Import into LatLong location, replicating the script functionality

v.in.ascii cat=0 x=6 y=5 fs=tab in=/tmp/IT_example.csv out=test
columns='geonameid integer, name varchar(200), asciiname varchar(200),
alternatename varchar(4000), latitude double precision, longitude
double precision, featureclass varchar(1), featurecode varchar(10),
countrycode varchar(2), cc2 varchar(60), admin1code varchar(20),
admin2code varchar(20), admin3code varchar(20), admin4code
varchar(20), population integer, elevation varchar(5), gtopo30
integer, timezone varchar(50), modification date' --o
Scanning input for column types...

ERROR: Unparsable latitude value in column <4>: PCLI

I don't get this particular error, but I do have some other problems.

First, I had to increase the buffer size:

--- vector/v.in.ascii/points.c (revision 31901)
+++ vector/v.in.ascii/points.c (working copy)
@@ -74,7 +74,7 @@
     char *coorbuf, *tmp_token, *sav_buf;
     int skip = FALSE, skipped = 0;

- buflen = 1000;
+ buflen = 4000;
     buf = (char *)G_malloc(buflen);
     buf_raw = (char *)G_malloc(buflen);
     coorbuf = (char *)G_malloc(256);

Otherwise, the input was truncated in the middle of the list of
translated names. This caused points_analyse[1] to see too few
columns, resulting in:

  Scanning input for column types...
  Maximum input row length: 999
  Maximum number of columns: 11
  Minimum number of columns: 4
  ERROR: x column number > minimum last column number
         (incorrect field separator?)

Fixing that, it now complains about:

  Scanning input for column types...
  Maximum input row length: 1309
  Maximum number of columns: 14
  Minimum number of columns: 14
  WARNING: Table <test> linked to vector map <test> does not exist
  ERROR: Number of columns defined (19) does not match number of columns (14)

This is caused by G_tokenize() skipping leading whitespace, including
tabs, even when the separator is a tab. Consequently, a run of
consecutive blank fields is interpreted as a single blank field.

After fixing that, I get:

  Scanning input for column types...
  Maximum input row length: 1309
  Maximum number of columns: 19
  Minimum number of columns: 19
  WARNING: Column number 11 <admin1code> defined as string has only integer
           values
  Importing points...
  Segmentation fault (core dumped)

This is caused by overflowing another 1000-byte buffer in
points_to_bin():

--- vector/v.in.ascii/points.c (revision 31911)
+++ vector/v.in.ascii/points.c (working copy)
@@ -269,7 +269,7 @@
       int *coltype, int xcol, int ycol, int zcol, int catcol,
       int skip_lines)
{
- char *buf, buf2[1000];
+ char *buf, buf2[4000];
     int cat = 0;
     int row = 1;
     struct line_pnts *Points;

After which, the file appears to import without any problems.

I have committed a fix to G_tokenize(), and also enlarged the buffers
in v.in.ascii to 4000 bytes (although removing fixed limits altogether
would be better).

[1] BTW, don't we normally use US-English spellings, i.e. "analyze"
instead of "analyse"?

--
Glynn Clements <glynn@gclements.plus.com>

On Mon, Jun 30, 2008 at 11:16 PM, Glynn Clements
<glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

...

After which, the file appears to import without any problems.

I have committed a fix to G_tokenize(), and also enlarged the buffers
in v.in.ascii to 4000 bytes (although removing fixed limits altogether
would be better).

Excellent, now also the Polish file is getting imported.
This closes
http://trac.osgeo.org/grass/ticket/202

But:
when using the SQLite driver, it cuts the "alternatename" field at
256 chars due to a hardcoded limitation in describe.c. I don't
remember why not varchar() length is used there.
What about this patch?

Index: db/drivers/sqlite/describe.c

--- db/drivers/sqlite/describe.c (revision 31904)
+++ db/drivers/sqlite/describe.c (working copy)
@@ -173,7 +173,7 @@
            break;

        case SQLITE_TEXT:
- fsize = 255;
+ fsize = 99999;
            break;

The PostgreSQL driver has a function to find out the column length.
With this change the full "alternatename" field is imported into SQLite.

Markus

On Tue, Jul 1, 2008 at 11:38 AM, Markus Neteler <neteler@osgeo.org> wrote:

The PostgreSQL driver has a function to find out the column length.
With this change the full "alternatename" field is imported into SQLite.

...

Using PostgreSQL instead, some escape problem happens (ok, slightly
offtopic for this thread but still related to v.in.geonames):

v.in.geonames /tmp/IT.csv out=it
...
D0/0: Escaped SQL: insert into it values ( 27788, 6485478, 'Hotel
Roma', 'Hotel Roma', null, 43.72241, 10.39204, 'S', 'HTL', 'IT', null,
'16', 'PI', null, null, 0, null, 1, 'Europe/Rome', '2007-04-15')
D0/0: Escaped SQL: insert into it values ( 27789, 6485480, 'Villa
Milani - Residenza d\\''epoca', 'Villa Milani - Residenza d\\''epoca',
null, 42.74533, 12.73843, 'S', 'HTL', 'IT', null, '18', 'PG', null,
null, 0, null, 374, 'Europe/Rome', '2007-04-15')
WARNING: nonstandard use of \\ in a string literal
LINE 1: insert into it values ( 27789, 6485480, 'Villa Milani - Resi...
                                                ^
HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
WARNING: nonstandard use of \\ in a string literal
LINE 1: ..., 6485480, 'Villa Milani - Residenza d\\''epoca', 'Villa Mil...
                                                             ^
HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
D0/0: Escaped SQL: insert into it values ( 27790, 6485481, 'Hotel Sher
Dan', 'Hotel Sher Dan', null, 40.72241, 8.39582, 'S', 'HTL', 'IT',
null, '1

I assume that the escape magic in db/drivers/postgres/execute.c isn't
perfect. Any suggestions here?

Markus

Markus Neteler wrote:

> After which, the file appears to import without any problems.
>
> I have committed a fix to G_tokenize(), and also enlarged the buffers
> in v.in.ascii to 4000 bytes (although removing fixed limits altogether
> would be better).

Excellent, now also the Polish file is getting imported.
This closes
http://trac.osgeo.org/grass/ticket/202

But:
when using the SQLite driver, it cuts the "alternatename" field at
256 chars due to a hardcoded limitation in describe.c. I don't
remember why not varchar() length is used there.

SQLite doesn't have VARCHAR. If you create a VARCHAR column, the
driver translates it to TEXT. The size you specify when the table is
created isn't stored anywhere.

Actually, SQLite doesn't really have column types at all. It remembers
the declaration type, although the SQLite driver doesn't use this; it
just returns the value types corresponding to the first row of a
"SELECT * FROM <table>" query.

[BTW, it appears that the PostgreSQL driver has a bug regarding TEXT
columns. get_column_info() (db/drivers/postgres/describe.c) doesn't
fill in the size field for TEXT columns, but describe_table() assumes
that it will always be filled in.]

What about this patch?

- fsize = 255;
+ fsize = 99999;

That will make every text column equivalent to VARCHAR(99999). That
may be problematic if you e.g. copy data to a "real" RDBMS (or any
file format which uses fixed-size fields), as it will create
VARCHAR(99999) columns, potentially wasting vast amounts of memory.

I would suggest lowering it to e.g. 1000. Or adding an environment
variable.

In the longer term, maybe the SQLite driver should try to store the
information which SQLite itself omits.

--
Glynn Clements <glynn@gclements.plus.com>

Markus Neteler wrote:

> The PostgreSQL driver has a function to find out the column length.
> With this change the full "alternatename" field is imported into SQLite.
...

Using PostgreSQL instead, some escape problem happens (ok, slightly
offtopic for this thread but still related to v.in.geonames):

v.in.geonames /tmp/IT.csv out=it
...
D0/0: Escaped SQL: insert into it values ( 27788, 6485478, 'Hotel
Roma', 'Hotel Roma', null, 43.72241, 10.39204, 'S', 'HTL', 'IT', null,
'16', 'PI', null, null, 0, null, 1, 'Europe/Rome', '2007-04-15')
D0/0: Escaped SQL: insert into it values ( 27789, 6485480, 'Villa
Milani - Residenza d\\''epoca', 'Villa Milani - Residenza d\\''epoca',
null, 42.74533, 12.73843, 'S', 'HTL', 'IT', null, '18', 'PG', null,
null, 0, null, 374, 'Europe/Rome', '2007-04-15')
WARNING: nonstandard use of \\ in a string literal
LINE 1: insert into it values ( 27789, 6485480, 'Villa Milani - Resi...
                                                ^
HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
WARNING: nonstandard use of \\ in a string literal
LINE 1: ..., 6485480, 'Villa Milani - Residenza d\\''epoca', 'Villa Mil...
                                                             ^
HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
D0/0: Escaped SQL: insert into it values ( 27790, 6485481, 'Hotel Sher
Dan', 'Hotel Sher Dan', null, 40.72241, 8.39582, 'S', 'HTL', 'IT',
null, '1

I assume that the escape magic in db/drivers/postgres/execute.c isn't
perfect. Any suggestions here?

Is there actually a problem? All I see are warnings.

If I try inserting the above into a test table using the psql utility,
I get exactly what I would expect.

grass=> create table it ( a INT, b INT, c TEXT, d TEXT, e TEXT, x DOUBLE PRECISION, y DOUBLE PRECISION, f TEXT, g TEXT, h TEXT, i TEXT, j TEXT, k TEXT, l TEXT, m TEXT, n INTEGER, o TEXT, p INTEGER, q TEXT, r TEXT) ;
CREATE TABLE
grass=> insert into it values ( 27788, 6485478, 'Hotel Roma', 'Hotel Roma', null, 43.72241, 10.39204, 'S', 'HTL', 'IT', null, '16', 'PI', null, null, 0, null, 1, 'Europe/Rome', '2007-04-15') ;
INSERT 9686780 1
grass=> insert into it values ( 27789, 6485480, 'Villa Milani - Residenza d\\''epoca', 'Villa Milani - Residenza d\\''epoca', null, 42.74533, 12.73843, 'S', 'HTL', 'IT', null, '18', 'PG', null, null, 0, null, 374, 'Europe/Rome', '2007-04-15') ;
INSERT 9686781 1
grass=> select * from it ;
   a | b | c | d | e | x | y | f | g | h | i | j | k | l | m | n | o | p | q | r
-------+---------+-----------------------------------+-----------------------------------+---+----------+----------+---+-----+----+---+----+----+---+---+---+---+-----+-------------+------------
27788 | 6485478 | Hotel Roma | Hotel Roma | | 43.72241 | 10.39204 | S | HTL | IT | | 16 | PI | | | 0 | | 1 | Europe/Rome | 2007-04-15
27789 | 6485480 | Villa Milani - Residenza d\'epoca | Villa Milani - Residenza d\'epoca | | 42.74533 | 12.73843 | S | HTL | IT | | 18 | PG | | | 0 | | 374 | Europe/Rome | 2007-04-15
(2 rows)

Note that the backslash is present in the file, which appears to be an
error; many other entries have literal apostrophes without the
backslash.

--
Glynn Clements <glynn@gclements.plus.com>

On Tue, Jul 1, 2008 at 6:55 PM, Glynn Clements <glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

when using the SQLite driver, it cuts the "alternatename" field at
256 chars due to a hardcoded limitation in describe.c. I don't
remember why not varchar() length is used there.

SQLite doesn't have VARCHAR. If you create a VARCHAR column, the
driver translates it to TEXT. The size you specify when the table is
created isn't stored anywhere.

This is not clear to me:
- I run v.in.ascii and define that I want varchar(4000).
- this is sent to DBMI which calls the SQLite driver
- in the SQLite driver 255 is hardcoded for TEXT and the
  "alternatename" field is cut
- hardcoding something much longer isn't convenient due to
  troubles when moving table to other DBMS

so far understood.

-> question: any chance to remove the hardcoded 255 limit
   and use instead the length I defined in the v.in.ascii call?
   Or is that size information eliminated in the communication
   between v.in.ascii and SQLite driver?

[BTW, it appears that the PostgreSQL driver has a bug regarding TEXT
columns. get_column_info() (db/drivers/postgres/describe.c) doesn't
fill in the size field for TEXT columns, but describe_table() assumes
that it will always be filled in.]

(not sure how to fix that)

Markus

Markus Neteler wrote:

>> when using the SQLite driver, it cuts the "alternatename" field at
>> 256 chars due to a hardcoded limitation in describe.c. I don't
>> remember why not varchar() length is used there.
>
> SQLite doesn't have VARCHAR. If you create a VARCHAR column, the
> driver translates it to TEXT. The size you specify when the table is
> created isn't stored anywhere.

This is not clear to me:
- I run v.in.ascii and define that I want varchar(4000).
- this is sent to DBMI which calls the SQLite driver
- in the SQLite driver 255 is hardcoded for TEXT and the
  "alternatename" field is cut
- hardcoding something much longer isn't convenient due to
  troubles when moving table to other DBMS

so far understood.

-> question: any chance to remove the hardcoded 255 limit
   and use instead the length I defined in the v.in.ascii call?

Not easily.

   Or is that size information eliminated in the communication
   between v.in.ascii and SQLite driver?

Clients communicate that information to the driver, but the SQLite
driver ignores it, as there's no way to communicate it to SQLite
itself. So far as SQLite is concerned, columns are either "text",
"integer" or "real". There's no way to specify a limit on the width of
a column, and thus no way to query it.

In theory, the driver could scan the entire table and determine the
widest element. But that would be slow for large tables, and it could
also confuse clients trying to add data to an existing table.

--
Glynn Clements <glynn@gclements.plus.com>

On Wed, Jul 2, 2008 at 3:08 AM, Glynn Clements <glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

>> when using the SQLite driver, it cuts the "alternatename" field at
>> 256 chars due to a hardcoded limitation in describe.c. I don't
>> remember why not varchar() length is used there.
>
> SQLite doesn't have VARCHAR. If you create a VARCHAR column, the
> driver translates it to TEXT. The size you specify when the table is
> created isn't stored anywhere.

This is not clear to me:
- I run v.in.ascii and define that I want varchar(4000).
- this is sent to DBMI which calls the SQLite driver
- in the SQLite driver 255 is hardcoded for TEXT and the
  "alternatename" field is cut
- hardcoding something much longer isn't convenient due to
  troubles when moving table to other DBMS

so far understood.

-> question: any chance to remove the hardcoded 255 limit
   and use instead the length I defined in the v.in.ascii call?

Not easily.

It seems that an environment variables is needed in the driver which
can be used by v.in.geonames then, because:

v.in.geonames /tmp/IT.txt out=it2 --o
Converted 42860 place names.
Scanning input for column types...
Maximum input row length: 1309
Maximum number of columns: 19
Minimum number of columns: 19
WARNING: Table <it2> linked to vector map <it2> does not exist
ERROR: Length of column 4 <alternatename> (1000) is less than maximum value
       length (1190)

...

In theory, the driver could scan the entire table and determine the
widest element. But that would be slow for large tables, and it could
also confuse clients trying to add data to an existing table.

Right. It would be perfect to have a "bundled" env var like:
GRASS_SQLITE_TEXTCOL_LENGTH=name,length

Or better two of them? Like this we could specify it individually for a
column and v.in.ascii would not fail.

Markus

On Tue, Jul 1, 2008 at 7:12 PM, Glynn Clements <glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

...

I assume that the escape magic in db/drivers/postgres/execute.c isn't
perfect. Any suggestions here?

Is there actually a problem? All I see are warnings.

If I try inserting the above into a test table using the psql utility,
I get exactly what I would expect.

...

Note that the backslash is present in the file, which appears to be an
error; many other entries have literal apostrophes without the
backslash.

Ah, I overlooked that there is misfit in the input data.

(Un)related a segfault: When copying the map from the postgresql mapset
into the sqlite mapset, it crashes:

# in SQLite mapset:
GRASS 6.4.svn (latlong_tbe_climate): > gdb g.copy
GNU gdb 6.6-3mdv2008.0 (Mandriva Linux release 2008.0)
...
This GDB was configured as "x86_64-mandriva-linux-gnu"...
Using host libthread_db library "/lib64/libthread_db.so.1".
(gdb) r vect=it@postgresql,test --o
Starting program:
/home/neteler/grass64/dist.x86_64-unknown-linux-gnu/bin/g.copy
vect=it@postgresql,test --o
[Thread debugging using libthread_db enabled]
[New Thread 46964157956272 (LWP 10388)]
warning: Lowest section in /usr/lib64/libicudata.so.36 is .gnu.hash at
0000000000000120
Copy vector <it@postgresql> to current mapset as <test>
WARNING: Vector map <test> already exists and will be overwritten
WARNING: Table <test> linked to vector map <test> does not exist

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 46964157956272 (LWP 10388)]
0x00002ab6aa9e8888 in db_execute_immediate (driver=0xab8be18d88e1a38c,
SQLstatement=0x7fff00de1240)
    at c_execute.c:16
16 db__set_protocol_fds (driver->send, driver->recv);
(gdb) bt full
#0 0x00002ab6aa9e8888 in db_execute_immediate
(driver=0xab8be18d88e1a38c, SQLstatement=0x7fff00de1240)
    at c_execute.c:16
        ret_code = 32767
#1 0x00002ab6aa9ea276 in db__copy_table (from_drvname=0x610420 "pg",
    from_dbname=0x6103f0 "host=localhost,dbname=postgis",
from_tblname=0x6103b0 "it",
    to_drvname=0x610540 "sqlite",
    to_dbname=0x610600
"/home/neteler/grassdata/latlong_tbe_climate/sqlite/sqlite.db",
    to_tblname=0x6104c0 "test", where=0xbf82e3a482e32ca2 <Address
0xbf82e3a482e32ca2 out of bounds>,
    select=0x85e5a282e3aa83e3 <Address 0x85e5a282e3aa83e3 out of bounds>,
    selcol=0x2cbd9be58c92e5b1 <Address 0x2cbd9be58c92e5b1 out of
bounds>, ivals=0x88e5a7a4e58f84e6,
    nvals=-1846924119) at copy_tab.c:321
        select = -999011943
        col = -1109350210
        ncols = -1950555982
        sqltype = -527778336
        ctype = -1109357636
        more = -533944643
        selcol_found = -1296179037
        buf = "'''Itaria,An Eadailt,An Iodail,An
Iodáil,Etalie,Itaalia,Itaalje,Itaglia,Itali,Italia,Italia (Y),Italia
(�\235),Italie,Italien,Italija,Italio,Italiya,Italië,Italja,Italska,Italujo,Italy,Italya,Ita�\202ia,I"...
        ivalues = (int *) 0xe0a8bde02c8dbce0
        from_handle = {dbName = {string = 0x610680
"host=localhost,dbname=postgis", nalloc = 30}, dbSchema = {
    string = 0x610980 "", nalloc = 1}}
        to_handle = {dbName = {string = 0x611180
"/home/neteler/grassdata/latlong_tbe_climate/sqlite/sqlite.db",
    nalloc = 61}, dbSchema = {string = 0x611980 "", nalloc = 1}}
        tblname = {string = 0x2ab6aa1293b0 "", nalloc = 0}
        sql = {
  string = 0x615b70 "insert into test values ( 14885, 3175395,
'Italian Republic', 'Italian Republic', '''Itaria,An Eadailt,An
Iodail,An Iodáil,Etalie,Itaalia,Itaalje,Itaglia,Itali,Italia,Italia
(Y),Italia (�\235),Italie,It"..., nalloc = 1277}
        value_string = {
  string = 0x6156c0 "''Itaria,An Eadailt,An Iodail,An
Iodáil,Etalie,Itaalia,Itaalje,Itaglia,Itali,Italia,Italia (-Y),Italia
(�\235),Italie,Italien,Italija,Italio,Italiya,Italië,Italja,Italska,Italujo,Italy,Italya,Ita�\202ia,It"...,
nalloc = 1192}
        tblnames = (dbString *) 0x6119a0
        table = (dbTable *) 0x8bbce08abde080be
        out_table = (dbTable *) 0xe12cb2bde0a3bde0
        cursor = {token = 0, driver = 0x6109d0, table = 0x611a10,
column_flags = 0x0, type = 1, mode = 0}
        column = (dbColumn *) 0x9083e1a283e19883
        value = (dbValue *) 0x83e19883e19a83e1
        colname = 0x8a88e1a38ce12c90 <Address 0x8a88e1a38ce12c90 out of bounds>
        from_driver = (dbDriver *) 0xe12c958ae1ab8be1
        to_driver = (dbDriver *) 0xab8be18d88e1a38c
        count = 1
        i = -509436191
#2 0x82e3aa83e3bf82e3 in ?? ()
No symbol table info available.
#3 0xbf82e3a482e32ca2 in ?? ()
No symbol table info available.
#4 0x85e5a282e3aa83e3 in ?? ()
No symbol table info available.
#5 0x2cbd9be58c92e5b1 in ?? ()
No symbol table info available.
#6 0x88e5a7a4e58f84e6 in ?? ()
No symbol table info available.
#7 0x8a84eab491ea2ca9 in ?? ()
No symbol table info available.
#8 0xedb49dec2cba86ea in ?? ()
No symbol table info available.
#9 0x8495ecaca6eb8883 in ?? ()
No symbol table info available.
#10 0x0000000000610027 in ?? ()
No symbol table info available.
#11 0x0000000000610540 in ?? ()
No symbol table info available.
#12 0x00000000006103b0 in ?? ()
No symbol table info available.
#13 0x00000000006103f0 in ?? ()
No symbol table info available.
#14 0x0000000000610420 in ?? ()
No symbol table info available.
#15 0x00007fff00de5500 in ?? ()
No symbol table info available.
#16 0x00002ab6a9efec66 in Vect_copy (in=Cannot access memory at
address 0xa482e32c61693539
) at map.c:238
        i = Cannot access memory at address 0xa482e32c61697249
(gdb)

I tried to debug this with ddd but got lost in the detached processes
(or however that was called) of th DBMI driver.

Markus

Markus Neteler wrote:

On Tue, Jul 1, 2008 at 7:12 PM, Glynn Clements <glynn@gclements.plus.com> wrote:
>
> Markus Neteler wrote:
...
>> I assume that the escape magic in db/drivers/postgres/execute.c isn't
>> perfect. Any suggestions here?
>
> Is there actually a problem? All I see are warnings.
>
> If I try inserting the above into a test table using the psql utility,
> I get exactly what I would expect.
...
> Note that the backslash is present in the file, which appears to be an
> error; many other entries have literal apostrophes without the
> backslash.

Ah, I overlooked that there is misfit in the input data.

(Un)related a segfault: When copying the map from the postgresql mapset
into the sqlite mapset, it crashes:

# in SQLite mapset:
GRASS 6.4.svn (latlong_tbe_climate): > gdb g.copy
GNU gdb 6.6-3mdv2008.0 (Mandriva Linux release 2008.0)
...
This GDB was configured as "x86_64-mandriva-linux-gnu"...
Using host libthread_db library "/lib64/libthread_db.so.1".
(gdb) r vect=it@postgresql,test --o
Starting program:
/home/neteler/grass64/dist.x86_64-unknown-linux-gnu/bin/g.copy
vect=it@postgresql,test --o
[Thread debugging using libthread_db enabled]
[New Thread 46964157956272 (LWP 10388)]
warning: Lowest section in /usr/lib64/libicudata.so.36 is .gnu.hash at
0000000000000120
Copy vector <it@postgresql> to current mapset as <test>
WARNING: Vector map <test> already exists and will be overwritten
WARNING: Table <test> linked to vector map <test> does not exist

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 46964157956272 (LWP 10388)]
0x00002ab6aa9e8888 in db_execute_immediate (driver=0xab8be18d88e1a38c,
SQLstatement=0x7fff00de1240)
    at c_execute.c:16
16 db__set_protocol_fds (driver->send, driver->recv);
(gdb) bt full
#0 0x00002ab6aa9e8888 in db_execute_immediate
(driver=0xab8be18d88e1a38c, SQLstatement=0x7fff00de1240)
    at c_execute.c:16
        ret_code = 32767
#1 0x00002ab6aa9ea276 in db__copy_table (from_drvname=0x610420 "pg",

[snip]

    nvals=-1846924119) at copy_tab.c:321

At the top of db__copy_table (lib/db/dbmi_client/copy_tab.c) is:

    char buf[1000];

I suspect that this is overflowing at line 295:

294 db_double_quote_string ( &value_string );
295 sprintf (buf, "'%s'", db_get_string(&value_string) );
296 db_append_string ( &sql, buf);

I suggest:

      db_double_quote_string ( &value_string );
      db_append_string ( &sql, "'");
      db_append_string ( &sql, db_get_string(&value_string));
      db_append_string ( &sql, "'");

--
Glynn Clements <glynn@gclements.plus.com>

Markus Neteler wrote:

> In theory, the driver could scan the entire table and determine the
> widest element. But that would be slow for large tables, and it could
> also confuse clients trying to add data to an existing table.

Right. It would be perfect to have a "bundled" env var like:
GRASS_SQLITE_TEXTCOL_LENGTH=name,length

IMHO, it would be better to explicitly store metadata in the SQLite
database. E.g.

CREATE TABLE sqlite_column_types ( table TEXT , column TEXT , coltype INT , length INT )

Whenever a table is created, deleted or modified, the driver would
update the sqlite_column_types table with the appropriate data. When
querying the structure of a table (i.e. db__driver_describe_table()),
the driver would read the data from that table.

The main drawback here is that if you create, delete or modify tables
other than through the SQLite driver, the metadata could become
inconsistent.

--
Glynn Clements <glynn@gclements.plus.com>

On Wed, Jul 2, 2008 at 6:43 PM, Glynn Clements <glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

...

(Un)related a segfault: When copying the map from the postgresql mapset
into the sqlite mapset, it crashes:

# in SQLite mapset:

...

(gdb) r vect=it@postgresql,test --o

...

Program received signal SIGSEGV, Segmentation fault.

...

(gdb) bt full
#0 0x00002ab6aa9e8888 in db_execute_immediate
(driver=0xab8be18d88e1a38c, SQLstatement=0x7fff00de1240)
    at c_execute.c:16
        ret_code = 32767
#1 0x00002ab6aa9ea276 in db__copy_table (from_drvname=0x610420 "pg",

[snip]

    nvals=-1846924119) at copy_tab.c:321

At the top of db__copy_table (lib/db/dbmi_client/copy_tab.c) is:

   char buf[1000];

I suspect that this is overflowing at line 295:

294 db_double_quote_string ( &value_string );
295 sprintf (buf, "'%s'", db_get_string(&value_string) );
296 db_append_string ( &sql, buf);

I suggest:

                       db_double_quote_string ( &value_string );
                       db_append_string ( &sql, "'");
                       db_append_string ( &sql, db_get_string(&value_string));
                       db_append_string ( &sql, "'");

Perfect, this cures the problem.
Thanks a lot!

Markus

Glynn Clements <glynn@gclements.plus.com> writes:

[...]

>>> SQLite doesn't have VARCHAR. If you create a VARCHAR column, the
>>> driver translates it to TEXT. The size you specify when the table
>>> is created isn't stored anywhere.

  Not exactly. While I don't know of the GRASS SQLite driver
  specifics, the CREATE TABLE commands are preserved by SQLite:

--cut--
    The exact text of each CREATE TABLE statement is stored in the
    sqlite_master table. Every time the database is opened, all CREATE
    TABLE statements are read from the sqlite_master table and used to
    regenerate SQLite's internal representation of the table layout. If
    the original command was a CREATE TABLE AS then then an equivalent
    CREATE TABLE statement is synthesized and store in sqlite_master in
    place of the original command. The text of CREATE TEMPORARY TABLE
    statements are stored in the sqlite_temp_master table.
--cut--

  E. g.:

$ sqlite3 /tmp/foo.db
SQLite version 3.3.8
Enter ".help" for instructions
sqlite> CREATE TABLE foo (bar VARCHAR (3));
sqlite> INSERT INTO foo VALUES ('bar baz');
sqlite> SELECT * FROM foo;
bar baz
sqlite> .schema
CREATE TABLE foo (bar VARCHAR (3));
sqlite> SELECT * FROM 'sqlite_master';
table|foo|foo|2|CREATE TABLE foo (bar VARCHAR (3))
sqlite>

  Of course, any width limit specified via VARCHAR () is ignored
  by SQLite itself.

>> This is not clear to me: - I run v.in.ascii and define that I want
>> varchar(4000). - this is sent to DBMI which calls the SQLite driver
>> - in the SQLite driver 255 is hardcoded for TEXT and the
>> "alternatename" field is cut - hardcoding something much longer
>> isn't convenient due to troubles when moving table to other DBMS

>> so far understood.

>> question: any chance to remove the hardcoded 255 limit and use
>> instead the length I defined in the v.in.ascii call?

> Not easily.

  I. e., by means of parsing SQL?

>> Or is that size information eliminated in the communication between
>> v.in.ascii and SQLite driver?

> Clients communicate that information to the driver, but the SQLite
> driver ignores it, as there's no way to communicate it to SQLite
> itself. So far as SQLite is concerned, columns are either "text",
> "integer" or "real". There's no way to specify a limit on the width
> of a column, and thus no way to query it.

  Does it mean that the GRASS SQLite driver doesn't try to put the
  column width into the CREATE TABLE statement? As shown in the
  example above, an arbitrary width limit may be passed to SQLite.

  IIUC, the very reason to allow SQL type specifications to be
  used in CREATE TABLE was the interoperability with other SQL
  implementations.

  But then, don't the major SQL RDBMS support the TEXT type? It
  has unlimited width and thus implies no incompatibility with
  SQLite.

> In theory, the driver could scan the entire table and determine the
> widest element. But that would be slow for large tables, and it
> could also confuse clients trying to add data to an existing table.

Ivan Shmakov wrote:

>>> SQLite doesn't have VARCHAR. If you create a VARCHAR column, the
>>> driver translates it to TEXT. The size you specify when the table
>>> is created isn't stored anywhere.

  Not exactly. While I don't know of the GRASS SQLite driver
  specifics, the CREATE TABLE commands are preserved by SQLite:

--cut--
    The exact text of each CREATE TABLE statement is stored in the
    sqlite_master table. Every time the database is opened, all CREATE
    TABLE statements are read from the sqlite_master table and used to
    regenerate SQLite's internal representation of the table layout. If
    the original command was a CREATE TABLE AS then then an equivalent
    CREATE TABLE statement is synthesized and store in sqlite_master in
    place of the original command. The text of CREATE TEMPORARY TABLE
    statements are stored in the sqlite_temp_master table.
--cut--

Interesting. But not particularly useful unless we add our own SQL
parser to the SQLite driver. Well, it should only need to handle
CREATE TABLE statements, but it would still need to understand exactly
the syntax which SQLite supports (and the table might have been
created by a future version).

OTOH, that essentially rules out the idea of having the SQLite driver
store its own metadata. That would only work with tables created by
the DBMI driver, and given that limitation, we could add a parser for
the data from sqlite_master which just supports the syntax used by the
SQLite driver.

IOW, change db/drivers/sqlite/describe.c to parse the sqlite_master
table according to the syntax used by db/drivers/sqlite/create_table.c.

> Clients communicate that information to the driver, but the SQLite
> driver ignores it, as there's no way to communicate it to SQLite
> itself. So far as SQLite is concerned, columns are either "text",
> "integer" or "real". There's no way to specify a limit on the width
> of a column, and thus no way to query it.

  Does it mean that the GRASS SQLite driver doesn't try to put the
  column width into the CREATE TABLE statement? As shown in the
  example above, an arbitrary width limit may be passed to SQLite.

The SQLite driver doesn't use VARCHAR at all; see
db/drivers/sqlite/create_table.c.

But even if it did, it doesn't attempt to retrieve any information
from the sqlite_master table.

  IIUC, the very reason to allow SQL type specifications to be
  used in CREATE TABLE was the interoperability with other SQL
  implementations.

SQLite allows them, but essentially ignores them (other than the
information it puts into sqlite_master). If you create a VARCHAR(10)
column, it will happily allow you to store a 1000-character string in
that column, and to retrieve it without truncation.

  But then, don't the major SQL RDBMS support the TEXT type? It
  has unlimited width and thus implies no incompatibility with
  SQLite.

They support the TEXT type, but the DBMI seems to assume fixed sizes,
e.g. the aforementioned problem with the PostgreSQL driver failing to
report the size for TEXT fields.

The MySQL driver simply reports the length reported by MySQL; I have
no idea what that will be for TEXT columns.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements <glynn@gclements.plus.com> writes:

>>>>> SQLite doesn't have VARCHAR. If you create a VARCHAR column, the
>>>>> driver translates it to TEXT. The size you specify when the table
>>>>> is created isn't stored anywhere.

>> Not exactly. While I don't know of the GRASS SQLite driver
>> specifics, the CREATE TABLE commands are preserved by SQLite:

[...]

>> Of course, any width limit specified via VARCHAR () is ignored by
>> SQLite itself.

> Interesting. But not particularly useful unless we add our own SQL
> parser to the SQLite driver.

> Well, it should only need to handle
> CREATE TABLE statements, but it would still need to understand
> exactly the syntax which SQLite supports (and the table might have
> been created by a future version).

[...]

> IOW, change db/drivers/sqlite/describe.c to parse the sqlite_master
> table according to the syntax used by
> db/drivers/sqlite/create_table.c.

  Hm. Now I see sqlite3_column_decltype () in the SQLite API
  reference?

[...]

>> IIUC, the very reason to allow SQL type specifications to be used in
>> CREATE TABLE was the interoperability with other SQL
>> implementations.

> SQLite allows them, but essentially ignores them (other than the
> information it puts into sqlite_master).

  That's the point -- while it behaves its own way, it allows
  standards-compliant database schemata to be reused with little
  or no modification. (Or, rather, it relies on de-facto
  standards, for that matter.)

> If you create a VARCHAR(10) column, it will happily allow you to
> store a 1000-character string in that column, and to retrieve it
> without truncation.

  Yes.

>> But then, don't the major SQL RDBMS support the TEXT type? It has
>> unlimited width and thus implies no incompatibility with SQLite.

> They support the TEXT type, but the DBMI seems to assume fixed sizes,
> e.g. the aforementioned problem with the PostgreSQL driver failing to
> report the size for TEXT fields.

> The MySQL driver simply reports the length reported by MySQL; I have
> no idea what that will be for TEXT columns.

  Shouldn't there be a separate value for the ``no size limit''
  case?

Ivan Shmakov wrote:

>>>>> Glynn Clements <glynn@gclements.plus.com> writes:

>>>>> SQLite doesn't have VARCHAR. If you create a VARCHAR column, the
>>>>> driver translates it to TEXT. The size you specify when the table
>>>>> is created isn't stored anywhere.

>> Not exactly. While I don't know of the GRASS SQLite driver
>> specifics, the CREATE TABLE commands are preserved by SQLite:

[...]

>> Of course, any width limit specified via VARCHAR () is ignored by
>> SQLite itself.

> Interesting. But not particularly useful unless we add our own SQL
> parser to the SQLite driver.

> Well, it should only need to handle
> CREATE TABLE statements, but it would still need to understand
> exactly the syntax which SQLite supports (and the table might have
> been created by a future version).

[...]

> IOW, change db/drivers/sqlite/describe.c to parse the sqlite_master
> table according to the syntax used by
> db/drivers/sqlite/create_table.c.

  Hm. Now I see sqlite3_column_decltype () in the SQLite API
  reference?

Er, right. ISTR that there are reasons why the SQLite driver doesn't
rely upon that much. However, I suspect that it will produce the same
result as parsing the string from sqlite_master.

IIRC, the main problem is that sqlite3_column_decltype() only works
for actual columns, not expressions, subselects etc, but the code in
question has to be able to describe the format of rows returned by
arbitrary SELECT statements, not just tables.

So it falls back to sqlite3_column_type(), which returns the type of
the column's data. but that only works if you have a valid row. If a
select doesn't return any rows, you lose. Also, sqlite3_column_type()
only understands null, integer, float, text and blob types.

Apart from needing to fall-back to the data type for expressions,
another problem is that it coerces the decltype to one of SQLite's
limited set of types, then converts that to one of the DB_SQL_* types,
losing information in the process.

However, I don't know if it does this for a specific reason. I suspect
the only way to find out is to try it and see what breaks.

>> IIUC, the very reason to allow SQL type specifications to be used in
>> CREATE TABLE was the interoperability with other SQL
>> implementations.

> SQLite allows them, but essentially ignores them (other than the
> information it puts into sqlite_master).

  That's the point -- while it behaves its own way, it allows
  standards-compliant database schemata to be reused with little
  or no modification. (Or, rather, it relies on de-facto
  standards, for that matter.)

This works well enough if the application already knows the schema,
but it's problematic if you're trying to obtain it from the database
itself.

>> But then, don't the major SQL RDBMS support the TEXT type? It has
>> unlimited width and thus implies no incompatibility with SQLite.

> They support the TEXT type, but the DBMI seems to assume fixed sizes,
> e.g. the aforementioned problem with the PostgreSQL driver failing to
> report the size for TEXT fields.

> The MySQL driver simply reports the length reported by MySQL; I have
> no idea what that will be for TEXT columns.

  Shouldn't there be a separate value for the ``no size limit''
  case?

Probably. But that potentially means re-writing the clients to handle
this case. OTOH, so far no-one seems to have actually noticed that you
get garbage for the length of a TEXT field with the PG driver:

  $ db.describe test
  table:test
  description:
  insert:?
  delete:?
  ncols:1
  nrows:0
  
  column:col
  description:
  type:TEXT
====> len:134591376
  scale:0
  precision:0
  default:
  nullok:yes
  select:?
  update:?

--
Glynn Clements <glynn@gclements.plus.com>