[GRASS5] Re: [GRASSLIST:9350] Re: v.in.ascii problems

There is an on-going discussion about this on the GRASS development
list. >>From a simple test I ran last night, v.in.ascii -b (the -b
flag is new in GRASS 6.1 CVS) does not build topology, and this
removes one of the two humps in memory consumption. The other hump (>
200MB for a 1M point file with a single attribute) was associated with
writing the dbf file (the file is 60MB), and is where things stick
now. In addition, the -b flag leaves the vector data set at level 1
topology (absent), and almost all vector commands need level 2.

I do now know whether the use of a different database driver than the
default would help. The dbf writing stage preceeds the topology
building, so the two memory-intensive humps are separate, with
topology being a little larger. Reading 1M points on a 1.5GHz P4 with
topology took about 7 minutes, without about half that time.

Use the -z and -t flags to avoid making the table. (and the z= option)
If the input is just x,y,z data there is no need for a table.

At minimum, we need v.info, v.surf.rst, v.univar, v.out.ascii (points)
and some sort of subsampling module (ie s.cellstats port) working with
this data. d.vect works already (with a warning). maybe v.surf.idw too.

Probably not many more modules though? -- I think if Radim doesn't want
this to be common-place use of the vector model then it probably shouldn't
be. He knows it better than anyone.. So for now massive point datasets
need to be treated as a special case to the vector model & only a
work-around solution.

e.g. with the sample LIDAR data (GRASS downloads page)

G61> v.in.ascii -zbt in=lidaratm2.txt out=lidaratm2 x=1 y=2 z=3 fs=,

The first 250k points take about 20 seconds to load.

If I use the full million it gets stuck on the scanning step:

D3/3: row 374430 : 28 chars

Interesting, that line is the second value with elevation > 100.

changing the first z value to 500.054 it segfaults pretty quick:

D5/5: Vect_hist_write()
D4/5: G_getl2: ->-75.622346,35.949693,500.054<-
D3/5: row 1 : 28 chars
D4/5: token: -75.622346
D4/5: is_latlong north: -75.622346
D4/5: row 1 col 0: '-75.622346' is_int = 0 is_double = 1
D4/5: token: 35.949693
D4/5: is_latlong north: 35.949693
D4/5: row 1 col 1: '35.949693' is_int = 0 is_double = 1
D4/5: row 1 col 2: '500.054' is_int = 0 is_double = 1
D4/5: G_getl2: ->-75.629469,35.949693,11.962<-
D3/5: row 2 : 27 chars
D4/5: token: -75.629469
D4/5: is_latlong north: -75.629469
D4/5: row 2 col 0: 'H629469' is_int = 0 is_double = 0
Segmentation fault

where is 'H629469' coming from?

v.in.ascii/points.c
tmp_token is getting corrupted, cascades from there

int points_analyse (){
...
    char **tokens;
...
    tmp_token=(char *) G_malloc(256);
...
    while (1) {
...
        tokens = G_tokenize (buf, fs);
...
        for ( i = 0; i < ntokens; i++ ) {
...
[*] sprintf(tmp_token, "%f", northing);
...
        /* replace current DMS token by decimal degree */
                    tokens[i]=tmp_token;

BOOM. pointer abuse. (bug is new lat/lon scanning code, only in 6.1CVS)

[*] and if northing column is longer than 256 without hitting the fs,
   buffer overflow?? add ,int maxlength, parameter to G_tokenize()?
   or can %f never be more than 256 bytes long?
    same %f effectively cutting down precision of lat/lon coords to 6
   spots after the decimal place? (be that pretty small on the ground)

improvements come one bug at a time...

Hamish

On Thu, 8 Dec 2005, Hamish wrote:

> There is an on-going discussion about this on the GRASS development
> list. >>From a simple test I ran last night, v.in.ascii -b (the -b
> flag is new in GRASS 6.1 CVS) does not build topology, and this
> removes one of the two humps in memory consumption. The other hump (>
> 200MB for a 1M point file with a single attribute) was associated with
> writing the dbf file (the file is 60MB), and is where things stick
> now. In addition, the -b flag leaves the vector data set at level 1
> topology (absent), and almost all vector commands need level 2.
>
> I do now know whether the use of a different database driver than the
> default would help. The dbf writing stage preceeds the topology
> building, so the two memory-intensive humps are separate, with
> topology being a little larger. Reading 1M points on a 1.5GHz P4 with
> topology took about 7 minutes, without about half that time.

Use the -z and -t flags to avoid making the table. (and the z= option)
If the input is just x,y,z data there is no need for a table.

For me in an x-y location my data with 1M points and -zbt now read in just
under a minute; lidaratm2.txt in effectively the same time (64 rather than
57 seconds, z here is double not int) and v.in.ascii stays at a
respectable 3.3MB size. d.vect works, but as you say prints a warning.

At minimum, we need v.info, v.surf.rst, v.univar, v.out.ascii (points)
and some sort of subsampling module (ie s.cellstats port) working with
this data. d.vect works already (with a warning). maybe v.surf.idw too.

Probably not many more modules though? -- I think if Radim doesn't want
this to be common-place use of the vector model then it probably shouldn't
be. He knows it better than anyone.. So for now massive point datasets
need to be treated as a special case to the vector model & only a
work-around solution.

e.g. with the sample LIDAR data (GRASS downloads page)

G61> v.in.ascii -zbt in=lidaratm2.txt out=lidaratm2 x=1 y=2 z=3 fs=,

The first 250k points take about 20 seconds to load.

If I use the full million it gets stuck on the scanning step:

D3/3: row 374430 : 28 chars

Interesting, that line is the second value with elevation > 100.

changing the first z value to 500.054 it segfaults pretty quick:

D5/5: Vect_hist_write()
D4/5: G_getl2: ->-75.622346,35.949693,500.054<-
D3/5: row 1 : 28 chars
D4/5: token: -75.622346
D4/5: is_latlong north: -75.622346
D4/5: row 1 col 0: '-75.622346' is_int = 0 is_double = 1
D4/5: token: 35.949693
D4/5: is_latlong north: 35.949693
D4/5: row 1 col 1: '35.949693' is_int = 0 is_double = 1
D4/5: row 1 col 2: '500.054' is_int = 0 is_double = 1
D4/5: G_getl2: ->-75.629469,35.949693,11.962<-
D3/5: row 2 : 27 chars
D4/5: token: -75.629469
D4/5: is_latlong north: -75.629469
D4/5: row 2 col 0: 'H629469' is_int = 0 is_double = 0
Segmentation fault

where is 'H629469' coming from?

I was also seeing seg-faults with my data in a long-lat location, so
switched to x-y (current CVS 6.1).

Roger

v.in.ascii/points.c
tmp_token is getting corrupted, cascades from there

int points_analyse (){
...
    char **tokens;
...
    tmp_token=(char *) G_malloc(256);
...
    while (1) {
...
        tokens = G_tokenize (buf, fs);
...
        for ( i = 0; i < ntokens; i++ ) {
...
[*] sprintf(tmp_token, "%f", northing);
...
        /* replace current DMS token by decimal degree */
                    tokens[i]=tmp_token;

BOOM. pointer abuse. (bug is new lat/lon scanning code, only in 6.1CVS)

[*] and if northing column is longer than 256 without hitting the fs,
   buffer overflow?? add ,int maxlength, parameter to G_tokenize()?
   or can %f never be more than 256 bytes long?
    same %f effectively cutting down precision of lat/lon coords to 6
   spots after the decimal place? (be that pretty small on the ground)

improvements come one bug at a time...

Hamish

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand@nhh.no

On Wed, 7 Dec 2005, Roger Bivand wrote:

On Thu, 8 Dec 2005, Hamish wrote:

> > There is an on-going discussion about this on the GRASS development
> > list. >>From a simple test I ran last night, v.in.ascii -b (the -b
> > flag is new in GRASS 6.1 CVS) does not build topology, and this
> > removes one of the two humps in memory consumption. The other hump (>
> > 200MB for a 1M point file with a single attribute) was associated with
> > writing the dbf file (the file is 60MB), and is where things stick
> > now. In addition, the -b flag leaves the vector data set at level 1
> > topology (absent), and almost all vector commands need level 2.
> >
> > I do now know whether the use of a different database driver than the
> > default would help. The dbf writing stage preceeds the topology
> > building, so the two memory-intensive humps are separate, with
> > topology being a little larger. Reading 1M points on a 1.5GHz P4 with
> > topology took about 7 minutes, without about half that time.
>
>
> Use the -z and -t flags to avoid making the table. (and the z= option)
> If the input is just x,y,z data there is no need for a table.

For me in an x-y location my data with 1M points and -zbt now read in just
under a minute; lidaratm2.txt in effectively the same time (64 rather than
57 seconds, z here is double not int) and v.in.ascii stays at a
respectable 3.3MB size. d.vect works, but as you say prints a warning.

Emboldened by the 1M points case, I've tried 13.6M points, with:

GRASS > date; v.in.ascii -zbt input=gptsL.txt output=gptsLz x=1 y=2 z=3 ; date
Wed Dec 7 15:44:23 CET 2005
Maximum input row length: 37
Maximum number of columns: 3
Minimum number of columns: 3
Wed Dec 7 15:57:20 CET 2005

so scaling quite decently at about 1M points a minute. Once again the
v.in.ascii process stayed at just over 3MB the whole time. So putting the
single attribute of interest in the z column, not doing topology and not
creating a database table seems to work - the coords file is now about
480MB, so on 32-bit machines, its size may be a further limiting factor.

>
> At minimum, we need v.info, v.surf.rst, v.univar, v.out.ascii (points)
> and some sort of subsampling module (ie s.cellstats port) working with
> this data. d.vect works already (with a warning). maybe v.surf.idw too.
>
> Probably not many more modules though? -- I think if Radim doesn't want
> this to be common-place use of the vector model then it probably shouldn't
> be. He knows it better than anyone.. So for now massive point datasets
> need to be treated as a special case to the vector model & only a
> work-around solution.
>
>
> e.g. with the sample LIDAR data (GRASS downloads page)
>
> G61> v.in.ascii -zbt in=lidaratm2.txt out=lidaratm2 x=1 y=2 z=3 fs=,
>
> The first 250k points take about 20 seconds to load.
>
>
> If I use the full million it gets stuck on the scanning step:
>
> D3/3: row 374430 : 28 chars
>
> Interesting, that line is the second value with elevation > 100.
>
> changing the first z value to 500.054 it segfaults pretty quick:
>
> D5/5: Vect_hist_write()
> D4/5: G_getl2: ->-75.622346,35.949693,500.054<-
> D3/5: row 1 : 28 chars
> D4/5: token: -75.622346
> D4/5: is_latlong north: -75.622346
> D4/5: row 1 col 0: '-75.622346' is_int = 0 is_double = 1
> D4/5: token: 35.949693
> D4/5: is_latlong north: 35.949693
> D4/5: row 1 col 1: '35.949693' is_int = 0 is_double = 1
> D4/5: row 1 col 2: '500.054' is_int = 0 is_double = 1
> D4/5: G_getl2: ->-75.629469,35.949693,11.962<-
> D3/5: row 2 : 27 chars
> D4/5: token: -75.629469
> D4/5: is_latlong north: -75.629469
> D4/5: row 2 col 0: 'H629469' is_int = 0 is_double = 0
> Segmentation fault
>
> where is 'H629469' coming from?

I was also seeing seg-faults with my data in a long-lat location, so
switched to x-y (current CVS 6.1).

Roger

>
> v.in.ascii/points.c
> tmp_token is getting corrupted, cascades from there
>
> int points_analyse (){
> ...
> char **tokens;
> ...
> tmp_token=(char *) G_malloc(256);
> ...
> while (1) {
> ...
> tokens = G_tokenize (buf, fs);
> ...
> for ( i = 0; i < ntokens; i++ ) {
> ...
> [*] sprintf(tmp_token, "%f", northing);
> ...
> /* replace current DMS token by decimal degree */
> tokens[i]=tmp_token;
>
> BOOM. pointer abuse. (bug is new lat/lon scanning code, only in 6.1CVS)
>
> [*] and if northing column is longer than 256 without hitting the fs,
> buffer overflow?? add ,int maxlength, parameter to G_tokenize()?
> or can %f never be more than 256 bytes long?
> same %f effectively cutting down precision of lat/lon coords to 6
> spots after the decimal place? (be that pretty small on the ground)
>
>
> improvements come one bug at a time...
>
> Hamish
>

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand@nhh.no

If I use the full million it gets stuck on the scanning step:

D3/3: row 374430 : 28 chars

Interesting, that line is the second value with elevation > 100.

changing the first z value to 500.054 it segfaults pretty quick:

D5/5: Vect_hist_write()
D4/5: G_getl2: ->-75.622346,35.949693,500.054<-
D3/5: row 1 : 28 chars
D4/5: token: -75.622346
D4/5: is_latlong north: -75.622346
D4/5: row 1 col 0: '-75.622346' is_int = 0 is_double = 1
D4/5: token: 35.949693
D4/5: is_latlong north: 35.949693
D4/5: row 1 col 1: '35.949693' is_int = 0 is_double = 1
D4/5: row 1 col 2: '500.054' is_int = 0 is_double = 1
D4/5: G_getl2: ->-75.629469,35.949693,11.962<-
D3/5: row 2 : 27 chars
D4/5: token: -75.629469
D4/5: is_latlong north: -75.629469
D4/5: row 2 col 0: 'H629469' is_int = 0 is_double = 0
Segmentation fault

where is 'H629469' coming from?

v.in.ascii/points.c
tmp_token is getting corrupted, cascades from there

int points_analyse (){
...
    char **tokens;
...
    tmp_token=(char *) G_malloc(256);
...
    while (1) {
...
        tokens = G_tokenize (buf, fs);
...
        for ( i = 0; i < ntokens; i++ ) {
...
[*] sprintf(tmp_token, "%f", northing);
...
        /* replace current DMS token by decimal degree */
                    tokens[i]=tmp_token;

BOOM. pointer abuse. (bug is new lat/lon scanning code, only in
6.1CVS)

[*] and if northing column is longer than 256 without hitting the fs,
   buffer overflow?? add ,int maxlength, parameter to G_tokenize()?
   or can %f never be more than 256 bytes long?
    same %f effectively cutting down precision of lat/lon coords to 6
   spots after the decimal place? (be that pretty small on the ground)

more:

from lib/gis/token.c
G_tokenize()
/* break buf into tokens. delimiters are replaced by NULLs
   and tokens array will point to varous locations in buf
   buf must not have a new line

so overwriting (or attempting to overwrite :slight_smile: the a tokens[i] string
with a tmp_token which is longer will stray into either tokens[i+1]'s
address or if i=num_tokens, beyond the address space of the the array.

eg: G_scan_northing("35N") -> 35.000000 will have the zeros stray into
the easting column.

Hamish