[GRASSLIST:9361] More v.in.ascii problems

I have a 945MB text file that contains x,y,z, and cats. I run the following:

v.in.ascii -zt input=SiteB_0.5m_backscatter.txt output=TEST_backscatter x=3
y=2 z=4 cat=1 fs=' '

And receive the following error:

Maximum input row length: 34
Maximum number of columns: 4
Minimum number of columns: 4
Building topology ...
Registering lines: 6 [main] v.in.ascii 3388 fixup_mmaps_after_fork:
WARNING: VirtualProtectEx to return to previous state in parent failed for
MAP_PRIVATE address
0x5BF0000, Win32 error 87
113738 [main] v.in.ascii 3388 fixup_mmaps_after_fork: WARNING:
VirtualProtect to copy protection to child failed forMAP_PRIVATE address
0x5BF0000, Win32 error 487
212186 [main] v.in.ascii 3388 fixup_mmaps_after_fork: ReadProcessMemory
(2nd try) failed for MAP_PRIVATE address 0x5BF0000, Win32 error 487
C:\cygwin\usr\local\grass6.1.cvs\bin\v.in.ascii (3388): ***
recreate_mmaps_after_fork_failed
     76 [main] v.in.ascii 2884 fork_parent: child 3388 died waiting for dll
loading
45286676 [main] v.in.ascii 1364 fixup_mmaps_after_fork: WARNING:
VirtualProtect to copy protection to child failed forMAP_PRIVATE address
0x5BF0000, Win32 error 487
45347289 [main] v.in.ascii 1364 fixup_mmaps_after_fork: ReadProcessMemory
(2nd try) failed for MAP_PRIVATE address 0x5BF0000, Win32 error 487
C:\cygwin\usr\local\grass6.1.cvs\bin\v.in.ascii (1364): ***
recreate_mmaps_after_fork_failed
47040634 [main] v.in.ascii 2884 fork_parent: child 1364 died waiting for dll
loading
ERROR: G_realloc: out of memory

Would the -b flag mentioned by Roger alleviate this problem? I'm working on
Cygwin/XP with 6.1cvs (Sept2). I do have a Ubuntu Breezy installation up and
running, but I can't use the latest 6.1 cvs on it until I get my tk and tcl
links sorted out.

~ Eric.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Eric Patton

Technologist, Geo-Spatial Data Services
Geological Survey of Canada (Atlantic)
Natural Resources Canada
Bedford Institute of Oceanography
Dartmouth, Nova Scotia, Canada B2Y 4A2

Postal address: P.O. Box 1006
Courier address: 1 Challenger Drive

Telephone: (902)426-7732
Facsimile: (902)426-4104
E-mail: epatton@NRCan.gc.ca

-----Original Message-----
From: owner-GRASSLIST@baylor.edu [mailto:owner-GRASSLIST@baylor.edu] On
Behalf Of Roger Bivand
Sent: Wednesday, December 07, 2005 10:04 AM
To: Hamish
Cc: jgomezdans@gmail.com; GRASSLIST@baylor.edu; grass5@grass.itc.it
Subject: [GRASSLIST:9358] Re: v.in.ascii problems

On Thu, 8 Dec 2005, Hamish wrote:

> There is an on-going discussion about this on the GRASS development
> list. >>From a simple test I ran last night, v.in.ascii -b (the -b
> flag is new in GRASS 6.1 CVS) does not build topology, and this
> removes one of the two humps in memory consumption. The other hump
> (> 200MB for a 1M point file with a single attribute) was associated
> with writing the dbf file (the file is 60MB), and is where things
> stick now. In addition, the -b flag leaves the vector data set at
> level 1 topology (absent), and almost all vector commands need level 2.
>
> I do now know whether the use of a different database driver than
> the default would help. The dbf writing stage preceeds the topology
> building, so the two memory-intensive humps are separate, with
> topology being a little larger. Reading 1M points on a 1.5GHz P4
> with topology took about 7 minutes, without about half that time.

Use the -z and -t flags to avoid making the table. (and the z= option)
If the input is just x,y,z data there is no need for a table.

For me in an x-y location my data with 1M points and -zbt now read in just
under a minute; lidaratm2.txt in effectively the same time (64 rather than
57 seconds, z here is double not int) and v.in.ascii stays at a respectable
3.3MB size. d.vect works, but as you say prints a warning.

At minimum, we need v.info, v.surf.rst, v.univar, v.out.ascii (points)
and some sort of subsampling module (ie s.cellstats port) working with
this data. d.vect works already (with a warning). maybe v.surf.idw too.

Probably not many more modules though? -- I think if Radim doesn't
want this to be common-place use of the vector model then it probably
shouldn't be. He knows it better than anyone.. So for now massive
point datasets need to be treated as a special case to the vector
model & only a work-around solution.

e.g. with the sample LIDAR data (GRASS downloads page)

G61> v.in.ascii -zbt in=lidaratm2.txt out=lidaratm2 x=1 y=2 z=3 fs=,

The first 250k points take about 20 seconds to load.

If I use the full million it gets stuck on the scanning step:

D3/3: row 374430 : 28 chars

Interesting, that line is the second value with elevation > 100.

changing the first z value to 500.054 it segfaults pretty quick:

D5/5: Vect_hist_write()
D4/5: G_getl2: ->-75.622346,35.949693,500.054<-
D3/5: row 1 : 28 chars
D4/5: token: -75.622346
D4/5: is_latlong north: -75.622346
D4/5: row 1 col 0: '-75.622346' is_int = 0 is_double = 1
D4/5: token: 35.949693
D4/5: is_latlong north: 35.949693
D4/5: row 1 col 1: '35.949693' is_int = 0 is_double = 1
D4/5: row 1 col 2: '500.054' is_int = 0 is_double = 1
D4/5: G_getl2: ->-75.629469,35.949693,11.962<-
D3/5: row 2 : 27 chars
D4/5: token: -75.629469
D4/5: is_latlong north: -75.629469
D4/5: row 2 col 0: 'H629469' is_int = 0 is_double = 0 Segmentation
fault

where is 'H629469' coming from?

I was also seeing seg-faults with my data in a long-lat location, so
switched to x-y (current CVS 6.1).

Roger

v.in.ascii/points.c
tmp_token is getting corrupted, cascades from there

int points_analyse (){
...
    char **tokens;
...
    tmp_token=(char *) G_malloc(256);
...
    while (1) {
...
        tokens = G_tokenize (buf, fs); ...
        for ( i = 0; i < ntokens; i++ ) { ...
[*] sprintf(tmp_token, "%f", northing);
...
        /* replace current DMS token by decimal degree */
                    tokens[i]=tmp_token;

BOOM. pointer abuse. (bug is new lat/lon scanning code, only in
6.1CVS)

[*] and if northing column is longer than 256 without hitting the fs,
   buffer overflow?? add ,int maxlength, parameter to G_tokenize()?
   or can %f never be more than 256 bytes long?
    same %f effectively cutting down precision of lat/lon coords to 6
   spots after the decimal place? (be that pretty small on the ground)

improvements come one bug at a time...

Hamish

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand@nhh.no

On Wed, 7 Dec 2005, Patton, Eric wrote:

I have a 945MB text file that contains x,y,z, and cats. I run the following:

v.in.ascii -zt input=SiteB_0.5m_backscatter.txt output=TEST_backscatter x=3
y=2 z=4 cat=1 fs=' '

And receive the following error:

Maximum input row length: 34
Maximum number of columns: 4
Minimum number of columns: 4
Building topology ...
Registering lines: 6 [main] v.in.ascii 3388 fixup_mmaps_after_fork:
WARNING: VirtualProtectEx to return to previous state in parent failed for
MAP_PRIVATE address
0x5BF0000, Win32 error 87
113738 [main] v.in.ascii 3388 fixup_mmaps_after_fork: WARNING:
VirtualProtect to copy protection to child failed forMAP_PRIVATE address
0x5BF0000, Win32 error 487
212186 [main] v.in.ascii 3388 fixup_mmaps_after_fork: ReadProcessMemory
(2nd try) failed for MAP_PRIVATE address 0x5BF0000, Win32 error 487
C:\cygwin\usr\local\grass6.1.cvs\bin\v.in.ascii (3388): ***
recreate_mmaps_after_fork_failed
     76 [main] v.in.ascii 2884 fork_parent: child 3388 died waiting for dll
loading
45286676 [main] v.in.ascii 1364 fixup_mmaps_after_fork: WARNING:
VirtualProtect to copy protection to child failed forMAP_PRIVATE address
0x5BF0000, Win32 error 487
45347289 [main] v.in.ascii 1364 fixup_mmaps_after_fork: ReadProcessMemory
(2nd try) failed for MAP_PRIVATE address 0x5BF0000, Win32 error 487
C:\cygwin\usr\local\grass6.1.cvs\bin\v.in.ascii (1364): ***
recreate_mmaps_after_fork_failed
47040634 [main] v.in.ascii 2884 fork_parent: child 1364 died waiting for dll
loading
ERROR: G_realloc: out of memory

Would the -b flag mentioned by Roger alleviate this problem? I'm working on
Cygwin/XP with 6.1cvs (Sept2). I do have a Ubuntu Breezy installation up and
running, but I can't use the latest 6.1 cvs on it until I get my tk and tcl
links sorted out.

In principle, yes, because the topology is not built, so the command exits
before you see the meltdown. The cat column is being thrown away by -t (as
far as I understand), as the database table is not being written. I'd
expect the coords file to be about the same size as the input file,
roughly 30M points. The -b flag is only in very recent CVS, 2 Sept.
predates it, so you'd need a more recent build to try it.

Roger

~ Eric.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Eric Patton

Technologist, Geo-Spatial Data Services
Geological Survey of Canada (Atlantic)
Natural Resources Canada
Bedford Institute of Oceanography
Dartmouth, Nova Scotia, Canada B2Y 4A2

Postal address: P.O. Box 1006
Courier address: 1 Challenger Drive

Telephone: (902)426-7732
Facsimile: (902)426-4104
E-mail: epatton@NRCan.gc.ca

-----Original Message-----
From: owner-GRASSLIST@baylor.edu [mailto:owner-GRASSLIST@baylor.edu] On
Behalf Of Roger Bivand
Sent: Wednesday, December 07, 2005 10:04 AM
To: Hamish
Cc: jgomezdans@gmail.com; GRASSLIST@baylor.edu; grass5@grass.itc.it
Subject: [GRASSLIST:9358] Re: v.in.ascii problems

On Thu, 8 Dec 2005, Hamish wrote:

> > There is an on-going discussion about this on the GRASS development
> > list. >>From a simple test I ran last night, v.in.ascii -b (the -b
> > flag is new in GRASS 6.1 CVS) does not build topology, and this
> > removes one of the two humps in memory consumption. The other hump
> > (> 200MB for a 1M point file with a single attribute) was associated
> > with writing the dbf file (the file is 60MB), and is where things
> > stick now. In addition, the -b flag leaves the vector data set at
> > level 1 topology (absent), and almost all vector commands need level 2.
> >
> > I do now know whether the use of a different database driver than
> > the default would help. The dbf writing stage preceeds the topology
> > building, so the two memory-intensive humps are separate, with
> > topology being a little larger. Reading 1M points on a 1.5GHz P4
> > with topology took about 7 minutes, without about half that time.
>
>
> Use the -z and -t flags to avoid making the table. (and the z= option)
> If the input is just x,y,z data there is no need for a table.

For me in an x-y location my data with 1M points and -zbt now read in just
under a minute; lidaratm2.txt in effectively the same time (64 rather than
57 seconds, z here is double not int) and v.in.ascii stays at a respectable
3.3MB size. d.vect works, but as you say prints a warning.

>
> At minimum, we need v.info, v.surf.rst, v.univar, v.out.ascii (points)
> and some sort of subsampling module (ie s.cellstats port) working with
> this data. d.vect works already (with a warning). maybe v.surf.idw too.
>
> Probably not many more modules though? -- I think if Radim doesn't
> want this to be common-place use of the vector model then it probably
> shouldn't be. He knows it better than anyone.. So for now massive
> point datasets need to be treated as a special case to the vector
> model & only a work-around solution.
>
>
> e.g. with the sample LIDAR data (GRASS downloads page)
>
> G61> v.in.ascii -zbt in=lidaratm2.txt out=lidaratm2 x=1 y=2 z=3 fs=,
>
> The first 250k points take about 20 seconds to load.
>
>
> If I use the full million it gets stuck on the scanning step:
>
> D3/3: row 374430 : 28 chars
>
> Interesting, that line is the second value with elevation > 100.
>
> changing the first z value to 500.054 it segfaults pretty quick:
>
> D5/5: Vect_hist_write()
> D4/5: G_getl2: ->-75.622346,35.949693,500.054<-
> D3/5: row 1 : 28 chars
> D4/5: token: -75.622346
> D4/5: is_latlong north: -75.622346
> D4/5: row 1 col 0: '-75.622346' is_int = 0 is_double = 1
> D4/5: token: 35.949693
> D4/5: is_latlong north: 35.949693
> D4/5: row 1 col 1: '35.949693' is_int = 0 is_double = 1
> D4/5: row 1 col 2: '500.054' is_int = 0 is_double = 1
> D4/5: G_getl2: ->-75.629469,35.949693,11.962<-
> D3/5: row 2 : 27 chars
> D4/5: token: -75.629469
> D4/5: is_latlong north: -75.629469
> D4/5: row 2 col 0: 'H629469' is_int = 0 is_double = 0 Segmentation
> fault
>
> where is 'H629469' coming from?

I was also seeing seg-faults with my data in a long-lat location, so
switched to x-y (current CVS 6.1).

Roger

>
> v.in.ascii/points.c
> tmp_token is getting corrupted, cascades from there
>
> int points_analyse (){
> ...
> char **tokens;
> ...
> tmp_token=(char *) G_malloc(256);
> ...
> while (1) {
> ...
> tokens = G_tokenize (buf, fs); ...
> for ( i = 0; i < ntokens; i++ ) { ...
> [*] sprintf(tmp_token, "%f", northing);
> ...
> /* replace current DMS token by decimal degree */
> tokens[i]=tmp_token;
>
> BOOM. pointer abuse. (bug is new lat/lon scanning code, only in
> 6.1CVS)
>
> [*] and if northing column is longer than 256 without hitting the fs,
> buffer overflow?? add ,int maxlength, parameter to G_tokenize()?
> or can %f never be more than 256 bytes long?
> same %f effectively cutting down precision of lat/lon coords to 6
> spots after the decimal place? (be that pretty small on the ground)
>
>
> improvements come one bug at a time...
>
> Hamish
>

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand@nhh.no

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand@nhh.no