There is an on-going discussion about this on the GRASS development
list. >>From a simple test I ran last night, v.in.ascii -b (the -b
flag is new in GRASS 6.1 CVS) does not build topology, and this
removes one of the two humps in memory consumption. The other hump (>
200MB for a 1M point file with a single attribute) was associated with
writing the dbf file (the file is 60MB), and is where things stick
now. In addition, the -b flag leaves the vector data set at level 1
topology (absent), and almost all vector commands need level 2.I do now know whether the use of a different database driver than the
default would help. The dbf writing stage preceeds the topology
building, so the two memory-intensive humps are separate, with
topology being a little larger. Reading 1M points on a 1.5GHz P4 with
topology took about 7 minutes, without about half that time.
Use the -z and -t flags to avoid making the table. (and the z= option)
If the input is just x,y,z data there is no need for a table.
At minimum, we need v.info, v.surf.rst, v.univar, v.out.ascii (points)
and some sort of subsampling module (ie s.cellstats port) working with
this data. d.vect works already (with a warning). maybe v.surf.idw too.
Probably not many more modules though? -- I think if Radim doesn't want
this to be common-place use of the vector model then it probably shouldn't
be. He knows it better than anyone.. So for now massive point datasets
need to be treated as a special case to the vector model & only a
work-around solution.
e.g. with the sample LIDAR data (GRASS downloads page)
G61> v.in.ascii -zbt in=lidaratm2.txt out=lidaratm2 x=1 y=2 z=3 fs=,
The first 250k points take about 20 seconds to load.
If I use the full million it gets stuck on the scanning step:
D3/3: row 374430 : 28 chars
Interesting, that line is the second value with elevation > 100.
changing the first z value to 500.054 it segfaults pretty quick:
D5/5: Vect_hist_write()
D4/5: G_getl2: ->-75.622346,35.949693,500.054<-
D3/5: row 1 : 28 chars
D4/5: token: -75.622346
D4/5: is_latlong north: -75.622346
D4/5: row 1 col 0: '-75.622346' is_int = 0 is_double = 1
D4/5: token: 35.949693
D4/5: is_latlong north: 35.949693
D4/5: row 1 col 1: '35.949693' is_int = 0 is_double = 1
D4/5: row 1 col 2: '500.054' is_int = 0 is_double = 1
D4/5: G_getl2: ->-75.629469,35.949693,11.962<-
D3/5: row 2 : 27 chars
D4/5: token: -75.629469
D4/5: is_latlong north: -75.629469
D4/5: row 2 col 0: 'H629469' is_int = 0 is_double = 0
Segmentation fault
where is 'H629469' coming from?
v.in.ascii/points.c
tmp_token is getting corrupted, cascades from there
int points_analyse (){
...
char **tokens;
...
tmp_token=(char *) G_malloc(256);
...
while (1) {
...
tokens = G_tokenize (buf, fs);
...
for ( i = 0; i < ntokens; i++ ) {
...
[*] sprintf(tmp_token, "%f", northing);
...
/* replace current DMS token by decimal degree */
tokens[i]=tmp_token;
BOOM. pointer abuse. (bug is new lat/lon scanning code, only in 6.1CVS)
[*] and if northing column is longer than 256 without hitting the fs,
buffer overflow?? add ,int maxlength, parameter to G_tokenize()?
or can %f never be more than 256 bytes long?
same %f effectively cutting down precision of lat/lon coords to 6
spots after the decimal place? (be that pretty small on the ground)
improvements come one bug at a time...
Hamish