[GRASS5] v.in.ascii updates

Hi,

I'm looking to do some v.in.ascii updates, I thought I'd trawl for
comments first.. most deal with format=point mode.

Fairly straight forward updates:

+ update to use G_getl2() to make MacOS9 ascii import work.

+ add skip= option to skip any header lines when format=point
   (e.g. column headings in cvs file)

+ add a -s flag to write those skipped header lines to the map's
   meta-data hist file (v.info -h).

Debatable updates:

? skip any line starting with a '#' in input file.
   (could silently ignore data..??)

? skip blank lines in input file.
   (currently extra newlines at EOF break import of points)

? strip "quotes" from both ends of varchar input.

? rename format=standard to format=grass as format=point is default,
   so standard mode is non-standard. Confusing! I'll fix any scripts/.

You can expect me to go ahead and do all these things if noone objects.

Hamish

From: Hamish <hamish_nospam@yahoo.com>
Sent: Wed, 16 Mar 2005 16:49:02 +1300

Debatable updates:

? skip any line starting with a '#' in input file.
   (could silently ignore data..??)

Add a flag to turn it off, in case weird #-marked data arises. How hard would
it be to be able to specify the comment marker as an argument? Better, as a
regular expression?

? skip blank lines in input file.
   (currently extra newlines at EOF break import of points)

Maybe add a flag to signal whether:

1) An empty line marks EOF (that might be useful to chop data by hand), except
at the beginning
2) Empty lines are ignored
3) Empty lines are an error (except prior to EOF)

? strip "quotes" from both ends of varchar input.

Shouldn't break anything; quoting-dequoting-overquoting is always tricky, so
users should be wary of "". Do the same for '', maybe?

? rename format=standard to format=grass as format=point is default,
   so standard mode is non-standard. Confusing! I'll fix any scripts/.

Agree.

Hamish wrote:

Hi,

I'm looking to do some v.in.ascii updates, I thought I'd trawl for
comments first.. most deal with format=point mode.

Fairly straight forward updates:

+ update to use G_getl2() to make MacOS9 ascii import work.

+ add skip= option to skip any header lines when format=point
   (e.g. column headings in cvs file)

+ add a -s flag to write those skipped header lines to the map's
   meta-data hist file (v.info -h).

I think that you can simply write that to history without complicating the module with new flag?

Debatable updates:

? skip any line starting with a '#' in input file.
   (could silently ignore data..??)

? skip blank lines in input file.
   (currently extra newlines at EOF break import of points)

User must be warn if there are errors in the file, empty line may be error in input. There should be probably a flag for shuch skipping.

Maybe it should be done in a GUI? Normal GRASS user can use
cat file | grap -v "^$" | grep -v "#"
?

? strip "quotes" from both ends of varchar input.

? rename format=standard to format=grass as format=point is default,
   so standard mode is non-standard. Confusing! I'll fix any scripts/.

ABSOLUTELY FORBIDDEN! The module options and behaviour MUST NOT change during the 6.x line. That applies to all modules and all developers without exceptions.

Radim

You can expect me to go ahead and do all these things if noone objects.

Hamish

_______________________________________________
grass5 mailing list
grass5@grass.itc.it
http://grass.itc.it/mailman/listinfo/grass5

I agree with the changes. I was going to suggest using format=vect (because
it is reading a GRASS vector file), but Radim says verbotten.

Michael
____________________
C. Michael Barton, Professor of Anthropology
School of Human Evolution and Social Change
PO Box 872402
Arizona State University
Tempe, AZ 85287-2402
USA

Phone: 480-965-6262
Fax: 480-965-7671
www: <www.public.asu.edu/~cmbarton>

From: Daniel Calvelo Aros <dcalvelo@minag.gob.pe>
Reply-To: <daniel.calvelo@minag.gob.pe>
Date: Wed, 16 Mar 2005 01:02:48 -0500
To: grass5 <grass5@grass.itc.it>
Subject: Re: [GRASS5] v.in.ascii updates

From: Hamish <hamish_nospam@yahoo.com>
Sent: Wed, 16 Mar 2005 16:49:02 +1300

Debatable updates:

? skip any line starting with a '#' in input file.
   (could silently ignore data..??)

Add a flag to turn it off, in case weird #-marked data arises. How hard would
it be to be able to specify the comment marker as an argument? Better, as a
regular expression?

? skip blank lines in input file.
   (currently extra newlines at EOF break import of points)

Maybe add a flag to signal whether:

1) An empty line marks EOF (that might be useful to chop data by hand), except
at the beginning
2) Empty lines are ignored
3) Empty lines are an error (except prior to EOF)

? strip "quotes" from both ends of varchar input.

Shouldn't break anything; quoting-dequoting-overquoting is always tricky, so
users should be wary of "". Do the same for '', maybe?

? rename format=standard to format=grass as format=point is default,
   so standard mode is non-standard. Confusing! I'll fix any scripts/.

Agree.

Hamish wrote:
> Hi,
>
> I'm looking to do some v.in.ascii updates, I thought I'd trawl for
> comments first.. most deal with format=point mode.
>
>
> Fairly straight forward updates:
>
> + update to use G_getl2() to make MacOS9 ascii import work.

done. (point mode only)

> + add skip= option to skip any header lines when format=point
> (e.g. column headings in cvs file)
>
> + add a -s flag to write those skipped header lines to the map's
> meta-data hist file (v.info -h).

I think that you can simply write that to history without complicating
the module with new flag?

good point, done. (point mode only)

> Debatable updates:
>
> ? skip any line starting with a '#' in input file.
> (could silently ignore data..??)

done & documented in help page. (point mode only?)

> ? skip blank lines in input file.
> (currently extra newlines at EOF break import of points)

User must be warn if there are errors in the file, empty line may be
error in input. There should be probably a flag for shuch skipping.

not done. I'd like to ignore blank lines if they happen at end of file,
but don't see an easy way of doing that. I agree that blank lines mid-
file is an error.

Maybe it should be done in a GUI? Normal GRASS user can use
cat file | grap -v "^$" | grep -v "#"
?

I think that's a pretty serious definition of "Normal GRASS user".

> ? strip "quotes" from both ends of varchar input.

TODO.

> ? rename format=standard to format=grass as format=point is
> default,
> so standard mode is non-standard. Confusing! I'll fix any
> scripts/.

ABSOLUTELY FORBIDDEN! The module options and behaviour MUST NOT change
during the 6.x line. That applies to all modules and all developers
without exceptions.

right. left as is.

Thanks to all for comments, they helped.

Hash (#) comments and skip= are for points mode only right now. Worth
putting in standard mode too? (skip doesn't make much sense there, but
might be a good way to write stuff to the history file?)

Please report any problems.

Please report if the massive input points memory leak is fixed/same/
worse. (e.g. LIDAR)

Mac people please report if .csv files saved in Excel-for-OSX work.

Hamish

From: Hamish <hamish_nospam@yahoo.com>
Sent: Sun, 20 Mar 2005 22:30:46 +1200

> > ? skip blank lines in input file.
> > (currently extra newlines at EOF break import of points)
>
> User must be warn if there are errors in the file, empty line may be
> error in input. There should be probably a flag for shuch skipping.

not done. I'd like to ignore blank lines if they happen at end of
file, but don't see an easy way of doing that. I agree that blank
lines mid- file is an error.

In points.c, points_to_bin function, I guess.

If we are setting on:

a) blank line(s) followed by eof -> ok and
b) blank line(s) followed by non-eof -> error,

you might catch the first blank line then consume ascii_in until not blank or
eof; then the previous rule applies.

Just an idea...

Daniel.

> > I'm looking to do some v.in.ascii updates, I thought I'd trawl for
> > comments first.. most deal with format=point mode.

..

Please report if the massive input points memory leak is fixed/same/
worse. (e.g. LIDAR)

Still there...

I finally got around to installing valgrind to check where the memory
leaks are.

quick analysis:

just by watching "top" (hit "M" to sort by memory use) in another term,
the v.in.ascii program seems ok; it's the $GISBASE/driver/db/dbf process
which has the leak in it.

valgrind analysis results:
  http://bambi.otago.ac.nz/hamish/grass/memleak/

The main offenders are:

dig_alloc_node (struct_alloc.c:46)
RTreeNewNode (node.c:47)
dig_alloc_line (struct_alloc.c:112)

Left over allocated memory over 1mb at program end:

==880==
==880== 1951960 bytes in 48799 blocks are still reachable in loss record 32 of 34
==880== at 0x1B906EDD: malloc (vg_replace_malloc.c:131)
==880== by 0x1B950DB9: dig_alloc_node (struct_alloc.c:46)
==880== by 0x1B94B1CE: dig_add_node (plus_node.c:114)
==880== by 0x1B94A641: dig_add_line (plus_line.c:54)
==880==
==880==
==880== 3513528 bytes in 48799 blocks are still reachable in loss record 33 of 34
==880== at 0x1B906EDD: malloc (vg_replace_malloc.c:131)
==880== by 0x1B950F65: dig_alloc_line (struct_alloc.c:112)
==880== by 0x1B94A556: dig_add_line (plus_line.c:45)
==880== by 0x1B91A922: Vect_build_nat (build_nat.c:481)
==880==
==880==
==880== 9137296 bytes in 19196 blocks are still reachable in loss record 34 of 34
==880== at 0x1B906EDD: malloc (vg_replace_malloc.c:131)
==880== by 0x1B972EBE: RTreeNewNode (node.c:47)
==880== by 0x1B975076: RTreeSplitNode (split_q.c:326)
==880== by 0x1B97368A: RTreeAddBranch (node.c:205)
==880==
==880== LEAK SUMMARY:
==880== definitely lost: 4639 bytes in 54 blocks.
==880== possibly lost: 0 bytes in 0 blocks.
==880== still reachable: 15994816 bytes in 214537 blocks.
==880== suppressed: 200 bytes in 1 blocks.

Note for the first two, number of blocks = number of imported points.

no clue what needs to be freed to fix these..

Hamish

Can you do the same for the driver?

Radim

Hamish wrote:

I'm looking to do some v.in.ascii updates, I thought I'd trawl for
comments first.. most deal with format=point mode.

..

Please report if the massive input points memory leak is fixed/same/
worse. (e.g. LIDAR)

Still there...

I finally got around to installing valgrind to check where the memory
leaks are.

quick analysis:

just by watching "top" (hit "M" to sort by memory use) in another term,
the v.in.ascii program seems ok; it's the $GISBASE/driver/db/dbf process
which has the leak in it.

valgrind analysis results:
  http://bambi.otago.ac.nz/hamish/grass/memleak/

The main offenders are:

dig_alloc_node (struct_alloc.c:46)
RTreeNewNode (node.c:47)
dig_alloc_line (struct_alloc.c:112)

Left over allocated memory over 1mb at program end:

==880==
==880== 1951960 bytes in 48799 blocks are still reachable in loss record 32 of 34
==880== at 0x1B906EDD: malloc (vg_replace_malloc.c:131)
==880== by 0x1B950DB9: dig_alloc_node (struct_alloc.c:46)
==880== by 0x1B94B1CE: dig_add_node (plus_node.c:114)
==880== by 0x1B94A641: dig_add_line (plus_line.c:54)
==880==
==880== 3513528 bytes in 48799 blocks are still reachable in loss record 33 of 34
==880== at 0x1B906EDD: malloc (vg_replace_malloc.c:131)
==880== by 0x1B950F65: dig_alloc_line (struct_alloc.c:112)
==880== by 0x1B94A556: dig_add_line (plus_line.c:45)
==880== by 0x1B91A922: Vect_build_nat (build_nat.c:481)
==880==
==880== 9137296 bytes in 19196 blocks are still reachable in loss record 34 of 34
==880== at 0x1B906EDD: malloc (vg_replace_malloc.c:131)
==880== by 0x1B972EBE: RTreeNewNode (node.c:47)
==880== by 0x1B975076: RTreeSplitNode (split_q.c:326)
==880== by 0x1B97368A: RTreeAddBranch (node.c:205)
==880==
==880== LEAK SUMMARY:
==880== definitely lost: 4639 bytes in 54 blocks.
==880== possibly lost: 0 bytes in 0 blocks.
==880== still reachable: 15994816 bytes in 214537 blocks.
==880== suppressed: 200 bytes in 1 blocks.

Note for the first two, number of blocks = number of imported points.

no clue what needs to be freed to fix these..

Hamish

> I finally got around to installing valgrind to check where the
> memory leaks are.

..

Can you do the same for the driver?

Sure; but can you provide a command line test(s)?

Valgrind seems to go through all calls (even glibc, shared libs, etc) as
they are made during runtime, so to test a part of the driver you just
need to make a call to it. sorry, I don't know much about databases or
how the drivers work to come up with an intelligent test myself.

?
Hamish

Hamish wrote:

I finally got around to installing valgrind to check where the
memory leaks are.

..

Can you do the same for the driver?

Sure; but can you provide a command line test(s)?

What do you mean?

Valgrind seems to go through all calls (even glibc, shared libs, etc) as
they are made during runtime, so to test a part of the driver you just
need to make a call to it. sorry, I don't know much about databases or
how the drivers work to come up with an intelligent test myself.

The module creates the driver as a new process (fork-exec). I think that you have to use --trace-children=yes.

Radim

>>>I finally got around to installing valgrind to check where the
>>>memory leaks are.
> ..
>
>>Can you do the same for the driver?

..

The module creates the driver as a new process (fork-exec). I think
that you have to use --trace-children=yes.

ok,

CMD="v.in.ascii in=test_pts.dat out=test4 --o"
valgrind -v --tool=addrcheck --leak-check=yes --trace-children=yes $CMD

Memcheck log:
http://bambi.otago.ac.nz/hamish/grass/memleak/v.in.ascii/memchk_child.txt

Look for process 2030 (dbf):

==2230== 36748400 bytes in 48100 blocks are definitely lost in loss record 15 of 16
==2230== at 0x3414CFE5: calloc (vg_replace_malloc.c:176)
==2230== by 0x3416AAB9: sqpInitStmt (alloc.c:30)
==2230== by 0x804B09F: execute (dbfexe.c:58)
==2230== by 0x804D5CA: db__driver_execute_immediate (execute.c:29)

Also graphical Massif test, heap growth analysis: (end of page)
  http://bambi.otago.ac.nz/hamish/grass/memleak/v.in.ascii/

Hamish

It's funny, sqpFreeStmt(st) was realy missing in execute(), I thought that I had checked it already before. 50000 points - dbf: 100MB -> 10MB

Thank you Hamish. Should I fix that also in the 6.0 branch?

Finally Helena can use 6.

Radim

Hamish wrote:

I finally got around to installing valgrind to check where the
memory leaks are.

..

Can you do the same for the driver?

..

The module creates the driver as a new process (fork-exec). I think
that you have to use --trace-children=yes.

ok,

CMD="v.in.ascii in=test_pts.dat out=test4 --o"
valgrind -v --tool=addrcheck --leak-check=yes --trace-children=yes $CMD

Memcheck log: http://bambi.otago.ac.nz/hamish/grass/memleak/v.in.ascii/memchk_child.txt

Look for process 2030 (dbf):

==2230== 36748400 bytes in 48100 blocks are definitely lost in loss record 15 of 16
==2230== at 0x3414CFE5: calloc (vg_replace_malloc.c:176)
==2230== by 0x3416AAB9: sqpInitStmt (alloc.c:30)
==2230== by 0x804B09F: execute (dbfexe.c:58)
==2230== by 0x804D5CA: db__driver_execute_immediate (execute.c:29)

Also graphical Massif test, heap growth analysis: (end of page)
  http://bambi.otago.ac.nz/hamish/grass/memleak/v.in.ascii/

Hamish

It's funny, sqpFreeStmt(st) was realy missing in execute(), I thought
that I had checked it already before. 50000 points - dbf: 100MB ->
10MB

cool.

I can now import the 1.1 million point sample LIDAR data set
(lidaratm2.txt.gz, 31.5mb uncompressed) from-
  http://mpa.itc.it/grasstutor/data_menu2nd.phtml

According to top, v.in.ascii using that file drives the dbf process up
to 248mb now and during "Registering lines" part the v.in.ascii process
goes up to 435mb memory use -- still a few small leaks? re-run valgrind?

(I don't understand why a point-only vector file is worrying about
lines?)

Thank you Hamish. Should I fix that also in the 6.0 branch?

seems like a perfect candidate for 6.0.1 to me.

H

Finally Helena can use 6.

Radim

Hamish wrote:
>>>>>I finally got around to installing valgrind to check where the
>>>>>memory leaks are.
>>>
>>>..
>>>
>>>
>>>>Can you do the same for the driver?
>
> ..
>
>>The module creates the driver as a new process (fork-exec). I think
>>that you have to use --trace-children=yes.
>
>
> ok,
>
> CMD="v.in.ascii in=test_pts.dat out=test4 --o"
> valgrind -v --tool=addrcheck --leak-check=yes --trace-children=yes
> $CMD
>
>
> Memcheck log:
> http://bambi.otago.ac.nz/hamish/grass/memleak/v.in.ascii/memchk_child.txt
>
>
> Look for process 2030 (dbf):
>
> ==2230== 36748400 bytes in 48100 blocks are definitely lost in loss
> record 15 of 16 ==2230== at 0x3414CFE5: calloc
> (vg_replace_malloc.c:176) ==2230== by 0x3416AAB9: sqpInitStmt
> (alloc.c:30) ==2230== by 0x804B09F: execute (dbfexe.c:58)
> ==2230== by 0x804D5CA: db__driver_execute_immediate
> (execute.c:29)
>
>
>
> Also graphical Massif test, heap growth analysis: (end of page)
> http://bambi.otago.ac.nz/hamish/grass/memleak/v.in.ascii/
>
>
>
> Hamish

Hamish wrote:

It's funny, sqpFreeStmt(st) was realy missing in execute(), I thought that I had checked it already before. 50000 points - dbf: 100MB ->
10MB

cool.

that is really cool! - thank you both. The 280000 point file that took
30minutes to import while machine was completely frozen now takes 2 minutes
and you don't even know that it is running.
v.in.sites now works well with this data set too.
I haven't tried it with the data that have millions of points yet,
but I am sure that this fixed the major problem.

thanks a lot once more,

Helena

I can now import the 1.1 million point sample LIDAR data set (lidaratm2.txt.gz, 31.5mb uncompressed) from-
  http://mpa.itc.it/grasstutor/data_menu2nd.phtml

According to top, v.in.ascii using that file drives the dbf process up
to 248mb now and during "Registering lines" part the v.in.ascii process
goes up to 435mb memory use -- still a few small leaks? re-run valgrind?

(I don't understand why a point-only vector file is worrying about
lines?)

Thank you Hamish. Should I fix that also in the 6.0 branch?

seems like a perfect candidate for 6.0.1 to me.

H

Finally Helena can use 6.

Radim

Hamish wrote:

I finally got around to installing valgrind to check where the
memory leaks are.

..

Can you do the same for the driver?

..

The module creates the driver as a new process (fork-exec). I think
that you have to use --trace-children=yes.

ok,

CMD="v.in.ascii in=test_pts.dat out=test4 --o"
valgrind -v --tool=addrcheck --leak-check=yes --trace-children=yes
$CMD

Memcheck log: http://bambi.otago.ac.nz/hamish/grass/memleak/v.in.ascii/memchk_child.txt

Look for process 2030 (dbf):

==2230== 36748400 bytes in 48100 blocks are definitely lost in loss
record 15 of 16 ==2230== at 0x3414CFE5: calloc
(vg_replace_malloc.c:176) ==2230== by 0x3416AAB9: sqpInitStmt
(alloc.c:30) ==2230== by 0x804B09F: execute (dbfexe.c:58)
==2230== by 0x804D5CA: db__driver_execute_immediate
(execute.c:29)

Also graphical Massif test, heap growth analysis: (end of page)
http://bambi.otago.ac.nz/hamish/grass/memleak/v.in.ascii/

Hamish

Hamish wrote:

It's funny, sqpFreeStmt(st) was realy missing in execute(), I thought that I had checked it already before. 50000 points - dbf: 100MB ->
10MB

cool.

I can now import the 1.1 million point sample LIDAR data set (lidaratm2.txt.gz, 31.5mb uncompressed) from-
  http://mpa.itc.it/grasstutor/data_menu2nd.phtml

According to top, v.in.ascii using that file drives the dbf process up
to 248mb now and during "Registering lines" part the v.in.ascii process
goes up to 435mb memory use -- still a few small leaks? re-run valgrind?

(I don't understand why a point-only vector file is worrying about
lines?)

Thank you Hamish. Should I fix that also in the 6.0 branch?

seems like a perfect candidate for 6.0.1 to me.

It must be tested a bit in 6.1 first, I think.

Radim