[GRASS-dev] v.in.ascii memory errors again

I'm having problems importing huge lidar point sets using v.in.ascii. I
thought this issue was resolved with the -b flag, but v.in.ascii is
consuming all the memory even in the initial scan of the data (before
building the topology, which should be skipped with the -b flag)

My data set is comma separated x,y,z points

v.in.ascii -ztb input=BasinPoints.txt output=NeuseBasinPts fs="," z=3

Sample data:

1939340.84,825793.89,657.22
1939071.95,825987.78,660.22
1939035.52,826013.97,662.46
1938762.45,826210.15,686.28
1938744.05,826223.34,688.57
1938707.4,826249.58,694.1
1938689.21,826262.62,696.55
1938670.91,826275.77,698.07
1938616.48,826314.99,699.31
1938598.36,826328.09,698.58

I have over 300 million such records and the input file is over 11GB.

v.in.ascii runs out of memory and crashes during points_analyse in
v.in.ascii/points.c

I did a CVS update about a week ago.
It looks like the current CVS version of
grass/vector/v.in.ascii/points.c (version 1.16) the G_free_tokens line
is outside of the loop that scans each line, instead of inside the loop
as it was in 1.13. The comment for 1.14 when the change was committed
says fixed segfault in LatLong, but this seems like it leads to
unbounded memory usage and is probably a memory leak as G_tokenize
mallocs new memory on each call.

Can anyone comment on the change or confirm that the current CVS
behavior is buggy?

-Andy

A follow up to my recent v.in.ascii error. By moving the G_free_tokens
inside the loop, I am able to get through the first pass of the input
data. I can now see why the free was originally moved outside the loop
to fix lat/long problems: because tokens[i] is redirected to a different
buffer in the LL case. This seems problematic and a possible source of
memory leaks.

This problem I believe can be solved with an extra free/malloc of tokens
inside the LL section of the code. But I am not using LL data and I ran
into another, bigger problem

When building the vector file, I get ERROR: Cannot write line (negative
offset). I suspect this is coming from Vect_write_line when
V1_write_line_nat returns -1. It looks like V1_write_line_nat, dig_fseek
and dig_ftell are using 32-bit file offsets (longs) instead of off_t
which can be 32-bit or 64-bit depending on compiler flags. So it seems
the vector libraries do not support vector files over 2GB. Is it
possible/likely that someone could update the dig_fseek, dig_ftell to
use off_t instead of long? How many places use these dig_f* functions?

-Andy

On Wed, 2006-06-28 at 23:59 -0400, Andrew Danner wrote:

I'm having problems importing huge lidar point sets using v.in.ascii. I
thought this issue was resolved with the -b flag, but v.in.ascii is
consuming all the memory even in the initial scan of the data (before
building the topology, which should be skipped with the -b flag)

My data set is comma separated x,y,z points

v.in.ascii -ztb input=BasinPoints.txt output=NeuseBasinPts fs="," z=3

Sample data:

1939340.84,825793.89,657.22
1939071.95,825987.78,660.22
1939035.52,826013.97,662.46
1938762.45,826210.15,686.28
1938744.05,826223.34,688.57
1938707.4,826249.58,694.1
1938689.21,826262.62,696.55
1938670.91,826275.77,698.07
1938616.48,826314.99,699.31
1938598.36,826328.09,698.58

I have over 300 million such records and the input file is over 11GB.

v.in.ascii runs out of memory and crashes during points_analyse in
v.in.ascii/points.c

I did a CVS update about a week ago.
It looks like the current CVS version of
grass/vector/v.in.ascii/points.c (version 1.16) the G_free_tokens line
is outside of the loop that scans each line, instead of inside the loop
as it was in 1.13. The comment for 1.14 when the change was committed
says fixed segfault in LatLong, but this seems like it leads to
unbounded memory usage and is probably a memory leak as G_tokenize
mallocs new memory on each call.

Can anyone comment on the change or confirm that the current CVS
behavior is buggy?

-Andy

_______________________________________________
grass-dev mailing list
grass-dev@grass.itc.it
http://grass.itc.it/mailman/listinfo/grass-dev

Andrew Danner wrote:

I'm having problems importing huge lidar point sets using v.in.ascii.
I thought this issue was resolved with the -b flag, but v.in.ascii is
consuming all the memory even in the initial scan of the data (before
building the topology, which should be skipped with the -b flag)

My data set is comma separated x,y,z points

v.in.ascii -ztb input=BasinPoints.txt output=NeuseBasinPts fs="," z=3

Sample data:

1939340.84,825793.89,657.22
1939071.95,825987.78,660.22
1939035.52,826013.97,662.46
1938762.45,826210.15,686.28
1938744.05,826223.34,688.57
1938707.4,826249.58,694.1
1938689.21,826262.62,696.55
1938670.91,826275.77,698.07
1938616.48,826314.99,699.31
1938598.36,826328.09,698.58

I have over 300 million such records and the input file is over 11GB.

v.in.ascii runs out of memory and crashes during points_analyse in
v.in.ascii/points.c

Hi,

you'll probably want to use the new r.in.xyz module for anything more
than 3 million points. I think v.surf.rst is the only module which can
do something useful with vector points without topology.

http://grass.ibiblio.org/grass61/manuals/html61_user/r.in.xyz.html
http://hamish.bowman.googlepages.com/grassfiles#xyz

v.in.ascii (without -b) has finite memory needs due to topological support.
Search the mailing list archives for many comments on "v.in.ascii memory
leak" by Radim, Helena, and myself on the subject. Here is some valgrind
analysis I did on it at the time:

http://bambi.otago.ac.nz/hamish/grass/memleak/v.in.ascii/

If you can find a way to lessen the memory footprint, then great!
Same for large file and 64bit support fixes.

I can now see why the free was originally moved outside the loop
to fix lat/long problems: because tokens[i] is redirected to a
different buffer in the LL case. This seems problematic and a possible
source of memory leaks.

but LL parsing support was only added relatively recently? need to check
the CVS log. rev 1.9:
http://freegis.org/cgi-bin/viewcvs.cgi/grass6/vector/v.in.ascii/points.c

so not a "core" unfixable part of the code?

Radim said that freeing memory was slow. Maybe free a chunk of memory
every 50000th point or so?

Hamish

Hamish,

Thanks for the tips. I've used r.in.xyz a few times and it is quite an
improvement over the s.in.ascii, s.to.rast (or the grass6 equivalent)
path, but I do want to use v.in.ascii without topology support because
ultimately I want to run a modified version of v.surf.rst that I wrote
to process very large lidar point sets. On grass5 and the sites format,
I could process over 350 million points using my modified v.surf.rst and
it was a bit faster than v.surf.rst on smaller inputs around 10 million
points. I would like to clean up my code a bit and submit it to the
GRASS project or perhaps have it as an add-on, but the new vector format
is limiting the number of points I can get into Grass6.

I followed the previous memory problems discussed by you, Helena, and
Radim, and I think this is a separate problem. I'm not sure if the
problem that freeing memory is slow applies to the tokenizer in the
first pass over the data or something in the topology building. I
thought it was the latter.

I think the recent change in G_free_tokens for LL projections is a bug,
but a bug that is fixable in a rather short period of time. I don't
have any lat/long data to test on and I won't have much time to look
into in the next few weeks, but perhaps I can take a look at it in late
July. The bigger problem is the 64-bit file support in the vector
library. Are the 64-bit file I/O function ftello and fseeko portable to
all the different platforms that run grass? Perhaps Glynn knows? If
there is any point in the code that writes file offsets to disk, we
would need to be careful about compatibility issues there. I don't know
this section of the code very well, so I'm reluctant to make any
changes, before getting some advice.

-Andy

On Thu, 2006-06-29 at 17:59 +1200, Hamish wrote:

Andrew Danner wrote:

> I'm having problems importing huge lidar point sets using v.in.ascii.
> I thought this issue was resolved with the -b flag, but v.in.ascii is
> consuming all the memory even in the initial scan of the data (before
> building the topology, which should be skipped with the -b flag)
>
> My data set is comma separated x,y,z points
>
> v.in.ascii -ztb input=BasinPoints.txt output=NeuseBasinPts fs="," z=3
>
> Sample data:
>
> 1939340.84,825793.89,657.22
> 1939071.95,825987.78,660.22
> 1939035.52,826013.97,662.46
> 1938762.45,826210.15,686.28
> 1938744.05,826223.34,688.57
> 1938707.4,826249.58,694.1
> 1938689.21,826262.62,696.55
> 1938670.91,826275.77,698.07
> 1938616.48,826314.99,699.31
> 1938598.36,826328.09,698.58
>
> I have over 300 million such records and the input file is over 11GB.
>
> v.in.ascii runs out of memory and crashes during points_analyse in
> v.in.ascii/points.c

Hi,

you'll probably want to use the new r.in.xyz module for anything more
than 3 million points. I think v.surf.rst is the only module which can
do something useful with vector points without topology.

http://grass.ibiblio.org/grass61/manuals/html61_user/r.in.xyz.html
http://hamish.bowman.googlepages.com/grassfiles#xyz

v.in.ascii (without -b) has finite memory needs due to topological support.
Search the mailing list archives for many comments on "v.in.ascii memory
leak" by Radim, Helena, and myself on the subject. Here is some valgrind
analysis I did on it at the time:

http://bambi.otago.ac.nz/hamish/grass/memleak/v.in.ascii/

If you can find a way to lessen the memory footprint, then great!
Same for large file and 64bit support fixes.

> I can now see why the free was originally moved outside the loop
> to fix lat/long problems: because tokens[i] is redirected to a
> different buffer in the LL case. This seems problematic and a possible
> source of memory leaks.

but LL parsing support was only added relatively recently? need to check
the CVS log. rev 1.9:
http://freegis.org/cgi-bin/viewcvs.cgi/grass6/vector/v.in.ascii/points.c

so not a "core" unfixable part of the code?

Radim said that freeing memory was slow. Maybe free a chunk of memory
every 50000th point or so?

Hamish

Hamish,

thanks for the answer, but as Andy has mentioned, he is using v.in.ascii -b already and v.surf.rst (his version).
Andy is not the only one who needs v.in.ascii -b to behave at least the same way as it did before the LL
related change - I just emailed other users who were asking about it that the v.in.ascii -b, v.surf.rst pipeline for processing large point
data sets works and it apparently is broken now.

Helena

Hamish wrote:

Andrew Danner wrote:

I'm having problems importing huge lidar point sets using v.in.ascii.
I thought this issue was resolved with the -b flag, but v.in.ascii is
consuming all the memory even in the initial scan of the data (before
building the topology, which should be skipped with the -b flag)

My data set is comma separated x,y,z points

v.in.ascii -ztb input=BasinPoints.txt output=NeuseBasinPts fs="," z=3

Sample data:

1939340.84,825793.89,657.22
1939071.95,825987.78,660.22
1939035.52,826013.97,662.46
1938762.45,826210.15,686.28
1938744.05,826223.34,688.57
1938707.4,826249.58,694.1
1938689.21,826262.62,696.55
1938670.91,826275.77,698.07
1938616.48,826314.99,699.31
1938598.36,826328.09,698.58

I have over 300 million such records and the input file is over 11GB.

v.in.ascii runs out of memory and crashes during points_analyse in
v.in.ascii/points.c
    
Hi,

you'll probably want to use the new r.in.xyz module for anything more
than 3 million points. I think v.surf.rst is the only module which can
do something useful with vector points without topology.

http://grass.ibiblio.org/grass61/manuals/html61_user/r.in.xyz.html
http://hamish.bowman.googlepages.com/grassfiles#xyz

v.in.ascii (without -b) has finite memory needs due to topological support.
Search the mailing list archives for many comments on "v.in.ascii memory
leak" by Radim, Helena, and myself on the subject. Here is some valgrind
analysis I did on it at the time:

http://bambi.otago.ac.nz/hamish/grass/memleak/v.in.ascii/

If you can find a way to lessen the memory footprint, then great!
Same for large file and 64bit support fixes.

I can now see why the free was originally moved outside the loop
to fix lat/long problems: because tokens[i] is redirected to a
different buffer in the LL case. This seems problematic and a possible
source of memory leaks.
    
but LL parsing support was only added relatively recently? need to check
the CVS log. rev 1.9:
http://freegis.org/cgi-bin/viewcvs.cgi/grass6/vector/v.in.ascii/points.c

so not a "core" unfixable part of the code?

Radim said that freeing memory was slow. Maybe free a chunk of memory
every 50000th point or so?

Hamish

_______________________________________________
grass-dev mailing list
grass-dev@grass.itc.it
http://grass.itc.it/mailman/listinfo/grass-dev
  
--
Helena Mitasova
Department of Marine, Earth and Atmospheric Sciences
North Carolina State University
1125 Jordan Hall
NCSU Box 8208
Raleigh, NC 27695-8208
http://skagit.meas.ncsu.edu/~helena/

email: hmitaso@unity.ncsu.edu
ph: 919-513-1327 (no voicemail)
fax 919 515-7802