While I looked into possibilities to optimize v.in.ogr I noticed that grass does not support coor files larger than 2 GB. With topological information stored in that file, and often many dead lines wasting space, the coor file can easily exceed 2 GB nowadays. While v.in.ogr was cleaning one particular vector, the coor file size went up to 9 GB, I killed v.in.ogr before it was finished, the resulting coor file when writing out that vector may have been well above 10 GB. GRASS can process such large coor files (to a degree) as long as topo is kept in memory, e.g. with v.in.ogr and v.clean, and can close such a vector and write it to disk. But that thing can not be opened again, the topo file stores the size of the coor file, and that stored value in the topo file must not exceed 2 GB (integer limit), giving a mismatch and error.
I want to propose some solutions to this problem:
high-level
Modules modifying the coor file, e.g. v.in.ogr, v.clean, v.overlay, v.buffer, should do all the processing in a temporary vector and at the end only copy alive lines to the final output vector, Vect_copy_map_lines() does that. When importing shapefiles with areas I noticed a coor file size reduction by a factor 2 to 5 which is quite a lot (e.g. 1 a GB coor file can be melted down to 200 MB, much nicer). This is also suggested in the vector TODO [1], I'm just pressing again.
low-level
coor file size is stored in memory as type long (32 bit integer on a 32bit system, and on my 64bit Linux with 32bit compatibility) counting bytes of the coor file. That gives the 2 GB limit. When closing the vector, this number is written to the topo file. When opening that vector again, this number is read from the topo file and compared to the actual coor file size, this is the 2 GB limit.
If this coor file size information in the topo file is just a safety check and not needed to process the coor file, it could be omitted altogether, making the supported coor file size unlimited (limited by the current system and filesystem). All references to the coor file size would need to be removed from the vector library.
If the coor file size stored in the topo file is indeed needed to properly process the coor file, the respective variables must be something else than long in order to support coor files larger than 2 GB, maybe long long? Same for all intermediate variables in the vector library storing coor file size.
Looking at limits.h, long can be like int or like long long (only true 64 bit systems). I use Linux 64bit with 32bit compatibility, here long is like int. Someone more familiar with type limits and type declarations on different systems please help!
I suspect some integer overflow for large coor files also in rtree, maybe someone in the know could look into that?
Regards,
Markus M
[1] http://freegis.org/cgi-bin/viewcvs.cgi/*checkout*/grass6/doc/vector/TODO