[GRASS-dev] vector libs: file based spatial index

Hi,

I have now a completely file based spatial index for vector libs that
could reduce memory consumption considerably. The spatial index file is
usually 2 - 3 times larger than the topo file, for point datasets it is
about 5 times larger than the topo file. In GRASS6.x all that is always
kept in memory and built from scratch.
My implementation is completely file based, also when creating or
updating a vector. This comes obviously with a speed penalty because
reading in memory is faster than reading from file. With all sorts of
tricks and relying on the system to cache files, the file based spatial
index needs 2 times as much time to get built than the memory based
spatial index, e.g. v.build takes twice as long (on my test system).
That's less than I was afraid it would take but still two times as long.
Other new features are dynamic 2D or 3D spatial index (was fixed to 3D
also for 2D vectors), higher portability across different platforms,
speed increase and more control over memory consumption by translating
all recursive functions to non-recursive functions, and it recycles dead
space in files (principle of Radim's suggestion in vector TODO). A
complete rewrite of the rtree library I'm afraid.

Considering that a file based spatial index is only useful for massive
vectors where memory can become a limiting factor, I hesitate to commit
to trunk. A sophisticated compromise would be to build the spatial index
in memory and then write it out to disk. Opening an old vector for
reading only (Vect_open_old()) would be much faster, spatial queries not
much slower than before because the spatial index head is loaded only,
searches in file are fairly fast. When opening a new vector, the spatial
index could be built in memory and then written out. When opening an old
vector for update the spatial index could be read to memory completely
and written out when closing.

What to do now? Leave it all in memory as in grass6, build in memory
then write out (risk of running out of memory on massive datasets), or
keep it always in file? I'll not commit any time soon (also waiting for
the lib/raster commotion to settle down), I need feedback on how to
proceed or if I should forget about it.

Markus M

Markus GRASS ha scritto:

What to do now? Leave it all in memory as in grass6, build in memory
then write out (risk of running out of memory on massive datasets), or
keep it always in file? I'll not commit any time soon (also waiting for
the lib/raster commotion to settle down), I need feedback on how to
proceed or if I should forget about it.

I think advice from Radim would be very useful here.
All the best.
--
Paolo Cavallini: http://www.faunalia.it/pc

Paolo Cavallini wrote:

Markus GRASS ha scritto:
  

What to do now? Leave it all in memory as in grass6, build in memory
then write out (risk of running out of memory on massive datasets), or
keep it always in file? I'll not commit any time soon (also waiting for
the lib/raster commotion to settle down), I need feedback on how to
proceed or if I should forget about it.
    
I think advice from Radim would be very useful here.
All the best.
  

OK, let me rephrase. I think I have two alternatives to the current
implementation of the vector spatial index and would like to know if
grass7 should get 1) faster vector display and lower memory consumption
at the cost of (sometimes) slower vector processing [1], 2) faster
vector display, a similar speed in vector processing but keep the risk
of running out of memory when processing large datasets, or 3) no
changes to the spatial index. IMHO this should be a general decision of
the GRASS community, not of one or two developers.

The actual implementation is not easy and advice from Radim and other
developers would really be very useful.

Markus M

[1] the sometimes slow speed of v.lidar.edgedetection is caused by the
module not the vector libs, as is probably true for other modules e.g.
v.extract (I remember a remark earlier this year, haven't searched the MLs)

Nice work.

What about make it scalable? In case, the vector library would take more
then GRASS_VECTOR_MEMORY_MAX (or similar), make it file-based. Keep it
in memory otherwise.

j

On Tue, Jun 23, 2009 at 08:28:49PM +0200, Markus GRASS wrote:

Paolo Cavallini wrote:
> Markus GRASS ha scritto:
>
>> What to do now? Leave it all in memory as in grass6, build in memory
>> then write out (risk of running out of memory on massive datasets), or
>> keep it always in file? I'll not commit any time soon (also waiting for
>> the lib/raster commotion to settle down), I need feedback on how to
>> proceed or if I should forget about it.
>>
>
> I think advice from Radim would be very useful here.
> All the best.
>
OK, let me rephrase. I think I have two alternatives to the current
implementation of the vector spatial index and would like to know if
grass7 should get 1) faster vector display and lower memory consumption
at the cost of (sometimes) slower vector processing [1], 2) faster
vector display, a similar speed in vector processing but keep the risk
of running out of memory when processing large datasets, or 3) no
changes to the spatial index. IMHO this should be a general decision of
the GRASS community, not of one or two developers.

The actual implementation is not easy and advice from Radim and other
developers would really be very useful.

Markus M

[1] the sometimes slow speed of v.lidar.edgedetection is caused by the
module not the vector libs, as is probably true for other modules e.g.
v.extract (I remember a remark earlier this year, haven't searched the MLs)

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

--
Jachym Cepicky
e-mail: jachym.cepicky gmail com
URL: http://les-ejk.cz
GPG: http://www.les-ejk.cz/pgp/JachymCepicky.pgp
Key fingerprint: 0C6D 0EAE 76BD 506C F299 ED8A C8AB 74B8 08D4 E08F

On 23/06/09 20:28, Markus GRASS wrote:

Paolo Cavallini wrote:

Markus GRASS ha scritto:
  

What to do now? Leave it all in memory as in grass6, build in memory
then write out (risk of running out of memory on massive datasets), or
keep it always in file? I'll not commit any time soon (also waiting for
the lib/raster commotion to settle down), I need feedback on how to
proceed or if I should forget about it.
    

I think advice from Radim would be very useful here.
All the best.
  

OK, let me rephrase. I think I have two alternatives to the current
implementation of the vector spatial index and would like to know if
grass7 should get 1) faster vector display and lower memory consumption
at the cost of (sometimes) slower vector processing [1], 2) faster
vector display, a similar speed in vector processing but keep the risk
of running out of memory when processing large datasets, or 3) no
changes to the spatial index. IMHO this should be a general decision of
the GRASS community, not of one or two developers.

What size of vector data are we talking about concerning the risk of running out of memory ? Would it be possible to implement both 1) and 2) with 2) being the default and a flag to switch to 1) for very large vectors ?

You wrote:

Considering that a file based spatial index is only useful for massive
vectors where memory can become a limiting factor, I hesitate to commit
to trunk.

Well, I don't know what you call massive, but one of the main problems with the memory based index is that currently the spatial index is rebuilt for each run of v.what as it is stored no where. This makes querying large (e.g. 20-30.000 polygons) _very_ slow when using the GUI (which will be the only option in grass7 as xmons have disappeared).

I'm not sure I understand everything correctly here, but I have the feeling that there are two questions here:

1) Should we have a file-based storage of the spatial index ? This can then be read into memory when necessary, which still should be faster than rebuilding it each time.

To this I clearly say +1.
The question here is: how to make sure the index is up to date ?

2) Should the entire treatment of the index be file-based, i.e. the index is never read into memory, but always accessed via file, with the speed penalties you spoke about.

To this I would say, if we can have a file-based, permanent storage of the index, but read into memory for treatment, unless a flag says not to load it into memory, than this would probably be the ideal, within my limited understanding of the issue.

Moritz

Great work!

In keeping with good GRASS traditions, I would suggest to have this
user-configurable via an env var. Perhaps:

GRASS_SPATIAL_INDEX =
  AUTO (default: keep in mem if enough RAM available)
  FILE
  MEMORY

Cheers,

Ben

Moritz Lennert wrote:

On 23/06/09 20:28, Markus GRASS wrote:

Paolo Cavallini wrote:

Markus GRASS ha scritto:

What to do now? Leave it all in memory as in grass6, build in memory
then write out (risk of running out of memory on massive datasets), or
keep it always in file? I'll not commit any time soon (also waiting for
the lib/raster commotion to settle down), I need feedback on how to
proceed or if I should forget about it.
    

I think advice from Radim would be very useful here.
All the best.
  

OK, let me rephrase. I think I have two alternatives to the current
implementation of the vector spatial index and would like to know if
grass7 should get 1) faster vector display and lower memory consumption
at the cost of (sometimes) slower vector processing [1], 2) faster
vector display, a similar speed in vector processing but keep the risk
of running out of memory when processing large datasets, or 3) no
changes to the spatial index. IMHO this should be a general decision of
the GRASS community, not of one or two developers.

What size of vector data are we talking about concerning the risk of running out of memory ? Would it be possible to implement both 1) and 2) with 2) being the default and a flag to switch to 1) for very large vectors ?

You wrote:

Considering that a file based spatial index is only useful for massive
vectors where memory can become a limiting factor, I hesitate to commit
to trunk.

Well, I don't know what you call massive, but one of the main problems with the memory based index is that currently the spatial index is rebuilt for each run of v.what as it is stored no where. This makes querying large (e.g. 20-30.000 polygons) _very_ slow when using the GUI (which will be the only option in grass7 as xmons have disappeared).

I'm not sure I understand everything correctly here, but I have the feeling that there are two questions here:

1) Should we have a file-based storage of the spatial index ? This can then be read into memory when necessary, which still should be faster than rebuilding it each time.

To this I clearly say +1.
The question here is: how to make sure the index is up to date ?

2) Should the entire treatment of the index be file-based, i.e. the index is never read into memory, but always accessed via file, with the speed penalties you spoke about.

To this I would say, if we can have a file-based, permanent storage of the index, but read into memory for treatment, unless a flag says not to load it into memory, than this would probably be the ideal, within my limited understanding of the issue.

Moritz
_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

--
Benjamin Ducke
Senior Geospatial Consultant

Oxford Archaeology
Janus House
Osney Mead
OX2 0ES
Oxford, U.K.

Tel: +44 (0)1865 263 800 (switchboard)
Tel: +44 (0)1865 980 758 (direct)
Fax :+44 (0)1865 793 496
benjamin.ducke@oxfordarch.co.uk

------
Files attached to this email may be in ISO 26300 format (OASIS Open Document Format). If you have difficulty opening them, please visit http://iso26300.info for more information.

On Tue, 23 Jun 2009, Markus GRASS wrote:

My implementation is completely file based, also when creating or
updating a vector. This comes obviously with a speed penalty because
reading in memory is faster than reading from file. With all sorts of
tricks and relying on the system to cache files, the file based spatial

Hi Markus, The last point above sounds interesting. What sort of tricks? Are you using memory-mapped files? My thinking would be to always read it from the file, and rely on the operating system as best as possible to not slow down performance too much (compared to holding the index in memory) for small sizes of indexes. I think that would be a seamless foolproof (from the user's point of view) approach that would scale very well with both increasing size of datasets and performance of computer hardware.

Paul

Paul Kelly wrote:

On Tue, 23 Jun 2009, Markus GRASS wrote:

My implementation is completely file based, also when creating or
updating a vector. This comes obviously with a speed penalty because
reading in memory is faster than reading from file. With all sorts of
tricks and relying on the system to cache files, the file based spatial

Hi Markus, The last point above sounds interesting. What sort of tricks?

When inserting or deleting or searching, the search patch is remembered
(kept in memory) and walked back by using a stack in a non-recursive
functions. This reduces file IO operations about by half and should be a
faster algorithm than the recursive alternative. As an example,
inserting a new item into an RTree that already has about 50,000 items
requires a total of 3 file reads and 3 - 5 file writes, each file IO
reads or writes 528 bytes on a 32 bit system and 586 bytes on a 64 bit
system. And when using intermediate files, the whole chunk of 528/568
bytes is read or written at once.

Are you using memory-mapped files?

What are memory-mapped files? Excuse my ignorance, I'm just a
self-trained coder (learning by doing).

My thinking would be to always read it from the file, and rely on the
operating system as best as possible to not slow down performance too
much (compared to holding the index in memory) for small sizes of
indexes. I think that would be a seamless foolproof (from the user's
point of view) approach that would scale very well with both
increasing size of datasets and performance of computer hardware.

It would also require less complex code, it's already quite a bit more
complex by now.

Markus M

On 24/06/09 22:49, Markus GRASS wrote:

Moritz:

I'm not sure I understand everything correctly here, but I have the
feeling that there are two questions here:

1) Should we have a file-based storage of the spatial index ? This can
then be read into memory when necessary, which still should be faster
than rebuilding it each time.

If an old vector is opened just for reading (v.what, v.info, probably
also d.vect), the fastest solution is probably to only load the header
of the spatial index, as is done for the coor file, and perform spatial
queries in file. This is very fast AFAIKT.

Then the main issue is during editing ? I guess it then depends on the use cases, but I don't know if frequent editing is something happening very often on large files... IMHO, a slight speed penality for the infrequent updates of vectors is acceptable compared to the huge advantage of not having to rebuild the index every time.

To this I clearly say +1.
The question here is: how to make sure the index is up to date ?

Keeping it up to date is not a problem per se with any method. The real
issue here is whether to keep it on file or loading it to memory when
modifying it, speed vs. memory consumption. And this is where I would
like to get feedback, what is your experience, do larger vectors use too
much memory or is vector processing relatively slow and should not get
any slower?

I haven't used vector files, yet, that have caused memory problems, but I have had serious speed problems...So, I would plead for whatever makes things faster.

Hmm yes, I think these massive datasets are still the exception, now and
then someone tries to work with huge vectors but this is not the
everyday case (maybe because it takes so long...).

They will probably become more normal (cf Lidar), so we should plan with them in mind, without making everyone else "suffer" because a few people need to use them.

> I would rather want to hard-code

the way of modifying a spatial index. There are different possibilities:
1) let the vector libs figure out what is best (very difficult)

-1

2) have
an env variable (could work),

As this retains flexibility for the future, I would favor this, but have no idea of what this entails in terms of increase of complexity of the code.

3) have a new flag in all vector modules
(users shouldn't be bothered on module level with vector model details)

Yeah, I agree that this is probably not the best idea.

4) decide on a new standard method and hard-code that method as was done
for the coor file which is never loaded to memory.

The largest file I have used is about 125000 areas with a topo file weighing 42M, so taking your worst estimation, this would mean around 200MB of spatial index, which is still largely acceptable for me.

I find it a bit difficult to give you a definitive answer on the base of theory alone. Do you have any means of testing the impact of one choice over the other for different use cases (editing, v.build, v.what - the latter especially when using in the GUI) ?

If the above is difficult, I would say go for your current preference which seems to be file-based. Would it be possible (just thinking out loud, without any idea what this entails) to work on two levels, with high-level functions which then can call either file-based or memory-based low-level functions ? This way you can create the high-level API with a file-based system behind that, but allow future creation of another memory-based "backend" if the need arises. E.g. something like the high-level db functions which you can call whatever the actual db driver.

Moritz

Moritz Lennert wrote:

Markus:

If an old vector is opened just for reading (v.what, v.info, probably
also d.vect), the fastest solution is probably to only load the header
of the spatial index, as is done for the coor file, and perform spatial
queries in file. This is very fast AFAIKT.

Then the main issue is during editing ?

Yes. Modules where a speed decrease will be observed are e.g. v.in.ogr,
v.clean, v.buffer. Some modules only build topology together with the
spatial index for a new vector at the end, when the new vector is ready,
e.g. v.select, v.kernel, in these cases there will be no obvious speed
difference.

I haven't used vector files, yet, that have caused memory problems,
but I have had serious speed problems...So, I would plead for whatever
makes things faster.

Speed problems can have many reasons and need to be tackled one at a
time. Poor speed can be caused by either the module or the vector libs
or both.

Hmm yes, I think these massive datasets are still the exception, now and
then someone tries to work with huge vectors but this is not the
everyday case (maybe because it takes so long...).

They will probably become more normal (cf Lidar), so we should plan
with them in mind, without making everyone else "suffer" because a few
people need to use them.

Lidar is a special case, I don't see a reason to drag along with them
topo and the spatial index, maybe the spatial index, but not topo (will
reply to Hamish too).

2) have
an env variable (could work),

As this retains flexibility for the future, I would favor this, but
have no idea of what this entails in terms of increase of complexity
of the code.

Two different rtree libraries, one memory based and one file based, need
to be available and maintained. Higher level stuff needs to prepare
temporary files only if needed, lower level stuff (diglib/spindex.c and
spindex_rw.c) need to handle both cases properly. Some functions must be
present in two versions, file based and memory based. Not sure how much
change is required in the headers, particularly dig_structs.h.

The largest file I have used is about 125000 areas with a topo file
weighing 42M, so taking your worst estimation, this would mean around
200MB of spatial index, which is still largely acceptable for me.

When testing LFS in the vector libs, I had vectors with > 2GB coor file,
800MB topo (>400000 areas). Importing them required 5 - 6GB memory,
handling them was simply a PITA, lots of coffee breaks.

I find it a bit difficult to give you a definitive answer on the base
of theory alone. Do you have any means of testing the impact of one
choice over the other for different use cases (editing, v.build,
v.what - the latter especially when using in the GUI) ?

v.build is slower when keeping the modified spatial index in file, but
v.build means rebuilding the spatial index, lots of file IO if
intermediate data are stored on file.

v.what is simply a charm when using from the gui :-), both with query in
display mode and query in edit mode (db speed is limiting). There is
nothing that needs to be rebuilt, a full 113 bytes are read (sidx
header), then search can immediately take place in file.

I would suggest that I first implement a new version were the spatial
index is always written out when a new or modifed vector is closed.
Intermediate data are still stored in memory. Opening an old vector in
read-only mode would then be faster, opening an old vector in update
mode would be the same like currently done, the spatial index is loaded
to memory. This can then be tested and polished, and once that is
stable, an env var could be added to keep the spatial index in file when
modifying (Vect_open_new or Vect_open_update). This would only be needed
for massive vectors.

Markus M

On Wed, 24 Jun 2009, Markus GRASS wrote:

Paul Kelly wrote:

On Tue, 23 Jun 2009, Markus GRASS wrote:

My implementation is completely file based, also when creating or
updating a vector. This comes obviously with a speed penalty because
reading in memory is faster than reading from file. With all sorts of
tricks and relying on the system to cache files, the file based spatial

Hi Markus, The last point above sounds interesting. What sort of tricks?

When inserting or deleting or searching, the search patch is remembered
(kept in memory) and walked back by using a stack in a non-recursive
functions. This reduces file IO operations about by half and should be a
faster algorithm than the recursive alternative. As an example,
inserting a new item into an RTree that already has about 50,000 items
requires a total of 3 file reads and 3 - 5 file writes, each file IO
reads or writes 528 bytes on a 32 bit system and 586 bytes on a 64 bit
system. And when using intermediate files, the whole chunk of 528/568
bytes is read or written at once.

Are you using memory-mapped files?

What are memory-mapped files? Excuse my ignorance, I'm just a
self-trained coder (learning by doing).

http://en.wikipedia.org/wiki/Memory-mapped_file
A chunk of a disk file is directly mapped into memory so you can access it using normal pointers as if it was permanently in memory, and the OS transparently handles paging the relevant chunks of the file in and out of memory as required.
It can be a useful and elegant way of managing access to large files a section at a time. For files small enough to be completely cached in memory the performance penalty over keeping the same data in memory (without a file) should be relatively small, and will be faster than accessing a file using conventional fopen()/fread() etc. functions.
But how much of a performance gain it would give for very large files strongly depends on the nature of access to the file, which I know very little about in this case. E.g. for completely random access there might not be a lot of gain. But if there was random access only within a certain section of a file, that section could be mapped into memory and access would then be quite fast. Overwriting "dead" areas of the index would be particularly simple I think, if it was memory mapped.

Paul

Paul Kelly wrote:

Markus:

What are memory-mapped files? Excuse my ignorance, I'm just a
self-trained coder (learning by doing).

http://en.wikipedia.org/wiki/Memory-mapped_file
A chunk of a disk file is directly mapped into memory so you can
access it using normal pointers as if it was permanently in memory,
and the OS transparently handles paging the relevant chunks of the
file in and out of memory as required.
It can be a useful and elegant way of managing access to large files a
section at a time. For files small enough to be completely cached in
memory the performance penalty over keeping the same data in memory
(without a file) should be relatively small, and will be faster than
accessing a file using conventional fopen()/fread() etc. functions.
But how much of a performance gain it would give for very large files
strongly depends on the nature of access to the file, which I know
very little about in this case. E.g. for completely random access
there might not be a lot of gain.

It is completely random, the next chunk to be read/written can be
anywhere in the file.

But if there was random access only within a certain section of a
file, that section could be mapped into memory and access would then
be quite fast.

Does it make sense to always map a different section of the file? This
mapping would then need to replace each call to fread/fwrite because of
completely random access. I'm currently using the same method like for
the coor file which seems to work fine so far.

Markus M

On Thu, 25 Jun 2009, Markus GRASS wrote:

[...]

very little about in this case. E.g. for completely random access
there might not be a lot of gain.

It is completely random, the next chunk to be read/written can be
anywhere in the file.

[...]

But if there was random access only within a certain section of a
file, that section could be mapped into memory and access would then
be quite fast.

Does it make sense to always map a different section of the file? This
mapping would then need to replace each call to fread/fwrite because of
completely random access. I'm currently using the same method like for
the coor file which seems to work fine so far.

No, I can't imagine that would give much improvement. The scenario I had hoped for was that when processing a particular sub-region of a vector map, all the needed data from the index would be near each other in memory (and thus in the file), thus that area of the file (I was thinking up to several hundred megabytes max. for performance reasons, but this could obviously scale with increased datasizes) could be mapped into memory and intensive I/O (or just reading) done there, with the memory-mapping mechanism automatically taking care of the reads and writes from/to the underlying file.

From what you have said it sounds like that would need a complete redesign

of the algorithm used for indexing, or might not be possible. The other option would be to map the whole of the index file into memory; it might be worth trying, but as you have already tried to optimise reading/writing performance based on knowledge of the algorithm, it is unlikely the OS could do any better. Still might be worth trying, but maybe too much work. So I think I should shut up now since (as is obvious) I don't know how the indexing algorithm works...

Paul

Paul Kelly wrote:

On Thu, 25 Jun 2009, Markus GRASS wrote:

[...]

very little about in this case. E.g. for completely random access
there might not be a lot of gain.

It is completely random, the next chunk to be read/written can be
anywhere in the file.

[...]

But if there was random access only within a certain section of a
file, that section could be mapped into memory and access would then
be quite fast.

Does it make sense to always map a different section of the file? This
mapping would then need to replace each call to fread/fwrite because of
completely random access. I'm currently using the same method like for
the coor file which seems to work fine so far.

No, I can't imagine that would give much improvement. The scenario I
had hoped for was that when processing a particular sub-region of a
vector map, all the needed data from the index would be near each
other in memory

Unfortunately not, you need the whole search tree also for processing a
subregion. The index is not a liner array but a search tree, and for a
search tree I don't know how to keep spatially close items close to each
other in file, particularly when modifying the tree.

(and thus in the file), thus that area of the file (I was thinking up
to several hundred megabytes max. for performance reasons, but this
could obviously scale with increased datasizes) could be mapped into
memory and intensive I/O (or just reading) done there, with the
memory-mapping mechanism automatically taking care of the reads and
writes from/to the underlying file.

From what you have said it sounds like that would need a complete
redesign of the algorithm used for indexing, or might not be possible.
The other option would be to map the whole of the index file into memory;

Then we can just as well keep the whole index in memory as before ?

it might be worth trying, but as you have already tried to optimise
reading/writing performance based on knowledge of the algorithm, it is
unlikely the OS could do any better. Still might be worth trying, but
maybe too much work. So I think I should shut up now since (as is
obvious) I don't know how the indexing algorithm works...

It's essentially a glorified binary search tree where a node has not 2
but (in this case) 10 children. As for BSTs, a search always starts at
the root of the tree and then descends down to end nodes. Such a search
is also performed for insertion and deletion. With the current
implementation, the location in file is completely mixed up and has very
little to do with spatial distance or search tree structure.

The principle of a search tree is very fast and should be kept for
speed. AFAIU the nature of the tree does not allow to have spatially
nearby features close to each other in file, if only because the root
node and following nodes spatially cover the whole vector, i.e. on that
lowest level all features are close to each other as far as the search
tree is concerned. I'm happy that I found a way to write out the index
in a way that allows search in file, always keeping memory requirements
down to a very few KB. There are other file-based RTree implementations,
but I don't understand them, cryptic code, no documentation (SQLite for
example has some RTree, didn't test it, also Norbert Beckmann's R*-tree
implementation).

Markus M

On 25/06/09 08:51, Markus GRASS wrote:

I would suggest that I first implement a new version were the spatial
index is always written out when a new or modifed vector is closed.
Intermediate data are still stored in memory. Opening an old vector in
read-only mode would then be faster, opening an old vector in update
mode would be the same like currently done, the spatial index is loaded
to memory. This can then be tested and polished, and once that is
stable, an env var could be added to keep the spatial index in file when
modifying (Vect_open_new or Vect_open_update). This would only be needed
for massive vectors.

+1

Moritz

Moritz Lennert wrote:

On 25/06/09 08:51, Markus GRASS wrote:

I would suggest that I first implement a new version were the spatial
index is always written out when a new or modifed vector is closed.
Intermediate data are still stored in memory. Opening an old vector in
read-only mode would then be faster, opening an old vector in update
mode would be the same like currently done, the spatial index is loaded
to memory. This can then be tested and polished, and once that is
stable, an env var could be added to keep the spatial index in file when
modifying (Vect_open_new or Vect_open_update). This would only be needed
for massive vectors.

+1

Now in trunk r38390, time to make distclean again...

To work with an existing vector in grass7, topology needs to be rebuilt
because a support file is missing, the spatial index. After that
everything is fine and grass6 can read the vector again as it is.

The vector spatial index is now built in memory and written out to file,
like topology and the category index. When opening an old vector, only
the header of the spatial index file is loaded, searches are done in
file. When opening an old vector for update, the spatial index is loaded
from file to memory, modifed there and then written out, like topology
and the category index.

The new spatial index algorithm (R*-tree) is a bit faster than the old
algorithm (RTree), breaking lines profits from it and thus v.in.ogr and
v.clean.

v.build is now a bit faster, sometimes same speed, sometimes twice as
fast, generally better performance for more complicated geometry.
v.what is now generally faster by a factor of 6 to 30, depending on the
vector.

The authors of the R*-tree claim that the R*-tree's search performance
is better particularly for massive point datasets. Using
elev_lid792_bepts in nc_spm_08 (not really massive), v.what takes now
here about 0.16s instead of 4.3s, ~25x faster, combined improvement of
file-based index and better index algorithm.

Still, for massive point datasets I would recommend not to build
topology, because all three support structures, topology, spatial index
and category index, can become massive. Keeping one in file and loading
the other two to memory doesn't help much.

I hope I didn't mess up too much...

Markus M

Hi,

2009/7/13 Markus Metz <markus.metz.giswork@googlemail.com>:

To work with an existing vector in grass7, topology needs to be rebuilt
because a support file is missing, the spatial index. After that
everything is fine and grass6 can read the vector again as it is.

great!

Just trying to build vector map 'bridges' from nc_spm. The module ends
up with the 'position mismatch' error.

$ v.build bridges
Building topology for vector map <bridges>...
[...]
Number of nodes: 10938
Number of primitives: 10938
Number of points: 10938
Number of lines: 0
Number of boundaries: 0
Number of centroids: 0
Number of areas: 0
Number of isles: 0
ERROR: position mismatch

Martin

--
Martin Landa <landa.martin gmail.com> * http://gama.fsv.cvut.cz/~landa

Hi,

2009/7/13 Martin Landa <landa.martin@gmail.com>:

[...]

Just trying to build vector map 'bridges' from nc_spm. The module ends
up with the 'position mismatch' error.

$ v.build bridges
Building topology for vector map <bridges>...

[...]

ERROR: position mismatch

Some debug info...

D4/5: dig_Wr_P_line() line = 10932
D5/5: line type 1 -> 1
D4/5: dig_Wr_P_line() line = 10933
D5/5: line type 1 -> 1
D4/5: dig_Wr_P_line() line = 10934
D5/5: line type 1 -> 1
D4/5: dig_Wr_P_line() line = 10935
D5/5: line type 1 -> 1
D4/5: dig_Wr_P_line() line = 10936
D5/5: line type 1 -> 1
D4/5: dig_Wr_P_line() line = 10937
D5/5: line type 1 -> 1
D4/5: dig_Wr_P_line() line = 10938
D5/5: line type 1 -> 1
D2/5: topo body offset 142
D1/5: Vect_save_spatial_index()
D1/5: Open sidx: /home/martin/grassdata/nc_spm_08/PERMANENT/vector/bridges/sidx
D1/5: dig_Wr_spidx()
D3/5: spidx offset node = 0 line = 0, area = 0 isle = 0
D1/5: spidx body offset 113
ERROR: position mismatch

Martin

--
Martin Landa <landa.martin gmail.com> * http://gama.fsv.cvut.cz/~landa

Martin Landa wrote:

Hi,

2009/7/13 Markus Metz <markus.metz.giswork@googlemail.com>:
  

To work with an existing vector in grass7, topology needs to be rebuilt
because a support file is missing, the spatial index. After that
everything is fine and grass6 can read the vector again as it is.
    
great!

Just trying to build vector map 'bridges' from nc_spm. The module ends
up with the 'position mismatch' error.

$ v.build bridges
Building topology for vector map <bridges>...
[...]
Number of nodes: 10938
Number of primitives: 10938
Number of points: 10938
Number of lines: 0
Number of boundaries: 0
Number of centroids: 0
Number of areas: 0
Number of isles: 0
ERROR: position mismatch
  

Fixed in r38397. No need for you to make distclean again, just recompile
diglib.

Markus M

Hi,

2009/7/13 Markus Metz <markus.metz.giswork@googlemail.com>:

[...]

Fixed in r38397. No need for you to make distclean again, just recompile
diglib.

Thanks for quick fix.

Martin

--
Martin Landa <landa.martin gmail.com> * http://gama.fsv.cvut.cz/~landa