[pgrouting-dev] integer vs bigint for edge and node ids

Developers,

This topic came up off list and I thought this should be part of a more public discussion, so here are my thoughts on it. I think both the ideas below are worth pursuing but are huge tasks that are currently not funded and I'm concerned that we should not tackle them on a piecemeal, ad-hoc basis.

Problem:

postgis loads data with a gid column that is typed bigint.
OSM data has node and edge ids that can not be held in an integer field.
Today, all pgrouting code assumes ids are integer and larger numbers overflow the field causing problems.

So the obvious answer is to update the code to be bigint compatible.

The issues related to doing this are as follows:

* pgrouting has the bad habit of using ids as indexes into arrays which is a problem. The code partially minimizes the indexing issue by located the minimum id and subtracting that from all other ids shifting the range to eliminate the leading empty ids, but if the (max_id - min_id) is large and the ids are sparsely allocated across the range we still need to allocate (max_id - min_id + 1) elements in our storage structures. For example if you have two ids '2' and '1,000,002' the code allocates 1,000,001 elements. This can obviously be fixed with some additional coding.

* the code would/might need to be changed to such that the boost structures can support bigint values and this will double the amount of storage needed for ids. There is also the possibility that if the point above does renumbering and we limit the number of ids in any given request to unsigned integer that the code might not need changes and we would only limit the size of the problem we can solve and max unsigned integer sized problem is still larger than anyone would want to wait for a solution.

* this would also imply that the output data types we use would get converted to bigint also.

* If we make this change, we need to do it to all our algorithms so we keep the commands consistent and compatible with one another. This would involve updating all of the follow algorithms in pgrouting:

apsp_johnson
apsp_warshall
astar
bd_astar
bd_dijkstra
common
dijkstra
driving_distance
kdijkstra
ksp
shooting_star ** deprecated, will be deleted **
trsp
tsp
vrp_basic
vrpdptw

While this does not have anything to do with this specific problem, we need to consider any major reworking of the code like this would involve with the fact for historical evolution of the code base we have copied source code from one algorithm to another so we have a lot of duplicated code. It would make a lot of sense at this point to refactor some of this code into common set of reusable C++ base classes. To put this into context all of the follow code is in someway based off of the original dijkstra code:
   astar
   bd_astar
   bd_dijkstra
   dijkstra
   driving_distance
   kdijkstra
   ksp
So every time we make a change to one of these algorithms, there is a good chance that the problem exists in one or more of the other ones. We have been struggling with this problem for some time.

I don't want to blowup this problem into something that is so big it will never get done, but I do think it is important that we think about these issues and that we put together a plan that takes these issues into account in the planning of these and other potential changes.

Thoughts,
   -Steve

Hi Steve,

Has there been any GSOC campaign with pgRouting? That would be the
place to start IMHO. After that would be a call for BSc/MSc thesis;
even if there is no budget, students may become interested in the
project.

The main difficulty with the problems you describe is that they are
are not easy to tackle concurrently. The first step would have to be
the refactoring. It could start with a first task to refactor just two
of modules and extend it to the others on a second step. The bigint
issue would be the last to tackle.

Just my thoughts. Regards,

Luís

On 15 March 2015 at 00:46, Stephen Woodbridge <woodbri@swoodbridge.com> wrote:

Developers,

This topic came up off list and I thought this should be part of a more
public discussion, so here are my thoughts on it. I think both the ideas
below are worth pursuing but are huge tasks that are currently not funded
and I'm concerned that we should not tackle them on a piecemeal, ad-hoc
basis.

Problem:

postgis loads data with a gid column that is typed bigint.
OSM data has node and edge ids that can not be held in an integer field.
Today, all pgrouting code assumes ids are integer and larger numbers
overflow the field causing problems.

So the obvious answer is to update the code to be bigint compatible.

The issues related to doing this are as follows:

* pgrouting has the bad habit of using ids as indexes into arrays which is a
problem. The code partially minimizes the indexing issue by located the
minimum id and subtracting that from all other ids shifting the range to
eliminate the leading empty ids, but if the (max_id - min_id) is large and
the ids are sparsely allocated across the range we still need to allocate
(max_id - min_id + 1) elements in our storage structures. For example if you
have two ids '2' and '1,000,002' the code allocates 1,000,001 elements. This
can obviously be fixed with some additional coding.

* the code would/might need to be changed to such that the boost structures
can support bigint values and this will double the amount of storage needed
for ids. There is also the possibility that if the point above does
renumbering and we limit the number of ids in any given request to unsigned
integer that the code might not need changes and we would only limit the
size of the problem we can solve and max unsigned integer sized problem is
still larger than anyone would want to wait for a solution.

* this would also imply that the output data types we use would get
converted to bigint also.

* If we make this change, we need to do it to all our algorithms so we keep
the commands consistent and compatible with one another. This would involve
updating all of the follow algorithms in pgrouting:

apsp_johnson
apsp_warshall
astar
bd_astar
bd_dijkstra
common
dijkstra
driving_distance
kdijkstra
ksp
shooting_star ** deprecated, will be deleted **
trsp
tsp
vrp_basic
vrpdptw

While this does not have anything to do with this specific problem, we need
to consider any major reworking of the code like this would involve with the
fact for historical evolution of the code base we have copied source code
from one algorithm to another so we have a lot of duplicated code. It would
make a lot of sense at this point to refactor some of this code into common
set of reusable C++ base classes. To put this into context all of the follow
code is in someway based off of the original dijkstra code:
  astar
  bd_astar
  bd_dijkstra
  dijkstra
  driving_distance
  kdijkstra
  ksp
So every time we make a change to one of these algorithms, there is a good
chance that the problem exists in one or more of the other ones. We have
been struggling with this problem for some time.

I don't want to blowup this problem into something that is so big it will
never get done, but I do think it is important that we think about these
issues and that we put together a plan that takes these issues into account
in the planning of these and other potential changes.

Thoughts,
  -Steve
_______________________________________________
pgrouting-dev mailing list
pgrouting-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/pgrouting-dev

On 3/18/2015 2:40 PM, Luís de Sousa wrote:

Hi Steve,

Has there been any GSOC campaign with pgRouting? That would be the
place to start IMHO. After that would be a call for BSc/MSc thesis;
even if there is no budget, students may become interested in the
project.

After many years of mentoring GSoC, I decided that I need a break this year. Daniel may still take a student and I'll probably co-mentor for that. But I'm not sure what that project will be. One thing we do need to get get some students/developers with strong C++ backgrounds because as you mention below we need to do some refactoring of the code and this is not a good project for a junior programmer.

The main difficulty with the problems you describe is that they are
are not easy to tackle concurrently. The first step would have to be
the refactoring. It could start with a first task to refactor just two
of modules and extend it to the others on a second step. The bigint
issue would be the last to tackle.

Yes, I agree. This will also require some restructuring ot the source tree to move common classes somewhere that they are easily accessed by the various algorithm code that needs them.

If the refactored code is written to work with bigints, then it is just a matter of changing the C code that interfaces with postgresql to handle the bigints. I think one of the utility classes in the refactoring needs to be renumbering class that supports old-new and new-old number translations. old-new renumbers the postgres ids into a new internal numbering, and new-old translates the internal numbers back to the postgres numbers.

If anyone has some time and/or ideas on this, we could start some wiki pages to discuss the refactoring. The are a list of dijkstra related algorithms below that would be candidates for refactoring.

One of the ugly areas of the code is all the C code that implements the postgresql interfaces. This code is hugely redundant. I have been trying to think of ways to clean this up. Some random ideas are to make a code generator that takes a structure defining the interface and them automatically generate the C code. Or make the interface code data driven using some structures and standard functions to refactor the current copy and paste mess. I'm sure there are other ideas out there.

Just my thoughts. Regards,

Good ideas, thanks!

-Steve

Luís

On 15 March 2015 at 00:46, Stephen Woodbridge <woodbri@swoodbridge.com> wrote:

Developers,

This topic came up off list and I thought this should be part of a more
public discussion, so here are my thoughts on it. I think both the ideas
below are worth pursuing but are huge tasks that are currently not funded
and I'm concerned that we should not tackle them on a piecemeal, ad-hoc
basis.

Problem:

postgis loads data with a gid column that is typed bigint.
OSM data has node and edge ids that can not be held in an integer field.
Today, all pgrouting code assumes ids are integer and larger numbers
overflow the field causing problems.

So the obvious answer is to update the code to be bigint compatible.

The issues related to doing this are as follows:

* pgrouting has the bad habit of using ids as indexes into arrays which is a
problem. The code partially minimizes the indexing issue by located the
minimum id and subtracting that from all other ids shifting the range to
eliminate the leading empty ids, but if the (max_id - min_id) is large and
the ids are sparsely allocated across the range we still need to allocate
(max_id - min_id + 1) elements in our storage structures. For example if you
have two ids '2' and '1,000,002' the code allocates 1,000,001 elements. This
can obviously be fixed with some additional coding.

* the code would/might need to be changed to such that the boost structures
can support bigint values and this will double the amount of storage needed
for ids. There is also the possibility that if the point above does
renumbering and we limit the number of ids in any given request to unsigned
integer that the code might not need changes and we would only limit the
size of the problem we can solve and max unsigned integer sized problem is
still larger than anyone would want to wait for a solution.

* this would also imply that the output data types we use would get
converted to bigint also.

* If we make this change, we need to do it to all our algorithms so we keep
the commands consistent and compatible with one another. This would involve
updating all of the follow algorithms in pgrouting:

apsp_johnson
apsp_warshall
astar
bd_astar
bd_dijkstra
common
dijkstra
driving_distance
kdijkstra
ksp
shooting_star ** deprecated, will be deleted **
trsp
tsp
vrp_basic
vrpdptw

While this does not have anything to do with this specific problem, we need
to consider any major reworking of the code like this would involve with the
fact for historical evolution of the code base we have copied source code
from one algorithm to another so we have a lot of duplicated code. It would
make a lot of sense at this point to refactor some of this code into common
set of reusable C++ base classes. To put this into context all of the follow
code is in someway based off of the original dijkstra code:
   astar
   bd_astar
   bd_dijkstra
   dijkstra
   driving_distance
   kdijkstra
   ksp
So every time we make a change to one of these algorithms, there is a good
chance that the problem exists in one or more of the other ones. We have
been struggling with this problem for some time.

I don't want to blowup this problem into something that is so big it will
never get done, but I do think it is important that we think about these
issues and that we put together a plan that takes these issues into account
in the planning of these and other potential changes.

Thoughts,
   -Steve
_______________________________________________
pgrouting-dev mailing list
pgrouting-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/pgrouting-dev

_______________________________________________
pgrouting-dev mailing list
pgrouting-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/pgrouting-dev