[pgrouting-users] Performance problems with pgr_dijkstra

Eric_Scheibler · February 10, 2015, 10:34pm

Hello,

I have some problems with the performance of the dijkstra routing algorithm and I hope, that someone
can help me with that.

I started my routing experiments with the Openstreetmap database of Saxony (German state) and
created a routing table with the program osm2po. First tries with the whole routing table with
934102 rows took several seconds to finish. So I created a temporary routing table with much less
rows: I've extracted only a subset of rows around the start and destination vertex. This table only
contained about 5000 rows and the dijkstra routing algorithm took about 60 ms to complete. The route
is intended for walkers and approximately 1000 meters long. Very good result

Now I upgraded to the whole osm map of Europe and again created the routing table with osm2po. This
table has about 78300000 rows. So I created the temp routing table again, filled it with almost the
same 5000 rows and took the same start and destination points. But when I run the dijkstra algorithm
on this table, it runs for about 2.1 seconds. That's also not so bad but much more then the 60 ms
from the Saxony map. Both tables got an index at the id, source and target fields. I've repeated the
routing query at the temp routing table but got always the same process time.

I don't understand, why the procedure lasts so much longer at the map of Europe although both
temporary routing tables are almost identical in size and don't depend on the rest of the databases
(no joins or other complicated stuff). Maybe someone has an idea and can point me in the right
direction.

Here is my routing query although there is nothing special on it:
SELECT seq, id1 AS node, id2 AS edge_id, cost FROM pgr_dijkstra('select id, source, target, cost from tmp_routing', 12345, 67890, false, false);
I use Postgresql 9.1, postgis 2.0 and pgrouting 2.0.

Thank you in advance
Best regards
Eric Scheibler

dkastl · February 11, 2015, 1:54am

I don't understand, why the procedure lasts so much longer at the map of
Europe although both
temporary routing tables are almost identical in size and don't depend on
the rest of the databases
(no joins or other complicated stuff). Maybe someone has an idea and can
point me in the right
direction.

Hi Eric,

First of all I'm not sure, if I really understand what you mean with
"creating a temporary table". Did you run "CREATE TEMP TABLE ..."?

You're right, that with pgRouting the amount of data selected from the
network table matters. And the fastest way to select only a part of the
network table is by selecting a bounding box. You should have an index on
your geometry column as well. Then you don't need to create temporary
tables.

Back to your question: as far as I remember, the size of ID's can matter. I
experienced this when I used data, that had already source and target ID's
in place, which all had the same number of digits. Renumbering (starting
from 1) helped to improve the speed. Though I can't tell this is the reason
in your case.

Have you also run VACUUM and ANALYZE on your table?

Regards,
Daniel

--
Georepublic UG & Georepublic Japan
eMail: daniel.kastl@georepublic.de
Web: http://georepublic.info

Eric_Scheibler · February 11, 2015, 11:23am

Hallo Daniel,

Daniel Kastl <daniel@georepublic.de> schrieb am 11.02.2015, 10:54 +0900:

I don't understand, why the procedure lasts so much longer at the map of
Europe although both
temporary routing tables are almost identical in size and don't depend on
the rest of the databases
(no joins or other complicated stuff). Maybe someone has an idea and can
point me in the right
direction.

First of all I'm not sure, if I really understand what you mean with
"creating a temporary table". Did you run "CREATE TEMP TABLE ..."?

Okay, let me explain that shortly. Maybe you have another solution for my problem at all. I use
python to create a routing server application. I call the necessary SQL commands within python. So I
start by creating a normal table like this:
CREATE TABLE tmp_routing AS SELECT * FROM eu_2po_4pgr LIMIT 0;

Then I fill it with rows from the eu_2po_4pgr table until both start and destination points are
included. So I get the described temp table with about 5000 rows. Reason for that is, that I
need this table more than once, so it's more efficient to create it first and then take it as long
as required.

A major problem for routing are ways, which aren't connected to the main street network, for example
a way in the backyard or at a roof garden. Cause my router is for walkers, I can't only use streets
but all other tracks and paths. So the first step is to verify, that the closest way from route
start and destination is connected to the main street network. Otherwise the routing algorithm can't
find a solution.

To achieve that, I take the closest intersection with at least one big street and calculate a route
from the potential best starting way to that intersection. Cause it's very likely that an
intersection with a big road is connected to the main street network, I get the surety, that my
potential start way is also connected to the main network. After I've verified my starting and
destination way, I can run the actual routing query. When I'am done, I delete the created table
manually.

So I need at least 3 but often more routing queries and so the relatively long process time of 2.1
seconds bothers me. For only one routing query I would be fine but I don't know how to deal
with that dead way networks.

So I have two chances to improve my router:
1. I find a better solution for the dead way problem. Then I only would need one routing query.
2. I try to find out, why the routing query takes so much longer at the bigger database although
it's a routing table with almost the same size.

You're right, that with pgRouting the amount of data selected from the
network table matters. And the fastest way to select only a part of the
network table is by selecting a bounding box. You should have an index on
your geometry column as well. Then you don't need to create temporary
tables.

Do you have an example for a bounding box? How to determine the box size? I know the distance
between the starting and destination point in meters. Could that be used?

Back to your question: as far as I remember, the size of ID's can matter. I
experienced this when I used data, that had already source and target ID's
in place, which all had the same number of digits. Renumbering (starting
from 1) helped to improve the speed. Though I can't tell this is the reason
in your case.

Very interesting. You could be right. I created a temp routing table in the Saxony database, took
start and destination vertex from my program and verified the process time and the result (4 rows
and 60 ms for a very short way, approximately 100 meters). Then I dumped the created table with
pg_dump and restored it into the Europe database. Now the same routing query runs as fast as in the
small database. So maybe the higher source and target id's are responsible for that. Do you have an
idea how to recreate the source and target id's, starting by 1?

Have you also run VACUUM and ANALYZE on your table?

Yes.

Best regards
Eric

Eric_Scheibler · February 11, 2015, 8:39pm

Eric Scheibler <email@eric-scheibler.de> schrieb am 11.02.2015, 12:23 +0100:

Daniel Kastl <daniel@georepublic.de> schrieb am 11.02.2015, 10:54 +0900:

You're right, that with pgRouting the amount of data selected from the
network table matters. And the fastest way to select only a part of the
network table is by selecting a bounding box. You should have an index on
your geometry column as well. Then you don't need to create temporary
tables.

Do you have an example for a bounding box? How to determine the box size? I know the distance
between the starting and destination point in meters. Could that be used?

Found that, works.

Back to your question: as far as I remember, the size of ID's can matter. I
experienced this when I used data, that had already source and target ID's
in place, which all had the same number of digits. Renumbering (starting
from 1) helped to improve the speed. Though I can't tell this is the reason
in your case.

Very interesting. You could be right. I created a temp routing table in the Saxony database, took
start and destination vertex from my program and verified the process time and the result (4 rows
and 60 ms for a very short way, approximately 100 meters). Then I dumped the created table with
pg_dump and restored it into the Europe database. Now the same routing query runs as fast as in the
small database. So maybe the higher source and target id's are responsible for that.

Yes, that solved the problem. Now the routing query completes after 30-40 ms. So it's even a bit
faster than at the small database. I've created a SQL function, which recreates the source and
target id's of the temp routing table:

CREATE OR REPLACE FUNCTION recreate_vertex_of_routing_table(regclass)
RETURNS void
AS $$
DECLARE
    row RECORD;
    vertex_storage hstore;
    new_vertex int;
BEGIN
    vertex_storage := ''::hstore;
    new_vertex := 1;
    FOR row in EXECUTE FORMAT('SELECT id, source, target FROM %I', $1)
    LOOP
        IF NOT vertex_storage ? row.source::text THEN
            vertex_storage = vertex_storage || (row.source::text => new_vertex::text);
            new_vertex := new_vertex + 1;
        END IF;
        IF NOT vertex_storage ? row.target::text THEN
            vertex_storage = vertex_storage || (row.target::text => new_vertex::text);
            new_vertex := new_vertex + 1;
        END IF;
    END LOOP;
    FOR row IN SELECT key, value FROM EACH(vertex_storage)
    LOOP
        EXECUTE FORMAT('UPDATE %I SET source=$1 WHERE source = $2', $1) USING row.value::int, row.key::int;
        EXECUTE FORMAT('UPDATE %I SET target=$1 WHERE target = $2', $1) USING row.value::int, row.key::int;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

Best regards
Eric