[pgrouting-users] dijkstra_sp_delta throwing a signal 6 (SIGABRT?)

Richard_Marsden · February 28, 2011, 7:49pm

Well I’ve moved forward and now have code in production calculating mileages from OpenStreetMap data: I’ve calculated mileage charts for Oceania and Africa.
The secret to get that far was to move operating systems from Windows to Ubuntu and then upgrade to pgRouting 1.05. (PostGres 8.4, Ubuntu 10)
Th computations are being performed with dijkstra_sp_delta.

However now I’m hitting another “server closed the connection unexpectedly” error.

Looking in the server logs, I find the LOG message “server process (PID 19133) was terminated by signal 6: Aborted”

From what I can tell, Signal 6 on Ubuntu is indeed a SIGABRT. There are no other log messages to indicate why Postgres/pgRouting threw a SIGABRT.

This is then followed by warnings and log messages saying other active server processes are being terminated, transactions rolled back, etc.

This error occurs at a reproducible point in a fairly sophisticated (multi-processor, Python, psycopg) script. Although I’m pretty certain of the SQL that is causing the problem, at the moment I don’t have the exact parameters (ie. graph nodes). I’m about to run the script single-threaded with diagnostics so I should be able to get a single SQL statement to reproduce the problem on a psql command line. In the worst case, this could take a couple of days.
No other programs are running that are calling Postgres.

My graph consists of the global OSM street data loaded into PostGIS with osm2po. I have checked for links of zero length. In fact all links <1m long have been taken out of the graph. I’ve just double checked costs and reverse_costs: all are positive (I’ve set these to the lengths)

I’ve just checked for start & end nodes being the same (ie. resulting in dijkstra_sp_delta being called with the some node identifier for the start and end): Yes my data has a few of these, but I’m pretty certain the crash occurs before they appear. However, I’m going to add code to detect these - there’s no point in executing an SQL statement for something that can be calculated in a trivial line of python.

What else should I be looking for? Are there any known problems I should look for? Is there any way of finding out what is causing the Signal 6?

Once I have the node identifiers that are causing the problem, I should be able to make en exportable-extract of the graph to give a reproducible dataset and matching SQL statement. Would anyone be able to investigate this?

Is there any way of making pgRouting / PostGres handle these situations more cleanly? At the moment, the crash is taking the server down with it. The crash is perhaps the first to occur after roughly 1 million route calculations: I can live with that failure rate - but only if my scripts can cleanly detect and recover from it. I guess ideally the server should stay up and a status value (or exception - but that probably wouldn’t work across so many code boundaries) be returned.

Best regards,

Richard Marsden

Stephen_Woodbridge1 · February 28, 2011, 10:43pm

The only think(s) that I can think of are:

1. it could be caused by a call to abort() or assert() in the C code, but:

woodbri@mappy:~/work/pgrouting-git/pgrouting$ find * -type f -exec grep -l -i abort {} \;
woodbri@mappy:~/work/pgrouting-git/pgrouting$ find * -type f -exec grep -l -i assert {} \;
core/src/CMakeFiles/routing.dir/depend.make
core/src/CMakeFiles/routing.dir/astar_boost_wrapper.o
core/src/CMakeFiles/routing.dir/shooting_star_boost_wrapper.o
core/src/CMakeFiles/routing.dir/depend.internal
core/src/CMakeFiles/routing.dir/boost_wrapper.o
core/src/CMakeFiles/routing.dir/CXX.includecache
lib/librouting.so

So it does not look like we have one in our source code, but there appears to be references in the .o that might be referenced by compiler generated code or includes outside our source tree like boost or system libs.

2. I suppose it is possible that the server is sending a SIGABRT to a child process that is doing something bad like taking too much memory. Or maybe there is an OOM (Out Of Memory) watchdog process killing it with a SIGABRT.

Have you watched this with top? or some other process watcher?

Hopefully, you can extract the SQL and run it from the command line so we can get a better hand on what is happening and what the query is.

-Steve

On 2/28/2011 2:49 PM, Richard Marsden wrote:

Well I've moved forward and now have code in production calculating
mileages from OpenStreetMap data: I've calculated mileage charts for
Oceania and Africa.
The secret to get that far was to move operating systems from Windows to
Ubuntu and then upgrade to pgRouting 1.05. (PostGres 8.4, Ubuntu 10)
Th computations are being performed with dijkstra_sp_delta.

However now I'm hitting another "server closed the connection
unexpectedly" error.

Looking in the server logs, I find the LOG message "server process (PID
19133) was terminated by signal 6: Aborted"
>From what I can tell, Signal 6 on Ubuntu is indeed a SIGABRT. There
are no other log messages to indicate why Postgres/pgRouting threw a
SIGABRT.

This is then followed by warnings and log messages saying other active
server processes are being terminated, transactions rolled back, etc.

This error occurs at a reproducible point in a fairly sophisticated
(multi-processor, Python, psycopg) script. Although I'm pretty certain
of the SQL that is causing the problem, at the moment I don't have the
exact parameters (ie. graph nodes). I'm about to run the script
single-threaded with diagnostics so I should be able to get a single SQL
statement to reproduce the problem on a psql command line. In the worst
case, this could take a couple of days.
No other programs are running that are calling Postgres.

My graph consists of the global OSM street data loaded into PostGIS with
osm2po. I have checked for links of zero length. In fact all links <1m
long have been taken out of the graph. I've just double checked costs
and reverse_costs: all are positive (I've set these to the lengths)

I've just checked for start & end nodes being the same (ie. resulting in
dijkstra_sp_delta being called with the some node identifier for the
start and end): Yes my data has a few of these, but I'm pretty certain
the crash occurs before they appear. However, I'm going to add code to
detect these - there's no point in executing an SQL statement for
something that can be calculated in a trivial line of python.

What else should I be looking for? Are there any known problems I should
look for? Is there any way of finding out what is causing the Signal 6?

Once I have the node identifiers that are causing the problem, I should
be able to make en exportable-extract of the graph to give a
reproducible dataset and matching SQL statement. Would anyone be able to
investigate this?

Is there any way of making pgRouting / PostGres handle these situations
more cleanly? At the moment, the crash is taking the server down with
it. The crash is perhaps the first to occur after roughly 1 million
route calculations: I can live with that failure rate - but only if my
scripts can cleanly detect and recover from it. I guess ideally the
server should stay up and a status value (or exception - but that
probably wouldn't work across so many code boundaries) be returned.

Best regards,

Richard Marsden

_______________________________________________
Pgrouting-users mailing list
Pgrouting-users@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/pgrouting-users

Richard_Marsden · March 1, 2011, 1:54am

Thanks for the reply and suggestions.

Well I have now run the same script but with just one thread/process. I would have expected this to have worked if it was a “bulk” out of memory problem (only one pgRouting process running). It failed. Also with better diagnostics of my I own, I tried to recreate the SQL statements on the command line - unfortunately these worked!

I have been using the System Monitor for a while. Previously this showed it hitting swap memory occasionally , so I’ve bumped the machine memory from 4GB to 8GB, and it hasn’t done since.
(Yes this is running 32 bit mainly because PostGIS is 32 bit, but I understand modern Linux has a way to handle more memory (but limited per process) - and it was using the full 4GB). I do note that it does not appear to have gone beyond a full 4GB (+ephemera) memory usage.

I tried to adjust the shared memory parameter in PostGres but I think the default must be close to the maximum for standard Ubuntu (something about having to rebuild the kernel to change SHMEM). So the PostGres shared memory setting is back to its default (28MB). work_mem has been upped to 256MB. This change was after the first crash.

Otherwise it is difficult to watch with top or the system monitor because so far it has had to run a while (hours) before the crash occurs.

I guess as a kludgy workaround I could try trapping the client error, wait, and skip (or try again). This should work for a single thread, but might pose problems for my multi-threaded app. That’s the problem when the server dies - all client threads have trouble until it restarts.

Richard Marsden

On Mon, Feb 28, 2011 at 4:43 PM, Stephen Woodbridge <woodbri@swoodbridge.com> wrote:

The only think(s) that I can think of are:

it could be caused by a call to abort() or assert() in the C code, but:

woodbri@mappy:~/work/pgrouting-git/pgrouting$ find * -type f -exec grep -l -i abort {} ;
woodbri@mappy:~/work/pgrouting-git/pgrouting$ find * -type f -exec grep -l -i assert {} ;
core/src/CMakeFiles/routing.dir/depend.make
core/src/CMakeFiles/routing.dir/astar_boost_wrapper.o
core/src/CMakeFiles/routing.dir/shooting_star_boost_wrapper.o
core/src/CMakeFiles/routing.dir/depend.internal
core/src/CMakeFiles/routing.dir/boost_wrapper.o
core/src/CMakeFiles/routing.dir/CXX.includecache
lib/librouting.so

So it does not look like we have one in our source code, but there appears to be references in the .o that might be referenced by compiler generated code or includes outside our source tree like boost or system libs.

I suppose it is possible that the server is sending a SIGABRT to a child process that is doing something bad like taking too much memory. Or maybe there is an OOM (Out Of Memory) watchdog process killing it with a SIGABRT.

Have you watched this with top? or some other process watcher?

Hopefully, you can extract the SQL and run it from the command line so we can get a better hand on what is happening and what the query is.

-Steve

On 2/28/2011 2:49 PM, Richard Marsden wrote:

Well I’ve moved forward and now have code in production calculating
mileages from OpenStreetMap data: I’ve calculated mileage charts for
Oceania and Africa.
The secret to get that far was to move operating systems from Windows to
Ubuntu and then upgrade to pgRouting 1.05. (PostGres 8.4, Ubuntu 10)
Th computations are being performed with dijkstra_sp_delta.

However now I’m hitting another “server closed the connection
unexpectedly” error.

Looking in the server logs, I find the LOG message “server process (PID
19133) was terminated by signal 6: Aborted”

From what I can tell, Signal 6 on Ubuntu is indeed a SIGABRT. There
are no other log messages to indicate why Postgres/pgRouting threw a
SIGABRT.

This is then followed by warnings and log messages saying other active
server processes are being terminated, transactions rolled back, etc.

This error occurs at a reproducible point in a fairly sophisticated
(multi-processor, Python, psycopg) script. Although I’m pretty certain
of the SQL that is causing the problem, at the moment I don’t have the
exact parameters (ie. graph nodes). I’m about to run the script
single-threaded with diagnostics so I should be able to get a single SQL
statement to reproduce the problem on a psql command line. In the worst
case, this could take a couple of days.
No other programs are running that are calling Postgres.

My graph consists of the global OSM street data loaded into PostGIS with
osm2po. I have checked for links of zero length. In fact all links <1m
long have been taken out of the graph. I’ve just double checked costs
and reverse_costs: all are positive (I’ve set these to the lengths)

I’ve just checked for start & end nodes being the same (ie. resulting in
dijkstra_sp_delta being called with the some node identifier for the
start and end): Yes my data has a few of these, but I’m pretty certain
the crash occurs before they appear. However, I’m going to add code to
detect these - there’s no point in executing an SQL statement for
something that can be calculated in a trivial line of python.

What else should I be looking for? Are there any known problems I should
look for? Is there any way of finding out what is causing the Signal 6?

Once I have the node identifiers that are causing the problem, I should
be able to make en exportable-extract of the graph to give a
reproducible dataset and matching SQL statement. Would anyone be able to
investigate this?

Is there any way of making pgRouting / PostGres handle these situations
more cleanly? At the moment, the crash is taking the server down with
it. The crash is perhaps the first to occur after roughly 1 million
route calculations: I can live with that failure rate - but only if my
scripts can cleanly detect and recover from it. I guess ideally the
server should stay up and a status value (or exception - but that
probably wouldn’t work across so many code boundaries) be returned.

Best regards,

Richard Marsden

Stephen_Woodbridge1 · March 1, 2011, 3:11am

On 2/28/2011 8:54 PM, Richard Marsden wrote:

Thanks for the reply and suggestions.

Well I have now run the same script but with just one thread/process. I
would have expected this to have worked if it was a "bulk" out of memory
problem (only one pgRouting process running). It failed. Also with
better diagnostics of my I own, I tried to recreate the SQL statements
on the command line - unfortunately these worked!

I have been using the System Monitor for a while. Previously this showed
it hitting swap memory occasionally , so I've bumped the machine memory
from 4GB to 8GB, and it hasn't done since.
(Yes this is running 32 bit mainly because PostGIS is 32 bit, but I

I run Postgres on my amd64 boxes and I'm pretty sure they are all 64bit processes. I do not have that much memory though. I'm running Debian.

understand modern Linux has a way to handle more memory (but limited per
process) - and it was using the full 4GB). I do note that it does not
appear to have gone beyond a full 4GB (+ephemera) memory usage.

Off hand I would stay it is not good that you are running close to 4GB because if I recall this is one of those magic boundaries where a 32bit points warps back to around or overflows.

Also, since you run threads all your threads are in the same process memory and their combined memeory can not exceed whatever the process limit is.

I tried to adjust the shared memory parameter in PostGres but I think
the default must be close to the maximum for standard Ubuntu (something
about having to rebuild the kernel to change SHMEM). So the PostGres
shared memory setting is back to its default (28MB). work_mem has been
upped to 256MB. This change was after the first crash.

From:
http://www.postgresql.org/docs/8.4/static/kernel-resources.html

Linux

The default maximum segment size is 32 MB, which is only adequate for small PostgreSQL installations. However, the remaining defaults are quite generously sized, and usually do not require changes. The maximum shared memory segment size can be changed via the sysctl interface. For example, to allow 128 MB, and explicitly set the maximum total shared memory size to 2097152 pages (the default):

$ sysctl -w kernel.shmmax=134217728
$ sysctl -w kernel.shmall=2097152

In addition these settings can be saved between reboots in /etc/sysctl.conf.

Older distributions might not have the sysctl program, but equivalent changes can be made by manipulating the /proc file system:

$ echo 134217728 >/proc/sys/kernel/shmmax
$ echo 2097152 >/proc/sys/kernel/shmall

And from:
http://www.postgresql.org/docs/7.4/static/kernel-resources.html
Linux

The default shared memory limit (both SHMMAX and SHMALL) is 32 MB in 2.2 kernels, but it can be changed in the proc file system (without reboot). For example, to allow 128 MB:

$ echo 134217728 >/proc/sys/kernel/shmall
$ echo 134217728 >/proc/sys/kernel/shmmax

You could put these commands into a script run at boot-time.

Alternatively, you can use sysctl, if available, to control these parameters. Look for a file called /etc/sysctl.conf and add lines like the following to it:

kernel.shmall = 134217728
kernel.shmmax = 134217728

This file is usually processed at boot time, but sysctl can also be called explicitly later.

Other parameters are sufficiently sized for any application. If you want to see for yourself look in /usr/src/linux/include/asm-xxx/shmpara m.h and /usr/src/linux/include/linux/sem.h.

Hope this helps,
-Steve

Otherwise it is difficult to watch with top or the system monitor
because so far it has had to run a while (hours) before the crash occurs.

I guess as a kludgy workaround I could try trapping the client error,
wait, and skip (or try again). This should work for a single thread, but
might pose problems for my multi-threaded app. That's the problem when
the server dies - all client threads have trouble until it restarts.

Richard Marsden

On Mon, Feb 28, 2011 at 4:43 PM, Stephen Woodbridge
<woodbri@swoodbridge.com <mailto:woodbri@swoodbridge.com>> wrote:

    The only think(s) that I can think of are:

    1. it could be caused by a call to abort() or assert() in the C
    code, but:

    woodbri@mappy:~/work/pgrouting-git/pgrouting$ find * -type f -exec
    grep -l -i abort {} \;
    woodbri@mappy:~/work/pgrouting-git/pgrouting$ find * -type f -exec
    grep -l -i assert {} \;
    core/src/CMakeFiles/routing.dir/depend.make
    core/src/CMakeFiles/routing.dir/astar_boost_wrapper.o
    core/src/CMakeFiles/routing.dir/shooting_star_boost_wrapper.o
    core/src/CMakeFiles/routing.dir/depend.internal
    core/src/CMakeFiles/routing.dir/boost_wrapper.o
    core/src/CMakeFiles/routing.dir/CXX.includecache
    lib/librouting.so

    So it does not look like we have one in our source code, but there
    appears to be references in the .o that might be referenced by
    compiler generated code or includes outside our source tree like
    boost or system libs.

    2. I suppose it is possible that the server is sending a SIGABRT to
    a child process that is doing something bad like taking too much
    memory. Or maybe there is an OOM (Out Of Memory) watchdog process
    killing it with a SIGABRT.

    Have you watched this with top? or some other process watcher?

    Hopefully, you can extract the SQL and run it from the command line
    so we can get a better hand on what is happening and what the query is.

    -Steve

    On 2/28/2011 2:49 PM, Richard Marsden wrote:

        Well I've moved forward and now have code in production calculating
        mileages from OpenStreetMap data: I've calculated mileage charts for
        Oceania and Africa.
        The secret to get that far was to move operating systems from
        Windows to
        Ubuntu and then upgrade to pgRouting 1.05. (PostGres 8.4, Ubuntu 10)
        Th computations are being performed with dijkstra_sp_delta.

        However now I'm hitting another "server closed the connection
        unexpectedly" error.

        Looking in the server logs, I find the LOG message "server
        process (PID
        19133) was terminated by signal 6: Aborted"
         >From what I can tell, Signal 6 on Ubuntu is indeed a SIGABRT.
        There
        are no other log messages to indicate why Postgres/pgRouting threw a
        SIGABRT.

        This is then followed by warnings and log messages saying other
        active
        server processes are being terminated, transactions rolled back,
        etc.

        This error occurs at a reproducible point in a fairly sophisticated
        (multi-processor, Python, psycopg) script. Although I'm pretty
        certain
        of the SQL that is causing the problem, at the moment I don't
        have the
        exact parameters (ie. graph nodes). I'm about to run the script
        single-threaded with diagnostics so I should be able to get a
        single SQL
        statement to reproduce the problem on a psql command line. In
        the worst
        case, this could take a couple of days.
        No other programs are running that are calling Postgres.

        My graph consists of the global OSM street data loaded into
        PostGIS with
        osm2po. I have checked for links of zero length. In fact all
        links <1m
        long have been taken out of the graph. I've just double checked
        costs
        and reverse_costs: all are positive (I've set these to the lengths)

        I've just checked for start & end nodes being the same (ie.
        resulting in
        dijkstra_sp_delta being called with the some node identifier for the
        start and end): Yes my data has a few of these, but I'm pretty
        certain
        the crash occurs before they appear. However, I'm going to add
        code to
        detect these - there's no point in executing an SQL statement for
        something that can be calculated in a trivial line of python.

        What else should I be looking for? Are there any known problems
        I should
        look for? Is there any way of finding out what is causing the
        Signal 6?

        Once I have the node identifiers that are causing the problem, I
        should
        be able to make en exportable-extract of the graph to give a
        reproducible dataset and matching SQL statement. Would anyone be
        able to
        investigate this?

        Is there any way of making pgRouting / PostGres handle these
        situations
        more cleanly? At the moment, the crash is taking the server down
        with
        it. The crash is perhaps the first to occur after roughly 1 million
        route calculations: I can live with that failure rate - but only
        if my
        scripts can cleanly detect and recover from it. I guess ideally the
        server should stay up and a status value (or exception - but that
        probably wouldn't work across so many code boundaries) be returned.

        Best regards,

        Richard Marsden

_______________________________________________
Pgrouting-users mailing list
Pgrouting-users@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/pgrouting-users

Richard_Marsden · March 1, 2011, 4:29pm

Thanks for the refs & links - some useful stuff there…

Oh the wonders of Linux documentation - of course all the forks and versions don’t help - the references I found to SHMMEM must have been old.
Thanks for those references. I’ve upped it to 128MB and shared_memory to a conservative 64MB (from 28MB) but the same result.
(I’m also printing off those two Postgres pages about recommended configurations, whilst I type - I’ll probably adjust them further. I have the Apress “From Beginner to Expert” PostGres book but of course it is cross-platform and is more interested in covering a wide range of topics including basic SQL)

After a few runs now, I’m seeing the abort occurs at different places in processing. Assuming that pgrouting’s search through the graph is deterministic (and I haven’t seen anything to say otherwise), this suggests the problem is not data (graph) specific.

I said “threads/processes”: Python’s implementation of multithreading is broken from the multi-core processing perspective - basically there’s one giant lock on the interpreter! Luckily the standard libraries include an alternative which uses OS processes in a thread-like way. I’m using the multiprocessing ‘Pool’ functions to implement what Google have christened “MapReduce” across 1-3 cpus (haven’t dared try 4 yet)
The problem occurs with once process as well as 3 processes, so I don’t think it is a bulk memory limit - instead it is a single process limit.

So far I’ve only seen the problem with Brazil (I’m batch processing country-wide mileage charts). This is the biggest chart I’ve tried (Australia was the previous largest and it wasn’t that much smaller).
One thought is that it could be related to the number of calls on a connection: perhaps a server-side garbage collector is not getting the chance to run? Or there’s a memory leak?
The maximum number of pgRouting calls per connection is currently 2000. Brazil is going to be in the 950-1900ish range (I first try with a small delta. If that fails to find a route, I try a larger delta - hence the two-fold uncertainty)

I’ve now switched to Asia. If I’m right with the above paragraph, it will probably fail with India but not China
(China has lots of cities but a large number are not matched with road data, so they are skipped)
So far it is running okay, but it has only reached Armenia…

Thanks for the suggestions - they are helping!

Richard

On Mon, Feb 28, 2011 at 9:11 PM, Stephen Woodbridge <woodbri@swoodbridge.com> wrote:

On 2/28/2011 8:54 PM, Richard Marsden wrote:

Thanks for the reply and suggestions.

Well I have now run the same script but with just one thread/process. I
would have expected this to have worked if it was a “bulk” out of memory
problem (only one pgRouting process running). It failed. Also with
better diagnostics of my I own, I tried to recreate the SQL statements
on the command line - unfortunately these worked!

I have been using the System Monitor for a while. Previously this showed
it hitting swap memory occasionally , so I’ve bumped the machine memory
from 4GB to 8GB, and it hasn’t done since.
(Yes this is running 32 bit mainly because PostGIS is 32 bit, but I

I run Postgres on my amd64 boxes and I’m pretty sure they are all 64bit processes. I do not have that much memory though. I’m running Debian.

understand modern Linux has a way to handle more memory (but limited per
process) - and it was using the full 4GB). I do note that it does not
appear to have gone beyond a full 4GB (+ephemera) memory usage.

Off hand I would stay it is not good that you are running close to 4GB because if I recall this is one of those magic boundaries where a 32bit points warps back to around or overflows.

Also, since you run threads all your threads are in the same process memory and their combined memeory can not exceed whatever the process limit is.

Hope this helps,
-Steve

Otherwise it is difficult to watch with top or the system monitor
because so far it has had to run a while (hours) before the crash occurs.

I guess as a kludgy workaround I could try trapping the client error,
wait, and skip (or try again). This should work for a single thread, but
might pose problems for my multi-threaded app. That’s the problem when
the server dies - all client threads have trouble until it restarts.

Richard Marsden

Richard_Marsden · March 2, 2011, 2:33pm

Update…

Well China did crash. Following from my theory that it is memory leak or other resource limitation from a large number of calls to pgRouting, I added code to cleanly close and reopen the connection every 300 points (ie. a max of 600 pgRouting calls). I also added a local garbage collection call and a 200ms sleep for good measure.

It appears to be working - China is still running (and should do for a week possibly), but it has run far longer than it did to the first crash or any of the Brazil crashes.

So there is a resource somewhere that is being “used up” when I have large numbers of queries on the same connection. When the connection closes, the resource in question is freed.

The stack is Psycopg → Postgres → pgRouting.

I found a reference to someone having a similar problem (no pgRouting, but lots of SELECTs through Psychopg), and their solution was to do a commit after every SELECT:

http://stackoverflow.com/questions/4173168/psycopg-postgres-connections-hang-out-randomly

I haven’t found any other references to this kind of problem.

Richard

On Tue, Mar 1, 2011 at 10:29 AM, Richard Marsden <winwaed@gmail.com> wrote:

Thanks for the refs & links - some useful stuff there…

Oh the wonders of Linux documentation - of course all the forks and versions don’t help - the references I found to SHMMEM must have been old.
Thanks for those references. I’ve upped it to 128MB and shared_memory to a conservative 64MB (from 28MB) but the same result.
(I’m also printing off those two Postgres pages about recommended configurations, whilst I type - I’ll probably adjust them further. I have the Apress “From Beginner to Expert” PostGres book but of course it is cross-platform and is more interested in covering a wide range of topics including basic SQL)

After a few runs now, I’m seeing the abort occurs at different places in processing. Assuming that pgrouting’s search through the graph is deterministic (and I haven’t seen anything to say otherwise), this suggests the problem is not data (graph) specific.

I said “threads/processes”: Python’s implementation of multithreading is broken from the multi-core processing perspective - basically there’s one giant lock on the interpreter! Luckily the standard libraries include an alternative which uses OS processes in a thread-like way. I’m using the multiprocessing ‘Pool’ functions to implement what Google have christened “MapReduce” across 1-3 cpus (haven’t dared try 4 yet)
The problem occurs with once process as well as 3 processes, so I don’t think it is a bulk memory limit - instead it is a single process limit.

So far I’ve only seen the problem with Brazil (I’m batch processing country-wide mileage charts). This is the biggest chart I’ve tried (Australia was the previous largest and it wasn’t that much smaller).
One thought is that it could be related to the number of calls on a connection: perhaps a server-side garbage collector is not getting the chance to run? Or there’s a memory leak?
The maximum number of pgRouting calls per connection is currently 2000. Brazil is going to be in the 950-1900ish range (I first try with a small delta. If that fails to find a route, I try a larger delta - hence the two-fold uncertainty)

I’ve now switched to Asia. If I’m right with the above paragraph, it will probably fail with India but not China
(China has lots of cities but a large number are not matched with road data, so they are skipped)
So far it is running okay, but it has only reached Armenia…

Thanks for the suggestions - they are helping!

Richard