On 2/28/2011 8:54 PM, Richard Marsden wrote:
Thanks for the reply and suggestions.
Well I have now run the same script but with just one thread/process. I
would have expected this to have worked if it was a "bulk" out of memory
problem (only one pgRouting process running). It failed. Also with
better diagnostics of my I own, I tried to recreate the SQL statements
on the command line - unfortunately these worked!
I have been using the System Monitor for a while. Previously this showed
it hitting swap memory occasionally , so I've bumped the machine memory
from 4GB to 8GB, and it hasn't done since.
(Yes this is running 32 bit mainly because PostGIS is 32 bit, but I
I run Postgres on my amd64 boxes and I'm pretty sure they are all 64bit processes. I do not have that much memory though. I'm running Debian.
understand modern Linux has a way to handle more memory (but limited per
process) - and it was using the full 4GB). I do note that it does not
appear to have gone beyond a full 4GB (+ephemera) memory usage.
Off hand I would stay it is not good that you are running close to 4GB because if I recall this is one of those magic boundaries where a 32bit points warps back to around or overflows.
Also, since you run threads all your threads are in the same process memory and their combined memeory can not exceed whatever the process limit is.
I tried to adjust the shared memory parameter in PostGres but I think
the default must be close to the maximum for standard Ubuntu (something
about having to rebuild the kernel to change SHMEM). So the PostGres
shared memory setting is back to its default (28MB). work_mem has been
upped to 256MB. This change was after the first crash.
From:
http://www.postgresql.org/docs/8.4/static/kernel-resources.html
Linux
The default maximum segment size is 32 MB, which is only adequate for small PostgreSQL installations. However, the remaining defaults are quite generously sized, and usually do not require changes. The maximum shared memory segment size can be changed via the sysctl interface. For example, to allow 128 MB, and explicitly set the maximum total shared memory size to 2097152 pages (the default):
$ sysctl -w kernel.shmmax=134217728
$ sysctl -w kernel.shmall=2097152
In addition these settings can be saved between reboots in /etc/sysctl.conf.
Older distributions might not have the sysctl program, but equivalent changes can be made by manipulating the /proc file system:
$ echo 134217728 >/proc/sys/kernel/shmmax
$ echo 2097152 >/proc/sys/kernel/shmall
And from:
http://www.postgresql.org/docs/7.4/static/kernel-resources.html
Linux
The default shared memory limit (both SHMMAX and SHMALL) is 32 MB in 2.2 kernels, but it can be changed in the proc file system (without reboot). For example, to allow 128 MB:
$ echo 134217728 >/proc/sys/kernel/shmall
$ echo 134217728 >/proc/sys/kernel/shmmax
You could put these commands into a script run at boot-time.
Alternatively, you can use sysctl, if available, to control these parameters. Look for a file called /etc/sysctl.conf and add lines like the following to it:
kernel.shmall = 134217728
kernel.shmmax = 134217728
This file is usually processed at boot time, but sysctl can also be called explicitly later.
Other parameters are sufficiently sized for any application. If you want to see for yourself look in /usr/src/linux/include/asm-xxx/shmpara m.h and /usr/src/linux/include/linux/sem.h.
Hope this helps,
-Steve
Otherwise it is difficult to watch with top or the system monitor
because so far it has had to run a while (hours) before the crash occurs.
I guess as a kludgy workaround I could try trapping the client error,
wait, and skip (or try again). This should work for a single thread, but
might pose problems for my multi-threaded app. That's the problem when
the server dies - all client threads have trouble until it restarts.
Richard Marsden
On Mon, Feb 28, 2011 at 4:43 PM, Stephen Woodbridge
<woodbri@swoodbridge.com <mailto:woodbri@swoodbridge.com>> wrote:
The only think(s) that I can think of are:
1. it could be caused by a call to abort() or assert() in the C
code, but:
woodbri@mappy:~/work/pgrouting-git/pgrouting$ find * -type f -exec
grep -l -i abort {} \;
woodbri@mappy:~/work/pgrouting-git/pgrouting$ find * -type f -exec
grep -l -i assert {} \;
core/src/CMakeFiles/routing.dir/depend.make
core/src/CMakeFiles/routing.dir/astar_boost_wrapper.o
core/src/CMakeFiles/routing.dir/shooting_star_boost_wrapper.o
core/src/CMakeFiles/routing.dir/depend.internal
core/src/CMakeFiles/routing.dir/boost_wrapper.o
core/src/CMakeFiles/routing.dir/CXX.includecache
lib/librouting.so
So it does not look like we have one in our source code, but there
appears to be references in the .o that might be referenced by
compiler generated code or includes outside our source tree like
boost or system libs.
2. I suppose it is possible that the server is sending a SIGABRT to
a child process that is doing something bad like taking too much
memory. Or maybe there is an OOM (Out Of Memory) watchdog process
killing it with a SIGABRT.
Have you watched this with top? or some other process watcher?
Hopefully, you can extract the SQL and run it from the command line
so we can get a better hand on what is happening and what the query is.
-Steve
On 2/28/2011 2:49 PM, Richard Marsden wrote:
Well I've moved forward and now have code in production calculating
mileages from OpenStreetMap data: I've calculated mileage charts for
Oceania and Africa.
The secret to get that far was to move operating systems from
Windows to
Ubuntu and then upgrade to pgRouting 1.05. (PostGres 8.4, Ubuntu 10)
Th computations are being performed with dijkstra_sp_delta.
However now I'm hitting another "server closed the connection
unexpectedly" error.
Looking in the server logs, I find the LOG message "server
process (PID
19133) was terminated by signal 6: Aborted"
>From what I can tell, Signal 6 on Ubuntu is indeed a SIGABRT.
There
are no other log messages to indicate why Postgres/pgRouting threw a
SIGABRT.
This is then followed by warnings and log messages saying other
active
server processes are being terminated, transactions rolled back,
etc.
This error occurs at a reproducible point in a fairly sophisticated
(multi-processor, Python, psycopg) script. Although I'm pretty
certain
of the SQL that is causing the problem, at the moment I don't
have the
exact parameters (ie. graph nodes). I'm about to run the script
single-threaded with diagnostics so I should be able to get a
single SQL
statement to reproduce the problem on a psql command line. In
the worst
case, this could take a couple of days.
No other programs are running that are calling Postgres.
My graph consists of the global OSM street data loaded into
PostGIS with
osm2po. I have checked for links of zero length. In fact all
links <1m
long have been taken out of the graph. I've just double checked
costs
and reverse_costs: all are positive (I've set these to the lengths)
I've just checked for start & end nodes being the same (ie.
resulting in
dijkstra_sp_delta being called with the some node identifier for the
start and end): Yes my data has a few of these, but I'm pretty
certain
the crash occurs before they appear. However, I'm going to add
code to
detect these - there's no point in executing an SQL statement for
something that can be calculated in a trivial line of python.
What else should I be looking for? Are there any known problems
I should
look for? Is there any way of finding out what is causing the
Signal 6?
Once I have the node identifiers that are causing the problem, I
should
be able to make en exportable-extract of the graph to give a
reproducible dataset and matching SQL statement. Would anyone be
able to
investigate this?
Is there any way of making pgRouting / PostGres handle these
situations
more cleanly? At the moment, the crash is taking the server down
with
it. The crash is perhaps the first to occur after roughly 1 million
route calculations: I can live with that failure rate - but only
if my
scripts can cleanly detect and recover from it. I guess ideally the
server should stay up and a status value (or exception - but that
probably wouldn't work across so many code boundaries) be returned.
Best regards,
Richard Marsden
_______________________________________________
Pgrouting-users mailing list
Pgrouting-users@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/pgrouting-users