[GRASSLIST:1032] openstreetmap data import into GRASS

I'm looking at how to import the OpenStreetMap project data into
GRASS. The data is stored as XML;

[ http://www.openstreetmap.org ]

..

This is a huge dataset,

..

It is likely that they'll get up to 10 million points+ though - is
PostGIS more appropriate to use with such a large dataset?

The idea is to use GRASS vector modules to clean up the data, do
shortest path routing, etc.

..

> The id links
> from nodes to segments
> from segments to ways
e.g.

<?xml version='1.0' encoding='UTF-8'?>
<osm version='0.3' generator='JOSM'>
  <node id='2222' lat='47.5118' lon='11.100' />
  <node id='3333' lat='48.4014' lon='10.988' />
  <node id='3334' lat='48.4014' lon='10.988'>
    <tag k='name' v='optional name of node' />
  </node>
...
  <segment id='44444' from='2222' to='3333'>
    <tag k='name' v='optional name of segment' />
  </segment>
...
  <way id='299353' timestamp='06-5-5 13:56:42'>
    <seg id='44444' />
    <seg id='44445' />
    <seg id='44446' />
...
    <tag k='name' v='A99' />
    <tag k='motorway' v='highway' />
...
  </way>
</osm>

Hi,

I have now written a bash-script prototype (attached). It's rather slow
and inefficient, and will scale poorly; using it on the whole OSM
dataset including the US TIGER data is inappropriate (would take a week
to run). None-the-less it works* if you use it on the remaining non-US
data.

[*] see bash scripting problem at end of msg.

It is intended to demonstrate the method more clearly than a complex awk
script and to be understood by more readers than a C or Matlab script.
I am currently primarily interested in getting the method right.
For Mark II, maybe Python or Perl would be best? Or via queries to
PostGIS database(s)? (I leave that to others who know those)

there is a rather dense osm2shp converter via awk:
http://article.gmane.org/gmane.comp.gis.openstreetmap/3005

[newlines have to be fixed before it will run]

I have doubts as to if the above linked osm2shp awk script is attaching
the correct cats to the correct segments. I haven't finished processing
to check though. If the segments shapefile is loaded into GRASS with
v.in.ogr, v.build.polylines needs to be run. This awk script discards
attributes as well.

how my script works: (the name will change)

OSM data is split into three parts, each with its own ID code (not
unique between the 3 DBs)

1) nodes. x,y[,z,t,creator,..]
2) segments. made up of two (and only two) nodes. (non-polylines)
3) "ways". (ie routes) a series of segment lines defining a road.
Usually contains attribute data but the field names and values seem to
be somewhat user defined* from route to route, so the only common field
useful for route planning was "name". I find "way" too close to
"waypoints", so I will try and call these "routes" to lessen confusion.

[*] e.g. multiple "Road type"s, some "one way" (but which way?), etc.
    Thus mostly useless for SQL queries/parsing without lots of cleanup.

processing:

The first step is to create three lookup tables for nodes, segs, and
routes.

Step two is to populate a "segment_id|x1|y1|x2|y2" table from the
"seg_id|from_node|to_node" and node tables. These lines are then
reformatted into GRASS vector ASCII format, then loaded with v.in.ascii.
I construct the poly-lines from these line segments in GRASS as I don't
trust that all segments will be ordered end-to-end in the routes.

Step three is to create a routes table linking to csv seg ids, and atts:
$RTE_ID|$SEGS|$NAME

Finally the GRASS loop: v.extract the segments for each route defined
by the table in the last step, then build a polyline from them with
v.build.polylines of the given route_id cat. Patch this in to an
aggregate vector. While we are at it, compose a SQL command to connect
the "name" attribute to the correct route_id cat. Once finished
the DB is linked and db.execute run to load values into the DB.

v.clean, v.net.* etc can then be attempted.

There is much room for improvement here, but I think it's a good start.

problem:

while [ $ERRORCODE -eq 0 ] ; do
  read LINE
  ERRORCODE=$?
  test $ERRORCODE && continue
  if `echo $LINE | grep` ; then
    do_minimal_stuff()
  fi
done < 100mb_file.txt

bash takes 800mb ram (that's ok, I've got lots, no swapping) but runs
*incredibly* slowly. Like 80486-SX slowly.

why is that? What's a better way of working through lines containing
spaces? Set the file as a fd to pass to `read` instead of via redirect?

Hamish

(attachments)

osm2seg.txt (5.39 KB)

Hamish wrote:

problem:

while [ $ERRORCODE -eq 0 ] ; do
  read LINE
  ERRORCODE=$?
  test $ERRORCODE && continue
  if `echo $LINE | grep` ; then
    do_minimal_stuff()
  fi
done < 100mb_file.txt

bash takes 800mb ram (that's ok, I've got lots, no swapping) but runs
*incredibly* slowly. Like 80486-SX slowly.

why is that? What's a better way of working through lines containing
spaces? Set the file as a fd to pass to `read` instead of via redirect?

Spawning one grep process per line isn't particularly efficient.

What is that test meant to do?

--
Glynn Clements <glynn@gclements.plus.com>

> problem:
>
> while [ $ERRORCODE -eq 0 ] ; do
> read LINE
> ERRORCODE=$?
> test $ERRORCODE && continue
> if `echo $LINE | grep` ; then
> do_minimal_stuff()
> fi
> done < 100mb_file.txt
>
> bash takes 800mb ram (that's ok, I've got lots, no swapping) but
> runs *incredibly* slowly. Like 80486-SX slowly.
>
> why is that? What's a better way of working through lines containing
> spaces? Set the file as a fd to pass to `read` instead of via
> redirect?

Spawning one grep process per line isn't particularly efficient.

No it isn't. Doing this in a shell script for prototyping purposes.
So creating/destroying processes is the bottleneck?

top reports:
Cpu(s): 0.0% user, 82.7% system, 17.3% nice, 0.0% idle
  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1949 hamish 17 18 89144 87m 86m S 13.7 8.6 83:50.53 bash
31434 hamish 19 18 89144 87m 86m R 0.3 8.6 0:00.01 bash

(currently running on a 41mb input file)

What is that test meant to do?

the `echo $LINE | grep` test is really more than that :slight_smile:
e.g. grepping the XML for the start of a <segment> record, or
using cut instead of grep to grab a value from the LINE.

an example:

#### create segment table fully populated with coordinates
num_segs=`wc -l planet_seg.dat | cut -f1 -d' '`
i=1
echo "segment|x1|y1|x2|y2" > planet_seg_lines.dat

for LINE in `cat planet_seg.dat` ; do
   SEG_ID=`echo "$LINE" | cut -f1 -d'|'`
   FROM_NODE=`echo "$LINE" | cut -f2 -d'|'`
   TO_NODE=`echo "$LINE" | cut -f3 -d'|'`
# echo "seg $SEG_ID from $FROM_NODE to $TO_NODE"
   FROM_COORD=`grep "^${FROM_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`
   TO_COORD=`grep "^${TO_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`

   if [ 0 -eq `echo $i | awk '{print $1 % 1000}'` ] ; then
      echo "seg $i of $num_segs"
   fi
   echo "$SEG_ID|$FROM_COORD|$TO_COORD" >> planet_seg_lines.dat

   i=`expr $i + 1`
done

(num_segs ~500k)
Yes, the FROM_COORD,TO_COORD greps take a little time but the loop still
runs slow if I skip past them.

I was trying to avoid using awk*, but most of that loop could be done by
a single awk process I guess. Would perl or python be a better solution?
Or will nothing approach C for speed?

[*] A large part of this exercise is to demonstrate the method. Awk's
clarity/readability is so dismal that I might as well do it in C in that
case.

I had the suggestion to change things to:
  while read LINE ; do
    ...
  done < inputfile.txt

instead of
  for LINE in `cat inputfile.txt` ; do
    ...
  done

in order to not load the entire 100mb file into memory, but I get the
same memory footprint, same slow result that way.

The speed does seem to be inversely proportional to input file size?,
so I wonder if this could be the problem, even if the above fix isn't
the right one.

I am talking like 1500 iterations per minute of the "for" loop. That
is fairly slow for a P4 2.8GHz, no swapping, 2.4.27 debian kernel ...
I wouldn't have thought it could be _that_ bad.

thanks,
Hamish

Hamish wrote:

> > problem:
> >
> > while [ $ERRORCODE -eq 0 ] ; do
> > read LINE
> > ERRORCODE=$?
> > test $ERRORCODE && continue
> > if `echo $LINE | grep` ; then
> > do_minimal_stuff()
> > fi
> > done < 100mb_file.txt
> >
> > bash takes 800mb ram (that's ok, I've got lots, no swapping) but
> > runs *incredibly* slowly. Like 80486-SX slowly.
> >
> > why is that? What's a better way of working through lines containing
> > spaces? Set the file as a fd to pass to `read` instead of via
> > redirect?
>
> Spawning one grep process per line isn't particularly efficient.

No it isn't. Doing this in a shell script for prototyping purposes.
So creating/destroying processes is the bottleneck?

Either that or do_minimal_stuff(), depending upon what the latter
does.

> What is that test meant to do?

the `echo $LINE | grep` test is really more than that :slight_smile:
e.g. grepping the XML for the start of a <segment> record, or
using cut instead of grep to grab a value from the LINE.

an example:

#### create segment table fully populated with coordinates
num_segs=`wc -l planet_seg.dat | cut -f1 -d' '`
i=1
echo "segment|x1|y1|x2|y2" > planet_seg_lines.dat

for LINE in `cat planet_seg.dat` ; do
   SEG_ID=`echo "$LINE" | cut -f1 -d'|'`
   FROM_NODE=`echo "$LINE" | cut -f2 -d'|'`
   TO_NODE=`echo "$LINE" | cut -f3 -d'|'`
# echo "seg $SEG_ID from $FROM_NODE to $TO_NODE"
   FROM_COORD=`grep "^${FROM_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`
   TO_COORD=`grep "^${TO_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`

   if [ 0 -eq `echo $i | awk '{print $1 % 1000}'` ] ; then
      echo "seg $i of $num_segs"
   fi
   echo "$SEG_ID|$FROM_COORD|$TO_COORD" >> planet_seg_lines.dat

   i=`expr $i + 1`
done

Ah. That's a different matter. I took "echo $LINE | grep" to imply
that you were feeding the line to grep. If you're grepping a complete
other file for every line of input, you may have bigger problems than
just the command-spawning overhead.

And the command-spawning overhead is pretty bad: 5 invocations of cut
and 2 of grep per line of input.

[I dread to think how slow that would be on Cygwin, where spawning
commands is anything up to 100 times slower than Linux (or another
"real" Unix) on the same hardware.]

(num_segs ~500k)
Yes, the FROM_COORD,TO_COORD greps take a little time but the loop still
runs slow if I skip past them.

Ok, so the command-spawning overhead probably is significant.

I was trying to avoid using awk*, but most of that loop could be done by
a single awk process I guess. Would perl or python be a better solution?
Or will nothing approach C for speed?

Anything which can do everything in a single process will be faster
than spawning multiple external commands per line of input.

Simple string operations equivalent to the cut or grep commands
require a handful of CPU cycles per byte. Spawning a command means
opening the executable, setting up its memory segments, mapping the
shared libraries which it uses, undoing all that on termination, etc.

Also, grep will have to compile the regex into to a state table every
time; if you were using the C regex functions or a language with
built-in regex support, you would normally only compile the regex once
(the same issues apply to a literal string; the same search algorithm
is used).

Even an interpreted language would be very much faster than that. C
might not be all that much faster. Or even any faster; if the
processing is simple, awk/perl/etc may be able to process the data as
fast as the disk can supply it.

[*] A large part of this exercise is to demonstrate the method. Awk's
clarity/readability is so dismal that I might as well do it in C in that
case.

Hmm. Pure C can be pretty dismal for string handling; awk has the
advantage that strings are first-class values, i.e. you don't have to
worry about allocation and the like.

It may be that the awk script could be written better. OTOH, a C
program would probably either leave most of the work to an XML library
or use lex/yacc for the parsing.

Obviously, any approach which doesn't use a dedicated XML library is
going to be assuming a very restricted subset of XML (e.g. no newlines
in awkward places). There's no way that you can parse arbitrary XML
with a simple script.

I had the suggestion to change things to:
  while read LINE ; do
    ...
  done < inputfile.txt

instead of
  for LINE in `cat inputfile.txt` ; do
    ...
  done

in order to not load the entire 100mb file into memory, but I get the
same memory footprint, same slow result that way.

I hadn't even noticed that; yes, processing files should be done with
"while read ...", not a for loop.

The speed does seem to be inversely proportional to input file size?,
so I wonder if this could be the problem, even if the above fix isn't
the right one.

If the time taken to run grep on planet_pt.dat is significant, the
time per line would be proportional to the size of planet_pt.dat (plus
any fixed overhead), and so the total time taken would be proportional
to the product of the files' sizes.

That issue won't go away simply by re-writing the program in another
language. To get around that issue (which could be a bigger issue than
the command-spawning overhead if planet_pt.dat is large), the lookup
needs to be more efficient, e.g. hashtable, btree, or bisection
search, rather than a linear search.

I am talking like 1500 iterations per minute of the "for" loop. That
is fairly slow for a P4 2.8GHz, no swapping, 2.4.27 debian kernel ...
I wouldn't have thought it could be _that_ bad.

Oh, I would. Spawning a command is /lot/ of work. Consider:

  $ strace /usr/bin/true
  execve("/usr/bin/true", ["true"], [/* 44 vars */]) = 0
  uname({sys="Linux", node="cerise", ...}) = 0
  brk(0) = 0x804c000
  access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
  open("/etc/ld.so.cache", O_RDONLY) = 3
  fstat64(3, {st_mode=S_IFREG|0644, st_size=71395, ...}) = 0
  mmap2(NULL, 71395, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7fa9000
  close(3) = 0
  open("/lib/libc.so.6", O_RDONLY) = 3
  read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000V\1\000"..., 512) = 512
  fstat64(3, {st_mode=S_IFREG|0755, st_size=1236576, ...}) = 0
  mmap2(NULL, 1142068, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7e92000
  mmap2(0xb7fa3000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x110) = 0xb7fa3000
  mmap2(0xb7fa7000, 7476, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7fa7000
  close(3) = 0
  mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7e91000
  mprotect(0xb7fa3000, 4096, PROT_READ) = 0
  mprotect(0xb7fce000, 4096, PROT_READ) = 0
  munmap(0xb7fa9000, 71395) = 0
  open("/dev/urandom", O_RDONLY) = 3
  read(3, "2\6\217\366", 4) = 4
  close(3) = 0
  exit_group(0) = ?

That's a fork() and exec() (both of which are expensive), plus another
22 system calls in the startup, for a program which doesn't even do
anything; at least 48 context switches, plus any for page faults.

Equivalent string-processing functions such as strstr or strcpy are
trivial in comparison.

--
Glynn Clements <glynn@gclements.plus.com>