I'm looking at how to import the OpenStreetMap project data into
GRASS. The data is stored as XML;
[ http://www.openstreetmap.org ]
..
This is a huge dataset,
..
It is likely that they'll get up to 10 million points+ though - is
PostGIS more appropriate to use with such a large dataset?The idea is to use GRASS vector modules to clean up the data, do
shortest path routing, etc.
..
> The id links
> from nodes to segments
> from segments to ways
e.g.<?xml version='1.0' encoding='UTF-8'?>
<osm version='0.3' generator='JOSM'>
<node id='2222' lat='47.5118' lon='11.100' />
<node id='3333' lat='48.4014' lon='10.988' />
<node id='3334' lat='48.4014' lon='10.988'>
<tag k='name' v='optional name of node' />
</node>
...
<segment id='44444' from='2222' to='3333'>
<tag k='name' v='optional name of segment' />
</segment>
...
<way id='299353' timestamp='06-5-5 13:56:42'>
<seg id='44444' />
<seg id='44445' />
<seg id='44446' />
...
<tag k='name' v='A99' />
<tag k='motorway' v='highway' />
...
</way>
</osm>
Hi,
I have now written a bash-script prototype (attached). It's rather slow
and inefficient, and will scale poorly; using it on the whole OSM
dataset including the US TIGER data is inappropriate (would take a week
to run). None-the-less it works* if you use it on the remaining non-US
data.
[*] see bash scripting problem at end of msg.
It is intended to demonstrate the method more clearly than a complex awk
script and to be understood by more readers than a C or Matlab script.
I am currently primarily interested in getting the method right.
For Mark II, maybe Python or Perl would be best? Or via queries to
PostGIS database(s)? (I leave that to others who know those)
there is a rather dense osm2shp converter via awk:
http://article.gmane.org/gmane.comp.gis.openstreetmap/3005
[newlines have to be fixed before it will run]
I have doubts as to if the above linked osm2shp awk script is attaching
the correct cats to the correct segments. I haven't finished processing
to check though. If the segments shapefile is loaded into GRASS with
v.in.ogr, v.build.polylines needs to be run. This awk script discards
attributes as well.
how my script works: (the name will change)
OSM data is split into three parts, each with its own ID code (not
unique between the 3 DBs)
1) nodes. x,y[,z,t,creator,..]
2) segments. made up of two (and only two) nodes. (non-polylines)
3) "ways". (ie routes) a series of segment lines defining a road.
Usually contains attribute data but the field names and values seem to
be somewhat user defined* from route to route, so the only common field
useful for route planning was "name". I find "way" too close to
"waypoints", so I will try and call these "routes" to lessen confusion.
[*] e.g. multiple "Road type"s, some "one way" (but which way?), etc.
Thus mostly useless for SQL queries/parsing without lots of cleanup.
processing:
The first step is to create three lookup tables for nodes, segs, and
routes.
Step two is to populate a "segment_id|x1|y1|x2|y2" table from the
"seg_id|from_node|to_node" and node tables. These lines are then
reformatted into GRASS vector ASCII format, then loaded with v.in.ascii.
I construct the poly-lines from these line segments in GRASS as I don't
trust that all segments will be ordered end-to-end in the routes.
Step three is to create a routes table linking to csv seg ids, and atts:
$RTE_ID|$SEGS|$NAME
Finally the GRASS loop: v.extract the segments for each route defined
by the table in the last step, then build a polyline from them with
v.build.polylines of the given route_id cat. Patch this in to an
aggregate vector. While we are at it, compose a SQL command to connect
the "name" attribute to the correct route_id cat. Once finished
the DB is linked and db.execute run to load values into the DB.
v.clean, v.net.* etc can then be attempted.
There is much room for improvement here, but I think it's a good start.
problem:
while [ $ERRORCODE -eq 0 ] ; do
read LINE
ERRORCODE=$?
test $ERRORCODE && continue
if `echo $LINE | grep` ; then
do_minimal_stuff()
fi
done < 100mb_file.txt
bash takes 800mb ram (that's ok, I've got lots, no swapping) but runs
*incredibly* slowly. Like 80486-SX slowly.
why is that? What's a better way of working through lines containing
spaces? Set the file as a fd to pass to `read` instead of via redirect?
Hamish
(attachments)
osm2seg.txt (5.39 KB)