Hi,
an attempt to clear up what the two alternatives are and what they do. BOTH gwc and geosearch contain code that can generate sitemaps. The reason it's a bit rushed, even though we've been working on KML for ages, is that we only caught onto the sitemap stuff about 3 weeks ago. It took a week before we could confirm it worked, then I have also been busy with other stuff and finally flat out sick for a week.
The goal is to have Google index the KML we serve using GeoServer and/or GeoWebCache. There are a couple of premises:
1) Googlebot may not crawl or index all the pages / placemarks, hence we must hand them to the bot in order of importance. We currently do that by using the KML hierarchies we have built for Google Earth.
2) Googlebot cannot follow links, so we need to create exhaustive sitemaps that link to every feature we want googlebot to see. So we must either precalculate the hierarchy or generate it iteratively.
On request I backported the sitemap feature to the geosearch module, but I sense that maybe the importance of this has been overestimated since gwc appears to handle it fine in many cases. Because the two systems have different sources of information they must take different approaches, so the code and the issues are not the same.
The gwc module does the following:
Traverses the entire KML hierarchy (if a tile has not been explored previously, it will do so now) and adds all tiles on disk to the sitemap.
Advantages:
1) We get every single placemark, for sure
2) Features stay permanently in the same tiles
3) We get can tell googlebot when a tile was last regenerated (though currently not implemented)
4) Serving the tiles is dirt cheap
Disadvantages:
1) Essentially seeds the entire cache when first asked for the sitemap. This may require a lot of space
2) Still fairly intense I/O after the cache has been seeded, at least for layers with more than 100 000 features
3) GeoServer has to be reloaded and caches cleared if you change the regionating attribute or number of features per tile
4) Does not honor the "indexable" checkmark you have in GeoServer, but you can still decide whether you want to submit all sitemaps or pick individually.
The GeoSearch does the following:
Looks into the H2 database and creates a sitemap with URLs to the WMS service the same way that gwc gets tiles. In addition, to enable the bot
Advantages:
1) The H2 query is fairly expensive because of a "group by x,y,z", but I'm guessing still a lot cheaper than scanning the disk
2) Any changes to the configuration are updated instantly (but changes to the data are not, unless you kill the H2 database)
3) It's fairly easy to seed the H2 database by launching a client like Google Earth and panning around for a bit
Disadvantages:
1) If googlebot refreshes sitemap after picking up half the links, the full tree may not be generated
2) We will in some cases, depending on how the data is distributed, link to a lot of empty tiles. This may make googlebot mad.
3) We currently have no way of seeding the H2 database (though we have a ticket for 1.7.2)
4) Creating a tile and serving it is medium-expensive
The stuff that I'm having google index right now is exclusively using the gwc solution. Hence we CANE ship without geosearch and still have sitemaps.
--
Arne Kepp
OpenGeo - http://opengeo.org
Expert service straight from the developers