[GeoNetwork-devel] [Geonetwork-devel] Harvesting Thredds problems

Hello there,
This is Filipe Freire, from INESCTEC (www.inesctec.pt) in Portugal, currently working on a oceanography research project that requires harvesting Thredds catalogs.

I’m not able to harvest Thredds correctly.

I have a Geonetwork 3.0.1 (running on tomcat7, with a postgres db) on a VM. There is also a public Geonetwork (RAIA Geonetwork) and a Thredds catalog that we try to harvest on both Geonetwork (the public one and the one set up on the VM).
While the harvester is running, the geonetwork.log prints out a couple of “SVN repository for metadata enabled but no repository available”. After the harvesting is concluded, some services are harvested, but the metadata isn’t harvested correctly. The title for example (or datasetName) seems to be extracted, but then the whole catalog.xml shows up on the Overview. It’s as if no metadata was extracted properly. You can view a screenshot here, hosted through google drive.

After setting up a XSLT parser (that uses the same saxon .jar file of geonetwork), and applying the thredds-metadata.xsl straight into the catalog.xml, I get a resulting xml file with just a single XML declaration. When I apply another .xsl, like the “SOS get capabilities” to a corresponding .xml file from an SOS, the resulting output file seems to show the metadata correctly.

By reading through the user guide and developer documentation, the whole “pipeline” of the Thredds harvesting raises a bit of confusion. Especially on the part of for example, where the thredds-metadata.xsl is being applied to the Thredds catalog.xml, creating fragments that are combined with a template (how or where are they combined?), creating the metadata.

After cloning the geonetwork git repository, I’ve also found there is a Harvester.java, with a Harvester class, in the harvesters folder. Reading through the code I see the following pipeline:
“loading categories and groups → setup proxy (which has been done) → load the catalog → display catalog read in log file (which tries to reolve a “ThreddsCatalog-to-ISO19119_ISO19139.xsl” that I’m not able to find in schema_plugins) → get base host url → crawl all datasets in thredds catalogue → show how many datasets have been processed → create service records → a TODO: add links to services provided by the thredds catalog → and saving the metadata to the geonetwork’s database”.

Can anyone point to me where could be the problem and where can I search for the answer?

Thanks in advance,
Filipe Freire