GSoC 2025: Week 3 Report: AI-ready Dataset Metadata as a Service using ZOO-Project

Progress Report for Week 3 (June 16 – June 22)

1. What did I get done this week?

  • Implemented geocroissant_to_geodcat.py: Developed a modular Python script to convert GeoCroissant metadata (croissant.json) into GeoDCAT-compliant RDF using rdflib, designed for reuse and automation.
  • Created conversion-geodcat-notebook.ipynb: Built a companion Jupyter notebook that demonstrates and explains the GeoCroissant-to-GeoDCAT conversion process interactively, including validation, RDF inspection, and querying.
  • Mapped GeoCroissant to GeoDCAT vocabulary: Defined and applied a consistent mapping between metadata fields (e.g., name → dct:title, distribution → dcat:Distribution, etc.), ensuring semantic accuracy across standards.
  • Handled spatial and temporal dimensions: Programmed extraction and representation of temporalExtent (startDate, endDate) and spatialExtent (dct:spatial) using controlled vocabularies and persistent URIs.
  • Generated and structured distributions: Translated each distribution into a dcat:Distribution, enriched with access URLs, media types, checksum validation (SHA256), and hierarchical relationships (isPartOf, hasPart).
  • Produced interoperable RDF outputs: Serialized final metadata into both JSON-LD (geodcat.jsonld) and Turtle (geodcat.ttl) formats to support semantic web integration and catalog ingestion.
  • Validated with SHACL constraints: Used pyshacl to confirm conformance of the RDF graphs to GeoDCAT structural requirements, improving metadata robustness and quality.
  • Audited metadata via querying: Inspected and extracted key elements such as dcat:distribution and dcat:accessURL from the RDF graph to verify completeness and correctness of dataset access metadata.

2. Plan for Next Week (June 23 – June 29):

  • Discuss GeoCroissant support on Hugging Face GitHub (dataset-viewer repo).
  • Continue structuring Kaggle datasets into GeoCroissant format.
  • Add Hugging Face and Kaggle-specific attributes (e.g., citation, license, split, repository) to GeoCroissant metadata.
  • Explore integration of hf.co/datasets and kaggle datasets download links as accessURL in distribution.

3. Am I blocked on anything?

  • No, I am not currently blocked on anything.

Links to Work Done:

Best Regards,
Harsh Shinde