GSoC 2025: Week 1 Report: AI-ready Dataset Metadata as a Service using ZOO-Project

Hi all,

Here’s my first weekly progress report for the Official Coding Period of GSoC 2025.

Week 1 Report (June 2nd - June 8th)

What did I get done this week?

  • Explored the GeoCroissant data format and the mlcroissant Python library.
  • Installed and configured required Python packages (mlcroissant, rasterio, datasets, torch, etc.) for geospatial machine learning workflows.
  • Programmatically generated Croissant-compatible JSON-LD metadata for the HLS Burn Scars dataset using the mlcroissant library.
  • Validated the metadata and resolved warnings related to missing fields (citeAs, datePublished, version).
  • Created a custom Hugging Face dataset loading script (hls.py) for the HLS Burn Scars dataset, enabling seamless integration with the datasets library.
  • Loaded and explored the dataset’s training split using the custom script and
    visualized sample satellite images and corresponding annotation masks using rasterio and matplotlib.
  • Developed a PyTorch data pipeline (BurnScarsDataset) to handle loading and preprocessing of geospatial image-mask pairs.
  • Implemented a U-Net architecture in PyTorch and trained the model.
  • Updated wiki page can be found in [1]
  • The GSoC repository can be found in [2]

What do I plan on doing next week? (June 9th - June 15th)

  • Demonstrate a GeoCroissant to STAC conversion example and explain the conversion process step by step.
  • Write Python scripts to GeoCroissant to STAC conversion and document the key differences between the GeoCroissant and STAC metadata formats.
  • Explore existing tools or libraries that can support or simplify the conversion, and test the conversion script on the metadata.

Am I blocked on anything?

  • No

References:

Best Regards,
Harsh Shinde

1 Like

Hi @harshinde

Interesting stuff you have here, I hope for the best.

Regards
Vicky

1 Like

Hi @cvvergara

Thanks a lot! I’ll do my best! :smiley:

Best,
Harsh