Introduction (Sionigdha Sadhukhan) — GSoC 2026 Proposal: GDAL Python Stub Hardening, Runtime Validation & Type Coverage

Sionigdha · March 9, 2026, 8:36pm

Who I Am

My name is Sionigdha Sadhukhan, a third-year B.Tech Computer Science & Engineering student at the Institute of Engineering & Management (IEM), Salt Lake, Kolkata, India. My technical interests lie at the intersection of open-source infrastructure, static analysis, Python tooling, and developer experience engineering.

I came to GDAL through geospatial data work and quickly became interested in understanding how the C++ API surfaces into Python through SWIG-generated bindings — specifically how the stub generation pipeline works, how docstrings drive it, and where it could be meaningfully strengthened. Over the past few months I have been actively contributing to GDAL and studying the codebase in preparation for a GSoC 2026 proposal.

My Contributions to GDAL

Merged Pull Requests

#13904 — gdal plug: add --fid option into GDALVectorInfo (Merged into GDAL 3.13.0, reviewed with extensive guidance from Even Rouault*)*

This PR plugs the --fid option of gdal vector info into the existing -fid option handled by GDALVectorInfo(), without re-implementing logic at the algorithm layer. It fixes issue #13763. The PR went through significant back-and-forth — 17 review comments, 2 commits, 7 files changed (+219/-164 lines) — with Even Rouault providing substantial guidance and changes during review. This was a valuable learning experience in understanding how algorithm-layer options flow through to the GDALVectorInfo() C++ API.

#13812 — tests: mark additional WMS/WMTS tests as network-dependent (Merged February 4)

Improved test infrastructure reliability by correctly marking network-dependent tests with pytest.mark.network, preventing false CI failures in offline environments. Required understanding how GDAL’s test suite is structured and how pytest markers are applied across driver-specific test files.

#13784 — DOC: add table summarizing raster blend operators (Merged January 28)

Added a structured reference table for raster blend operators in the documentation. Required reading the actual blend operator implementation to verify each entry was correctly described — not a mechanical documentation change.

#13725 — DOC: disable dummy Python Module Index in PDF output (Merged January 19, backported to release/3.12)

Fixed a Sphinx documentation build issue where a dummy Python Module Index was incorrectly generated in PDF output. This was backported to the stable release branch, meaning it met the bar for release-quality fixes.

What I Have Studied in the Codebase

To build this proposal I went significantly beyond my own PRs and specifically studied:

The docstub pipeline — _analysis.py, _docstrings.py, _stubs.py. I understand how the Lark grammar parses NumPy-style docstrings, how TypeMatcher resolves type names against known built-ins, typing, collections.abc, and module-level __all__, and where resolution falls through for GDAL-specific domain classes.

SWIG interface files — I read .i files to understand the gap between what SWIG generates at runtime and what docstrings describe. These two can drift independently with no automated detection currently in place.

The _typeshed.Incomplete fallback — I traced the code paths where unresolvable types silently fall back to Incomplete. This happens for unknown class names, missing annotations, and Lark parse failures. The fallback allows stub generation to succeed but makes typing gaps invisible.

PR #13198 — I studied this PR specifically because it was the most recent systematic improvement to docstring formatting for stub generation. Understanding what it fixed helped me identify what remains unaddressed.

GSoC 2026 Proposal Direction

I am working on a proposal around strengthening the GDAL Python stub generation pipeline. I want to be upfront that I am actively seeking mentor feedback on scope and priorities — what follows is the direction I am exploring, not a finalized plan.

The Problem Area

The GDAL Python stub generation pipeline is entirely docstring-driven. The docstub tool does not inspect SWIG typemaps, C++ signatures, or Python runtime callables — it relies entirely on docstring content and formatting.

This means stubs can drift from runtime behavior without any automated detection, and typing gaps accumulate silently rather than being surfaced. A few concrete examples of this:

There is no automated check that a generated .pyi signature matches the actual inspect.signature() of the corresponding runtime callable in osgeo.gdal, osgeo.ogr, or osgeo.osr
GDAL domain types like Dataset, Layer, Geometry, Feature, and SpatialReference frequently fall through to _typeshed.Incomplete
Nullable returns described in docstrings as “returns None on failure” do not consistently produce Type | None unions in generated stubs
There is no CI step that catches when stubs become inconsistent with the actual runtime API

Areas I Am Exploring

Runtime–Stub Consistency Validator

A tool that imports osgeo.gdal, osgeo.ogr, osgeo.osr, walks all public callables using inspect, extracts runtime signatures, parses corresponding .pyi stubs using libcst, and compares parameter names, parameter counts, optional/default values, and return annotation presence. It would produce a structured mismatch report:

MISMATCH: gdal.OpenEx
  Runtime:  OpenEx(filename, nOpenFlags=0, allowed_drivers=None, open_options=None, sibling_files=None)
  Stub:     OpenEx(pszFilename: str, nOpenFlags: int = ...) -> Dataset | None
  Issues:   parameter name mismatch (filename vs pszFilename), 3 parameters missing from stub

Strict Mode for _typeshed.Incomplete Tracking

Rather than silently falling back, a mode that surfaces unresolved types with structured metrics:

Stub generation summary:
  Total functions analyzed : 1,243
  Fully typed              :   891
  Partially typed          :   310
  _typeshed.Incomplete     :   258
  Parse failures           :    32

This makes typing debt visible and trackable over time.

Improved TypeMatcher for GDAL Domain Classes

Explicit resolution mappings for core GDAL classes — Dataset, Layer, Feature, Geometry, SpatialReference, Driver, Band — with correct import paths, directly reducing _typeshed.Incomplete fallbacks in generated stubs.

Optional / None Inference Improvements

More consistent generation of Type | None unions from docstring phrases like “returns None on failure”, rather than depending on exact wording.

CI Integration

A validation step that generates stubs, runs the validator, and fails if mismatch count exceeds a defined threshold — preventing long-term drift between C++ bindings → SWIG → Python runtime → docstrings → generated stubs.

What I Am Looking For

I have already posted to the gdal-dev mailing list about this direction and am actively looking for mentor guidance. Specifically I would appreciate:

Feedback on whether these areas align with Python binding priorities for the project
Guidance on finding a mentor familiar with the stub generation pipeline
Any prior discussion, existing work, or planned changes in this area I should be aware of before going further

I am happy to share my full draft proposal, discuss any part of the technical approach in detail, or put together a small proof-of-concept if that would help establish fit.

Thank you for reading.

Sionigdha Sadhukhan IEM Salt Lake, Kolkata | B.Tech CSE Year 3

GitHub: Sionigdha

Email: snigdha.lee75@gmail.com

robe · March 10, 2026, 1:28am

Unfortunately I don’t think GDAL is planning to participate in Google Summer of Code this year. The list of planned participating projects and their idea pages is:

Note that GDAL and GRASS are also members of NumFocus and the idea page for NumFocus is here -

github.com/numfocus/gsoc

2026/ideas-list.md

master

# Ideas Pages

This is the home page of projects ideas of NumFOCUS for Google Summer of Code 2026.
Since NumFOCUS is an umbrella organization you will only find links to the ideas
page of each organization under the NumFOCUS umbrella at this page.

- [AiiDA](https://github.com/aiidateam/aiida-core/wiki/GSoC-2026-Projects)
- [ArviZ](https://github.com/arviz-devs/arviz/wiki/GSoC-2026-projects)
- [CVXPY](https://github.com/cvxpy/GSOC)
- [Data Retriever](https://github.com/weecology/retriever/wiki/GSoC-2026-Project-Ideas)
- [gammapy](https://github.com/gammapy/gammapy/wiki/GSoC-2026-Project)
- [GRASS](https://grasswiki.osgeo.org/wiki/GRASS_GSoC_Ideas_2026)
- [HoloViz](https://github.com/holoviz/holoviz/wiki/2026-GSoC-Project-List)
- [Gridap](https://github.com/gridap/GSoC/blob/main/2026/ideas-list.md)
- [JuMP](https://github.com/jump-dev/GSOC)
- [matplotlib](https://github.com/matplotlib/matplotlib/wiki/Matplotlib-GSoC-2026-Ideas)
- [pvlib](https://github.com/pvlib/pvlib-python/wiki/GSoC-2026-Projects)
- [PyMC](https://github.com/pymc-devs/pymc/wiki/GSoC-2026-projects)
- [PySAL](https://github.com/pysal/pysal/wiki/Google-Summer-of-Code-2026)
- [QuTiP](https://github.com/qutip/qutip/wiki//Google-Summer-of-Code-current)

This file has been truncated. show original

GRASS is doing GSOC under the NumFocus umbrella.

Sionigdha · March 10, 2026, 5:13am

Thank you for the clarification, robe. I have actually been contributing to both GDAL and QGIS over the past few months, so I will be shifting my GSoC 2026 focus to QGIS. I have prepared a proposal around enhancing GDAL inspection and metadata workflows in the QGIS Processing framework ,will be posting that shortly! Would appreciate any guidance on finding a mentor for this direction.