Who I Am
My name is Sionigdha Sadhukhan, a third-year B.Tech Computer Science & Engineering student at the Institute of Engineering & Management (IEM), Salt Lake, Kolkata, India. My technical interests lie at the intersection of open-source infrastructure, static analysis, Python tooling, and developer experience engineering.
I came to GDAL through geospatial data work and quickly became interested in understanding how the C++ API surfaces into Python through SWIG-generated bindings — specifically how the stub generation pipeline works, how docstrings drive it, and where it could be meaningfully strengthened. Over the past few months I have been actively contributing to GDAL and studying the codebase in preparation for a GSoC 2026 proposal.
My Contributions to GDAL
Merged Pull Requests
#13904 — gdal plug: add --fid option into GDALVectorInfo (Merged into GDAL 3.13.0, reviewed with extensive guidance from Even Rouault*)*
This PR plugs the --fid option of gdal vector info into the existing -fid option handled by GDALVectorInfo(), without re-implementing logic at the algorithm layer. It fixes issue #13763. The PR went through significant back-and-forth — 17 review comments, 2 commits, 7 files changed (+219/-164 lines) — with Even Rouault providing substantial guidance and changes during review. This was a valuable learning experience in understanding how algorithm-layer options flow through to the GDALVectorInfo() C++ API.
#13812 — tests: mark additional WMS/WMTS tests as network-dependent (Merged February 4)
Improved test infrastructure reliability by correctly marking network-dependent tests with pytest.mark.network, preventing false CI failures in offline environments. Required understanding how GDAL’s test suite is structured and how pytest markers are applied across driver-specific test files.
#13784 — DOC: add table summarizing raster blend operators (Merged January 28)
Added a structured reference table for raster blend operators in the documentation. Required reading the actual blend operator implementation to verify each entry was correctly described — not a mechanical documentation change.
#13725 — DOC: disable dummy Python Module Index in PDF output (Merged January 19, backported to release/3.12)
Fixed a Sphinx documentation build issue where a dummy Python Module Index was incorrectly generated in PDF output. This was backported to the stable release branch, meaning it met the bar for release-quality fixes.
What I Have Studied in the Codebase
To build this proposal I went significantly beyond my own PRs and specifically studied:
The docstub pipeline — _analysis.py, _docstrings.py, _stubs.py. I understand how the Lark grammar parses NumPy-style docstrings, how TypeMatcher resolves type names against known built-ins, typing, collections.abc, and module-level __all__, and where resolution falls through for GDAL-specific domain classes.
SWIG interface files — I read .i files to understand the gap between what SWIG generates at runtime and what docstrings describe. These two can drift independently with no automated detection currently in place.
The _typeshed.Incomplete fallback — I traced the code paths where unresolvable types silently fall back to Incomplete. This happens for unknown class names, missing annotations, and Lark parse failures. The fallback allows stub generation to succeed but makes typing gaps invisible.
PR #13198 — I studied this PR specifically because it was the most recent systematic improvement to docstring formatting for stub generation. Understanding what it fixed helped me identify what remains unaddressed.
GSoC 2026 Proposal Direction
I am working on a proposal around strengthening the GDAL Python stub generation pipeline. I want to be upfront that I am actively seeking mentor feedback on scope and priorities — what follows is the direction I am exploring, not a finalized plan.
The Problem Area
The GDAL Python stub generation pipeline is entirely docstring-driven. The docstub tool does not inspect SWIG typemaps, C++ signatures, or Python runtime callables — it relies entirely on docstring content and formatting.
This means stubs can drift from runtime behavior without any automated detection, and typing gaps accumulate silently rather than being surfaced. A few concrete examples of this:
-
There is no automated check that a generated
.pyisignature matches the actualinspect.signature()of the corresponding runtime callable inosgeo.gdal,osgeo.ogr, orosgeo.osr -
GDAL domain types like
Dataset,Layer,Geometry,Feature, andSpatialReferencefrequently fall through to_typeshed.Incomplete -
Nullable returns described in docstrings as “returns None on failure” do not consistently produce
Type | Noneunions in generated stubs -
There is no CI step that catches when stubs become inconsistent with the actual runtime API
Areas I Am Exploring
Runtime–Stub Consistency Validator
A tool that imports osgeo.gdal, osgeo.ogr, osgeo.osr, walks all public callables using inspect, extracts runtime signatures, parses corresponding .pyi stubs using libcst, and compares parameter names, parameter counts, optional/default values, and return annotation presence. It would produce a structured mismatch report:
MISMATCH: gdal.OpenEx
Runtime: OpenEx(filename, nOpenFlags=0, allowed_drivers=None, open_options=None, sibling_files=None)
Stub: OpenEx(pszFilename: str, nOpenFlags: int = ...) -> Dataset | None
Issues: parameter name mismatch (filename vs pszFilename), 3 parameters missing from stub
Strict Mode for _typeshed.Incomplete Tracking
Rather than silently falling back, a mode that surfaces unresolved types with structured metrics:
Stub generation summary:
Total functions analyzed : 1,243
Fully typed : 891
Partially typed : 310
_typeshed.Incomplete : 258
Parse failures : 32
This makes typing debt visible and trackable over time.
Improved TypeMatcher for GDAL Domain Classes
Explicit resolution mappings for core GDAL classes — Dataset, Layer, Feature, Geometry, SpatialReference, Driver, Band — with correct import paths, directly reducing _typeshed.Incomplete fallbacks in generated stubs.
Optional / None Inference Improvements
More consistent generation of Type | None unions from docstring phrases like “returns None on failure”, rather than depending on exact wording.
CI Integration
A validation step that generates stubs, runs the validator, and fails if mismatch count exceeds a defined threshold — preventing long-term drift between C++ bindings → SWIG → Python runtime → docstrings → generated stubs.
What I Am Looking For
I have already posted to the gdal-dev mailing list about this direction and am actively looking for mentor guidance. Specifically I would appreciate:
-
Feedback on whether these areas align with Python binding priorities for the project
-
Guidance on finding a mentor familiar with the stub generation pipeline
-
Any prior discussion, existing work, or planned changes in this area I should be aware of before going further
I am happy to share my full draft proposal, discuss any part of the technical approach in detail, or put together a small proof-of-concept if that would help establish fit.
Thank you for reading.
Sionigdha Sadhukhan IEM Salt Lake, Kolkata | B.Tech CSE Year 3
GitHub: Sionigdha
Email: snigdha.lee75@gmail.com