Proposal for GSoC 2025: AI-Powered Natural Language Interface for GRASS GIS

Hello GRASS GIS Community,

My name is Sachintha Nadeeshan, and I’ve been contributing to GRASS GIS by fixing bugs, some of which have been merged. I am applying for GSoC 2025 with a proposal to develop an AI-powered Natural Language Interface for GRASS GIS. I wanted to share this idea with you and get feedback from the community to refine and improve it.

Project Overview:

The goal of this project is to develop a Generative AI-powered chatbot that enables users to interact with GRASS GIS using natural language queries. Users will be able to ask questions, request analysis, and get GRASS GIS commands or instructions in a conversational format.

The AI chatbot will:

  • Translate natural language queries like “How do I calculate the slope for a raster?” into GRASS GIS commands.
  • Suggest relevant tools and workflows based on user queries and dataset context.
  • Provide step-by-step guidance to help users complete geospatial analysis tasks.

Why This Is Useful:

  • Improved Accessibility: This feature will make GRASS GIS more accessible to users who are new to GIS or who prefer a conversational interface.
  • Efficiency for Advanced Users: It will also help experienced users by quickly providing tool suggestions, reducing the time spent searching for the correct syntax or workflow.
  • Broader User Base: By integrating natural language processing (NLP), we can reduce the learning curve for users and expand the potential user base of GRASS GIS.

Hi Sachintha,

Your proposal sounds interesting to me. I am currently working on a similar project, so, if your proposal is approved and you are selected, I volunteer to mentor.
Could you please elaborate on the architecture and the infrastructure requirements?
I imagine you foresee to deploy a LLM on a OSGeo server. Do you envisage to develop a semantic search using a knowledge graph? Please clarify these points and detail the technical requirements.
Thanks
Kind regards,
Margherita

Hi Margherita,

Thank you for your interest in my proposal, and for offering to volunteer as a mentor! I’m excited about the possibility of working on this project and I appreciate your input. Below, I’ll elaborate on the architecture and infrastructure requirements, and provide further clarification on the deployment, LLM usage, and the potential for semantic search with a knowledge graph.


1. Architecture Overview

The project will involve the integration of an AI-powered Natural Language Interface within GRASS GIS, where users can input natural language queries and receive corresponding GRASS GIS commands or suggestions.

Key Components:

  1. User Interface (CLI/GUI):
  • The user will interact with the system via a CLI initially (with a potential GUI version later). This will provide the platform for users to type natural language queries and receive answers or commands.
  1. NLP Model (LLM):
  • I will use a pre-trained NLP model, such as GPT-2 (smaller model due to time and resource constraints), which will be fine-tuned using GRASS GIS documentation and common user queries. The model will process user inputs, translate them into GRASS GIS commands, and provide feedback.
  • The model will be deployed locally or on a cloud platform and will not necessarily need to be hosted on an OSGeo server unless there is infrastructure available to support it. Given the relatively small scope of the project (using a smaller model like GPT-2), cloud-based hosting (e.g., AWS, Google Cloud, or even a local server) would be a more scalable and cost-effective approach.
  1. Backend Integration with GRASS GIS:
  • The chatbot will be able to query GRASS GIS commands and generate them based on user input. This will be facilitated by integrating the model into GRASS GIS’ toolset. The model will provide suggestions and generate commands based on geospatial tasks such as calculating slope, reclassification, and hydrological modeling.
  1. Data Layer:
  • The model will be fine-tuned using GRASS GIS-specific data, including documentation, help manuals, and common queries. Over time, it can be expanded to support more complex geospatial queries and workflows.

2. Infrastructure Requirements

To deploy and operate the AI-powered chatbot, the following infrastructure is recommended:

  1. Hardware Requirements:
  • Server for Model Deployment: A cloud server or local server capable of running the fine-tuned NLP model. For smaller models like GPT-2, a typical server with 16GB RAM and a multi-core processor should be sufficient.
  • Storage: Depending on the size of the model, 5-10GB of storage may be required. Additional space will be needed for logs, user data, and documentation.
  1. Software Requirements:
  • NLP Framework: Tools like Hugging Face’s Transformers and PyTorch will be used for fine-tuning and deploying the NLP model.
  • GRASS GIS Setup: The server or local machine should have GRASS GIS installed, which is necessary for backend integration and command generation.
  • Python Environment: For the integration of the model and GRASS GIS commands, a Python environment will be required (Python 3.7+).
  1. Cloud-based Deployment (Optional):
  • Cloud Hosting: Platforms like Google Cloud, AWS, or Heroku will allow easy deployment of the NLP model and interface. These platforms offer scalable resources that can be adjusted based on usage.
  • API Gateway: If you plan to build the interface with a web-based GUI, a REST API or GraphQL API will be needed to connect the NLP backend with the frontend interface.
  1. Knowledge Base Storage:
  • The fine-tuned model will need access to a structured knowledge base (such as GRASS GIS documentation). This can be stored in a simple database (SQLite / PostgreSQL) or on the file system, depending on the size and structure of the data.

3. Semantic Search using Knowledge Graph

While the initial version of the chatbot will focus on direct command generation and tool suggestions based on queries, I agree that implementing a semantic search using a knowledge graph could be a valuable extension. This would improve the system’s ability to understand contextual relationships between geospatial concepts, such as tools, datasets, and workflows.

How it Could Work:

  1. Knowledge Graph Construction:
  • We could create a knowledge graph where nodes represent GRASS GIS tools, geospatial datasets, and user queries, and edges represent the relationships between these entities (e.g., a “tool for calculating slope” is related to “raster data”).
  • This knowledge graph could be created manually initially by extracting relationships from GRASS GIS documentation and user queries.
  1. Search Functionality:
  • By incorporating a semantic search engine, the chatbot could not only match keywords in queries but also understand the underlying intent and context of the query.
  • For example, if the user asks, “How do I find areas with high elevation in this raster?”, the system could:
    • Identify that this is a raster analysis query.
    • Search the knowledge graph for relevant tools like reclassifying raster or hillshade analysis and suggest those tools to the user.

Technical Considerations:

  • Graph Database: A graph database like Neo4j or ArangoDB would be ideal for storing the knowledge graph.
  • Integration with NLP: The knowledge graph would be queried as a secondary layer after the NLP model processes the initial query. The model could then refine its response based on the semantic search results from the graph.

However, this feature might exceed the 175-hour time frame and could be implemented as a future enhancement after the initial version of the tool is deployed.


4. Next Steps and Clarification

To clarify:

  • LLM Deployment: While deploying a large-scale LLM directly on an OSGeo server may not be feasible due to resource constraints, cloud platforms (e.g., Google Cloud, AWS) will be used for hosting the fine-tuned model. The AI model will not require a massive computational resource for the initial version (e.g., GPT-2 or similar).
  • Semantic Search: I envision this as a future enhancement, but it’s certainly feasible. I would recommend starting with basic command generation and then progressively expanding the system with contextual understanding and semantic search capabilities in future iterations.

Please let me know if this answers your questions or if you need further clarification on any of these points. I’m eager to hear your thoughts and look forward to working together on this project!

Kind regards,
Sachintha Nadeeshan Kodikara

Hey all,

This sounds great and I fully support this project. I have done some experiments and prototypes some year(s) ago but was satisfied enough to call it “good”. So I would be more than happy to watch the project grow in brighter todays and maybe share an opinion or two during the project.

Dear Sachintha,

Thank you for the clarification. I agree that the deployment of the LLM on a cloud platform would be beneficial, but it would imply a cost for maintainance. I’m not sure an open source project like GRASS could allocate resources to that. This is why I was thinking to OSGeo server, but the feasibility and availability is up to OSGeo to decide. Other long-term solution ideas?

Thanks
Margherita

Sachinta, you mentioned bug fixes, could you please elaborate on your previous experience with GRASS?

Madi, Ondrej, thanks for stepping up to be mentors! All participants need to be tested in some way, please consider what would be a suitable test of skills for Sachinta. The traditional approach (write a test, fix a bug) may not be that applicable here.

Hi Anna,
yes I agree that a bug fix might not be suitable to test the candidate capabilities. But at any rate Sachinta said that he contributed to GRASS so I’m also looking forward to see his contribution.
The tasks I could think of (e.g. deploy a small LLM with OLLAMA) would be time consuming, I’m not sure it would be appropriate to demand this. I would be happy with a preliminary telco in which we can exchange opinions on the deployment, if that’s ok, of course with other members of the community.
Another point that I would like to check in advance is the feasibility in terms of infrastructure. This is something GRASS (PSC? developers?) / OSGeo should explore, ideally before approving the project.
Cheers,
Margherita

I agree the infrastructure together with ongoing maintenance is an issue. I would like to see some estimates of cloud hosting, whether that’s something we as a project can pay for. OSGeo hosting may not be feasible.

Hi Ondrej,

Thank you so much for your support and encouragement! It’s great to know that you’ve worked on similar prototypes before and that you’re willing to share your insights during the project. Your experience will be invaluable, and I’m excited to have the opportunity to learn from your expertise as the project progresses.

I look forward to collaborating with you and the rest of the community as we work to make this AI-powered tool a success for GRASS GIS.

Thanks again, and I appreciate your willingness to be part of this journey!

Dear Margherita,

Thank you for your thoughtful response and for highlighting the concerns regarding cloud hosting costs and ongoing maintenance. I completely understand the budget limitations of an open-source project like GRASS GIS, and I agree that relying on cloud services could be difficult to sustain in the long term.

I believe the solution may lie in exploring a hybrid approach for the infrastructure. Here are a couple of ideas that could help address the concerns:

  1. OSGeo Server Hosting: Hosting the model on an OSGeo server could be a good option to minimize costs. I wanted to ask if OSGeo has any resources available to host the LLM or if the GRASS GIS community could provide any insights into this possibility. Understanding whether OSGeo hosting is feasible will be crucial in deciding the infrastructure for the chatbot.

  2. Small Model Deployment: Instead of deploying large models like GPT-3, we could consider using smaller models (e.g., DistilGPT-2) which require less computational power and storage. This would help significantly reduce costs while still providing a functional AI-powered tool for the GRASS GIS users.

  3. Open Source Alternatives for Semantic Search: Another possibility could be to explore lighter open-source NLP models like spaCy or BERT for geospatial queries. These models are smaller in size and would reduce the infrastructure burden, while still offering the essential functionality for the chatbot.

I would love to hear your thoughts on the OSGeo hosting option and whether it’s something that can be explored further. I’m also open to discussing the best way to balance cost-efficiency and model performance as we move forward.

Thanks again for your input — I look forward to your feedback.

Best regards,
Sachintha Nadeeshan Kodikara

Hi Annakrat,

Thank you for asking! It’s not a big deal — I haven’t done much with GRASS GIS yet, but I’ve contributed to a couple of small improvements on the GRASS GIS website. Specifically, I worked on:

  • PR #486: This pull request involved updating the content and enhancing the clarity of some sections of the GRASS GIS website, making it more user-friendly.
  • PR #490: This PR focused on improving some styling and fixing minor issues to ensure a cleaner and more consistent design across the website.

I’m looking forward to diving deeper into the project with this AI-powered Natural Language Interface and contributing more to the community through this new initiative.

I hope this clarifies my previous contributions, and I’d be happy to provide more details if needed!

Best regards,
Sachintha Nadeeshan Kodikara
LinkedIn : https://www.linkedin.com/in/sachintha-nadeeshan-kodikara

Thank you all for the continued discussion and your valuable input regarding the infrastructure. After researching the potential costs for cloud hosting, I’ve compiled some rough estimates for running smaller models such as DistilGPT-2 and spaCy on major cloud platforms like AWS, Google Cloud, and Azure.

Cloud Hosting Cost Estimates:

Provider Light Model (Small Instance) Large Model (GPU) Storage/Other Costs
AWS ~$30/month ~$2,200/month ~$0.023/GB/month
Google Cloud ~$30/month ~$1,050/month ~$0.02/GB/month
Azure ~$34/month ~$650/month ~$0.02/GB/month
  • For small-scale deployments using DistilGPT-2 or spaCy, the cost would be around $30-50 per month for a small instance with moderate usage.
  • For larger models (e.g., GPT-3 or similar), costs can scale significantly, particularly if we need to use GPU instances for deployment, ranging from $650 to $2,200 per month depending on the provider and instance specifications.
  • Storage and data transfer costs are typically low, but they can add up with frequent interactions or large datasets.


  • If OSGeo hosting is feasible, it would be ideal to explore that option further. I would love to hear your thoughts on whether OSGeo can provide any resources for hosting the model or if the GRASS GIS community can contribute.
  • If OSGeo hosting isn’t an option, I recommend proceeding with a smaller model deployment to reduce costs. We could explore DistilGPT-2 or other smaller models, which are more cost-effective for initial deployment.
  • If cloud hosting remains the best option, we will need to ensure we can sustain the costs over time.

Let me know if you have any further questions or thoughts on how we should proceed with the infrastructure. I’m looking forward to continuing this discussion and finding the best solution for the long-term success of the project.

indeed, that seems unlikely given that the OSGeo budget is under severe constraints currently, and particularly the SAC one: [OSGeo-Discuss] OSGeo finances

Given the current financial constraints, may I know if it would still be feasible to proceed with this project, or if it would be better to pause development due to resource limitations? :pleading_face:

I have been thinking about this. I would explore the feasibility of fine tuning a SLM on the GRASS documentation and deploy it as a package. Small Language Models are designed to be lightweight, making them more suitable for deployment on edge devices compared to larger models.

See for inspiration [2411.03350] A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness and GitHub - FairyFali/SLMs-Survey: Survey of Small Language Models from Penn State, ...

1 Like

Thank you so much for the valuable suggestion! I truly appreciate your insight into using SLMs to avoid cloud cost. It’s a fantastic idea, and I’m grateful for your guidance. I believe fine-tuning an SLM on GRASS GIS documentation will indeed provide an efficient and cost-effective solution. I’m excited to explore this approach further and appreciate your support in refining the direction of the project.

Hi Sachinta,

since you need to pass a test of skills, here is an idea what you could work on:

I understand it’s not directly relevant to your project, but it deals with GRASS documentation and you would need to get familiar with it anyway. We are finishing up migrating old documentation (generated from html files) to new based on markdown files, but we still need the html files for man pages. So this PR tries to generate man pages from markdown (so that we can remove the html files altogether). I would like you to test the current script, report what is missing, what does not work properly and ideally fix that. E.g. tables have not been ported yet. You are welcome to use AI as long as the result makes sense. You can create a PR on top of my branch.

I am sure you will have questions. We have contributing guidelines, so please make sure to read the relevant parts first.

Best,
Anna

Sure, ma’am! I’ll work on it. However, as I’m currently stuck in my university end exams, it might take a bit longer than expected. I sincerely apologize for the delay, but I’ll get to it as soon as possible.

Thank you for your understanding