[GRASS-dev] [SoC] Parallelization of Raster and Vector libraries

Hi,

My name is Jordan. I’m currently a 3rd year (junior) computer science student at South Dakota School of Mines and Technology (SDSMT). I don’t have much knowledge about GIS, but Wolf did give me a few links and book (An Introduction to Geographic Information Systems by Heywood). I got the book from my school’s library, but haven’t had an opportunity to read any of it. But I’m willing to learn more. I’m also trying to find an interest area, since I’m not sure where I want to go with my career or if I want to continue my education.

Basic premise:
Begin to parallelize the suggested modules (from the OpenMP page) using openMP and profile/parallelize others if time permits.

Why I want this specific project? Practice and experience. I’m currently taking an Intro to Parallel Computing course, and I want to practice what I’ve learned. It’s interesting to me so I want to get better at it.

Initial thoughts and ideas on doing this:
I think my first step would be to understand the code by reading the documentation, and the code. To be understand the code I could probably generate some UML diagrams, and how functions are connected (I saw a tool a couple days ago, basically what I wanted). Which would help determine entry points and help me determine (likely) spots to parallelize since the code itself is not thread-safe, and breaking at or below entry point is likely safer (less dependency, fewer threads created for micro tasks, etc.). I would then profile the code. I’ve never used a profiling tool before, but good time to learn and useful in the future. Come up with a sharing strategy (do I need locks, can I avoid them (also ensuring const-correctness would likely help), is duplicate data possible/needed, etc.). Then try to parallelize the code. Afterwards make sure the results are the same as before parallelization (likely using the test suite) or other sort of testing.

Just kind of my thought process about how I would try to go about parallelizing a module. I haven’t had a chance to look over any code or read much documentation yet. Hopefully, I’ll get a chance to look at in-depth sometime this week, but it’s a fairly busy week as-is.

~Jordan

IRC Nick: RevisionD

Jordan Neumeyer wrote:

Just kind of my thought process about how I would try to go about
parallelizing a module.

The main issue with parallelising raster input is that the library
keeps a copy of the current row's data, so that consecutive reads of
the same row (as happen when upsampling) don't re-read the data.

For concurrent access to a single map, you would need to either keep
one row per thread, or abandon caching. Also, you would need to use
pread() rather than fseek()+read().

It's more straightfoward to read multiple maps concurrently. In 7.0,
this case should be thread-safe.

Alternatively, you could have one thread for reading, one for writing,
and multiple worker threads for the actual processing. However, unless
the processing is complex, I/O will be the bottleneck.

--
Glynn Clements <glynn@gclements.plus.com>

Opps, forgot to send this to mailing list as well…

On Tue, Mar 30, 2010 at 8:32 PM, Jordan Neumeyer <jordan.neumeyer@mines.sdsmt.edu> wrote:

Hi Glynn,

On Tue, Mar 30, 2010 at 8:29 AM, Glynn Clements <glynn@gclements.plus.com> wrote:

Jordan Neumeyer wrote:

Just kind of my thought process about how I would try to go about
parallelizing a module.

The main issue with parallelising raster input is that the library
keeps a copy of the current row’s data, so that consecutive reads of
the same row (as happen when upsampling) don’t re-read the data.

For concurrent access to a single map, you would need to either keep
one row per thread, or abandon caching. Also, you would need to use
pread() rather than fseek()+read().

It sounds like you’re talking about parallelism in I/O from a file or database. Neither of which is my intent or goal for this project. I will parallelize things after they have already been read into memory, and tasks are processor intensive. I wouldn’t want parallelize any I/O, but if I were to optimize I/O. I would make all operations I/O asynchronous, which is can mimic parallelism in a sense. Queuing up the chunks of data and then processing them as resources become available.

It’s more straightfoward to read multiple maps concurrently. In 7.0,
this case should be thread-safe.

Alternatively, you could have one thread for reading, one for writing,
and multiple worker threads for the actual processing. However, unless
the processing is complex, I/O will be the bottleneck.

I/O is generally a bottleneck anyway. Something always tends to be waiting on another.

Glynn Clements <glynn@gclements.plus.com>

Jordan Neumeyer wrote:

> > Just kind of my thought process about how I would try to go about
> > parallelizing a module.
>
> The main issue with parallelising raster input is that the library
> keeps a copy of the current row's data, so that consecutive reads of
> the same row (as happen when upsampling) don't re-read the data.
>
> For concurrent access to a single map, you would need to either keep
> one row per thread, or abandon caching. Also, you would need to use
> pread() rather than fseek()+read().

It sounds like you're talking about parallelism in I/O from a file or
database. Neither of which is my intent or goal for this project. I will
parallelize things after they have already been read into memory, and tasks
are processor intensive. I wouldn't want parallelize any I/O, but if I were
to optimize I/O. I would make all operations I/O asynchronous, which is can
mimic parallelism in a sense. Queuing up the chunks of data and then
processing them as resources become available.

Most GRASS raster modules process data row-by-row, rather than reading
entire maps into memory. Reading maps into memory is frowned upon, as
GRASS is regularly used with maps which are too large to fit into
memory. Where the algorithm cannot operate row-by-row, use of a tile
cache is the next best alternative; see e.g. r.proj.seg (renamed to
r.proj in 7.0).

Holding an entire map in memory is only considered acceptable if the
algorithm is inherently so slow that processing a gigabyte-sized map
simply wouldn't be feasible, or the access pattern is such that even a
tile-cache approach isn't feasible.

In general, GRASS should be able to process multi-gigabyte maps even
on 32-bit systems, and work on multi-user systems where a process
cannot assume that it can use a significant proportion of the system's
total physical memory.

> It's more straightfoward to read multiple maps concurrently. In 7.0,
> this case should be thread-safe.
>
> Alternatively, you could have one thread for reading, one for writing,
> and multiple worker threads for the actual processing. However, unless
> the processing is complex, I/O will be the bottleneck.
>

I/O is generally a bottleneck anyway. Something always tends to be waiting
on another.

When I refer to I/O, I'm referring not just to read() and write(), but
also the (de)compression, conversion and resampling, i.e. everything
performed by the get/put-row functions. For many GRASS modules, this
takes more time than the actual processing.

Finally, the thread title refers to libraries. Very little processing
occurs in the libraries; most of it is in the individual modules. So
there isn't much scope for "parallelising" the libraries. The main
issue for library functions is to ensure that they are thread-safe.
Most of the necessary work for the raster library has been done in
7.0.

--
Glynn Clements <glynn@gclements.plus.com>