r.texture currently uses schedule(static, 1) ordered in its OpenMP loop (execute.c:107-158). The ordered directive serializes Rast_put_row() calls, which limits scaling — benchmarks show efficiency dropping to ~65% at 8 threads for small windows (3×3).
I’m considering replacing this with chunked processing — buffer output for a block of rows, compute them with schedule(dynamic), then write sequentially after each chunk. This follows the pattern r.sun uses.
Trade-off: Memory goes from ~few MB (current) to ~100 MB (chunk buffers), but removes the synchronization bottleneck.
Is this worth a PR? Any concerns with the approach?
I am not an expert on this, but generally we try to keep the tools able to work with low memory. That said, if you can let the user decide how much memory they want to use (more memory - faster processing, less memory - slower processing) that would be a good approach. I don’t think 100MB is a problem, r.texture is doing some heavy processing.
Hi Anna, thank you for the feedback! That makes sense. I will implement it with a parameter to let the user define the memory usage (defaulting to a low value, but allowing higher for better performance). This way, we keep the tool accessible for low-memory systems while unlocking the scaling potential for others. I’ll start working on a draft PR with this approach.
Regarding your last review on PR#7044, I’ve addressed the memory budget logic and added the coarsening ratio benchmarks you requested. Whenever you have a moment to check the latest push, I’d appreciate it!"
Hi, I’m the one who parallelized r.texture. Below is my opinion.
The current approach already does something chunk by chunk, using fbuf_threads to store the output processed by each thread for each row.
I feel that the efficiency of r.texture is already pretty high compared to several parallelized raster tools.
A serialize Rast_put_row() is necessary. We cannot avoid that. We can choose to do it inside the loop or outside the loop. Doing that outside the loop means we may need to block memory for the whole output raster. This is just my thought. You may have a better way.
schedule(dynamic) may not improve efficiency. If the task in the loop is similar for each thread, schedule (static) is a better choice. However, the parallel loop in r.sun looks more complex and many if statements can make threads behave differently, so schedule(dynamic) is a better choice.
Overall, I cannot understand which chunked processing you want to improve here, because it is already somewhat chunked. Furthermore, I suggest you run some runtime tests before diving into a significant modification. I expect these changes to make minimal improvements.
remove #pragma omp ordered .
run the same number of Rast_put_row() outside the parallel for rather than inside the parallel loop.
try schedule(dynamic)
Please try these changes separately if you have time. Although these tests yield wrong results, you can get a sense of which one makes the largest improvement. If you really see a significant difference, it may be worth a try to improve it. If not, you may want to work on tools that haven’t been parallelized to make meaningful contributions.
Hi cyliang, thank you for the detailed explanation , it’s really valuable to get insight from the person who originally parallelized this module.
You’re absolutely right that my description of “chunked processing” was misleading — the per-thread row buffers (fbuf_threads) already achieve this. I should have benchmarked more carefully before proposing changes.
The only thing worth testing seems to be removing #pragma omp ordered and moving Rast_put_row() calls outside the parallel loop. I’ll run that benchmark and share the results.
If the improvement is marginal, I’ll redirect my efforts toward tools that haven’t been parallelized yet, where the impact would be more meaningful. Thanks again for the guidance!