[GRASSLIST:8999] r.statistics vs v.rast.stats vs ??

David_Finlayson · November 11, 2005, 12:46am

I am trying to calculate statistics such as mean elevation, slope,
maximum drainage area etc for a very large number of small plots. The
plots are in a vector file and I have been trying to use v.rast.stats
on the various rasters that represent elevation, slope, drainage area,
etc.

The problem is that the rasters are very high resolution, the plots
are quite small, and I have more than 600 to process. If you look at
the code for v.rast.stats, it converts each plot into a raster to use
as a mask and then runs r.univar on the entire map (now mostly null
values), once for each statistic. In my situation it can take several
minutes for each plot. In other words hours for each of the above
rasters that I am interested in and more than 2 days for the whole
set.

Is there a more efficient way of calculating these statistics? I have
tried to use r.statistics and can't make sense of the inputs and
outputs. Alternatively, if I could control the region size in the
critical part of the v.rast.stats script I might be able to improve
the speed of the script.

Any ideas?

--
David Finlayson
Marine Geology & Geophysics
School of Oceanography
Box 357940
University of Washington
Seattle, WA 98195-7940
USA

Office: Marine Sciences Building, Room 112
Phone: (206) 616-9407
Web: http://students.washington.edu/dfinlays

David_Finlayson · November 11, 2005, 3:15am

In my case (very large DEM small vector areas) the creation of the
MASK during each pass through the loop was the rate-limiting step. I
added the following lines to reduce the size of the region prior to
creating the MASK. It helped dramatically:

# Reduce size of region prior to calculating mask
g.remove vect=${VECTOR}_PTEMP
v.extract input=${VECTOR} output=${VECTOR}_PTEMP type=area new=1 list=$i
g.region vect=${VECTOR}_PTEMP

I need to do more testing to see if it is getting the same answer as before.

Calculating the statistics multiple times is redundant, but for my
dataset it was not slow (less than 1 second for all of them). I wonder
if my areas were larger would it take longer?

Now the only slow part is updating sqlite. Seems to take a second or
so for each update call.

David

On 11/10/05, Hamish <hamish_nospam@yahoo.com> wrote:

> I am trying to calculate statistics such as mean elevation, slope,
> maximum drainage area etc for a very large number of small plots. The
> plots are in a vector file and I have been trying to use v.rast.stats
> on the various rasters that represent elevation, slope, drainage area,
> etc.
>
> The problem is that the rasters are very high resolution, the plots
> are quite small, and I have more than 600 to process. If you look at
> the code for v.rast.stats, it converts each plot into a raster to use
> as a mask and then runs r.univar on the entire map (now mostly null
> values), once for each statistic. In my situation it can take several
> minutes for each plot. In other words hours for each of the above
> rasters that I am interested in and more than 2 days for the whole
> set.
>
> Is there a more efficient way of calculating these statistics? I have
> tried to use r.statistics and can't make sense of the inputs and
> outputs. Alternatively, if I could control the region size in the
> critical part of the v.rast.stats script I might be able to improve
> the speed of the script.

for i in $CATSLIST ; do
echo "Processing category $i ($CURRNUM/$NUMBER)"
g.remove MASK > /dev/null 2> /dev/null
r.mapcalc "MASK=if(${VECTOR}_${TMPNAME} == $i, 1, null())" 2> /dev/null
#n, min, max, range, mean, stddev, variance, coeff_var, sum
n=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f1`
min=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f2`
max=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f3`
range=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f4`
mean=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f5`
stddev=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f6`
variance=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f7`
cf_var=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f8`
sum=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f9`

a couple of things,

1) r.univar is running 9 times when it only needs to be run once:
eval `r.univar -g $RASTER`

Then $cf_var gets renamed $coeff_var, needs to be carried through.

2) if g.region zoom= respects MASK, we can do:
g.region save=${TMPNAME}_rastres

for i in $CATSLIST ; do
echo "Processing category $i ($CURRNUM/$NUMBER)"
g.remove MASK > /dev/null 2> /dev/null
r.mapcalc "MASK=if(${VECTOR}_${TMPNAME} == $i, 1, null())" 2> /dev/null

+g.region zoom=$RASTER

#n, min, max, range, mean, stddev, variance, coeff_var, sum
eval `r.univar -g $RASTER`

+g.region ${TMPNAME}_rastres

...
done

g.remove region=${TMPNAME}_rastres

but g.region zoom= can take a _long_ time so maybe no advantage/worse?

Would need to check/reset NSRES and EWRES to raster's values or
is that redundant? (what about g.region -a?) Some checking needs
to be done that we're not screwing up the cells.. at minimum
a before/after check to see that the same answer comes out both
ways.

3) rm twice??
cleanup()
{
#restore settings:
g.region region=${TMPNAME}
g.remove region=${TMPNAME} > /dev/null

twice??

4) isn't -p > /dev/null circular?
g.region nsres=$NSRES ewres=$EWRES -ap > /dev/null

5) cleanup should happen before echo "Done."

Hamish

--
David Finlayson
Marine Geology & Geophysics
School of Oceanography
Box 357940
University of Washington
Seattle, WA 98195-7940
USA

Office: Marine Sciences Building, Room 112
Phone: (206) 616-9407
Web: http://students.washington.edu/dfinlays

David_Finlayson · November 11, 2005, 4:49am

I was able to reduce these updates from about 3 seconds per pass
through the loop to less than 1 second (undetectable) by writing an
SQL update statement to file and then piping that into sqlite3
directly. For the 650 plots in my data set that saves me 30 minutes
per run. I could maybe save more by accumulating all of the updates
into a transaction and then committing the transaction in the final
part of the script, but that is probably splitting hairs.

Using SQL directly requires knowledge of the back end, so this
modification may not be useful for others. v.db.update is really slow
though for sqlite. Can this be improved?

David

On 11/10/05, David Finlayson <david.p.finlayson@gmail.com> wrote:

In my case (very large DEM small vector areas) the creation of the
MASK during each pass through the loop was the rate-limiting step. I
added the following lines to reduce the size of the region prior to
creating the MASK. It helped dramatically:

# Reduce size of region prior to calculating mask
g.remove vect=${VECTOR}_PTEMP
v.extract input=${VECTOR} output=${VECTOR}_PTEMP type=area new=1 list=$i
g.region vect=${VECTOR}_PTEMP

I need to do more testing to see if it is getting the same answer as before.

Calculating the statistics multiple times is redundant, but for my
dataset it was not slow (less than 1 second for all of them). I wonder
if my areas were larger would it take longer?

Now the only slow part is updating sqlite. Seems to take a second or
so for each update call.

David

On 11/10/05, Hamish <hamish_nospam@yahoo.com> wrote:
> > I am trying to calculate statistics such as mean elevation, slope,
> > maximum drainage area etc for a very large number of small plots. The
> > plots are in a vector file and I have been trying to use v.rast.stats
> > on the various rasters that represent elevation, slope, drainage area,
> > etc.
> >
> > The problem is that the rasters are very high resolution, the plots
> > are quite small, and I have more than 600 to process. If you look at
> > the code for v.rast.stats, it converts each plot into a raster to use
> > as a mask and then runs r.univar on the entire map (now mostly null
> > values), once for each statistic. In my situation it can take several
> > minutes for each plot. In other words hours for each of the above
> > rasters that I am interested in and more than 2 days for the whole
> > set.
> >
> > Is there a more efficient way of calculating these statistics? I have
> > tried to use r.statistics and can't make sense of the inputs and
> > outputs. Alternatively, if I could control the region size in the
> > critical part of the v.rast.stats script I might be able to improve
> > the speed of the script.
>
>
> for i in $CATSLIST ; do
> echo "Processing category $i ($CURRNUM/$NUMBER)"
> g.remove MASK > /dev/null 2> /dev/null
> r.mapcalc "MASK=if(${VECTOR}_${TMPNAME} == $i, 1, null())" 2> /dev/null
> #n, min, max, range, mean, stddev, variance, coeff_var, sum
> n=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f1`
> min=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f2`
> max=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f3`
> range=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f4`
> mean=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f5`
> stddev=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f6`
> variance=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f7`
> cf_var=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f8`
> sum=`r.univar -g $RASTER | cut -d'=' -f2 | tr '\n' ' ' | cut -d' ' -f9`
>
>
> a couple of things,
>
> 1) r.univar is running 9 times when it only needs to be run once:
> eval `r.univar -g $RASTER`
>
> Then $cf_var gets renamed $coeff_var, needs to be carried through.
>
>
> 2) if g.region zoom= respects MASK, we can do:
> g.region save=${TMPNAME}_rastres
>
> for i in $CATSLIST ; do
> echo "Processing category $i ($CURRNUM/$NUMBER)"
> g.remove MASK > /dev/null 2> /dev/null
> r.mapcalc "MASK=if(${VECTOR}_${TMPNAME} == $i, 1, null())" 2> /dev/null
>
> +g.region zoom=$RASTER
>
> #n, min, max, range, mean, stddev, variance, coeff_var, sum
> eval `r.univar -g $RASTER`
>
> +g.region ${TMPNAME}_rastres
>
> ...
> done
>
> g.remove region=${TMPNAME}_rastres
>
>
>
> but g.region zoom= can take a _long_ time so maybe no advantage/worse?
>
> Would need to check/reset NSRES and EWRES to raster's values or
> is that redundant? (what about g.region -a?) Some checking needs
> to be done that we're not screwing up the cells.. at minimum
> a before/after check to see that the same answer comes out both
> ways.
>
>
> 3) rm twice??
> cleanup()
> {
> #restore settings:
> g.region region=${TMPNAME}
> g.remove region=${TMPNAME} > /dev/null
>
> twice??
>
> 4) isn't -p > /dev/null circular?
> g.region nsres=$NSRES ewres=$EWRES -ap > /dev/null
>
>
> 5) cleanup should happen before echo "Done."
>
>
> Hamish
>

--
David Finlayson
Marine Geology & Geophysics
School of Oceanography
Box 357940
University of Washington
Seattle, WA 98195-7940
USA

Office: Marine Sciences Building, Room 112
Phone: (206) 616-9407
Web: http://students.washington.edu/dfinlays

--
David Finlayson
Marine Geology & Geophysics
School of Oceanography
Box 357940
University of Washington
Seattle, WA 98195-7940
USA

Office: Marine Sciences Building, Room 112
Phone: (206) 616-9407
Web: http://students.washington.edu/dfinlays