#3198: r.stats.quantile: hardcoded max number of categries in base map
---------------------------------------+-------------------------
Reporter: mlennert | Owner: grass-dev@…
Type: defect | Status: new
Priority: normal | Milestone: 7.2.1
Component: Raster | Version: unspecified
Keywords: r.stats.quantile MAX_CATS | CPU: Unspecified
Platform: Unspecified |
---------------------------------------+-------------------------
r.stats.quantile
[https://trac.osgeo.org/grass/browser/grass/trunk/raster/r.stats.quantile/main.c#L21
limits] the number of categories the base map can have to 1000 through a
MAX_CATS variable.
Is there any specific reason for this ? I would like to use
r.stats.quantile in i.segment.stats to calculate percentiles per segment,
but number of segments can be much higher than 1000.
#3198: r.stats.quantile: hardcoded max number of categries in base map
--------------------------+---------------------------------------
Reporter: mlennert | Owner: grass-dev@…
Type: defect | Status: new
Priority: normal | Milestone: 7.2.1
Component: Raster | Version: unspecified
Resolution: | Keywords: r.stats.quantile MAX_CATS
CPU: Unspecified | Platform: Unspecified
--------------------------+---------------------------------------
Comment (by glynn):
Replying to [ticket:3198 mlennert]:
> Is there any specific reason for this ? I would like to use
r.stats.quantile in i.segment.stats to calculate percentiles per segment,
but number of segments can be much higher than 1000.
The limit was added so that if someone tries to use a base map with a
million categories, it just fails quickly, rather than attempting
something which will either exhaust memory or take days to run.
For each category in the base map, it allocates a basecat structure, each
of which references several dynamically-allocated arrays. The .slots and
.slot_bins arrays are sized based upon the bins= option, the .values array
is sized to hold all of the values falling into any bin containing to a
quantile, the .quants and .bins arrays according to the number of
quantiles.
As well as the memory consumption, almost all processing is per-category.
Having said that, more categories will tend to result in less data per
category. However, there are some non-trivial per-category overheads. On
the other hand, sorting the bins containing quantiles should be faster
overall with more bins but proportionally less data in each bin.
There's no fundamental reason why the limit can't be raised; or even
abolished, if you don't mind an unsuitable choice of base map resulting in
"unable to allocate" errors, or just taking forever. Consider putting a
limit on num_cats*num_slots; a map with many categories should presumably
require fewer bins (assuming that the data isn't concentrated into a
handful of categories).
#3198: r.stats.quantile: hardcoded max number of categries in base map
--------------------------+---------------------------------------
Reporter: mlennert | Owner: grass-dev@…
Type: defect | Status: new
Priority: normal | Milestone: 7.2.1
Component: Raster | Version: unspecified
Resolution: | Keywords: r.stats.quantile MAX_CATS
CPU: Unspecified | Platform: Unspecified
--------------------------+---------------------------------------
Comment (by mlennert):
Replying to [comment:1 glynn]:
> Replying to [ticket:3198 mlennert]:
>
> > Is there any specific reason for this ? I would like to use
r.stats.quantile in i.segment.stats to calculate percentiles per segment,
but number of segments can be much higher than 1000.
>
> The limit was added so that if someone tries to use a base map with a
million categories, it just fails quickly, rather than attempting
something which will either exhaust memory or take days to run.
>
> For each category in the base map, it allocates a basecat structure,
each of which references several dynamically-allocated arrays. The .slots
and .slot_bins arrays are sized based upon the bins= option, the .values
array is sized to hold all of the values falling into any bin containing
to a quantile, the .quants and .bins arrays according to the number of
quantiles.
>
> As well as the memory consumption, almost all processing is per-
category.
>
> Having said that, more categories will tend to result in less data per
category. However, there are some non-trivial per-category overheads. On
the other hand, sorting the bins containing quantiles should be faster
overall with more bins but proportionally less data in each bin.
>
> There's no fundamental reason why the limit can't be raised; or even
abolished, if you don't mind an unsuitable choice of base map resulting in
"unable to allocate" errors, or just taking forever.
A warning was maintained. At least the user is made aware and can stop the
module.
> Consider putting a limit on num_cats*num_slots; a map with many
categories should presumably require fewer bins (assuming that the data
isn't concentrated into a handful of categories).
In r69776 MarkusM introduce dynamic bins, although I don't really
understand what this means ;-).
More generally: the man page of r.stats.quantile does lack a bit of info
about its parameters, notably the 'bin' parameter. A short paragraph
explaining how the module works would be useful.
#3198: r.stats.quantile: hardcoded max number of categries in base map
--------------------------+---------------------------------------
Reporter: mlennert | Owner: grass-dev@…
Type: defect | Status: new
Priority: normal | Milestone: 7.2.1
Component: Raster | Version: unspecified
Resolution: | Keywords: r.stats.quantile MAX_CATS
CPU: Unspecified | Platform: Unspecified
--------------------------+---------------------------------------
Comment (by mmetz):
Replying to [comment:2 mlennert]:
> Replying to [comment:1 glynn]:
> > [...]
> >
> > There's no fundamental reason why the limit can't be raised; or even
abolished, if you don't mind an unsuitable choice of base map resulting in
"unable to allocate" errors, or just taking forever.
>
> A warning was maintained. At least the user is made aware and can stop
the module.
FWIW, I tested with more than a million categories in the base map and the
module finished within 19 seconds (on an old laptop).
>
> > Consider putting a limit on num_cats*num_slots; a map with many
categories should presumably require fewer bins (assuming that the data
isn't concentrated into a handful of categories).
>
> In r69776 MarkusM introduce dynamic bins, although I don't really
understand what this means ;-).
For example, if there are only 10 cells for a given basemap category, it
does not make sense to allocate 1000 bins for that category, instead a
single bin is sufficient. With many basemap categories and only few values
for each category, memory consumption can be reduced by 90% down to 10% of
the previous version of r.stats.quantile. Still, with many basemap
categories and many cells per category, the module will be slow and will
need a lot of memory.