On 31/10/18 12:19, Nikos Alexandris wrote:
* Veronica Andreo <veroandreo@gmail.com> [2018-10-31 00:23:57 +0100]:
Hi devs,
Hi Vero,
(not a real dev, but I'll share what I think)
I'm writing to ask how do one determine the best number of classes/clusters
in a set of unsupervised classifications with different k in GRASS?
You already know better than me I guess, but I'd like to refresh my mind
on all this a bit.
I guess the only way to tell if the number of classes is "best", is to
judge yourself by inspecting the "quality" of the clusters returned.
One way to tell would be to compute the "error of clusters" which would
be the overall distance between the points that are assigned to a
cluster and its center. I guess comparing the overall errors between
different clustering settings (or even algorithms?), would give an idea
about how close points are around the centers of clusters.
Maybe we could implement something like this.
(All this I practiced during an generic Algorithmic Thinking course. I
guess it's applicable in our "domain" too.)
I use i.cluster with different number of classes and then i.maxlik that uses a
modified version of k-means according to the manual page. Now, I would like
to know which unsup classif is the best within the set.
Sorry, I guess I have to read up: what is "unsup classif"?
I check the i.cluster reports (looking for separability) and then explored the
rejection maps. But none of those seems to work as a crisp and clear
indicator. BTW, does anyone know which separability index does i.cluster
use?
I am interested to learn about the distance measure too. I am looking
at the source code of `i.cluster`. And then, searching around, I think
it's this file:
grasstrunk/lib/cluster/c_sep.c
and I/we just need to identify which distance it measures.
i.cluster uses a simple k-means approach based on the spectral euclidean distance between pixels or between pixels and existing clusters. By including a min cluster size and a min cluster separation parameter, total number of clusters might change which is different from a classical k-means.
i.cluster also works on a sample of the image pixels to define the clusters, so there is no guarantee that the clusters it identifies would be those one would find if using all pixels, but AFAIK it is generally reasonable close to justify the pay-off as it provides greater speed.
i.maxlik does not interfere in the clustering part. It uses the signatures of classes provided as input (possibly the signatures of the clusters if the input is the output of i.cluster) to then assign each pixel to one of the classes. The reject map of i.maxlik allows you see the probability of a pixels membership in the chosen class. It does not really allow you to measure cluster "quality", nor ideal number of clusters (well you could try with many different cluster numbers and then chose the one where the reject map values are the lowest on average).
If you want to use a very simple approach to Nikos' suggestion of calculating the error, you could use something like this:
- For each i.cluster + i.maxlik result:
- For each original band
- Create new pseudo band with mean values of the
original band per cluster (r.stats.zonal)
- Calculate euclidean distance in spectral space of each pixel
to its cluster (r.mapcalc):
(pixel_value_band1 - r.stats.zonal result on band 1)^2 +
(pixel_value_band2 - r.stats.zonal result on band 2)^2 +
etc
- Calculate mean euclidean distance on the result of (or median,
or whatever you are looking for) (r.univar)
- Identify the i.cluster + i.maxlik result that reaches the best score
Moritz