[GRASS-dev] how to determine best k in a set of unsupervised classifications?

Hi devs,

I’m writing to ask how do one determine the best number of classes/clusters in a set of unsupervised classifications with different k in GRASS? I use i.cluster with different number of classes and then i.maxlik that uses a modified version of k-means according to the manual page. Now, I would like to know which unsup classif is the best within the set. I check the i.cluster reports (looking for separability) and then explored the rejection maps. But none of those seems to work as a crisp and clear indicator. BTW, does anyone know which separability index does i.cluster use?

In any case, I have seen some indices elsewhere (mainly R and Python) that are used to choose the best clustering results (coming from the same or different clustering methods). Examples of those indices are Silhouette, Dunn, etc. Some are called internal as they do not require test data and just characterize the compactness of clusters. On the other hand, the ones requiring test data are called external. I have seen them in dtwclust R package [0] (the package is oriented to time series clustering but validation indices are more general) and in scikit-learn in Python [1]. Does any of you have something already implemented in this direction? or how do you assess your unsup classification (clustering) results?

Any ideas or suggestions within GRASS?

Thanks much in advance!
Vero

[0] https://rdrr.io/cran/dtwclust/man/cvi.html

[1] http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

* Veronica Andreo <veroandreo@gmail.com> [2018-10-31 00:23:57 +0100]:

Hi devs,

Hi Vero,

(not a real dev, but I'll share what I think)

I'm writing to ask how do one determine the best number of classes/clusters
in a set of unsupervised classifications with different k in GRASS?

You already know better than me I guess, but I'd like to refresh my mind
on all this a bit.

I guess the only way to tell if the number of classes is "best", is to
judge yourself by inspecting the "quality" of the clusters returned.

One way to tell would be to compute the "error of clusters" which would
be the overall distance between the points that are assigned to a
cluster and its center. I guess comparing the overall errors between
different clustering settings (or even algorithms?), would give an idea
about how close points are around the centers of clusters.
Maybe we could implement something like this.

(All this I practiced during an generic Algorithmic Thinking course. I
guess it's applicable in our "domain" too.)

I use i.cluster with different number of classes and then i.maxlik that uses a
modified version of k-means according to the manual page. Now, I would like
to know which unsup classif is the best within the set.

Sorry, I guess I have to read up: what is "unsup classif"?

I check the i.cluster reports (looking for separability) and then explored the
rejection maps. But none of those seems to work as a crisp and clear
indicator. BTW, does anyone know which separability index does i.cluster
use?

I am interested to learn about the distance measure too. I am looking
at the source code of `i.cluster`. And then, searching around, I think
it's this file:

grasstrunk/lib/cluster/c_sep.c

and I/we just need to identify which distance it measures.

Nikos

In any case, I have seen some indices elsewhere (mainly R and Python) that
are used to choose the best clustering results (coming from the same or
different clustering methods). Examples of those indices are Silhouette,
Dunn, etc. Some are called internal as they do not require test data and
just characterize the compactness of clusters. On the other hand, the ones
requiring test data are called external. I have seen them in dtwclust R
package [0] (the package is oriented to time series clustering but
validation indices are more general) and in scikit-learn in Python [1].
Does any of you have something already implemented in this direction? or
how do you assess your unsup classification (clustering) results?

Any ideas or suggestions within GRASS?

Thanks much in advance!
Vero

[0] https://rdrr.io/cran/dtwclust/man/cvi.html
[1]
http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/grass-dev

--
Nikos Alexandris | Remote Sensing & Geomatics
GPG Key Fingerprint 6F9D4506F3CA28380974D31A9053534B693C4FB3

On 31/10/18 12:19, Nikos Alexandris wrote:

* Veronica Andreo <veroandreo@gmail.com> [2018-10-31 00:23:57 +0100]:

Hi devs,

Hi Vero,

(not a real dev, but I'll share what I think)

I'm writing to ask how do one determine the best number of classes/clusters
in a set of unsupervised classifications with different k in GRASS?

You already know better than me I guess, but I'd like to refresh my mind
on all this a bit.

I guess the only way to tell if the number of classes is "best", is to
judge yourself by inspecting the "quality" of the clusters returned.

One way to tell would be to compute the "error of clusters" which would
be the overall distance between the points that are assigned to a
cluster and its center. I guess comparing the overall errors between
different clustering settings (or even algorithms?), would give an idea
about how close points are around the centers of clusters.
Maybe we could implement something like this.

(All this I practiced during an generic Algorithmic Thinking course. I
guess it's applicable in our "domain" too.)

I use i.cluster with different number of classes and then i.maxlik that uses a
modified version of k-means according to the manual page. Now, I would like
to know which unsup classif is the best within the set.

Sorry, I guess I have to read up: what is "unsup classif"?

I check the i.cluster reports (looking for separability) and then explored the
rejection maps. But none of those seems to work as a crisp and clear
indicator. BTW, does anyone know which separability index does i.cluster
use?

I am interested to learn about the distance measure too. I am looking
at the source code of `i.cluster`. And then, searching around, I think
it's this file:

grasstrunk/lib/cluster/c_sep.c

and I/we just need to identify which distance it measures.

i.cluster uses a simple k-means approach based on the spectral euclidean distance between pixels or between pixels and existing clusters. By including a min cluster size and a min cluster separation parameter, total number of clusters might change which is different from a classical k-means.

i.cluster also works on a sample of the image pixels to define the clusters, so there is no guarantee that the clusters it identifies would be those one would find if using all pixels, but AFAIK it is generally reasonable close to justify the pay-off as it provides greater speed.

i.maxlik does not interfere in the clustering part. It uses the signatures of classes provided as input (possibly the signatures of the clusters if the input is the output of i.cluster) to then assign each pixel to one of the classes. The reject map of i.maxlik allows you see the probability of a pixels membership in the chosen class. It does not really allow you to measure cluster "quality", nor ideal number of clusters (well you could try with many different cluster numbers and then chose the one where the reject map values are the lowest on average).

If you want to use a very simple approach to Nikos' suggestion of calculating the error, you could use something like this:

- For each i.cluster + i.maxlik result:
  - For each original band
    - Create new pseudo band with mean values of the
      original band per cluster (r.stats.zonal)
  - Calculate euclidean distance in spectral space of each pixel
      to its cluster (r.mapcalc):

    (pixel_value_band1 - r.stats.zonal result on band 1)^2 +
    (pixel_value_band2 - r.stats.zonal result on band 2)^2 +
    etc

  - Calculate mean euclidean distance on the result of (or median,
    or whatever you are looking for) (r.univar)

- Identify the i.cluster + i.maxlik result that reaches the best score

Moritz

Hi Nikos and Moritz,

Thanks for your replies and for the recipe :slight_smile:

Cheers,
Vero

El mié., 31 oct. 2018 a las 22:09, Moritz Lennert (<mlennert@club.worldonline.be>) escribió:

On 31/10/18 12:19, Nikos Alexandris wrote:

Hi devs,

Hi Vero,

(not a real dev, but I’ll share what I think)

I’m writing to ask how do one determine the best number of classes/clusters
in a set of unsupervised classifications with different k in GRASS?

You already know better than me I guess, but I’d like to refresh my mind
on all this a bit.

I guess the only way to tell if the number of classes is “best”, is to
judge yourself by inspecting the “quality” of the clusters returned.

One way to tell would be to compute the “error of clusters” which would
be the overall distance between the points that are assigned to a
cluster and its center. I guess comparing the overall errors between
different clustering settings (or even algorithms?), would give an idea
about how close points are around the centers of clusters.
Maybe we could implement something like this.

(All this I practiced during an generic Algorithmic Thinking course. I
guess it’s applicable in our “domain” too.)

I use i.cluster with different number of classes and then i.maxlik that uses a
modified version of k-means according to the manual page. Now, I would like
to know which unsup classif is the best within the set.

Sorry, I guess I have to read up: what is “unsup classif”?

I check the i.cluster reports (looking for separability) and then explored the
rejection maps. But none of those seems to work as a crisp and clear
indicator. BTW, does anyone know which separability index does i.cluster
use?

I am interested to learn about the distance measure too. I am looking
at the source code of i.cluster. And then, searching around, I think
it’s this file:

grasstrunk/lib/cluster/c_sep.c

and I/we just need to identify which distance it measures.

i.cluster uses a simple k-means approach based on the spectral euclidean
distance between pixels or between pixels and existing clusters. By
including a min cluster size and a min cluster separation parameter,
total number of clusters might change which is different from a
classical k-means.

i.cluster also works on a sample of the image pixels to define the
clusters, so there is no guarantee that the clusters it identifies would
be those one would find if using all pixels, but AFAIK it is generally
reasonable close to justify the pay-off as it provides greater speed.

i.maxlik does not interfere in the clustering part. It uses the
signatures of classes provided as input (possibly the signatures of the
clusters if the input is the output of i.cluster) to then assign each
pixel to one of the classes. The reject map of i.maxlik allows you see
the probability of a pixels membership in the chosen class. It does not
really allow you to measure cluster “quality”, nor ideal number of
clusters (well you could try with many different cluster numbers and
then chose the one where the reject map values are the lowest on average).

If you want to use a very simple approach to Nikos’ suggestion of
calculating the error, you could use something like this:

  • For each i.cluster + i.maxlik result:
  • For each original band
  • Create new pseudo band with mean values of the
    original band per cluster (r.stats.zonal)
  • Calculate euclidean distance in spectral space of each pixel
    to its cluster (r.mapcalc):

(pixel_value_band1 - r.stats.zonal result on band 1)^2 +
(pixel_value_band2 - r.stats.zonal result on band 2)^2 +
etc

  • Calculate mean euclidean distance on the result of (or median,
    or whatever you are looking for) (r.univar)

  • Identify the i.cluster + i.maxlik result that reaches the best score

Moritz