[GRASS-dev] What is the meaning of output from i.cluster

i.cluster produces a text output file that looks like this (for the landsat 2000 images from the nc_spm_08 demo data set).

#produced by i.cluster
#Class 1
247
69.1174 50.3603 41.3482 18.9514 17.4534 117.049 14.2105
10.5837
12.6608 22.8737
17.1663 29.6708 45.1628
7.36345 9.28993 16.6389 47.4367
9.14573 9.22619 16.9756 47.1115 70.631
7.75037 6.67348 10.1294 13.043 14.9941 17.5505
7.63372 6.74497 12.0565 27.4981 43.3025 12.0629 30.7522
#Class 2
2059
70.0257 53.2035 46.205 61.085 61.5935 121.271 35.6688
9.28262
7.5725 9.99696
11.3514 11.8868 22.0678
-6.02017 -0.713125 -9.81723 54.7357
-1.33501 2.43359 2.91134 15.8135 40.1529
4.98865 4.49973 6.3419 -2.73927 -1.77851 11.3745
5.31077 6.34246 10.6505 -4.60692 19.1398 3.18893 19.0243

[more for the other classes]

So what does this mean? There is no clear explanation of this in the manual, and there are no variable names in the output.

I am guessing that

line 1 is possibly the number of pixels in the class
line 2 is the cluster mean within each original raster input file (landsat maps bands in this case)

The remaining lines are some kind of matrix. If a correlation matrix, I don’t understand why the diagonal is not 1–or at least the biggest number for the relevant input maps.

Any explanation?

Michael


C. Michael Barton
Director, Center for Social Dynamics & Complexity
Professor of Anthropology, School of Human Evolution & Social Change
Arizona State University

voice: 480-965-6262 (SHESC), 480-727-9746 (CSDC)
fax: 480-965-7671 (SHESC), 480-727-0709 (CSDC)

www: http://www.public.asu.edu/~cmbarton, http://csdc.asu.edu

Hello Michael,

Michael Barton wrote:

i.cluster produces a text output file that looks like this (for the landsat
2000 images from the nc_spm_08 demo data set).

#produced by i.cluster
#Class 1
247
69.1174 50.3603 41.3482 18.9514 17.4534 117.049 14.2105
10.5837
12.6608 22.8737
17.1663 29.6708 45.1628
7.36345 9.28993 16.6389 47.4367
9.14573 9.22619 16.9756 47.1115 70.631
7.75037 6.67348 10.1294 13.043 14.9941 17.5505
7.63372 6.74497 12.0565 27.4981 43.3025 12.0629 30.7522
#Class 2
2059
70.0257 53.2035 46.205 61.085 61.5935 121.271 35.6688
9.28262
7.5725 9.99696
11.3514 11.8868 22.0678
-6.02017 -0.713125 -9.81723 54.7357
-1.33501 2.43359 2.91134 15.8135 40.1529
4.98865 4.49973 6.3419 -2.73927 -1.77851 11.3745
5.31077 6.34246 10.6505 -4.60692 19.1398 3.18893 19.0243

[more for the other classes]

So what does this mean? There is no clear explanation of this in the manual,
and there are no variable names in the output.

I am guessing that
line 1 is possibly the number of pixels in the class

I think you are right.

line 2 is the cluster mean within each original raster input file (landsat
maps bands in this case)

Yes.

*Note,* however, as previously well-explained by Moritz Lennert
(http://lists.osgeo.org/pipermail/grass-user/2012-October/066046.html):

--%<---
i.cluster does not cluster all pixels, but only a sample (see parameter
'sample'). The result of that clustering is not that all pixels are assigned
to a given cluster, but only that you have signatures that are
"representative" of a given cluster. If you run i.cluster on the same data
asking for the same number of classes, but with different sample sizes, you
will probably get slightly different signatures for each cluster at each run.
--->%--

The remaining lines are some kind of matrix. If a correlation matrix, I
don't understand why the diagonal is not 1--or at least the biggest number
for the relevant input maps.
Any explanation?

It is a variance(=the diagonal)-covariance matrix(=the off diagonal elements)
as described in the manual and mentioned else-where in past threads in the
list.

As previously noted by Hamish Bowman, let's have a look at
(http://lists.osgeo.org/pipermail/grass-user/2008-June/045108.html):

--%<---
I_fopen_signature_file_new() found in lib/imagery/sigfile.c
--->%--

but I can't find/understand if it helps.

What about looking at

a) /geo/osgeo/src/grass_trunk/lib/python/ctypes/OBJ.x86_64-unknown-linux-
gnu/imagery.py

or

b) /geo/osgeo/src/grass_trunk/lib/python/ctypes/OBJ.x86_64-unknown-linux-
gnu/cluster.py:

--%<--- a) lines 756-781 / b) lines 579-600 --%<---
# /geo/osgeo/src/grass_trunk/dist.x86_64-unknown-linux-
gnu/include/grass/imagery.h: 51
class struct_One_Sig(Structure):
    pass

struct_One_Sig.__slots__ = [
    'desc',
    'npoints',
    'mean',
    'var',
    'status',
    'r',
    'g',
    'b',
    'have_color',
]
struct_One_Sig._fields_ = [
    ('desc', c_char * 100),
    ('npoints', c_int),
    ('mean', POINTER(c_double)),
    ('var', POINTER(POINTER(c_double))),
    ('status', c_int),
    ('r', c_float),
    ('g', c_float),
    ('b', c_float),
    ('have_color', c_int),
]
--->%--

?

And then, of course, at
/geo/osgeo/src/grass_trunk/dist.x86_64-unknown-linux-
gnu/include/grass/imagery.h:

--%<---
struct One_Sig
{
    char desc[100];
    int npoints;
    double *mean; /* one mean for each band */
    double **var; /* covariance band-band */
    int status; /* may be used to 'delete' a signature */
    float r, g, b; /* color */
    int have_color;
};
-->%---

Best, Nikos

Thanks for the explanation Nikos.

But see below.

Michael
____________________
C. Michael Barton
Director, Center for Social Dynamics & Complexity
Professor of Anthropology, School of Human Evolution & Social Change
Arizona State University

voice: 480-965-6262 (SHESC), 480-727-9746 (CSDC)
fax: 480-965-7671 (SHESC), 480-727-0709 (CSDC)
www: http://www.public.asu.edu/~cmbarton, http://csdc.asu.edu

On Mar 25, 2013, at 5:34 PM, Nikos Alexandris <nik@nikosalexandris.net>
wrote:

Hello Michael,

Michael Barton wrote:

i.cluster produces a text output file that looks like this (for the landsat
2000 images from the nc_spm_08 demo data set).

#produced by i.cluster
#Class 1
247
69.1174 50.3603 41.3482 18.9514 17.4534 117.049 14.2105
10.5837
12.6608 22.8737
17.1663 29.6708 45.1628
7.36345 9.28993 16.6389 47.4367
9.14573 9.22619 16.9756 47.1115 70.631
7.75037 6.67348 10.1294 13.043 14.9941 17.5505
7.63372 6.74497 12.0565 27.4981 43.3025 12.0629 30.7522
#Class 2
2059
70.0257 53.2035 46.205 61.085 61.5935 121.271 35.6688
9.28262
7.5725 9.99696
11.3514 11.8868 22.0678
-6.02017 -0.713125 -9.81723 54.7357
-1.33501 2.43359 2.91134 15.8135 40.1529
4.98865 4.49973 6.3419 -2.73927 -1.77851 11.3745
5.31077 6.34246 10.6505 -4.60692 19.1398 3.18893 19.0243

[more for the other classes]

So what does this mean? There is no clear explanation of this in the manual,
and there are no variable names in the output.

I am guessing that
line 1 is possibly the number of pixels in the class

If so, the report should be:

Number of pixels in class: #####

I think you are right.

line 2 is the cluster mean within each original raster input file (landsat
maps bands in this case)

And here, there should be a headings line and explanation something like this:

              map 1 map 2 map 3 map 4
Mean values: ######### ######### ######### #########

Yes.

*Note,* however, as previously well-explained by Moritz Lennert
([GRASS-user] Is "i.cluster" an implementation of the ISODATA algorithm?):

--%<---
i.cluster does not cluster all pixels, but only a sample (see parameter
'sample'). The result of that clustering is not that all pixels are assigned
to a given cluster, but only that you have signatures that are
"representative" of a given cluster. If you run i.cluster on the same data
asking for the same number of classes, but with different sample sizes, you
will probably get slightly different signatures for each cluster at each run.

This needs to be in the manual rather than lost in a email

--->%--

The remaining lines are some kind of matrix. If a correlation matrix, I
don't understand why the diagonal is not 1--or at least the biggest number
for the relevant input maps.
Any explanation?

It is a variance(=the diagonal)-covariance matrix(=the off diagonal elements)
as described in the manual and mentioned else-where in past threads in the
list.

So the diagonal values = the variance in the values for that band in a cluster
The off diagonal values are the covariance between the band values for a cluster

Right? This is not in the manual but should be. IMHO, it should also be in the report output something like this.

Variance (diagonal) and covariance scores (off-diagonal)

        map 1 map 2 map 3 map 4
map 1 #########
map 1 ######### #########
map 1 ######### ######### #########
map 1 ######### ######### ######### #########

As previously noted by Hamish Bowman, let's have a look at
([GRASS-user] Can't find signature file for i.cluster or i.class):

I'm not sure what the following means. Sorry. If it is necessary to delve into the source code to find out what is being reported in the report, the report needs to be changed so that is not necessary.

It would also be useful to know how much each map contributes to each cluster.

--%<---
I_fopen_signature_file_new() found in lib/imagery/sigfile.c
--->%--

but I can't find/understand if it helps.

What about looking at

a) /geo/osgeo/src/grass_trunk/lib/python/ctypes/OBJ.x86_64-unknown-linux-
gnu/imagery.py

or

b) /geo/osgeo/src/grass_trunk/lib/python/ctypes/OBJ.x86_64-unknown-linux-
gnu/cluster.py:

--%<--- a) lines 756-781 / b) lines 579-600 --%<---
# /geo/osgeo/src/grass_trunk/dist.x86_64-unknown-linux-
gnu/include/grass/imagery.h: 51
class struct_One_Sig(Structure):
   pass

struct_One_Sig.__slots__ = [
   'desc',
   'npoints',
   'mean',
   'var',
   'status',
   'r',
   'g',
   'b',
   'have_color',
]
struct_One_Sig._fields_ = [
   ('desc', c_char * 100),
   ('npoints', c_int),
   ('mean', POINTER(c_double)),
   ('var', POINTER(POINTER(c_double))),
   ('status', c_int),
   ('r', c_float),
   ('g', c_float),
   ('b', c_float),
   ('have_color', c_int),
]
--->%--

?

And then, of course, at
/geo/osgeo/src/grass_trunk/dist.x86_64-unknown-linux-
gnu/include/grass/imagery.h:

--%<---
struct One_Sig
{
   char desc[100];
   int npoints;
   double *mean; /* one mean for each band */
   double **var; /* covariance band-band */
   int status; /* may be used to 'delete' a signature */
   float r, g, b; /* color */
   int have_color;
};
-->%---

Best, Nikos

Michael Barton wrote:

Thanks for the explanation Nikos.

And I re-direct this ("thanks") to the people who have done the real work -- I
am still stealing stuff :wink:

But see below.

Sure, i.cluster is a favourite re-call subject :-).
More below -- please have a look at a draft version of a "to-become" a Wiki
page related to Clustering:

<http://grasswiki.osgeo.org/wiki/User:NikosA/About_Clustering&gt; -- this link
repeated below as I will try to justify the need for an extra page, besides
the actual manual.

Michael Barton wrote:

>> i.cluster produces a text output file that looks like this (for the
>> landsat 2000 images from the nc_spm_08 demo data set).
>>
>> #produced by i.cluster
>> #Class 1
>> 247
>> 69.1174 50.3603 41.3482 18.9514 17.4534 117.049 14.2105
>> 10.5837
>> 12.6608 22.8737
>> 17.1663 29.6708 45.1628
>> 7.36345 9.28993 16.6389 47.4367
>> 9.14573 9.22619 16.9756 47.1115 70.631
>> 7.75037 6.67348 10.1294 13.043 14.9941 17.5505
>> 7.63372 6.74497 12.0565 27.4981 43.3025 12.0629 30.7522
>> #Class 2
>> 2059
>> 70.0257 53.2035 46.205 61.085 61.5935 121.271 35.6688
>> 9.28262
>> 7.5725 9.99696
>> 11.3514 11.8868 22.0678
>> -6.02017 -0.713125 -9.81723 54.7357
>> -1.33501 2.43359 2.91134 15.8135 40.1529
>> 4.98865 4.49973 6.3419 -2.73927 -1.77851 11.3745
>> 5.31077 6.34246 10.6505 -4.60692 19.1398 3.18893 19.0243
>>
>> [more for the other classes]
>>
>> So what does this mean? There is no clear explanation of this in the
>> manual, and there are no variable names in the output.
>>
>> I am guessing that
>> line 1 is possibly the number of pixels in the class

If so, the report should be:
Number of pixels in class: #####

[..]

I agree.

>> line 2 is the cluster mean within each original raster input file
>> (landsat maps bands in this case)

And here, there should be a headings line and explanation something like
this:
              map 1 map 2 map 3 map 4
Mean values: ######### ######### ######### #########

I agree!

> *Note,* however, as previously well-explained by Moritz Lennert
> (http://lists.osgeo.org/pipermail/grass-user/2012-October/066046.html):
> --%<---
> i.cluster does not cluster all pixels, but only a sample (see parameter
> 'sample'). The result of that clustering is not that all pixels are
> assigned to a given cluster, but only that you have signatures that are
> "representative" of a given cluster. If you run i.cluster on the same data
> asking for the same number of classes, but with different sample sizes,
> you will probably get slightly different signatures for each cluster at
> each run.
> --->%--

This needs to be in the manual rather than lost in a email

Sure.

>> The remaining lines are some kind of matrix. If a correlation matrix, I
>> don't understand why the diagonal is not 1--or at least the biggest
>> number for the relevant input maps. Any explanation?

> It is a variance(=the diagonal)-covariance matrix(=the off diagonal
> elements) as described in the manual and mentioned else-where in past
> threads in the list.

So the diagonal values = the variance in the values for that band in a
cluster The off diagonal values are the covariance between the band values
for a cluster

Right? This is not in the manual but should be. IMHO, it should also be in
the report output something like this.

Variance (diagonal) and covariance scores (off-diagonal)

        map 1 map 2 map 3 map 4
map 1 #########
map 1 ######### #########
map 1 ######### ######### #########
map 1 ######### ######### ######### #########

Agreed!

> As previously noted by Hamish Bowman, let's have a look at
> (http://lists.osgeo.org/pipermail/grass-user/2008-June/045108.html):

I'm not sure what the following means. Sorry. If it is necessary to delve
into the source code to find out what is being reported in the report, the
report needs to be changed so that is not necessary.

Absolutely. Note, there may be the need to differentiate what the user report
looks like and what the actual input for the classification is (e.g. for
i.maxlik) which, I guess, is the current structure of the signature files (?).

It would also be useful to know how much each map contributes to each
cluster.

(shrug)

I need to "think" a lot about that, meaning take time to understand the code
and propose...

Anyhow, I have started scratching a wikipedia page for similarities and
differences between i.cluster and other well known clustering algorithms such
as the k-means and the ISODATA. I consider it as necessary after a) having
read many related threads, b) going line-by-line through the manual and c)
after working recently in a project which involved "simple clustering" of NDVI
maps.

I have had already taken several notes (copy-pasted from threads, the manual
and own wording) about constructing an explanatory/comparative overview
(text). Never took the time to start the page...

I have started sharing (slowly) these notes at:
<http://grasswiki.osgeo.org/wiki/User:NikosA/About_Clustering&gt;\.

This is a draft version which, hopefully, will end-up being a normal GRASS-
Wiki page -- ideally mentioned in the manual of i.cluster as well.

I am not sure who has the time and the will to go through the real hard-work,
i.e. adjusting the code so as to make it more informative for the user. I
will try to support the documentation efforts.

Thank you, Nikos

> --%<---
> I_fopen_signature_file_new() found in lib/imagery/sigfile.c
> --->%--
> but I can't find/understand if it helps.

> What about looking at
>
> a) /geo/osgeo/src/grass_trunk/lib/python/ctypes/OBJ.x86_64-unknown-linux-
> gnu/imagery.py
>
> or
>
> b) /geo/osgeo/src/grass_trunk/lib/python/ctypes/OBJ.x86_64-unknown-linux-
> gnu/cluster.py:
>
> --%<--- a) lines 756-781 / b) lines 579-600 --%<---
> # /geo/osgeo/src/grass_trunk/dist.x86_64-unknown-linux-
> gnu/include/grass/imagery.h: 51
>
> class struct_One_Sig(Structure):
> pass
>
> struct_One_Sig.__slots__ = [
>
> 'desc',
> 'npoints',
> 'mean',
> 'var',
> 'status',
> 'r',
> 'g',
> 'b',
> 'have_color',
>
> ]
> struct_One_Sig._fields_ = [
>
> ('desc', c_char * 100),
> ('npoints', c_int),
> ('mean', POINTER(c_double)),
> ('var', POINTER(POINTER(c_double))),
> ('status', c_int),
> ('r', c_float),
> ('g', c_float),
> ('b', c_float),
> ('have_color', c_int),
>
> ]
> --->%--
>
> ?
>
> And then, of course, at
> /geo/osgeo/src/grass_trunk/dist.x86_64-unknown-linux-
> gnu/include/grass/imagery.h:
>
> --%<---
> struct One_Sig
> {
>
> char desc[100];
> int npoints;
> double *mean; /* one mean for each band */
> double **var; /* covariance band-band */
> int status; /* may be used to 'delete' a signature */
> float r, g, b; /* color */
> int have_color;
>
> };
> -->%---