[GRASS-dev] Compute mahalanobis distance using Scipy

I would like to compute a raster layer with for each raster cell the mahalanobis distance to the centre of the environmental space formed by all reference data points (raster cells). In R this can be done as explained here [1].

. I would like to do this using python only (no dependency on R). I came up with the following, which works, but is very slow. I guess this is because it loops over every raster cell to compute the mahalanobis distance? Any idea how this can be done faster (I am very new to python, so bear with me if I am making stupid mistakes)

ref = [‘bio1@clim’,‘bio2@clim’,‘bio3@clim’]

import numpy as np

from scipy.spatial import distance

import scipy.linalg as linalg

import grass.script as grass

import grass.script.array as garray

Create covariance table (could do this in python instead)

text_file = open(“tmpfile”, “w”)

text_file.write(grass.read_command(“r.covar”, quiet=True, map=ref))

text_file.close()

covar = np.loadtxt(tmpcov, skiprows=1)

os.remove(tmpcov)

Import data

dat_ref = None

stat_mean = None

layer = garray.array()

for i, map in enumerate(ref):

layer.read(map, null=np.nan)

s = len(ref)

r, c = layer.shape

if dat_ref is None:

dat_ref = np.empty((s, r, c), dtype=np.double)

if stat_mean is None:

stat_mean = np.empty((s), dtype=np.double)

dat_ref[i,:,:] = layer

stat_mean[i] = np.nanmean(layer)

del layer

Compute mahalanobis distance

r, c = dat_ref[1,:,:].shape

stat_mah = garray.array()

for i in xrange(r):

for j in xrange(c):

cell_ref = dat_ref[…,i,j]

stat_mah[i,j] = distance.mahalanobis(cell_ref, stat_mean

, linalg.inv(covar))

stat_mah.write(“mahalanobisMap”)

[1] https://pvanb.wordpress.com/2014/05/13/a-new-method-and-tool-exdet-to-evaluate-novelty-environmental-conditions/

On 29/01/15 18:30, Paulo van Breugel wrote:

I would like to compute a raster layer with for each raster cell the
mahalanobis distance to the centre of the environmental space// formed
by all reference data points (raster cells). In R this can be done as
explained here [1].

. I would like to do this using python only (no dependency on R). I came
up with the following, which works, but is very slow. I guess this is
because it loops over every raster cell to compute the mahalanobis
distance? Any idea how this can be done faster (I am very new to python,
so bear with me if I am making stupid mistakes)

There's probably ways to accelerate this in Python (maybe you can try rewriting your for-loops as map() calls), but on the Wikipedia page on mahalanobis distance that you reference [1], it says that:

"Along each principal component axis, it measures the number of standard deviations from P to the mean of D. If each of these axes is rescaled to have unit variance, then Mahalanobis distance corresponds to standard Euclidean distance in the transformed space."

Couldn't you use i.pca to calculate principal components and then calculate distances of points in that space ?

Just brainstorming...

Moritz

[1] http://en.wikipedia.org/wiki/Mahalanobis_distance

Paulo van Breugel wrote:

I would like to compute a raster layer with for each raster cell the
mahalanobis distance to the centre of the environmental space formed by all
reference data points (raster cells). In R this can be done as explained
here [1].

. I would like to do this using python only (no dependency on R). I came up
with the following, which works, but is very slow. I guess this is because
it loops over every raster cell to compute the mahalanobis distance? Any
idea how this can be done faster (I am very new to python, so bear with me
if I am making stupid mistakes)

The main speed-up will come from "inlining" distance.mahalanobis(),
which is essentially:

    delta = u - v
    m = np.dot(np.dot(delta, VI), delta)
    return np.sqrt(m)

np.dot(v, m) is equivalent to np.sum(v * m,axis=0), and np.dot(u,v) is
equivalent to np.sum(u * v), so the second line above is equivalent to

    m = np.sum(np.sum(delta[:,None] * VI * delta[None,:], axis=-1), axis=-1)

Thus, delta can be extended from a 1-D array to 3-D, changing the
result from a scalar to a 2-D array. The calculation of stat_mah then
becomes:

    VI = np.linalg.inv(covar)
    delta = dat_ref - stat_mean[:,None,None]
    m = np.sum(np.sum(delta[:,:,:,None] * delta[:,:,None,:] * VI[None,None,:,:],axis=-1),axis=-1)
    stat_mah = np.sqrt(m)

This moves the loops from Python into C, which usually results in a
significant speed increase.

--
Glynn Clements <glynn@gclements.plus.com>

Thanks! I am traveling right now so will look at this as soon as I am back, but this looks promising. Much appreciated.

On Fri, 30 Jan 2015 15:40 Glynn Clements <glynn@gclements.plus.com> wrote:

Paulo van Breugel wrote:

I would like to compute a raster layer with for each raster cell the
mahalanobis distance to the centre of the environmental space formed by all
reference data points (raster cells). In R this can be done as explained
here [1].

. I would like to do this using python only (no dependency on R). I came up
with the following, which works, but is very slow. I guess this is because
it loops over every raster cell to compute the mahalanobis distance? Any
idea how this can be done faster (I am very new to python, so bear with me
if I am making stupid mistakes)

The main speed-up will come from “inlining” distance.mahalanobis(),
which is essentially:

delta = u - v
m = np.dot(np.dot(delta, VI), delta)
return np.sqrt(m)

np.dot(v, m) is equivalent to np.sum(v * m,axis=0), and np.dot(u,v) is
equivalent to np.sum(u * v), so the second line above is equivalent to

m = np.sum(np.sum(delta[:,None] * VI * delta[None,:], axis=-1), axis=-1)

Thus, delta can be extended from a 1-D array to 3-D, changing the
result from a scalar to a 2-D array. The calculation of stat_mah then
becomes:

VI = np.linalg.inv(covar)
delta = dat_ref - stat_mean[:,None,None]
m = np.sum(np.sum(delta[:,:,:,None] * delta[:,:,None,:] * VI[None,None,:,:],axis=-1),axis=-1)
stat_mah = np.sqrt(m)

This moves the loops from Python into C, which usually results in a
significant speed increase.


Glynn Clements <glynn@gclements.plus.com>

On Fri, 30 Jan 2015 10:59 Moritz Lennert <mlennert@club.worldonline.be> wrote:

On 29/01/15 18:30, Paulo van Breugel wrote:

I would like to compute a raster layer with for each raster cell the
mahalanobis distance to the centre of the environmental space// formed
by all reference data points (raster cells). In R this can be done as
explained here [1].

. I would like to do this using python only (no dependency on R). I came
up with the following, which works, but is very slow. I guess this is
because it loops over every raster cell to compute the mahalanobis
distance? Any idea how this can be done faster (I am very new to python,
so bear with me if I am making stupid mistakes)

There’s probably ways to accelerate this in Python (maybe you can try
rewriting your for-loops as map() calls), but on the Wikipedia page on
mahalanobis distance that you reference [1], it says that:

“Along each principal component axis, it measures the number of standard
deviations from P to the mean of D. If each of these axes is rescaled to
have unit variance, then Mahalanobis distance corresponds to standard
Euclidean distance in the transformed space.”

Couldn’t you use i.pca to calculate principal components and then
calculate distances of points in that space ?

Just brainstorming…

Interesting thought. I will see if I can work that one out. Thanks!

Moritz

[1] http://en.wikipedia.org/wiki/Mahalanobis_distance

On Fri, Jan 30, 2015 at 3:39 PM, Glynn Clements <glynn@gclements.plus.com>
wrote:

Paulo van Breugel wrote:

> I would like to compute a raster layer with for each raster cell the
> mahalanobis distance to the centre of the environmental space formed by
all
> reference data points (raster cells). In R this can be done as explained
> here [1].
>
> . I would like to do this using python only (no dependency on R). I came
up
> with the following, which works, but is very slow. I guess this is
because
> it loops over every raster cell to compute the mahalanobis distance? Any
> idea how this can be done faster (I am very new to python, so bear with
me
> if I am making stupid mistakes)

The main speed-up will come from "inlining" distance.mahalanobis(),
which is essentially:

    delta = u - v
    m = np.dot(np.dot(delta, VI), delta)
    return np.sqrt(m)

np.dot(v, m) is equivalent to np.sum(v * m,axis=0), and np.dot(u,v) is
equivalent to np.sum(u * v), so the second line above is equivalent to

    m = np.sum(np.sum(delta[:,None] * VI * delta[None,:], axis=-1),
axis=-1)

Smart! I can follow the logic, but I am not sure how to solve the problem
below:

   Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    ValueError: operands could not be broadcast together with shapes
(3,1,77,78) (3,3)

Which refers to the different dimensions of delta and VI?

Thus, delta can be extended from a 1-D array to 3-D, changing the
result from a scalar to a 2-D array. The calculation of stat_mah then
becomes:

    VI = np.linalg.inv(covar)
    delta = dat_ref - stat_mean[:,None,None]
    m = np.sum(np.sum(delta[:,:,:,None] * delta[:,:,None,:] *
VI[None,None,:,:],axis=-1),axis=-1)
    stat_mah = np.sqrt(m)

This moves the loops from Python into C, which usually results in a
significant speed increase.

--
Glynn Clements <glynn@gclements.plus.com>

Paulo van Breugel wrote:

> The main speed-up will come from "inlining" distance.mahalanobis(),
> which is essentially:
>
> delta = u - v
> m = np.dot(np.dot(delta, VI), delta)
> return np.sqrt(m)
>
> np.dot(v, m) is equivalent to np.sum(v * m,axis=0), and np.dot(u,v) is
> equivalent to np.sum(u * v), so the second line above is equivalent to
>
> m = np.sum(np.sum(delta[:,None] * VI * delta[None,:], axis=-1),
> axis=-1)

Smart! I can follow the logic, but I am not sure how to solve the problem
below:

   Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    ValueError: operands could not be broadcast together with shapes
(3,1,77,78) (3,3)

Which refers to the different dimensions of delta and VI?

The first version of the code (which is quoted above) will only work
with 1-D arrays. It's just a fairly direct translation of the existing
distance.mahalanobis() function, given to explain how it can then be
extened to 3-D arrays (i.e. a 2-D array of 1-D vectors).

The second version:

> VI = np.linalg.inv(covar)
> delta = dat_ref - stat_mean[:,None,None]
> m = np.sum(np.sum(delta[:,:,:,None] * delta[:,:,None,:] * VI[None,None,:,:],axis=-1),axis=-1)
> stat_mah = np.sqrt(m)

should work with delta being a 3-D array.

--
Glynn Clements <glynn@gclements.plus.com>

On Mon, Feb 2, 2015 at 1:38 PM, Glynn Clements <glynn@gclements.plus.com>
wrote:

Paulo van Breugel wrote:

> > The main speed-up will come from "inlining" distance.mahalanobis(),
> > which is essentially:
> >
> > delta = u - v
> > m = np.dot(np.dot(delta, VI), delta)
> > return np.sqrt(m)
> >
> > np.dot(v, m) is equivalent to np.sum(v * m,axis=0), and np.dot(u,v) is
> > equivalent to np.sum(u * v), so the second line above is equivalent to
> >
> > m = np.sum(np.sum(delta[:,None] * VI * delta[None,:], axis=-1),
> > axis=-1)
>
>
> Smart! I can follow the logic, but I am not sure how to solve the problem
> below:
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> ValueError: operands could not be broadcast together with shapes
> (3,1,77,78) (3,3)
>
> Which refers to the different dimensions of delta and VI?

The first version of the code (which is quoted above) will only work
with 1-D arrays. It's just a fairly direct translation of the existing
distance.mahalanobis() function, given to explain how it can then be
extened to 3-D arrays (i.e. a 2-D array of 1-D vectors).

The second version:

> > VI = np.linalg.inv(covar)
> > delta = dat_ref - stat_mean[:,None,None]
> > m = np.sum(np.sum(delta[:,:,:,None] * delta[:,:,None,:] *
VI[None,None,:,:],axis=-1),axis=-1)
> > stat_mah = np.sqrt(m)

should work with delta being a 3-D array.

Yes, but when running the lines above, I am getting the same error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes
(3,77,78,78) (1,1,3,3)

From what I can see from broadcasting rules, there is mismatch in the last

and fore-last dimensions of VI[None,None,:,:] compared to the other two?

delta[:,:,:,None].shape
(3, 77, 78, 1)

delta[:,:,None,:].shape
(3, 77, 1, 78)

VI[None,None,:,:].shape
(1, 1, 3, 3)

I am really have to get a better grasp of working with these arrays, but in
any case, after a bit trial and error, I came up with the following to
compute m.

m = np.sum(np.sum(delta[None,:,:,:] * VI[:,:,None,None],axis=1) * delta,
axis=0)

This does, in my limited testing, gives the same result as using the loop
with

pow(distance.mahalanobis(cell_ref, stat_mean, VI),2).

to be sure, does the above makes sense to you?

Paulo

--
Glynn Clements <glynn@gclements.plus.com>

On Mon, Feb 2, 2015 at 3:40 PM, Paulo van Breugel <p.vanbreugel@gmail.com>
wrote:

On Mon, Feb 2, 2015 at 1:38 PM, Glynn Clements <glynn@gclements.plus.com>
wrote:

Paulo van Breugel wrote:

> > The main speed-up will come from "inlining" distance.mahalanobis(),
> > which is essentially:
> >
> > delta = u - v
> > m = np.dot(np.dot(delta, VI), delta)
> > return np.sqrt(m)
> >
> > np.dot(v, m) is equivalent to np.sum(v * m,axis=0), and np.dot(u,v) is
> > equivalent to np.sum(u * v), so the second line above is equivalent to
> >
> > m = np.sum(np.sum(delta[:,None] * VI * delta[None,:], axis=-1),
> > axis=-1)
>
>
> Smart! I can follow the logic, but I am not sure how to solve the
problem
> below:
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> ValueError: operands could not be broadcast together with shapes
> (3,1,77,78) (3,3)
>
> Which refers to the different dimensions of delta and VI?

The first version of the code (which is quoted above) will only work
with 1-D arrays. It's just a fairly direct translation of the existing
distance.mahalanobis() function, given to explain how it can then be
extened to 3-D arrays (i.e. a 2-D array of 1-D vectors).

The second version:

> > VI = np.linalg.inv(covar)
> > delta = dat_ref - stat_mean[:,None,None]
> > m = np.sum(np.sum(delta[:,:,:,None] * delta[:,:,None,:] *
VI[None,None,:,:],axis=-1),axis=-1)
> > stat_mah = np.sqrt(m)

should work with delta being a 3-D array.

Yes, but when running the lines above, I am getting the same error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes
(3,77,78,78) (1,1,3,3)

From what I can see from broadcasting rules, there is mismatch in the last
and fore-last dimensions of VI[None,None,:,:] compared to the other two?

delta[:,:,:,None].shape
(3, 77, 78, 1)

delta[:,:,None,:].shape
(3, 77, 1, 78)

VI[None,None,:,:].shape
(1, 1, 3, 3)

I am really have to get a better grasp of working with these arrays, but
in any case, after a bit trial and error, I came up with the following to
compute m.

m = np.sum(np.sum(delta[None,:,:,:] * VI[:,:,None,None],axis=1) * delta,
axis=0)

This does, in my limited testing, gives the same result as using the loop
with

pow(distance.mahalanobis(cell_ref, stat_mean, VI),2).

to be sure, does the above makes sense to you?

About the speed up.. that is indeed rather significant. With three layers
as input (cells: 1122582) my original function took 30 seconds, the one
above 0.23 seconds. Not bad :slight_smile:

Paulo

--
Glynn Clements <glynn@gclements.plus.com>

Paulo van Breugel wrote:

> The second version:
>
> > > VI = np.linalg.inv(covar)
> > > delta = dat_ref - stat_mean[:,None,None]
> > > m = np.sum(np.sum(delta[:,:,:,None] * delta[:,:,None,:] * VI[None,None,:,:],axis=-1),axis=-1)
> > > stat_mah = np.sqrt(m)
>
> should work with delta being a 3-D array.
>

Yes, but when running the lines above, I am getting the same error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes
(3,77,78,78) (1,1,3,3)

Well, it's not quite the same; both operands now have the same number
of dimensions, but they're not compatible (essentially, for each
dimension either both values should be the same or one of them should
be one).

>From what I can see from broadcasting rules, there is mismatch in the last
and fore-last dimensions of VI[None,None,:,:] compared to the other two?

delta[:,:,:,None].shape
(3, 77, 78, 1)

delta[:,:,None,:].shape
(3, 77, 1, 78)

VI[None,None,:,:].shape
(1, 1, 3, 3)

The problem was that delta is 3x77x78 whereas the code assumed that it
would be 77x78x3.

I am really have to get a better grasp of working with these arrays, but in
any case, after a bit trial and error, I came up with the following to
compute m.

m = np.sum(np.sum(delta[None,:,:,:] * VI[:,:,None,None],axis=1) * delta,
axis=0)

This does, in my limited testing, gives the same result as using the loop
with

pow(distance.mahalanobis(cell_ref, stat_mean, VI),2).

to be sure, does the above makes sense to you?

I believe that's correct.

--
Glynn Clements <glynn@gclements.plus.com>

Hi Glyn,

Your suggestions to compute the mahalanobis distance work perfect. However, I have a follow up question. For large data sets, I am getting a MemoryError when running “mahdist = np.sum(np.sum(delta[None,:,:,:] * VI[:,:,None,None],axis=1) * delta,axis=0)”

pro is list with raster names

s = len(pro)

dat_pro = None

layer = garray.array()

r, c = layer.shape

for i, map in enumerate(pro):

layer.read(map, null=np.nan)

if dat_pro is None:

dat_pro = np.empty((s, r, c), dtype=np.double)

dat_pro[i,:,:] = layer

del layer

delta = dat_pro - stat_mean[:,None,None]

mahdist = np.sum(np.sum(delta[None,:,:,:] * VI[:,:,None,None],axis=1) * delta, axis=0)

stat_mahal = garray.array()

stat_mahal[…] = mahdist

I guess this is because the calculations are done in-memory? Any way to avoid this memory problem when using large data sets (something like working with memmap objects?)

Cheers,

Paulo

···

On Mon, Feb 2, 2015 at 7:19 PM, Glynn Clements <glynn@gclements.plus.com> wrote:

Paulo van Breugel wrote:

The second version:

VI = np.linalg.inv(covar)
delta = dat_ref - stat_mean[:,None,None]
m = np.sum(np.sum(delta[:,:,:,None] * delta[:,:,None,:] * VI[None,None,:,:],axis=-1),axis=-1)
stat_mah = np.sqrt(m)

should work with delta being a 3-D array.

Yes, but when running the lines above, I am getting the same error message:

Traceback (most recent call last):
File “”, line 1, in
ValueError: operands could not be broadcast together with shapes
(3,77,78,78) (1,1,3,3)

Well, it’s not quite the same; both operands now have the same number
of dimensions, but they’re not compatible (essentially, for each
dimension either both values should be the same or one of them should
be one).

From what I can see from broadcasting rules, there is mismatch in the last
and fore-last dimensions of VI[None,None,:,:] compared to the other two?

delta[:,:,:,None].shape
(3, 77, 78, 1)

delta[:,:,None,:].shape
(3, 77, 1, 78)

VI[None,None,:,:].shape
(1, 1, 3, 3)

The problem was that delta is 3x77x78 whereas the code assumed that it
would be 77x78x3.

I am really have to get a better grasp of working with these arrays, but in
any case, after a bit trial and error, I came up with the following to
compute m.

m = np.sum(np.sum(delta[None,:,:,:] * VI[:,:,None,None],axis=1) * delta,
axis=0)

This does, in my limited testing, gives the same result as using the loop
with

pow(distance.mahalanobis(cell_ref, stat_mean, VI),2).

to be sure, does the above makes sense to you?

I believe that’s correct.


Glynn Clements <glynn@gclements.plus.com>

Dear Paulo,

On Fri, Feb 13, 2015 at 9:57 AM, Paulo van Breugel
<p.vanbreugel@gmail.com> wrote:

I guess this is because the calculations are done in-memory? Any way to
avoid this memory problem when using large data sets (something like working
with memmap objects?)

With memmap you still have a limits of 2Gb I guess, you should try: dask

Dask Array implements the NumPy ndarray interface using blocked
algorithms, cutting up the large array into many small arrays. This
lets us compute on arrays larger than memory using all of our cores.
We coordinate these blocked algorithms using dask graphs.

http://dask.readthedocs.org/en/latest/array.html

I didn't have a chance to try it yet, but it support a numpy array
syntax, and since you are using quite basic functionalities I think
you should be able to work with it.

All the best

Pietro

Hi Pietro,

Thanks for the suggestion, I will have a look at the documentation.

Paulo

···

On Fri, Feb 13, 2015 at 10:09 AM, Pietro <peter.zamb@gmail.com> wrote:

Dear Paulo,

On Fri, Feb 13, 2015 at 9:57 AM, Paulo van Breugel
<p.vanbreugel@gmail.com> wrote:

I guess this is because the calculations are done in-memory? Any way to
avoid this memory problem when using large data sets (something like working
with memmap objects?)

With memmap you still have a limits of 2Gb I guess, you should try: dask

Dask Array implements the NumPy ndarray interface using blocked
algorithms, cutting up the large array into many small arrays. This
lets us compute on arrays larger than memory using all of our cores.
We coordinate these blocked algorithms using dask graphs.

http://dask.readthedocs.org/en/latest/array.html

I didn’t have a chance to try it yet, but it support a numpy array
syntax, and since you are using quite basic functionalities I think
you should be able to work with it.

All the best

Pietro

On Fri, Feb 13, 2015 at 10:13 AM, Paulo van Breugel <p.vanbreugel@gmail.com>
wrote:

Hi Pietro,

Thanks for the suggestion, I will have a look at the documentation.

Paulo

On Fri, Feb 13, 2015 at 10:09 AM, Pietro <peter.zamb@gmail.com> wrote:

Dear Paulo,

On Fri, Feb 13, 2015 at 9:57 AM, Paulo van Breugel
<p.vanbreugel@gmail.com> wrote:
> I guess this is because the calculations are done in-memory? Any way to
> avoid this memory problem when using large data sets (something like
working
> with memmap objects?)

With memmap you still have a limits of 2Gb I guess, you should try: dask

Just reading the memmap manual page, where it reads: "Memory-mapped arrays
use the Python memory-map object which (prior to Python 2.5) does not allow
files to be larger than a certain size depending on the platform. This size
is always < 2GB even on 64-bit systems.". Which is unclear to me; I am not
sure if that means that this limit is different or does not apply when on
Python 2.5 or newer (what is the minimum python version for GRASS?)

Dask Array implements the NumPy ndarray interface using blocked
algorithms, cutting up the large array into many small arrays. This
lets us compute on arrays larger than memory using all of our cores.
We coordinate these blocked algorithms using dask graphs.

http://dask.readthedocs.org/en/latest/array.html

I didn't have a chance to try it yet, but it support a numpy array
syntax, and since you are using quite basic functionalities I think
you should be able to work with it.

All the best

Pietro

Paulo van Breugel wrote:

>> With memmap you still have a limits of 2Gb I guess, you should try: dask
>>
>
Just reading the memmap manual page, where it reads: "Memory-mapped arrays
use the Python memory-map object which (prior to Python 2.5) does not allow
files to be larger than a certain size depending on the platform. This size
is always < 2GB even on 64-bit systems.". Which is unclear to me; I am not
sure if that means that this limit is different or does not apply when on
Python 2.5 or newer (what is the minimum python version for GRASS?)

Even if Python itself no longer imposes a 2 GiB limit, you would
probably need to be using a 64-bit platform, the system would need
enough memory for the operation, and it would need to allow the
process to use that much memory.

The problem here is that

mahdist = np.sum(np.sum(delta[None,:,:,:] * VI[:,:,None,None],axis=1) * delta,axis=0)

expands both delta and VI to 4-D arrays then calculates their product
element-wise, then sums them over two axes. The intermediate array of
products could potentially be very large.

It may be possible to avoid this issue using numpy.tensordot, e.g.

mahdist = np.sum(np.tensordot(VI, delta, (1,0)) * delta,axis=0)

I don't know for sure whether this actually uses less memory. It could
do, as it rearranges the matrices so that the sum-of-products is
calculated using np.dot, which is a built-in function.

--
Glynn Clements <glynn@gclements.plus.com>

On Sat, Feb 14, 2015 at 4:38 AM, Glynn Clements <glynn@gclements.plus.com>
wrote:

Paulo van Breugel wrote:

> >> With memmap you still have a limits of 2Gb I guess, you should try:
dask
> >>
> >
> Just reading the memmap manual page, where it reads: "Memory-mapped
arrays
> use the Python memory-map object which (prior to Python 2.5) does not
allow
> files to be larger than a certain size depending on the platform. This
size
> is always < 2GB even on 64-bit systems.". Which is unclear to me; I am
not
> sure if that means that this limit is different or does not apply when on
> Python 2.5 or newer (what is the minimum python version for GRASS?)

Even if Python itself no longer imposes a 2 GiB limit, you would
probably need to be using a 64-bit platform, the system would need
enough memory for the operation, and it would need to allow the
process to use that much memory.

The problem here is that

mahdist = np.sum(np.sum(delta[None,:,:,:] * VI[:,:,None,None],axis=1) *
delta,axis=0)

expands both delta and VI to 4-D arrays then calculates their product
element-wise, then sums them over two axes. The intermediate array of
products could potentially be very large.

It may be possible to avoid this issue using numpy.tensordot, e.g.

mahdist = np.sum(np.tensordot(VI, delta, (1,0)) * delta,axis=0)

I don't know for sure whether this actually uses less memory. It could
do, as it rearranges the matrices so that the sum-of-products is
calculated using np.dot, which is a built-in function.

I did some quick testing, it seems to use slightly less memory. The dask
solution seems promising, with the disadvantage that it is not widely
available / still experimental according to the website.

For a quick solution, what about using r.tile to split the input data in
tiles and compute the mahalanobis distance per tile. Or would that come
with too much overhead or other clear disadvantages?

--
Glynn Clements <glynn@gclements.plus.com>

On Sat, Feb 14, 2015 at 11:47 AM, Paulo van Breugel <p.vanbreugel@gmail.com>
wrote:

For a quick solution, what about using r.tile to split the input data in
tiles and compute the mahalanobis distance per tile.

See PyGRASS GridModule class which will do tiling (and a lot of other
things) for you.

http://grass.osgeo.org/grass70/manuals/libpython/pygrass.modules.grid.html

Hi Paulo,

you can see an implementation of the mahalanobis distance computed in
parallel (using all computer processors) for a single image here:

https://github.com/javimarlop/eHabpy/blob/master/ehab.py

see lines 32-86 and 597. I hope it helps! Cheers, Javier

On Sat, Feb 14, 2015 at 7:29 PM, Vaclav Petras <wenzeslaus@gmail.com> wrote:

On Sat, Feb 14, 2015 at 11:47 AM, Paulo van Breugel <p.vanbreugel@gmail.com>
wrote:

For a quick solution, what about using r.tile to split the input data in
tiles and compute the mahalanobis distance per tile.

See PyGRASS GridModule class which will do tiling (and a lot of other
things) for you.

http://grass.osgeo.org/grass70/manuals/libpython/pygrass.modules.grid.html

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Hi Javier,

This looks really useful, thanks! My python skills are very limited, so I will have to look at it better to see what I can understand. But to start, the parallel function, does that also solve the handling of potentially very large raster layers?

Btw, this does look like a very useful set of functions, any plan to develop this into a GRASS addon or function?

Paulo

···

On Sun, Feb 15, 2015 at 9:22 PM, Javier Martínez-López <javi.martinez.lopez@gmail.com> wrote:

Hi Paulo,

you can see an implementation of the mahalanobis distance computed in
parallel (using all computer processors) for a single image here:

https://github.com/javimarlop/eHabpy/blob/master/ehab.py

see lines 32-86 and 597. I hope it helps! Cheers, Javier

On Sat, Feb 14, 2015 at 7:29 PM, Vaclav Petras <wenzeslaus@gmail.com> wrote:

On Sat, Feb 14, 2015 at 11:47 AM, Paulo van Breugel <p.vanbreugel@gmail.com>
wrote:

For a quick solution, what about using r.tile to split the input data in
tiles and compute the mahalanobis distance per tile.

See PyGRASS GridModule class which will do tiling (and a lot of other
things) for you.

http://grass.osgeo.org/grass70/manuals/libpython/pygrass.modules.grid.html


grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Hi Paulo,

to use it with python just copy the mahalanobis function definitions,
import all necessary libraries and use it as in line 597 (the
covariance matrix and the mean are computed just before this line). I
am sorry that the code is not documented yet. I am not sure if this
will solve the large raster layers problem but you can try it. As you
can see in the script, I did not write the parallel mahalanobis
functions (Sturla Molden in CC did it) and it works perfectly for me,
so I agree that it would be great to implement it also in GRASS (maybe
as an add on?). However, I am not so experienced with that. The
ehab.py script also does a lot data preprocessing to allow batch runs
using heterogeneous data sources and finally performs some patch
analysis and metrics out of the resulting similarity maps using the
ndimage library (scipy). Any help on how to implement it in GRASS
would be more than welcome! Cheers, Javier

On Sun, Feb 15, 2015 at 10:21 PM, Paulo van Breugel
<p.vanbreugel@gmail.com> wrote:

Hi Javier,

This looks really useful, thanks! My python skills are very limited, so I
will have to look at it better to see what I can understand. But to start,
the parallel function, does that also solve the handling of potentially very
large raster layers?

Btw, this does look like a very useful set of functions, any plan to develop
this into a GRASS addon or function?

Paulo

On Sun, Feb 15, 2015 at 9:22 PM, Javier Martínez-López
<javi.martinez.lopez@gmail.com> wrote:

Hi Paulo,

you can see an implementation of the mahalanobis distance computed in
parallel (using all computer processors) for a single image here:

https://github.com/javimarlop/eHabpy/blob/master/ehab.py

see lines 32-86 and 597. I hope it helps! Cheers, Javier

On Sat, Feb 14, 2015 at 7:29 PM, Vaclav Petras <wenzeslaus@gmail.com>
wrote:
>
> On Sat, Feb 14, 2015 at 11:47 AM, Paulo van Breugel
> <p.vanbreugel@gmail.com>
> wrote:
>>
>>
>> For a quick solution, what about using r.tile to split the input data
>> in
>> tiles and compute the mahalanobis distance per tile.
>
>
> See PyGRASS GridModule class which will do tiling (and a lot of other
> things) for you.
>
>
> http://grass.osgeo.org/grass70/manuals/libpython/pygrass.modules.grid.html
>
> _______________________________________________
> grass-dev mailing list
> grass-dev@lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/grass-dev