[GRASS-dev] matplotlib example script

I’ve attached an example script and output for using the MatPlotLib Python library in GRASS. This script replicates the functionality of d.histogram, but makes a much nicer looking plot. It also has more options, although I’ve included only a very few of the formatting possibilities here. The script sends the output from r.stats into 9 lines of code to produce the histogram with formatting options. Most of it is done by 4 lines. This can be run from the command line and does not require a GUI to be installed. It writes to a PNG file, but could easily be set to create a TclTk or wxPython interactive window as a user-selectable option.

The following command produces the attached histogram:

histogram_mpldemo.py input=elevation.10m bins=25 output=“/users/cmbarton/documents/myhistogram” fillcolor=20:20:200

Maybe I’ll try an xy graph of 2 maps/images with density plot options next.

To run this, you need to install MatPlotLib <http://matplotlib.sourceforge.net/>, which requires Numpy http://numpy.scipy.org/. Both are installable as binary eggs via the Python easy_install package manager or from most Linux package managers. Of course you need Python too.

Enjoy
Michael


C. Michael Barton, Professor of Anthropology
Director of Graduate Studies
School of Human Evolution & Social Change
Center for Social Dynamics & Complexity
Arizona State University

Phone: 480-965-6262
Fax: 480-965-7671
www: <www.public.asu.edu/~cmbarton>

(attachments)

histogram_mpldemo.py (4.34 KB)

Michael Barton wrote:

I've attached an example script and output for using the MatPlotLib
Python library in GRASS.

One of the Apple Mail developers needs a dose of the clue-stick. The
script (and image) were attached to the HTML alternative of the
multipart/alternative section, so they don't exist for anyone using
the plain-text alternative.

This script replicates the functionality of
d.histogram, but makes a much nicer looking plot. It also has more
options, although I've included only a very few of the formatting
possibilities here. The script sends the output from r.stats into 9
lines of code to produce the histogram with formatting options. Most
of it is done by 4 lines. This can be run from the command line and
does not require a GUI to be installed. It writes to a PNG file, but
could easily be set to create a TclTk or wxPython interactive window
as a user-selectable option.

Before we start using MatPlotLib extensively, we need to figure out
how to handle output format selection so that all script can work with
any backend, and the scripts can be used alongside d.* commands
without the caller needing to distingiush between them.

Or, at least, we need to figure out an interface that will allow us to
do that later without having to modify large numbers of scripts. IOW,
somthing like a grass.plot module which hides the details behind
begin() and end() methods, so the scripts themselves don't reference
specific format types.

A couple of brief comments on the code:

if os.environ['PYTHONPATH'] != None:
    os.environ['PYTHONPATH'] = '/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages:'+os.environ['PYTHONPATH']

This doesn't belong here.

    mapstats = grass.read_command('r.stats', input = options['input'], fs = ',',
                                    nsteps = '255', flags = 'Acn')
    histlist = mapstats.splitlines()
    for pair in histlist:

Ideally, read_command() shouldn't used for anything which might create
a lot of output.

For processing "bulk" output, iterate over stdout, with e.g.:

  mapstats_p = grass.start_command('r.stats', input = options['input'], fs = ',',
                               nsteps = '255', flags = 'Acn', stdout = subprocess.PIPE)
  for pair in mapstats_p.stdout:
          try:
              pairlist = pair.strip().split(',')
  ...

  mapstats_p.wait()

[Note the added strip(); the "for line in file" idiom doesn't strip the
line terminator from the line.]

As that's likely to be quite common, I have added grass.pipe_command(),
which is like start_command() but adds the "stdout = subprocess.PIPE"
automatically, so you can use:

  mapstats_p = grass.pipe_command('r.stats', input = options['input'], fs = ',',
                               nsteps = '255', flags = 'Acn')
  for pair in mapstats_p.stdout:
  ...

It returns the Popen object rather than the pipe itself, as you need
the Popen object for the wait() call (even if you don't want the exit
code, the child process will remain as a zombie until you call wait()).

            for i in range(cells):
                plotlist.append(val)

Ouch. This is going to cause problems (i.e. fail) for large maps (even
for moderately-sized maps, it will be slow). As it stands, this script
*isn't* a substitute for d.histogram.

As axes.hist() just calls numpy.histogram() then uses axes.bar() (or
axes.barh()) to plot the result, you may as well just use axes.bar()
directly on the value,count pairs emitted by r.stats.

If you specifically want to bin the data, it would be better to either
add an option to r.stats, or to bin the data in the script as it is
being read, rather than read an entire raster map into a Python list.

        if len(fcolor.split(':')) == 3:
            #rgb color
            r = float(fcolor.split(':')[0])
            g = float(fcolor.split(':')[1])
            b = float(fcolor.split(':')[2])
            #translate to mpl rgb format
            r = r / 255
            g = g / 255
            b = b / 255
            fcolor = (r,g,b)

This should probably be made into a library function; it would also
need the named colours (lib/gis/named_colr.c).

    if options['linecolor'] != None:
        lcolor = options['linecolor']
    else: fcolor = 'k'

The else clause should presumably be lcolor. Does this not also need to
handle R:G:B?

--
Glynn Clements <glynn@gclements.plus.com>

Thanks for the input Glynn. A few responses below.

On Jul 24, 2008, at 12:24 PM, Glynn Clements wrote:

Michael Barton wrote:

I've attached an example script and output for using the MatPlotLib
Python library in GRASS.

One of the Apple Mail developers needs a dose of the clue-stick. The
script (and image) were attached to the HTML alternative of the
multipart/alternative section, so they don't exist for anyone using
the plain-text alternative.

Hmmm. I might be able to help that, but it's hard to remember to switch the way the program works between people here who expect html-like and plain text.

This script replicates the functionality of
d.histogram, but makes a much nicer looking plot. It also has more
options, although I've included only a very few of the formatting
possibilities here. The script sends the output from r.stats into 9
lines of code to produce the histogram with formatting options. Most
of it is done by 4 lines. This can be run from the command line and
does not require a GUI to be installed. It writes to a PNG file, but
could easily be set to create a TclTk or wxPython interactive window
as a user-selectable option.

Before we start using MatPlotLib extensively, we need to figure out
how to handle output format selection so that all script can work with
any backend, and the scripts can be used alongside d.* commands
without the caller needing to distingiush between them.

Or, at least, we need to figure out an interface that will allow us to
do that later without having to modify large numbers of scripts. IOW,
somthing like a grass.plot module which hides the details behind
begin() and end() methods, so the scripts themselves don't reference
specific format types.

I agree. One possibility is to use the ps backend if you are indeed planning to switch the output to postscript. In this example, I just wanted to show that TclTk, wxPython, Qt or other GUI is not required.

A couple of brief comments on the code:

if os.environ['PYTHONPATH'] != None:
   os.environ['PYTHONPATH'] = '/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages:'+os.environ['PYTHONPATH']

This doesn't belong here.

Forgot to take this out. This is an issue that my system is having and not only does it not belong in this script, it doesn't help inside this script.

   mapstats = grass.read_command('r.stats', input = options['input'], fs = ',',
                                   nsteps = '255', flags = 'Acn')
   histlist = mapstats.splitlines()
   for pair in histlist:

Ideally, read_command() shouldn't used for anything which might create
a lot of output.

For processing "bulk" output, iterate over stdout, with e.g.:

  mapstats_p = grass.start_command('r.stats', input = options['input'], fs = ',',
                               nsteps = '255', flags = 'Acn', stdout = subprocess.PIPE)
  for pair in mapstats_p.stdout:
          try:
             pairlist = pair.strip().split(',')
  ...

  mapstats_p.wait()

[Note the added strip(); the "for line in file" idiom doesn't strip the
line terminator from the line.]

The strip() is probably unnecessary; I have a habit of worrying about extraneous spaces causing formating problems.

As that's likely to be quite common, I have added grass.pipe_command(),
which is like start_command() but adds the "stdout = subprocess.PIPE"
automatically, so you can use:

  mapstats_p = grass.pipe_command('r.stats', input = options['input'], fs = ',',
                               nsteps = '255', flags = 'Acn')
  for pair in mapstats_p.stdout:
  ...

It returns the Popen object rather than the pipe itself, as you need
the Popen object for the wait() call (even if you don't want the exit
code, the child process will remain as a zombie until you call wait()).

Thanks.

           for i in range(cells):
               plotlist.append(val)

Ouch. This is going to cause problems (i.e. fail) for large maps (even
for moderately-sized maps, it will be slow). As it stands, this script
*isn't* a substitute for d.histogram.

Well, I was worried about that too. However, even with the 10m DEM in spearfish, the main time is spent in running r.stats. The rest of the code takes no appreciable time to run. I should try it on a big Terra/ASTER or Landsat ETM file and see how long it takes.

The fastest thing probably is to just read the GRASS histogram file. But this doesn't permit changing bin sizes or other plotting options.

As axes.hist() just calls numpy.histogram() then uses axes.bar() (or
axes.barh()) to plot the result, you may as well just use axes.bar()
directly on the value,count pairs emitted by r.stats.

If you specifically want to bin the data, it would be better to either
add an option to r.stats, or to bin the data in the script as it is
being read, rather than read an entire raster map into a Python list.

I agree about r.stats. Wouldn't binning in the script take as long as the current method?

       if len(fcolor.split(':')) == 3:
           #rgb color
           r = float(fcolor.split(':')[0])
           g = float(fcolor.split(':')[1])
           b = float(fcolor.split(':')[2])
           #translate to mpl rgb format
           r = r / 255
           g = g / 255
           b = b / 255
           fcolor = (r,g,b)

This should probably be made into a library function; it would also
need the named colours (lib/gis/named_colr.c).

Yes. And it might be one in MatPlotLib. It has some color function, but I haven't explored them yet. Do the grass color names have html equivalents? If so, we don't need /lib/gis/named_colr.c.

   if options['linecolor'] != None:
       lcolor = options['linecolor']
   else: fcolor = 'k'

The else clause should presumably be lcolor. Does this not also need to
handle R:G:B?

You are right.

Thanks again
Michael

Michael Barton wrote:

>> This script replicates the functionality of
>> d.histogram, but makes a much nicer looking plot. It also has more
>> options, although I've included only a very few of the formatting
>> possibilities here. The script sends the output from r.stats into 9
>> lines of code to produce the histogram with formatting options. Most
>> of it is done by 4 lines. This can be run from the command line and
>> does not require a GUI to be installed. It writes to a PNG file, but
>> could easily be set to create a TclTk or wxPython interactive window
>> as a user-selectable option.
>
> Before we start using MatPlotLib extensively, we need to figure out
> how to handle output format selection so that all script can work with
> any backend, and the scripts can be used alongside d.* commands
> without the caller needing to distingiush between them.
>
> Or, at least, we need to figure out an interface that will allow us to
> do that later without having to modify large numbers of scripts. IOW,
> somthing like a grass.plot module which hides the details behind
> begin() and end() methods, so the scripts themselves don't reference
> specific format types.

I agree. One possibility is to use the ps backend if you are indeed
planning to switch the output to postscript. In this example, I just
wanted to show that TclTk, wxPython, Qt or other GUI is not required.

Well, the general idea is to support as wide a range of backends as
possible. Essentially, we want something similar to the d.* commands,
which will output to a variety of backends, even those which are yet
to be invented.

Also, if we can't find any better way to integrate d.* commands and
matplotlib, we may end up adding a d.graph backend to matplotlib.

>> for i in range(cells):
>> plotlist.append(val)
>
> Ouch. This is going to cause problems (i.e. fail) for large maps (even
> for moderately-sized maps, it will be slow). As it stands, this script
> *isn't* a substitute for d.histogram.

Well, I was worried about that too. However, even with the 10m DEM in
spearfish, the main time is spent in running r.stats. The rest of the
code takes no appreciable time to run. I should try it on a big Terra/
ASTER or Landsat ETM file and see how long it takes.

Here:

$ g.region rast=elevation.10m
$ time r.stats -Acn elevation.10m >/dev/null
r.stats complete.

real 0m1.233s
user 0m1.220s
sys 0m0.013s
time stuff/histogram_mpldemo.py input=elevation.10m
r.stats complete.

real 0m5.058s
user 0m4.886s
sys 0m0.163s

Note that the time taken to get to the "r.stats complete" point
includes the above loop to insert "cells" copies of "val" in
"plotlist".

Also, r.stats alone is only using one core, while the script will be
using both (one running the script, another running r.stats
concurrently). With a single core (or if the other core(s) are busy),
it would be worse.

> As axes.hist() just calls numpy.histogram() then uses axes.bar() (or
> axes.barh()) to plot the result, you may as well just use axes.bar()
> directly on the value,count pairs emitted by r.stats.
>
> If you specifically want to bin the data, it would be better to either
> add an option to r.stats, or to bin the data in the script as it is
> being read, rather than read an entire raster map into a Python list.

I agree about r.stats. Wouldn't binning in the script take as long as
the current method?

Even if it takes just as long, you're less likely to have it fail
because "plotlist" consumed all available RAM. As it stands, plotlist
will have one entry for every non-null cell in the raster.

When processing "bulk" data, anything that uses a fixed amount of
memory (e.g. one integer per bin) is preferable to using memory
proportional to the size of the input.

Hence the recommendation to iterate over the lines of r.stats' output
rather than read it all into a list then iterate over the list.

>> if len(fcolor.split(':')) == 3:
>> #rgb color
>> r = float(fcolor.split(':')[0])
>> g = float(fcolor.split(':')[1])
>> b = float(fcolor.split(':')[2])
>> #translate to mpl rgb format
>> r = r / 255
>> g = g / 255
>> b = b / 255
>> fcolor = (r,g,b)
>
> This should probably be made into a library function; it would also
> need the named colours (lib/gis/named_colr.c).

Yes. And it might be one in MatPlotLib. It has some color function,
but I haven't explored them yet. Do the grass color names have html
equivalents? If so, we don't need /lib/gis/named_colr.c.

I suspect that we want our own code here. Eventually there may be
Python scripts which don't use matplotlib (i.e. non-d.* scripts) but
which still need to parse colours.

I've added this to grass.py:

  def parse_color(val, dflt = None):
      if val in named_colors:
          return named_colors[val]
  
      vals = val.split(':')
      if len(vals) == 3:
          return tuple(float(v) / 255 for v in vals)
  
      return dflt

[along with the named_colors dictionary.]

--
Glynn Clements <glynn@gclements.plus.com>

On Jul 24, 2008, at 6:03 PM, Glynn Clements wrote:

Michael Barton wrote:

This script replicates the functionality of
d.histogram, but makes a much nicer looking plot. It also has more
options, although I've included only a very few of the formatting
possibilities here. The script sends the output from r.stats into 9
lines of code to produce the histogram with formatting options. Most
of it is done by 4 lines. This can be run from the command line and
does not require a GUI to be installed. It writes to a PNG file, but
could easily be set to create a TclTk or wxPython interactive window
as a user-selectable option.

Before we start using MatPlotLib extensively, we need to figure out
how to handle output format selection so that all script can work with
any backend, and the scripts can be used alongside d.* commands
without the caller needing to distingiush between them.

Or, at least, we need to figure out an interface that will allow us to
do that later without having to modify large numbers of scripts. IOW,
somthing like a grass.plot module which hides the details behind
begin() and end() methods, so the scripts themselves don't reference
specific format types.

I agree. One possibility is to use the ps backend if you are indeed
planning to switch the output to postscript. In this example, I just
wanted to show that TclTk, wxPython, Qt or other GUI is not required.

Well, the general idea is to support as wide a range of backends as
possible. Essentially, we want something similar to the d.* commands,
which will output to a variety of backends, even those which are yet
to be invented.

Also, if we can't find any better way to integrate d.* commands and
matplotlib, we may end up adding a d.graph backend to matplotlib.

          for i in range(cells):
              plotlist.append(val)

Ouch. This is going to cause problems (i.e. fail) for large maps (even
for moderately-sized maps, it will be slow). As it stands, this script
*isn't* a substitute for d.histogram.

Well, I was worried about that too. However, even with the 10m DEM in
spearfish, the main time is spent in running r.stats. The rest of the
code takes no appreciable time to run. I should try it on a big Terra/
ASTER or Landsat ETM file and see how long it takes.

Here:

$ g.region rast=elevation.10m
$ time r.stats -Acn elevation.10m >/dev/null
r.stats complete.

real 0m1.233s
user 0m1.220s
sys 0m0.013s
time stuff/histogram_mpldemo.py input=elevation.10m
r.stats complete.

real 0m5.058s
user 0m4.886s
sys 0m0.163s

Note that the time taken to get to the "r.stats complete" point
includes the above loop to insert "cells" copies of "val" in
"plotlist".

Also, r.stats alone is only using one core, while the script will be
using both (one running the script, another running r.stats
concurrently). With a single core (or if the other core(s) are busy),
it would be worse.

As axes.hist() just calls numpy.histogram() then uses axes.bar() (or
axes.barh()) to plot the result, you may as well just use axes.bar()
directly on the value,count pairs emitted by r.stats.

If you specifically want to bin the data, it would be better to either
add an option to r.stats, or to bin the data in the script as it is
being read, rather than read an entire raster map into a Python list.

I agree about r.stats. Wouldn't binning in the script take as long as
the current method?

Even if it takes just as long, you're less likely to have it fail
because "plotlist" consumed all available RAM. As it stands, plotlist
will have one entry for every non-null cell in the raster.

When processing "bulk" data, anything that uses a fixed amount of
memory (e.g. one integer per bin) is preferable to using memory
proportional to the size of the input.

Hence the recommendation to iterate over the lines of r.stats' output
rather than read it all into a list then iterate over the list.

I'll give that a try.

      if len(fcolor.split(':')) == 3:
          #rgb color
          r = float(fcolor.split(':')[0])
          g = float(fcolor.split(':')[1])
          b = float(fcolor.split(':')[2])
          #translate to mpl rgb format
          r = r / 255
          g = g / 255
          b = b / 255
          fcolor = (r,g,b)

This should probably be made into a library function; it would also
need the named colours (lib/gis/named_colr.c).

Yes. And it might be one in MatPlotLib. It has some color function,
but I haven't explored them yet. Do the grass color names have html
equivalents? If so, we don't need /lib/gis/named_colr.c.

I suspect that we want our own code here. Eventually there may be
Python scripts which don't use matplotlib (i.e. non-d.* scripts) but
which still need to parse colours.

I've added this to grass.py:

  def parse_color(val, dflt = None):
      if val in named_colors:
          return named_colors[val]
  
      vals = val.split(':')
      if len(vals) == 3:
          return tuple(float(v) / 255 for v in vals)
  
      return dflt

[along with the named_colors dictionary.]

I'd added it as a method in the script, but can dispense with that now.

Michael

On Jul 24, 2008, at 6:03 PM, Glynn Clements wrote:

Even if it takes just as long, you're less likely to have it fail
because "plotlist" consumed all available RAM. As it stands, plotlist
will have one entry for every non-null cell in the raster.

When processing "bulk" data, anything that uses a fixed amount of
memory (e.g. one integer per bin) is preferable to using memory
proportional to the size of the input.

Hence the recommendation to iterate over the lines of r.stats' output
rather than read it all into a list then iterate over the list.

To do a histogram, I need to send ax.hist a list of values. So I don't know how I can get away without creating that list unless I use a completely different algorithm (something from numpy?).

This probably isn't a very good example, as there may be other more efficient ways to get the data to actually create the histogram. It's more a way to try this out and see if it is useful--and what needs to be done to use it.

On the other hand, in spite of recent improvements, d.hist is still pretty ugly, with formatting issues like varying font sizes on a single axis. And it is not very flexible. It would be nice to see where standard deviations lie, customize axis formatting, etc. For this module in particular, it is probably better to bin the data in another way than I have, and input it into one of the matplotlib plot methods.

Here are the results of trying this on a larger image (Terra/ASTER) with almost 30 million cells. The script takes about 2.5X as long as d.histogram (which surprising to me is also using r.stats internally). It still isn't bad at 15 seconds for an unoptimized first shot.

GRASS 7.0.svn (Spain_wgs84z30):~ > r.info ASTER.0536.vnir.3 +----------------------------------------------------------------------------+
  | Layer: ASTER.0536.vnir.3 Date: Sun Aug 17 23:08:30 2003 |
  | Mapset: aster Login of Creator: cmbarton |
  | Location: Spain_wgs84z30 |
  | DataBase: /Users/Shared/grassdata |
  | Title: ( ASTER.vnir.0536.3 ) |
  | Timestamp: none |
  |----------------------------------------------------------------------------|
  | |
  | Type of Map: raster Number of Categories: 252 |
  | Data Type: CELL |
  | Rows: 5175 |
  | Columns: 5787 |
  | Total Cells: 29947725 |
  | Projection: UTM (zone 30) |
  | N: 4423719.5 S: 4346094.5 Res: 15 |
  | E: 774409.5 W: 687604.5 Res: 15 |
  | Range of data: min = 5 max = 252 |
  | |
  | Data Description: |
  | generated by i.in.erdas |
  | |
  | |
  +----------------------------------------------------------------------------+

GRASS 7.0.svn (Spain_wgs84z30):~ > time r.stats -Acn ASTER.0536.vnir.3 >/dev/null
  100%
r.stats complete.

real 0m2.774s
user 0m2.418s
sys 0m0.122s
GRASS 7.0.svn (Spain_wgs84z30):~ > time histogram_mpldemo.py input=ASTER.0536.vnir.3 output=testaster
  100%
r.stats complete.

real 0m14.957s
user 0m11.938s
sys 0m1.443s

GRASS 7.0.svn (Spain_wgs84z30):~ > time d.histogram ASTER.0536.vnir.3 >/dev/null 100%
r.stats complete.

real 0m5.811s
user 0m2.440s
sys 0m0.144s

Michael

Hi Michael,

I tried it in Debian/Sid and here is what I got, a strange permission error....

Any idea/suggestion?

------------------------------

GRASS 7.0.svn (spearfish60):~ > python histogram_mpldemo.py
input=elevation.10m bins=25 output="~/myhistogram" fillcolor=20:20:200
start
execl() failed: Permission denied
GRASS 7.0.svn (spearfish60):~ >

------------------------------

thanks
Yann

2008/7/25 Yann Chemin <yann.chemin@gmail.com>:

Hi Michael,

Nice histogram!
I am going to get python compiled and try it here...
Yann

--
Yann Chemin
International Rice Research Institute
Office: http://www.irri.org/gis
Perso: http://www.freewebs.com/ychemin

Yann Chemin wrote:

I tried it in Debian/Sid and here is what I got, a strange permission error....

Any idea/suggestion?

  chmod +x histogram_mpldemo.py
  ./histogram_mpldemo.py

Even if you invoke the python interpreter explicitly, the script will
run g.parser, which will re-execute the script directly, so it needs
to be executable. This is true for any script which uses g.parser.

--
Glynn Clements <glynn@gclements.plus.com>

Michael Barton wrote:

> Even if it takes just as long, you're less likely to have it fail
> because "plotlist" consumed all available RAM. As it stands, plotlist
> will have one entry for every non-null cell in the raster.
>
> When processing "bulk" data, anything that uses a fixed amount of
> memory (e.g. one integer per bin) is preferable to using memory
> proportional to the size of the input.
>
> Hence the recommendation to iterate over the lines of r.stats' output
> rather than read it all into a list then iterate over the list.

To do a histogram, I need to send ax.hist a list of values. So I don't
know how I can get away without creating that list unless I use a
completely different algorithm (something from numpy?).

Don't use axes.hist(), use axes.bar(). I.e. leave the calculations to
GRASS, and only use matplotlib for plotting.

ax.hist() and numpy.histogram() are broken by design. They should
accept an iterator as an argument. Requiring the entire data to be
passed as a list makes them useless for large amounts of data.

If we were to impose a requirement that maps must fit into memory,
writing GRASS modules would be significantly simpler. GRASS would also
be significantly less useful.

On the other hand, in spite of recent improvements, d.hist is still
pretty ugly, with formatting issues like varying font sizes on a
single axis. And it is not very flexible. It would be nice to see
where standard deviations lie, customize axis formatting, etc. For
this module in particular, it is probably better to bin the data in
another way than I have, and input it into one of the matplotlib plot
methods.

Actually, for statistical information, the most important feature is
the ability output data in formats that are useful to *real*
statistical software. There must be packages out there (even free
ones) which will do a far better job than anything we are going to
provide.

Sure, we can provide "basic" functionality, but where do you draw the
line? Which features of a "real" statistics package *wouldn't* be
useful in GRASS? I'm really quite worried about the potential for
feature creep in this area.

And library of plotting functions isn't a statistics package. You want
something where GRASS hands over the data and forgets about it. We
shouldn't be responsible for communicating from the user to the
software details such as what type of graph to draw, or the colours or
symbols or whatever.

--
Glynn Clements <glynn@gclements.plus.com>

On Jul 25, 2008, at 11:59 AM, Glynn Clements wrote:

Michael Barton wrote:

Even if it takes just as long, you're less likely to have it fail
because "plotlist" consumed all available RAM. As it stands, plotlist
will have one entry for every non-null cell in the raster.

When processing "bulk" data, anything that uses a fixed amount of
memory (e.g. one integer per bin) is preferable to using memory
proportional to the size of the input.

Hence the recommendation to iterate over the lines of r.stats' output
rather than read it all into a list then iterate over the list.

To do a histogram, I need to send ax.hist a list of values. So I don't
know how I can get away without creating that list unless I use a
completely different algorithm (something from numpy?).

Don't use axes.hist(), use axes.bar(). I.e. leave the calculations to
GRASS, and only use matplotlib for plotting.

Fine and probably faster. But currently r.stats will only bin fp maps.

ax.hist() and numpy.histogram() are broken by design. They should
accept an iterator as an argument. Requiring the entire data to be
passed as a list makes them useless for large amounts of data.

Well. For *really* large amounts of data I suppose. And indeed sometimes people have Gb maps to work with. However, 15 sec. for histogramming a 30 million cell, 23Mb ASTER file isn't too bad. As you say, it would faster to use C modules in GRASS for binning if we had them.

If we were to impose a requirement that maps must fit into memory,
writing GRASS modules would be significantly simpler. GRASS would also
be significantly less useful.

On the other hand, in spite of recent improvements, d.hist is still
pretty ugly, with formatting issues like varying font sizes on a
single axis. And it is not very flexible. It would be nice to see
where standard deviations lie, customize axis formatting, etc. For
this module in particular, it is probably better to bin the data in
another way than I have, and input it into one of the matplotlib plot
methods.

Actually, for statistical information, the most important feature is
the ability output data in formats that are useful to *real*
statistical software. There must be packages out there (even free
ones) which will do a far better job than anything we are going to
provide.

Certainly. But there is a convenience factor too. r.univar is an example. If I want to do general purpose stats, I do dump to a stat program. But some stats may be used regularly enough that it makes sense to do them internally. And it is useful to be able to get and obtain some stats that can be piped into other scripts or modules.

Sure, we can provide "basic" functionality, but where do you draw the
line? Which features of a "real" statistics package *wouldn't* be
useful in GRASS? I'm really quite worried about the potential for
feature creep in this area.

I guess I'd leave it up to the user/developer base to decide what kinds of functionality we need. If it is something regularly necessary, someone will probably craft a script for it. If this is really useful, it might get translated into C-code. Visualization and spatial analysis is an important part of GIS functionality to me. In fact, a lot of graphing programs and stat packages would have difficulty in dealing with 20, 50, or 100 million points. MatPlotLib (or something like it) seems like a good tool to have available for development. It's also an encouragement for people to begin to develop scripts in Python.

And library of plotting functions isn't a statistics package. You want
something where GRASS hands over the data and forgets about it. We
shouldn't be responsible for communicating from the user to the
software details such as what type of graph to draw, or the colours or
symbols or whatever.

For some things like histogramming, I think you're right. there really isn't much point in even being able to change the color. But it should look nice enough to publish. For other graphing, there may be use for more user control. The profile graphing module in the wxGUI probably has more user options than needed. I admit to using it to see what was possible.

However, it's nice if there are easy programming options for creating nice graphs where these are useful. And GRASS currently lacks such tools AFAICT. d.graph is really primitive. It seems better to simply use a nice, pre-existing graphing library (like we do with gdal and proj), rather than continue to try to create our own. Maybe there are better graphing libraries for GRASS than matplotlib. It doesn't do everything, but does contain sufficient functionality to cover most of the graphing needs for a high-end GIS. And it's pretty easy to work with.

Michael

Hi,

after chmod +x, it created a nice histogram of elevation.10m.
So I tried on floating point image created by some process.

here is the problem:
Traceback (most recent call last):
  File "/home/yann/histogram_mpldemo.py", line 173, in <module>
    main()
  File "/home/yann/histogram_mpldemo.py", line 159, in main
    alpha=0.75)
  File "/usr/lib/python2.5/site-packages/matplotlib/axes.py", line 6035, in hist
    normed=bool(normed), new=True)
  File "/usr/lib/python2.5/site-packages/numpy/lib/function_base.py",
line 236, in histogram
    range = (a.min(), a.max())
ValueError: zero-size array to ufunc.reduce without identity

this happens because the data has NAN (see r.info output):
Range of data: min = nan max = nan

Is there a way to discard NAN in a.min() and a.max() calculations?
Or is there a NAN-resistant mode in matplotlib?

Yann

The code is pretty simple. I would guess that NAN data could be discarded when making the list that is histogrammed. I'm kind of surprised that it shows up as r.stats is run to discard nulls. But I guess these aren't seen as nulls by r.stats.

Michael
______________________________
C. Michael Barton, Professor of Anthropology
Director of Graduate Studies
School of Human Evolution & Social Change
Center for Social Dynamics & Complexity
Arizona State University
Tempe, AZ 85287-2402
USA

voice: 480-965-6262; fax: 480-965-7671
www: http://www.public.asu.edu/~cmbarton

On Jul 25, 2008, at 4:09 PM, Yann Chemin wrote:

Hi,

after chmod +x, it created a nice histogram of elevation.10m.
So I tried on floating point image created by some process.

here is the problem:
Traceback (most recent call last):
File "/home/yann/histogram_mpldemo.py", line 173, in <module>
   main()
File "/home/yann/histogram_mpldemo.py", line 159, in main
   alpha=0.75)
File "/usr/lib/python2.5/site-packages/matplotlib/axes.py", line 6035, in hist
   normed=bool(normed), new=True)
File "/usr/lib/python2.5/site-packages/numpy/lib/function_base.py",
line 236, in histogram
   range = (a.min(), a.max())
ValueError: zero-size array to ufunc.reduce without identity

this happens because the data has NAN (see r.info output):
Range of data: min = nan max = nan

Is there a way to discard NAN in a.min() and a.max() calculations?
Or is there a NAN-resistant mode in matplotlib?

Yann

Yann,

Have you tried this on a dataset without NAN values?

Michael
______________________________
C. Michael Barton, Professor of Anthropology
Director of Graduate Studies
School of Human Evolution & Social Change
Center for Social Dynamics & Complexity
Arizona State University
Tempe, AZ 85287-2402
USA

voice: 480-965-6262; fax: 480-965-7671
www: http://www.public.asu.edu/~cmbarton

On Jul 25, 2008, at 4:09 PM, Yann Chemin wrote:

Hi,

after chmod +x, it created a nice histogram of elevation.10m.
So I tried on floating point image created by some process.

here is the problem:
Traceback (most recent call last):
File "/home/yann/histogram_mpldemo.py", line 173, in <module>
   main()
File "/home/yann/histogram_mpldemo.py", line 159, in main
   alpha=0.75)
File "/usr/lib/python2.5/site-packages/matplotlib/axes.py", line 6035, in hist
   normed=bool(normed), new=True)
File "/usr/lib/python2.5/site-packages/numpy/lib/function_base.py",
line 236, in histogram
   range = (a.min(), a.max())
ValueError: zero-size array to ufunc.reduce without identity

this happens because the data has NAN (see r.info output):
Range of data: min = nan max = nan

Is there a way to discard NAN in a.min() and a.max() calculations?
Or is there a NAN-resistant mode in matplotlib?

Yann

Michael Barton wrote:

> ax.hist() and numpy.histogram() are broken by design. They should
> accept an iterator as an argument. Requiring the entire data to be
> passed as a list makes them useless for large amounts of data.

Well. For *really* large amounts of data I suppose. And indeed
sometimes people have Gb maps to work with. However, 15 sec. for
histogramming a 30 million cell, 23Mb ASTER file isn't too bad. As you
say, it would faster to use C modules in GRASS for binning if we had
them.

Things can be written if there is a need for them. But doing things
"right" is more important than doing things "right now".

Many of the current problems with GRASS are due to cases where
expediency was allowed to win out over good design (or *any* design).

[As to the specific task of binning, you could probably get a suitable
tool within 5 minutes by scavenging from r.quantile.]

> Sure, we can provide "basic" functionality, but where do you draw the
> line? Which features of a "real" statistics package *wouldn't* be
> useful in GRASS? I'm really quite worried about the potential for
> feature creep in this area.

I guess I'd leave it up to the user/developer base to decide what
kinds of functionality we need. If it is something regularly
necessary, someone will probably craft a script for it. If this is
really useful, it might get translated into C-code. Visualization and
spatial analysis is an important part of GIS functionality to me. In
fact, a lot of graphing programs and stat packages would have
difficulty in dealing with 20, 50, or 100 million points. MatPlotLib
(or something like it) seems like a good tool to have available for
development. It's also an encouragement for people to begin to develop
scripts in Python.

It's also far too low-level. If you start writing lots of scripts
which call matplotlib directly, you will quickly end up with a
situation where one script lets you set the axis colour and the label
font but not the symbols, another lets you control the symbols but not
the font, etc.

[I'm using stylistic properties here for simplicity; there are
probably other properties which are far more important, e.g. being
able to zoom, select log/linear scaling, etc.]

If you can't find or aren't happy with existing graphing libraries,
then write one. But don't require each script to include its own
written-from-scratch-in-one-hour graphing library.

What is required is something at a level where the script generates
some data then indicates that the data should be displayed as a
scatter plot. The library should take care of the rest (scales, sizes,
colours, ...) without the script having to explicitly read options
from the user and pass them to matplotlib.

Are you going to hard-code the fonts, colours, line width, symbols,
etc, or allow the user to set them? Assuming the latter, are you going
to force them to specify font= linecolor= textcolor= etc for every
command that they type, or allow them to set defaults? Assuming the
latter, how will the script obtain this information?

Also, are these scripts going to be usable from within the GUI? (I.e.
display the graphics in a window created by the GUI, not just dump a
PNG file to the disk). How are the file format, dimensions, filename
etc communicated between the two?

Will the script be able to communicate the set of available display
attributes to the GUI (in such a way that it can distinguish "what to
display" from "how to display it")?

This, in a nutshell, is the difference between "software engineering"
and "coding". I've worked with too many "coders"; and by "worked
with", I mean "cleaned up after".

That isn't to say that you shouldn't start to write anything until you
have a complete architectural design. Just that you should expect the
first dozen or so attempts to serve as learning exercises rather than
something which will eventually be used. And expect the first few
attempts to tell you more about what *won't* work than what will.

In many regards, trial-and-error can produce a better design than
trying to operate from a purely theoretical perspective. Hindsight
tends to be more accurate than foresight, albeit with the drawback
that it takes rather more effort to obtain.

--
Glynn Clements <glynn@gclements.plus.com>

Yann Chemin wrote:

this happens because the data has NAN (see r.info output):
Range of data: min = nan max = nan

Is there a way to discard NAN in a.min() and a.max() calculations?
Or is there a NAN-resistant mode in matplotlib?

NaN should never appear in GRASS rasters. When it does, it's a bug in
the module which created the map (do you know which module was
responsible in this case?).

One of the things which is on my to-do list for 7.x is to change
G_is_[fd]_null_value() to treat *all* NaN values as null (currently,
it checks for a specific bit pattern, which is just one possible NaN
value).

Once that is done, then r.stats' -n flag will prevent NaNs from making
it into the output.

In the meantime, you can clean up maps which contain NaNs with e.g.:

  r.mapcalc 'newmap = if(oldmap == oldmap,oldmap,null())'

or, with recent versions of r.null:

  r.null map setnull=nan

--
Glynn Clements <glynn@gclements.plus.com>

On Jul 25, 2008, at 6:56 PM, Glynn Clements wrote:

Michael Barton wrote:

ax.hist() and numpy.histogram() are broken by design. They should
accept an iterator as an argument. Requiring the entire data to be
passed as a list makes them useless for large amounts of data.

Well. For *really* large amounts of data I suppose. And indeed
sometimes people have Gb maps to work with. However, 15 sec. for
histogramming a 30 million cell, 23Mb ASTER file isn't too bad. As you
say, it would faster to use C modules in GRASS for binning if we had
them.

Your comments bring out a number of important points that we need to consider both in this particular case, and more in general too. Here are a few responses.

Things can be written if there is a need for them. But doing things
"right" is more important than doing things "right now".

I agree. This is a proposal accompanied by examples so that others can try it out. Sort of like Cairo.

Many of the current problems with GRASS are due to cases where
expediency was allowed to win out over good design (or *any* design).

[As to the specific task of binning, you could probably get a suitable
tool within 5 minutes by scavenging from r.quantile.]

This would be good to add to r.stats, which already has an interface for this but which only works with float maps.

Sure, we can provide "basic" functionality, but where do you draw the
line? Which features of a "real" statistics package *wouldn't* be
useful in GRASS? I'm really quite worried about the potential for
feature creep in this area.

I guess I'd leave it up to the user/developer base to decide what
kinds of functionality we need. If it is something regularly
necessary, someone will probably craft a script for it. If this is
really useful, it might get translated into C-code. Visualization and
spatial analysis is an important part of GIS functionality to me. In
fact, a lot of graphing programs and stat packages would have
difficulty in dealing with 20, 50, or 100 million points. MatPlotLib
(or something like it) seems like a good tool to have available for
development. It's also an encouragement for people to begin to develop
scripts in Python.

It's also far too low-level. If you start writing lots of scripts
which call matplotlib directly, you will quickly end up with a
situation where one script lets you set the axis colour and the label
font but not the symbols, another lets you control the symbols but not
the font, etc.

It's not nearly as low level as d.graph or bash. Using pyplot or pylab dispenses with even more code. But it's important to see how this would work in a GRASS environment rather than just for creating general purpose graphs. This means it's up to the script/module/GUI developer to decide how much control is appropriate to pass on to users.

[I'm using stylistic properties here for simplicity; there are
probably other properties which are far more important, e.g. being
able to zoom, select log/linear scaling, etc.

If you can't find or aren't happy with existing graphing libraries,
then write one. But don't require each script to include its own
written-from-scratch-in-one-hour graphing library.

My thought was that this could potentially serve as a 'standard' graphic library for Python scripting in GRASS. Depending on the graphic requirements, it could also serve for some of the 'built in' graphing functions in a wxPython GUI environment--that is, things built in to the GUI or modules that ship with GRASS. Obviously, we'd need to try it out and maybe look at other possible alternatives. This library is widely used and well-maintained, and has a lot of functionality. So it seems like a good candidate for something like this.

What is required is something at a level where the script generates
some data then indicates that the data should be displayed as a
scatter plot. The library should take care of the rest (scales, sizes,
colours, ...)

I agree, especially for 'built-in' graphing like histogramming.

without the script having to explicitly read options
from the user and pass them to matplotlib.

The idea is not that we should attempt to create general-purpose graphing applications, but to provide a flexible set of tools that GRASS developers and sophisticated users can use.

Are you going to hard-code the fonts, colours, line width, symbols,
etc, or allow the user to set them?

I guess it depends on the application. Sometimes this is superfluous and other times it's very useful.

Assuming the latter, are you going
to force them to specify font= linecolor= textcolor= etc for every
command that they type, or allow them to set defaults? Assuming the
latter, how will the script obtain this information?

The examples that I did gave the user virtually no choice over graph formatting or the type of graph displayed, although it is easy enough to build it into a script GUI. The existence of a library which allows a programmer to produce diverse, publication quality graphs doesn't mean that it is necessary to push all of the options onto users. This would be creating a general purpose graphing application. And as you point out, other applications can fill that role.

Something like MatPlotLib does give us a way to create an important class of visualization that is largely lacking in GRASS, which is otherwise rich in analytical tools. I realize that people vary in how they respond to information presented in different ways. I am someone who usually much prefers to see a graph than a table of numbers. So having something along the lines of MatPlotLib available seems like a good thing to me.

Also, are these scripts going to be usable from within the GUI? (I.e.
display the graphics in a window created by the GUI, not just dump a
PNG file to the disk).

Currently some scripts are usable in the GUI simply because they are called from the menu. There is no particular integration beyond that. Some scripts (and C modules) produce dense numerical output that could benefit by optionally having it drawn to a graph in a file, rather than only being written to a text file. In other cases, there may be functions, like interactive profiling, that need to be wrapped into the GUI to be fully useable. The current interactive profiling module uses a wxPython library. This is fine, but is difficult to use outside of the wxPython GUI. Because there is considerable demand (at least among the developer community) for maintaining the possibility inside and outside the GUI, I've been looking for something that would work in both environments. That is one reason I like MatPlotLib, although there may be better alternatives I'm not yet aware of. It has a wxPython backend that allows it to be wrapped completely in the GUI and display to a canvas. Or it can create its own display environment. Or it can output to a file. And it is as easy to insert into a stand-along script as it is to embed it into the wxPython GUI environment.

How are the file format, dimensions, filename
etc communicated between the two?

Will the script be able to communicate the set of available display
attributes to the GUI (in such a way that it can distinguish "what to
display" from "how to display it")?

Currently no independent, stand-alone modules communicate with the GUI display. Jachym and Martin have created some hooks to make it possible to send the output from a display command (e.g. d.rast) to the wxPython canvas. I'm not sure if this is still active and I'm not sure that it should be a high priority, though I know that others might disagree.

Since MatPlotLib is pure Python and has a wxPython backend, it shouldn't be too hard to be able to have a graph created by this library display in the mapdisplay canvas. But I'm not sure that is a good idea. The mapdisplay canvas, with its toolbar, is pretty specialized for displaying maps and map-like imagery. Usually, I'd think a user would prefer to have a graph display in a separate window and a different kind of window. This is easy enough to do with MatPlotLib in a wxPython (or other) environment.

If this seems after testing to be a potentially valuable addition to the GRASS system, it would probably be good to build some standardized convenience libraries to easily create graphs with a standard look, manage data flows from grass modules to MatPlotLib/Numpy, create a standard graph display window (the toolbar that comes with MatPlotLib is only partly useful IMHO), etc. Perhaps this is what you meant above. It's also the kind of examples I did--axis labels, title, scaling are all automatic.

This, in a nutshell, is the difference between "software engineering"
and "coding". I've worked with too many "coders"; and by "worked
with", I mean "cleaned up after".

I agree very much. I have far less experience in this than you, but over the past several years, I've had to sort through a lot of poorly designed, accretionary code. On the other hand, almost none of us on the development team are programmers and development has been a long-term accretionary process. But I suppose that is all the more reason to try to work out the concepts better.

That isn't to say that you shouldn't start to write anything until you
have a complete architectural design. Just that you should expect the
first dozen or so attempts to serve as learning exercises rather than
something which will eventually be used. And expect the first few
attempts to tell you more about what *won't* work than what will.

In many regards, trial-and-error can produce a better design than
trying to operate from a purely theoretical perspective. Hindsight
tends to be more accurate than foresight, albeit with the drawback
that it takes rather more effort to obtain.

And some parts of GRASS very much need more in the way of well-though-out design concepts, while others can be more evolutionary. In my mind, that is where we are with a graphing library at the moment. I, at least, think it would be very good to have one. Someone could certainly program one in C, but that seems a lot of work if we can just use one that is already built. That leaves us more resources to build the pieces that are NOT out there to use. I'm simply proposing MatPlotLib as a potential candidate to use for creating graphs in a Python-enabled GRASS. We need to work with it some to see what its potential and limitations are.

Michael

Hi Michael,

I got a nice CASC2D water depth histogram, in a different color...
It nicely work for other datasets... without NAN, which are different
from NULL values, I wish we had a G_set_nan_null_value() function
too...

Yann

2008/7/26 Michael Barton <michael.barton@asu.edu>:

Yann,

Have you tried this on a dataset without NAN values?

Michael
______________________________
C. Michael Barton, Professor of Anthropology
Director of Graduate Studies
School of Human Evolution & Social Change
Center for Social Dynamics & Complexity
Arizona State University
Tempe, AZ 85287-2402
USA

voice: 480-965-6262; fax: 480-965-7671
www: http://www.public.asu.edu/~cmbarton

On Jul 25, 2008, at 4:09 PM, Yann Chemin wrote:

Hi,

after chmod +x, it created a nice histogram of elevation.10m.
So I tried on floating point image created by some process.

here is the problem:
Traceback (most recent call last):
File "/home/yann/histogram_mpldemo.py", line 173, in <module>
  main()
File "/home/yann/histogram_mpldemo.py", line 159, in main
  alpha=0.75)
File "/usr/lib/python2.5/site-packages/matplotlib/axes.py", line 6035, in
hist
  normed=bool(normed), new=True)
File "/usr/lib/python2.5/site-packages/numpy/lib/function_base.py",
line 236, in histogram
  range = (a.min(), a.max())
ValueError: zero-size array to ufunc.reduce without identity

this happens because the data has NAN (see r.info output):
Range of data: min = nan max = nan

Is there a way to discard NAN in a.min() and a.max() calculations?
Or is there a NAN-resistant mode in matplotlib?

Yann

--
Yann Chemin
International Rice Research Institute
Office: http://www.irri.org/gis
Perso: http://www.freewebs.com/ychemin

(attachments)

myhistogram.png