[GRASS-dev] g.mlist as C implementation?

Hi,

when using g.mlist in mapsets with thousands of maps in
it (as happens with MODIS time series), it gets extremely
slow. I wonder

- if a C implementation would be faster (I usually need
  the * wildcard to match file names)
- if yes, how complicated it is to write.

?
Markus

------------------
ITC -> dall'1 marzo 2007 Fondazione Bruno Kessler
ITC -> since 1 March 2007 Fondazione Bruno Kessler
------------------

Markus Neteler wrote:

when using g.mlist in mapsets with thousands of maps in
it (as happens with MODIS time series), it gets extremely
slow. I wonder

- if a C implementation would be faster (I usually need
  the * wildcard to match file names)

It's certainly possible to improve upon the efficiency of g.mlist,
although there's a lot which could be done without resorting to C.

The main thing which stands out is that it invokes "grep" once for
every name returned by g.list:

    list=""
    for i in `g.list type=$type mapset=$mapset | grep -v '^-\+$' | grep -v "files available" | grep -vi "mapset"`
    do
        if [ ! "$search" ] ; then
      list="$list $i"
        else
      list="$list `echo $i | grep \"$search\"`"
        fi
    done

Feeding a stream of names to a single grep process would be far more
efficient. The g.list output can be "unformatted" so that each name
appears on a separate line by filtering through:

  sed 's/ */\<newline>/g'

where <newline> is a literal newline character (\n works, but appears
to be a GNU extension).

Also, use of "for var in `command` ..." is sub-optimal; using
"command | while read var ..." is preferable.

- if yes, how complicated it is to write.

The code itself is fairly straightforward, provided that you can rely
upon the existence of either fnmatch() and/or regexec() et al. The
former is POSIX.2, the latter POSIX.1. IOW, neither are standard on
Windows.

DIY glob matching isn't hard if you impose a restriction that the
pattern may not contain more than one asterisk: check that the part
before the asterisk matches the beginning of the string, the part
after it matches the end, and the two don't overlap (i.e. the string
is at least as long as the pattern without the asterisk).

Matching regular expressions is rather more complex; you don't want to
do it yourself.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements wrote:

Markus Neteler wrote:
> when using g.mlist in mapsets with thousands of maps in
> it (as happens with MODIS time series), it gets extremely
> slow.

..
Glynn Clements wrote:

The main thing which stands out is that it invokes "grep" once for
every name returned by g.list:

Hi,

see attached patch for consideration. Following Glynn's suggestion it
gets rid of the loop for each map name, making the script ~50% faster
than before, but still about twice as slow as a single call to g.list.

I haven't put it in CVS as I haven't really tested it.

Hamish

(attachments)

g.mlist_HB20070910.diff (2.24 KB)

Hamish wrote on 09/10/2007 05:52 AM:

Glynn Clements wrote:
  

Markus Neteler wrote:
    

when using g.mlist in mapsets with thousands of maps in
it (as happens with MODIS time series), it gets extremely
slow.
      

..
Glynn Clements wrote:
  

The main thing which stands out is that it invokes "grep" once for
every name returned by g.list:
    
Hi,

see attached patch for consideration. Following Glynn's suggestion it
gets rid of the loop for each map name, making the script ~50% faster
than before, but still about twice as slow as a single call to g.list.

I haven't put it in CVS as I haven't really tested it.

I have made these tests:

for i in `seq 1 2000` ; do r.mapcalc map$i=1 ;done
# old g.mlist
time g.mlist type=rast pat="map*"
real 0m7.496s
user 0m2.417s
sys 0m4.983s

# new g.mlist
time g.mlist type=rast pat="map*"
real 0m0.230s
user 0m0.157s
sys 0m0.058s

=> 0.230/7.496 = 0.031

for i in `seq 1 10000` ; do r.mapcalc map$i=1 ;done
# old g.mlist
real 0m54.725s
user 0m24.809s
sys 0m28.944s

# new g.mlist
real 0m0.416s
user 0m0.229s
sys 0m0.087s

=> 0.416/54.725 = 0.0076

# paranoia test: new g.mlist
g.mlist type=rast pat="map*" | wc -l
10000

Amazing!
Also g.mremove will be finally fast then since it calls g.mlist.

Markus

------------------
ITC -> dall'1 marzo 2007 Fondazione Bruno Kessler
ITC -> since 1 March 2007 Fondazione Bruno Kessler
------------------

Hamish wrote:

Glynn Clements wrote:
>
> Markus Neteler wrote:
> > when using g.mlist in mapsets with thousands of maps in
> > it (as happens with MODIS time series), it gets extremely
> > slow.
..
Glynn Clements wrote:
> The main thing which stands out is that it invokes "grep" once for
> every name returned by g.list:

see attached patch for consideration. Following Glynn's suggestion it
gets rid of the loop for each map name, making the script ~50% faster
than before, but still about twice as slow as a single call to g.list.

Here's my proposed alternative. It avoids the use of backticks in
favour of a pipeline ending in a while loop.

--
Glynn Clements <glynn@gclements.plus.com>

(attachments)

g.mlist.patch (1016 Bytes)

Markus Neteler wrote:

>>> when using g.mlist in mapsets with thousands of maps in
>>> it (as happens with MODIS time series), it gets extremely
>>> slow.

Hamish:

> see attached patch for consideration. Following Glynn's suggestion it
> gets rid of the loop for each map name,

Markus:

I have made these tests:

..

for i in `seq 1 10000` ; do r.mapcalc map$i=1 ;done
# old g.mlist
real 0m54.725s

..

# new g.mlist
real 0m0.416s

..

=> 0.416/54.725 = 0.0076

..

Amazing!

well, the time is just proportional to the number of program calls. by
removing that loop that number is now finite, whereas before it depended
on the number of maps. Of course the expanded g.list solution is just 1
program call.

Glynn:

Here's my proposed alternative. It avoids the use of backticks in
favour of a pipeline ending in a while loop.

I don't mind which patch gets used; the simpler the better.

I noticed parts of the script may be a bit fragile/inefficient, which is
of concern for a module mainly intended as a scripting tool.

- The script relies on "echo -n"; not portable?
  http://www.opengroup.org/onlinepubs/009695399/utilities/echo.html

+ g.list type=$type mapset=$mapset \
+ | grep -v '^-\+$' \
+ | grep -v "files available" \
+ | grep -vi "mapset" \
+ | sed 's/ */\n/g' \
+ | grep -v '^$' \
+ | grep "$search" \
+ | sort \
+ | while read i

- combine the groups of sequential greps into single calls with the \|
  "OR" expr.

- The `grep -v "files available"` will fail when translated,
    'raster files available in mapset <PERMANENT>:'
  I assume the `grep -vi "mapset"` is there as a poor workaround for that.
  ("mapset" will not be as translated as often as "files available") :-/

Is there a standard C library fn we can call to get regex support?
Direct pattern searching in g.list would be preferable; I agree that
a DIY reimplementation is a lot more trouble than it is worth. (unless
some standard GPL code is copied verbatim into a new G_regex() libgis fn)
?
things like r.out.mpeg have basic map-name matching capabilities already.

Hamish

Hamish wrote:

I noticed parts of the script may be a bit fragile/inefficient, which is
of concern for a module mainly intended as a scripting tool.

- The script relies on "echo -n"; not portable?

Nope. According to the official specification, echo doesn't accept any
switches. OTOH, the GNU version doesn't support the \c sequence. IOW,
there is no portable way to echo a string without appending a newline.

  http://www.opengroup.org/onlinepubs/009695399/utilities/echo.html

+ g.list type=$type mapset=$mapset \
+ | grep -v '^-\+$' \
+ | grep -v "files available" \
+ | grep -vi "mapset" \
+ | sed 's/ */\n/g' \
+ | grep -v '^$' \
+ | grep "$search" \
+ | sort \
+ | while read i

- combine the groups of sequential greps into single calls with the \|
  "OR" expr.

The second and third can certainly be combined that way, provided that
you can use (or not use) -i for both (which is probably the case). I'm
not sure whether you can include the anchors (^ and $) within a
branch.

Hmm; this is wrong:

  grep -vi "mapset"

Nothing (other than common sense) prevents you from creating a map
named "mapset".

- The `grep -v "files available"` will fail when translated,
    'raster files available in mapset <PERMANENT>:'
  I assume the `grep -vi "mapset"` is there as a poor workaround for that.
  ("mapset" will not be as translated as often as "files available") :-/

One more reason why we need a g.list which isn't quite so "hostile" to
scripting.

Is there a standard C library fn we can call to get regex support?
Direct pattern searching in g.list would be preferable; I agree that
a DIY reimplementation is a lot more trouble than it is worth. (unless
some standard GPL code is copied verbatim into a new G_regex() libgis fn)?
things like r.out.mpeg have basic map-name matching capabilities already.

For regular expressions, regcomp(), regexec(), regfree() and
regerror() are specified by POSIX.1. For shell "glob" patterns,
fnmatch() is specified by POSIX.2. None of these are specified by
either version of ANSI C, so we would need additional libraries for
Windows.

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements wrote:

Hamish wrote:

I noticed parts of the script may be a bit fragile/inefficient, which is
of concern for a module mainly intended as a scripting tool.

- The script relies on "echo -n"; not portable?

Nope. According to the official specification, echo doesn't accept any
switches. OTOH, the GNU version doesn't support the \c sequence. IOW,
there is no portable way to echo a string without appending a newline.

We can use printf instead I guess. By default it does not append a
newline (to append it, add \n at the end of the string to be printed).

It seems to be portable and POSIX compliant, see eg. [1].

[1]http://pwet.fr/man/linux/commandes/posix/echo

Maciek

On Tue, 11 Sep 2007, Maciej Sieczka wrote:

Glynn Clements wrote:

Hamish wrote:

I noticed parts of the script may be a bit fragile/inefficient, which is
of concern for a module mainly intended as a scripting tool.

- The script relies on "echo -n"; not portable?

Nope. According to the official specification, echo doesn't accept any
switches. OTOH, the GNU version doesn't support the \c sequence. IOW,
there is no portable way to echo a string without appending a newline.

What about the echo that is included with GRASS: $GISBASE/etc/echo?
It suppresses a new line if the -n flag is used, and prints to stderr if -e is used, otherwise stdout.

Might be another use for it?

Paul

Another reason to consider extending the functionality of g.list over
improvements to the g.mlist script is that a C-module would run in a Windows
environment and the script contains a lot of *nix-isms that could be
problematic even with Mysys.

Michael

On 9/10/07 7:20 PM, "Glynn Clements" <glynn@gclements.plus.com> wrote:

Hamish wrote:

I noticed parts of the script may be a bit fragile/inefficient, which is
of concern for a module mainly intended as a scripting tool.

- The script relies on "echo -n"; not portable?

Nope. According to the official specification, echo doesn't accept any
switches. OTOH, the GNU version doesn't support the \c sequence. IOW,
there is no portable way to echo a string without appending a newline.

  echo

+ g.list type=$type mapset=$mapset \
+ | grep -v '^-\+$' \
+ | grep -v "files available" \
+ | grep -vi "mapset" \
+ | sed 's/ */\n/g' \
+ | grep -v '^$' \
+ | grep "$search" \
+ | sort \
+ | while read i

- combine the groups of sequential greps into single calls with the \|
  "OR" expr.

The second and third can certainly be combined that way, provided that
you can use (or not use) -i for both (which is probably the case). I'm
not sure whether you can include the anchors (^ and $) within a
branch.

Hmm; this is wrong:

grep -vi "mapset"

Nothing (other than common sense) prevents you from creating a map
named "mapset".

- The `grep -v "files available"` will fail when translated,
    'raster files available in mapset <PERMANENT>:'
  I assume the `grep -vi "mapset"` is there as a poor workaround for that.
  ("mapset" will not be as translated as often as "files available") :-/

One more reason why we need a g.list which isn't quite so "hostile" to
scripting.

Is there a standard C library fn we can call to get regex support?
Direct pattern searching in g.list would be preferable; I agree that
a DIY reimplementation is a lot more trouble than it is worth. (unless
some standard GPL code is copied verbatim into a new G_regex() libgis fn)?
things like r.out.mpeg have basic map-name matching capabilities already.

For regular expressions, regcomp(), regexec(), regfree() and
regerror() are specified by POSIX.1. For shell "glob" patterns,
fnmatch() is specified by POSIX.2. None of these are specified by
either version of ANSI C, so we would need additional libraries for
Windows.

__________________________________________
Michael Barton, Professor of Anthropology
Director of Graduate Studies
School of Human Evolution & Social Change
Center for Social Dynamics and Complexity
Arizona State University

phone: 480-965-6213
fax: 480-965-7671
www: http://www.public.asu.edu/~cmbarton

Paul Kelly wrote:

>>> I noticed parts of the script may be a bit fragile/inefficient, which is
>>> of concern for a module mainly intended as a scripting tool.
>>>
>>> - The script relies on "echo -n"; not portable?
>
>> Nope. According to the official specification, echo doesn't accept any
>> switches. OTOH, the GNU version doesn't support the \c sequence. IOW,
>> there is no portable way to echo a string without appending a newline.

What about the echo that is included with GRASS: $GISBASE/etc/echo?

That would work, but it could have a substantial performance hit, as
it can be executed twice per name ("echo" is normally a shell
built-in).

One solution would be to use "$echo", and set it to either "echo" or
"$GISBASE/etc/echo" depending upon whether the default echo command
understands the -n switch. Patch attached.

--
Glynn Clements <glynn@gclements.plus.com>

(attachments)

g.mlist.patch (554 Bytes)

Michael Barton wrote:

Another reason to consider extending the functionality of g.list over
improvements to the g.mlist script is that a C-module would run in a Windows
environment and the script contains a lot of *nix-isms that could be
problematic even with Mysys.

Or use Tcl or Python; both of those have regexp support as part of the
language. OTOH, there's no guarantee that either of those will be
installed.

The only practical option which doesn't require additional
dependencies is a C module with no support for regexps and glob
patterns limited to a single asterisk.

--
Glynn Clements <glynn@gclements.plus.com>

- The script relies on "echo -n"; not portable?

Instead of "echo -n" to build the output map by map, what if a string was
constructed within the loop like:

if [ $first ] ; then
  OUTSTRING="$MAPNAME$MAPSET"
else
  OUTSTRING="$OUTSTRING$SEP$MAPNAME$MAPSET"
fi

and then at end of the script just a single
  echo "$OUTSTRING"

is that going to max out at 4096 chars?

what should a the C flat 'g.list -g' output look like? how about:
[user1]
map1
map2
map3
[PERMANENT]
map_a
map_b
map_c

?
Hamish

Hi,

2007/9/12, Hamish <hamish_nospam@yahoo.com>:

what should a the C flat 'g.list -g' output look like? how about:
[user1]
map1
map2
map3
[PERMANENT]
map_a
map_b
map_c

BTW, I tried to implement something similar in the past...

g.list rast ncols=1
----------------------------------------------
raster files available in mapset <elev>:
dem_reggio_10
dem_reggio_10_fp
----------------------------------------------
raster files available in mapset <PERMANENT>:
MASK
elev_reggio
elev_reggio_10

Martin

--
Martin Landa <landa.martin@gmail.com> * http://gama.fsv.cvut.cz/~landa *

(attachments)

g-list-ncols.diff (4.33 KB)

Glynn Clements wrote on 09/10/2007 08:25 PM:

Hamish wrote:

Glynn Clements wrote:
    

Markus Neteler wrote:
      

when using g.mlist in mapsets with thousands of maps in
it (as happens with MODIS time series), it gets extremely
slow.
        

..
Glynn Clements wrote:
    

The main thing which stands out is that it invokes "grep" once for
every name returned by g.list:
      

see attached patch for consideration. Following Glynn's suggestion it
gets rid of the loop for each map name, making the script ~50% faster
than before, but still about twice as slow as a single call to g.list.
    
Here's my proposed alternative. It avoids the use of backticks in
favour of a pipeline ending in a while loop.

It appears that Hamish's implementation based on Glynn's former notes
is much faster:

time sh g.mlist_glynn type=rast map=neteler pat="map*"
real 0m1.200s
user 0m0.749s
sys 0m0.464s

time sh g.mlist_hamish type=rast map=neteler pat="map*"
real 0m0.338s
user 0m0.190s
sys 0m0.052s

Is that the contribution to portability? :slight_smile:

Markus

------------------
ITC -> dall'1 marzo 2007 Fondazione Bruno Kessler
ITC -> since 1 March 2007 Fondazione Bruno Kessler
------------------

Hamish wrote on 09/12/2007 12:47 PM:

what should a the C flat 'g.list -g' output look like? how about:
[user1]
map1
map2
map3
[PERMANENT]
map_a
map_b
map_c
  

For parsing reasons, it should be
[user1] map1
[user1] map2
[user1] map3
[PERMANENT] map1
[PERMANENT] map2
[PERMANENT] map3

or even
user1:map1
user1:map2
user1:map3
PERMANENT:map1
PERMANENT:map2
PERMANENT:map3

and 'g.list -gf':
user1:map1:title1
user1:map2:title2
PERMANENT:map1:title1

Markus

------------------
ITC -> dall'1 marzo 2007 Fondazione Bruno Kessler
ITC -> since 1 March 2007 Fondazione Bruno Kessler
------------------

Hamish wrote:

> - The script relies on "echo -n"; not portable?

Instead of "echo -n" to build the output map by map, what if a string was
constructed within the loop like:

if [ $first ] ; then
  OUTSTRING="$MAPNAME$MAPSET"
else
  OUTSTRING="$OUTSTRING$SEP$MAPNAME$MAPSET"
fi

and then at end of the script just a single
  echo "$OUTSTRING"

is that going to max out at 4096 chars?

It's going to create the string in memory, then write it in one go.

It could potentially hit a limit on the maximum length of a command if
echo isn't a built-in. I don't know whether this is anything other
than a theoretical possibility, though.

what should a the C flat 'g.list -g' output look like? how about:
[user1]
map1
map2
map3
[PERMANENT]
map_a
map_b
map_c

It would be simpler to parse if the mapset was part of each name,
e.g.:

map1@user1
map2@user1
map3@user1
map_a@PERMANENT
map_b@PERMANENT
map_c@PERMANENT

--
Glynn Clements <glynn@gclements.plus.com>

Markus Neteler wrote:

>>> The main thing which stands out is that it invokes "grep" once for
>>> every name returned by g.list:
>>>
>> see attached patch for consideration. Following Glynn's suggestion it
>> gets rid of the loop for each map name, making the script ~50% faster
>> than before, but still about twice as slow as a single call to g.list.
>>
>
> Here's my proposed alternative. It avoids the use of backticks in
> favour of a pipeline ending in a while loop.
>
>
It appears that Hamish's implementation based on Glynn's former notes
is much faster:

time sh g.mlist_glynn type=rast map=neteler pat="map*"
real 0m1.200s
user 0m0.749s
sys 0m0.464s

time sh g.mlist_hamish type=rast map=neteler pat="map*"
real 0m0.338s
user 0m0.190s
sys 0m0.052s

Is that the contribution to portability? :slight_smile:

I had overlooked the part where loop which adds the separator was
replaced with a call to "tr".

Technically, this corresponds to a reduction in functionality, as the
previous version would accept any string as a separator, but using
"tr" restricts it to a single character (which is probably all that's
required).

I've committed an update which has no loops other than the top-level
"for mapset in $mapsets" loop.

--
Glynn Clements <glynn@gclements.plus.com>

Martin,

Is this in the cvs yet?

Michael

On 9/12/07 4:22 AM, "Martin Landa" <landa.martin@gmail.com> wrote:

Hi,

2007/9/12, Hamish <hamish_nospam@yahoo.com>:

what should a the C flat 'g.list -g' output look like? how about:
[user1]
map1
map2
map3
[PERMANENT]
map_a
map_b
map_c

BTW, I tried to implement something similar in the past...

g.list rast ncols=1
----------------------------------------------
raster files available in mapset <elev>:
dem_reggio_10
dem_reggio_10_fp
----------------------------------------------
raster files available in mapset <PERMANENT>:
MASK
elev_reggio
elev_reggio_10

Martin

__________________________________________
Michael Barton, Professor of Anthropology
Director of Graduate Studies
School of Human Evolution & Social Change
Center for Social Dynamics & Complexity
Arizona State University

phone: 480-965-6213
fax: 480-965-7671
www: http://www.public.asu.edu/~cmbarton

Glynn Clements wrote:

I had overlooked the part where loop which adds the separator was
replaced with a call to "tr".

Technically, this corresponds to a reduction in functionality, as the
previous version would accept any string as a separator, but using
"tr" restricts it to a single character (which is probably all that's
required).

is it possible to use sed instead of tr?
  sed -e "s/\n/$sep/" # doesn't work

I've committed an update which has no loops other than the top-level
"for mapset in $mapsets" loop.

thanks.

I will have a look at merging Martin, Markus and Glynn's suggestions for
a flat output g.list flag. As the feature is needed for parsing, I'd
rather do it as a flag vs. a ncols= option which retains the text and
fancy format mapset underlines. I've never really known what the "g"
stands for, but as we use it a lot for parsable output, the flag will
be '-g' here too.

so it would look like:

G63> g.list -g rast
map1@user1
map2@user1
map3@user1
map_a@PERMANENT
map_b@PERMANENT
map_c@PERMANENT

G63> g.list -gf rast
map1@user1:
map2@user1:This map has a title
map3@user1:
map_a@PERMANENT:Raw x,y,z data binned into a raster grid by cell min
map_b@PERMANENT:Land Use
map_c@PERMANENT:

Once that is firm we can revisit g.mlist to make it fully locale
independent by removing the < grep -vi "files available\|mapset" >
part.

Hamish