[GRASS-dev] NULL file compression: loop over uncompressed maps to save disk space

Hi,

while hunting for more GBs on my local disk I found many raster maps
with a still uncompressed NULL files (no surprise since the optional
new NULL compression was introduced in 7.2.0).

As an example - EU DEM25m:

uncompressed NULL file:
6000000000 Apr 13 2016 ./eu_laea/PERMANENT/cell_misc/eu_dem25/null

compressed NULL file:
32108798 Jan 3 15:09 eu_laea/PERMANENT/cell_misc/eu_dem25/nullcmpr

Ratio:

32108798 / 6000000000

[1] 0.005351466

... quite an improvement :slight_smile:

Having tons of raster maps here I thought of running r.null over all
raster maps. In general it is:

export GRASS_COMPRESS_NULLS=1
r.null -z myrastermap

Attached a patch which adds a second line of output to "r.compress -p
myrastermap" in order to check the actual compression state of a map:

r.compress -p eu_dem25
<eu_dem25> is compressed (method 2: ZLIB). Data type: FCELL
<eu_dem25> has an uncompressed NULL file

After compression it looks like this:

r.compress -p eu_dem25
<eu_dem25> is compressed (method 2: ZLIB). Data type: FCELL
<eu_dem25> has a compressed NULL file

Now, how to use that:

# all in one (check if NULL is compressed, if no, do it otherwise don't touch):
r.compress -p eu_dem25 2>&1 | grep uncompressed && r.null -z eu_dem25

Questions:
I believe that an additional -g flag for shell style printing would be
useful as well).
Maybe with a -g flag no need to use the stderr redirect?
Any better ideas here? (if yes, feel free to submit to SVN for testing)

################
Since r.null -z doesn't do anything useful if GRASS_COMPRESS_NULLS is
not set I have tried (!) to add a G_message() to tell the user if that
variable is set or not.
r.null -z eu_dem25
The GRASS_COMPRESS_NULLS environment variable is currently set
6%...

But it *always* tells that it is set, so my getenv() parsing is wrong.
Can anyone help please? Also attached...

thanks,
Markus

(attachments)

r_compress_r_null.diff (1.34 KB)

On Tue, Jan 3, 2017 at 6:08 PM, Markus Neteler <neteler@osgeo.org> wrote:

Hi,

while hunting for more GBs on my local disk I found many raster maps
with a still uncompressed NULL files (no surprise since the optional
new NULL compression was introduced in 7.2.0).

As an example - EU DEM25m:

uncompressed NULL file:
6000000000 Apr 13 2016 ./eu_laea/PERMANENT/cell_misc/eu_dem25/null

compressed NULL file:
32108798 Jan 3 15:09 eu_laea/PERMANENT/cell_misc/eu_dem25/nullcmpr

Ratio:

32108798 / 6000000000
[1] 0.005351466

… quite an improvement :slight_smile:

Having tons of raster maps here I thought of running r.null over all
raster maps. In general it is:

export GRASS_COMPRESS_NULLS=1
r.null -z myrastermap

Attached a patch which adds a second line of output to “r.compress -p
myrastermap” in order to check the actual compression state of a map:

r.compress -p eu_dem25
<eu_dem25> is compressed (method 2: ZLIB). Data type: FCELL
<eu_dem25> has an uncompressed NULL file

After compression it looks like this:

r.compress -p eu_dem25
<eu_dem25> is compressed (method 2: ZLIB). Data type: FCELL
<eu_dem25> has a compressed NULL file

Now, how to use that:

all in one (check if NULL is compressed, if no, do it otherwise don’t touch):

r.compress -p eu_dem25 2>&1 | grep uncompressed && r.null -z eu_dem25

Questions:
I believe that an additional -g flag for shell style printing would be
useful as well).
Maybe with a -g flag no need to use the stderr redirect?
Any better ideas here? (if yes, feel free to submit to SVN for testing)

I have added your changes and a new shell style option to trunk in r70337. Note that this is not standard shell style because the module accepts several input maps, thus compression info is written to stdout as one line per input map. The format is
input map name|data type|name of data compression method|NULL file compression

e.g.

eu_dem25|FCELL|ZLIB|NO

or

eu_dem25|FCELL|BZIP2|YES

################
Since r.null -z doesn’t do anything useful if GRASS_COMPRESS_NULLS is
not set I have tried (!) to add a G_message() to tell the user if that
variable is set or not.
r.null -z eu_dem25
The GRASS_COMPRESS_NULLS environment variable is currently set
6%…

But it always tells that it is set, so my getenv() parsing is wrong.
Can anyone help please? Also attached…

Try trunk r70338. This also fixes removal of a compressed NULL file: the file name is nullcmpr, not null2

Markus M

On Wed, Jan 11, 2017 at 10:34 AM, Markus Metz
<markus.metz.giswork@gmail.com> wrote:

On Tue, Jan 3, 2017 at 6:08 PM, Markus Neteler <neteler@osgeo.org> wrote:

...

Questions:
I believe that an additional -g flag for shell style printing would be
useful as well).
Maybe with a -g flag no need to use the stderr redirect?
Any better ideas here? (if yes, feel free to submit to SVN for testing)

I have added your changes and a new shell style option to trunk in r70337.

Thanks for that!

Note that this is not standard shell style because the module accepts
several input maps, thus compression info is written to stdout as one line
per input map. The format is
input map name|data type|name of data compression method|NULL file
compression
e.g.
eu_dem25|FCELL|ZLIB|NO
or
eu_dem25|FCELL|BZIP2|YES

Yes, that's quite useful like this. Maybe a small modification to make
it usable for the beloved eval() function?

r.compress -g eu_dem25
eu_dem25=FCELL|ZLIB|YES
?
This would keep the -g implementations consistent across different commands.

################
Since r.null -z doesn't do anything useful if GRASS_COMPRESS_NULLS is
not set I have tried (!) to add a G_message() to tell the user if that
variable is set or not.
r.null -z eu_dem25
The GRASS_COMPRESS_NULLS environment variable is currently set
6%...

But it *always* tells that it is set, so my getenv() parsing is wrong.
Can anyone help please? Also attached...

Try trunk r70338.

Works.

This also fixes removal of a compressed NULL file: the
file name is nullcmpr, not null2

That old "null2" name is also here:

./raster/r.support/main.c: G_file_name_misc(path, "cell_misc",
"null2", raster->answer, G_mapset());

best,
markusN

On Wed, Jan 11, 2017 at 12:37 PM, Markus Neteler <neteler@osgeo.org> wrote:

On Wed, Jan 11, 2017 at 10:34 AM, Markus Metz
<markus.metz.giswork@gmail.com> wrote:

On Tue, Jan 3, 2017 at 6:08 PM, Markus Neteler <neteler@osgeo.org> wrote:
…

Questions:
I believe that an additional -g flag for shell style printing would be
useful as well).
Maybe with a -g flag no need to use the stderr redirect?
Any better ideas here? (if yes, feel free to submit to SVN for testing)

I have added your changes and a new shell style option to trunk in r70337.

Thanks for that!

Note that this is not standard shell style because the module accepts
several input maps, thus compression info is written to stdout as one line
per input map. The format is
input map name|data type|name of data compression method|NULL file
compression
e.g.
eu_dem25|FCELL|ZLIB|NO
or
eu_dem25|FCELL|BZIP2|YES

Yes, that’s quite useful like this. Maybe a small modification to make
it usable for the beloved eval() function?

r.compress -g eu_dem25
eu_dem25=FCELL|ZLIB|YES
?

In this case, eu_dem25 would be both the name of a raster map and the name of a variable, I find this confusing. And you still need to split the result. I guess parsing
eu_dem25|FCELL|ZLIB|YES

would be slightly easier in python than
eu_dem25=FCELL|ZLIB|YES

This would keep the -g implementations consistent across different commands.

Full easy support for eval() is not possible if the same parameters are printed several times. See e.g. r.univar -gt with a zonal map, r.quantile, r.stats.quantile or v.db.connect -g with several connections defined. Therefore I would use one line with name=value where possible, otherwise one line per input with separated fields. I would not mix the two, IMHO it will cause confusion.

That old “null2” name is also here:

./raster/r.support/main.c: G_file_name_misc(path, “cell_misc”,
“null2”, raster->answer, G_mapset());

Fixed in r70340,1 (trunk, relbr72).

Markus M

On Wed, Jan 11, 2017 at 3:06 PM, Markus Metz
<markus.metz.giswork@gmail.com> wrote:

On Wed, Jan 11, 2017 at 12:37 PM, Markus Neteler <neteler@osgeo.org> wrote:

On Wed, Jan 11, 2017 at 10:34 AM, Markus Metz

....

> Note that this is not standard shell style because the module accepts
> several input maps, thus compression info is written to stdout as one
> line per input map. The format is
> input map name|data type|name of data compression method|NULL file
> compression
> e.g.
> eu_dem25|FCELL|ZLIB|NO
> or
> eu_dem25|FCELL|BZIP2|YES

Yes, that's quite useful like this. Maybe a small modification to make
it usable for the beloved eval() function?

r.compress -g eu_dem25
eu_dem25=FCELL|ZLIB|YES
?

In this case, eu_dem25 would be both the name of a raster map and the name
of a variable, I find this confusing. And you still need to split the
result. I guess parsing
eu_dem25|FCELL|ZLIB|YES
would be slightly easier in python than
eu_dem25=FCELL|ZLIB|YES

This would keep the -g implementations consistent across different
commands.

Full easy support for eval() is not possible if the same parameters are
printed several times. See e.g. r.univar -gt with a zonal map, r.quantile,
r.stats.quantile or v.db.connect -g with several connections defined.
Therefore I would use one line with name=value where possible, otherwise one
line per input with separated fields. I would not mix the two, IMHO it will
cause confusion.

ok fine. After a bit of testing I suggest to backport the new flag and
r.null messages so that NULL compression becomes easy also in 7.2 (I
enjoyed already the --exec interface to loop over all mapsets from
outside and compress all maps therein).

markusN