Hi again
Now I tested a bit the GeoTIFF approach with regards to disc space and performance:
For the disc space test I exported a MODIS dataset over Scandinavia (Size is 7629 x 9387 pixels) from GRASS native format (type CELL which had (compressed) 27M on disc) to Geotiff with two different data types (Int16 and Float64) both LZW-compressed and uncompressed.
Results for the Int16 dataset
MODIS_sizetest_compressed.tif (LZW-compressed, Predictor 2): 14M
MODIS_sizetest_uncompressed.tif (uncompressed): 137M
Results for the Float64 dataset
MODIS_sizetest_compressed_Float64.tif (LZW-compressed): 29M
MODIS_sizetest_uncompressed_Float64.tif (uncompressed): 547M
So, disc capacity seems to be an factor one should consider as uncompressed data is in this case at least 10 to 20 times heavier than compressed...
Maybe we will have to accept that raw data is kept in a less interoperable (GRASS native) format as the processed results are mainly the ones of interests (and I guess visual file browser e.g. in Arc software will never the less almost freeze when opening a folder with hundreds or thousands of files, do not know?).
I tried to run also performance tests using the time command and r.mapcalc with external and native format both for input and output (all 4 combinations). Results were not reliable and performance tests seem to be a bit tricky according to Glynns post here:
http://osdir.com/ml/grass-development-gis/2010-09/msg00225.html
Does anyone have a suggestion how such tests Glynn describes could be run technically (for not C-developers) without too much effort?
Cheers
Stefan
-----Original Message-----
From: grass-user-bounces@lists.osgeo.org [mailto:grass-user-bounces@lists.osgeo.org] On Behalf Of Blumentrath, Stefan
Sent: 4. desember 2013 22:52
To: Sören Gebbert
Cc: grass-user@lists.osgeo.org list
Subject: Re: [GRASS-user] Organizing spatial (time series) data for mixed GIS environments
Hi Sören,
First of all thank you very much for the excellent temporal framework! It is really great work!
Thank you also for your answers. They are already very helpful too!
I will test the solution with external Geotiffs.
Updates of the Geotiffs by external software are expectable (possibly by cron-jobs), so I have to think about a strategy for updating all time-space datasets down-streams which depend on a file updated (decade, year, month, whatever...)
I`ll report back after some first tests...
Best regards
Stefan
-----Original Message-----
From: Sören Gebbert [mailto:soerengebbert@googlemail.com]
Sent: 4. desember 2013 18:02
To: Blumentrath, Stefan
Cc: grass-user@lists.osgeo.org list
Subject: Re: [GRASS-user] Organizing spatial (time series) data for mixed GIS environments
Hi Stefan,
2013/12/3 Blumentrath, Stefan <Stefan.Blumentrath@nina.no>:
Dear all,
On our Ubuntu server we are about to reorganize our GIS data in order
to develop a more efficient and consistent solution for data storage
in a mixed GIS environment.
By “mixed GIS environment” I mean that we have people working with
GRASS, QGIS, PostGIS but also many people using R and maybe the
largest fraction using ESRI products, furthermore we have people using
ENIV, ERDAS and some other. Only few people (like me) actually work
directly on the server…
Until now I stored “my” data mainly in GRASS (6/7) native format which
I was very happy with. But I guess our ESRI- and PostGIS-people would
not accept that as a standard…
However, especially for time series data we cannot have several copies
in different formats (tailor-made for each and every software).
So I started thinking: what would be the most efficient and convenient
solution for storing a large amount of data (e.g. high resolution
raster and vector data with national extent plus time series data) in
a way that it is accessible for all (at least most) remote users (with
different GIS software). As I am very fond of the temporal framework
in GRASS 7 it would be a precondition that I can use these tools on
the data without unreasonable performance loss. Another precondition
would be that users at remote computers in our (MS Windows) network can have access to the data.
In general, four options come into my mind:
a) Stick to GRASS native format and have one copy in another format
b) Use the native formats the data come in (e.g. temperature and
precipitation comes in zipped ascii-grid format)
c) Use PostGIS as a backend for data storage (raster / vector) (linked
by (r./v.external.*)
d) Use another GDAL/OGR format for data storage (raster / vector)
(linked by (r./v.external.*)
My question(s) are:
What solutions could you recommend or what solution did you choose?
I would suggest to use r.external and uncompressed geotiff files for raster data. But you have to make sure that external software does not modify these files, or if they do, that the temporal framework is triggered to update dependent space time raster datasets.
I would suggest to use the native GRASS format, in case of vector data. Hence vector data needs to be copied. But maybe PostgreSQL with topology support will be a solution? I think Martin Landa may have an opinion here.
Who is having experience with this kind of data management challenge?
No experience here from my side.
How do externally linked data series perform compared to GRASS native?
It will be slower than the native format for sure. But i don't know how much slower.
I searched a bit the mailing list and found this:
(http://osgeo-org.1560.x6.nabble.com/GRASS7-temporal-GIS-database-ques
tions-td5054920.html) where Sören recommended “postgresql as temporal
database backend”. However I am not sure if that was meant only for
the temporal metadata and not the rasters themselves…
My recommendation was related to the temporal metadata only. The sqlite database will not scale very well for select requests if you have more than 30,000 maps registered in your temporal database.
PostgreSQL will be much faster for select requests. But PostgreSQL performs very badly in managing (insert, update, delete) many maps. I am not sure what the reason for this is, but from my experience has PostgreSQL a scaling problem with many tables. Hence if you do not modify you data often, PostgreSQL is your temporal database backend of choice. Otherwise i would recommend Sqlite, even if its slower for select requests.
Furthermore in the idea collection for the Temporal framework
(http://grasswiki.osgeo.org/wiki/Time_series_development, Open issues
This discussion is pretty old and does not reflect the current temporal framework implementation. Please have a look at the new TGRASS paper:
https://www.sciencedirect.com/science/article/pii/S136481521300282X?np=y
and the Geostat workshop:
http://geostat-course.org/Topic_Gebbert
section) limitations were mentioned regarding the number of files in a
folder, which would be possibly a problem both for file based storage.
The
ext2 file system had “"soft" upper limit of about 10-15k files in a
single directory” but theoretically many more where possible. Other
file systems may allow for more I guess… Will usage of such big
directories > 10,000 files lead to performance problems?
Modern file systems should not have problems with many files. I am using ext4 and the temporal framework with 100.000 maps without noticeable performance issues.
The “Working with external data in GRASS 7” – wiki entry
(http://grasswiki.osgeo.org/wiki/Working_with_external_data_in_GRASS_7
) covers the technical part (and to some degree performance issues)
very well.
Would it be worth adding a part on the strategic considerations / pros
and cons of using external data? Or is that too much user and format dependent?
It would be great if you could share your experience with us. 
Best regards
Soeren
Thanks for any feedback our thoughts around this topic…
Cheers
Stefan
_______________________________________________
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user
_______________________________________________
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user