Hello Jody,
Thanks for your email! That clarifies at least which direction we should be going with some of these issues. A few remaining important points:
1. Can you fill me in on a way to get the path to the DataDirectory without calling dir() ? I'll have to make a patch for that then, but I really did not see a way to do that in the current API, if you are working with a ResourceStore. See the constructor GeoServerResourceLoader(ResourceStore resourceStore). We'll have to change the resourcestore API to make this possible, no?
2. The problem with the GEOSERVER_DATA_DIRECTORY/data directory, or any other raster/vector data is slightly more complicated than you think.
* The REST api uploads both configuration files as well as data files, and it uses the same methods for both. I converted the whole module to use resources instead of files. This results (for now) in data files being uploaded to the database and then cached when the store is created.
* The distinction is not always simple to make, app-schema has configuration files (usually located in the workspaces dir) that are threated by geoserver in the same way as data files and they are read by geotools.
Is there a reason why using the database to store and distribute the data files is not recommended, is it a matter performance/space?
Otherwise, indeed I would recommend allowing the user specify in the jdbcstore configuration file which dirs to ignore. The jdbcstore would ignore with import as well as return a filebasedresource when these folders are being queried. Does that sound good?
3. I like the idea of deleting the data directory after import. But then point (1) _absolutely_ needs to be resolved, because otherwise the data directory will immediately be cached completely, repeatedly.
4. In my opinion, dir() should _always_ be avoided. I would recommend using resources as much as possible and as long as possible and only cache when absolutely necessary (usually a 3rd party lib), which means ery dir() is rarely necessary but file() is sufficient. The issue with the usage of dir() is that it could encourage people to use the file system directly, forgetting that changes to the file system have no lasting effect when using the jdbcstore!
Kind Regards
Niels
On 26-10-15 22:28, Jody Garnett wrote:
Thanks Niels, some comments inline, assume this is for GSIP-132 <https://github.com/geoserver/geoserver/wiki/GSIP-132> (unless that is completed already).
On 7 October 2015 at 05:28, Niels Charlier <niels@anonymised.com <mailto:niels@anonymised.com>> wrote:
Hi Jody, Gabriel, Kevin
I have been porting all modules to use the resources system
consistently and only use files when necessary (usually external
library). I still stumbled upon two minor questions/issues I
wanted to discuss.
1. Usage of the "data" directory. At the moment the import from
data directory -> jdbc store ignores the "data" directory. In a
clustered environment, this directory thus remains instance
specific, and it would be up to the user to refer to shared files.
Are you talking about GEOSERVER_DATA_DIRECTORY/data? If so that is only a convention, I have made data directories that used "raster" and "vector" folders for example.
For storing spatial data (GeoTIFF, Shapefile, Image Mosaic here) I had the idea of doing something like JNDI but for referencing an external folder used for this purpose. This could both provide an "ignore" list (so "data" was not hard coded) and allow for a cluster with RAID storage mapped to a specific mount.
At this moment, there is no reason why we couldn't include the
data dir in the jdbcstore and cache it before loading the geotools
datastore. This is actually what my modified version of the rest
service already does because it uses resources everywhere.
For configuration files this is what we want.
Another idea, was to program the jdbcstore to return file based
resources only when the "data" directory is used, so that it
definitely will never store those files in the database unnecessarily.
Okay pretty sure you are talking about GEOSERVER_DATA_DIRECTORY/data now.
Q: Is it worth removing the files that have been imported into JDBCConfig from GEOSERVER_DATA_DIRECTORY? This would prevent confusion, and allow GEOSERVER_DATA_DIRECTORY to work strictly as a cache (for the few things that require a file to be unpacked on to disk).
2. In the jdbcstore, should the children of a directory be cached
when dir() is called?
Cached is on import (so yes). Should the resources be unpacked (staged) to the file system when dir() is called? Yes
The DataDirectory class uses the dir() method to know the root of
the data directory, causing the whole data directory to be cached
at once multiple times unnecessarily, since the root dir is
usually requested just to know the path for some reason (all code
where it actually needs files in the data dir, have been replaced
by resources).
This is a bug, such logic should be replaced. There is another method to get the root of the GEOSERVER_DATA_DIRECTORY. While we may hard code some things now it would be wide to have an extension point for modules (such as geowebcache) to mark off working directories that should not be cached.
Using dir() to determine the root of GEOSERVER_DATA_DIRECTORY is a bad idea, in addition to breaking the design of dir() we are trying to avoid duplicating code contain data directory structure logic.
We now always want to use resources as long as possible, only
calling file() at the last moment if necessary. As a consequence
the dir() method is actually hardly used for the purpose or
getting all the files inside that dir. I would suggest on calling
dir() only to create the dir if it doesn't exist yet and not cache
its children. There is only one part of code left where that would
pose a problem, the community module "validation", which passes on
a whole dir to its geotools counterpart. This however could be
changed in the geotools module to pass on a collection of files
instead.
Validation only needed one "validation" folder, so that code could be changed to use resource("validation").dir().
After this change, I wonder if we should make a doc page on the
proper practices of using the Resource API in order to be
clustering-safe.
Yep, could add to the developers guide under "file access".