Dear Jim,
quick feedback.
First of all congratulation on making this work. As I suspected the
bottleneck is getting the data out of HDFS.
I can think about two things (which we are not mutually exclusive):
-1- Maybe complex, put smaller bits into HFDS and use the mosaic to
serve or even develop a light(er)weight layer that can pull the
granules.
This would help with WMS requests over large files as you'll end up
use smaller chunks to satisfy them most of the time
-2- We could build a more complex ImageInputStream that:
- has an internal cache (file and or memory) that does not get thrown
away upon each request but tends to live longer for each single file
in HDF
- we would have different streams reuse the same cache. Multiple
requests might read data from the cache concurrently but when data is
not there, we would block the thread for the request, go back to HFDS,
pull the data, write to the cache and so on
We could put together 1 and 2 to make things faster.
Hope this helps, anyway, I am in favour of exploring this in order to
allow the GeoServer stack to support data from HDFS.
Regards,
Simone Giannecchini
GeoServer Professional Services from the experts!
Visit http://goo.gl/it488V for more information.
Ing. Simone Giannecchini
@simogeo
Founder/Director
GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 333 8128928
http://www.geo-solutions.it
http://twitter.com/geosolutions_it
-------------------------------------------------------
AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate.
Il loro utilizzo è consentito esclusivamente al destinatario del
messaggio, per le finalità indicate nel messaggio stesso. Qualora
riceviate questo messaggio senza esserne il destinatario, Vi preghiamo
cortesemente di darcene notizia via e-mail e di procedere alla
distruzione del messaggio stesso, cancellandolo dal Vostro sistema.
Conservare il messaggio stesso, divulgarlo anche in parte,
distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità
diverse, costituisce comportamento contrario ai principi dettati dal
D.Lgs. 196/2003.
The information in this message and/or attachments, is intended solely
for the attention and use of the named addressee(s) and may be
confidential or proprietary in nature or covered by the provisions of
privacy act (Legislative Decree June, 30 2003, no.196 - Italy's New
Data Protection Code).Any use not in accord with its purpose, any
disclosure, reproduction, copying, distribution, or either
dissemination, either whole or partial, is strictly forbidden except
previous formal approval of the named addressee(s). If you are not the
intended recipient, please contact immediately the sender by
telephone, fax or e-mail and delete the information in this message
that has been received in error. The sender does not give any warranty
or accept liability as the content, accuracy or completeness of sent
messages and accepts no responsibility for changes made after they
were sent or for other risks which arise as a result of e-mail
transmission, viruses, etc.
On Sun, Apr 17, 2016 at 9:49 PM, Jim Hughes <jnh5y@anonymised.com> wrote:
Hi all,
I want to report on my success with registering and displaying GeoTiffs
stored on HDFS. There are some limitations with this approach;
particularly, I am unsure if there's anyway to cache / memory-map the
data. As such, I believe each request is re-downloading the entire file.
Generally, I hope to document my approach well enough so that others
could follow it (if needed) and to solicit feedback. In terms of
feedback, I'd love to hear 1) if there are improvements, and 2) if the
changes are reasonable enough to be considered for a proposal/merge request.
That out of the way, here's the rough outline:
1. Register additional URL handlers.
2. Convince validation layers in GeoServer that 'hdfs' is an ok URL scheme.
3. Get bytes out of the HDFS file.
For step 1, note that Java's URL scheme is pluggable via
java.net.URLStreamHandler. The docs(1) point out that one can call
URL.setURLStreamHandlerFactory to setup a Factory to provide such a
handler. This method can only be called once, and folks from the
internet (2) do yoga since Tomcat already registers a factory. They
seem to have missed the fact that the Tomcat factory actually lets you
add your own. I provide a gist (3) to show a little bean which will
instantiate a Hadoop URL handler and try to install it using both of
those methods.
There are two places I found in GeoServer which validate the URL given
in the page for adding a GeoTiff. The first is the GeoServer
FileExistValidator which calls out to a Wicket UrlValidator. Telling the
Wicket class to allow_all_schemes knocks out that issue. For the
second, in the FileModel, one needs to provide a happy path for URLs
which are not local to the filesystem. Those two small changes are here
(4).
Once GeoServer will register a GeoTiff coverage with a non-'file://'
URL, we need to read the bytes. Javax has an interface
javax.imageio.spi.ImageInputStreamSpi which adapts between instances of
a particular class and an ImageInputStream.
For my prototype, I wrote an instance of this interface which takes a
string, checks if it starts with "hdfs", creates a URL, and returns new
MemoryCacheImageInputStream(url.openStream()). The only problem with
this approach is that there is already an implementation which handles
Strings, and GeoTools's ImageIOExt tries the first one and skips any
others. One can update that handling (5) slightly to try all the
handlers. It'd probably be better to update (6) to try url.openStream
as a fallback.
During testing, I worked with the sfdem.tif which ships with GeoServer.
The hdfs layer was a little slower than the local filesystem layer, but
it wasn't unusable. To crank things up, I tried out a 600+ megabyte
GeoTiff from Natural Earth, and it was downright slow. Using a network
monitor, I was able to observe network traffic consistent with the
entire file being re-read for most requests. I think this approach may
be slightly useful for layers which are infrequently accessed and then
only be a few users.
Thanks to everyone who had suggestions and encouragement for the
original thread!
Cheers,
Jim
Step 1: Register additional URL handlers:
1.
http://download.java.net/jdk7/archive/b123/docs/api/java/net/URL.html#URL(java.lang.String,%20java.lang.String,%20int,%20java.lang.String)
2. http://skife.org/java/url/library/2012/05/14/java_url_handlers.html
3. Gist for a bean to register the Hadoop URL handlers:
https://gist.github.com/jnh5y/1739baa42466d66e383fa26ffd7235ca
Step 2: GeoServer changes:
4.
https://github.com/jnh5y/geoserver/commit/5320f26a0574f034433aa96097054ec1ec782d45
The FileModel change could be a little more robust.
Step 3: GeoTools changes:
5.
https://github.com/jnh5y/geotools/commit/f2db29339c7f7e43d0c52ab93195babc1abb6f49
Or one could modify the URL handling here:
6.
https://github.com/geosolutions-it/imageio-ext/blob/master/library/streams/src/main/java/it/geosolutions/imageio/stream/input/spi/URLImageInputStreamSpi.java#L88-L97
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
GeoTools-Devel mailing list
GeoTools-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geotools-devel