[Geoserver-users] Importer REST extension - Many questions after experimenting with Big CSV files

Hello List,
We are implementing a Geoserver based application where we need to import bulky CSV data into Oracle. We think the importer is perfect for this, since it can handle CSV and can write to a OCI-based oracle datastore. However, before we go further we have been doing some evaluation.

The System specs is as follows

  • OS:RH 6.5
  • Db:Oracle 12x
  • Geoserver Vers: 2.8.1
  • Data size: 23 million records (1.5GB, but could be upto 5GB)

From our initial evaluations we have the following questions regarding the inner workings oft he importer:

  1. CSV Data:
  2. What separater does the importer use by default?
  3. Is the separator character configurable? We noticed that it only accepts comma as a separator
  4. Does quoting or not quoting column values have an effect on the importer ability to parse (from our experiment – NO)
  5. File upload:
  6. When uploading the CSV, is the data compressed by default, and if so which compression? GZIP?
  7. If the data is not compressed by default, can we compress it on the client side (say using Curl)
  8. If b above is correct, how does the server handle a compressed stream? Does it check if the incoming data is compressed?
  9. Database writes:
  10. Do writes begins immediately as soon as the first row oft he CSV data is read on the server side? Or does the importer waits till the entire CSV is uploaded before starting to write into the database
  11. Is there a way we can optimize the writes? Parallelize it?
  12. What settings in geoserver affects the execution of the Importer during the write operations
  13. What can we do on the database side in order to make the process faster? (22 million rows took 3 days to write)
  14. Monitoring:
  15. The progress interface reports only the number of rows inserted, the toal rows and the task status. Are the the only things that can be shown on a progress report?
  16. Is it possible also to retrieve info about:
  17. Start time/date
  18. End time/date
  19. Date of last successful import etc
  20. Others:
  21. Are we using the importer fort he right task (uploading bulk data)?
  22. Was importer envisioned for such use cases?
  23. Is there another known working alternative within Geoserver realm?

Thanks and regards,
Moses

T-Systems International GmbH
Telekom IT
System Integration Telekom IT
E-TSOEW0103
Moses Gone

+49 228 98413510 (Tel.)
E-Mail: moses.gone@anonymised.com

T-Systems International GmbH
Aufsichtsrat: Thomas Dannenfeldt (Vorsitzender)
Geschäftsführung: Reinhard Clemens (Vorsitzender), Dr. Ferri Abolhassan, Thilo Kusch, Dr. Markus Müller, Georg Pepping, Hagen Rickmann

Handelsregister: Amtsgericht Frankfurt am Main HRB 55933
Sitz der Gesellschaft: Frankfurt am Main
WEEE-Reg.-Nr. DE50335567
Hinweis: Diese E-Mail und/oder die Anhänge sind vertraulich und ausschließlich für den bezeichneten Adressaten bestimmt. Die Weitergabe oder Kopieren dieser E-Mail ist strengstens verboten. Wenn Sie diese E-Mail irrtümlich erhalten haben, informieren Sie bitte unverzüglich den Absender und vernichten Sie die Nachricht und alle Anhänge. Vielen Dank.

Große Veränderungen fangen klein an – Ressourcen schonen und nicht jede E-Mail drucken

On Fri, Jan 29, 2016 at 12:01 PM, <Moses.Gone@anonymised.com> wrote:

Hello List,
We are implementing a Geoserver based application where we need to import
bulky CSV data into Oracle. We think the importer is perfect for this,
since it can handle CSV and can write to a OCI-based oracle datastore.
However, before we go further we have been doing some evaluation.

Moses,
with this many questions you're making it look like a professional
consultation instead of a simpler
community question. That might related to why you did not get any answer.
I'll try to answer anyways, but given your use case, I'd also have a look
here:
http://geoserver.org/support/

The System specs is as follows

   - OS:RH 6.5
   - Db:Oracle 12x
   - Geoserver Vers: 2.8.1
   - Data size: 23 million records (1.5GB, but could be upto 5GB)

From our initial evaluations we have the following questions regarding the
inner workings oft he importer:

   1. CSV Data:

   1. What separater does the importer use by default?

Comma

   1. Is the separator character configurable? We noticed that it only
   accepts comma as a separator

Not that I know of, it could be made to

   1. Does quoting or not quoting column values have an effect on the
   importer ability to parse (from our experiment – NO)

Can't remember by heart.

   1. File upload:

   1. When uploading the CSV, is the data compressed by default, and if
   so which compression? GZIP?

I don't think it's compressed, but you can test it

   1. If the data is not compressed by default, can we compress it on the
   client side (say using Curl)

That needs a test to

   1. If b above is correct, how does the server handle a compressed
   stream? Does it check if the incoming data is compressed?

It should check for compression headers. Our code does not do that, but

the Restlet framework used for our REST services might handle it
transparently

   1.

   1. Database writes:

   1. Do writes begins immediately as soon as the first row oft he CSV
   data is read on the server side? Or does the importer waits till the entire
   CSV is uploaded before starting to write into the database

The latter

   1. Is there a way we can optimize the writes? Parallelize it?

You should be able to page it, make an initial import to create the table,

then hit the importer with many smaller pages in parallel. Untested, but it
might work. A single transaction with 23 million records is unlikely to
work anyways.

   1. What settings in geoserver affects the execution of the Importer
   during the write operations

I cannot think of particular settings that would help

   1. What can we do on the database side in order to make the process
   faster? (22 million rows took 3 days to write)

See above, page it.

   1. Monitoring:

   1. The progress interface reports only the number of rows inserted,
   the toal rows and the task status. Are the the only things that can be
   shown on a progress report?

Without changing the code, those are the ones as far as I remember.

   1. Is it possible also to retrieve info about:

   1. Start time/date
   2. End time/date
   3. Date of last successful import etc

Sure, change the importer code :slight_smile:

   1. Others:

   1. Are we using the importer fort he right task (uploading bulk data)?

   1. as importer envisioned for such use cases?

It's the right interface, but it is not yet developed to support such

volumes. It can be improved to

   1. Is there another known working alternative within Geoserver realm?

Not going though GeoServer. Such a big import seems to be more of a job of

a ETL tool, but as said, GeoServer importer could be improved
to handle it more efficiently.

Cheers
Andrea

--

GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

*AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

-------------------------------------------------------