[GeoNetwork-users] Difference in order of importation for "Batch Import" tool in GeoNetwork 2.10.1 vs 2.6.4

I've been using GeoNetwork 2.6.4 on a Windows server for some time before
upgrading to 2.10.1 on a Linux server. Things are fine until I use the
"Batch Import" tool (under Administration):

Consider the following files:
- Blainville.xml;
- Boisbriand.xml;
- Bois-des-Fillions.xml;
- Charlemagne.xml;
- Deux-Montagnes.xml

Supposing that the last generated metadata ID is 10, in GeoNetwork 2.6.4,
the Batch Import tool would import them in alphabetical order, so the IDs
would be as follow:
- Blainville.xml = 11;
- Boisbriand.xml = 12;
- Bois-des-Fillions.xml = 13;
- Charlemagne.xml = 14;
- Deux-Montagnes.xml = 15

However, when I use this tool in 2.10.1, the IDs are as follow:
- Blainville.xml = 15;
- Boisbriand.xml = 12;
- Bois-des-Fillions.xml = 13;
- Charlemagne.xml = 11;
- Deux-Montagnes.xml = 14

That sure isn't alphabetical, so I looked at other attributes (creation
date, size, etc.), but none match that order. So, what's going on?

From the change logs

(http://www.softpedia.com/progChangelog/GeoNetwork-Changelog-211352.html), I
see that Batch Import is not transactional anymore, but what does this have
to do with anything? This is really driving me crazy, so any help would be
appreciated.

--
View this message in context: http://osgeo-org.1560.x6.nabble.com/Difference-in-order-of-importation-for-Batch-Import-tool-in-GeoNetwork-2-10-1-vs-2-6-4-tp5081523.html
Sent from the GeoNetwork users mailing list archive at Nabble.com.

Oh, I forgot to add that prior to installing 2.10.1 on CentOS, I tested 2.8
on Windows XP and the Batch Import tool functioned as expected.

--
View this message in context: http://osgeo-org.1560.x6.nabble.com/Difference-in-order-of-importation-for-Batch-Import-tool-in-GeoNetwork-2-10-1-vs-2-6-4-tp5081523p5081524.html
Sent from the GeoNetwork users mailing list archive at Nabble.com.

Hi mutemog,

I think you'll find that batch import in alphabetical order was only ever an accidental feature based on the fact that it used to be a straight directory scan. In 2.8.x and later things changed quite a lot because the import job is split amongst threads that work concurrently to drive the database as hard as you want (according to the number of threads you allocate in the System Configuration). This is a good thing because it speeds up loading quite a lot if your machine has multiple processors etc but it does mean that the order that the files get loaded is not as predictable as before.

You can probably get the behaviour you want (though I'm not sure I quite understand why you must have it :-)) by specifying only one thread in the System Configuration or by using an import-config.xml to control batch loading as described here: Import facilities - GeoNetwork Opensource

On the transactional thing: In 2.10.1 batch import is transactional again (2.10.0 wasn't) - you're right this doesn't have anything to do with load order - but it is important because it is not nice if something goes wrong to be stuck with a partially loaded database which then needs to be cleaned up and/or the missing records identified and loaded. In short and quite brutal terms, the load either succeeds or the whole load is aborted and must be done again. There could be some other controls put on this behaviour to make it less brutal but for now the changes in 2.10.1 restored transactional behaviour that was lost during sometime before 2.10.0 came out.

Cheers,
Simon

________________________________________
From: Wicker_Bin [mutemog@anonymised.com]
Sent: Thursday, 3 October 2013 5:30 AM
To: geonetwork-users@lists.sourceforge.net
Subject: [GeoNetwork-users] Difference in order of importation for "Batch Import" tool in GeoNetwork 2.10.1 vs 2.6.4

I've been using GeoNetwork 2.6.4 on a Windows server for some time before
upgrading to 2.10.1 on a Linux server. Things are fine until I use the
"Batch Import" tool (under Administration):

Consider the following files:
- Blainville.xml;
- Boisbriand.xml;
- Bois-des-Fillions.xml;
- Charlemagne.xml;
- Deux-Montagnes.xml

Supposing that the last generated metadata ID is 10, in GeoNetwork 2.6.4,
the Batch Import tool would import them in alphabetical order, so the IDs
would be as follow:
- Blainville.xml = 11;
- Boisbriand.xml = 12;
- Bois-des-Fillions.xml = 13;
- Charlemagne.xml = 14;
- Deux-Montagnes.xml = 15

However, when I use this tool in 2.10.1, the IDs are as follow:
- Blainville.xml = 15;
- Boisbriand.xml = 12;
- Bois-des-Fillions.xml = 13;
- Charlemagne.xml = 11;
- Deux-Montagnes.xml = 14

That sure isn't alphabetical, so I looked at other attributes (creation
date, size, etc.), but none match that order. So, what's going on?

From the change logs

(http://www.softpedia.com/progChangelog/GeoNetwork-Changelog-211352.html), I
see that Batch Import is not transactional anymore, but what does this have
to do with anything? This is really driving me crazy, so any help would be
appreciated.

--
View this message in context: http://osgeo-org.1560.x6.nabble.com/Difference-in-order-of-importation-for-Batch-Import-tool-in-GeoNetwork-2-10-1-vs-2-6-4-tp5081523.html
Sent from the GeoNetwork users mailing list archive at Nabble.com.

Simon, thanks a lot for that explanation!

After setting up 2.10.1 on another machine, everything seems to be in order,
and I think I've pinpointed my problem: it wasn't a thread thing, since it's
set to 1 by default, but rather a hardware problem.

The problematic machine was a virtual machine that was running on a test
machine that's been used and abused, making things very, very slow all
around. I guess I'd have to benchmark it to get some real qualitative
results for comparison, but comparing 2.8 Windows (Host) vs 2.8 CentOS VM vs
2.10.1 CentOS and getting faulty results only on the VM has led me to
believe this.

Also, the reason for the import order is that I'm importing XML files that
have been modified with a Python script in a way that groups thumbnails,
files and metadata together with a predicted ID number. If left unmodified,
it wouldn't really matter, since the XML is pointing to the correct
components, but if the user was to edit something, it would probably put the
files in the wrong directory. Anyway, once the massive importation is over
with, this won't be an issue and I'll set the threads.

Thanks again!

--
View this message in context: http://osgeo-org.1560.x6.nabble.com/Difference-in-order-of-importation-for-Batch-Import-tool-in-GeoNetwork-2-10-1-vs-2-6-4-tp5081523p5083742.html
Sent from the GeoNetwork users mailing list archive at Nabble.com.