[GeoNetwork-users] [datahub] code 400 when getting amount of organisations

Hi,

trying to deploy datahub v1.0.1 w/ GN 4.0.6 within georchestra, so far it works fine *but* on the homepage, the request trying to get the amount or orgs fails with a 400 code.

the reply seems to come from ES, which is 7.17.9 (upstream ES debian package), and is undecipherable mumbo-jumbo only ES experts can probably understand:

Error is: Bad Request. Request: {"aggregations":{"contact":{"nested":{"path":"contactForResource"},"aggs":{"org":{"terms":{"field":"contactForResource.organisation","exclude":"","size":1000,"order":{"_key":"asc"}},"aggs":{"mail":{"terms":{"size":1000,"exclude":"","field":"contactForResource.email.keyword"}}}}}}},"from":0,"size":0,"query":{"bool":{"must":[{"terms":{"isTemplate":["n"]}}],"must_not":{"terms":{"resourceType":["service","map","map/static","mapDigital"]}},"should":,"filter":[{"query_string":{"query":"(op0:(0 OR 1) ) AND (draft:n OR draft:e)"}}]}},"_source":} . Error: {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [contactForResource.organisation] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"gn-records","node":"Xm14XIsPRQiXF8-EfHgK5w","reason":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [contactForResource.organisation] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}}],"caused_by":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [contactForResource.organisation] in order to load field data by uninverting the inverted index. Note that this can use significant memory.","caused_by":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [contactForResource.organisation] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}}},"status":400}.

My understanding is that something is wrong in the ES index configuration, but i have no idea where this can be modified/tweaked. The GN config wrt ES should be 'stock' afaict, i've looked at https://geonetwork-opensource.org/manuals/4.0.x/en/maintainer-guide/statistics/setup-elasticsearch.html & https://geonetwork-opensource.org/manuals/4.0.x/en/install-guide/installing-index.html but that doesnt help much.

this is reproducible on 2 internal instances and https://demo.georchestra.org/datahub/ (open network console, look for the 400 error code on POST https://demo.georchestra.org/geonetwork/srv/api/search/records/_search?bucket=bucket.

I see it works fine on https://geo2france.fr/datahub/ but i dunno what is different there..

Thanks for any pointers/help.

--
Landry Breuil

Hi Landry,

Actually there is an outdated spec in the default index definition
(records.json) about contact.organisation, which is used from the datahub.
This change
<Blaming core-geonetwork/web/src/main/webResources/WEB-INF/data/config/index/records.json at main · geonetwork/core-geonetwork · GitHub;
should work hopefully, let me know.
Be careful cause the index file which is actually used is generated from
the web ressource one.
Then you need to delete your index and re index (you can do that from the
admin console/tools).

BTW, I strongly recommend to use the gn4.2.x branch with the datahub 1.0,
otherwise you'll won't have the favorite thing.

We should maintain a matrix of compatibility maybe between the datahub and
GN 4 api.

Cheers

On Tue, Mar 7, 2023 at 12:33 PM Landry Breuil via GeoNetwork-users <
geonetwork-users@lists.sourceforge.net> wrote:

Hi,

trying to deploy datahub v1.0.1 w/ GN 4.0.6 within georchestra, so far
it works fine *but* on the homepage, the request trying to get the
amount or orgs fails with a 400 code.

the reply seems to come from ES, which is 7.17.9 (upstream ES debian
package), and is undecipherable mumbo-jumbo only ES experts can probably
understand:

Error is: Bad Request. Request:
{"aggregations":{"contact":{"nested":{"path":"contactForResource"},"aggs":{"org":{"terms":{"field":"contactForResource.organisation","exclude":"","size":1000,"order":{"_key":"asc"}},"aggs":{"mail":{"terms":{"size":1000,"exclude":"","field":"contactForResource.email.keyword"}}}}}}},"from":0,"size":0,"query":{"bool":{"must":[{"terms":{"isTemplate":["n"]}}],"must_not":{"terms":{"resourceType":["service","map","map/static","mapDigital"]}},"should":,"filter":[{"query_string":{"query":"(op0:(0

OR 1) ) AND (draft:n OR draft:e)"}}]}},"_source":} . Error:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Text

fields are not optimised for operations that require per-document field
data like aggregations and sorting, so these operations are disabled by
default. Please use a keyword field instead. Alternatively, set
fielddata=true on [contactForResource.organisation] in order to load
field data by uninverting the inverted index. Note that this can use
significant
memory."}],"type":"search_phase_execution_exception","reason":"all
shards
failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"gn-records","node":"Xm14XIsPRQiXF8-EfHgK5w","reason":{"type":"illegal_argument_exception","reason":"Text

fields are not optimised for operations that require per-document field
data like aggregations and sorting, so these operations are disabled by
default. Please use a keyword field instead. Alternatively, set
fielddata=true on [contactForResource.organisation] in order to load
field data by uninverting the inverted index. Note that this can use
significant
memory."}}],"caused_by":{"type":"illegal_argument_exception","reason":"Text

fields are not optimised for operations that require per-document field
data like aggregations and sorting, so these operations are disabled by
default. Please use a keyword field instead. Alternatively, set
fielddata=true on [contactForResource.organisation] in order to load
field data by uninverting the inverted index. Note that this can use
significant
memory.","caused_by":{"type":"illegal_argument_exception","reason":"Text
fields are not optimised for operations that require per-document field
data like aggregations and sorting, so these operations are disabled by
default. Please use a keyword field instead. Alternatively, set
fielddata=true on [contactForResource.organisation] in order to load
field data by uninverting the inverted index. Note that this can use
significant memory."}}},"status":400}.

My understanding is that something is wrong in the ES index
configuration, but i have no idea where this can be modified/tweaked.
The GN config wrt ES should be 'stock' afaict, i've looked at

https://geonetwork-opensource.org/manuals/4.0.x/en/maintainer-guide/statistics/setup-elasticsearch.html
&

https://geonetwork-opensource.org/manuals/4.0.x/en/install-guide/installing-index.html
but that doesnt help much.

this is reproducible on 2 internal instances and
Datahub (open network console, look for
the 400 error code on POST

https://demo.georchestra.org/geonetwork/srv/api/search/records/_search?bucket=bucket
.

I see it works fine on Datahub but i dunno what
is different there..

Thanks for any pointers/help.

--
Landry Breuil

_______________________________________________
GeoNetwork-users mailing list
GeoNetwork-users@lists.sourceforge.net
geonetwork-users List Signup and Options
GeoNetwork OpenSource is maintained at
GeoNetwork - Geographic Metadata Catalog download | SourceForge.net

--
*camptocamp*
INNOVATIVE SOLUTIONS
BY OPEN SOURCE EXPERTS

*Florent Gravin*
*Technical Leader - Architect*
+33 4 58 48 20 36

On 07/03/2023 12:51, Florent Gravin wrote:

Hi Landry,

Actually there is an outdated spec in the default index definition (records.json) about contact.organisation, which is used from the datahub.
This change <https://github.com/geonetwork/core-geonetwork/blame/main/web/src/main/webResources/WEB-INF/data/config/index/records.json#L1298&gt; should work hopefully, let me know.

Definitely helps once i've figured out the precise right steps.. see below :slight_smile:

Be careful cause the index file which is actually used is generated from the web ressource one.
Then you need to delete your index and re index (you can do that from the admin console/tools).

Well, i *think* i got it right at some point, its only confusing because there's a records.json in the geonetwork datadir (under config/index) and one in the webapp at geonetwork/WEB-INF/data/config/index/records.json and i dunno which is supposed to be the 'right' one.

i've "fixed" both to reflect the change in https://github.com/geonetwork/core-geonetwork/commit/ca727fd70e6ae0ef7325dc9fd3733e2400b6cff5, i've had a working index config once, but now clicking 'delete and reindex' only seems to recreate an old broken config with "org".

$curl -s -ugn_user:xxxx -XGET http://xxx:9200/gn-records-demo | jq -r .\"gn-records-demo\".mappings.dynamic_templates[8].contact.mapping.properties
{
   "role": {
     "type": "keyword"
   },
   "org": {
     "type": "keyword"
   }
}

i've tried manually deleting the index via curl -XDELETE and afaict, the only 'working' setup is when i force-create the index via curl -XPUT @records.json

$cat /srv/tomcat/georchestra/webapps/geonetwork/WEB-INF/data/config/index/records.json | jq .mappings.dynamic_templates[8].contact.mapping
{
   "type": "nested",
   "properties": {
     "organisation": {
       "type": "keyword"
     },
     "role": {
       "type": "keyword"
     }
   }
}

$curl -ugn_user:xxx -XDELETE http://xxx:9200/gn-records-demo
$curl -H 'Content-Type: application/json' -ugn_user:xxx -XPUT -d@/srv/tomcat/georchestra/webapps/geonetwork/WEB-INF/data/config/index/records.json http://xxx:9200/gn-records-demo
$curl -ugn_user:xxx -XGET http://xxx:9200/gn-records-demo | jq -r .\"gn-records-demo\".mappings.dynamic_templates[8].contact.mapping.properties
{
   "role": {
     "type": "keyword"
   },
   "organisation": {
     "type": "keyword"
   }
}
From that point, if i only hit 'reindex' things are fine and the organizations query from datahub gets a valid reply.

But if i let GN create the index configuration it's wrong. Same thing if i have a working index config and hit 'delete and reindex', the newly created index will have the wrong config.

BTW, I strongly recommend to use the gn4.2.x branch with the datahub 1.0, otherwise you'll won't have the favorite thing.

Sure, but GN4.2 is only in georchestra master since 2 days, and i'm wary of running that a bit early in production until all bugs are shaken out.

We should maintain a matrix of compatibility maybe between the datahub and GN 4 api.

That would be much welcome :slight_smile:

--
Landry Breuil

On 07/03/2023 15:13, Landry Breuil via GeoNetwork-users wrote:

On 07/03/2023 12:51, Florent Gravin wrote:

Hi Landry,

Actually there is an outdated spec in the default index definition (records.json) about contact.organisation, which is used from the datahub.
This change <https://github.com/geonetwork/core-geonetwork/blame/main/web/src/main/webResources/WEB-INF/data/config/index/records.json#L1298&gt; should work hopefully, let me know.

Definitely helps once i've figured out the precise right steps.. see below :slight_smile:

nevermind, on this sandbox instance something had set the index config directory to ... another one. No wonder it didnt take my new config into account...

In the meantime i've deployed the correction on https://demo.georchestra.org/geonetwork (still running 4.0.6) and https://demo.georchestra.org/datahub is happy and gets the amount of orgs from ES.

thanks Florent !

--
Landry Breuil

Hi Landy,

The index folder can be set from an ENV so it can be confusing where the
right records.json is.
You can check that from admin/statistics/information

The geOrchestra instance points out that we should increase the facet count
for the organisations, (1000 is the limit).
And also, seems that ES is also limited to 10K while counting the records,
but it's weird cause in GN the total count is ~15K.
We will try to figure that out.

Thanks

On Tue, Mar 7, 2023 at 4:55 PM Landry Breuil via GeoNetwork-users <
geonetwork-users@lists.sourceforge.net> wrote:

On 07/03/2023 15:13, Landry Breuil via GeoNetwork-users wrote:
> On 07/03/2023 12:51, Florent Gravin wrote:
>> Hi Landry,
>>
>> Actually there is an outdated spec in the default index definition
>> (records.json) about contact.organisation, which is used from the
>> datahub.
>> This change
>> <
Blaming core-geonetwork/web/src/main/webResources/WEB-INF/data/config/index/records.json at main · geonetwork/core-geonetwork · GitHub;
should work hopefully, let me know.
>
> Definitely helps once i've figured out the precise right steps.. see
> below :slight_smile:

nevermind, on this sandbox instance something had set the index config
directory to ... another one. No wonder it didnt take my new config into
account...

In the meantime i've deployed the correction on
Demo Catalogue powered by geOrchestra (still running 4.0.6) and
Datahub is happy and gets the amount of
orgs from ES.

thanks Florent !

--
Landry Breuil

_______________________________________________
GeoNetwork-users mailing list
GeoNetwork-users@lists.sourceforge.net
geonetwork-users List Signup and Options
GeoNetwork OpenSource is maintained at
GeoNetwork - Geographic Metadata Catalog download | SourceForge.net

--
*camptocamp*
INNOVATIVE SOLUTIONS
BY OPEN SOURCE EXPERTS

*Florent Gravin*
*Technical Leader - Architect*
+33 4 58 48 20 36