[Geoserver-devel] JMS Cluster fixes and questions

Hello there,

I’ve created a PR to fix the jms-cluster extension build on behalf of koordinates.com, as its tests were failing (https://github.com/geoserver/geoserver/pull/4040).

I’m not sure if there’s a fixed module maintainer but based on commit history I’d say Nuno and Andrea.
In any case, I’m willing to put some more work into it, so would like to first check it is ok for me to contribute to those modules, and whether you guys could take a look at those pr’s in the near term, just so we don’t run off a custom branch for too long.

Now a couple of questions. While digging into the code I found a couple of things I’m not completely clear on, maybe I’m missing something, so bear with me.

The most concerning issue to me is that slaves apply changes when handling catalog pre-modify events. When handling post-modify events, slaves merely force their local catalog to fire a post-modify event.
This seems to mean that if something fails at master between pre-modify and post-modify, slaves might end up out of sync since they’ve applied the change while handing the pre-modify event. That is to say, unless I’m missing something important, that slaves should apply the modifications when handling post-modify events and not pre-modify events, and that post-modify events on the slaves are currently being applied twice for each change, when applying the change locally at pre-modify handling (the local catalog would fire them), and again when dealing with the post-modify event (in JMSCatalogPostModifyEventHandler)?

As for dependencies, there are 4 apache geronimo dependencies that seem unnecessary, like in, we’re not using JMS JTA transactions are we? or are those geronimo dependencies somehow needed at runtime? I can run all the tests without them but not sure whether they would be needed at run time based on some configuration aspect?

There are possibly other improvements that can be done, not as crucial, such as

  • making sure the updateSequence doesn’t get out of sync between master(s) and slaves (currently slaves end up with master’s updateSequence + 1)
  • assess and maybe improve JMS transaction throughput, reducing the message size (e.g. events payload is the fully xstream-encoded catalog object, which in some cases is rather large, where it could be proxified like the ones saved to disk - using ids instead of fully encoded CatalogInfo relationships)
  • I’m not totally certain of the implications of the following producer.disable/enable pattern under concurrency in all message handlers:

public @Override boolean synchronize(JMSGlobalModifyEvent ev) {
try {
final GeoServerInfo localObject = localizeGeoServerInfo(geoServer, ev);
producer.disable();
this.geoServer.save(localObject);
} finally {
producer.enable();
}
return true;
}

That is, “producer” is a ToggleSwitch, which toggles either master or slave behavior.
By looking at the bean wiring in applicationContext.xml, all handlers are given a ref to the server toggle (JMSToggleProducer), while the slave one (JMSToggleConsumer) is unused.

Calling producer.enable()/disable() dispatches a ToggleEvent that in turn is caught by all JMSApplicationListeners (i.e. slave message processor, and master catalog and configuration producers), to temporarily disable event dispatching.

So by looking at the above snippet, it seems to me like in a multi-master set up (e.g. nodes being both master and slave), the processing of an incoming message (handler.synchronize) would prevent the same node to broadcast events from changes being applied on that node to other nodes?

I guess that’s it for now, please let me know whether those are legit concerns or I’m missing something, I’m willing to set up test cases and fixes but wanted to check my understanding first.

Cheers,
Gabriel.

···

Gabriel Roldán

Hi Gabriel,
sorry for the feedback delay, busy times, please see my answers bellow:

Hello there,

I’ve created a PR to fix the jms-cluster extension build on behalf of koordinates.com, as its tests were failing (https://github.com/geoserver/geoserver/pull/4040).

I’m not sure if there’s a fixed module maintainer but based on commit history I’d say Nuno and Andrea.
In any case, I’m willing to put some more work into it, so would like to first check it is ok for me to contribute to those modules, and whether you guys could take a look at those pr’s in the near term, just so we don’t run off a custom branch for too long.

Your contributions are welcome, I just had a look at the PR.

Now a couple of questions. While digging into the code I found a couple of things I’m not completely clear on, maybe I’m missing something, so bear with me.

The most concerning issue to me is that slaves apply changes when handling catalog pre-modify events. When handling post-modify events, slaves merely force their local catalog to fire a post-modify event.
This seems to mean that if something fails at master between pre-modify and post-modify, slaves might end up out of sync since they’ve applied the change while handing the pre-modify event. That is to say, unless I’m missing something important, that slaves should apply the modifications when handling post-modify events and not pre-modify events, and that post-modify events on the slaves are currently being applied twice for each change, when applying the change locally at pre-modify handling (the local catalog would fire them), and again when dealing with the post-modify event (in JMSCatalogPostModifyEventHandler)?

Sounds like a good reasoning, I didn’t wrote that code so not sure why it was implemented that way.
But I remember that two or three years ago I asked the same question to myself, I have been scratching me head but I don’t remember the conclusion I come to.

I would say that as long the integrations tests and manual tests pass, I have no objection to those changes.

As for dependencies, there are 4 apache geronimo dependencies that seem unnecessary, like in, we’re not using JMS JTA transactions are we? or are those geronimo dependencies somehow needed at runtime? I can run all the tests without them but not sure whether they would be needed at run time based on some configuration aspect?

I don’t know, manual testing will need to performed :-).

There are possibly other improvements that can be done, not as crucial, such as

  • making sure the updateSequence doesn’t get out of sync between master(s) and slaves (currently slaves end up with master’s updateSequence + 1)
  • assess and maybe improve JMS transaction throughput, reducing the message size (e.g. events payload is the fully xstream-encoded catalog object, which in some cases is rather large, where it could be proxified like the ones saved to disk - using ids instead of fully encoded CatalogInfo relationships)
  • I’m not totally certain of the implications of the following producer.disable/enable pattern under concurrency in all message handlers:

public @Override boolean synchronize(JMSGlobalModifyEvent ev) {
try {
final GeoServerInfo localObject = localizeGeoServerInfo(geoServer, ev);
producer.disable();
this.geoServer.save(localObject);
} finally {
producer.enable();
}
return true;
}

That is, “producer” is a ToggleSwitch, which toggles either master or slave behavior.
By looking at the bean wiring in applicationContext.xml, all handlers are given a ref to the server toggle (JMSToggleProducer), while the slave one (JMSToggleConsumer) is unused.

Calling producer.enable()/disable() dispatches a ToggleEvent that in turn is caught by all JMSApplicationListeners (i.e. slave message processor, and master catalog and configuration producers), to temporarily disable event dispatching.

So by looking at the above snippet, it seems to me like in a multi-master set up (e.g. nodes being both master and slave), the processing of an incoming message (handler.synchronize) would prevent the same node to broadcast events from changes being applied on that node to other nodes?

Yes that’s correct, I think that’s why that if is there.

I guess that’s it for now, please let me know whether those are legit concerns or I’m missing something, I’m willing to set up test cases and fixes but wanted to check my understanding first.

Your remarks \ concerns look legit to me, I can dedicate some time to review your PRs.
I would just kindly ask you to break your changes in small PRs if possible, it will make the reviews easier :-)

Thank you,
Nuno Oliveira

Cheers,
Gabriel.

_______________________________________________
Geoserver-devel mailing list
[Geoserver-devel@anonymised.comsts.sourceforge.net](mailto:Geoserver-devel@lists.sourceforge.net)
[https://lists.sourceforge.net/lists/listinfo/geoserver-devel](https://lists.sourceforge.net/lists/listinfo/geoserver-devel)


<details class='elided'>
<summary title='Show trimmed content'>&#183;&#183;&#183;</summary>

Gabriel Roldán

</details>

Hi Nuno,

sorry for the late reply, and thanks for your reply and review. I’ve been on vacation and a complicated back to work week.

So essentially I’ll be looking at the comments in the PR asap. I know it was a rather large one, had to do a lot until I got comfortable with the code base and some of the stuff is more related to avoiding code duplication than to the fix itself, which is also hard to spot out due to the nature of the integration tests that blindly check for expected number of events. But of course I agree smaller, more focused PR’s would be better and will make sure to do so in the future.

For the time being, I guess getting the module back on its feet is important and I’m willing to continue making improvements.

(attachments)

face-smile.png

···

Gabriel Roldán

Hi Gabriel,

Hi Nuno,

sorry for the late reply, and thanks for your reply and review. I’ve been on vacation and a complicated back to work week.

Np, welcome back.

So essentially I’ll be looking at the comments in the PR asap. I know it was a rather large one, had to do a lot until I got comfortable with the code base and some of the stuff is more related to avoiding code duplication than to the fix itself, which is also hard to spot out due to the nature of the integration tests that blindly check for expected number of events. But of course I agree smaller, more focused PR’s would be better and will make sure to do so in the future.

Ok let’s see your feedback, but still that PR is too big, will do my best to review, thx for your understanding.

For the time being, I guess getting the module back on its feet is important and I’m willing to continue making improvements.

This plugin is running in several production environments, so we need to be careful to not break current systems expectations, heavy manual testing will also be needed here.

That say, I look forward to read your feedback on the current PR :-).

···

Gabriel Roldán