[Geoserver-users] GitHub integration - Encoding issue

Hello,

How do you understand the reply of Transifex?

Regards
Alexandre

---------- Forwarded message ---------
De : Ryan Bernstein <support@anonymised.com>
Date: mer. 5 janv. 2022 à 15:33
Subject: Re: GitHub integration - Encoding issue
To: Alexandre Gacon <alexandre.gacon@anonymised.com>

Hello Alexandre,

I hope you are doing well!

It seems that you are very technical, so I’m just going to copy/paste the comments as-is from our developers who looked into this issue quite extensively…


In order to understand the following explanation, keep in mind that:

  • UTF-8 is the encoding that will preserve properly all non-ascii, non-latin1 characters
  • ISO-5589-1 (aka latin1 ) is a ascii based encoding that contains all the ascii characters plus some additional ones used in the latin alphabet (i.e. é, è etc..)
  • us-ascii is the standard encoding for electronic communication and as we already mentioned a subset of the latin1 encoding.

After the new tests regarding the retaining of the encoding of the file given in the ticket, we noticed the following:

  • If a non-latin1, non-ascii character exists in the translation (UTF-8 characters) then the final translation file will contain the UTF-8 escaped corresponding characters (i.e. \u0420 corresponds to some Cyrillic letter).
  • In our case, the latin1 character wasn’t part of the translated strings but part of the structure of the file, at the template of the file. This means that we don’t want to change it to the UTF-8 escaped character.
  • But on the other hand, the library that we are using in order to integrate github with transifex is not supporting latin1 but UTF-8 so when a non-ascii character appears it converts the whole file to the best encoding that can represent that character. In our case that is UTF-8.

In order to preserve the us-ascii encoding (not the latin1) in github one must make sure that the source keys and the comments of the file do not contain any non ascii characters.

In case something wasn’t clear, what this means is that because the source file had a latin1 character (é) even though the translations for the strings did not, this character was kept as-is (not escaped) as part of the “template”. Therefore, the translation files sent back to GitHub are being encoded with UTF-8 by the library being used. We do not think we can do anything about this, unfortunately. So, the translation files for the Java Properties file format must be retrieved from Transifex directly instead of using the GitHub integration.

Is this all clear? Do you have any other questions?

Kind regards,
Ryan


Ryan Bernstein
Customer Support Engineer | Transifex
Join Our Community!
Join user research!

How would you rate my reply?
Satisfaction Rating Icons

{#HS:1724125509-162787#}

On Fri, Dec 24, 2021 at 5:54 AM PST, Alexandre Gacon <alexandre.gacon@anonymised.com> wrote:

Yes Ryan.

It is clear now. It is a good beginning since it will allow to have Transifex up-to-date on what remains to translated. A two-way synchronisation would be indeed be better but for Christmas it is a nice present from you.

Have good holidays and enjoy the coming events too !

Regards
Alexandre

Le ven. 24 déc. 2021 à 14:50, Ryan Bernstein <support@anonymised.com0…> a écrit :

On Fri, Dec 24, 2021 at 5:49 AM PST, Ryan Bernstein <support@anonymised.com..> wrote:

Hi,

Source files will now be kept in the correct iso-5589-1 encoding/format when using the GitHub integration.

Any translation files pushed back to GitHub via the integration will be in UTF-8 format instead of the correct (iso-5589-1) format. Our developers will continue looking into this, but they haven’t found a solution so far, unfortunately. So, please use the UI (or the newer API/CLI for automation) to download these translation files.

Does this help clarify things?

Best,
Ryan


Ryan Bernstein
Customer Support Engineer | Transifex
Join Our Community!
Join user research!
On Fri, Dec 24, 2021 at 5:39 AM PST, Alexandre Gacon <alexandre.gacon@anonymised.com> wrote:

Hi Ryan,

Just to be sure of what you say: the encoding of the source property file will be kept. If a translation file is provided by Github in the iso encoding, it will be kept too but if a new language is added through Transifex, the encoding will be wrong.

Or all the translations will be in utf-8 when pushed to Github and the only way to have them in the correct encoding will be to use the Transifex UI ?

Regards
Alexandre

Le ven. 24 déc. 2021 à 14:15, Ryan Bernstein <support@anonymised.com0…> a écrit :

On Fri, Dec 24, 2021 at 5:15 AM PST, Ryan Bernstein <support@anonymised.com..> wrote:

Hi Alexandre,

OK, we have provided a fix for creating a resource file in Java Properties format.
This means that whenever you create a new resource file, the source file will keep the iso-5589-1 encoding instead of utf-8.

Now, in your project where a utf8 source file already exists, in order to change its encoding, there are 2 ways:

  1. Remove the resource and recreate it by using the GitHub integration.
  2. Update the content in the remote repository (by adding i.e. a commented line like # properties file ) and in the next sync, this will update the source file as well.
    As far as the translated file, we couldn’t find a solution there because there are a lot of things that are happening in external libraries out of our control. For the time being, though, you can get the translated file directly from the TX UI in iso-5589-1 encoding.

Does this make sense? We hope that this at least provides a way for you to continue using these files with Transifex for now?

Best,
Ryan

P.S. Happy Holidays!


Ryan Bernstein
Customer Support Engineer | Transifex
Join Our Community!
Join user research!
On Mon, Dec 20, 2021 at 10:05 AM PST, Alexandre Gacon <alexandre.gacon@anonymised.com> wrote:

Ok, thanks for the update!

Le lun. 20 déc. 2021 à 14:24, Ryan Bernstein <support@anonymised.com0…> a écrit :

On Mon, Dec 20, 2021 at 5:23 AM PST, Ryan Bernstein <support@anonymised.com..> wrote:

Hello Alexandre,

I hope you are well!

I just wanted to send a quick update that our developers are still looking into this. They do not have a resolution yet, but are trying to determine what can be done. We will keep you posted on any updates…

Kind regards,
Ryan


Ryan Bernstein
Customer Support Engineer | Transifex
Join Our Community!
Join user research!
On Wed, Dec 15, 2021 at 10:35 AM PST, Alexandre Gacon <alexandre.gacon@anonymised.com> wrote:

Hi Ryan,

Happy to hear that you are going to try to solve this! I find that Transifex is a wonderful solution and I would be very happy if we manage to use it more and more for open source projects!

Regards
Alexandre

Le mer. 15 déc. 2021 à 19:27, Ryan Bernstein <support@anonymised.com0…> a écrit :

On Wed, Dec 15, 2021 at 10:26 AM PST, Ryan Bernstein <support@anonymised.com…> wrote:

Hello Alexandre,

I hope you are well! Please allow me to jump in here in place of Cesar.

We have verified what you said, and are looking into it. We will probably need to create a ticket to get this resolved by our developers.

Further, I tried converting the translation file to the correct ISO-8859-1 encoding (using Sublime Text), but that didn’t work, unfortunately…
5d2a6b9bd03bd1e471646a066dd0368a.png

So, we understand that this issue needs to be addressed, and we will definitely keep you updated on our progress!

Our apologies for this issue :frowning:

Kind regards,
Ryan


Ryan Bernstein
Customer Support Engineer | Transifex
Join Our Community!
Join user research!
On Wed, Dec 15, 2021 at 9:11 AM PST, Alexandre Gacon <alexandre.gacon@anonymised.com> wrote:

Hello,

We did another try with the attached files. As you can see, both are encoded in ISO-8859-1.

After completing the translation and validating it, we received in the Pull Request of GitHub a file encoded as UTF-8.

Do you have any suggestions on how to solve this?

Regards
Alexandre Gacon

Le lun. 13 déc. 2021 à 23:06, Cesar Garcia <support@anonymised.com..> a écrit :

On Wed, Jan 5, 2022 at 6:46 PM Alexandre Gacon <alexandre.gacon@anonymised.com> wrote:


In order to understand the following explanation, keep in mind that:

  • UTF-8 is the encoding that will preserve properly all non-ascii, non-latin1 characters
  • ISO-5589-1 (aka latin1 ) is a ascii based encoding that contains all the ascii characters plus some additional ones used in the latin alphabet (i.e. é, è etc..)

Probably key to understanding the rest, latin1 and ISO8859-1 are the same (it confused me at first).

  • us-ascii is the standard encoding for electronic communication and as we already mentioned a subset of the latin1 encoding.

After the new tests regarding the retaining of the encoding of the file given in the ticket, we noticed the following:

  • If a non-latin1, non-ascii character exists in the translation (UTF-8 characters) then the final translation file will contain the UTF-8 escaped corresponding characters (i.e. \u0420 corresponds to some Cyrillic letter).

Ok, so Transifex won’t support the Wicket “.utf8.properties” convention, and just escape chars so that they can be encoded in ISO-8859-1 instead.

  • In our case, the latin1 character wasn’t part of the translated strings but part of the structure of the file, at the template of the file. This means that we don’t want to change it to the UTF-8 escaped character.

I don’t understand what “the structure of the file” instead of “part of the translated strings” means. Maybe the latin1 character was in a key rather than
in a value? Or maybe in a comment.

  • But on the other hand, the library that we are using in order to integrate github with transifex is not supporting latin1 but UTF-8 so when a non-ascii character appears it converts the whole file to the best encoding that can represent that character. In our case that is UTF-8.

It seems they have a technical limitation, and can either do us-ascii or escaped UTF-8, but does not support latin1 (ISO-8859-1).

In order to preserve the us-ascii encoding (not the latin1) in github one must make sure that the source keys and the comments of the file do not contain any non ascii characters.

Seems that we can either use only us-ascii chars (and encode anything else, included accented letters, using UTF-8 escape codes),
or maybe fully UTF-8? Regardless it seems ISO-8859-1 is simply out of the equation?


In case something wasn’t clear, what this means is that because the source file had a latin1 character (é) even though the translations for the strings did not, this character was kept as-is (not escaped) as part of the “template”. Therefore, the translation files sent back to GitHub are being encoded with UTF-8 by the library being used. We do not think we can do anything about this, unfortunately. So, the translation files for the Java Properties file format must be retrieved from Transifex directly instead of using the GitHub integration.

I believe the “é” character was added in a comment, as an attempt to force Transifex to use ISO-8859-1?
And Transifex is simply incapable of doing that?

Hum… well Wicket does not really care and will support translation files made of us-ascii with UTF-8 escapes fine
I believe, but translators that are doing direct commits, rather than going though Transifex might be less than pleased.
I believe Jody at one point mentioned a different platform, but cannot remember which one that is.
Thinking out loud, I see two avenues ahead:

  • Put up with Transifex limitations
  • Try to extract the good work present in Transifex once, and then migrate to another translation system, if you can find one that works better for translator
    Cheers

Andrea

==

GeoServer Professional Services from the experts!

Visit http://bit.ly/gs-services-us for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions Group
phone: +39 0584 962313

fax: +39 0584 1660272

mob: +39 333 8128928

https://www.geosolutionsgroup.com/

http://twitter.com/geosolutions_it


Con riferimento alla normativa sul trattamento dei dati personali (Reg. UE 2016/679 - Regolamento generale sulla protezione dei dati “GDPR”), si precisa che ogni circostanza inerente alla presente email (il suo contenuto, gli eventuali allegati, etc.) è un dato la cui conoscenza è riservata al/i solo/i destinatario/i indicati dallo scrivente. Se il messaggio Le è giunto per errore, è tenuta/o a cancellarlo, ogni altra operazione è illecita. Le sarei comunque grato se potesse darmene notizia.

This email is intended only for the person or entity to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. We remind that - as provided by European Regulation 2016/679 “GDPR” - copying, dissemination or use of this e-mail or the information herein by anyone other than the intended recipient is prohibited. If you have received this email by mistake, please notify us immediately by telephone or e-mail

Hello Andrea,

Thank you for your input. Transifex developers are still trying to fix the issue, so there is still some hope !

I will keep the community updated.

Alexandre

···

Alexandre Gacon

Hello everyone,

Finally, Transifex cannot fix the issue they have with the encoding (they just updated the documentation about it, see : https://docs.transifex.com/transifex-github-integrations/github-tx-ui).

I will finish reviewing the Transifex configuration and the synchronization with the GitHub repo to at least have one way synchronization. After this, I will try to see if I can setup some script the APIs of Transifex to automatize the Transifex->Git synchronization if possible.

Regards
Alexandre

···

Alexandre Gacon

Wow, bummer…
Thanks for all the work you’re pouring into this Alexandre!

Cheers
Andrea

···

Regards,

Andrea Aime

==
GeoServer Professional Services from the experts!

Visit http://bit.ly/gs-services-us for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions Group
phone: +39 0584 962313

fax: +39 0584 1660272

mob: +39 333 8128928

https://www.geosolutionsgroup.com/

http://twitter.com/geosolutions_it


Con riferimento alla normativa sul trattamento dei dati personali (Reg. UE 2016/679 - Regolamento generale sulla protezione dei dati “GDPR”), si precisa che ogni circostanza inerente alla presente email (il suo contenuto, gli eventuali allegati, etc.) è un dato la cui conoscenza è riservata al/i solo/i destinatario/i indicati dallo scrivente. Se il messaggio Le è giunto per errore, è tenuta/o a cancellarlo, ogni altra operazione è illecita. Le sarei comunque grato se potesse darmene notizia.

This email is intended only for the person or entity to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. We remind that - as provided by European Regulation 2016/679 “GDPR” - copying, dissemination or use of this e-mail or the information herein by anyone other than the intended recipient is prohibited. If you have received this email by mistake, please notify us immediately by telephone or e-mail