[GRASS5] Freetype failure

Folks,

A couple weeks ago I wrote for help with my problem using freetype. In short, when I tried to display characters using the freetype library the X driver crashed. I later found that the PNG driver had the same behavior.

In the absence of input, I managed to track the problem down to grass-6.0.2/display/drivers/lib/Text3.c (currently renamed to text3.c). The convert_str function calls iconv. The third and fifth parameters in the iconv prototype are pointers to size_t variables. The convert_str function instead passed pointers to int. iconv misinterpreted the values and that lead to a corruption of the code. With the parameters changed to size_t the code works OK.

This is the first problem I recall arising from using int and size_t interchangeably. This may reflect my environment (gcc 4.0.2 prerelease, GNU libc 2.3.5, SuSE Linux 10.0 on AMD64).

The initial fix was pretty easy, but in studying the code I found other problems (e.g., arguments to memset in convert_str are in the wrong order). Also, the approach in the code to be rather indirect. It works by converting UTF-8 codes to UCS-2BE codes and then converting the UCS-2BE codes to FT_ULong for the freetype library. Not all UTF-8 codes can be represented in UCS-2BE.

I wrote a UTF-8 to FT_ULong converter to get a more direct solution and eliminated convert_str from the code. This is a working solution and probably in most respects a better solution than the current text3.c.

I no longer maintain a CVS installation of GRASS but I think these changes need to be made. I can provide the code to someone who can introduce it to CVS. As it stands the code is largely complete. I would like it to be checked by someone who is a more proficient programmer than I. To be formally complete the UTF-8 decoder needs code added to catch malformed UTF-8 2-byte codes in the range of 0xD800 to 0xDFFF. I think other malformations are caught by my existing code, but their efficiency might be improved.

Roger Miller

roger@spinn.net wrote:

A couple weeks ago I wrote for help with my problem using freetype. In
short, when I tried to display characters using the freetype library the X
driver crashed. I later found that the PNG driver had the same behavior.

In the absence of input, I managed to track the problem down to
grass-6.0.2/display/drivers/lib/Text3.c (currently renamed to text3.c). The
convert_str function calls iconv. The third and fifth parameters in the
iconv prototype are pointers to size_t variables. The convert_str function
instead passed pointers to int. iconv misinterpreted the values and that
lead to a corruption of the code. With the parameters changed to size_t the
code works OK.

This is the first problem I recall arising from using int and size_t
interchangeably. This may reflect my environment (gcc 4.0.2 prerelease, GNU
libc 2.3.5, SuSE Linux 10.0 on AMD64).

The initial fix was pretty easy, but in studying the code I found other
problems (e.g., arguments to memset in convert_str are in the wrong order).
Also, the approach in the code to be rather indirect. It works by
converting UTF-8 codes to UCS-2BE codes and then converting the UCS-2BE
codes to FT_ULong for the freetype library.

To be precise, it converts strings from the selected encoding (which
may be UTF-8 or something else) to UCS2-BE.

Not all UTF-8 codes can be represented in UCS-2BE.

But does FreeType support anything beyond the 16-bit range?

I wrote a UTF-8 to FT_ULong converter to get a more direct solution and
eliminated convert_str from the code. This is a working solution and
probably in most respects a better solution than the current text3.c.

Except for the most important issue, namely that the input string is
not necessarily in UTF-8; the encoding is specified by the charset=
option to d.font.freetype. As the FreeType support in the display
drivers was originally written to support Japanese, I suspect that
most of the existing users of this functionality probably won't be
using UTF-8.

I no longer maintain a CVS installation of GRASS but I think these changes
need to be made. I can provide the code to someone who can introduce it to
CVS. As it stands the code is largely complete. I would like it to be
checked by someone who is a more proficient programmer than I. To be
formally complete the UTF-8 decoder needs code added to catch malformed
UTF-8 2-byte codes in the range of 0xD800 to 0xDFFF. I think other
malformations are caught by my existing code, but their efficiency might be
improved.

Whilst a hard-coded UTF-8 to UCS-2 or UCS-4 decoder might be a useful
fall-back for systems which don't have iconv, the iconv code needs to
stay to support other encodings.

The int/size_t and memset() issues need to be fixed, and converting to
32-bit Unicode should be harmless (although pointless unless FreeType
supports it).

--
Glynn Clements <glynn@gclements.plus.com>

Thanks for getting back to me, Glynn:

Also, the approach in the code to be rather indirect. It works by converting UTF-8 codes to UCS-2BE codes and then converting the UCS-2BE codes to FT_ULong for the freetype library.

To be precise, it converts strings from the selected encoding (which
may be UTF-8 or something else) to UCS2-BE.

Sorry, this is something I overlooked.

Not all UTF-8 codes can be represented in UCS-2BE.

But does FreeType support anything beyond the 16-bit range?

I believe FreeType does. At any rate, FreeType uses a 32 bit value to store it's character codes, so support is possible. Whether or not FreeType-compatible fonts are available for the whole range is another question.

I wrote a UTF-8 to FT_ULong converter to get a more direct solution and eliminated convert_str from the code. This is a working solution and probably in most respects a better solution than the current text3.c.

Except for the most important issue, namely that the input string is
not necessarily in UTF-8; the encoding is specified by the charset=
option to d.font.freetype. As the FreeType support in the display
drivers was originally written to support Japanese, I suspect that
most of the existing users of this functionality probably won't be
using UTF-8.

UTF-8 represents the entire range of UCS. Existing Japanese, Korean, Chinese (etc.) character encodings are encorporated in UCS and are represented by UTF-8. That does not mean that everyone's software is delivering UTF-8 encoding, but the time when that happens is probably not too far off.

Whilst a hard-coded UTF-8 to UCS-2 or UCS-4 decoder might be a useful
fall-back for systems which don't have iconv, the iconv code needs to
stay to support other encodings.

That makes sense, but if everything is to be funneled into one encoding then I don't think it should be through UCS-2. There is the possibly academic fact that UCS-2 doesn't represent all of UCS. Also, UTF-8 is expected to be the future standard encoding and many of us are already working with it. UTF-8 has been the default encoding in all major Linux distributions for a couple years now -- longer for some distros. I haven't heard that UCS-2 is that widely used.

It makes more sense to translate anything that isn't already encoded in UTF-8 into UTF-8, then decode UTF-8 to FreeType. That way UTF-8 systems would not have to go through an encode-decode cycle.

Roger Miller

roger@spinn.net wrote:

>> I wrote a UTF-8 to FT_ULong converter to get a more direct solution and
>> eliminated convert_str from the code. This is a working solution and
>> probably in most respects a better solution than the current text3.c.
>
> Except for the most important issue, namely that the input string is
> not necessarily in UTF-8; the encoding is specified by the charset=
> option to d.font.freetype. As the FreeType support in the display
> drivers was originally written to support Japanese, I suspect that
> most of the existing users of this functionality probably won't be
> using UTF-8.

UTF-8 represents the entire range of UCS. Existing Japanese, Korean,
Chinese (etc.) character encodings are encorporated in UCS and are
represented by UTF-8. That does not mean that everyone's software is
delivering UTF-8 encoding, but the time when that happens is probably not
too far off.

That's wishful thinking. Most of the CJK world is quite happy to stick
with their existing encodings regardless of how much western
programmers would like them all to switch to Unicode.

> Whilst a hard-coded UTF-8 to UCS-2 or UCS-4 decoder might be a useful
> fall-back for systems which don't have iconv, the iconv code needs to
> stay to support other encodings.

That makes sense, but if everything is to be funneled into one encoding then
I don't think it should be through UCS-2. There is the possibly academic
fact that UCS-2 doesn't represent all of UCS. Also, UTF-8 is expected to be
the future standard encoding and many of us are already working with it.
UTF-8 has been the default encoding in all major Linux distributions for a
couple years now -- longer for some distros. I haven't heard that UCS-2 is
that widely used.

FWIW, Windows uses UCS-2LE quite extensively, but that isn't relevant
here.

The main reason it is used in the FreeType code is that it's the
simplest encoding to decode to an integer codepoint. The iconv only
deals with external encodings, so there's no way to decode directly to
unicode codepoints in "host-endian" format (although you could decode
to either UCS-4LE or UCS4-BE according to the host's endianness, then
just cast the output buffer to "FT_ULong *").

It makes more sense to translate anything that isn't already encoded in
UTF-8 into UTF-8, then decode UTF-8 to FreeType. That way UTF-8 systems
would not have to go through an encode-decode cycle.

That's easier to program (all conversions other than UTF-8 to
UCS-2/UCS-4 become the responsibility of the user), but it's a lot
less useful (because the user has to explicitly convert everything).

To be useful, d.text needs to be able to accept text in the encoding
which other programs generate. In locales where the dominant language
doesn't use the latin alphabet, that probably isn't going to be UTF-8
(on Windows, it definitely won't be UTF-8).

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements writes:

UTF-8 represents the entire range of UCS. Existing Japanese, Korean, Chinese (etc.) character encodings are encorporated in UCS and are represented by UTF-8. That does not mean that everyone's software is delivering UTF-8 encoding, but the time when that happens is probably not too far off.

That's wishful thinking. Most of the CJK world is quite happy to stick
with their existing encodings regardless of how much western
programmers would like them all to switch to Unicode.

Perhaps it is wishful thinking, but according to the document at http://www.cl.cam.ac.uk/~mgk25/unicode.html
China, Korea and Japan already have national standards based on UCS. Microsoft uses Unicode, which is similar.

Don't confuse encoding with Unicode or UCS. Unicode and UCS standardize the values that represent different glyphs. Encoding determines how the values are stored and processed.

The main reason it is used in the FreeType code is that it's the
simplest encoding to decode to an integer codepoint.

This is true for multibyte characters, but not single byte characters. Besides, I'm offering the decoder, so it shouldn't make a lot of difference whether it is more complex or not.

It makes more sense to translate anything that isn't already encoded in UTF-8 into UTF-8, then decode UTF-8 to FreeType. That way UTF-8 systems would not have to go through an encode-decode cycle.

That's easier to program (all conversions other than UTF-8 to
UCS-2/UCS-4 become the responsibility of the user), but it's a lot
less useful (because the user has to explicitly convert everything).

Sorry if I mislead you. My suggestion was that the code would retain convert_str and convert_str would use iconv to convert all user-supplied encodings to UTF-8 instead of to UCS-2BE as it does now. Draw_text would decode UTF-8 to FT_ULong. There would be no responsibility on the user that isn't there now. Anything coming in from a UTF-8 system could skip convert_str.

But now that you mention it, just using iconv to convert everything to UCS-4BE and casting that to FT_ULong might be a simpler solution yet. That would leave iconv with the responsibility for checking the UTF-8 stream for malformed encodings. I'm not sure how much of that checking iconv actually does.

Roger Miller

roger@spinn.net wrote:

>> UTF-8 represents the entire range of UCS. Existing Japanese, Korean,
>> Chinese (etc.) character encodings are encorporated in UCS and are
>> represented by UTF-8. That does not mean that everyone's software is
>> delivering UTF-8 encoding, but the time when that happens is probably not
>> too far off.
>
> That's wishful thinking. Most of the CJK world is quite happy to stick
> with their existing encodings regardless of how much western
> programmers would like them all to switch to Unicode.

Perhaps it is wishful thinking, but according to the document at
http://www.cl.cam.ac.uk/~mgk25/unicode.html
China, Korea and Japan already have national standards based on UCS.
Microsoft uses Unicode, which is similar.

Having standards for something and actually using it are very
different matters.

Part of the problem is that Windows doesn't provide much choice when
it comes to encodings. You have 16-bit Unicode (i.e. UCS-2LE) and the
system's codepage, and that's it. For Japanese, the system codepage is
CP932 (Shift-JIS), and anything which doesn't use UCS-2LE (i.e.
anything which needs to use an ASCII-compatible encoding, e.g.
virtually every external data format except those which mandate UTF-8)
will be in Shift-JIS.

> The main reason it is used in the FreeType code is that it's the
> simplest encoding to decode to an integer codepoint.

This is true for multibyte characters, but not single byte characters.

I don't understand what you're saying here. Or maybe you're
misunderstanding something. UCS-2BE is just 16-bit unicode codepoints
stored in big-endian byte order. This encoding was chosen because it's
trivial to convert to an FT_ULong. Decoding UCS-2BE to Unicode
codepoints is just:

  wchar_t *chars;
  char *bytes;

  for (i = 0; i < num_chars; i++)
    chars[i] = (bytes[2*i] << 8) | bytes[2*i+1];

[Not to be confused with UTF-16, which is almost the same as UCS-2,
except that UTF-16 supports codepoints above U+FFFF using surrogates
while UCS-2 is limited to the BMP.]

Besides, I'm offering the decoder, so it shouldn't make a lot of difference
whether it is more complex or not.

>> It makes more sense to translate anything that isn't already encoded in
>> UTF-8 into UTF-8, then decode UTF-8 to FreeType. That way UTF-8 systems
>> would not have to go through an encode-decode cycle.
>
> That's easier to program (all conversions other than UTF-8 to
> UCS-2/UCS-4 become the responsibility of the user), but it's a lot
> less useful (because the user has to explicitly convert everything).

Sorry if I mislead you. My suggestion was that the code would retain
convert_str and convert_str would use iconv to convert all user-supplied
encodings to UTF-8 instead of to UCS-2BE as it does now. Draw_text would
decode UTF-8 to FT_ULong.

Using UTF-8 as the intermediate encoding doesn't make sense.

There would be no responsibility on the user that
isn't there now. Anything coming in from a UTF-8 system could skip
convert_str.

But now that you mention it, just using iconv to convert everything to
UCS-4BE and casting that to FT_ULong might be a simpler solution yet.

Yes; that's basically what happens now, except that it uses UCS-2
rather than UCS-4. Given the relatively small amounts of data
involved, the performance advantages of using UCS-2 are negligible.

Also, AFAIK, FT_ULong is in the host's byte order, so you either need
to convert char[4] to FT_ULong using shift+or (which is what happens
at present), or use either UCS-4BE or UCS-4LE depending upon the
host's byte order (Vax users are out of luck).

That would leave iconv with the responsibility for checking the
UTF-8 stream for malformed encodings. I'm not sure how much of that
checking iconv actually does.

iconv (at least the GNU implementation) is very rigid; it won't accept
anything which doesn't strictly conform to the input encoding.

--
Glynn Clements <glynn@gclements.plus.com>

On Wed, 2006-03-29 at 00:45 +0100, Glynn Clements wrote:

> This is true for multibyte characters, but not single byte characters.

I don't understand what you're saying here. Or maybe you're
misunderstanding something. UCS-2BE is just 16-bit unicode codepoints
stored in big-endian byte order.

UTF-8 can be 1, 2, 3, 4, 5 or 6 bytes. The first byte corresponds to
the old ascii standard. Transactions with ascii are just one-byte
transfers, most transactions with latin-1 and ISO-8859-1 characters are
also just 1-byte transfers.

> Sorry if I mislead you. My suggestion was that the code would retain
> convert_str and convert_str would use iconv to convert all user-supplied
> encodings to UTF-8 instead of to UCS-2BE as it does now. Draw_text would
> decode UTF-8 to FT_ULong.

Using UTF-8 as the intermediate encoding doesn't make sense.

It makes sense if you start with UTF-8 and there is no intermediate step
at all. Lots of us are using UTF-8 now (possibly without realizing it)
and more of us will be using it in the future.

Also, AFAIK, FT_ULong is in the host's byte order, so you either need
to convert char[4] to FT_ULong using shift+or (which is what happens
at present), or use either UCS-4BE or UCS-4LE depending upon the
host's byte order (Vax users are out of luck).

Does the existing code account for differences in byte order? I don't
see how it does.

Roger Miller

Roger Miller wrote:

> > > The main reason it is used in the FreeType code is that it's the
> > > simplest encoding to decode to an integer codepoint.
> >
> > This is true for multibyte characters, but not single byte characters.
>
> I don't understand what you're saying here. Or maybe you're
> misunderstanding something. UCS-2BE is just 16-bit unicode codepoints
> stored in big-endian byte order.

UTF-8 can be 1, 2, 3, 4, 5 or 6 bytes. The first byte corresponds to
the old ascii standard. Transactions with ascii are just one-byte
transfers, most transactions with latin-1 and ISO-8859-1 characters are
also just 1-byte transfers.

To clarify: UCS-2/UCS-4 are the simplest Unicode encodings to decode
to integer Unicode codepoints. UCS-2/UCS-4 are just integer codepoints
stored in a specific byte order (technically, those names imply
big-endian ordering; the little-endian UCS-* encodings were invented
by Microsoft to avoid byte-swapping on import and export).

> > Sorry if I mislead you. My suggestion was that the code would retain
> > convert_str and convert_str would use iconv to convert all user-supplied
> > encodings to UTF-8 instead of to UCS-2BE as it does now. Draw_text would
> > decode UTF-8 to FT_ULong.
>
> Using UTF-8 as the intermediate encoding doesn't make sense.

It makes sense if you start with UTF-8 and there is no intermediate step
at all. Lots of us are using UTF-8 now (possibly without realizing it)
and more of us will be using it in the future.

Certainly, forcing the user to supply UTF-8 simplifies matters for the
programmer, which is why it's so popular. But it's a major nuisance
for the user if you have been consistently using some other encoding
for the past 25 years (unless that encoding is ASCII).

The adoption of UTF-8 closely mirrors the use of ASCII.

It's most popular in English-speaking locales where almost everything
uses ASCII. It's reasonably popular in locales whose primary language
uses the roman alphabet, i.e. where you can adequately approximate the
language using ASCII (it's common in "European" locales to simply
coerce filenames, usernames etc to ASCII to sidestep any encoding
issues).

It's least popular in locales where the language doesn't use the latin
alphabet but e.g. Cyrillic or Han instead.

In the latter case, you are likely to have decades worth of data and
an installed base of software which use a specific encoding other than
ASCII, and where non-ASCII characters are commonplace in filenames,
usernames etc.

It doesn't help that the UTF-8 encoding isn't compatible with the
(older, and in many locales, well-established) ISO-2022 encoding
(unlike ISO-8859-*, EUC and others).

> Also, AFAIK, FT_ULong is in the host's byte order, so you either need
> to convert char[4] to FT_ULong using shift+or (which is what happens
> at present), or use either UCS-4BE or UCS-4LE depending upon the
> host's byte order (Vax users are out of luck).

Does the existing code account for differences in byte order? I don't
see how it does.

The existing code converts to UCS2-BE then converts the result to
integer codepoints with:

    ch = (out[i]<<8) | out[i+1];

[display/drivers/lib/text3.c, line 194].

This doesn't rely upon the host's byte order.

--
Glynn Clements <glynn@gclements.plus.com>

On composing a reply I found myself repeating things I've said before.
I think that by definition that means this discussion is going nowhere.
All I can really ask is that someone fix the features that caused my
original problems.

Roger Miller

On Wed, 2006-03-29 at 17:50 +0100, Glynn Clements wrote:

Roger Miller wrote:

> > > > The main reason it is used in the FreeType code is that it's the
> > > > simplest encoding to decode to an integer codepoint.
> > >
> > > This is true for multibyte characters, but not single byte characters.
> >
> > I don't understand what you're saying here. Or maybe you're
> > misunderstanding something. UCS-2BE is just 16-bit unicode codepoints
> > stored in big-endian byte order.
>
> UTF-8 can be 1, 2, 3, 4, 5 or 6 bytes. The first byte corresponds to
> the old ascii standard. Transactions with ascii are just one-byte
> transfers, most transactions with latin-1 and ISO-8859-1 characters are
> also just 1-byte transfers.

To clarify: UCS-2/UCS-4 are the simplest Unicode encodings to decode
to integer Unicode codepoints. UCS-2/UCS-4 are just integer codepoints
stored in a specific byte order (technically, those names imply
big-endian ordering; the little-endian UCS-* encodings were invented
by Microsoft to avoid byte-swapping on import and export).

> > > Sorry if I mislead you. My suggestion was that the code would retain
> > > convert_str and convert_str would use iconv to convert all user-supplied
> > > encodings to UTF-8 instead of to UCS-2BE as it does now. Draw_text would
> > > decode UTF-8 to FT_ULong.
> >
> > Using UTF-8 as the intermediate encoding doesn't make sense.
>
> It makes sense if you start with UTF-8 and there is no intermediate step
> at all. Lots of us are using UTF-8 now (possibly without realizing it)
> and more of us will be using it in the future.

Certainly, forcing the user to supply UTF-8 simplifies matters for the
programmer, which is why it's so popular. But it's a major nuisance
for the user if you have been consistently using some other encoding
for the past 25 years (unless that encoding is ASCII).

The adoption of UTF-8 closely mirrors the use of ASCII.

It's most popular in English-speaking locales where almost everything
uses ASCII. It's reasonably popular in locales whose primary language
uses the roman alphabet, i.e. where you can adequately approximate the
language using ASCII (it's common in "European" locales to simply
coerce filenames, usernames etc to ASCII to sidestep any encoding
issues).

It's least popular in locales where the language doesn't use the latin
alphabet but e.g. Cyrillic or Han instead.

In the latter case, you are likely to have decades worth of data and
an installed base of software which use a specific encoding other than
ASCII, and where non-ASCII characters are commonplace in filenames,
usernames etc.

It doesn't help that the UTF-8 encoding isn't compatible with the
(older, and in many locales, well-established) ISO-2022 encoding
(unlike ISO-8859-*, EUC and others).

> > Also, AFAIK, FT_ULong is in the host's byte order, so you either need
> > to convert char[4] to FT_ULong using shift+or (which is what happens
> > at present), or use either UCS-4BE or UCS-4LE depending upon the
> > host's byte order (Vax users are out of luck).
>
> Does the existing code account for differences in byte order? I don't
> see how it does.

The existing code converts to UCS2-BE then converts the result to
integer codepoints with:

    ch = (out[i]<<8) | out[i+1];

[display/drivers/lib/text3.c, line 194].

This doesn't rely upon the host's byte order.

Roger Miller wrote:

On composing a reply I found myself repeating things I've said before.
I think that by definition that means this discussion is going nowhere.
All I can really ask is that someone fix the features that caused my
original problems.

I've applied fixes locally for the size_t and memset() issues, but
haven't had chance to test them.

I've attached the patch (against current CVS) if anyone else wants to
check them.

I haven't changed from UCS-2 to UCS-4; there doesn't seem much point
unless there actually exists a TrueType font which uses codepoints
outside the BMP.

--
Glynn Clements <glynn@gclements.plus.com>

(attachments)

freetype.patch (652 Bytes)

Thanks, Glynn.

Incidentally, FreeType is not limited to TrueType fonts. It can use several different font formats. Just the same, its' unlikely anyone is mapping with symbols beyond the BMP.

Roger Miller

Glynn Clements writes:

Roger Miller wrote:

On composing a reply I found myself repeating things I've said before.
I think that by definition that means this discussion is going nowhere.
All I can really ask is that someone fix the features that caused my
original problems.

I've applied fixes locally for the size_t and memset() issues, but
haven't had chance to test them.

I've attached the patch (against current CVS) if anyone else wants to
check them.

I haven't changed from UCS-2 to UCS-4; there doesn't seem much point
unless there actually exists a TrueType font which uses codepoints
outside the BMP.

--
Glynn Clements <glynn@gclements.plus.com>