Re: unicode

Guest section DW (
Fri, 15 May 1998 12:03:01 +0200 (MET DST)

Jan Vroonhof writes:

> To get around such problems you either need to have a single
> representation of each glyph which is basically what unicode is

This entire discussion has mostly been by people that do not
know what they are talking about. Let me contradict this factoid.

"each" - this is false, but of less importance
"single representation" - extremely unfortunately, this is false, too

First of all there is the matter of precomposed accented symbols.
A symbol plus diacritic can be written as one combined symbol,
or as two (or more) symbols. For example, both
U+0061,U+0308 and U+00E4 represent a-umlaut.

Secondly, lots and lots of glyphs have two or more representation
as a single Unicode symbol. This was caused by the desire of the
Unicode consortium to guarantee round-trip compatibility between
Unicode and various character sets.
For example, the glyph A is coded as U+0041, U+0391, U+0410
depending on whether you think it is Latin, Greek or Cyrillic.
The glyph K is coded as U+004B, U+039A, U+212A depending on
whether you think it is Latin, Greek or stands for degree Kelvin.
The aleph is U+05D0 if it is Hebrew, but U+2135 if it is maths.

Note, these are precisely the same glyphs. No font difference.
So, even a human secretary who is typing my math notes will
in general be unable to decide whether a capital PI should be
U+220F because it is a product symbol, or U+03A0 because it is a
Greek symbol.

How do you think the kernel, or libc, will distinguish the
capital letter X (U+0058) from the Roman numeral X (U+2169)?

An interesting artificial intelligence project.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to