Re: A Great Idea (tm) about reimplementing NLS.

From: Bernd Eckenfels
Date: Fri Jun 17 2005 - 04:42:22 EST


In article <200506170450.12943.pmcfarland@xxxxxxxxxxxx> you wrote:
> (implication of utf8 and not utf16 goes here)
>
> Very few Unicode characters require three bytes, instead of the usual one or
> two.

UTF-8 2 bytes end with U+07ff which covers only Latin, Cyrillic, Hebrew and
Arabic.

All JCK Unified Ideographs (U+4E00-) and Extensions (U+3400-) have 3 byte
encodings with UTF-8. Some of the B Extensions even use 4 bytes (U+20000-)

> For one byte you just have the byte.

For ASCII you have one byte.

> For two bytes, you really have three: a control code stating "the following
> two bytes are a two byte character", and then the two bytes.

Umm, thats a bit missleading. UTF-8 works with bit not byte prefixes.
Unicode code points are integers and depending on the encoding represented
as multiple code points, which can be represented as bytes.

> Unless I've completely misunderstood the Unicode specification, this is what
> is going on.

You might want to look up Joel's Tutorial or just browse the Unihan Database:
http://www.joelonsoftware.com/articles/Unicode.html
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=3400
http://www.unicode.org/cgi-bin/UnihanGrid.pl?codepoint=U+07F1&useutf8=false

Greetings
Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/