Re: Unicode, etc. solution

H. Peter Anvin (hpa@transmeta.com)
27 Aug 1997 15:46:04 GMT


Followup to: <199708271116.HAA07433@lynx.dac.neu.edu>
By author: Albert Cahalan <acahalan@lynx.dac.neu.edu>
In newsgroup: linux.dev.kernel
>
> While having the right tables is better, the other method works.
> It is an easy default, which makes it a nice hack.
>

No, it makes it a crude hack.

> With UTF-8, repeated conversions are unavoidable and somewhat complex.
> Most apps will _severely_ mishandle text on a UTF-8 system. At least
> with raw Unicode you know that newer apps will operate without
> overhead and older apps won't split UTF-8 characters. At worst you
> have a byte order swap.

Most apps will simply not operate in a UCS-2 system which you propose.

> > This is exactly the wrong thing to do. We *DON'T* want this
> > kind of crap in the system. If so, we're much better off
> > standardizing on Unicode. Otherwise the kernel has to know
> > about every bloody character set in existence -- this is
> > completely utterly intolerable.
>
> It is funny to see that from you, because I think you had something
> to do with loadable translation tables for the console. Do you also
> find that completely utterly intolerable? There are already several
> reimplementations of it for filesystems. Wouldn't it be better if
> they could share the same code and translation tables?

Sort of, but what you proposes requires the kernel to know *EVERY*
character set. Loading a map for the console is only required for the
character set you want to *use*. Unfortunately, the tables the
console requires are not adequate for filesystem use, and the inverted
tables a filesystem would require are *MUCH* larger.

> > This is the wrong thing to do. Use UTF-8 encoding as the
> > multibyte set, and do conversion to wide characters if you
> > want to.
>
> Since the conversion is not cheap and UTF-8 breaks everything
> anyway, we might as well do this the Right Way with 16-bit
> characters all accross the API. The old calls must remain
> as single-byte encoded for normal apps.

Except it isn't the right thing.

> > The Asians are -- for good reason -- already screaming bloody
> > murder over 16 bits; either we end up using an awful kluge
> > like UTF-16, or we stick to 8-bit bytes and use UTF-8, which
> > handles all of UCS-4 quite elegantly.
>
> Normal everyday "characters" fit in 16 bits. Since there are
> more characters every day, they can't all go into halfway
> portable filenames anyway. This is why word processors and HTML
> let you embed an image of <img src="foo.gif"> as needed.

Tell that to the Chinese person who can't write his name because it's
beyond 16 bits. It's a lose.

-hpa

-- 
    PGP: 2047/2A960705 BA 03 D3 2C 14 A8 A8 BD  1E DF FE 69 EE 35 BD 74
    See http://www.zytor.com/~hpa/ for web page and full PGP public key
Always looking for a few good BOsFH.  **  Linux - the OS of global cooperation
        I am Baha'i -- ask me about it or see http://www.bahai.org/