Re: unicode

Theodore Y. Ts'o (tytso@MIT.EDU)
Thu, 14 May 1998 13:35:16 -0400

Date: Fri, 15 May 1998 00:55:43 -0700 (PDT)
From: Alex Belits <>

Because re-encoding is the last and worst thing that one may want to
happen with them -- charsets/language labels are necessary for displaying
characters with fonts that are mapped to charsets and applying rules that
are mapped to languages (capitalization, hyphenation, phonetic match). The
initial assumption is that adding reasonable support for fonts and rules
is possible without exposing any other encoding or charset to application.
Then no one re-encodes anything except when handles charset-specific
devices or charset-specific filesystems.

None of the above (capitalization, hyphenation, and phoentic match) are
required for filenames. They are required if you are using a word
processor (such as Microsoft Office's Word, which is also using Unicode
internally to store all of their documents, so they've managed to solve
this problem), but that's not we're talking about here on the
linux-kernel mailing list.

For filenames, it is really, really bad when a user sees two filenames
in a directory listing which look identical when printed on the screen,
but which have different encodings. It is also really bad when the user
sees a particular filename in a directory listing, tries to type it, but
because the user was unlucky and guessed wrong about which character set
was used, she gets a "file not found error". Remeber, most users look
at directory listings by looking at the glyphs on their display devices.
They do *not* look at directory listings with their favorite hex dump
editor. :-)

For this reason, re-encoding to a canonicalized form is absolutely a
requirement before such filenames are stored in the filesystem. UTF-8
provides such a canonicalization process, although granted the it is not
trivial (requiring large table lookups), and probably will need to be in
a userspace (such as in libc).

- Ted

P.S. This is why, when all is send and done, an I18N expert at the OSF
is reported to have said, "You know, it would be easier to just teach
them all English." The entire I18N problem is *hard*, and anyone who
thinks they have an easy solution to it is either on drugs or is trying
to sell you something.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to