Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberi=

Alex Belits (abelits@phobos.illtel.denver.co.us)
Tue, 26 Aug 1997 22:58:37 -0700 (PDT)


On 27 Aug 1997, Matthias Urlichs wrote:

> > > Sort order is important. But cultural sort order (as opposed to any odd
> > > sort order) _cannot_ be done via naked byte order and picking the right
> > > character set. It's not even possible for English - you want to sort
> > >
> > > Andy
> > > boring
> > > John
> > >
> > > and no naked byte order will ever give you this.
> >
> > You don't have a clue.
> >
> ??? Of course he has a clue. You _cannot_ sort characters via any byte
> order.
> At the very least you have to map upper->lower case. You can only do
> that right in ASCII and some stupid national variants (they're stupid

...but since sorting is charset-dependent, I can always apply charset's
local definition of sorting and case-mapping (or even phonetic matching
in loose search, or language-dependent word-searching rules in keywords
searching) if I know the charset. One can even write C++ class to
handle such things automatically and derive charsets from it (or do it
in any OO language but Java, or in plain C if one wishes). With bare
Unicode I simply can't do that unless I convert things back.

> because you can't put more than one language into one document -- try
> mixing the German idiocy of usurping ASCII []{}\| for umlaut characters
> with C source code -- oh, so you want to use trigraphs ???).

This is why they have switched to iso8859-1.

> There's also the question of what you want to achieve. Shall capital letters
> be wholly distinct? Ignored? Be used for some sort of secondary ordering?

I don't want to anythin but application to make this decision -- I had
enough trouble with "smart" case-matching in DOS/Windows. Application
should handle that and it should have _means_ to handle that easily.

> Same with umlauts, diacriticals, and what-have-you (which also can be split
> into their secondary forms when sorting, as with ä -> ae, or maybe even
> (c) -> "copyright").
>
> Face it, the only way to do this right is via some generic mechanism, and
> as soon as you have that mechanism it's irrelevant whether the character
> set you use manages to place A B C in the right order or not.

"generic mechanism" != "single charset". Generic mechanism can include
charsets and have nice way of handling them (X11 has one, suitable for its
user interface needs -- nothing prevents to make generic mechanism that
uses charsets definitions that will contain proper procedures of handling
more complex aspects of charsets and languages). Of course, that should be
done entirely in userspace, and kernel shouldn't interfere.

> > Lie. Windows has unicode support that is mainly broken and unused -- this
> > is why it has "localized" versions (that will be absolutely pointless if
> > it was really internationalized like Unicode's use claims to make
> > possible).
> >
> Great. But is that a fault of Microsoft or of Unicode??

It's another victory of Microsoft over the common sense -- first it
declared an unusable standard, so it will guarantee that everything can be
"obsoleted" because it's not adopted new standard yet, then it made a
completely broken implementation that even they have to kludge around, so
in the end the "great" specification isn't even used any widely in their
own products. Not first thing that they did this way, and most likely not
last one either.

>
> I'd suspect the former...

Why? Microsoft with all its lameness is capable of at least _utilizing_
things that are usable, even if it utilizes them its own, perverted way
(preemptive multitasking, TCP/IP, RISC, programming languages). Unicode is
just way too broken to be even used by them at all.

> > > That's why you want to standardize those on UTF-8. You _don't_ want to
> > > have the FS have different names in different character sets.
> >
> > Why do you know what others want? You don't even speak their languages.
> >
> There are two alternatives here which would actually work. You can display
> names from non-local character sets via some sort of machine-readable
> transliteration, maybe UTF-7 so that you can actually type the thing, or
> you can display them in their native form and depend on the user to figure
> out for themselves where the Greek Alpha (or whatever) is.
>
> A third way would be some sort of human-readable transliteration, but
> you'll have to be careful with aliasing -- what if two different names get
> transliterated to the same string?

"Display" where? in readdir()? It's not where they are displayed. I prefer
to have some freedom what to do in userspace with them, and actually I
prefer to have them displayed in the form, I choose for them -- even on
ASCII-only console I don't want to see my name spelled "ES A SH A BE E EL
I TS".

> There are also alternatives which won't work. Inserting the disk from
> my Greek friend into the Russian friend's disk drive and having the
> filenames show up in some jumble of nonunderstandable Cyrillic letters is
> Not An Option

Why? Will he be able to read them any better? (of course, it should be
very unusual Russian if he used Unicode -- I don't know a single Russian
person who does).

> (it gets worse with multibyte characters -- "sorry, but this
> character doesn't exist in Klingonese, so you can't type it, thus you can't
> open this file"

What????? If a program has the same bytes in argument to open() as ones in
filename, file will be opened, otherwise not. If kernel it will
"translate" them in the manner, DOS/Windows do, it will cause ambiguity.

> (insert appropriate Klingon insult, then wipe the screen
> clean please ;-) (yes I do know that you don't need multibyte characters
> for Klingon, this is an example).)
>
> Marking the disk as "On this disk, all names are Cyrillic" and another disk
> as "Greek" and another as "Big5" and another as ... doesn't make sense
> either. What shall a multilingual translator do, one hard disk per
> language??

Multilingual translator needs more powerful thing that Unicode anyway, but
regular users thatnow unlukely need handle more than one-two charsets
definitely don't need to use "new" overbloated and unusable with their
language one.

> You can pick nits with Unicode all you like, but please, if you want to
> replace it, offer us some alternative which actually can be made to work
> for everybody and which isn't just another 80% (or even 99%) non-solution.

Any solution should be as far from filesystem name handling as possible.

> Note that Unicode does have its flaws. I'm not too happy about the fact
> that they mixed up umlauted and diaresised(??) characters, but you got to
> draw the line _somewhere_, or else they'll give us different character
> codes for the distinct lower-case-g and -a glyphs next. :-(

koi8 has different characters code for Cyrillic and ASCII capital "A" that
look exactly the same (even though people often use different styles for
them in fonts to make them distinguishable) -- and it has _really_ valid
reasons that even Unicode great unificators accepted.

> Which BTW
> just shows that non-Latin-1 languages don't have a monopoly on that kind of
> difficulty.

You _used_ Latin1 before Unicode, so it can't be that bad. Others
had charsets that suit their languages better for a long time, and are
_not_ satisfied to a mess that Unicode offers instead.

--
Alex