Re: [PATCH v9 10/13] exfat: add nls operations

From: Pali RohÃr
Date: Sun Jan 05 2020 - 11:51:28 EST


On Thursday 02 January 2020 16:20:33 Namjae Jeon wrote:
> This adds the implementation of nls operations for exfat.

Hello! In whole patch series are different naming convention for
nls/Unicode related terms. E.g. uni16s, utf16s, nls, vfsname, ...

Could this be fixed, so it would be unambiguously named? "uni16s" name
is misleading as Unicode does not fit into 16byte type.

Based on what is in nls.h I would propose following names:

* unicode_t *utf32s always for strings in UTF-32/UCS-4 encoding (host
endianity) (or "unicode_t *unis" as this is the fixed-width encoding
for all Unicode codepoints)

* wchar_t *utf16s always for strings in UTF-16 encoding (host endianity)

* u8 *utf8s always for strings in UTF-8 encoding

* wchar_t *ucs2s always for strings in UCS-2 encoding (host endianity)

Plus in the case you need to work with UTF-16 or UCS-2 in little endian,
add appropriate naming suffixes.

And use e.g. "vfsname" (char * OR unsigned char * OR u8 *) like you
already have on some places for strings in iocharset= encoding.


Looking at the whole code + exfat specification and usage is:

Kernel NLS functions do conversion between UCS-2 and iocharset=.
exfat upcase table has definitions only for UCS-2 characters.
All exfat string structures are stored in UTF-16LE, except upcase table
which is in UCS-2LE.

It is great mess in specification, specially when it talks about Unicode
upcase table for case insensitivity, which is limited only to code
points up to the U+FFFF and does not say anything about Unicode
Normalization and Normal Forms.

=======================================================================

And this opens a new question, what should kernel do if userspace asks
to create these 4 files? (Assume that iocharset=uff8 for full Unicode
support)

1. U+00e9
2. U+0065, U+0301
3. U+00c9
4. U+0045, U+0301

According to Unicode uppercase algorithm, all 4 filenames results in
same grapheme "LATIN CAPITAL LETTER E WITH ACUTE".

But with current exfat implementation first and third are treated as
same and then second and fourth are treated as same. Therefore first and
fourth are treated as different filenames, even the fact that they
represent same grapheme just only one is upper case and one lower case.

To prevent such thing we need to use some kind of Unicode normalization
form here.

What do you think what should kernel's exfat driver do in this case?

CCing Gabriel as he was implementing some Unicode normalization for ext4
driver and maybe should bring some light to new exfat driver too.

--
Pali RohÃr
pali.rohar@xxxxxxxxx

Attachment: signature.asc
Description: PGP signature