Re: [RFC PATCH 01/20] fat: Fix iocharset=utf8 mount option

From: OGAWA Hirofumi
Date: Sun Aug 15 2021 - 07:23:20 EST


To: Pali Rohár <pali@xxxxxxxxxx>
Cc: linux-fsdevel@xxxxxxxxxxxxxxx, linux-ntfs-dev@xxxxxxxxxxxxxxxxxxxxx, linux-cifs@xxxxxxxxxxxxxxx, jfs-discussion@xxxxxxxxxxxxxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>, Jan Kara <jack@xxxxxxx>, "Theodore Y . Ts'o" <tytso@xxxxxxx>, Luis de Bethencourt <luisbg@xxxxxxxxxx>, Salah Triki <salah.triki@xxxxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Dave Kleikamp <shaggy@xxxxxxxxxx>, Anton Altaparmakov <anton@xxxxxxxxxx>, Pavel Machek <pavel@xxxxxx>, Marek Behún <marek.behun@xxxxxx>, Christoph Hellwig <hch@xxxxxxxxxxxxx>
Subject: Re: [RFC PATCH 01/20] fat: Fix iocharset=utf8 mount option
From: OGAWA Hirofumi <hirofumi@xxxxxxxxxxxxxxxxxx>
Gcc: nnimap+ibmpc.myhome.or.jp:Sent
--text follows this line--
To: Pali Rohár <pali@xxxxxxxxxx>
Cc: linux-fsdevel@xxxxxxxxxxxxxxx, linux-ntfs-dev@xxxxxxxxxxxxxxxxxxxxx, linux-cifs@xxxxxxxxxxxxxxx, jfs-discussion@xxxxxxxxxxxxxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>, Jan Kara <jack@xxxxxxx>, "Theodore Y . Ts'o" <tytso@xxxxxxx>, Luis de Bethencourt <luisbg@xxxxxxxxxx>, Salah Triki <salah.triki@xxxxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Dave Kleikamp <shaggy@xxxxxxxxxx>, Anton Altaparmakov <anton@xxxxxxxxxx>, Pavel Machek <pavel@xxxxxx>, Marek Behún <marek.behun@xxxxxx>, Christoph Hellwig <hch@xxxxxxxxxxxxx>
Subject: Re: [RFC PATCH 01/20] fat: Fix iocharset=utf8 mount option
From: OGAWA Hirofumi <hirofumi@xxxxxxxxxxxxxxxxxx>
Gcc: nnimap+ibmpc.myhome.or.jp:Sent
--text follows this line--
Pali Rohár <pali@xxxxxxxxxx> writes:

>> This change is not equivalent to utf8=1. In the case of utf8=1, vfat
>> uses iocharset's conversion table and it can handle more than ascii.
>>
>> So this patch is incompatible changes, and handles less chars than
>> utf8=1. So I think this is clean though, but this would be regression
>> for user of utf8=1.
>
> I do not think so... But please correct me, as this code around is mess.
>
> Without this change when utf8=1 is set then iocharset= encoding is used
> for case-insensitivity implementation (toupper / tolower conversion).
> For all other parts are use correct utf8* conversion functions.
>
> But you use touppper / tolower functions from iocharset= encoding on
> stream of utf8 bytes then you either get identity or some unpredictable
> garbage in utf8. So when comparing two (different) non-ASCII filenames
> via this method you in most cases get that filenames are different.
> Because converting their utf8 bytes via toupper / tolower functions from
> iocharset= encoding results in two different byte sequences in most
> cases. Even for two utf8 case-insensitive same strings.
>
> But you can play with it and I guess it is possible to find two
> different utf8 strings which after toupper / tolower conversion from
> some iocharset= encoding would lead to same byte sequence.
>
> This patch uses for utf8 tolower / touppser function simple 7-bit
> tolower / toupper ascii function. And so for 7-bit ascii file names
> there is no change.
>
> So this patch changes behavior when comparing non 7-bit ascii file
> names, but only in cases when previously two different file names were
> marked as same. As now they are marked correctly as different. So this
> is changed behavior, but I guess it is bug fix which is needed.
> If you want I can put this change into separate patch.
>
> Issue that two case-insensitive same files are marked as different is
> not changed by this patch and therefore this issue stay here.

OK, sure. utf8 looks like broken than I was thinking (although user can
use iocharset=ascii and utf8=1 for this). The code might be better to
clean up a bit more though, looks like good basically.

One thing, please update FAT_DEFAULT_IOCHARSET help in Kconfig and
Documentation/filesystems/vfat.rst (with new warning about iocharset=utf8).

Thanks.
--
OGAWA Hirofumi <hirofumi@xxxxxxxxxxxxxxxxxx>