Re: [PATCH v2] checkpatch: Only encode UTF-8 quoted printable mail headers

From: Joe Perches
Date: Wed Jul 18 2018 - 17:55:18 EST


On Wed, 2018-07-18 at 14:42 -0700, Andrew Morton wrote:
> On Wed, 18 Jul 2018 16:52:54 +0200 Geert Uytterhoeven <geert+renesas@xxxxxxxxx> wrote:
>
> > As PERL uses its own internal character encoding, always calling
> > encode("utf8", ...) on the author name may cause corruption, leading to
> > an author signoff mismatch.
> >
> > This happens in the following cases:
> > - If a patch is in ISO-8859, and contains a non-ASCII author name in
> > the From: line, it is converted to UTF-8, while the Signed-off-by
> > line will still be in ISO-8859.
> > - If a patch is in UTF-8, and contains a non-ASCII author name in the
> > body (not header) From: line, it is assumed to be encoded in PERL's
> > internal character encoding, and converted to UTF-8 incorrectly,
> > while the Signed-off-by line will be in real UTF-8.
> >
> > Fix this by only doing the encode step if the From: line used UTF-8
> > quoted printable encoding.
>
> Works for me, thanks.

Me too so far, but I've more testing I'd like to do.

> Relatedly, would it be worth adding a checkpatch warning if a patch
> contains anything other than ASCII or UTF-8?
>
> I added this to my little local patch-checking script.
>
> if ! file $p | grep -q -P "ASCII text|Unicode text"
> then
> echo $p: weird charset
> fi

Might be hard to be effective.

For instance, the lkml mail I've kept so far this year
has a mixture of ascii/utf-8/iso-8859/windows-1252 and
some others with a few different encodings used too.

$ grep -Poh "\bcharset=\S+" ~/.local/share/evolution/mail/local/.MailingLists.Linux-Kernel/cur/*|cut -f3- -d:|sort|uniq -c|sort -rn
821 charset=us-ascii
469 charset="UTF-8"
394 charset="ISO-8859-1"
252 charset=US-ASCII
221 charset=utf-8
118 charset=utf-8;
97 charset="utf-8"
66 charset=UTF-8
60 charset="us-ascii"
33 charset=ISO-8859-15
24 charset=iso-8859-1
18 charset=US-ASCII;
11 charset=us-ascii;
7 charset=windows-1252;
7 charset="utf-8";
6 charset="UTF-8";
5 charset=windows-1252
5 charset="iso-8859-1"
4 charset="windows-1252"
3 charset=UTF-8;
3 charset="US-ASCII"
2 charset="iso-2022-jp"
2 charset=gbk;
2 charset="gb2312"
1 charset="utf-7"
1 charset="iso-8859-15"
1 charset=ISO-8859-1
1 charset="gbk";

And

$ grep "^Content-Transfer-Encoding:" ~/.local/share/evolution/mail/local/.MailingLists.Linux-Kernel/cur/*|cut -f3- -d:|sort|uniq -c|sort -rn
873 Content-Transfer-Encoding: 7bit
212 Content-Transfer-Encoding: 8bit
97 Content-Transfer-Encoding: quoted-printable
63 Content-Transfer-Encoding: base64
56 Content-Transfer-Encoding: 8BIT
24 Content-Transfer-Encoding: 7Bit
3 Content-Transfer-Encoding: 7BIT
2 Content-Transfer-Encoding: QUOTED-PRINTABLE