Re: [PATCH] checkpatch: use utf-8 match for spell checking

From: Joe Perches
Date: Tue Dec 12 2023 - 14:07:32 EST


On Tue, 2023-12-12 at 10:43 +0100, Antonio Borneo wrote:
> The current code that checks for misspelling verifies, in a more
> complex regex, if $rawline matches [^\w]($misspellings)[^\w]
>
> Being $rawline a byte-string, a utf-8 character in $rawline can
> match the non-word-char [^\w].
> E.g.:
> ./script/checkpatch.pl --git 81c2f059ab9
> WARNING: 'ment' may be misspelled - perhaps 'meant'?
> #36: FILE: MAINTAINERS:14360:
> +M: Clément Léger <clement.leger@xxxxxxxxxxx>
> ^^^^
>
> Use a utf-8 version of $rawline for spell checking.
>
> Signed-off-by: Antonio Borneo <antonio.borneo@xxxxxxxxxxx>
> Reported-by: Clément Le Goffic <clement.legoffic@xxxxxxxxxxx>

Seems sensible, thanks, but:

> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
[]
> @@ -3477,7 +3477,8 @@ sub process {
> # Check for various typo / spelling mistakes
> if (defined($misspellings) &&
> ($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
> - while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> + my $rawline_utf8 = decode("utf8", $rawline);
> + while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> my $typo = $1;
> my $blank = copy_spacing($rawline);

Maybe this needs to use $rawline_utf8 ?

> my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo);

And may now the $fix bit will not always work properly