Re: [PATCH] checkpatch: use utf-8 match for spell checking
From: Antonio Borneo
Date: Tue Jan 02 2024 - 11:23:43 EST
On Tue, 2023-12-12 at 11:07 -0800, Joe Perches wrote:
> On Tue, 2023-12-12 at 10:43 +0100, Antonio Borneo wrote:
> > The current code that checks for misspelling verifies, in a more
> > complex regex, if $rawline matches [^\w]($misspellings)[^\w]
> >
> > Being $rawline a byte-string, a utf-8 character in $rawline can
> > match the non-word-char [^\w].
> > E.g.:
> > ./script/checkpatch.pl --git 81c2f059ab9
> > WARNING: 'ment' may be misspelled - perhaps 'meant'?
> > #36: FILE: MAINTAINERS:14360:
> > +M: Clément Léger <clement.leger@xxxxxxxxxxx>
> > ^^^^
> >
> > Use a utf-8 version of $rawline for spell checking.
> >
> > Signed-off-by: Antonio Borneo <antonio.borneo@xxxxxxxxxxx>
> > Reported-by: Clément Le Goffic <clement.legoffic@xxxxxxxxxxx>
>
> Seems sensible, thanks, but:
>
> > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
> > @@ -3477,7 +3477,8 @@ sub process {
> > # Check for various typo / spelling mistakes
> > if (defined($misspellings) &&
> > ($in_commit_log || $line =~ /^(?:\+|Subject:)/i)) {
> > - while ($rawline =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> > + my $rawline_utf8 = decode("utf8", $rawline);
> > + while ($rawline_utf8 =~ /(?:^|[^\w\-'`])($misspellings)(?:[^\w\-'`]|$)/gi) {
> > my $typo = $1;
> > my $blank = copy_spacing($rawline);
>
> Maybe this needs to use $rawline_utf8 ?
Correct, I will send a v2!
>
> > my $ptr = substr($blank, 0, $-[1]) . "^" x length($typo);
>
> And may now the $fix bit will not always work properly
I have run some test and it looks ok with current ASCII file scripts/spelling.txt.
I have also tested adding some utf-8 string in the spelling file, but checkpatch reads it as
ASCII and extending it to utf-8 will require further modifications in checkpatch, way beyond
this simple fix.
Thanks for the review.
Antonio