Re: [PATCH] checkpatch: handle utf8 while computing length of commit msg lines

From: Joe Perches
Date: Sat Oct 22 2022 - 01:48:36 EST


On Fri, 2022-10-21 at 21:15 +0200, Antonio Borneo wrote:
> The current check for the length of each line in the commit msg
> uses length($line) that counts line's bytes.
> If the line contains utf8 characters, the byte count can exceed
> the cap even on quite short lines.
>
> Count the utf8 characters for checking line length.
>
> Signed-off-by: Antonio Borneo <antonio.borneo@xxxxxxxxxxx>
>
> ---
>
> Actually it's not fully clear to me if utf8 characters in the
> commit msg are acceptable/tolerated or to be avoided.

Nor is it to me, likely it's OK though as at least checkpatch has an
existing test/comment for nominally valid UTF-8 in commit messages.

CHK("INVALID_UTF8",
"Invalid UTF-8, patch and commit message should be encoded in UTF-8\n" . $hereptr);

> In the commit msg of 15662b3e8644 ("checkpatch: add a --strict
> check for utf-8 in commit logs") is stated:
> Some find using utf-8 in commit logs inappropriate.

I don't particularly care one way or another.

Andrew? Linus?

> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 1e5e66ae5a52..eaad5da50554 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -3220,7 +3220,7 @@ sub process {
>
> # Check for line lengths > 75 in commit log, warn once
> if ($in_commit_log && !$commit_log_long_line &&
> - length($line) > 75 &&
> + length(decode("utf8", $line)) > 75 &&
> !($line =~ /^\s*[a-zA-Z0-9_\/\.]+\s+\|\s+\d+/ ||
> # file delta changes
> $line =~ /^\s*(?:[\w\.\-\+]*\/)++[\w\.\-\+]+:/ ||
>
> base-commit: 9abf2313adc1ca1b6180c508c25f22f9395cc780