Re: [PATCH v2] get_maintainer: correctly parse UTF-8 encoded names in files

From: Alvin Šipraga
Date: Fri Dec 15 2023 - 05:31:28 EST


On Thu, Dec 14, 2023 at 07:57:54AM -0800, Joe Perches wrote:
> On Thu, 2023-12-14 at 16:06 +0100, Alvin Šipraga wrote:
> > @@ -442,7 +443,7 @@ sub maintainers_in_file {
> > my $text = do { local($/) ; <$f> };
> > close($f);
> >
> > - my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > + my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
> > push(@file_emails, clean_file_emails(@poss_addr));
> > }
> > }
>
> Rather than open _all_ files in utf-8, perhaps the block
> that opens a specific file to find maintainers
>
> sub maintainers_in_file {
> my ($file) = @_;
>
> return if ($file =~ m@\bMAINTAINERS$@);
>
> if (-f $file && ($email_file_emails || $file =~ /\.yaml$/)) {
> open(my $f, '<', $file)
> or die "$P: Can't open $file: $!\n";
> my $text = do { local($/) ; <$f> };
> close($f);
> ...
>
> should change the
>
> open(my $f...
> to
> use open qw(:std :encoding(UTF-8));
> open(my $f...

Yes, this also works for parsing the name in an arbitrary file. But with the
change you suggest above, the script then corrupts my name when it is lifted
from MAINTAINERS (!?):

$ ./scripts/get_maintainer.pl -f drivers/net/dsa/realtek/ | grep alsi
"Alvin Å ipraga" <alsi@xxxxxxxxxxxxxxx> (maintainer:REALTEK RTL83xx SMI DSA ROUTER CHIPS)

I'm not entirely sure why that happens, since my name doesn't get corrupted when
coming from MAINTAINERS with the upstream state of the script. But anyway, with
your suggestion I would then also have to add it here:

@@ -347,6 +346,7 @@ my @mfiles = ();
my @self_test_info = ();

sub read_maintainer_file {
+ use open qw(:std :encoding(UTF-8));
my ($file) = @_;

open (my $maint, '<', "$file")

... and I guess there might be other cases too.

Rather than scattering it all over, don't you think it's more robust to open all
files in UTF-8? I tried to show in one of my replies to v1 [1] that this should
be compatible with basically all of the source tree.

[1] https://lore.kernel.org/all/dzn6uco4c45oaa3ia4u37uo5mlt33obecv7gghj2l756fr4hdh@mt3cprft3tmq/

If you are still unconvinced then I will gladly send a v3 patching the two cases
we have discussed (read_maintainer_file() and maintainers_in_file()).

>
>
> And unrelated and secondarily, perhaps the
> $file =~ /\.yaml$/
> test should be
> $file =~ /\.(?:yaml|dtsi?)$/
>
> to also find any maintainer address in the dts* files
>
> https://lore.kernel.org/lkml/20231028174656.GA3310672@bill-the-cat/T/

Is this supposed to parse the "Copyright (c) 20xx John Doe <foo@xxxxxxxx>" in
the .dts* files? But sure, I can do a resend of Shawn's original patch
separately if you like.

Kind regards,
Alvin