Re: [PATCH v2] checkpatch: fix false positives in REPEATED_WORD warning

From: Aditya
Date: Thu Oct 22 2020 - 15:15:10 EST


On 22/10/20 9:40 pm, Joe Perches wrote:
> On Thu, 2020-10-22 at 20:20 +0530, Aditya Srivastava wrote:
>> Presence of hexadecimal address or symbol results in false warning
>> message by checkpatch.pl.
> []
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
>> @@ -3051,7 +3051,10 @@ sub process {
>> }
>>
>> # check for repeated words separated by a single space
>> - if ($rawline =~ /^\+/ || $in_commit_log) {
>> +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
>> + if (($rawline =~ /^\+/ || $in_commit_log) &&
>> + $rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
>
> Alignment and use \b before and after the regex please.

If we use \b either before or after or both it does not match patterns
such as:
+ -rw-r--r--. 1 root root 112K Mar 20 12:16
selinux-policy-3.14.4-48.fc31.noarch.rpm

This is happening probably because it is counting '-' for '\b'
I have not observed any negatives of using this though.

>
> if (($rawline =~ /^\+/ || $in_commit_log) &&
> $rawline !~ /\b[bcCdDlMnpPs\?-][rwxsStT-]{9}\b/) {
>> @@ -3065,6 +3068,34 @@ sub process {
>> next if ($first ne $second);
>> next if ($first eq 'long');
>>
>> + # avoid repeating hex occurrences like 'ff ff fe 09 ...'
>> + if ($first =~ /\b[0-9a-f]{2,}/) {
>> + # if such sequence occurs more than 4, it is most probably part of some of code
>> + next if ((scalar @hex_seq)>4);
>> + # for hex occurrences which are less than 4
>> + # get first hex word in the line
>> + if ($rawline =~ /\b[0-9a-f]{2,} /) {
>> + my $post_hex_seq = $';
>> +
>> + # set suffieciently high default values to avoid ignoring or counting in absence of another
>> + my $non_hex_char_pos = 1000;
>> + my $special_chars_pos = 500;
>> +
>> + if ($post_hex_seq =~ /[g-z]+/) {
>> + # first non hex character in post_hex_seq
>> + $non_hex_char_pos = $-[0];
>> + }
>> + if($post_hex_seq =~ /[^a-zA-Z0-9]{2,}/) {
>> + # first occurrence of 2 or more special chars
>> + $special_chars_pos = $-[0];
>> + }
>
> What does all this code actually avoid?
>
>

Sir, there are multiple variations of hex for which this warning is
occurring, for eg:
1) 00 c0 06 16 00 00 ff ff 00 93 1c 18 00 00 ff ff ................
2) ffffffff ffffffff 00000000 c070058c
3) f5a: 48 c7 44 24 78 ff ff movq
$0xffffffffffffffff,0x78(%rsp)
4) + fe fe
5) + fe fe - ? end marker ?
6) Code: ff ff 48 (...)

So I first check if the repeated word matches /\b[0-9a-f]{2,}/ . If it
does and occurs as a sequence of such repetitions more than 4(ie more
than or equal to 5), then it is most probably a part of hexadecimal
code. This is implemented here,

+ if ($first =~ /\b[0-9a-f]{2,}/) {
+ # if such sequence occurs more than 4, it is most probably part
of some of code
+ next if ((scalar @hex_seq)>4);

This addresses our issues for warning similar to example (1),(2) and (3).

But still we haven't detected 4,5,6. One can argue that we can modify:

+ next if ((scalar @hex_seq)>4);

with (scalar @hex_seq)>2 or (scalar @hex_seq)>3

but then, we'll not be able to account for warnings such as:

7) + * sets this to -1, the slack value will be calculated to be be
halfway
8) + * @seg: index of packet segment whose raw fields are to be be
extracted
9) The data in destination buffer is expected to be be parsed in big
10) + * 1. New session or device can'be be created - session sysfs
files

Here I observed that in hex codes, there are atleast 2 special
characters present before any non-hex character, for eg. in (5). Also
generally such occurrences are very rare in writing english, and it is
also helpful in our case.

This is implemented here:

>> + # avoid repeating hex occurrences like 'ff ff fe 09 ...'
>> + if ($first =~ /\b[0-9a-f]{2,}/) {
>> + # if such sequence occurs more than 4, it is most probably
part of some of code
>> + next if ((scalar @hex_seq)>4);
>> + # for hex occurrences which are less than 4
>> + # get first hex word in the line
>> + if ($rawline =~ /\b[0-9a-f]{2,} /) {
>> + my $post_hex_seq = $';
>> +
>> + # set suffieciently high default values to avoid ignoring or
counting in absence of another
>> + my $non_hex_char_pos = 1000;
>> + my $special_chars_pos = 500;
>> +
>> + if ($post_hex_seq =~ /[g-z]+/) {
>> + # first non hex character in post_hex_seq
>> + $non_hex_char_pos = $-[0];
>> + }
>> + if($post_hex_seq =~ /[^a-zA-Z0-9]{2,}/) {
>> + # first occurrence of 2 or more special chars
>> + $special_chars_pos = $-[0];
>> + }

I have used these two lines for cases like example(4):
+ my $non_hex_char_pos = 1000;
+ my $special_chars_pos = 500;

Here, non-hex characters are missing, thus the default character helps
us to get desired result.
Also, I have set higher values such that if one of them occurs in a
line, the result remain unaffected, than with lower default values.


Thanks
Aditya