Re: [PROBLEM] [ADDITIONAL DIAG] seltests: net/forwarding/sch_ets.sh [HANG]

From: Petr Machata
Date: Sat Jul 22 2023 - 08:40:20 EST



Mirsad Todorovac <mirsad.todorovac@xxxxxxxxxxxx> writes:

> On 7/20/23 18:25, Mirsad Todorovac wrote:
>> On 7/20/23 18:07, Mirsad Todorovac wrote:
>>> root@defiant:/home/marvin/linux/kernel/linux_torvalds/tools/testing/selftests/net/forwarding#
>>>
>>>> That would imply that the reproducer needs to include the previous tests as
>>>> well.
>>>
>>> This is entirely possible, as timeouts and CTRL+C events do not seem to be caught
>>> and the cleanup is not done ...
>>>
>>> sch_ets_core.sh:    trap cleanup EXIT
>>>
>>> Some tests time out even after settings:timeout=240, so IMHO this should be taken into account.
>>>
>>> Best regards,
>>> Mirsad Todorovac
>>>
>>>> It looks like the test is stuck in ets_test_mixed in classifier_mode.
>>>> A way to run just this test would be:
>>>>
>>>>      TESTS="classifier_mode ets_test_mixed" ./sch_ets.sh
>>>>
>>>> Looking at the code, the only place that I can see that waits on
>>>> anything is the "{ kill %% && wait %%; } 2>/dev/null" line in
>>>> stop_traffic() (in lib.sh). Maybe something like this would let
>>>> us see if that's the case:
>>>>
>>>> modified   tools/testing/selftests/net/forwarding/lib.sh
>>>> @@ -1468,8 +1470,10 @@ start_tcp_traffic()
>>>>   stop_traffic()
>>>>   {
>>>> +    echo killing MZ
>>>>       # Suppress noise from killing mausezahn.
>>>>       { kill %% && wait %%; } 2>/dev/null
>>>> +    echo killed MZ
>>>>   }
>> FYI, this is the [incomplete] list of the tests that time out even after
>> being assigned long timeout of 240s instead of default.
>> marvin@defiant:~/linux/kernel/linux_torvalds$ grep TIMEOUT
>> ../kselftest-6.5-rc2-net-forwarding-8.log
>> not ok 13 selftests: net/forwarding: custom_multipath_hash.sh # TIMEOUT 240 seconds
>> not ok 18 selftests: net/forwarding: gre_custom_multipath_hash.sh # TIMEOUT 240 seconds
>> not ok 19 selftests: net/forwarding: gre_inner_v4_multipath.sh # TIMEOUT 240 seconds
>> not ok 21 selftests: net/forwarding: gre_multipath_nh_res.sh # TIMEOUT 240 seconds
>> not ok 22 selftests: net/forwarding: gre_multipath_nh.sh # TIMEOUT 240 seconds
>> not ok 27 selftests: net/forwarding: ip6gre_custom_multipath_hash.sh # TIMEOUT 240 seconds
>> not ok 34 selftests: net/forwarding: ip6gre_inner_v4_multipath.sh # TIMEOUT 240 seconds
>> not ok 58 selftests: net/forwarding: no_forwarding.sh # TIMEOUT 240 seconds
>> not ok 67 selftests: net/forwarding: router_mpath_nh_res.sh # TIMEOUT 240 seconds
>> not ok 68 selftests: net/forwarding: router_mpath_nh.sh # TIMEOUT 240 seconds
>> not ok 70 selftests: net/forwarding: router_multipath.sh # TIMEOUT 240 seconds
>> marvin@defiant:~/linux/kernel/linux_torvalds$
>
> The test seem to be stuck despite adding SIGTERM cleanup to all tests
> that timed out:
>
> # selftests: net/forwarding: sch_ets.sh
> # TEST: ping vlan 10 [ OK ]
> # TEST: ping vlan 11 [ OK ]
> # TEST: ping vlan 12 [ OK ]
> # Running in priomap mode
> # Testing ets bands 3 strict 3, streams 0 1
> # TEST: band 0 [ OK ]
> # INFO: Expected ratio >95% Measured ratio 100.00
> # TEST: band 1 [ OK ]
> # INFO: Expected ratio <5% Measured ratio 0
> # Testing ets bands 3 strict 3, streams 1 2
> # TEST: band 1 [ OK ]
> # INFO: Expected ratio >95% Measured ratio 100.00
> # TEST: band 2 [ OK ]
> # INFO: Expected ratio <5% Measured ratio 0
> # Testing ets bands 4 strict 1 quanta 5000 2500 1500, streams 0 1
> # TEST: band 0 [ OK ]
> # INFO: Expected ratio >95% Measured ratio 100.00
> # TEST: band 1 [ OK ]
> # INFO: Expected ratio <5% Measured ratio 0
> # Testing ets bands 4 strict 1 quanta 5000 2500 1500, streams 1 2
> # TEST: bands 1:2 [ OK ]
> # INFO: Expected ratio 2.00 Measured ratio 2.00
> # Testing ets bands 3 quanta 3300 3300 3300, streams 0 1 2
> # TEST: bands 0:1 [ OK ]
> # INFO: Expected ratio 1.00 Measured ratio .99
> # TEST: bands 0:2 [ OK ]
> # INFO: Expected ratio 1.00 Measured ratio .99
> # Testing ets bands 3 quanta 5000 3500 1500, streams 0 1 2
> # TEST: bands 0:1 [ OK ]
> # INFO: Expected ratio 1.42 Measured ratio 1.42
> # TEST: bands 0:2 [ OK ]
> # INFO: Expected ratio 3.33 Measured ratio 3.33
> # Testing ets bands 3 quanta 5000 8000 1500, streams 0 1 2
> # TEST: bands 0:1 [ OK ]
> # INFO: Expected ratio 1.60 Measured ratio 1.59
> # TEST: bands 0:2 [ OK ]
> # INFO: Expected ratio 3.33 Measured ratio 3.33
> # Testing ets bands 2 quanta 5000 2500, streams 0 1
> # TEST: bands 0:1 [ OK ]
> # INFO: Expected ratio 2.00 Measured ratio 1.99
> # Running in classifier mode
> # Testing ets bands 3 strict 3, streams 0 1
> # TEST: band 0 [ OK ]
> # INFO: Expected ratio >95% Measured ratio 100.00
> # TEST: band 1 [ OK ]
> # INFO: Expected ratio <5% Measured ratio 0
> # Testing ets bands 3 strict 3, streams 1 2
> # TEST: band 1 [ OK ]
> # INFO: Expected ratio >95% Measured ratio 100.00
> # TEST: band 2 [ OK ]
> # INFO: Expected ratio <5% Measured ratio 0
> # Testing ets bands 4 strict 1 quanta 5000 2500 1500, streams 0 1
>
> What is happening is a stuck mousezahn loop:
>
> root 558266 2732 0 01:17 pts/2 00:00:23 mausezahn veth0.10 -p 8000 -A 192.0.2.1 -B 192.0.2.2 -c 0 -a own -b ba:33:37:81:dc:5
> root 558273 2732 0 01:17 pts/2 00:00:23 mausezahn veth0.11 -p 8000 -A 192.0.2.17 -B 192.0.2.18 -c 0 -a own -b ba:33:37:81:dc

Sure, the test is stuck and therefore didn't reap its MZ invocations.
The test is supposed to invoke stop_traffic(), which then does { kill %%
&& wait %%; }, which should do it.

The question is, does it reach that statement? If yes, the wait is
likely where it's stuck, but why? If not, where else could it be stuck?
I don't see any other place where the test loops endlessly or waits.

> I am not sure if this will ever complete, or is it a runaway loop?

It won't, that's what the "-c 0" on the command like does.

> Probably it is stuck, for nothing was written to the test log for 71 minutes.
>
> I hope we are coming closer to the cause.
>
> Previously I didn't figure out what hanged.

The script did :)

The fact that there's mausezahn running alongside is relevant, but not
the cause.