Re: [BUG] [FIXED: TESTED] kmemleak in rtnetlink_rcv() triggered by selftests/drivers/net/team in build cdc9718d5e59

From: Mirsad Goran Todorovac
Date: Fri Apr 21 2023 - 07:37:00 EST


On 13.4.2023. 20:19, Ido Schimmel wrote:
On Mon, Apr 10, 2023 at 07:34:09PM +0200, Mirsad Goran Todorovac wrote:
I've ran "make kselftest" with vanilla torvalds tree 6.3-rc5 + your patch.

It failed two lines after "enslaved device client - ns-A IP" which passed OK.

Is this hang for 5 hours in selftests: net: fcnal-test.sh test, at the line
(please see to the end):

It's not clear to me if the test failed for you or just got stuck. The
output below is all "[ OK ]".

I ran the test with my patch and got:

Tests passed: 875
Tests failed: 5

I don't believe the failures are related to my patch given the test
doesn't use bonding.

See more below.


# ###########################################################################
# IPv4 address binds
# ###########################################################################
#
#
# #################################################################
# No VRF
#
# SYSCTL: net.ipv4.ping_group_range=0 2147483647
#
# TEST: Raw socket bind to local address - ns-A IP [ OK ]
# TEST: Raw socket bind to local address after device bind - ns-A IP [ OK ]
# TEST: Raw socket bind to local address - ns-A loopback IP [ OK ]
# TEST: Raw socket bind to local address after device bind - ns-A loopback IP [ OK ]
# TEST: Raw socket bind to nonlocal address - nonlocal IP [ OK ]
# TEST: TCP socket bind to nonlocal address - nonlocal IP [ OK ]
# TEST: ICMP socket bind to nonlocal address - nonlocal IP [ OK ]
# TEST: ICMP socket bind to broadcast address - broadcast [ OK ]
# TEST: ICMP socket bind to multicast address - multicast [ OK ]
# TEST: TCP socket bind to local address - ns-A IP [ OK ]
# TEST: TCP socket bind to local address after device bind - ns-A IP [ OK ]
#
# #################################################################
# With VRF
#
# SYSCTL: net.ipv4.ping_group_range=0 2147483647
#
# TEST: Raw socket bind to local address - ns-A IP [ OK ]
# TEST: Raw socket bind to local address after device bind - ns-A IP [ OK ]
# TEST: Raw socket bind to local address after VRF bind - ns-A IP [ OK ]
# TEST: Raw socket bind to local address - VRF IP [ OK ]
# TEST: Raw socket bind to local address after device bind - VRF IP [ OK ]
# TEST: Raw socket bind to local address after VRF bind - VRF IP [ OK ]
# TEST: Raw socket bind to out of scope address after VRF bind - ns-A loopback IP [ OK ]
# TEST: Raw socket bind to nonlocal address after VRF bind - nonlocal IP [ OK ]
# TEST: TCP socket bind to nonlocal address after VRF bind - nonlocal IP [ OK ]
# TEST: ICMP socket bind to nonlocal address after VRF bind - nonlocal IP [ OK ]
# TEST: ICMP socket bind to broadcast address after VRF bind - broadcast [ OK ]
# TEST: ICMP socket bind to multicast address after VRF bind - multicast [ OK ]
# TEST: TCP socket bind to local address - ns-A IP [ OK ]
# TEST: TCP socket bind to local address after device bind - ns-A IP [ OK ]
# TEST: TCP socket bind to local address - VRF IP [ OK ]
# TEST: TCP socket bind to local address after device bind - VRF IP [ OK ]
# TEST: TCP socket bind to invalid local address for VRF - ns-A loopback IP [ OK ]
# TEST: TCP socket bind to invalid local address for device bind - ns-A loopback IP [ OK ]
#
# ###########################################################################
# Run time tests - ipv4
# ###########################################################################
#
# TEST: Device delete with active traffic - ping in - ns-A IP [ OK ]
# TEST: Device delete with active traffic - ping in - VRF IP [ OK ]
# TEST: Device delete with active traffic - ping out - ns-B IP [ OK ]
# TEST: TCP active socket, global server - ns-A IP [ OK ]
# TEST: TCP active socket, global server - VRF IP [ OK ]
# TEST: TCP active socket, VRF server - ns-A IP [ OK ]
# TEST: TCP active socket, VRF server - VRF IP [ OK ]
# TEST: TCP active socket, enslaved device server - ns-A IP [ OK ]
# TEST: TCP active socket, VRF client - ns-A IP [ OK ]
# TEST: TCP active socket, enslaved device client - ns-A IP [ OK ]
# TEST: TCP active socket, global server, VRF client, local - ns-A IP [ OK ]
# TEST: TCP active socket, global server, VRF client, local - VRF IP [ OK ]
# TEST: TCP active socket, VRF server and client, local - ns-A IP [ OK ]
# TEST: TCP active socket, VRF server and client, local - VRF IP [ OK ]
# TEST: TCP active socket, global server, enslaved device client, local - ns-A IP [ OK ]
# TEST: TCP active socket, VRF server, enslaved device client, local - ns-A IP [ OK ]
# TEST: TCP active socket, enslaved device server and client, local - ns-A IP [ OK ]
# TEST: TCP passive socket, global server - ns-A IP [ OK ]
# TEST: TCP passive socket, global server - VRF IP [ OK ]
# TEST: TCP passive socket, VRF server - ns-A IP [ OK ]
# TEST: TCP passive socket, VRF server - VRF IP [ OK ]
# TEST: TCP passive socket, enslaved device server - ns-A IP [ OK ]
# TEST: TCP passive socket, VRF client - ns-A IP [ OK ]
# TEST: TCP passive socket, enslaved device client - ns-A IP [ OK ]
# TEST: TCP passive socket, global server, VRF client, local - ns-A IP [ OK ]

Hope this helps.

I also have a iwlwifi DEADLOCK and I don't know if these should be reported independently.
(I don't think it is related to the patch.)

If the test got stuck, then it might be related to the deadlock in
iwlwifi. Try running the test without iwlwifi and see if it helps. If
not, I suggest starting a different thread about this issue.

Will submit the bonding patch over the weekend.

Tested it again, with only the net selftest subtree:

tools/testing/selftests/Makefile:
TARGETS += drivers/net/bonding
TARGETS += drivers/net/team
TARGETS += net
TARGETS += net/af_unix
TARGETS += net/forwarding
TARGETS += net/hsr
# TARGETS += net/mptcp
TARGETS += net/openvswitch
TARGETS += netfilter

and it failed to reproduce the hang. (NOTE: In fact, it was only a script stall forever,
not a "kill -9 <PID>" non-killable process.)

With or without iwlwifi module, now it appears to work as a standalone test.

The problem might indeed be a spurious lockup in iwlwifi. I've noticed an attempt to
lock a locked lock from within the interrupt in the journalctl logs, but I am really
not that familiar with the iwlwifi driver's code ... It is apparently not a deterministic
error bound to repeat with every test.

I reckon the tests prior to the net subtree have done something to my kernel but thus far
I could not isolate the culprit test.

# tools/testing/selftests/net/fcnal-test.sh alone passes OK.

Thanks for testing

Not at all. I apologise for the false alarm.

Thanks for patching at such short notice.

The patch closes the memory leak, and the latest change was obviously the most suspected one,
but now it doesn't seem so.

It would require more work to isolate the particular test that caused the hang, but I don't
know if I have enough resources, mainly the time. And the guiding idea that I am going in the
right direction. :-/

Best regards,
Mirsad

--
Mirsad Todorovac
System engineer
Faculty of Graphic Arts | Academy of Fine Arts
University of Zagreb
Republic of Croatia, the European Union

Sistem inženjer
Grafički fakultet | Akademija likovnih umjetnosti
Sveučilište u Zagrebu