Re: POSSIBLE BUG: selftests/net/fcnal-test.sh: [FAIL][FIX TESTED] in vrf "bind - ns-B IPv6 LLA" test

From: Mirsad Goran Todorovac
Date: Sat Jun 10 2023 - 14:04:29 EST


On 6/9/23 18:13, Guillaume Nault wrote:
On Thu, Jun 08, 2023 at 07:37:15AM +0200, Mirsad Goran Todorovac wrote:
On 6/7/23 18:51, Guillaume Nault wrote:
On Wed, Jun 07, 2023 at 12:04:52AM +0200, Mirsad Goran Todorovac wrote:
[...]
TEST: ping local, VRF bind - ns-A IP [ OK ]
TEST: ping local, VRF bind - VRF IP [FAIL]
TEST: ping local, VRF bind - loopback [ OK ]
TEST: ping local, device bind - ns-A IP [FAIL]
TEST: ping local, device bind - VRF IP [ OK ]
[...]
TEST: ping local, VRF bind - ns-A IP [ OK ]
TEST: ping local, VRF bind - VRF IP [FAIL]
TEST: ping local, VRF bind - loopback [ OK ]
TEST: ping local, device bind - ns-A IP [FAIL]
TEST: ping local, device bind - VRF IP [ OK ]
[...]

I have the same failures here. They don't seem to be recent.
I'll take a look.

Certainly. I thought it might be something architecture-specific?

I have reproduced it also on a Lenovo IdeaPad 3 with Ubuntu 22.10,
but on Lenovo desktop with AlmaLinux 8.8 (CentOS fork), the result
was "888/888 passed".

I've taken a deeper look at these failures. That's actually a problem in
ping. That's probably why you have different results depending on the
distribution.

Thank you for your work. I feel encouraged by your aim to get to the bottom
of the problem ...
The problem is that, for some versions, 'ping -I netdev ...' doesn't
bind the socket to 'netdev' if the IPv4 address to ping is set on that
same device. The VRF tests depend on this socket binding, so they fail
when ping refuses to bind. That was fixed upstream with commit
92ce8ef21393 ("Revert "ping: do not bind to device when destination IP
is on device"") (https://github.com/iputils/iputils/commit/92ce8ef2139353da3bf55fe2280bd4abd2155c9f).

Long story short, the tests should pass with the latest upstream ping
version.

Alternatively, you can modify the commands run by fcnal-test.sh and
provide the -I option twice: one for setting the device binding and one
for setting the source IPv4 address. This way ping should accept to
bind its socket.

Something like (not tested):

- run_cmd ping -c1 -w1 -I ${VRF} ${a}
+ run_cmd ping -c1 -w1 -I ${VRF} -I ${a} ${a}
[...]
- run_cmd ping -c1 -w1 -I ${NSA_DEV} ${a}
+ run_cmd ping -c1 -w1 -I ${NSA_DEV} -I ${a} ${a}

I have tested this and the fix appears to work:

#################################################################
With VRF

SYSCTL: net.ipv4.raw_l3mdev_accept=1

TEST: ping out, VRF bind - ns-B IP [ OK ]
TEST: ping out, device bind - ns-B IP [ OK ]
TEST: ping out, vrf device + dev address bind - ns-B IP [ OK ]
TEST: ping out, vrf device + vrf address bind - ns-B IP [ OK ]
TEST: ping out, VRF bind - ns-B loopback IP [ OK ]
TEST: ping out, device bind - ns-B loopback IP [ OK ]
TEST: ping out, vrf device + dev address bind - ns-B loopback IP [ OK ]
TEST: ping out, vrf device + vrf address bind - ns-B loopback IP [ OK ]
TEST: ping in - ns-A IP [ OK ]
TEST: ping in - VRF IP [ OK ]
TEST: ping local, VRF bind - ns-A IP [ OK ]
TEST: ping local, VRF bind - VRF IP [ OK ]
TEST: ping local, VRF bind - loopback [ OK ]
TEST: ping local, device bind - ns-A IP [ OK ]
TEST: ping local, device bind - VRF IP [ OK ]
TEST: ping local, device bind - loopback [ OK ]
TEST: ping out, vrf bind, blocked by rule - ns-B loopback IP [ OK ]
TEST: ping out, device bind, blocked by rule - ns-B loopback IP [ OK ]
TEST: ping in, blocked by rule - ns-A loopback IP [ OK ]
TEST: ping out, vrf bind, unreachable route - ns-B loopback IP [ OK ]
TEST: ping out, device bind, unreachable route - ns-B loopback IP [ OK ]
TEST: ping in, unreachable route - ns-A loopback IP [ OK ]
SYSCTL: net.ipv4.ping_group_range=0 2147483647

SYSCTL: net.ipv4.raw_l3mdev_accept=1

TEST: ping out, VRF bind - ns-B IP [ OK ]
TEST: ping out, device bind - ns-B IP [ OK ]
TEST: ping out, vrf device + dev address bind - ns-B IP [ OK ]
TEST: ping out, vrf device + vrf address bind - ns-B IP [ OK ]
TEST: ping out, VRF bind - ns-B loopback IP [ OK ]
TEST: ping out, device bind - ns-B loopback IP [ OK ]
TEST: ping out, vrf device + dev address bind - ns-B loopback IP [ OK ]
TEST: ping out, vrf device + vrf address bind - ns-B loopback IP [ OK ]
TEST: ping in - ns-A IP [ OK ]
TEST: ping in - VRF IP [ OK ]
TEST: ping local, VRF bind - ns-A IP [ OK ]
TEST: ping local, VRF bind - VRF IP [ OK ]
TEST: ping local, VRF bind - loopback [ OK ]
TEST: ping local, device bind - ns-A IP [ OK ]
TEST: ping local, device bind - VRF IP [ OK ]
TEST: ping local, device bind - loopback [ OK ]
TEST: ping out, vrf bind, blocked by rule - ns-B loopback IP [ OK ]
TEST: ping out, device bind, blocked by rule - ns-B loopback IP [ OK ]
TEST: ping in, blocked by rule - ns-A loopback IP [ OK ]
TEST: ping out, vrf bind, unreachable route - ns-B loopback IP [ OK ]
TEST: ping out, device bind, unreachable route - ns-B loopback IP [ OK ]
TEST: ping in, unreachable route - ns-A loopback IP [ OK ]

###########################################################################

This also works on the Lenovo IdeaPad 3 Ubuntu 22.10 laptop, but on the AlmaLinux 8.8
Lenovo desktop I have a problem:

[root@pc-mtodorov net]# grep FAIL ../fcnal-test-4.log
TEST: ping local, VRF bind - ns-A IP [FAIL]
TEST: ping local, VRF bind - VRF IP [FAIL]
TEST: ping local, device bind - ns-A IP [FAIL]
TEST: ping local, VRF bind - ns-A IP [FAIL]
TEST: ping local, VRF bind - VRF IP [FAIL]
TEST: ping local, device bind - ns-A IP [FAIL]
[root@pc-mtodorov net]#

Kernel is the recent one:

[root@pc-mtodorov net]# uname -rms
Linux 6.4.0-rc5-testnet-00003-g5b23878f7ed9 x86_64
[root@pc-mtodorov net]#

However, I have a question:

In the ping + "With VRF" section, the tests with net.ipv4.raw_l3mdev_accept=1
are repeated twice, while "No VRF" section has the versions:

SYSCTL: net.ipv4.raw_l3mdev_accept=0

and

SYSCTL: net.ipv4.raw_l3mdev_accept=1

The same happens with the IPv6 ping tests.

In that case, it could be that we have only 2 actual FAIL cases,
because the error is reported twice.

Is this intentional?

I don't know why the non-VRF tests are run once with raw_l3mdev_accept=0
and once with raw_l3mdev_accept=1. Unless I'm missing something, this
option shouldn't affect non-VRF users. Maybe the objective is to make
sure that it really doesn't affect them. David certainly knows better.

The problem appears to be that non-VRF tests are being ran with
raw_l3mdev_accept={0|1}, while VRF tests w raw_l3mdev_accept={1|1} ...

I will try to fix that, but I am not sure of the semantics either.

Regards,
Mirsad