Re: [RFC PATCH net] sfc: Fix use-after-free due to selftest_work

From: Martin Habets
Date: Fri Apr 14 2023 - 05:45:11 EST


On Thu, Apr 13, 2023 at 04:35:08PM +0800, Ding Hui wrote:
> On 2023/4/13 15:37, Martin Habets wrote:
> > On Wed, Apr 12, 2023 at 08:50:13AM +0800, Ding Hui wrote:
> > > There is a use-after-free scenario that is:
> > >
> > > When netif_running() is false, user set mac address or vlan tag to VF,
> > > the xxx_set_vf_mac() or xxx_set_vf_vlan() will invoke efx_net_stop()
> > > and efx_net_open(), since netif_running() is false, the port will not
> > > start and keep port_enabled false, but selftest_worker is scheduled
> > > in efx_net_open().
> > >
> > > If we remove the device before selftest_worker run, the efx is freed,
> > > then we will get a UAF in run_timer_softirq() like this:
> > >
> > > [ 1178.907941] ==================================================================
> > > [ 1178.907948] BUG: KASAN: use-after-free in run_timer_softirq+0xdea/0xe90
> > > [ 1178.907950] Write of size 8 at addr ff11001f449cdc80 by task swapper/47/0
> > > [ 1178.907950]
> > > [ 1178.907953] CPU: 47 PID: 0 Comm: swapper/47 Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1
> > > [ 1178.907954] Hardware name: SANGFOR X620G40/WI2HG-208T1061A, BIOS SPYH051032-U01 04/01/2022
> > > [ 1178.907955] Call Trace:
> > > [ 1178.907956] <IRQ>
> > > [ 1178.907960] dump_stack+0x71/0xab
> > > [ 1178.907963] print_address_description+0x6b/0x290
> > > [ 1178.907965] ? run_timer_softirq+0xdea/0xe90
> > > [ 1178.907967] kasan_report+0x14a/0x2b0
> > > [ 1178.907968] run_timer_softirq+0xdea/0xe90
> > > [ 1178.907971] ? init_timer_key+0x170/0x170
> > > [ 1178.907973] ? hrtimer_cancel+0x20/0x20
> > > [ 1178.907976] ? sched_clock+0x5/0x10
> > > [ 1178.907978] ? sched_clock_cpu+0x18/0x170
> > > [ 1178.907981] __do_softirq+0x1c8/0x5fa
> > > [ 1178.907985] irq_exit+0x213/0x240
> > > [ 1178.907987] smp_apic_timer_interrupt+0xd0/0x330
> > > [ 1178.907989] apic_timer_interrupt+0xf/0x20
> > > [ 1178.907990] </IRQ>
> > > [ 1178.907991] RIP: 0010:mwait_idle+0xae/0x370
> > >
> > > I am thinking about several ways to fix the issue:
> > >
> > > [1] In this RFC, I cancel the selftest_worker unconditionally in
> > > efx_pci_remove().
> > >
> > > [2] Add a test condition, only invoke efx_selftest_async_start() when
> > > efx->port_enabled is true in efx_net_open().
> > >
> > > [3] Move invoking efx_selftest_async_start() from efx_net_open() to
> > > efx_start_all() or efx_start_port(), that matching cancel action in
> > > efx_stop_port().
> >
> > I think moving this to efx_start_port() is best, as you say to match
> > the cancel in efx_stop_port().
> >
>
> If moving to efx_start_port(), should we worry about that IRQ_TIMEOUT
> is still enough?

1 second is a long time for a machine running code, so it does not worry me.

> I'm not sure if there is a long time waiting from starting of schedule
> selftest_work to the ending of efx_net_open().

I see your point. Looking at efx_start_all() there is the call to
efx_start_datapath() after the call to efx_net_open(), which takes a
relatively long time (well under 200ms though).
Logically it would be better to move efx_selftest_async_start() after this
call. What do you think?

The point here is that efx_start_all() calls efx_start_port() early, and
efx_stop_all() also calls efx_stop_port() early. The calling sequence is
correct but they are not the strict inverse of each other.

Martin

>
> --
> Thanks,
> - Ding Hui