Re: [PATCH net v4] nfc: ... device_is_registered() is data race-able

From: Greg KH
Date: Thu Apr 28 2022 - 09:37:31 EST


On Thu, Apr 28, 2022 at 09:22:11PM +0800, Lin Ma wrote:
> Hello there,
>
> >
> > Yes, that looks better,
>
> Cool, thanks again for giving comments. :)
>
> > but what is the root problem here that you are
> > trying to solve? Why does NFC need this when no other subsystem does?
> >
>
> Well, in fact, me and Duoming are keep finding concurrency bugs that happen
> between the device cleanup/detach routine and other undergoing routines.
>
> That is to say, when a device, no matter real or virtual, is detached from
> the machine, the kernel awake cleanup routine to reclaim the resource.
> In current case, the cleanup routine will call nfc_unregister_device().
>
> Other routines, mainly from user-space system calls, need to be careful of
> the cleanup event. In another word, the kernel need to synchronize these
> routines to avoid race bugs.
>
> In our practice, we find that many subsystems are prone to this type of bug.
>
> For example, in bluetooth we fix
>
> BT subsystem
> * e2cb6b891ad2 ("bluetooth: eliminate the potential race condition when removing
> the HCI controller")
> * fa78d2d1d64f ("Bluetooth: fix data races in smp_unregister(), smp_del_chan()")
> ..
>
> AX25 subsystem
> * 1ade48d0c27d ("ax25: NPD bug when detaching AX25 device")
> ..
>
> we currently focus on the net relevant subsystems and we now is auditing the NFC
> code.
>
> In another word, all subsystems need to take care of the synchronization issues.
> But seems that the solutions are varied between different subsystem.
>
> Empirically speaking, most of them use specific flags + specific locks to prevent
> the race.
>
> In such cases, if the cleanup routine first hold the lock, the other routines will
> wait on the locks. Since the cleanup routine write the specific flag, the other
> routine, after check the specific flag, will be aware of the cleanup stuff and just
> abort their tasks.
> If the other routines first hold the lock, the cleanup routine just wait them to
> finish.
>
> NFC here is special because it uses device_is_registered. I thought the author may
> believe this macro is race free. However, it is not. So we need to replace this check
> to make sure the netlink functions will 100 percent be aware of the cleanup routine
> and abort the task if they grab the device_lock lately. Otherwise, the nelink routine
> will call sub-layer code and possilby dereference resources that already freed.
>
> For example, one of my recent fix 3e3b5dfcd16a ("NFC: reorder the logic in
> nfc_{un,}register_device") takes the suggestion from maintainer as he thought the
> device_is_registered is enough. And for now we find out this device_is_registered
> is not enough.

How do you physically remove a NFC device from a system? What types of
busses are they on? If non-removable one, then odds are there's not
going to be many races so this is a low-priority thing. If they can be
on removable busses (USB, PCI, etc.), then that's a bigger thing.

But again, the NFC bus code should handle all of this for the drivers,
there's nothing special about NFC that should warrant a special need for
this type of thing.

thanks,

greg k-h