Re: Converting dev->mutex into dev->spinlock ?

From: Tetsuo Handa
Date: Sat Feb 04 2023 - 20:32:25 EST


On 2023/02/05 5:01, Alan Stern wrote:
> On Sun, Feb 05, 2023 at 02:09:40AM +0900, Tetsuo Handa wrote:
>> On 2023/02/05 1:27, Alan Stern wrote:
>>> On Sun, Feb 05, 2023 at 01:12:12AM +0900, Tetsuo Handa wrote:
>>>> On 2023/02/05 0:34, Alan Stern wrote:
>>>> Lockdep validation on dev->mutex being disabled is really annoying, and
>>>> I want to make lockdep validation on dev->mutex enabled; that is the
>>>> "drivers/core: Remove lockdep_set_novalidate_class() usage" patch.
>>>
>>>> Even if it is always safe to acquire a child device's lock while holding
>>>> the parent's lock, disabling lockdep checks completely on device's lock is
>>>> not safe.
>>>
>>> I understand the problem you want to solve, and I understand that it
>>> can be frustrating. However, I do not believe you will be able to
>>> solve this problem.
>>
>> That is a declaration that driver developers are allowed to take it for granted
>> that driver callback functions can behave as if dev->mutex is not held.
>
> No it isn't. It is a declaration that driver developers must be extra
> careful because lockdep is unable to detect locking errors involving
> dev->mutex.

Driver developers are not always familiar with locks used by driver core,
like your

It's hard to figure out what's wrong from looking at the syzbot report.
What makes you think it is connected with dev->mutex?

At first glance, it seems that the ath6kl driver is trying to flush a
workqueue while holding a lock or mutex that is needed by one of the
jobs in the workqueue. That's obviously never going to work, no matter
what sort of lockdep validation gets used.

comment indicates that you did not notice that dev->mutex was connected to
this problem which involved ath6kl driver code and ath9k driver code and
driver core code.

Core developers can't assume that driver developers are extra careful, as
well as driver developers can't assume that core developers are familiar
with locks used by individual drivers. We need to fill the gap.

>
>> Some developers test their changes with lockdep enabled, and believe that their
>> changes are correct because lockdep did not complain.
>> https://syzkaller.appspot.com/bug?extid=9ef743bba3a17c756174 is an example.
>
> How do you know developers are making this mistake? That example
> doesn't show anything of the sort; the commit which introduced the bug
> says nothing about lockdep.

The commit which introduced the bug cannot say something about lockdep, for
lockdep validation is disabled and nobody noticed the possibility of deadlock
until syzbot reports it as hung. Since the possibility of deadlock cannot be
noticed until syzbot reports it as hung, I assume that there are many similar
deadlocks in the kernel that involves dev->mutex. How do you teach developers
that they are making this mistake, without keeping lockdep validation enabled?

By keeping lockdep validation disabled, you are declaring that driver developers
need not to worry about dev->mutex rather than declaring that driver developers
need to worry about dev->mutex.