Re: BUG: workqueue lockup (2)

From: Tetsuo Handa
Date: Sun May 13 2018 - 10:30:01 EST


Eric Biggers wrote:
> Generally it's best to close syzbot bug reports once the original cause is
> fixed, so that syzbot can continue to report other bugs with the same signature.

That's difficult to judge. Closing as soon as the original cause is fixed allows
syzbot to try to report different reproducer for different bugs. But at the same time,
different/similar bugs which were reported in that report (or comments in the discussion
for that report) will become almost invisible from users (because users unlikely check
other reports in already fixed bugs).

An example is

general protection fault in kernfs_kill_sb (2)
https://syzkaller.appspot.com/bug?id=903af3e08fc7ec60e57d9c9b93b035f4fb038d9a

where the cause of above report was already pointed out in the discussion for
the below report.

general protection fault in kernfs_kill_sb
https://syzkaller.appspot.com/bug?id=d7db6ecf34f099248e4ff404cd381a19a4075653

Since the latter is marked as "fixed on May 08 18:30", I worry that quite few
users would check the relationship.

> Note also that a "workqueue lockup" can be caused by almost anything in the
> kernel, I think. This one for example is probably in the sound subsystem:
> https://syzkaller.appspot.com/text?tag=CrashReport&x=1767232b800000
>

Right. Maybe we should not stop the test upon "workqueue lockup" message, for
it is likely that the cause of lockup is that somebody is busy looping which
should have been reported shortly as "rcu detected stall".

Of course, there is possibility that "workqueue lockup" is reported because
cond_resched() was used when explicit schedule_timeout_*() is required, which
was the reason commit 82607adcf9cdf40f ("workqueue: implement lockup detector")
was added.

If we stop the test upon "workqueue lockup" message, maybe longer timeout (e.g.
300 seconds) is better so that rcu stall or hung task messages are reported
if rcu stall or hung task is occurring.