Re: [BUG] Soft-lockup during cpu-hotplug in VFS callpaths

From: Srivatsa S. Bhat
Date: Tue Sep 06 2011 - 01:04:01 EST


On 09/06/2011 12:48 AM, Maciej Rutecki wrote:
> On poniedziaÅek, 5 wrzeÅnia 2011 o 11:07:55 Srivatsa S. Bhat wrote:
>> On 09/01/2011 12:10 AM, Maciej Rutecki wrote:
>>> On Åroda, 24 sierpnia 2011 o 15:44:55 Srivatsa S. Bhat wrote:
>>>> Hi,
>>>>
>>>> While running stressful cpu hotplug tests along with kernel compilation
>>>> running in background, soft-lockups are detected on multiple CPUs.
>>>> Sometimes this also leads to hard lockups and kernel panic.
>>>> All the soft-lockups seem to occur at vfsmount_lock_local_cpu() or other
>>>> VFS callpaths.
>>>>
>>>>
>>>> [37108.410813] BUG: soft lockup - CPU#5 stuck for 22s! [cc1:29669]
>>>> <snip>
>>>> [37108.694781] Call Trace:
>>>> [37108.697306] [<ffffffff81199e70>] ?
>>>> vfsmount_lock_local_lock_cpu+0x70/0x70 [37108.704258]
>>>> [<ffffffff81187cb5>] path_init+0x315/0x400
>>>> [37108.709558] [<ffffffff8127c398>] ? __raw_spin_lock_init+0x38/0x70
>>>> [37108.715812] [<ffffffff8118961c>] path_openat+0x8c/0x3f0
>>>> [37108.721203] [<ffffffff81012129>] ? sched_clock+0x9/0x10
>>>> [37108.726597] [<ffffffff8109416d>] ? sched_clock_cpu+0xcd/0x110
>>>> [37108.732508] [<ffffffff810a178d>] ? trace_hardirqs_off+0xd/0x10
>>>> [37108.738498] [<ffffffff8109421f>] ? local_clock+0x6f/0x80
>>>> [37108.743970] [<ffffffff81189a99>] do_filp_open+0x49/0xa0
>>>> [37108.749362] [<ffffffff811982f3>] ? alloc_fd+0xc3/0x210
>>>> [37108.754665] [<ffffffff8152584b>] ? _raw_spin_unlock+0x2b/0x40
>>>> [37108.760575] [<ffffffff811982f3>] ? alloc_fd+0xc3/0x210
>>>> [37108.765875] [<ffffffff81179607>] do_sys_open+0x107/0x1e0
>>>> [37108.771352] [<ffffffff810d610f>] ? audit_syscall_entry+0x1bf/0x1f0
>>>> [37108.777695] [<ffffffff81179720>] sys_open+0x20/0x30
>>>> [37108.782741] [<ffffffff8152e202>] system_call_fastpath+0x16/0x1b
>>>>
>>>> Kernel version: 3.0.1, 3.0.3
>>>> Hardware: Dual socket quad-core hyper-threaded Intel x86 machine
>>>> Scenario:
>>>> (a) Stressful cpu hotplug tests + kernel compilation
>>>>
>>>> (b) IRQ balancing had been disabled and all the IRQs were made to be
>>>>
>>>> routed to CPU 0 (except the ones that couldn't be routed).
>>>>
>>>> (c) Lockdep was enabled during kernel configuration.
>>>>
>>>> Steps (b) and (c) were done to dig deeper into the issue. However the
>>>> same issue was observed by just doing step (a).
>>>>
>>>> Definitely there seems to be a race condition occurring here, because
>>>> this issue is hit after sometime, after starting the tests. And the
>>>> time it takes to hit the issue increases as we increase the number of
>>>> debug print statements. In some cases (especially when the number of
>>>> debug print statements were quite high), the stress on the machine had
>>>> to be increased in order to hit the issue within measurable time. In my
>>>> tests, a maximum of about 2 to 2.5 hours was sufficient, to hit this
>>>> bug.
>>>>
>>>> Please find the console log attached with this mail.
>>>>
>>>> Any ideas on how to go about fixing this bug?
>>>
>>> It is a regression?
>>
>> Hi Maciej,
>>
>> Thank you for taking a look.
>> Yes, it seems to be a regression. I tested out kernel 2.6.39.3 with similar
>> test cases for quite a long time, and it did not hit any soft-lockup
>> issues.
>
> Thanks for the answer. I create bug entry:
> https://bugzilla.kernel.org/show_bug.cgi?id=42402

Oh thank you. But I had created an entry myself in bugzilla, immediately after I posted on the
mailing list. (https://bugzilla.kernel.org/show_bug.cgi?id=42382)
I will however delete my entry since we don't want duplicates and moreover the 'Product'
and 'Component' fields in your entry seems more appropriate with respect to the bug.

Thanks again.

--
Regards,
Srivatsa S. Bhat <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
Linux Technology Center,
IBM India Systems and Technology Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/