Re: [patch 00/14] x86/irq: Plug various vector cleanup races

From: Joe Lawrence
Date: Tue Jan 19 2016 - 23:13:21 EST


On 01/18/2016 10:00 AM, Joe Lawrence wrote:
[... snip ... ]
> Hi Thomas,
>
> When logging in this morning and looking at the box running the 14
> patches + additional patch, I see it hit a hung task timeout in xhci USB
> code about 39 hours in. Stack trace below (looks to be waiting on a
> completion that never comes).
>
> I didn't see this when running only the *initial* 14 patches. Of
> course, before these irq cleanup fixes my tests never ran this long :)
> So it may or may not be related to the patchset, I'm still poking around
> the generated vmcore. Let me know if there is anything you might be
> interested in looking at from the wreckage.
>
> -- Joe
>
>
>
> INFO: task kworker/0:1:1506 blocked for more than 120 seconds.
> Tainted: P OE 4.3.0sra12+ #50
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/0:1 D 0000000000000000 0 1506 2 0x00000080
> Workqueue: usb_hub_wq hub_event
> ffff8801e46dba58 0000000000000046 ffff8810375dac00 ffff881038430000
> ffff8801e46dc000 ffff88025ac20440 ffff88025ac20438 ffff881038430000
> 0000000000000000 ffff8801e46dba70 ffffffff81659893 7fffffffffffffff
> Call Trace:
> [<ffffffff81659893>] schedule+0x33/0x80
> [<ffffffff8165c530>] schedule_timeout+0x200/0x2a0
> [<ffffffff810e2761>] ? internal_add_timer+0x71/0xb0
> [<ffffffff810e4994>] ? mod_timer+0x114/0x210
> [<ffffffff8165a371>] wait_for_completion+0xf1/0x130
> [<ffffffff810a70d0>] ? wake_up_q+0x70/0x70
> [<ffffffff814b14a1>] xhci_discover_or_reset_device+0x1e1/0x540
> [<ffffffff814723b8>] hub_port_reset+0x3c8/0x590
> [<ffffffff81472aa5>] hub_port_init+0x525/0xb00
> [<ffffffff81476068>] hub_port_connect+0x328/0x940
> [<ffffffff81476cbc>] hub_event+0x63c/0xb00
> [<ffffffff810947dc>] process_one_work+0x14c/0x3c0
> [<ffffffff81095044>] worker_thread+0x114/0x470
> [<ffffffff8165925f>] ? __schedule+0x2af/0x8b0
> [<ffffffff81094f30>] ? rescuer_thread+0x310/0x310
> [<ffffffff8109ab88>] kthread+0xd8/0xf0
> [<ffffffff8109aab0>] ? kthread_park+0x60/0x60
> [<ffffffff8165d75f>] ret_from_fork+0x3f/0x70
> [<ffffffff8109aab0>] ? kthread_park+0x60/0x60

Hi Thomas / Boris,

In an effort to exonerate the patchset, I instrumented xHCI to monitor
complementary wait_for_completion / complete calls in that driver,
hoping that an early exit in its probe might be simply skipping the
complete call ... but I ended up collecting two new crashes in
get_next_timer_interrupt:

(Again with proprietary and out-of-tree drivers loaded.)

general protection fault: 0000 [#1] SMP
Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun
matroxfb(OE) ccmod(POE) ftmod(OE) videosw(OE) [ ... snip ... ]
CPU: 10 PID: 0 Comm: swapper/10 Tainted: P OE 4.3.0sra13+ #53
Hardware name: Stratus ftServer 6800/G7LYY, BIOS BIOS Version 8.1:61 09/10/2015
task: ffff881038e35800 ti: ffff881038e3c000 task.ti: ffff881038e3c000
RIP: 0010:[<ffffffff810e4c55>] [<ffffffff810e4c55>] get_next_timer_interrupt+0x1a5/0x240
RSP: 0018:ffff881038e3fde0 EFLAGS: 00010002
RAX: ffff88103fa8e8b8 RBX: 000013629b0c5740 RCX: 000000014140a6d6
RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000001 RDI: 00000000010140a7
RBP: ffff881038e3fe30 R08: 6b6b6b6b6b6b6b6b R09: 0000000000000027
R10: 0000000000000027 R11: ffff881038e3fde8 R12: 000000010140a6d6
R13: ffff88103fa8e080 R14: ffff881038e3fe00 R15: 0000000000000040
FS: 0000000000000000(0000) GS:ffff88103fa80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcbf695a000 CR3: 0000002030582000 CR4: 00000000001406e0
Stack:
ffff88103fa8e8b8 ffff88103fa8eab8 ffff88103fa8ecb8 ffff88103fa8eeb8
c5067d2c236293b0 ffff88103fa8f500 ffff88103fa8d600 000013629b1632f3
000000010140a6d6 000013629b0c5740 ffff881038e3fe88 ffffffff810f421f
Call Trace:
[<ffffffff810f421f>] tick_nohz_stop_sched_tick+0x1bf/0x2b0
[<ffffffff810f43af>] __tick_nohz_idle_enter+0x9f/0x150
[<ffffffff810f47dc>] tick_nohz_idle_enter+0x3c/0x70
[<ffffffff810bf00c>] cpu_startup_entry+0x9c/0x330
[<ffffffff8104e700>] start_secondary+0x160/0x1a0
Code: 95 38 0e 00 00 48 89 55 c8 48 8d 55 b0 4c 8d 5a 08 4c 8d 72 20 41 89 fa 41 83 e2 3f 45 89 d1 49 63 d1 48 8b 14 d0 48 85 d2 74 1e <f6> 42 2a 10 75 10 4c 8b 42 10 be 01 00 00 00 49 39 c8 49 0f 48
RIP [<ffffffff810e4c55>] get_next_timer_interrupt+0x1a5/0x240
RSP <ffff881038e3fde0>

crash> dis -l get_next_timer_interrupt+0x1a5
/root/linus/kernel/time/timer.c: 1289
<get_next_timer_interrupt+0x1a5>: testb $0x10,0x2a(%rdx)

RDX: 6b6b6b6b6b6b6b6b

1246 static unsigned long __next_timer_interrupt(struct tvec_base *base)
...
1251 struct timer_list *nte;
...
1277 /* Check tv2-tv5. */
1278 varray[0] = &base->tv2;
1279 varray[1] = &base->tv3;
1280 varray[2] = &base->tv4;
1281 varray[3] = &base->tv5;
...
1283 for (array = 0; array < 4; array++) {
1284 struct tvec *varp = varray[array];
1285
1286 index = slot = timer_jiffies & TVN_MASK;
1287 do {
1288 hlist_for_each_entry(nte, varp->vec + slot, entry) {
1289 if (nte->flags & TIMER_DEFERRABLE)

So the nte pointer contains the slub_debug poison pattern.

>From the disassembly of __next_timer_interrupt, it looks like r13 is
used to store "base".

R13: ffff88103fa8e080

crash> struct tvec_base ffff88103fa8e080
struct tvec_base {
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0x1
}
}
}
}
},
running_timer = 0x0,
timer_jiffies = 0x10140a6d7,
next_timer = 0x10140a6d6,
active_timers = 0x7,
all_timers = 0x8,
cpu = 0xa,
migration_enabled = 0x1,
nohz_active = 0x1,
...
tv1 = {
vec = {{
...
first = 0xffff88203800dff8
...
tv2 = {
vec = {{
...
first = 0xffff88100aa9a550 << index 39 points to slub poison
...
first = 0xffff88103fa90ea0
...
tv3 = {
vec = {{
...
first = 0xffff8807452b3928
...
first = 0xffff88103fa8d380
...
tv4 = {
vec = {{
...
tv5 = {
vec = {{
...
first = 0xffff88100c8955e8

crash> struct hlist_node 0xffff88203800dff8
struct hlist_node {
next = 0x0,
pprev = 0xffff88103fa8e790
}
crash> struct hlist_node 0xffff88100aa9a550
struct hlist_node {
next = 0x6b6b6b6b6b6b6b6b, << uhoh!
pprev = 0x6b6b6b6b6b6b6b6b <<
}
crash> struct hlist_node 0xffff88103fa90ea0
struct hlist_node {
next = 0x0,
pprev = 0xffff88103fa8e9f8
}
crash> struct hlist_node 0xffff8807452b3928
struct hlist_node {
next = 0xffff88100aa9da68,
pprev = 0xffff88103fa8eae0
}
crash> struct hlist_node 0xffff88103fa8d380
struct hlist_node {
next = 0x0,
pprev = 0xffff88103fa8eb58
}
crash> struct hlist_node 0xffff88100c8955e8
struct hlist_node {
next = 0xffff881021e47598,
pprev = 0xffff88103fa8eec0
}

crash utility confirms it in its "timer" display:

crash> timer
TVEC_BASES[9]: ffff88103fa4e080
JIFFIES
4315982678
EXPIRES TIMER_LIST FUNCTION
4315982681 ffff88203800ddb0 ffffffff8150e7a0 <intel_pstate_timer_func>
4315982973 ffff88103fa50ea0 ffffffff81092d90 <delayed_work_timer_fn>
4316267973 ffff88103fa4d380 ffffffff81041930 <mce_timer_fn>

timer: invalid list entry: 6b6b6b6b6b6b6b6b
timer: ignoring faulty timer list at index 39 of timer array

timer: invalid list entry: 6b6b6b6b6b6b6b6b
timer: ignoring faulty timer list at index 39 of timer array
TVEC_BASES[10]: ffff88103fa8e080
JIFFIES
4315982678
EXPIRES TIMER_LIST FUNCTION
4315981531 ffff88203800dff8 ffffffff8150e7a0 <intel_pstate_timer_func>
4316034039 ffff88100aa9da68 ffffffff81092d90 <delayed_work_timer_fn>
4316034111 ffff8807452b3928 ffffffff81092d90 <delayed_work_timer_fn>
4316267970 ffff88103fa8d380 ffffffff81041930 <mce_timer_fn>
4401397760 ffff88100c8955e8 ffffffff8160bbe0 <ipv6_regen_rndid>
4401397760 ffff881021e47598 ffffffff8160bbe0 <ipv6_regen_rndid>
7740398493674204011 ffff88100aa9a550 6b6b6b6b6b6b6b6b

crash> struct timer_list ffff88100aa9a550
struct timer_list {
entry = {
next = 0x6b6b6b6b6b6b6b6b,
pprev = 0x6b6b6b6b6b6b6b6b
},
expires = 0x6b6b6b6b6b6b6b6b,
function = 0x6b6b6b6b6b6b6b6b,
data = 0x6b6b6b6b6b6b6b6b,
flags = 0x6b6b6b6b,
slack = 0x6b6b6b6b,
start_pid = 0x6b6b6b6b,
start_site = 0x6b6b6b6b6b6b6b6b,
start_comm = "kkkkkkkkkkkkkkkk"
}

A second crash of the same signature occurred a few hours later.


Unfortunately I only have a single box to run these tests in what
amounts to an after-hours effort. I started testing back in 4.3 but
avoided moving forward to avoid the 4.4 development cycle (and
incidental issues that it might have muddled the waters). That said,
what would be the best way to proceed?

Change device removal tests to avoid proprietary drivers.

What about the other out-of-tree device drivers (mpt3sas, ixgbe,
etc.)? These are open source, but contain much Stratus
device removal paranoia that upstream hasn't adopted.

Rebase evil(TM) proprietary/out-of-tree drivers against 4.4 or
4.5rcX, apply this patchset and any other required device removal
fixups.

If proprietary/out-of-tree drivers are a debugging deal breaker, I
understand. The platform offers a unique hotplug testbed, so I try to
contribute testing and bug reports where I feel they apply equally
to untainted upstream.

Regards,

-- Joe