SDHCI long sleep with interrupts off

From: David Jander
Date: Thu Dec 17 2015 - 05:47:14 EST



Hi all,

I was investigating the source of abnormal irq-latency spikes on an i.MX6
(ARM) board, and discovered this:

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 4.4.0-rc4+
# --------------------------------------------------------------------
# latency: 2068 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:1)
# -----------------
# | task: mmcqd/0-92 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: _raw_spin_lock_irqsave
# => ended at: _raw_spin_unlock_irqrestore
#
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / delay
# cmd pid ||||| time | caller
# \ / ||||| \ | /
mmcqd/0-92 0d... 1us#: _raw_spin_lock_irqsave
mmcqd/0-92 0.n.1 2066us : _raw_spin_unlock_irqrestore
mmcqd/0-92 0.n.1 2070us+: trace_preempt_on <-_raw_spin_unlock_irqrestore
mmcqd/0-92 0.n.1 2107us : <stack trace>
=> sdhci_runtime_resume_host
=> __rpm_callback
=> rpm_callback
=> rpm_resume
=> __pm_runtime_resume
=> __mmc_claim_host
=> mmc_blk_issue_rq
=> mmc_queue_thread
=> kthread
=> ret_from_fork

2 ms with interrupts disabled!!! To much dismay, I later discovered that this
isn't even the worst case scenario. I also discovered that this has been in
the kernel for a long time without a fix (I have tested from 3.17 to 4.4-rc4).
There has been an attempt by someone to address this back in 2010, but
apparently it never hit mainline.
Going through the code in sdhci.c, I found this troublesome code-path:

sdhci_do_set_ios() {
spin_lock_irqsave(&host->lock, flags);
...
sdhci_reinit() --> sdhci_init() --> sdhci_do_reset() -->
host->ops->reset() --> sdhci_reset()
...
spin_unlock_irqrestore(&host->lock, flags);
}

And in sdhci_reset(), which may be called with held spinlock:

...
/* Wait max 100 ms */
timeout = 100;

/* hw clears the bit when it's done */
while (sdhci_readb(host, SDHCI_SOFTWARE_RESET) & mask) {
if (timeout == 0) {
pr_err("%s: Reset 0x%x never completed.\n",
mmc_hostname(host->mmc), (int)mask);
sdhci_dumpregs(host);
return;
}
timeout--;
mdelay(1);
}

I am wondering: There either must be a reason this hasn't been fixed in such a
long time, or I am not understanding this correctly, so please enlighten me.
Before trying a cowboy attempt at "fixing" this, I'd really like to know why am
I seeing this?
I mean... how can such a problem get unnoticed and unfixed for so long?
Will an attempt at fixing this issue even be accepted?

Best regards,

--
David Jander
Protonic Holland.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/