RFC: rt + uncontrollable kthreads/workqueues = generic evil

From: Mike Galbraith
Date: Mon Dec 23 2013 - 05:24:12 EST



1. rt tasks can kill the whole box or jam up random applications via
kthreadd and/or kworker starvation, even when the user is being careful.
2. uncontrollable kthreads create unfixable rt priority inversions in
the workqueue case, and even if workqueues could be prioritized, dynamic
worker pools can insert huge memory allocation latencies into any rt
task that depends upon a workqueue.

A couple samples:

CPU2,3 are "completely" isolated via cpusets, CPU3 is running a "super
critical" rt hog (while(1);) at FIFO:1. Joe User fires up firefox on a
system cpuset CPU, firefox hangs, lots of things do.


marge:~ # cat /proc/5840/stack
[<ffffffff81101d0e>] sleep_on_page+0xe/0x20
[<ffffffff81101f00>] wait_on_page_bit+0x80/0x90
[<ffffffff81102004>] filemap_fdatawait_range+0xf4/0x180
[<ffffffff811035ad>] filemap_write_and_wait_range+0x4d/0x80
[<ffffffff811cab8a>] ext4_sync_file+0xca/0x290
[<ffffffff81186e38>] do_fsync+0x58/0x80
[<ffffffff81187230>] SyS_fsync+0x10/0x20
[<ffffffff81559ed2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

rt_rq[3]:
.rt_nr_running : 1
.rt_throttled : 0
.rt_time : 0.000000
.rt_runtime : 0.000001

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
kthreadd 2 32390.390741 89 120 32390.390741 1.103162 417146.885135
kworker/u8:0 6 32390.390741 50 120 32390.390741 0.679007 303904.697089
kworker/3:1 37 32391.042046 4971 120 32391.042046 77.975683 197424.475026
kworker/3:1H 269 32390.390741 2542 100 32390.390741 15.520425 193210.559919
R cpuhog 5625 0.000000 13 98 0.000000 382385.886326 89.825704

Well now, kthreadd waking to an isolated and 100% rt consumed CPU
doesn't bode well for the future of this box, that's a killer.
kworker/3:1H is what was blocking firefox and more though, bumping it to
FIFO:10 freed firefox and friends.

Try again with kthread prioritized.. evolution hangs at startup.

rt_rq[3]:
.rt_nr_running : 1
.rt_throttled : 0
.rt_time : 0.000000
.rt_runtime : 0.000001

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
kworker/3:1 37 32392.189438 5092 120 32392.189438 79.811331 318171.326151
R cpuhog 15101 0.000000 4 98 0.000000 48118.123331 0.043160

marge:~ # pidof evolution
15103
marge:~ # cat /proc/15103/stack
[<ffffffff81064359>] flush_work+0x29/0x40
[<ffffffff81110113>] lru_add_drain_all+0x163/0x1a0
[<ffffffff8112df48>] SyS_mlock+0x38/0x130
[<ffffffff81559ed2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

cpuhog 15101 [003] 5027.777502: irq:softirq_entry: vec=1 [action=TIMER]
cpuhog 15101 [003] 5027.777504: workqueue:workqueue_queue_work: work struct=0xffff88022fd8f060 function=vmstat_update workqueue=0xffff880226c5aa00 req_cpu=64 cpu=3
cpuhog 15101 [003] 5027.777505: workqueue:workqueue_activate_work: work struct 0xffff88022fd8f060
cpuhog 15101 [003] 5027.777507: sched:sched_wakeup: comm=kworker/3:1 pid=37 prio=120 success=1 target_cpu=003
cpuhog 15101 [003] 5027.777508: irq:softirq_exit: vec=1 [action=TIMER]
cpuhog 15101 [003] 5027.777508: irq:softirq_entry: vec=9 [action=RCU]
cpuhog 15101 [003] 5027.777509: irq:softirq_exit: vec=9 [action=RCU]
cpuhog 15101 [003] 5027.781500: irq:softirq_raise: vec=1 [action=TIMER]

flush_work is gonna take a while. Bump pid 37 to FIFO:10, evolution can
finally run.

I created an ugly hack in enterprise to let the user prioritize kthreads
and/or workqueues, and that works as far as empowering the user to do
whatever he wants to do without the box just falling over, or the stuff
he thinks is super critical starving its own dependencies (or innocent
bystanders as above), and ergo itself, no matter how "clever" that
"critical stuff" may seem to me.

Most of the time, when I see these kind of issues, it's stuff that I'd
call rt abuse, but I've also recently seen some image processing stuff
that looked much more legit, and which used to be able to get away with
using a workqueue fall flat, and I had to tell the user that workqueue
should be removed from their driver, as the things are not the least bit
rt friendly. Dynamic pool constituted a regression for that user.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/