Re: [RFC][PATCHv6 00/12] printk: introduce printing kernel thread

From: Sergey Senozhatsky
Date: Wed Dec 20 2017 - 02:10:41 EST


Hello,

not sure if you've been following the whole thread, so I'll try
to summarize it here. apologies if it'll massively repeat the things
that have already been said or will be too long.

On (12/19/17 15:31), Michal Hocko wrote:
> On Tue 19-12-17 10:24:55, Sergey Senozhatsky wrote:
> > On (12/18/17 20:08), Steven Rostedt wrote:
> > > > ... do you guys read my emails? which part of the traces I have provided
> > > > suggests that there is any improvement?
> > >
> > > The traces I've seen from you were from non-realistic scenarios.
> > > But I have hit issues with printk()s happening that cause one CPU to do all
> > > the work, where my patch would fix that. Those are the scenarios I'm
> > > talking about.
> >
> > any hints about what makes your scenario more realistic than mine?
>
> Well, Tetsuo had some semi-realistic scenario where alloc stall messages
> managed to stall other printk callers (namely oom handler). I am saying
> sem-realistic because he is testing OOM throughput with an unrealistic
> workload which itself is not very real on production systems. The
> essential thing here is that many processes might be in the allocation
> path and any printk there could just swamp some unrelated printk caller
> and cause hard to debug problems. Steven's patch should address that
> with a relatively simple lock handover. I was pessimistic this would
> work sufficiently well but it survived Tetsuo's testing IIRC so it
> sounds good enough to me.

sure, no denial. Tetsuo indeed said that Steven's patch passed his
test. and for the note, the most recent printk_kthread patch set passed
Tetsuo's test as well; I asked him privately to run the tests before I
published it. but this is not the point. and to make it clear - this is
not a "Steven's patch" vs. "my patch set + Steven's patch atop of it"
type of thing. not at all.


IMPORTANT DISCLAIMER
I SPEAK FOR MYSELF ONLY. ABOUT MY OBSERVATION ONLY. THIS IS AN
ATTEMPT TO ANALYSE WHY THE PATCH DIDN'T WORK ON MY SETUP AND WHY
MY SETUP NEEDS ANOTHER APPROACH. THIS IS NOT TO WARN ANYONE THAT
THE PATCH WON'T WORK ON THEIR SETUPS. I MEAN NO OFFENSE AND AM
NOT TRYING TO LOOK/SOUND SMART. AND, LIKE I SAID, IF STEVEN OR
PETR WANT TO PUSH THE PATCH, I'M NOT GOING TO BLOCK IT.

so why Steven's patch has not succeeded on my boards?... partly because
of the fact that printk is "multidimensional" in its complexity and
Steven's patch just doesn't touch some of those problematic parts; partly
because the patch has the requirements which can't be 100% true on my
boards.

to begin with,
so the patch works only when the console_sem is contended. IOW, when
multiple sites perform printk-s concurrently frequent enough for the
hand off logic to detect it and to pass the control to another CPU.
but this turned out to be a bit hard to count on. why? several problems.

(1) the really big one:
console_sem owner can be preempted under console_sem, removing any
console_sem competition, it's already locked - its owner is preempted.
this removes any possibility of hand off. and this unleashes CPUs that
do printk-s, because when console_sem is locked, printk-s from other
CPUs become, basically, as fast as sprintf+memcpy.

(1.1) the really big one, part two. more on this later. see @PART2

(2) another big one:
printk() fast path - sprintf+memcpy - can be significantly faster than
call to console drivers. on my board I see that printk CPU can add 1140
new messages to the logbuf, while active console_sem owner prints a
single message to the serial console. not to mention console_sem owner
preemption.

(1) and (2) combined can do magical things. on my extremely trivial
test -- not insanely huge number of printks (much less than 2000 lines)
from a preemptible context, not so much printk-s from other CPUs - I
can see, and I confirmed that with the traces, that when console_sem is
getting handed over to another CPU and that new console_sem owner is
getting preempted or when it begins to print messages to serial console,
the CPU that actually does most of printk-s finishes its job in almost no
time, because all it has to do is sprintf+memcpy. which basically means
that console_sem is not contended anymore, and thus its either current
console_sem owner or _maybe_ some other task that has to print all of
the pending messages.

and to make it worse, the hand off logic does not distinguish between
contexts it's passing the console_sem ownership to. it will be equally
happy to hand off console_sem to atomic context from a non atomic one,
or from atomic to another atomic (and vice versa), regardless the number
of pending messages in the logbuf. more on this later [see @LATER].

now, Steven and Petr said that my test was non realistic and thus the
observations were invalid. OK, let's take a closer look on OOM. and
let's start with the question - what is so special about OOM that
makes (1) or (1.1) or (2) invalid?
i
can we get preempted when we call out_of_memory()->printk()->console_unlock()?

yes.

lkml.kernel.org/r/201612221927.BGE30207.OSFJMFLFOHQtOV@xxxxxxxxxxxxxxxxxxx
>
> somebody who called out_of_memory() is still preempted by other threads consuming
> CPU time due to cond_resched() from console_unlock() as demonstrated by below patch.
>

that is - we are preempted under console_sem. other CPUs keep adding
messages to the logbuf in the meantime.

can we get preempted for a very long time?

yes.

lkml.kernel.org/r/201612150136.GBC13980.FHQFLSOJOFOtVM@xxxxxxxxxxxxxxxxxxx
>
> In most cases, "Out of memory: " and "Killed process" lines are printed within 0.1
> second. But sometimes it took a few seconds. Less often it took longer than a minute.
> There was one big stall which lasted for minutes.
>

and yes.

lkml.kernel.org/r/201706060002.FCD65614.OFFLOVQtHSJFOM@xxxxxxxxxxxxxxxxxxx
>
> Regarding the OOM killer preempted by console_unlock() from printk()
> problem, it will be mitigated by offloading to the printk kernel thread.
> But offloading solves only the OOM killer preempted by console_unlock()
> case. The OOM killer can still be preempted by schedule_timeout_killable(1).
>

did Tetsuo try to fix that console_unlock() behaviour from the OOM side?
yes.

lkml.kernel.org/r/201706060002.FCD65614.OFFLOVQtHSJFOM@xxxxxxxxxxxxxxxxxxx
>
> This change is a subset of enclosing whole oom_kill_process() steps
> with preempt_disable()/preempt_enable(), which was already rejected.
>

and we also already know, that preemption from console_unlock() does
interfere with OOM

lkml.kernel.org/r/201612142037.EED00059.VJMOFLtSOQFFOH@xxxxxxxxxxxxxxxxxxx
>
> If we can map all printk() called inside oom_kill_process() to printk_deferred(),
> we can avoid cond_resched() inside console_unlock() with oom_lock held.
>

any other reports?

yes. for example:

lkml.kernel.org/r/201603022101.CAH73907.OVOOMFHFFtQJSL@xxxxxxxxxxxxxxxxxxx
>
> I'm trying to dump information of all threads which might be relevant
> to stalling inside memory allocator. But it seems to me that since this
> patch changed to allow calling cond_resched() from printk() if it is
> safe to do so, it is now possible that the thread which invoked the OOM
> killer can sleep for minutes with the oom_lock mutex held when my dump is
> in progress. I want to release oom_lock mutex as soon as possible so
> that other threads can call out_of_memory() to get TIF_MEMDIE and exit
> their allocations.
>
> So, how can I prevent printk() triggered by out_of_memory() from sleeping
> for minutes with oom_lock mutex held? Guard it with preempt_disable() /
> preempt_enable() ? Guard it with rcu_read_lock() / rcu_read_unlock()?

and many more.


the point I'm trying to make is that - (1) and (1.1) and (2) are still very
true even when the system is under OOM condition. and we still can accumulate
a ton of messages in the logbuf. we expect that there will be a lot of printks
during OOM and that hand off logic will handle it all nicely. no objections.
but the reports that we already have suggest that that expectation may be
slightly over-optimistic. *it might happen so* (or it might not) that none
of those "a lot of printks" will be able to help us, simply because with the
hand off enabled console_sem contention does not matter anything at all as
long as the console_sem owner is preempted/rescheduled. so I guess it's
still possible that once we finally begin printing logbuf messages OOM might
be over and there might be no more concurrent printks from other CPUs, and
it's some poor task that single-handedly will have to flush pending logbuf
messages. but we expect that this won't happen, and that there will be
other printk-s.

and to make it worse. this "one task that prints the entire logbuf"
is possible even on non-preemptible kernels, and even if we would disable
preemption in console_unlock(). and the reason being is - console_sem is
a sleeping lock.

https://marc.info/?l=dri-devel&m=149938825811219
>
> I noticed that printing to consoles can stop forever because console drivers
> can wait for memory allocation when memory allocation cannot make progress.
>

which turned out to be due to this thing
https://marc.info/?l=linux-mm&m=149939515214223&w=2

>
> what you had is a combination of factors
>
>
> CPU0 CPU1 CPU2
> console_callback()
> console_lock()
> ^^^^^^^^^^^^^
> vprintk_emit() mutex_lock(&par->bo_mutex)
> kzalloc(GFP_KERNEL)
> console_trylock() kmem_cache_alloc() mutex_lock(&par->bo_mutex)
> ^^^^^^^^^^^^^^^^ io_schedule_timeout
>

that is, the console_sem owner being dependent on kmalloc(), under OOM.
and this is what @PART2 "(1.1) the really big one, part two" is all
about. and yes, we may have a ton of printks messages during OOM. but
none of them will worth a penny until we return from kzalloc(GFP_KERNEL).
and in the given scheme of things, it is possible that it's console_sem
that single-handedly will have to print all pending logbuf messages. but
we expect that this won't happen, and that there will be other printk-s.

// not to mention that we even have direct console_sem dependencies on
// MM, like "console_lock(); kmalloc(GFP_KERNEL);". take a look at
// vc_allocate() or vc_do_resize(), etc.


and this is when we can recall @LATER. so let's say we have following:

> we have console_lock() from non-atomic context. console_sem owner is
> getting preempted, under console_sem. in the mean time we have OOM,
> which can print a lot of info. by the time console_sem returns back to
> TASK_RUNNING logbuf contains some number of pending messages. console
> owner goes to console_unlock(). accidentally we have printk from IRQ on
> CPUz. console_owner hands over printing duty to CPUz. so now we have to
> print OOM messages from irq.
>

CPU0 CPU1 ~ CPUx CPUz

console_lock

<< preempted /
rescheduled >>

OOM printouts, lots of OOM traces, etc.

<< back to RUNNING >>

console_unlock()

for (;;)
sets console_owner
call_console_drivers() IRQ
printk
sees console_owner
sets console_waiter

clears console_owner
sees console_waier
hand off
for (;;) {
call_console_drivers()
??? lockup
}
up()

so we are happy to hand off printing from any context to any context.
Steven said that this scenario is possible, but is not of any particular
interest, because printk from IRQ or from any other atomic context is a
bad thing, which should happen only when something wrong is going on in
the system. but we are in OOM or has just returned from the OOM. which _is_
"something bad going on", isn't it? can we instead say - OOM makes that
printk from atomic context more likely? if it does happen, will there be
non-atomic printk-s to take over printing from atomic CPUz? we can't tell.
I don't know much about Tetsuo's test, but I assume that his VM does not
have any networking activities during the test. I probably wouldn't be so
surprised to see a bunch of printk-s from atomic contexts under OOM.

but the thing is, the very moment we do that "non-atomic -> atomic" hand
off/transfer, we are on very thin ice. why? because fundamental and crucial
requirements that Steven's patch imposes on workload and system are:

(A) an optimistic one:
every atomic printk() must be followed by a non-atomic one within
the watchdog_threshold (10 seconds) interval. and, additionally, that
non-atomic printk must happen at the right moment, so we can hand
off the console_sem ownership to it. otherwise, we can lockup the
system, by looping in console_unlock() from atomic context for more
than watchdog_threshold seconds.

surely it's rather hard to count on it. thus Steven's patch has the
second fundamental requirement:

(B) a pessimistic one:
if we have only atomic context to evict logbuf messages, then the
rule is - any atomic context must be able to flush the entire logbuf
within the watchdog_threshold (10 seconds) interval.

do we see a problem here?

(B) basically means that the size of the logbuf *must* be limited based
on the performance of the system's slowest console. so I don't know if
Tetsuo's using any network console or serial consoles, or if it's just
fbcon which can be really fast compared to serial console, but I know that
I'm using a serial console on my boards. and under no circumstances I can
ever have the logbuf small enough to flush it to a serial console within
10 seconds. in fact, the slower the console is - the bigger the logbuf is
(of course I'm not speaking about insanely huge buffers). this might sound
counter intuitive, but it makes sense. because, the smaller the buffer
is - the sooner you'll start losing kernel messages. but it's not just
my slow serial console to blame. consoles don't scale. the more CPUs the
system has, the more logbuf messages console potentially has to print (at
the same transfer rate); say 1024 CPUs under OOM.


realistically,
I can't make (B) true. and I can't guarantee (A) to be true.

and that's why Steven's patch is, *UNFORTUNATELY*, not doing very
well on my boards. I simply can't fulfill its pre-conditions. sure,
there might be cases when Steven's patch does the right thing. but
I believe I already saw enough to conclude that on my boards I need
something different.

-ss