Re: [RFC][PATCHv6 00/12] printk: introduce printing kernel thread

From: Steven Rostedt
Date: Mon Dec 18 2017 - 20:04:01 EST


On Tue, 19 Dec 2017 09:52:48 +0900
Sergey Senozhatsky <sergey.senozhatsky.work@xxxxxxxxx> wrote:

> > The case here, you are talking about a CPU doing console_lock() from a
> > non printk() case. Which is what I was asking about how often this
> > happens.
>
> I'd say often enough. but the point I was trying to make is that we can
> have non-atomic CPUs which can do the print out, instead of "sharing the
> load" between atomic CPUs.

We don't even know if sharing between "atomic" and "non-atomic" is an
issue. Anything that does a printk() in an atomic location, is going to
have latency to begin with.

>
> > As for why there's no handoff. Does the non printk()
> > console_lock/unlock ever happen from a critical location? I don't think
> > it does (but I haven't checked). Then it is the perfect candidate to do
> > all the printing.
>
> that's right. that is the point I was trying making. we can have better
> candidates to do all the printing.

Sure, but we don't even know if we have to. A problem scenario hasn't
come up that wasn't due to the current implementation (which my patch
changes).


> I did tests yesterday, traces are available. I can't conclude that
> the patch fixes the unfairness of printk().

It doesn't fix the "unfairness" it fixes the unboundedness of printk.
That is what has been triggering all the issues from before.



> consider the following case
>
> we have console_lock() from non-atomic context. console_sem owner is
> getting preempted, under console_sem. which is totally possible and
> happens a lot. in the mean time we have OOM, which can print a lot of
> info. by the time console_sem returns back to TASK_RUNNING logbuf
> contains some number of pending messages [lets say 10 seconds worth
> of printing]. console owner goes to console_unlock(). accidentally
> we have printk from IRQ on CPUz. console_owner hands over printing
> duty to CPUz. so now we have to print 10 seconds worth of OOM messages
> from irq.

Yes that can happen. But printk's from irq context is not nice to have
either, and should only happen when things are going wrong to begin
with.

>
>
>
> CPU0 CPU1 ~ CPUx CPUz
>
> console_lock
>
> << preempted >>
>
>
> OOM OOM printouts, lots
> of OOM traces, etc.
>
> OOM end [progress done].
>
> << back to RUNNING >>
>
> console_unlock()
>
> for (;;)
> sets console_owner
> call_console_drivers() IRQ
> printk
> sees console_owner
> sets console_waiter
>
> clears console_owner
> sees console_waier
> handoff
> for (;;) {
> call_console_drivers()
> ??? lockup
> }
> up()
>
> this is how we down() from non-atomic and up() from atomic [if we make
> it to up(). we might end up in NMI panic]. this scenario is totally possible,

The printk buffer needs to be very big, and bad things have to happen
first. This is a theoretical scenario, and I'd like to see it happen in
the real world before we try to fix it. My patch should make printk
behave *MUCH BETTER* than it currently does.

If you are worried about NMI panics, then we could add a touch nmi
within the printk loop.


> isn't it? the optimistic expectation here is that some other printk() from
> non-atomic CPU will jump in and take over printing from atomic CPUz. but I
> don't see why we are counting on it.

I don't see why we even care. Placing a printk in an atomic context is
a problem to begin with, and should only happen if there's issues in
the system.

-- Steve